情感分類中文語料

title: 情感分類–中文語料

data: 2017-03-04

tags: nltk

折騰了幾天終於上午用nltk實現了中文語料的分類。把整個流程記錄一下。

用的是譚松波老師的酒店分類的語料庫，有四個版本：2000(balanced)、4000(balanced)、6000(balanced)、10000(unbalanced)。語料庫結構如下：

-chnsenticorp_htl_ba_2000
|-neg
|-neg
.0.txt ~ neg
.999
.txt
|-pos
|-pos.0
.txt ~ pos.999
.txt

因為該語料庫編碼格式為gb2312，為了後續在python和nltk中好處理，將其轉化為utf-8編碼格式，使用了乙個轉碼的小工具gb2312<–>utf-8 。統一轉碼之後，將進行中文分詞，使用jiaba。

import glob
import jieba 
i=0;
for file in glob.glob(r"c:\users\rumusan\desktop\chnsenticorp_htl_ba_2000\pos\*.txt"):
with open(file,"r+",encoding= 'utf-8') as f1:
lines=f1.readlines()
lines=''.join(lines)
lines=lines.replace('\n', '')#原檔案有大量空行，去掉
seg_list = jieba.cut(lines)#分詞
seg_list=' '.join(seg_list)
# print(seg_list) #顯示分詞結果
f2=open(r"c:\users\rumusan\desktop\2\%d.txt"%i,'w',encoding='utf-8')
f2.write(seg_list)#分詞結果寫入
f2.close()
i=i+1;

分別轉碼neg和pos的1000個檔案，並進行儲存：

-hotle_reviews
|-neg
|-0.txt ~ 999.txt
|-pos
|-0.txt ~ .999.txt

分詞效果還是不錯的：

硬體 設施 太舊 , 和 房價 不 相符 , ** 還是 貴 了……

現在我們已經把文字進行了預處理，可以將其作為語料庫。nltk有兩種載入自己語料庫的檔案：

第一種：

from nltk.corpus import plaintextcorpusreader
corpus_root=r"c:\users\rumusan\desktop\hotel_reviews"
hotel_reviews=plaintextcorpusreader(corpus_root,'.*')
hotel_reviews.fileids()#檢視所有檔案

第二種：

from nltk.corpus import bracketparsecorpusreader
corpus_root=r"c:\users\rumusan\desktop\hotel_reviews"
file_pattern = r".*/.*\.txt" 
hotel_reviews=bracketparsecorpusreader(corpus_root,file_pattern)
hotel_reviews.fileids()#檢視所有檔案

採用了第一種方式，為了便於處理，我們把語料結構轉化為和「情感分析–example」中一樣：

import nltk
import random
#載入自己的語料庫
from nltk.corpus import plaintextcorpusreader
#路徑corpus_root_reviews=r"c:\users\rumusan\desktop\hotel_reviews"
#總（後有對整個庫處理的步驟，就重複載入了。）
corpus_root_neg=r"c:\users\rumusan\desktop\hotel_reviews\neg"
#neg
corpus_root_pos=r"c:\users\rumusan\desktop\hotel_reviews\pos"
#pos
#載入reviews=plaintextcorpusreader(corpus_root_reviews,'.*')#總
neg=plaintextcorpusreader(corpus_root_neg,'.*')#neg
pos=plaintextcorpusreader(corpus_root_pos,'.*')#pos
documents_neg =[(list(neg.words(fileid)),0)#加入了標籤0
for fileid in neg.fileids()]
documents_pos =[(list(pos.words(fileid)),1)#加入了標籤1
for fileid in pos.fileids()]
documents_neg.extend(documents_pos)#組合documents_neg和documents_pos
documents=documents_neg#將組合後的語料庫命名為document

下面和英文語料分類就大體一致了：

random.shuffle(documents)
all_words = nltk.freqdist(w for w in reviews.words())
word_features=[word for (word, freq) in all_words.most_common(3000)]
defdocument_features
(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set,test_set=featuresets[500:],featuresets[:500]
classifier=nltk.*****bayesclassifier.train(train_set)
print (nltk.classify.accuracy(classifier,test_set))
#哪些特徵是分類器發現最有資訊量的
classifier.show_most_informative_features(10)

大致的流程已經可以了，接下來可以進行一些細緻的處理，如特徵、分類器、訓練集和測試集等等。

中文情感分析語料庫

中文情感分析語料庫中文情感分析的語料庫非常少，這五個中文語料庫是我在網上的蒐集的。url 資料集2 2012年ccf自然語言處理與中文計算會議中文微博情感分析測評資料 url 條微博，共約 20000 條微博。資料採用xml格式，已經預先切分好句子。每條句子的所有標註資訊都包含在元素的屬性中。其...

中文情感分析語料庫

原文中文情感分析的語料庫非常少，這五個中文語料庫是我在網上的蒐集的。url 資料集2 2012年ccf自然語言處理與中文計算會議中文微博情感分析測評資料 url 條微博，共約 20000 條微博。資料採用xml格式，已經預先切分好句子。每條句子的所有標註資訊都包含在元素的屬性中。其中opinio...

情感極性關於中文情感分類的知識

文字分類，就是在預定義的分類體系下，根據文字的特徵內容或屬性將給定文字與乙個或多個類別相關聯的過程。1 構建分類類別體系 2 獲取帶有類別標籤的文字 3 文字的特徵選擇及權重計算 4 分類器的選擇與訓練 5 文字的分類應用對應每乙個類別，都可以訓練出對應的詞特徵檔案。對應到類別的細分或者合併，...

情感分類 中文語料

中文情感分析語料庫

中文情感分析語料庫

情感極性 關於中文情感分類的知識

相關推薦

情感分類中文語料

情感極性關於中文情感分類的知識