利用NLTK sklearn進行垃圾郵件分類

利用nltk來進行資料處理和提取特徵，再交由sklearn進行機器學習訓練分類器，嘗試了多個機器學習演算法並評價分類效能。

上**：

import nltk
from nltk.corpus import stopwords
from nltk.stem import wordnetlemmatizer
import csv
import numpy as np
from sklearn.feature_extraction.text import tfidfvectorizer
from sklearn.*****_bayes import multinomialnb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import tree
from sklearn.linear_model import sgdclassifier
from sklearn.svm import linearsvc
from sklearn.ensemble import randomforestclassifier
#預處理
def preprocessing(text):
#text=text.decode("utf-8")
tokens=[word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
stops=stopwords.words('english')
tokens=[token for token in tokens if token not in stops]
tokens=[token.lower() for token in tokens if len(token)>=3]
lmtzr=wordnetlemmatizer()
tokens=[lmtzr.lemmatize(token) for token in tokens]
preprocessed_text=' '.join(tokens)
return preprocessed_text
#讀取資料集
file_path='e:\mycode\dataset\smsspamcollection\smsspamcollection'
sms=open(file_path,'r',encoding='utf-8')
sms_data=
sms_label=
csv_reader=csv.reader(sms,delimiter='\t')
for line in csv_reader:
sms.close()
#print(sms_data)
#按0.7：0.3比例分為訓練集和測試集，再將其向量化
dataset_size=len(sms_data)
trainset_size=int(round(dataset_size*0.7))
print('dataset_size:',dataset_size,' trainset_size:',trainset_size)
x_train=np.array([''.join(el) for el in sms_data[0:trainset_size]])
y_train=np.array(sms_label[0:trainset_size])
x_test=np.array(sms_data[trainset_size+1:dataset_size])
y_test=np.array(sms_label[trainset_size+1:dataset_size])
vectorizer=tfidfvectorizer(min_df=2,ngram_range=(1,2),stop_words='english',strip_accents='unicode',norm='l2')
x_train=vectorizer.fit_transform(x_train)
x_test=vectorizer.transform(x_test)
#樸素貝葉斯分類器
clf=multinomialnb().fit(x_train,y_train)
y_nb_pred=clf.predict(x_test)
print(y_nb_pred)
print('nb_confusion_matrix:')
cm=confusion_matrix(y_test,y_nb_pred)
print(cm)
print('nb_classification_report:')
cr=classification_report(y_test,y_nb_pred)
print(cr)
feature_names=vectorizer.get_feature_names()
coefs=clf.coef_
intercept=clf.intercept_
coefs_with_fns=sorted(zip(coefs[0],feature_names))
n=10
top=zip(coefs_with_fns[:n],coefs_with_fns[:-(n+1):-1])
for (coef_1,fn_1),(coef_2,fn_2) in top:
print('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1,fn_1,coef_2,fn_2))
#決策樹
clf=tree.decisiontreeclassifier().fit(x_train.toarray(),y_train)
y_tree_pred=clf.predict(x_test.toarray())
print('tree_confusion_matrix:')
cm=confusion_matrix(y_test,y_tree_pred)
print(cm)
print('tree_classification_report:')
print(classification_report(y_test,y_tree_pred))
#sgd
clf=sgdclassifier(alpha=0.0001,n_iter=50).fit(x_train,y_train)
y_sgd_pred=clf.predict(x_test)
print('sgd_confusion_matrix:')
cm=confusion_matrix(y_test,y_sgd_pred)
print(cm)
print('sgd_classification_report:')
print(classification_report(y_test,y_sgd_pred))
#svm
clf=linearsvc().fit(x_train,y_train)
y_svm_pred=clf.predict(x_test)
print('svm_confusion_matrix:')
cm=confusion_matrix(y_test,y_svm_pred)
print(cm)
print('svm_classification_report:')
print(classification_report(y_test,y_svm_pred))
#randomforestclassifier
clf=randomforestclassifier(n_estimators=10)
clf.fit(x_train,y_train)
y_rf_pred=clf.predict(x_test)
print('rf_confusion_matrix:')
print(confusion_matrix(y_test,y_rf_pred))
print('rf_classification_report:')
print(classification_report(y_test,y_rf_pred))

利用faac進行編碼

利用faac直接對pcm進行aac編碼下面是我在faac fronted main.c中抽出來對pcm進行aac編碼的例子希望對大家有用。片源資訊 output.pcm 44100 2 16 include include include include include include def...

利用管道進行通訊

管道簡介管道是單向的先進先出的無結構的固定大小的位元組流，它把乙個程序的標準輸出和另乙個程序的標準輸入連線在一起。寫程序在管道的尾端寫入資料，讀程序在管道的首端讀出資料。資料讀出後將從管道中移走，其它讀程序都不能再讀到這些資料。管道提供了簡單的流控制機制。程序試圖讀空管道時，在有資料寫入管道...

利用BitMap進行排序

利用bitmap可以對某些資料進行排序，但是限制條件是必須實現知道資料的範圍，而且不能重複，類似於桶排序，但是比桶排序更加節省記憶體。原理很簡單，就是設定陣列某一位的數在bitmap中對應位為1，然後遍歷陣列就可以得到結果。這裡以100以內的乙個陣列排序為例例如陣列 intarray 則設定bit...

利用NLTK sklearn進行垃圾郵件分類

利用faac進行編碼

利用管道進行通訊

利用BitMap進行排序

相關推薦