樸素貝葉斯

2021-09-29 18:13:47 字數 3428 閱讀 1600

準備資料:從文字中構建詞向量

def

loaddataset()

: postinglist=[[

'my'

,'dog'

,'has'

,'flea'

,'problem'

,'help'

,'please'],

['maybe'

,'not'

,'take'

,'him'

,'to'

,'dog'

,'park'

,'stupid'],

['my'

,'dalmation'

,'is'

,'so'

,'cute'

,'i'

,'love'

,'him'],

['stop'

,'posting'

,'stupid'

,'worthless'

,'garbage'],

['mr'

,'licks'

,'ate'

,'my'

,'steak'

,'how'

,'to'

,'stop'

,'him'],

['quit'

,'buying'

,'worthless'

,'dog'

,'food'

,'stupid']]

classvec=[0

,1,0

,1,0

,1]return postinglist,classvec

defcreatevocablist

(dataset)

: vocabset=

set(

)for document in dataset:

vocabset=vocabset|

set(document)

return

list

(vocabset)

defwordstovec

(vocablist,inputset)

: returnvec=[0

]*len(vocablist)

for word in inputset:

if word in vocablist:

returnvec[vocablist.index(word)]=

1else

:print

("the word:%s is not in my vocabulary"

%word)

return returnvec

此段**就是將文字對應乙個詞集構建乙個只含0,1的向量,如果詞集中的詞出現在文字中,就將向量在該詞在詞集中的位置設為1,否則為0。這樣方便後面計算概率。

def

trainnb

(trainmatrix,traincategory)

: numtraindocs=

len(trainmatrix)

numwords=

len(trainmatrix[0]

) pausive=

sum(traincategory)

/float

(numtraindocs)

p0num=ones(numwords)

;p1num=ones(numwords)

#正確的初始化應該是p0num=zeros(numwords);p1num=zeros(numwords)

p0denom=

2.0;p1denom=

2.0#和 p0denom=0.0;p1denom=0.0,此處初始化為了減小乙個概率為0結果就是0的影響

for i in

range

(numtraindocs)

:if traincategory[i]==1

: p1num+=trainmatrix[i]

p1denom+=

sum(trainmatrix[i]

)else

: p0num+=trainmatrix[i]

p0denom+=

sum(trainmatrix[i]

) p1vect=p1num/p1denom

p0vect=p0num/p0denom

p0vect=

[log(x)

for x in p0vect]

p1vect=

[log(x)

for x in p1vect]

#概率就是p1num/p1denom和p0num/p0denom,此處是為了防止因子過小而得不到正確答案

return p0vect,p1vect,pausive

實現分類

def

classifynb

(vecclassify,p0vec,p1vec,pclass1)

: p1=

sum(vecclassify*p1vec)

+log(pclass1)

p0=sum(vecclassify*p0vec)

+log(

1.0-pclass1)

if p1>p0:

return

1else

:return

0def

testingnb()

: listposts,listclasses=loaddataset(

) myvocablist=createvocablist(listposts)

trainmat=

for postindoc in listposts:

) p0v,p1v,pab=trainnb(trainmat,listclasses)

testentry=

['love'

,'my'

,'dalmation'

] thisdoc=array(wordstovec(myvocablist,testentry)

)print

(testentry,

' classified as:'

,classifynb(thisdoc,p0v,p1v,pab)

) testentry=

['stupid'

,'garbage'

] thisdoc=array(wordstovec(myvocablist,testentry)

)print

(testentry,

' classified as:'

,classifynb(thisdoc,p0v,p1v,pab)

)

樸素貝葉斯

樸素貝葉斯演算法是一種基於概率統計的分類方法,它主要利用貝葉斯公式對樣本事件求概率,通過概率進行分類。以下先對貝葉斯公式做個了解。對於事件a b,若p b 0,則事件a在事件b發生的條件下發生的概率為 p a b p a b p b 將條件概率稍作轉化即可得到貝葉斯公式如下 p a b p b a ...

樸素貝葉斯

1.準備資料 從文字中構建詞向量 2.訓練演算法 從詞向量計算概率 3.測試演算法 儲存為 bayes.py 檔案 參考 coding utf 8 from numpy import 文字轉化為詞向量 def loaddataset postinglist my dog has flea probl...

樸素貝葉斯

機器學習是將資料轉化為決策面的過程 scikit learn縮寫為sklearn 訓練乙個分類器,學習之後 其處理的準確性 def nbaccuracy features train,labels train,features test,labels test from sklearn.bayes ...