機器學習實戰學習記錄決策樹

決策樹中演算法採用的id3.劃分資料集基於特徵。其中採用分類依據為資訊理論中的資訊增益和資訊熵（夏農熵）。

機器學習中夏農熵計算公式為：

其中xi表示分類，p（xi）表示xi分類的概率。

首先，建立資料集及計算夏農熵

from math import log
defcalcshannonent
(dataset):
#夏農熵計算函式
numents=len(dataset)
labelcounts={} #記錄標籤及對應的個數
for featvec in dataset:
currentlabel=featvec[-1] #取出陣列中最後一項『n』或者『y』的作為標籤
if currentlabel not
in labelcounts.keys(): #如果當前標籤在labelcounts中無記錄，則新增
labelcounts[currentlabel]=0
labelcounts[currentlabel]+=1
#標籤對應數量計數
shannonent=0.0
print(labelcounts)#輸出標籤及對應個數
for key in labelcounts:
prob=float(labelcounts[key])/numents #同標籤出現的概率，即p(xi)
shannonent-=prob*log(prob,2) #計算夏農熵 -p(xi)*log2(p(xi))的加和
return shannonent
defcreatedataset
():#資料集建立函式
dataset=[
[1,1,'y'],
[1,1,'y'],
[1,0,'n'],
[0,1,'n'],
[0,1,'n']
]labels=["no se***cing","flippers"]
return dataset,labels
mydata,labels=createdataset()
mydata[0][-1]='maybe'
#修改資料集中第乙個的標籤為 maybe
print(mydata)
print(calcshannonent(mydata))

輸出結果為

[[1, 1, 'maybe'], [1, 1, 'y'], [1, 0, 'n'], [0, 1, 'n'], [0, 1, 'n']]
1.3709505944546687

得到熵，下一步按照最大資訊增益的方法劃分資料集。

原理是取各屬性進行熵的計算。取最高資訊增益的屬性為最佳屬性。

def
choosebestfeaturetosplit
(dataset):
numfeatures=len(dataset[0])-1
baseentropy =calcshannonent(dataset) #計算原始熵值
bestinfogain=0.0
bestfeature=-1
for i in range(numfeatures):
featlist=[example[i] for example in dataset] #遍歷獲取該列所有值
print("featlist:",featlist)
uniquevals=set(featlist) #從列表中建立集合，得到不重複的所有可能取值 
print("uniquevals",uniquevals)
newentropy=0.0
for value in uniquevals:
subdataset=splitdataset(dataset,i,value)
print("subdataset:",subdataset)
prob=len(subdataset)/float(len(dataset))
newentropy+=prob*calcshannonent(subdataset)
print("%d 列屬性的熵為："%i,newentropy)
infogain=baseentropy-newentropy #計算每乙個屬性值對應的熵值並求和。結果與原始熵值的差即為資訊增益。增益越大說明所佔決策權越大 
print("inforgain:",infogain)
if(infogain>bestinfogain):
bestinfogain=infogain
bestfeature=i
return bestfeature
print("bestfeature:",choosebestfeaturetosplit(mydata))

輸出結果為：

featlist: [1, 1, 1, 0, 0]
uniquevals 
subdataset: [[1, 'n'], [1, 'n']]
subdataset: [[1, 'y'], [1, 'y'], [0, 'n']]
0 列屬性的熵為： 0.5509775004326937
inforgain: 0.4199730940219749
featlist: [1, 1, 0, 1, 1]
uniquevals 
subdataset: [[1, 'n']]
subdataset: [[1, 'y'], [1, 'y'], [0, 'n'], [0, 'n']]
1 列屬性的熵為： 0.8
inforgain: 0.17095059445466854
bestfeature: 0

下面構建決策樹

**如下：

def
majoritycnt
(classlist):
#返回出現次數最多的分類名稱
classcount={}
for vote in classlist:
if vote not
in classcount.keys():classcount[vote]=0
#建立分類（即字典）並計數
classcount[vote]+=1
sortedclasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=true) #排序，true公升序 
return sortedclasscount[0][0]
defcreatetree
(dataset,labels):
classlist=[example[-1] for example in dataset]
if classlist.count(classlist[0])==len(classlist): #所有的都是乙個類,停止劃分
return classlist[0]
if len(dataset[0])==1: # #遍歷完所有特徵值時（僅剩一列）返回出現次數最多的 
return majoritycnt(classlist)
bestfeat=choosebestfeaturetosplit(dataset)
bestfeatlabel=labels[bestfeat]
mytree=} #字典的建立
del(labels[bestfeat]) #刪除最佳屬性
featvalues=[example[bestfeat] for example in dataset] #得到所有屬性值
uniquevals=set(featvalues) #得到不重複的所有屬性值
for value in uniquevals:
sublabels=labels[:]
mytree[bestfeatlabel][value]=createtree(splitdataset(dataset,bestfeat,value),sublabels) #遞迴
return mytree
print("createtree:",createtree(mydata,labels))

最終輸出結果：

createtree: }}}

結果為多層巢狀的葉節

機器學習實戰決策樹

決策樹 2 python語言在函式中傳遞的是列表的引用，在函式內部對列表物件的修改，將會影響該列表物件的整個生存週期。為了消除這個不良影響，我們需要在函式的開始宣告乙個新列表物件。在本節中，指的是在劃分資料集函式中，傳遞的引數dataset列表的引用，為了不影響dataset我們重新宣告了乙個ret...

機器學習實戰決策樹

這幾天一直在學習機器學習實戰python 實現，在程式清單的3 6 獲取及誒單數程式，書上的程式是這樣的 def getnumleafs mytree numleafs 0.0 firststr list dict.keys mytree 0 seconddict mytree firststr p...

機器學習實戰決策樹

class sklearn.tree.decisiontreeclassifier criterion gini splitter best max depth none,min samples split 2,min samples leaf 1,min weight fraction leaf ...

機器學習實戰學習記錄 決策樹

機器學習實戰 決策樹

機器學習實戰決策樹

機器學習實戰 決策樹

相關推薦

機器學習實戰學習記錄決策樹

機器學習實戰決策樹

機器學習實戰決策樹