機器學習實戰 決策樹

2021-08-29 00:15:44 字數 3712 閱讀 2087

def calcshannonent(dataset):

numentries = len(dataset)

labelcounts = {}

for featvec in dataset:

currentlabel = featvec[-1]

if currentlabel not in labelcounts.keys():

labelcounts[currentlabel] = 0

labelcounts[currentlabel] += 1

shannonent = 0.0

for key in labelcounts:

prob = float(labelcounts[key])/numentries

shannonent -= prob * log(prob,2)

return shannonent

對每個特徵劃分資料集的結果計算一次資訊熵,然後判斷按照哪個特徵劃分資料集是最好的劃分方式。

def splitdataset(dataset, axis, value): 

# 抽取出第axis+1位屬性為value的所有元素,並去除value屬性

retdataset =

for featvec in dataset:

if featvec[axis] == value:

reducefeatvec = featvec[:axis]

reducefeatvec.extend(featvec[axis+1:])

return retdataset

熵計算將會告訴我們如何劃分資料集是最好的資料組織方式。

def choosebestfeaturetosplit(dataset):

numfeatures = len(dataset[0]) - 1

baseentropy = calcshannonent(dataset)

bestinfogain = 0.0

bestfeature = -1

for i in range(numfeatures):

featlist = [example[i] for example in dataset]

# 將dataset中的資料按行依次放入example中,然後取得example中的example[i]元素,放入列表featlist中

uniquevals = set(featlist)

newentropy = 0.0

for value in uniquevals:

subdataset = splitdataset(dataset, i, value)

prob = len(subdataset)/float(len(dataset))

newentropy += prob * calcshannonent(subdataset)

infogain = baseentropy - newentropy

if (infogain > bestinfogain):

bestinfogain = infogain

bestfeature = i

return bestfeature

如果資料集已經處理了所有屬性,但是類標籤依然不是唯一的,此時我們需要決定如何定義該葉子節點,在這種情況下,我們通常會採用多數表決的方法決定該葉子節點的分類。

def majoritycnt(classlist):

classcount = {}

for vote in classlist:

if vote not in classcount.keys():

classcount[vote] = 0

classcount[vote] += 1

sortedclasscount = sorted(classcount.items(),key=operator.itemgetter(1),reverse=true)

return sortedclasscount[0][0]

def createtree(dataset,labels):

classlist = [example[-1] for example in dataset]

if classlist.count(classlist[0]) == len(classlist):

return classlist[0] # stop splitting when all of the classes are equal

if len(dataset[0]) == 1: # stop splitting when there are no more features in dataset

return majoritycnt(classlist)

bestfeat = choosebestfeaturetosplit(dataset)

bestfeatlabel = labels[bestfeat]

mytree = }

del(labels[bestfeat])

featvalues = [example[bestfeat] for example in dataset]

uniquevals = set(featvalues)

for value in uniquevals:

sublabels = labels[:]

mytree[bestfeatlabel][value] = createtree(splitdataset(dataset, bestfeat, value),sublabels)

return mytree

在儲存帶有特徵的資料會面臨乙個問題:程式無法確定特徵在資料集中的位置,特徵標籤列表將幫助程式處理這個問題。

def classify(inputtree,featlabels,testvec):

firststr = list(inputtree.keys())[0]

seconddict = inputtree[firststr]

featindex = featlabels.index(firststr)

for key in seconddict.keys():

if testvec[featindex] == key:

if type(seconddict[key]).__name__=='dict':

classlabel = classify(seconddict[key],featlabels,testvec)

else:

classlabel = seconddict[key]

return classlabel

def storetree(inputtree,filename):

import pickle

fw = open(filename,'w')

pickle.dump(inputtree,fw)

fw.close()

def grabtree(filename):

import pickle

fr = open(filename)

return pickle.load(fr)

機器學習實戰 決策樹

決策樹 2 python語言在函式中傳遞的是列表的引用,在函式內部對列表物件的修改,將會影響該列表物件的整個生存週期。為了消除這個不良影響,我們需要在函式的開始宣告乙個新列表物件。在本節中,指的是在劃分資料集函式中,傳遞的引數dataset列表的引用,為了不影響dataset我們重新宣告了乙個ret...

機器學習實戰決策樹

這幾天一直在學習機器學習實戰python 實現,在程式清單的3 6 獲取及誒單數程式,書上的程式是這樣的 def getnumleafs mytree numleafs 0.0 firststr list dict.keys mytree 0 seconddict mytree firststr p...

機器學習實戰 決策樹

class sklearn.tree.decisiontreeclassifier criterion gini splitter best max depth none,min samples split 2,min samples leaf 1,min weight fraction leaf ...