決策樹 ID3演算法

2021-07-14 09:44:53 字數 4069 閱讀 1895

id3演算法通過計算每個屬性的資訊增益,認為資訊增益越大屬性越優,每次劃分選取資訊增益最大的屬性為劃分標準,重複這個過程,直到構成一棵決策樹。資訊熵是描述事件給我們的驚訝程度,如果所有事件的概率均等,那熵值大,驚訝程度低。如果有一事件的概率極高而其他極低,熵值便低,驚訝程度大。其計算公式如下:

資訊增益描述某個事件發生前後資訊熵的變化。

在id3裡面可以理解為引入特徵之後資訊熵的變化(這句話純屬個人理解,如果不對請指正)。其計算公式如下:

其中n表示分類的數目。data.txt

young	myope	no	reduced	no lenses

young myope no normal soft

young myope yes reduced no lenses

young myope yes normal hard

young hyper no reduced no lenses

young hyper no normal soft

young hyper yes reduced no lenses

young hyper yes normal hard

pre myope no reduced no lenses

pre myope no normal soft

pre myope yes reduced no lenses

pre myope yes normal hard

pre hyper no reduced no lenses

pre hyper no normal soft

pre hyper yes reduced no lenses

pre hyper yes normal no lenses

presbyopic myope no reduced no lenses

presbyopic myope no normal no lenses

presbyopic myope yes reduced no lenses

presbyopic myope yes normal hard

presbyopic hyper no reduced no lenses

presbyopic hyper no normal soft

presbyopic hyper yes reduced no lenses

presbyopic hyper yes normal no lenses

label.txt
age	prescript	astigmatic	tearrate
#encoding: utf-8

import math

def cal_shannon_ent(data):

data_num = len(data)

cnt = {}

for feat_vec in data:

tmp_class = feat_vec[-1]

if tmp_class not in cnt.keys():

cnt[tmp_class] = 0

cnt[tmp_class] += 1

shannon_ent = 0.0

for key in cnt:

p = float(cnt[key]) / data_num

shannon_ent -= p * math.log(p, 2)

return shannon_ent

def split_data(data, index, value):

ret_data =

for feat_vec in data:

if feat_vec[index] == value:

remain_feat_vec = feat_vec[:index]

remain_feat_vec.extend(feat_vec[index + 1:])

return ret_data

def get_best_feat(data):

base_ent = cal_shannon_ent(data)

feat_num = len(data[0]) - 1

best_info_gain = 0.0

best_feat = -1

for i in range(feat_num):

values = [j[i] for j in data]

unique_values = set(values)

new_ent = 0.0

for value in unique_values:

sub_data = split_data(data, i, value)

p = float(len(sub_data)) / len(data)

new_ent += p * cal_shannon_ent(sub_data) #

info_gain = base_ent - new_ent

if info_gain > best_info_gain:

best_info_gain = info_gain

best_feat = i

return best_feat

def get_major_class(class_list):

cnt = {}

maxx = -1

major_class = -1

for _class in class_list:

if _class not in cnt.keys():

cnt[_class] = 0

cnt[_class] += 1

if(cnt[_class] > maxx):

maxx = cnt[_class]

major_class = _class

return major_class

def create_tree_id3(data, label):

class_list = [i[-1] for i in data]

if class_list.count(class_list[0]) == len(class_list):

return class_list[0]

if len(data[0]) == 1:

return get_major_class(class_list)

best_feat = get_best_feat(data)

best_feat_label = label[best_feat]

ret_tree = }

del(label[best_feat])

values = [i[best_feat] for i in data]

unique_values = set(values)

for value in unique_values:

sub_label = label[:] #

ret_tree[best_feat_label][value] = create_tree_id3(split_data(data, best_feat, value), sub_label)

return ret_tree

if __name__=="__main__":

# data = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

# label = ['no su***cing', 'flippers']

fr = open('data.txt')

data = [line.strip().split('\t') for line in fr.readlines()]

fr = open('label.txt')

label = [line.strip().split('\t') for line in fr.readlines()][0]

print create_tree_id3(data, label)

}, 'myope': 'hard'}}, 'no': }, 'young': 'soft'}}}}}}

決策樹 ID3演算法

一 決策樹基本概念 在機器學習中,決策樹是乙個 模型,它代表的是物件屬性與物件值之間的一種對映關係。本質上決策樹是通 過一系列規則對資料進行分類的過程。下圖為經典決策樹例項。如圖所示,例項是由 屬性 值 對表示的 例項是用一系列固定的屬性和他們的值構成。目標函式具有離散的輸出值 上圖給每個例項賦予乙...

決策樹ID3演算法

typeerror dict keys object does not support indexing 9.typeerror dict keys object does not support indexing 這個問題是python版本的問題 如果使用的是python2 firststr my...

決策樹 ID3演算法

id3演算法也叫決策樹歸納演算法,不是很使用,但是是決策樹演算法的開山之作,這裡簡單說下 在資訊理論中,熵 entropy 是隨機變數不確定性的度量,也就是熵越大,則隨機變數的不確定性越大。設x是乙個取有限個值得離散隨機變數,其概率分布為 則隨機變數x的熵定義為 設有隨機變數 x,y 其聯合概率分布...