sklearn學習筆記（7）決策樹隨機森林

決策樹思想的**非常樸素，程式設計中的條件分支結構就是if-then結構，最早的決策樹就是利用這類結構分割資料的一種分類學習方法。

資訊熵，在2023年由克勞德·艾爾伍德·夏農提出，解決對資訊的量化度量問題

資訊增益，特徵a對訓練資料集d的資訊增益g(d,a)，定義為集合d的資訊熵h(d)與特徵a給定條件下的d的資訊條件熵h(d|a)之差，即公式為：

常見決策樹使用的演算法

id3 資訊增益最大的準則

c4.5 資訊增益比最大的準則

dart 回歸樹：平方差誤差最小；分類樹：基尼係數最小的準則，在sklearn中可以選擇劃分的預設原則

sklearn決策樹api：sklearn.tree.decisiontreeclassifier

sklearn決策樹結構、本地儲存api：sklearn.tree.export_graphviz

sklearn.tree.export_graphviz(estimator,out_file="tree.dot",feature_names=[","])，該函式能夠匯出dot格式

grapthviz，能夠將dot檔案轉換為pdf、png的工具

安裝grapthviz，ubuntu: sudo apt-get install grapthviz mac: brew install grapthviz

執行命令：dot -tpng tree.dot -o tree.png

決策樹的優缺點以及改進

優點：簡單的理解和解釋，樹木視覺化

需要很少的資料準備，其他技術通常需要資料歸一化

缺點：決策樹學習者可以建立不能很好的推廣資料過於複雜的樹，這被稱為過擬合。

改進：減枝cart演算法（決策樹api當中已經實現，隨機森林引數調優有相關介紹）

隨機森林

注：企業重要決策，由於決策樹很好的分析能力，在決策過程中應用較多。

定義：在機器學習中，隨機森裡是乙個包含多個決策樹的分類器，並且其輸出的類別是有個別樹輸出的類別的眾數而定。

sklearn隨機森林api：sklearn.ensemble.randomforestclassifier

隨機森林的優點

在當前所有演算法中，具有極好的準確率

能夠有效的執行在大資料集上

能夠處理具有高維特徵的輸入樣本，而且不需要降維

能夠評估各個特徵在分類問題上的重要性

# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import dictvectorizer
from sklearn.tree import decisiontreeclassifier, export_graphviz
from sklearn.ensemble import randomforestclassifier
from sklearn.model_selection import gridsearchcv
def decision_ex():
"""決策樹進行鐵達尼號生存**
:return: none
"""# 獲取資料，從中篩選一些特徵，目標值作為分析的資料
titan = pd.read_csv("")
x = titan[["pclass", "age", "***"]]
y = titan["survived"]
# age存在缺失值，需要進行處理
x["age"].fillna(x["age"].mean(), inplace=true)
# 劃分訓練集、測試集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
# 進行字典特抽取，針對特徵當中有一些類別的值
data_dict = dictvectorizer(sparse=false)
x_train = data_dict.fit_transform(x_train.to_dict(orient="records"))
x_test = data_dict.fit_transform(x_test.to_dict(orient="records"))
print(data_dict.get_feature_names())
"""決策樹**"""
# dec = decisiontreeclassifier(max_depth=5)
# dec.fit(x_train, y_train)
# print("準確率：", dec.score(x_test, y_test))
## # 匯出樹的結構
# export_graphviz(dec, out_file="./tree.dot", feature_names=data_dict.get_feature_names() )
# print("匯出樹結構成功！")
"""隨即森林**"""
rf = randomforestclassifier()
# 構造引數字典
param = 
gc = gridsearchcv(rf, param_grid=param, cv=2)
gc.fit(x_train, y_train)
print("準確率：", gc.score(x_test, y_test))
print("選擇的引數組合為：", gc.best_params_)
if __name__ == "__main__":
decision_ex()

sklearn 決策樹學習筆記

遍歷眾多特徵，計算每一次分類後的資訊增益，選取分類後熵值最小的特徵作為當前分類節點防止過擬合，當每個資料都是乙個葉結點的時候，分類正確率是100 但是樹過於龐大。from sklearn.datasets.california housing import fetch california hou...

sklearn機器學習決策樹

tree.decisiontreeclassifier 分類樹 tree.decisiontreeregressor 回歸樹 tree.export graphviz 將生成的決策樹匯出為dot格式，畫圖專用 from sklearn import tree 匯入需要的模組 clf tree.dec...

SKlearn之決策樹

決策樹是一種非引數的監督學習方法。模組 sklearn.tree sklearn建模的步驟 1 選擇並建立模型例 clf tree.decisiontreeclassifier 2 提供資料訓練模型例 clf clf.fit x train,y train 3 獲取需要的資訊例 result ...

sklearn學習筆記（7） 決策樹 隨機森林

sklearn 決策樹學習筆記

sklearn機器學習 決策樹

SKlearn之決策樹

相關推薦

sklearn學習筆記（7）決策樹隨機森林

sklearn機器學習決策樹