特徵選擇 GBDT 特徵重要度

整合學習因具有**精度高的優勢而受到廣泛關注，尤其是使用決策樹作為基學習器的整合學習演算法。樹的整合演算法的著名**有隨機森林和gbdt。隨機森林具有很好的抵抗過擬合的特性，並且引數（決策樹的個數）對**效能的影響較小，調參比較容易，一般設定乙個比較大的數。gbdt具有很優美的理論基礎，一般而言效能更有優勢。

基於樹的整合演算法還有乙個很好的特性，就是模型訓練結束後可以輸出模型所使用的特徵的相對重要度，便於我們選擇特徵，理解哪些因素是對**有關鍵影響，這在某些領域（如生物資訊學、神經系統科學等）特別重要。本文主要介紹基於樹的整合演算法如何計算各特徵的相對重要度。

friedman在gbm的**中提出的方法：特徵j

[math processing error]的全域性重要度通過特徵j

[math processing error]在單顆樹中的重要度的平均值來衡量： j

2j^=

1m∑m

=1mj

2j^(

tm)

[math processing error]

其中，m是樹的數量。特徵j

[math processing error]

在單顆樹中的重要度的如下： j2

j^(t

)=∑t

=1l−

1i2t

^1(v

t=j)

[math processing error]

其中，l

[math processing error]

為樹的葉子節點數量，l−

1 [math processing error]

即為樹的非葉子節點數量（構建的樹都是具有左右孩子的二叉樹），vt

[math processing error]

是和節點t

[math processing error]

相關聯的特徵，i2

t^[math processing error]

是節點t

[math processing error]

**之後平方損失的減少值。

為了更好的理解特徵重要度的計算方法，下面給出scikit-learn工具包中的實現，**移除了一些不相關的部分。

下面的**來自於gradientboostingclassifier物件的feature_importances屬性的計算方法：

def
feature_importances_
(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_ 
importances = total_sum / len(self.estimators_)
return importances

其中，self.estimators_是演算法構建出的決策樹的陣列，tree.feature_importances_ 是單棵樹的特徵重要度向量，其計算方法如下：

cpdef compute_feature_importances(self, normalize=true):
"""computes the importance of each feature (aka variable)."""
while node != end_node:
if node.left_child != _tree_leaf:
# ... and node.right_child != _tree_leaf:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
return importances

上面的**經過了簡化，保留了核心思想。計算所有的非葉子節點在**時加權不純度的減少，減少得越多說明特徵越重要。

特徵選擇 GBDT 特徵重要度

GBDT如何選擇特徵

RandomForest特徵重要度問題

特徵選擇單變數特徵選擇

特徵選擇 GBDT 特徵重要度

GBDT如何選擇特徵

RandomForest特徵重要度問題

特徵選擇 單變數特徵選擇

相關推薦

特徵選擇單變數特徵選擇