利用GBDT模型構造新特徵

實際問題中，可直接用於機器學習模型的特徵往往並不多。能否從「混亂」的原始log中挖掘到有用的特徵，將會決定機器學習模型效果的好壞。引用下面一句流行的話：

特徵決定了所有演算法效果的上限，而不同的演算法只是離這個上限的距離不同而已。

本文中我將介紹facebook最近發表的利用gbdt模型構造新特徵的方法1

。**的思想很簡單，就是先用已有特徵訓練gbdt模型，然後利用gbdt模型學習到的樹來構造新特徵，最後把這些新特徵加入原有特徵一起訓練模型。構造的新特徵向量是取值0/1的，向量的每個元素對應於gbdt模型中樹的葉子結點。當乙個樣本點通過某棵樹最終落在這棵樹的乙個葉子結點上，那麼在新特徵向量中這個葉子結點對應的元素值為1，而這棵樹的其他葉子結點對應的元素值為0。新特徵向量的長度等於gbdt模型裡所有樹包含的葉子結點數之和。

舉例說明。下面的圖中的兩棵樹是gbdt學習到的，第一棵樹有3個葉子結點，而第二棵樹有2個葉子節點。對於乙個輸入樣本點x，如果它在第一棵樹最後落在其中的第二個葉子結點，而在第二棵樹里最後落在其中的第乙個葉子結點。那麼通過gbdt獲得的新特徵向量為[0, 1, 0, 1, 0]，其中向量中的前三位對應第一棵樹的3個葉子結點，後兩位對應第二棵樹的2個葉子結點。

那麼，gbdt中需要多少棵樹能達到效果最好呢？具體數字顯然是依賴於你的應用以及你擁有的資料量。一般資料量較少時，樹太多會導致過擬合。在作者的應用中，大概500棵左右效果就基本不改進了。另外，作者在建gbdt時也會對每棵樹的葉子結點數做約束——不多於12個葉子結點。

下面是這種方法在我們世紀佳緣的乙個概率**問題上的實際效果。我們只使用了30棵樹。第乙個圖是只使用原始特徵的結果，第二個圖是原始特徵加gbdt新特徵的結果。圖中橫座標表示**概率值，縱座標表示真實概率值。所以**的點越靠近y=

x y=x

這條參考線越好。顯然，使用了gbdt構造的新特徵後，模型的**效果好不少。

對了，已經有人利用這種方法贏得了kaggle乙個ctr預估比賽的冠軍，**可見裡面有這種方法的具體實現。

以下是xgboost生成新特徵的乙個實現，需要xgb0.6版本及以上

# -*- coding: utf-8 -*-
"""created on mon jul 3 21:37:30 2017
@author: bryan
"""from sklearn.model_selection import train_test_split 
from sklearn import metrics 
from xgboost.sklearn import xgbclassifier 
import pandas as pd
import numpy as np 
def mergetoone(x,x2): 
x3= 
for i in range(x.shape[0]): 
tmp=np.array([list(x.iloc[i]),list(x2[i])]) 
x3=np.array(x3) 
return x3 
data=pd.read_csv("e:\data\wine.csv")
#打亂資料
data=data.sample(len(data))
y=data.label
x=data.drop("label",axis=1)
#劃分訓練集測試集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)##test_size測試集合所佔比例 
##x_train_1用於生成模型 x_train_2用於和新特徵組成新訓練集合 
x_train_1, x_train_2, y_train_1, y_train_2 = train_test_split(x_train, y_train, test_size=0.6, random_state=0) 
clf = xgbclassifier( 
learning_rate =0.2, #預設0.3 
n_estimators=200, #樹的個數 
max_depth=8, 
min_child_weight=10, 
gamma=0.5, 
subsample=0.75, 
colsample_bytree=0.75, 
objective= 'binary:logistic', #邏輯回歸損失函式 
nthread=8, #cpu執行緒數 
scale_pos_weight=1, 
reg_alpha=1e-05, 
reg_lambda=10, 
seed=1024) #隨機種子 
clf.fit(x_train_1, y_train_1) 
x_train_new2=mergetoone(x_train_2,new_feature) 
x_test_new=mergetoone(x_test,new_feature_test) 
model = xgbclassifier( 
learning_rate =0.05, #預設0.3 
n_estimators=300, #樹的個數 
max_depth=7, 
min_child_weight=1, 
gamma=0.5, 
subsample=0.8, 
colsample_bytree=0.8, 
objective= 'binary:logistic', #邏輯回歸損失函式 
nthread=8, #cpu執行緒數 
scale_pos_weight=1, 
reg_alpha=1e-05, 
reg_lambda=1, 
seed=1024) #隨機種子 
model.fit(x_train_2, y_train_2) 
y_pre= model.predict(x_test) 
y_pro= model.predict_proba(x_test)[:,1] 
print("auc score :",(metrics.roc_auc_score(y_test, y_pro))) 
print("accuracy :",(metrics.accuracy_score(y_test, y_pre))) 
model = xgbclassifier( 
learning_rate =0.05, #預設0.3 
n_estimators=300, #樹的個數 
max_depth=7, 
min_child_weight=1, 
gamma=0.5, 
subsample=0.8, 
colsample_bytree=0.8, 
objective= 'binary:logistic', #邏輯回歸損失函式 
nthread=8, #cpu執行緒數 
scale_pos_weight=1, 
reg_alpha=1e-05, 
reg_lambda=1, 
seed=1024) #隨機種子 
model.fit(x_train_new2, y_train_2) 
y_pre= model.predict(x_test_new) 
y_pro= model.predict_proba(x_test_new)[:,1] 
print("auc score :",(metrics.roc_auc_score(y_test, y_pro))) 
print("accuracy :",(metrics.accuracy_score(y_test, y_pre)))

#references

xinran he et al. practical lessons from predicting clicks on ads at facebook, 2014.

利用GBDT模型構造新特徵

利用GBDT模型構造新特徵

機器學習之GBDT構建新特徵

利用分類模型學習特徵權重

利用GBDT模型構造新特徵

利用GBDT模型構造新特徵

機器學習之GBDT構建新特徵

利用分類模型學習特徵權重

相關推薦