變數重要性和變數選擇in xgboost

2021-09-06 10:06:16 字數 4126 閱讀 3768

1。變數重要得分

或者xgboost本來就有內建函式

進行排序啦,更友好

2.變數選擇

selectfrommodel

比如這樣(記得要transform之後再傳給select

# select features using threshold

selection = selectfrommodel(model, threshold=thresh, prefit=true)

select_x_train = selection.transform(x_train)

# train model

selection_model = xgbclassifier()

selection_model.fit(select_x_train, y_train)

# eval model

select_x_test = selection.transform(x_test)

y_pred = selection_model.predict(select_x_test)

完整**

# use feature importance for feature selection

from numpy import loadtxt

from numpy import sort

from xgboost import xgbclassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.feature_selection import selectfrommodel

# load data

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into x and y

x = dataset[:,0:8]

y = dataset[:,8]

# split data into train and test sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=7)

# fit model on all training data

model = xgbclassifier()

model.fit(x_train, y_train)

# make predictions for test data and evaluate

y_pred = model.predict(x_test)

predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, predictions)

print("accuracy: %.2f%%" % (accuracy * 100.0))

# fit model using each importance as a threshold

thresholds = sort(model.feature_importances_)

for thresh in thresholds:

# select features using threshold

selection = selectfrommodel(model, threshold=thresh, prefit=true)

select_x_train = selection.transform(x_train)

# train model

selection_model = xgbclassifier()

selection_model.fit(select_x_train, y_train)

# eval model

select_x_test = selection.transform(x_test)

y_pred = selection_model.predict(select_x_test)

predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, predictions)

print("thresh=%.3f, n=%d, accuracy: %.2f%%" % (thresh, select_x_train.shape[1], accuracy*100.0))

就是我們需要設定乙個閾值,到底需不需要選進去,這個例子是先排序,挨個選閾值,看哪個時候最好,輸出如下:

accuracy: 77.95%

thresh=0.071, n=8, accuracy: 77.95%

thresh=0.073, n=7, accuracy: 76.38%

thresh=0.084, n=6, accuracy: 77.56%

thresh=0.090, n=5, accuracy: 76.38%

thresh=0.128, n=4, accuracy: 76.38%

thresh=0.160, n=3, accuracy: 74.80%

thresh=0.186, n=2, accuracy: 71.65%

thresh=0.208, n=1, accuracy: 63.78%

SQL中變數賦初始值的重要性

首先準備一些測試資料,create table tynametable idint,typename nvarchar 10 insert into tynametable values 1,射手 insert into tynametable values 10,法師 insert into ty...

論MongoDB索引選擇的重要性

線上某業務,頻繁出現iops 使用率100 的 每秒4000iops 現象,每次持續接近1個小時,從慢請求的日誌發現是乙個 getmore 請求耗時1個小時,導致iops高 深入調查之後,最終發現竟是乙個索引選擇的問題。2017 11 01t15 04 17.498 0800 i command c...

回顧和總結的重要性

一段時間的緊張開發結束了,作為乙個技術開發者,不知道大家是不是和我一樣,在每次開發新專案的時候都會用到一些新的技術,新的知識點,遇到一些技術難點,一些很奇怪的bug。或許你在當時解決了,但是幾個月之後你只記得你用過某個技術或者遇到錯某個錯誤,但是已經想不起當時是怎麼解決的了。因為我們每天都要接觸很多...