Python例項第4講填補缺失值

在這個例子裡，我們向大家展示填補缺失值比丟棄它們得到的結果更好。但是請注意，缺失值填補並不總會改善**結果，所以請使用交叉驗證評價。有的時候，丟棄缺失行或使用標記值反而更有效。

一般時候，缺失值可以用均值、中位數或眾數代替。當變數較多時，用中位數代替是一種穩健的方法。在本例中，填補將有助於分類器接近原始分數。

首先，匯入必需的模組。

import numpy as np
from sklearn.datasets import load_boston
from sklearn.ensemble import randomforestregressor
from sklearn.pipeline import pipeline
from sklearn.preprocessing import imputer
from sklearn.model_selection import cross_val_score

匯入模組後，從numpy工具包生成模擬資料集。

rng = np.random.randomstate(0)

使用函式randomstate獲得隨機數生成器。0為隨機種子，只要隨機種子相同，產生的隨機數序列就相同。

載入「波士頓房價」資料集，該資料集在【python例項第1講】介紹過。

dataset = load_boston()
x_full, y_full = dataset.data, dataset.target
n_samples = x_full.shape[0]
n_features = x_full.shape[1]
print(x_full.shape)

建立乙個由100棵樹組成的隨機森林估計量estimator，隨機狀態是隨機數生成器的種子0.乙個隨機森林是乙個擬合多棵分類決策樹的估計量，它使用平均化的辦法來改善**準確率，控制過度擬合。構建決策樹的子樣本是原始樣本的bootstrap樣本。

estimator = randomforestregressor(random_state=0, n_estimators=100)

使用交叉驗證法評價分數，取分數的平均值，保留小數點後兩位列印出來。

score = cross_val_score(estimator, x_full, y_full).mean()
print("score with the entire dataset = %.2f" % score)

missing_rate = 0.75
n_missing_samples = int(np.floor(n_samples * missing_rate))
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype=np.bool),
np.ones(n_missing_samples,
dtype=np.bool)))
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

x_filtered = x_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = randomforestregressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, x_filtered, y_filtered).mean()
print("score without the samples containing missing values = %.2f" % score)

將缺失值處標記為0, 再將資料集裡標記為0的項，用該項所在列的均值代替。由於列表示特徵，此即用該特徵的均值代替缺失值。最後，在填補後的資料集上計算分數。

x_missing = x_full.copy()
x_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = pipeline([("imputer", imputer(missing_values=0,
strategy="mean",
axis=0)),
("forest", randomforestregressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, x_missing, y_missing).mean()
print("score after imputation of the missing values = %.2f" % score)

讓我們來比較三種情況下的分數值：

score with the entire dataset = 0.56

score without the samples containing missing values = 0.48

score after imputation of the missing values = 0.57

由此可見，填補後的分值更加接近完整分值。

Python例項第6講多標籤分類

本例模擬乙個多標籤文件分類問題。資料集根據下面的過程隨機產生。在上述過程裡，使用拒絕取樣 rejection sampling 確保n 2,文件長度不是0.同樣地，我們也拒絕已經被選擇的類。被分配兩個類的文件，在圖上用兩種顏色圈出。通過投射到pca的前兩個主成分做分類，然後使用sklearn.mul...

Python 基礎第4講運算子

x 1 2 x hello world 字串相加是拼接x 6 2 x hello world 報錯字串不能相減x 6 2 x 50 x x 6 2 x 3.0x 6 2 x 3x 7 2 x 1x 2 3x 25 0.5x 3 x x 3 x 3 x x 3 x 3 x x 3 x 3 x x 3...

python 基礎知識第4講運算子

運算子用於執行程式運算，會針對乙個以上運算元專案來進行運算。例如 2 3，其運算元是2和3，而運算子則是比如表現形式號數字相加示例 y 1 4 print y y 5字串相加則會進拼串操作，示例 s hello s1 world s2 s s1 print sa hello world...

Python例項第4講 填補缺失值

Python例項第6講 多標籤分類

Python 基礎第4講 運算子

python 基礎知識第4講 運算子

相關推薦

Python例項第4講填補缺失值

Python例項第6講多標籤分類

Python 基礎第4講運算子

python 基礎知識第4講運算子