Python例項第6講多標籤分類

本例模擬乙個多標籤文件分類問題。資料集根據下面的過程隨機產生。

在上述過程裡，使用拒絕取樣(rejection sampling)確保n>2, 文件長度不是0. 同樣地，我們也拒絕已經被選擇的類。被分配兩個類的文件，在圖上用兩種顏色圈出。

通過投射到pca的前兩個主成分做分類，然後使用sklearn.multiclass.onevsrestclassifier分類器學習乙個兩類的判別模型。請注意，pca是用來作乙個無監督的降維，而cca(典型關聯分析)是用作有監督的降維。不同情況下的樣本分類結果見下圖。

注意：在下圖中，無標籤的樣本並不意味著我們不能**它們的標籤，而是樣本沒有標籤。

首先，在python環境載入必須的函式庫。

print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import onevsrestclassifier
from sklearn.svm import svc
from sklearn.preprocessing import labelbinarizer
from sklearn.decomposition import pca
from sklearn.cross_decomposition import cca

為了在乙個圖形裡同時畫四個圖，需要定義四個分隔的超平面。為此，定義乙個函式plot_hyperplane實現。

def plot_hyperplane(clf, min_x, max_x, linestyle, label):
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(min_x - 5, max_x + 5) # make sure the line is long enough
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, linestyle, label=label)

再定義乙個函式plot_subfigure, 實現每個超平面內子圖的畫法。

def plot_subfigure(x, y, subplot, title, transform):
if transform == "pca":
x = pca(n_components=2).fit_transform(x)
elif transform == "cca":
x = cca(n_components=2).fit(x, y).transform(x)
else:
raise valueerror
min_x = np.min(x[:, 0])
max_x = np.max(x[:, 0])
min_y = np.min(x[:, 1])
max_y = np.max(x[:, 1])
classif = onevsrestclassifier(svc(kernel='linear'))
classif.fit(x, y)
plt.subplot(2, 2, subplot)
plt.title(title)
zero_class = np.where(y[:, 0])
one_class = np.where(y[:, 1])
plt.scatter(x[:, 0], x[:, 1], s=40, c='gray', edgecolors=(0, 0, 0))
plt.scatter(x[zero_class, 0], x[zero_class, 1], s=160, edgecolors='b',
facecolors='none', linewidths=2, label='class 1')
plt.scatter(x[one_class, 0], x[one_class, 1], s=80, edgecolors='orange',
facecolors='none', linewidths=2, label='class 2')
plot_hyperplane(classif.estimators_[0], min_x, max_x, 'k--',
'boundary\nfor class 1')
plot_hyperplane(classif.estimators_[1], min_x, max_x, 'k-.',
'boundary\nfor class 2')
plt.xticks(())
plt.yticks(())
plt.xlim(min_x - .5 * max_x, max_x + .5 * max_x)
plt.ylim(min_y - .5 * max_y, max_y + .5 * max_y)
if subplot == 2:
plt.xlabel('first principal component')
plt.ylabel('second principal component')
plt.legend(loc="upper left")

最後，呼叫datasets庫函式make_multilabel_classification, 在兩個主成分類上實現多類別分類，在每個超平面上畫出子圖。

plt.figure(figsize=(8, 6))
x, y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=true,
random_state=1)
plot_subfigure(x, y, 1, "with unlabeled samples + cca", "cca")
plot_subfigure(x, y, 2, "with unlabeled samples + pca", "pca")
x, y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=false,
random_state=1)
plot_subfigure(x, y, 3, "without unlabeled samples + cca", "cca")
plot_subfigure(x, y, 4, "without unlabeled samples + pca", "pca")
plt.subplots_adjust(.04, .02, .97, .94, .09, .2)
plt.show()

Python例項第4講填補缺失值

在這個例子裡，我們向大家展示填補缺失值比丟棄它們得到的結果更好。但是請注意，缺失值填補並不總會改善結果，所以請使用交叉驗證評價。有的時候，丟棄缺失行或使用標記值反而更有效。一般時候，缺失值可以用均值中位數或眾數代替。當變數較多時，用中位數代替是一種穩健的方法。在本例中，填補將有助於分類器接近原始...

Python例項第2講特徵提取整合方法

在現實場景的例子裡，有很多從資料集提取特徵的方法。通常，將幾種特徵提取方法組合使用會收到更好的效果。本例顯示怎樣使用函式featureunion組合特徵。這裡要用到scikit learn自帶資料集鳶尾花資料集鳶尾花 iris 資料集位於datasets裡，是由著名統計學家sir ronald...

Python例項第29講遞迴的特徵排除法

機器學習訓練營機器學習愛好者的自由交流空間 qq 群號 696721295 這是乙個遞迴的特徵排除例子，顯示在乙個數字分類的任務裡畫素的相關性。給定乙個外部的估計量，它給特徵賦權，比如說線性模型裡的回歸係數。所謂遞迴的特徵排除 recursive feature elimination,rfe 它...

Python例項第6講 多標籤分類

Python例項第4講 填補缺失值

Python例項第2講 特徵提取整合方法

Python例項第29講 遞迴的特徵排除法

相關推薦

Python例項第6講多標籤分類

Python例項第4講填補缺失值

Python例項第2講特徵提取整合方法

Python例項第29講遞迴的特徵排除法