Scikit learn實戰之聚類 Kmeans

在scikit-learn中，對於未標記資料的執行聚類需要使用 sklearn.cluster 模組。

每乙個聚類演算法有兩個變數組成：乙個是類，它實現了fit方法從未標記的訓練資料中學習類簇；還有乙個就是函式，該函式的功能就是給它一批訓練資料，它能夠返回與每一批訓練資料相對應的類標的結果的整數集合。對於類來說，基於訓練資料的類標能夠從labels_屬性獲取。

1.1 輸入資料

乙個需要重點關注的事情是，模組中實現的演算法能夠接受不同種類的矩陣作為輸入。所有的這些方法都接收[n_samples, n_features]標準結構的矩陣。

1.2 各種聚類方法的效果

scikit-learn中聚類演算法的對比

method name

parameters

scalability

usecase

geometry (metric used)

k-means

number of clusters

very large n_samples, medium n_clusters with minibatch code

general-purpose, even cluster size, flat geometry, not too many clusters

distances between points

affinity propagation

damping, sample preference

not scalable with n_samples

many clusters, uneven cluster size, non-flat geometry

graph distance (e.g. nearest-neighbor graph)

mean-shift

bandwidth

not scalable with n_samples

many clusters, uneven cluster size, non-flat geometry

distances between points

spectral clustering

number of clusters

medium n_samples, small n_clusters

few clusters, even cluster size, non-flat geometry

graph distance (e.g. nearest-neighbor graph)

ward hierarchical clustering

number of clusters

large n_samples and n_clusters

many clusters, possibly connectivity constraints

distances between points

agglomerative clustering

number of clusters, linkage type, distance

large n_samples and n_clusters

many clusters, possibly connectivity constraints, non euclidean distances

any pairwise distance

dbscan

neighborhood size

very large n_samples, medium n_clusters

non-flat geometry, uneven cluster sizes

distances between nearest points

gaussian mixtures

many

not scalable

flat geometry, good for density estimation

mahalanobis distances to centers

birch

branching factor, threshold, optional global clusterer

large n_clusters and n_samples

large dataset, outlier removal, data reduction.

euclidean distance between points

kmeans演算法通過將訓練樣本分配到n個具有相等方差的組中來聚類資料。它的核心思想是類內方差和最小化。該演算法要求指定具體的類數目。它在大量資料樣本的情況下具有良好的擴充套件性，並且在很多不同領域範圍的應用程式中被應用。

kmeans演算法將乙個包含n個資料樣本的資料集x聚類成具有k個類的集合c，類的集合c由聚類樣本的均值 uj

描述。均值 uj

通常稱為聚類中心。需要注意的是，通常它們不屬於x，儘管他們同屬於同乙個向量空間。kmeans演算法致力於選擇聚類中心，使得如下的公式最小化

下面是一段具體的**：

Scikit learn實戰之聚類 Kmeans

scikit learn 實戰之非監督學習

scikit learn 實戰之非監督學習 2

sklearn實戰之kmeans 聚類

Scikit learn實戰之聚類 Kmeans

scikit learn 實戰之非監督學習

scikit learn 實戰之非監督學習 2

sklearn實戰之kmeans 聚類

相關推薦