首页 > 代码库 > 聚类算法总结

聚类算法总结

最近要在spark上做一个聚类的项目,数据量和类的个数都比较大。KMeans效果尚可,但是有点慢,因而重新看了下常用的算法。


kmeans
     attention: init centers (randomize vs kmeans++)

mini-batch kmeans
     loops: random samples; find closest for all; update centers for each

mean shift
     init: get centers by bandwidth
     loops: find neighbours of centers; update centers; dedup

AP cluster
     init: get S; Rik=0, Aik=0
     loops: Rik = Sik - Max_k‘!=k(Aik‘ + Sik‘); Aik = min(0, Rkk + Sum_i‘!=i,k Max(0, Ri‘k)); Akk = Sum_i‘!=k Max(0, Ri’k)
     end: for any i, Max_k Rik + Aik as it’s exemplar

spectral clustering
     steps: similarity matrix S; S=UV; kmeans of U

Ward hierarchical
     init: each sample as center
     loops: merge to minimize RMSE within clusters

DBSCAN
     init: get densest core samples
     loops: get more core samples nearby old samples

其实scikit-learn实现了很多算法,也有现成的数据集可以做做实验。例如:http://scikit-learn.org/stable/modules/clustering.html 上有一些效果图,和算法扩展性的说明。

                                             A comparison of the clustering algorithms in scikit-learn



Method nameParametersScalabilityUsecaseGeometry (metric used)
K-Meansnumber of clustersVery large n_samples, medium n_clusterswith MiniBatch codeGeneral-purpose, even cluster size, flat geometry, not too many clustersDistances between points
Affinity propagationdamping, sample preferenceNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Mean-shiftbandwidthNot scalable withn_samplesMany clusters, uneven cluster size, non-flat geometryDistances between points
Spectral clusteringnumber of clustersMedium n_samples, small n_clustersFew clusters, even cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Ward hierarchical clusteringnumber of clustersLarge n_samples andn_clustersMany clusters, possibly connectivity constraintsDistances between points
Agglomerative clusteringnumber of clusters, linkage type, distanceLarge n_samples andn_clustersMany clusters, possibly connectivity constraints, non Euclidean distancesAny pairwise distance
DBSCANneighborhood sizeVery large n_samples, medium n_clustersNon-flat geometry, uneven cluster sizesDistances between nearest points
Gaussian mixturesmanyNot scalableFlat geometry, good for density estimationMahalanobis distances to centers

聚类算法总结