首页 > 代码库 > 聚类算法总结
聚类算法总结
最近要在spark上做一个聚类的项目,数据量和类的个数都比较大。KMeans效果尚可,但是有点慢,因而重新看了下常用的算法。
kmeans
attention: init centers (randomize vs kmeans++)
mini-batch kmeans
loops: random samples; find closest for all; update centers for each
mean shift
init: get centers by bandwidth
loops: find neighbours of centers; update centers; dedup
AP cluster
init: get S; Rik=0, Aik=0
loops: Rik = Sik - Max_k‘!=k(Aik‘ + Sik‘); Aik = min(0, Rkk + Sum_i‘!=i,k Max(0, Ri‘k)); Akk = Sum_i‘!=k Max(0, Ri’k)
end: for any i, Max_k Rik + Aik as it’s exemplar
spectral clustering
steps: similarity matrix S; S=UV; kmeans of U
Ward hierarchical
init: each sample as center
loops: merge to minimize RMSE within clusters
DBSCAN
init: get densest core samples
loops: get more core samples nearby old samples
其实scikit-learn实现了很多算法,也有现成的数据集可以做做实验。例如:http://scikit-learn.org/stable/modules/clustering.html 上有一些效果图,和算法扩展性的说明。
A comparison of the clustering algorithms in scikit-learn
Method name | Parameters | Scalability | Usecase | Geometry (metric used) |
---|---|---|---|---|
K-Means | number of clusters | Very large n_samples, medium n_clusterswith MiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters | Distances between points |
Affinity propagation | damping, sample preference | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Mean-shift | bandwidth | Not scalable withn_samples | Many clusters, uneven cluster size, non-flat geometry | Distances between points |
Spectral clustering | number of clusters | Medium n_samples, small n_clusters | Few clusters, even cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Ward hierarchical clustering | number of clusters | Large n_samples andn_clusters | Many clusters, possibly connectivity constraints | Distances between points |
Agglomerative clustering | number of clusters, linkage type, distance | Large n_samples andn_clusters | Many clusters, possibly connectivity constraints, non Euclidean distances | Any pairwise distance |
DBSCAN | neighborhood size | Very large n_samples, medium n_clusters | Non-flat geometry, uneven cluster sizes | Distances between nearest points |
Gaussian mixtures | many | Not scalable | Flat geometry, good for density estimation | Mahalanobis distances to centers |
聚类算法总结