聚类算法总结

首页 > 代码库 > 聚类算法总结

2024-07-22 08:33:52 221人阅读

最近要在spark上做一个聚类的项目，数据量和类的个数都比较大。KMeans效果尚可，但是有点慢，因而重新看了下常用的算法。

kmeans
attention: init centers (randomize vs kmeans++)

mini-batch kmeans
loops: random samples; find closest for all; update centers for each

mean shift
init: get centers by bandwidth
loops: find neighbours of centers; update centers; dedup

AP cluster
init: get S; Rik=0, Aik=0
loops: Rik = Sik - Max_k‘!=k(Aik‘ + Sik‘); Aik = min(0, Rkk + Sum_i‘!=i,k Max(0, Ri‘k)); Akk = Sum_i‘!=k Max(0, Ri’k)
end: for any i, Max_k Rik + Aik as it’s exemplar

spectral clustering
steps: similarity matrix S; S=UV; kmeans of U

Ward hierarchical
init: each sample as center
loops: merge to minimize RMSE within clusters

DBSCAN
init: get densest core samples
loops: get more core samples nearby old samples

其实scikit-learn实现了很多算法，也有现成的数据集可以做做实验。例如：http://scikit-learn.org/stable/modules/clustering.html 上有一些效果图，和算法扩展性的说明。

A comparison of the clustering algorithms in scikit-learn

Method name	Parameters	Scalability	Usecase	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters`with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with`n_samples`	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters, linkage type, distance	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes	Distances between nearest points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers

聚类算法总结

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 聚类算法总结

聚类算法总结

看完仍有疑问？有类似问题直接问程序猿