k-平均演算法

k-平均演算法（英文：k-means clustering）源於訊號處理中的一種向量量化方法，現在則更多地作為一種聚類分析方法流行於數據探勘領域。k-平均聚類的目的是：把 $n$ 個點（可以是樣本的一次觀察或一個實例）劃分到k個聚類中，使得每個點都屬於離他最近的均值（此即聚類中心）對應的聚類，以之作為聚類的標準。這個問題將歸結為一個把數據空間劃分為Voronoi cells的問題。

這個問題在計算上是NP困難的，不過存在高效的啟發式演算法。一般情況下，都使用效率比較高的啟發式演算法，它們能夠快速收斂於一個局部最佳解。這些演算法通常類似於通過迭代最佳化方法處理高斯混合分佈的最大期望值演算法（EM演算法）。而且，它們都使用聚類中心來為數據建模；然而k-平均聚類傾向於在可比較的空間範圍內尋找聚類，期望值-最大化技術卻允許聚類有不同的形狀。

k-平均聚類與k-近鄰之間沒有任何關係（後者是另一流行的機器學習技術）。

描述

已知觀測集 $(x_{1},x_{2},...,x_{n})$ ，其中每個觀測都是一個 $d$ -維實向量，k-平均聚類要把這 $n$ 個觀測劃分到k個集合中(k≤n),使得組內平方和（WCSS within-cluster sum of squares）最小。換句話說，它的目標是找到使得下式滿足的聚類 $S_{i}$ ，

{\underset {\mathbf {S} }{\operatorname {arg\,min} }}\sum _{i=1}^{k}\sum _{\mathbf {x} \in S_{i}}\left\|\mathbf {x} -{\boldsymbol {\mu }}_{i}\right\|^{2}

其中 $\mu _{i}$ 是 $S_{i}$ 中所有點的均值。

歷史源流

雖然其思想能夠追溯到1957年的胡戈·施泰因豪斯（英語：Hugo Steinhaus） ^[1] ，術語「k-平均」於1967年才被詹姆斯·麥昆（James MacQueen） ^[2] 首次使用。標準演算法則是在1957年被史都華·勞埃德（Stuart Lloyd）作為一種脈衝碼調製的技術所提出，但直到1982年才被貝爾實驗室公開出版 ^[3] 。在1965年，E·W·弗吉（E. W. Forgy）發表了本質上相同的方法，所以這一演算法有時被稱為勞埃德-弗吉方法。更高效的版本則被J·A·哈蒂根（J. A. Hartigan）和M·A·王（M. A. Wong）提出（1975/1979） ^[4]^[5]^[6]。

演算法

標準演算法

最常用的演算法使用了迭代最佳化的技術。它被稱為k-平均演算法而廣為使用，有時也被稱為Lloyd演算法（尤其在電腦科學領域）。已知初始的k個均值點 $m_{1}^{(t)},...,m_{k}^{(t)}$ ,演算法的按照下面兩個步驟交替進行 ^[7] ：

分配(Assignment)：將每個觀測分配到聚類中，使得組內平方和（WCSS）達到最小。

因為這一平方和就是平方後的歐氏距離，所以很直觀地把觀測分配到離它最近的均值點即可 ^[8] 。（數學上，這意味依照由這些均值點生成的Voronoi圖來劃分上述觀測）。

S_{i}^{(t)}=\left\{x_{p}:\left\|x_{p}-m_{i}^{(t)}\right\|^{2}\leq \left\|x_{p}-m_{j}^{(t)}\right\|^{2}\forall j,1\leq j\leq k\right\}

其中每個 $x_{p}$ 都只被分配到一個確定的聚類 $S^{t}$ 中，儘管在理論上它可能被分配到2個或者更多的聚類。

更新(Update)：對於上一步得到的每一個聚類，以聚類中觀測值的圖心，作為新的均值點。

m_{i}^{(t+1)}={\frac {1}{\left|S_{i}^{(t)}\right|}}\sum _{x_{j}\in S_{i}^{(t)}}x_{j}

因為算術平均是最小平方估計，所以這一步同樣減小了目標函數組內平方和（WCSS）的值。

這一演算法將在對於觀測的分配不再變化時收斂。由於交替進行的兩個步驟都會減小目標函數WCSS的值，並且分配方案只有有限種，所以演算法一定會收斂於某一（局部）最佳解。注意：使用這一演算法無法保證得到全域最佳解。

這一演算法經常被描述為「把觀測按照距離分配到最近的聚類」。標準演算法的目標函數是組內平方和（WCSS），而且按照「最小平方和」來分配觀測，確實是等價於按照最小歐氏距離來分配觀測的。如果使用不同的距離函數來代替（平方）歐氏距離，可能使得演算法無法收斂。然而，使用不同的距離函數，也能得到k-平均聚類的其他變體，如球體k-平均演算法和k-中心點演算法。

初始化方法

通常使用的初始化方法有Forgy和隨機劃分(Random Partition)方法 ^[9] 。Forgy方法隨機地從數據集中選擇k個觀測作為初始的均值點；而隨機劃分方法則隨機地為每一觀測指定聚類，然後執行「更新(Update)」步驟,即計算隨機分配的各聚類的圖心，作為初始的均值點。Forgy方法易於使得初始均值點散開，隨機劃分方法則把均值點都放到靠近數據集中心的地方。參考Hamerly et al的文章 ^[9] ，可知隨機劃分方法一般更適用於k-調和均值和模糊k-平均演算法。對於期望值-最大化(EM)演算法和標準k-平均演算法，Forgy方法作為初始化方法的表現會更好一些。

這是一個啟發式演算法，無法保證收斂到全域最佳解，並且聚類的結果會依賴於初始的聚類。又因為演算法的執行速度通常很快，所以一般都以不同的起始狀態執行多次來得到更好的結果。不過，在最差的情況下，k-平均演算法會收斂地特別慢：尤其是已經證明了存在這一的點集（甚至在2維空間中），使得k-平均演算法收斂的時間達到指數級（ $2^{\Omega (n)}$ ） ^[10] 。好在在現實中，這樣的點集幾乎不會出現：因為k-平均演算法的平滑執行時間是多項式時間的 ^[11] 。

註：把「分配」步驟視為「期望值」步驟，把「更新」步驟視為「最大化步驟」，可以看到，這一演算法實際上是廣義期望值-最大化演算法（GEM）的一個變體。

複雜度

在 $d$ 維空間中找到k-平均聚類問題的最佳解的計算複雜度：

NP-hard：一般歐式空間中，即使目標聚類數僅為2^[12]^[13]
NP困難：平面中，不對聚類數目k作限制^[14]
如果k和 $d$ 都是固定的，時間複雜度為 $O(n^{dk+1}logn)$ ,其中 $n$ 為待聚類的觀測點數目^[15]

相比之下，Lloyds演算法的執行時間通常為 $O(nkdi)$ ,k和 $d$ 定義如上， $i$ 為直到收斂時的迭代次數。如果數據本身就有一定的聚類結構，那麼收斂所需的迭代數目通常是很少的，並且進行少數迭代之後，再進行迭代的話，對於結果的改善效果很小。鑑於上述原因，Lloyds演算法在實踐中通常被認為幾乎是線性複雜度的。

下面有幾個關於這一演算法複雜度的近期研究：

Lloyd's k-平均演算法具有多項式平滑執行時間。對於落在空間 $[0,1]^{d}$ 任意的 $n$ 點集合，如果每一個點都獨立地受一個均值為 $0$ ，標準差為 $\sigma$ 的正態分佈所影響，那麼k-平均演算法的期望值執行時間上界為 $O(n^{34}k^{34}d^{8}log^{4}(n)/\sigma ^{6})$ ，即對於 $n,k,i,d$ 和 $1/\sigma$ 都是多項式時間的^[11]。
在更簡單的情況下，有更好的上界。例如^[16]，在整數網格 $\left\{1,...,M\right\}^{d}$ 中，k-平均演算法執行時間的上界為 $O(dn^{4}M^{2})$ 。

演算法的變體

與其他統計機器學習方法的關係

k-平均聚類，以及它與EM演算法的聯絡，是高斯混合模型的一個特例。很容易能把k-平均問題一般化為高斯混合模型^[19]。另一個k-平均演算法的推廣則是k-SVD演算法，後者把數據點視為「編碼本向量」的稀疏線性組合。而k-平均對應於使用單編碼本向量的特殊情形（其權重為1）^[20]。

Mean Shift 聚類

基本的Mean Shift聚類要維護一個與輸入數據集規模大小相同的數據點集。初始時，這一集合就是輸入集的副本。然後對於每一個點，用一定距離範圍內的所有點的均值來迭代地替換它。與之對比，k-平均把這樣的迭代更新限制在（通常比輸入數據集小得多的）K個點上，而更新這些點時，則利用了輸入集中與之相近的所有點的均值（亦即，在每個點的Voronoi劃分內）。還有一種與k-平均類似的Mean shift演算法，即似然Mean shift，對於迭代變化的集合，用一定距離內在輸入集中所有點的均值來更新集合里的點^[21]。Mean Shift聚類與k-平均聚類相比，有一個優點就是不用指定聚類數目，因為Mean shift傾向於找到儘可能少的聚類數目。然而，Mean shift會比k-平均慢得多，並且同樣需要選擇一個「寬度」參數。和k-平均一樣，Mean shift演算法有許多變體。

主成分分析（PCA）

有一些研究^[22]^[23]表明，k-平均的放鬆形式解（由聚類指示向量表示），可由主成分分析中的主成分給出，並且主成分分析由主方向張成的子空間與聚類圖心空間是等價的。不過，主成分分析是k-平均聚類的有效放鬆形式並不是一個新的結果(如，見^[24])，並且還有的研究結果直接揭示了關於「聚類圖心子空間是由主成分方向張成的」這一論述的反例^[25] 。

獨立成分分析(ICA)

有研究表明^[26]，在稀疏假設以及輸入數據經過白化的預處理後，k-平均得到的解就是獨立成分分析的解。這一結果對於解釋k-平均在特徵學習方面的成功應用很有幫助。

雙向過濾

k-平均演算法隱含地假設輸入數據的順序不影響結果。雙向過濾與k-平均演算法和Mean shift演算法類似之處在於它同樣維護着一個迭代更新的數據集（亦是被均值更新）。然而，雙向過濾限制了均值的計算只包含了在輸入數據中順序相近的點^[21]，這使得雙向過濾能夠被應用在圖像去噪等數據點的空間安排是非常重要的問題中。

相似問題

目標函數是使得聚類平方誤差最小化的演算法還有k-中心點演算法，該方法保持聚類的中心在一個真實數據點上，亦即使用中心而非圖心作為均值點。

參考資料

[1]
Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1957, 4 (12): 801–804. MR 0090073. Zbl 0079.16403 （法語）.
[2]
MacQueen, J. B. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press: 281–297. 1967 [2009-04-07]. MR 0214227. Zbl 0214.46201. （原始內容存檔於2019-10-29）.
[3]
Lloyd, S. P. Least square quantization in PCM. Bell Telephone Laboratories Paper. 1957. Published in journal much later: Lloyd., S. P. Least squares quantization in PCM (PDF). IEEE Transactions on Information Theory. 1982, 28 (2): 129–137 [2009-04-15]. doi:10.1109/TIT.1982.1056489. （原始內容存檔 (PDF)於2009-06-17）.
[4]
E.W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965, 21: 768–769.
[5]
J.A. Hartigan. Clustering algorithms. John Wiley & Sons, Inc. 1975.
[6]
Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A k-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 1979, 28 (1): 100–108. JSTOR 2346830.
[7]
MacKay, David. Chapter 20. An Example Inference Task: Clustering (PDF). Information Theory, Inference and Learning Algorithms. Cambridge University Press. 2003: 284–292 [2015-04-02]. ISBN 0-521-64298-1. MR 2012999. （原始內容存檔於2016-02-17）.
[8]
Since the square root is a monotone function, this also is the minimum Euclidean distance assignment.
[9]
Hamerly, G. and Elkan, C. Alternatives to the k-means algorithm that find better clusterings (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM). 2002 [2015-04-02]. （原始內容 (PDF)存檔於2012-08-05）.
[10]
Vattani., A. k-means requires exponentially many iterations even in the plane (PDF). Discrete and Computational Geometry. 2011, 45 (4): 596–616 [2015-04-02]. doi:10.1007/s00454-011-9340-1. （原始內容存檔 (PDF)於2011-10-08）.
[11]
Arthur, D.; Manthey, B.; Roeglin, H. k-means has polynomial smoothed complexity. Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS). 2009.
[12]
Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009, 75: 245–249. doi:10.1007/s10994-009-5103-0.
[13]
Dasgupta, S. and Freund, Y. Random Projection Trees for Vector Quantization. Information Theory, IEEE Transactions on. July 2009, 55: 3229–3242. arXiv:0805.1390 . doi:10.1109/TIT.2009.2021326.
[14]
Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The Planar k-Means Problem is NP-Hard. Lecture Notes in Computer Science. 2009, 5431: 274–285. doi:10.1007/978-3-642-00202-1_24.
[15]
Inaba, M.; Katoh, N.; Imai, H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. Proceedings of 10th ACM Symposium on Computational Geometry: 332–339. 1994. doi:10.1145/177424.178042.
[16]
Arthur; Abhishek Bhowmick. A theoretical analysis of Lloyd's algorithm for k-means clustering (學位論文). 2009.[1]^{[失效連結]}
[17]
Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling, Mathematical Geosciences, 42: 487 - 517
[18]
Coates, Adam; Ng, Andrew Y. Learning feature representations with k-means (PDF). G. Montavon, G. B. Orr, K.-R. Müller (編). Neural Networks: Tricks of the Trade. Springer. 2012 [2015-04-02]. （原始內容 (PDF)存檔於2013-07-06）.
[19]
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP. Section 16.1. Gaussian Mixture Models and k-Means Clustering. Numerical Recipes: The Art of Scientific Computing 3rd. New York: Cambridge University Press. 2007 [2015-04-02]. ISBN 978-0-521-88068-8. （原始內容存檔於2011-08-11）.
[20]
Template:Cite Journal
[21]
Little, M.A.; Jones, N.S. Generalized Methods and Solvers for Piecewise Constant Signals: Part I (PDF). Proceedings of the Royal Society A. 2011 [2015-04-02]. （原始內容存檔 (PDF)於2019-08-19）.
[22]
H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. Spectral Relaxation for k-means Clustering (PDF). Neural Information Processing Systems vol.14 (NIPS 2001) (Vancouver, Canada). Dec 2001: 1057–1064 [2015-04-02]. （原始內容存檔 (PDF)於2020-09-30）.
[23]
Chris Ding and Xiaofeng He. k-means Clustering via Principal Component Analysis (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2004). July 2004: 225–232 [2015-04-02]. （原始內容存檔 (PDF)於2020-09-30）.
[24]
Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V. Vinay. Clustering large graphs via the singular value decomposition (PDF). Machine learning. 2004, 56: 9–33 [2012-08-02]. doi:10.1023/b:mach.0000033113.59016.96. （原始內容存檔 (PDF)於2020-10-24）.
[25]
Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu. Dimensionality reduction for k-means clustering and low rank approximation (Appendix B). ArXiv. 2014 [2014-11-29]. （原始內容存檔於2020-08-06）.
[26]
Alon Vinnikov and Shai Shalev-Shwartz. k-means Recovers ICA Filters when Independent Components are Sparse (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2014). 2014 [2015-04-02]. （原始內容存檔 (PDF)於2020-11-26）.

外部連結

[1] [1]
Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1957, 4 (12): 801–804. MR 0090073. Zbl 0079.16403 （法語）.

[macqueen1967-2] [2]
MacQueen, J. B. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press: 281–297. 1967 [2009-04-07]. MR 0214227. Zbl 0214.46201. （原始內容存檔於2019-10-29）.

[lloyd1957-3] [3]
Lloyd, S. P. Least square quantization in PCM. Bell Telephone Laboratories Paper. 1957. Published in journal much later: Lloyd., S. P. Least squares quantization in PCM (PDF). IEEE Transactions on Information Theory. 1982, 28 (2): 129–137 [2009-04-15]. doi:10.1109/TIT.1982.1056489. （原始內容存檔 (PDF)於2009-06-17）.

[forgy65-4] [4]
E.W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965, 21: 768–769.

[hartigan1975-5] [5]
J.A. Hartigan. Clustering algorithms. John Wiley & Sons, Inc. 1975.

[hartigan1979-6] [6]
Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A k-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 1979, 28 (1): 100–108. JSTOR 2346830.

[7] [7]
MacKay, David. Chapter 20. An Example Inference Task: Clustering (PDF). Information Theory, Inference and Learning Algorithms. Cambridge University Press. 2003: 284–292 [2015-04-02]. ISBN 0-521-64298-1. MR 2012999. （原始內容存檔於2016-02-17）.

[8] [8]
Since the square root is a monotone function, this also is the minimum Euclidean distance assignment.

[hamerly-9] [9]
Hamerly, G. and Elkan, C. Alternatives to the k-means algorithm that find better clusterings (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM). 2002 [2015-04-02]. （原始內容 (PDF)存檔於2012-08-05）.

[10] [10]
Vattani., A. k-means requires exponentially many iterations even in the plane (PDF). Discrete and Computational Geometry. 2011, 45 (4): 596–616 [2015-04-02]. doi:10.1007/s00454-011-9340-1. （原始內容存檔 (PDF)於2011-10-08）.

[Arthur,_D.;_Manthey,_B.;_Roeglin,_H._2009-11] [11]
Arthur, D.; Manthey, B.; Roeglin, H. k-means has polynomial smoothed complexity. Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS). 2009.

[12] [12]
Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009, 75: 245–249. doi:10.1007/s10994-009-5103-0.

[13] [13]
Dasgupta, S. and Freund, Y. Random Projection Trees for Vector Quantization. Information Theory, IEEE Transactions on. July 2009, 55: 3229–3242. arXiv:0805.1390 . doi:10.1109/TIT.2009.2021326.

[14] [14]
Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The Planar k-Means Problem is NP-Hard. Lecture Notes in Computer Science. 2009, 5431: 274–285. doi:10.1007/978-3-642-00202-1_24.

[15] [15]
Inaba, M.; Katoh, N.; Imai, H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. Proceedings of 10th ACM Symposium on Computational Geometry: 332–339. 1994. doi:10.1145/177424.178042.

[16] [16]
Arthur; Abhishek Bhowmick. A theoretical analysis of Lloyd's algorithm for k-means clustering (學位論文). 2009.[1]^{[失效連結]}

[17] [17]
Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling, Mathematical Geosciences, 42: 487 - 517

[Coates2012-18] [18]
Coates, Adam; Ng, Andrew Y. Learning feature representations with k-means (PDF). G. Montavon, G. B. Orr, K.-R. Müller (編). Neural Networks: Tricks of the Trade. Springer. 2012 [2015-04-02]. （原始內容 (PDF)存檔於2013-07-06）.

[19] [19]
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP. Section 16.1. Gaussian Mixture Models and k-Means Clustering. Numerical Recipes: The Art of Scientific Computing 3rd. New York: Cambridge University Press. 2007 [2015-04-02]. ISBN 978-0-521-88068-8. （原始內容存檔於2011-08-11）.

[20] [20]
Template:Cite Journal

[Little2011-21] [21]
Little, M.A.; Jones, N.S. Generalized Methods and Solvers for Piecewise Constant Signals: Part I (PDF). Proceedings of the Royal Society A. 2011 [2015-04-02]. （原始內容存檔 (PDF)於2019-08-19）.

[22] [22]
H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. Spectral Relaxation for k-means Clustering (PDF). Neural Information Processing Systems vol.14 (NIPS 2001) (Vancouver, Canada). Dec 2001: 1057–1064 [2015-04-02]. （原始內容存檔 (PDF)於2020-09-30）.

[23] [23]
Chris Ding and Xiaofeng He. k-means Clustering via Principal Component Analysis (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2004). July 2004: 225–232 [2015-04-02]. （原始內容存檔 (PDF)於2020-09-30）.

[24] [24]
Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V. Vinay. Clustering large graphs via the singular value decomposition (PDF). Machine learning. 2004, 56: 9–33 [2012-08-02]. doi:10.1023/b:mach.0000033113.59016.96. （原始內容存檔 (PDF)於2020-10-24）.

[25] [25]
Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu. Dimensionality reduction for k-means clustering and low rank approximation (Appendix B). ArXiv. 2014 [2014-11-29]. （原始內容存檔於2020-08-06）.

[26] [26]
Alon Vinnikov and Shai Shalev-Shwartz. k-means Recovers ICA Filters when Independent Components are Sparse (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2014). 2014 [2015-04-02]. （原始內容存檔 (PDF)於2020-11-26）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

描述

歷史源流

演算法

標準演算法

初始化方法

複雜度

演算法的變體

更多的討論

相關應用

向量的量化

聚類分析

特徵學習