﻿ 基于本证间隙增大的K邻域加权谱聚类算法 K Neighbor Relationship for Spectral Clustering with Lager Eigengap

Computer Science and Application
Vol. 10  No. 02 ( 2020 ), Article ID: 34245 , 14 pages
10.12677/CSA.2020.102030

K Neighbor Relationship for Spectral Clustering with Lager Eigengap

Wenmin Tian, Xuan Wang, Chunlin Tian

Shaanxi Normal University (SUUN), Xi’an Shaanxi

Received: Feb. 1st, 2020; accepted: Feb. 13th, 2020; published: Feb. 20th, 2020

ABSTRACT

Recently, clustering is one of the hot method in unsupervised learning. Spectral clustering is a hot in clustering method and often shows good clustering performance. One crucial step of spectral clustering is constructing a similarity matrix. However, the traditional model cannot consider the distribution structure of the data set well, and it is diﬃcult to truly reﬂect the similarity between data points. Inspired by density clustering to ﬁnd the nearest neighbor relationship between data and obtain the optimized similarity matrix. In this paper, we propose a novel construction method of similarity matrix that use a graph with kneighbor relationship (KNRS), considering the kneighborhood distribution of data this model, as a result, the eigengap becomes lager. We have also made comparison with some methods on some common datasets, the experiments show the superiority of our model in the benchmark data sets.

Keywords:Spectral Clustering, Eigengap, Similarity Matrix, Euclidean Distance

Copyright © 2020 by author(s) and Hans Publishers Inc.

1. 引言

2. 谱聚类算法及分析

2.1. 谱聚类

2.2. 相似度矩阵

2.3. 本征间隙

3. KN-SC算法及证明

3.1. 概念定义

$d\left(i,j\right)=\sqrt{{\left({x}_{i1}-{x}_{j1}\right)}^{2}+{\left({x}_{i2}-{x}_{j2}\right)}^{2}+\cdots +{\left({x}_{in}-{x}_{jn}\right)}^{2}}$ (1)

$A\left(i,j\right)=\left\{\begin{array}{l}1;\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{if}\text{\hspace{0.17em}}j\text{th point distance}\le k\text{th}\text{\hspace{0.17em}}\text{points distance}\\ 0;\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{if}\text{\hspace{0.17em}}j\text{th point distance}>k\text{th points distance}\end{array}$ (2)

k是KNRS的参数，k用于找出每个点的k个最近邻。

$\rho =\mathrm{exp}\left(-x+\mathrm{log}\left(1-\alpha \right)\right)$ (3)

$x=\left({\sum }_{i=1}^{n}A\right)/k$ (4)

$\delta =\mathrm{min}\left\{|\lambda -s|;\lambda \text{\hspace{0.17em}}\text{eigenvalue}\text{\hspace{0.17em}}\text{of}\text{\hspace{0.17em}}A,\lambda \notin {S}_{1},s\in {S}_{1}\right\}$ (5)

${V}_{1}$${\stackrel{˜}{V}}_{1}$ 之间的距离 $d\left({V}_{1},{\stackrel{˜}{V}}_{1}\right)$ 的范围为：

$d\left({V}_{1},{\stackrel{˜}{V}}_{1}\right)\le \frac{‖H‖}{\delta }$ (6)

3.2. KNRS算法

KNRS方法包括五个阶段，大致流程如图1所示。

$L=D-W$ (7)

Figure 1. KNRS algorithm flowchart

${L}_{sym}={D}^{-1/2}L{D}^{1/2}=I-{D}^{\frac{-1}{2}}W{D}^{\frac{1}{2}}$ (8)

${L}_{rw}={D}^{-1}L=I-{D}^{-1}W$. (9)

(10)

$\begin{array}{c}{f}^{\text{T}}Lf={f}^{\text{T}}\left(I-{D}^{-\frac{1}{2}}W{D}^{-\frac{1}{2}}\right)f\\ =\underset{i=1}{\overset{n}{\sum }}{f}_{i}^{2}-\underset{i,j=1}{\overset{n}{\sum }}{f}_{i}{f}_{j}\frac{{w}_{ij}}{\sqrt{{d}_{i}{d}_{j}}}\\ =\frac{1}{2}\left(\underset{i=1}{\overset{n}{\sum }}{f}_{i}^{2}+\underset{j=1}{\overset{n}{\sum }}{f}_{j}^{2}-2\underset{i,j=1}{\overset{n}{\sum }}{f}_{i}{f}_{j}\frac{{w}_{ij}}{\sqrt{{d}_{i}{d}_{j}}}\right)\\ =\frac{1}{2}{\underset{i,j=1}{\overset{n}{\sum }}{w}_{ij}\left(\frac{{f}_{i}}{\sqrt{{d}_{i}}}-\frac{{f}_{j}}{\sqrt{{d}_{j}}}\right)}^{2}\end{array}$ (11)

1. 第一步，构建完整的连接图并选择最近的k个点，获得k个最近的邻居矩阵W。

2. 第二步，找到k邻居关系。

3. 第三步，获得新的相似矩阵W。

4. 第四步，获得拉普拉斯矩阵L。

5. 第五步，使用例如K-均值算法得到最终的聚类结果。

3.3. KNRS算法相关证明

$f\left(D-W\right)\ge f\left({e}_{\mathrm{min}}I-W\right)~{e}_{\mathrm{min}}-\lambda \ge {e}_{\mathrm{min}}-{\lambda }_{\mathrm{max}}\ge {e}_{\mathrm{min}}-\mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)$ (12)

${\mathrm{max}}_{1\le i\le n}|{\lambda }_{i}|\le {‖A‖}_{1}={\mathrm{max}}_{1\le j\le n}{\sum }_{i=1}^{n}|{a}_{ij}|$ (13)

${\mathrm{max}}_{1\le i\le n}|{\lambda }_{i}|\le {‖A‖}_{\infty }={\mathrm{max}}_{1\le i\le n}{\sum }_{j=1}^{n}|{a}_{ij}|$ (14)

$f\left(D-W\right)\le f\left({e}_{\mathrm{max}}I-W\right)~{e}_{\mathrm{max}}-\lambda \le {e}_{\mathrm{max}}-{\lambda }_{\mathrm{min}}\le \mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)-{\lambda }_{\mathrm{min}}$ (15)

$\begin{array}{c}\Delta =sup\left(L\right)-sub\left(L\right)\\ =\left(\mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)-{\lambda }_{\mathrm{min}}\right)-\left({e}_{\mathrm{min}}-\mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)\right)\\ =-{\lambda }_{\mathrm{min}}-{e}_{\mathrm{min}}+2\mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)\\ =\mathrm{min}\left({‖W‖}_{\infty },{‖W‖}_{1}\right)-{\lambda }_{\mathrm{min}}\text{\hspace{0.17em}}\left(\text{assym}\text{.matrix}W\right)\\ \le {\lambda }_{\mathrm{max}}-{\lambda }_{\mathrm{min}}\le {\lambda }_{\mathrm{max}}\end{array}$ (16)

$\begin{array}{c}\Delta =sup\left(L\right)-sub\left(L\right)\\ \le {\lambda }_{\mathrm{max}}\\ \le ‖L+{H}^{\prime }‖\\ \le ‖L‖+‖{H}^{\prime }‖\end{array}$ (17)

4. 实验

4.1. 数据集

Iris数据集包含3个类别，每个类别有50个实例，具有4个属性，其中每个类别都表示一种Iris植物。Wine数据集包含三种葡萄酒中178种样品中每种的13种成分的数量。Soybean数据集有47个样本，35个属性和7个类别。Seeds数据集具有210个样本，7个属性和3个类别。Two circles的人工数据集在两个圆圈中具有420个样本，在两个类别中具有三个属性。Boat数据集包含100个样本，2个属性和3个类别。这两个人工数据集包含两种类型的样本，即中等密度密集区域和周围稀疏区域。MNIST手写数字数据库的训练集为60,000个示例，测试集为10,000个示例，784个属性和10个类别用来验证算法不仅对密度均匀的数据集聚类有效，而且对于密度分布不均匀的数据集聚类也有效。标准数据集用于进一步检查算法的有效性和泛化能力。

Table 1. Benchmark datasets

4.2. 聚类评价指标

4.2.1. NMI

$NMI=\frac{I\left(H,Y\right)}{\sqrt{H\left(X\right)H\left(Y\right)}}$ (18)

$NMI=\frac{{\sum }_{k=1}^{C}{\sum }_{m=1}^{C}{n}_{k,m}\mathrm{log}\left(\frac{n×{n}_{k,m}}{{n}_{k}{\stackrel{^}{n}}_{m}}\right)}{\sqrt{\left({\sum }_{k=1}^{C}{n}_{k}\mathrm{log}\frac{{n}_{k}}{n}\right)\left({\sum }_{m=1}^{C}{\stackrel{^}{n}}_{m}\mathrm{log}\frac{{\stackrel{^}{n}}_{m}}{n}\right)}}$ (19)

4.3. 实验总结分析

4.3.1. 章节标题

Figure 2. KNRS clustering results on benchmark datasets

Table 2. Results on datasets by NMI

MNIST是手写数字数据库如图3所示，对于那些想在真实数据上尝试学习和模式识别方法同时又在预处理和格式化上花费最少精力的人们来说，是一个很好的数据库。本文从MNIST的训练集中选择了500个图像中的150个属性进行实验。NMI对MNIST数据集的聚类结果总结在图4中。根据图4，我们可以看到KNRS方法优于其他比较方法。

Figure 3. MNIST dataset

Figure 4. Results on MNIST datasets by NMI

Figure 5. Results of increased eigengap of dataset in KNRS and SC methods

Figure 6. Comparison of the effect of increasing the eigenspace on NMI on 6 sets of datasets

Figure 7. Clustering results of KNRS method for different δ and k. (δ = 0.1, 0.2, 0.4, 0.8, 1); number of neighbors k = 1, 3, 5, 7, 9, 10, 13, 15, 16, 17, 18, 19, 20, 21

5. 总结

K Neighbor Relationship for Spectral Clustering with Lager Eigengap[J]. 计算机科学与应用, 2020, 10(02): 289-302. https://doi.org/10.12677/CSA.2020.102030

1. 1. Jain, A.K. (2008) Data Clustering: 50 Years beyond K-Means.

2. 2. Park, H.S. and Jun, C.H. (2009) A Simple and Fast Algorithm for k-Medoids Clustering. Expert Systems with Applications, 36, 3336-3341. https://doi.org/10.1016/j.eswa.2008.01.039

3. 3. Guha, S., Rastogi, R. and Shim, K. (1998) Cure: An Efficient Clustering Algorithm for Large Databases. Information Systems, 26, 35-58. https://doi.org/10.1016/S0306-4379(01)00008-4

4. 4. Bezdek, J.C., Ehrlich, R. and Full, W. (1984) FCM: The Fuzzy C-Means Clustering Algorithm. Computers & Geosciences, 10, 191-203. https://doi.org/10.1016/0098-3004(84)90020-7

5. 5. Ester, M., Kriegel, H.P., Sander, J., et al. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of International Conference on Knowledge Discovery & Data Mining, Vol. 96, 226-231.

6. 6. Luxburg, U.V. (2007) A Tutorial on Spectral Clustering. Statistics & Computing, 17, 395-416. https://doi.org/10.1007/s11222-007-9033-z

7. 7. Schlkopf, B., Smola, A. and Mller, K. (1998) Nonlinear Component Analysis as a Kernel Eigen Value Problem. Neural Computation, 10, 1299-1319.

8. 8. Xu, D. and Tian, Y. (2015) A Comprehensive Survey of Clustering Algorithms. Annals of Data Science, 2, 165-193. https://doi.org/10.1007/s40745-015-0040-1

9. 9. Donath, W.E. and Hoﬀman, A.J. (1973) Lower Bounds for the Partitioning of Graphs. IBM Journal of Research & Development, 17, 420-425. https://doi.org/10.1147/rd.175.0420

10. 10. Fiedler, M. (1976) Algebraic Connectivity of Graphs. Czechoslovak Mathematical Journal, 23, 298-305.

11. 11. Hagen, L. and Kahng, A.B. (2002) New Spectral Methods for Ratio Cut Partitioning and Clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 11, 1074-1085. https://doi.org/10.1109/43.159993

12. 12. Shi, J. and Malik, J. (2000) Normalized Cuts and Image Segmentation. Departmental Papers (CIS), 107.

13. 13. Ng, A.Y., Jordan, M.I. and Weiss, Y. (2002) On Spectral Clustering: Analysis and an Algorithm. In: Advances in Neural Information Processing Systems, Springer, The Netherlands, 849-856.

14. 14. Feng, J., Lin, Z., Xu, H. and Yan, S. (2014) Robust Subspace Segmentation with Block-Diagonal Prior. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 23-28 June 2014, 3818-3825. https://doi.org/10.1109/CVPR.2014.482

15. 15. Xia, T., Cao, J., Zhang, Y.-D. and Li, J.-T. (2009) On Defining Affinity Graph for Spectral Clustering through Ranking on Manifolds. https://doi.org/10.1016/j.neucom.2009.03.012

16. 16. Zhang, X., Li, J. and Yu, H. (2011) Local Density Adaptive Similarity Measurement for Spectral Clustering. Pattern Recognition Letters, 32, 352-358. https://doi.org/10.1016/j.patrec.2010.09.014

17. 17. 孙继广. 矩阵扰动分析[M]. 北京: 科学出版社, 2001: 146-160.

18. 18. Davis, C. and Kahan, W.M. (1970) The Rotation of Eigenvectors by a Perturbation. SIAM Journal on Numerical Analysis, 7, 1-46. https://doi.org/10.1137/0707001

19. 19. Perona, P. and Freeman, W. (1998) A Factorization Approach to Grouping. In: European Conference on Computer Vision, Springer, Netherlands, 655-670. https://doi.org/10.1007/BFb0055696

20. 20. Scott, G.L. and Longuet-Higgins, H.C. (1990) Feature Grouping by ‘Relocalisation’ of Eigenvectors of the Proximity Matrix. BMVC, 1-6. https://doi.org/10.5244/C.4.20

21. 21. Ding, C.H., He, X., Zha, H., Gu, M. and Simon, H.D. (2001) A Min-Max Cut Algorithm for Graphpartitioning and Data Clustering. Proceedings 2001 IEEE International Conference on Data Mining, San Jose, CA, 29 November-2 December 2001, 107-114.

22. 22. 王丽. 图论在算法设计中的应用[D]: [硕士学位论文]. 西安: 西安电子科技大学, 2010.

23. 23. Malik, J., Belongie, S., Leung, T. and Shi, J. (2001) Contour and Texture Analysis for Image Segmentation. International Journal of Computer Vision, 43, 7-27.

24. 24. Meila, M. and Shi, J. (2001) Learning Segmentation by Random Walks. In: Advances in Neural Information Processing Systems, Springer, The Netherlands, 873-879.

25. 25. Liu, R., Lin, Z., and Su, Z. (2014) Learning Markov Random Walks for Robust Subspace Clustering and Estimation. Neural Networks, 59, 1-15. https://doi.org/10.1016/j.neunet.2014.06.005

26. 26. Li, X.-Y. and Guo, L.-J. (2012) Constructing Affinity Matrix in Spectral Clustering Based on Neighbor Propagation. Neurocomputing, 97, 125-130. https://doi.org/10.1016/j.neucom.2012.06.023

27. 27. Li, Q., Ren, Y., Li, L. and Liu, W. (2016) Fuzzy Based Affinity Learning for Spectral Clustering. Pattern Recognition, 60, 531-542. https://doi.org/10.1016/j.patcog.2016.06.011

28. 28. 田铮, 李小斌, 句彦伟. 谱聚类的扰动分析[J]. 中国科学, 2007, 37(4): 527-543.

29. 29. 孔万增, 孙志海, 杨灿. 基于本征间隙与正交特征向量的自动谱聚类[J]. 电子学报, 2010, 38(8): 1980-1985.

30. 30. Bhatia, R. (2013) Matrix Analysis. Springer Science & Business Media, New York.

31. 31. He, X., Cai, D. and Niyogi, P. (2006) Laplacian Score for Feature Selection. In: Advances in Neural Information Processing Systems, Springer, The Netherlands, 507-514.

32. 32. Xiang, T. and Gong, S. (2008) Spectral Clustering with Eigenvector Selection. Pattern Recognition, 41, 1012-1029. https://doi.org/10.1016/j.patcog.2007.07.023

33. 33. Asuncion, A. and Newman, D. (2007) UCI Machine Learning Repository.

34. 34. Strehl, A. and Ghosh, J. (2002) Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3, 583-617.

35. 35. Jin, R., Kang, F. and Ding, C.H. (2006) A Probabilistic Approach for Optimizing Spectral Clustering. In: Advances in Neural Information Processing Systems, Springer, The Netherlands, 571-578.