﻿ 集成式位置敏感聚类方法 Image Cluster Method Based on Ensemble Locality Sensitive Clustering

Artificial Intelligence and Robotics Research
Vol.05 No.02(2016), Article ID:17566,12 pages
10.12677/AIRR.2016.52003

Image Cluster Method Based on Ensemble Locality Sensitive Clustering

Tianqiang Peng1, Haolin Gao2

1Department of Computer Science and Engineering, Henan Institute of Engineering, Zhengzhou Henan

2Information System Engineering Institute, Information Engineering University, Zhengzhou Henan

Received: Apr. 25th, 2016; accepted: May 15th, 2016; published: May 18th, 2016

ABSTRACT

To overcome the weakness of k-means in image clustering especially visual image clustering, we proposed an Ensemble Locality Sensitive Clustering method. It first determined the number of clusters of dataset, then generated the multiple clustering resolutions based on Exact Euclidean Locality Sensitive Hashing algorithm, at last, cluster ensemble methods were applied to get final partition. The experiments on synthetic dataset and image dataset show that new method reaches the same level with k-means combined with cluster ensemble about clustering accuracy on synthetic data set, and slightly less accuracy on image dataset. But the advantage of new method is its clustering time is faster than k-means, and it is suitable for incremental clustering. Therefore, Ensemble Locality Sensitive Clustering is a promising clustering method for high dimension image data.

Keywords:Exact Euclidean Locality Sensitive Hashing, Random Projection, Image Clustering , Cluster Ensemble

1河南工程学院，计算机工程与科学系，河南 郑州

2信息工程大学，信息系统工程学院，河南 郑州

1. 引言

E2LSH (Exact Euclidean Locality Sensitive Hashing) [10] 是随机映射的一个特例，近年来也引起了人们更多的注意，被应用在不少领域如检索 [11] 、复制检测 [12] 等。它在用于近似最近邻检索时，在保持较高准确率的同时有着很低的运行时间，是当前近似最近邻检索的主流算法。它把高维向量映射到低维空间后，也完成了维数约减的任务。同时，相同的桶中的点比不同桶中的点更相似。因此，如果根据桶标志对数据点进行分组，也可以达到对数据点进行聚类的目的。而且，E2LSH是一个数据无关的方法，它在对一个点进行映射时与其它点无关，这意味着它可以方便地为一个动态数据库建立索引。类似地，用于聚类时，当数据集规模变化时不需要对整个数据集重新进行聚类，只需要对变化的数据点运算即可，也就是说它可以方便地起到动态聚类的作用，再加上它的时间优势使得它可以较好的用于增量大数据集聚类。事实上，LSH算法的提出者已经指出LSH可以用于快速聚类，但他没有进一步进行理论证明和实验验证。不过，Ravichandran D已经将LSH的欧式空间实现方案E2LSH用于名词聚类 [13] 。然而，图像数据比文本数据要复杂得多，因此，对图像进行聚类也比对名词进行聚类要困难的多。

2. 位置敏感聚类可行性

2.1. 随即映射的距离保持和边界保持特性

Johnson-Lindenstrauss定理说明了映射的距离保持性质。给定和数据集，对于正整数，存在一个映射，使得对于所有

(1)

(2)

(3)

(4)

2.2. 位置敏感聚类

E2LSH是LSH (Locality Sensitive Hashing)的欧式空间实现方式，是一种基于随机映射的方法。因此以它为基础设计的算法具有对数据进行聚类的能力。位置敏感聚类的基础就是E2LSH算法的位置敏感哈希函数。该函数是基于p-稳定分布函数的，。位置敏感哈希函数的定义为：

3. 集成式位置敏感聚类

3.1. 聚类集成框架

Strehl和Ghosh把一致划分(即最终划分)定义为与各个基聚类器得出的划分共享信息最多的划分 [19] ，为测量两个划分共享的信息量，他们定义了基于信息论中互信息的归一化互信息(Normalized Mutual Information, NMI)，用来度量两个划分相似度。用表示划分中数据点的个数，表示划分中类中数据点的个数，表示同时在这两个类中点的数目。那么，NMI的定义为：

(6)

(7)

3.2. 集成式位置敏感聚类

1) 为数据集S确定聚类数目k；

2) 生成服从高斯分布的的随机矩阵和b、w，是一个d维向量，

3) 对所有数据点进行随机映射，每个点的映射的结果为d维桶标记

(8)

4) 用PAM算法对的每一列聚成k类，得出的结果即对数据集的d个划分，用表示。

5) 使用聚类集成方法对d个基划分进行集成并得出最终划分。采用的聚类集成方法包括CSPA、HGPA和MCLA，它们各自得出一个集成划分，记作，最终划分是和其它两个划分有最大归一化互信息的划分，即

(9)

4. 实验

4.1. 人工数据集实验

Figure 1. k-means clustering result in data set 1

Figure 2. ELSC bucket number result in data set 1

Table 1. The label of k-means clustering from the first type in data set 1

Table 2. The bucket indices of ELSC algorithm from the first type in data set 1

Figure 3. k-means clustering result in data set 2

Figure 4. ELSC clustering result in data set 2

[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3]。该结果也与Ek-means相同。

4.2. 图像集实验

Table 3. The bucket indices of the ELSC algorithm in the second type of images

Figure 5. k-means clustering result in image set 1

Figure 6. ELSC bucket indic result in Image 1

Figure 7. Class symbol of ELSC division in Image 1

Figure 8. The clustering time of two algorithms in image set 1

Figure 9. The MAP of ELSC in image set 1

Table 4. The F number of the algorithms in each type of image set 1

5. 总结

Image Cluster Method Based on Ensemble Locality Sensitive Clustering[J]. 人工智能与机器人研究, 2016, 05(02): 23-34. http://dx.doi.org/10.12677/AIRR.2016.52003

1. 1. Cao, Y. and Wu, J. (2002) Projective ART for Clustering Data Sets in High Dimensional Spaces. Neural Networks, 15, 105-120.

2. 2. Agrawal, R., Gehrke, J., Gunopulos, D. and Raghavan, P. (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proceedings of SIGMOD Record ACM Special Interest Group on Management of Data, 94-105. http://dx.doi.org/10.1145/276304.276314

3. 3. Schclara, A., Rokachb, L. and Amit, A. (2013) Ensembles of Classifiers Based on Dimensionality Reduction. Pattern Analysis and Applications Journal, 5, 1305-4345.

4. 4. Dasgupta, S. and Sinha, K. (2013) Randomized Partition Trees for Exact Nearest Neighbor Search. Proceedings of Workshop and Conference Proceedings, 30, 1-21.

5. 5. Baraniuk, R.G., Davenport, M., De Vore, R. and Wakin, M.B. (2008) A Simple Proof of the Restricted Isometry Principle for Random Matrices. Constructive Approximation, 28, 253-263.

6. 6. Fowler, J.E. and Du, Q. (2012) Anomaly Detection and Reconstruction from Random Projections. IEEE Transaction on Image Processing, 21, 184-195.

7. 7. Schulman, L.J. (2000) Clustering for Edge-Cost Minimization. Proceedings of Annual ACM Symposium Theory of Computing, 547-555.

8. 8. Balcan, M.-F., Blum, A. and Vempala, S. (2006) Kernels as Features: On Kernels, Margins, and Low-Dimensional Mappings. Machine Learning, 65, 79-94.

9. 9. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.J. and Vishwanathan, S.V.N. (2009) Hash Kernels for Structured Data. Journal of Machine Learning Research, 10, 2615-2637.

10. 10. Andoni and Indyk, P. (2008) Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM, 51, 117-122.

11. 11. Jegou, H., Douze, M. and Schmid, C. (2010) Improving Bag-of-Features for Large Scale Image Search. International Journal of Computer Vision, 87, 316-336. http://dx.doi.org/10.1007/s11263-009-0285-2

12. 12. Liu, Z., Liu, T. and David, G. (2010) Effective and Scalable Video Copy Detection. Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval, ACM, New York, 119-128.

13. 13. Ravichandran, D., Pantel, P. and Hovy, E. (2005) Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Stroudsburg, 622-629. http://dx.doi.org/10.3115/1219840.1219917

14. 14. Blum, A. (2006) Random Projection, Margins, Kernels, and Fea-ture-Selection. Proceedings of the 2005 International Conference on Subspace, Latent Structure and Feature Selection, LNCS, 52-68.

15. 15. Shi, Q.F., Shen, C.H., Hill, R. and van den Hengel, A. (2012) Is Margin Preserved after Random Projection. Proceedings of International Conference on Machine Learning, Edinburgh.

16. 16. Pons, S.V. and Sulcloper, J.R. (2011) A Surver of Clustering Ensemble Algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25, 337-372. http://dx.doi.org/10.1142/S0218001411008683

17. 17. Topchy, A.P., Jain, A.K. and Punch, W.F. (2005) Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1866-1881. http://dx.doi.org/10.1109/TPAMI.2005.237

18. 18. Topchy, A., Minaei-Bidgoli, B., Jain, A.K. and Punch, W.F. (2004) Adaptive Clustering Ensembles. Proceedings of the 17th International Conference, Washington DC, 272–275. http://dx.doi.org/10.1109/icpr.2004.1334105

19. 19. Strehl and Ghosh, J. (2002) Cluster Ensembles: A Knowledge Reuse Framework Multiple Partitions. Journal of Machine Learning Research, 583-617.