﻿ 一种基于扰动分析的三支谱聚类算法 A Three-Way Spectral Clustering Algorithm Based on Perturbation Analysis

Vol.06 No.08(2017), Article ID:22809,7 pages
10.12677/AAM.2017.68118

A Three-Way Spectral Clustering Algorithm Based on Perturbation Analysis

Xiaolei Wang, Pingxin Wang*

School of Science, Jiangsu University of Science and Technology, Zhenjiang Jiangsu

*通讯作者。

Received: Nov. 8th, 2017; accepted: Nov. 21st, 2017; published: Nov. 27th, 2017

ABSTRACT

We structure a three-way Clustering Algorithm Based on NJW Algorithm. The main idea is that each object, making weighted arithmetic, repeats clustering arithmetic to obtain the weighted results, which benefits from the stability of NJW Algorithm. According to the perturbation for all objects, we can classify the object into core or fringe regions. The compact degree of core and fringe regions can be described by the modified S_Dbw(c) index. Testing the algorithm on the artificial and UCI data sets, the results meet expectation.

Keywords:Three-Way Clustering, Disturbance Degree, NJW Algorithm

1. 引言

2. 相关工作与概念

Figure 1. A schematic diagram of data set

${C}_{i}^{P}\cup {C}_{i}^{B}\cup {C}_{i}^{N}=U$ (1)

${C}_{i}^{p}\ne \varnothing ,i=1,2,\cdots ,k$ (2)

$\underset{i=1}{\overset{k}{\cup }}\left({C}_{i}^{p}\cup {C}_{i}^{B}\right)=U$ (3)

$TC=\left\{\left({C}_{1}^{P},{C}_{1}^{B}\right),\cdots ,\left({C}_{k}^{P},{C}_{k}^{B}\right)\right\}$

3. 基于扰动的三支聚类算法

Step1 通过NJW算法对数据集U进行一次聚类操作，得到 $C=\left\{{C}_{1},{C}_{2},\cdots ,{C}_{k}\right\}$

Step 2 在数据集中任找一个尚未被判断的点q，在原本数据集中添加m个数据q得到新数据集 ${U}_{1}$ ，对 ${U}_{1}$ 应用NJW算法重新聚类，得到 ${C}^{*}=\left\{{C}_{1}^{*},{C}_{2}^{*},{C}_{3}^{*},\cdots \right\}$

Step3在 ${C}^{*}$ 中找到q所对应的聚类 ${C}_{i}^{*}$ ，比较两聚类中心的欧氏距离，当其距离差值大于标准扰动聚类w时，将q划分至边界域，反之，将q划分至核心域。

Step4 若q以外所有点都已被判断则转入Step5，否则转入Step2。

Step5 输出核心域与边界域。

Step1 通过NJW算法对数据集U进行一次聚类操作，得到 $C=\left\{{C}_{1},{C}_{2},\cdots ,{C}_{k}\right\}$

Step 2 在数据集中任找一个尚未被判断的点q， $q\in {C}_{i}$ ，在原本数据集U中添加m个数据得到新数据集 ${U}_{1}$ ，找到q原本所在类簇 ${C}_{i}^{*}$ 。记 $\frac{添加的对象数}{数据集总对象数}=扰动系数$

Step3计算 ${C}_{i}^{*}$ 的聚类中心，将其与原本C的中心进行比较，若欧式距离差大于标准扰动距离与扰动系数的乘积则将其划分至边界域，反之，将其划分至核心域。

Step4 若q以外所有点都已被判断则转入Step5，否则转入Step2

Step5 输出核心域与边界域。

4. 实验结果

4.1. 人工数据集

${a}_{i}=\left(\left(rand\left(0,1\right)-\text{1}\right)\text{*11}\right)\text{+1,}i=1,\cdots ,8$$rand\left(0,1\right)$ 表示在0~1中取随机数。

$\sigma =\left[\begin{array}{cc}{b}_{1}& {b}_{2}\\ ⋮& ⋮\\ ⋮& ⋮\\ {b}_{7}& {b}_{8}\end{array}\right]$ ，对于 ${b}_{j}$${b}_{j}=\left(\left(rand\left(0,1\right)\text{*1}\right)/4\right)+2,\text{}j=1,\cdots ,8$

4.2. 评价标准

Figure 2. A synthetic data set with 400 points

$Scat\left(c\right)=\frac{1}{c}\underset{i=1}{\overset{c}{\sum }}\frac{‖\sigma \left({v}_{i}\right)‖}{‖\sigma \left(U\right)‖}$ (4)

${\sigma }_{x}^{P}=\frac{1}{n}\underset{k=1}{\overset{n}{\sum }}{\left({X}_{k}^{P}-{\stackrel{¯}{X}}^{P}\right)}^{2}$ (5)

$\stackrel{¯}{X}=\frac{1}{n}\underset{k=1}{\overset{n}{\sum }}{X}_{k},\text{}\forall {X}_{k}\in U$ (6)

$\sigma \left({v}_{i}\right)$ 是类簇 ${C}_{i}$ 的方差：

${\sigma }_{{v}_{i}}^{P}=\frac{1}{{n}_{i}}\underset{k=1}{\overset{{n}_{i}}{\sum }}{\left({X}_{k}^{P}-{v}_{i}^{P}\right)}^{2}$ (7)

$‖x‖$ 的定义为：如果x是一个向量，则

$‖x‖=\sqrt{x*{x}^{\prime }}$ (8)

$P=\left\{{c}_{1}^{P},{c}_{2}^{P},\cdots \right\}$$B=\left\{{c}_{1}^{B},{c}_{2}^{B},\cdots \right\}$$C=\left\{{c}_{1},{c}_{2},\cdots \right\}$ ，其中P为每个已划分类簇核心域构成的集合，B为每个已划分类簇边界域构成的集合，C为原二支聚类的结果。根据Scat(c)指标，从理论上可以得出

$Scat\left({c}^{P}\right)=\frac{1}{{c}^{P}}\underset{i=1}{\overset{{c}^{P}}{\sum }}\frac{‖\sigma \left({v}_{i}\right)‖}{‖\sigma \left(P\right)‖}$ ,

$Scat\left(c\right)=\frac{1}{c}\underset{i=1}{\overset{c}{\sum }}\frac{‖\sigma \left({v}_{i}\right)‖}{‖\sigma \left(U\right)‖}$ ,

$Scat\left({c}^{B}\right)=\frac{1}{{c}^{B}}\underset{i=1}{\overset{{c}^{B}}{\sum }}\frac{‖\sigma \left({v}_{i}\right)‖}{‖\sigma \left(B\right)‖}$ ,

$Scat\left({c}^{P}\right) . (9)

Table 1. UCI data sets used in experiments

Table 2. Results of experiments

4.3. UCI数据集测试

5. 总结

A Three-Way Spectral Clustering Algorithm Based on Perturbation Analysis[J]. 应用数学进展, 2017, 06(08): 985-991. http://dx.doi.org/10.12677/AAM.2017.68118

1. 1. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) Data Clustering: A Review. ACM Computing Survey, 31, 264-323. https://doi.org/10.1145/331499.331504

2. 2. Anderberg, M.R. (1973) Cluster Analysis for Application. Academic Press, New York.

3. 3. Omran, M.G.H., Engelbrecht, A.P. and Salman, A. (2007) An Overview of Clustering Methods. Intelligent Data Analysis, 11, 583-605.

4. 4. Xu, R. and Wunsch, D. (2005) Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16, 645- 678. https://doi.org/10.1109/TNN.2005.845141

5. 5. 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J]. 软件学报, 2008, 19(1): 48-61.

6. 6. Halkidi, M. and Vazirgiannis, M. (2002) Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set. Proceedings IEEE International Conference on Data Mining, 29 November-2 December 2001. http://ieeexplore.ieee.org/abstract/document/989517/