﻿ 基于数据映射算法的近邻存储方法研究 Research on the Method of Neighbour Storage Based on Data Mapping Algorithm

Hans Journal of Data Mining
Vol.06 No.04(2016), Article ID:18698,9 pages
10.12677/HJDM.2016.64016

Research on the Method of Neighbour Storage Based on Data Mapping Algorithm

Shanshan Li

School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu

Received: Aug. 15th, 2016; accepted: Oct. 7th, 2016; published: Oct. 10th, 2016

ABSTRACT

With the high-speed development of the Internet, processing of high-dimensional and massive amounts of data for querying is a key challenge. However, for the traditional distributed storage scheme, such as the P2P network and Chord, the data storage capacity and the switch overheads from the nodes are increasing, thus decreasing the storage efficiency and data query efficiency continuously. In this article, a neighbor data storage approach based on data mapping algorithm is proposed. The experiment results show that the proposed method can improve the query accuracy rate and reduce network bandwidth through relevant query.

Keywords:P2P, Distributed-Memory, Neighbor Data Storage, Relevant Query

1. 引言

2. 相似性度量方式

(1) (距离非负)；

(2)当且仅当(只有点到自身的距离为0，其他距离都大于0)；

(3) (距离具有对称性)；

(4) (三角不等式)。

(1) 海明距离

(2.1)

(2) 范数距离

(2.2)

(2.3)

3. 基于数据映射算法的近邻存储方法的提出

3.1. 理论基础

3.2. 实现方法

(1) 利用谱哈希进行哈希映射

Figure 1. The flow chart of the mapping

(3.1)

(3.2)

(2) 新型数据映射算法

(3) 分布式节点网络的构建

Chord由麻省理工学院(MIT)在2001年提出，其目的是提供一种能在P2P网络快速定位资源的的算法。Chord通过把Node和Key映射到相同的空间而保证一致性哈希。为了保证通过哈希函数得到的哈希码不具有重复性，选择SHA-1作为哈希函数，它会产生一个2160大小的空间，环上的每一项均为一个16字节的大整数。这些整数首尾相连形成一个环，称之为Chord环。整数在Chord环上按大小顺时针排

Figure 2. Z-curve method

(3.3)

(4) 负载均衡

objec1->m1-1；objec2->m1-2；objec3->m18-2；objec4->m18-1；

Figure 3. Z-Value distribution of the ring

Figure 4. Distributed storage architecture diagram

4. 实验验证

5. 总结

Figure 5. Similarity degree between systems

Figure 6. System switching expenditure

1) 针对海量数据下分布式存储系统中相似性查询查全率较低等问题，提出了近邻存储这一创新性概念。基于这个理念，提出了一种基于数据映射算法的近邻存储方法，该方法利用谱哈希哈希映射算法和数据映射算法将高维原始数据映射到分布式存储空间，实现了分布式近邻存储。

2) 为了能够更好的将相近原始数据映射到邻近的服务器节点中，提出了一种基于Z-curve的数据映射方法。该方法从原理证明了在原始空间相近的数据能够更好的映射到分布式存储系统的邻近的服务器节点，保证了在各个空间中数据的相似性。

3) 为了防止出现系统热点问题，影响系统可靠性，通过虚拟节点的方法，动态的分析和移动各个节点的负载，有效了降低了各个节点的负载，减轻了系统的过载情况，实现了系统的负载均衡。

4) 最后通过实验可以证明，在多个不同指标下，本文提出的方案在系统相似度和系统切换开销均优于chord。通过本文实验可以说明本方案在分布式网络中实现了分布式近邻存储，在大数据领域具有较大的应用价值。

Research on the Method of Neighbour Storage Based on Data Mapping Algorithm[J]. 数据挖掘, 2016, 06(04): 139-147. http://dx.doi.org/10.12677/HJDM.2016.64016

1. 1. Svendsen, H.B. and Erickson, M. (2016) System and Method for Identifying Music Content in a P2P Real Time Rec-ommendation Network. http://xueshu.baidu.com/s?wd=paperuri%3A%2879696d30bc0034769b511c43747ca6ab%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fwww.freepatentsonline.com%2F8422490.html&ie=utf-8&sc_us=3331819690017585512

2. 2. Li, P., Wang, M., Cheng, J., et al. (2013) Spectral Hashing with Semantically Consistent Graph for Image Indexing. IEEE Transactions on Multimedia, 15, 141-152. http://dx.doi.org/10.1109/TMM.2012.2199970

3. 3. 彭良睿, 李学明. 一种基于树型结构的P2P系统高维数据检索方法[J]. 计算机应用研究, 2015, 32(3): 842-845.

4. 4. Yao, C., Bu, J.J., Wu, C.X., Chen, G.C., et al. (2013) Semi-Supervised Spectral Hashing for Fast Similarity Search. Neurocomputing, 101, 52-58. http://dx.doi.org/10.1016/j.neucom.2012.06.035

5. 5. Washbourne, L. (2015) A Survey of P2P Network Security. arXiv:1504.01358

6. 6. Shao, J., Wu, F., Ouyang, C., et al. (2012) Sparse Spectral Hashing. Pattern Recognition Letters, 33, 271-277. http://dx.doi.org/10.1016/j.patrec.2011.10.018

7. 7. Zou, F., Liu, C., Ling, H., et al. (2013) Least Square Regu-larized Spectral Hashing for Similarity Search. Signal Processing, 93, 2265-2273. http://dx.doi.org/10.1016/j.sigpro.2012.05.033

8. 8. Stoica, I., Morris, R., Karger, D., et al. (2001) Chord: A Sclable Peer-to-Peer Lookup Service for Internet Applications. Proceedings of the 2001 SIGCOMM, 31, 149-160. http://dx.doi.org/10.1145/383059.383071