信息网络安全 ›› 2018, Vol. 18 ›› Issue (8): 43-49.doi: 10.3969/j.issn.1671-1122.2018.08.006

• • 上一篇    下一篇

一种基于Yarn云计算平台与NMF的大数据聚类算法

冯新扬1(), 沈建京2   

  1. 1.河南财经政法大学计算机与信息工程学院,河南郑州 450046
    2.解放军战略支援部队信息工程大学,河南郑州 450002
  • 收稿日期:2018-03-10 出版日期:2018-08-20 发布日期:2020-05-11
  • 作者简介:

    作者简介:冯新扬(1980—),男,安徽,讲师,博士,主要研究方向为云计算、金融大数据;沈建京(1961—),男,湖北,教授,博士,主要研究方向为分布式智能系统。

    作为积累大数据的典型行业,电信行业积累了大量的手机用户行为数据,数据里包括用户拨出电话的基站信息、通话时间、通话时长等内容。一方面这些数据可以被用来研究用户之间形成的社交网络;另一方面,由于这些行为数据具有地理上下文,因此也可以基于网络理论来结合地理属性研究城市中不同区域之间的关系与功能。

  • 基金资助:
    国家自然科学基金[61202285];河南省科技攻关项目[122102210387];河南省教育厅科技攻关项目[13B520902]

A Yarn and NMF Based Big Data Clustering Algorithm

Xinyang FENG1(), Jianjing SHEN2   

  1. 1. School of Computer and Information Engineering, Henan University of Economics and Law, Zhengzhou Henan 450046, China
    2. PLA Strategic Support Force Information Engineering University, Zhengzhou Henan 450002, China
  • Received:2018-03-10 Online:2018-08-20 Published:2020-05-11

摘要:

为了改进MapReduce早期版本在大数据聚类算法方面的性能,文章提出了基于Yarn(Yet Another Resource Negotiator)云计算平台与非负矩阵分解NMF(Non-negative Matrix Factorization)的大数据聚类方法。文章讨论了高维数据相似性聚类与非负矩阵分解的结合及其面向MapReduce的数据聚类的任务划分方式。该方法的实现采用Hadoop2.0的Yarn平台,利用Hadoop的HDFS(Hadoop Distributed File System)来存储大容量的外部数据;描述了基于NMF的大数据相似性聚类方法的编码与实现过程,并以电信运营商的大数据作为案例程序进行了测试。实验结果表明,Yarn云平台比传统用于数据聚类的非负矩阵方法具有更好的运行时间与加速比,能够在可以接受的时间范围内完成电信运营商的大数据处理。

关键词: 云计算, 大数据, Yarn平台, 非负矩阵分解, 聚类算法

Abstract:

In order to improve the performance of MapReduce version 1 on big data processing, a Yarn and NMF (Non-negative Matrix Factorization) based Parallel hierarchical clustering algorithm was proposed in this paper. The combination of big data classification with NMF algorithm and the task partition in our MapReduce approach were discussed subsequently. Our approach used the Yarn distributed computation programming model of Hadoop2.0 and thus the big data was stored in HDFS (Hadoop Distributed File System). The coding mechanism and flow of hierarchical data clustering on Yarn were also discussed and described in detail. In order to demonstrate the efficiency of our approach, a serial of simulation experiments on a telecommunication big data were done. The results and performance analysis demonstrated that big data can be completed in an accepted time scope with Yarn framework. Good performance and speedup had been also obtained in our test.

Key words: cloud computing, big data, Yarn platform, non-negative matrix factorization, cluster algorithm

中图分类号: