基于Hadoop的海量安全日志聚类算法研究

doi:10.3969/j.issn.1671-1122.2018.08.008

摘要/Abstract

摘要：

大数据环境下,网络安全事件层出不穷,网络安全成为各界关注的热点。安全日志记录着设备运行状态的重要信息,通过对其分析可以实时掌握网络安全态势,可作为事前防护、事后追责的安全审计手段,实现对异常事件的追责与溯源。针对日志审计的重要性并结合数据挖掘在日志分析领域的重要作用,同时针对单机环境下处理海量数据效率相对滞后等问题,文章提出一种基于Hadoop的面向海量安全日志的聚类算法。首先,文章提出了基于最大最小距离（MMD）和均值思想对K-means聚类算法进行改进,克服了传统K-means聚类算法在寻找初始聚类中心随机性的缺陷;其次,为了适应海量数据的有效处理,提高聚类的效率与速度,将改进的K-means聚类算法部署在Map/Reduce上进行迭代计算。实验表明,改进的聚类算法的准确性优于其他典型算法,聚类效果稳定,在集群的性能上具有较好的运行速度和加速比。

关键词: 安全日志, 聚类, K-means, Map/Reduce, Hadoop

Abstract:

In the big data environment, network security incidents emerge one after another, and network security has become a hot spot of concern. As a dark data in the new environment, the security log records the important information of the running status of the equipment. Through its analysis, it can grasp the network security situation in real time, and can be used as a security auditing tool for pre-protection and after-accusation, to achieve abnormal events. Aiming at the importance of log auditing and combining the important role of data mining in the field of log analysis, and aiming at the relative lag of processing massive data in a single machine environment, a clustering algorithm based on Hadoop for massive security log is proposed. Firstly, the K-means clustering algorithm is improved based on the maximum and minimum distance (MMD) and the mean value, which overcomes the defect of the traditional K-means algorithm in finding the randomness of the initial cluster center. Secondly, in order to adapt to the massive data. Effectively process, improve the efficiency and speed of clustering, and deploy the improved K-means clustering algorithm on Map/Reduce for iterative calculation. Experiments show that the improved clustering algorithm proposed in this paper is better than other typical methods, and the clustering effect is stable. It has better running speed and speedup ratio in cluster performance.

Key words: security log, clustering, K-means, Map/Reduce, Hadoop

中图分类号:

TP309

陆勰, 罗守山, 张玉梅. 基于Hadoop的海量安全日志聚类算法研究[J]. 信息网络安全, 2018, 18(8): 56-63.

Xie LU, Shoushan LUO, Yumei ZHANG. Research on Hadoop-based Massive Security Log Clustering Algorithm[J]. Netinfo Security, 2018, 18(8): 56-63.

图/表 7

图1

图2

图3

图4

表1

图5

图6

参考文献 18

[1]	WANG Shaojie, LONG Chun, WAN Wei, et al.Research on HDFS Small File Problem Based on Real-time Data of Cybersecurity[J]. Netinfo Security, 2017, 17(10): 81-85.
	王绍节,龙春,万巍,等. 基于网络空间安全实时数据的HDFS小文件问题研究[J]. 信息网络安全,2017,17(10):81-85.
[2]	LEI Xiaofeng, XIE Kunqing, LIN Fan, et al.An Efficient Clustering Algorithm Based on Local Optimality of K-Means[J]. Journal of Software, 2008, 19(7): 1683-1692.
	雷小锋,谢昆青,林帆,等. 一种基于K-Means局部最优性的高效聚类算法[J]. 软件学报,2008,19(7):1683-1692.
[3]	ZHAO Weizhong, MA Huifang, FU Yanxiang, et al.Research on Parallel k-means Algorithm Design Based on Hadoop Platform[J]. Computer Science, 2011, 38(10): 166-168, 176.
	赵卫中,马慧芳,傅燕翔, 等. 基于云计算平台Hadoop的并行k-means聚类算法设计研究[J]. 计算机科学,2011,38(10):166-168,176.
[4]	LIU Meiling, HUANG Mingxuan, TANG Weidong.A k-means Algorithm for Optimized Initial Clustering Center Based on Discrete Quantity[J]. Computer Engineering and Science, 2017, 39(6): 1164-1170.
	刘美玲,黄名选,汤卫东. 基于离散量优化初始聚类中心的k-means算法[J]. 计算机工程与科学,2017,39(6):1164-1170.
[5]	LI Wu, ZHAO Jiaoyan, YAN Taishan.Improved K-means Clustering Algorithm Optimizing Initial Clustering Centers Based on Average Difference Degree[J]. Control and Decision, 2017, 32(4): 759-762.
	李武,赵娇燕,严太山. 基于平均差异度优选初始聚类中心的改进K-均值聚类算法[J]. 控制与决策,2017,32(4):759-762.
[6]	YIN Aiying, WU Yunbing, ZHU Minwei, et al. Improved Algorithm of K-means Based on MapReduce Framework[EB/OL]. , 2018-2-1.
	阴爱英,吴运兵,朱敏琛,等. 基于MapReduce框架下K-means的改进算法[EB/OL]. , 2018-2-1.
[7]	WANG Yonggui, WU Chao, DAI Wei.K-means Algorithm of Random Sample Based on MapReduce[J]. Computer Engineering and Applications, 2016, 52(8): 74-79.
	王永贵,武超,戴伟. 基于MapReduce的随机抽样K-means算法[J]. 计算机工程与应用,2016,52(8):74-79.
[8]	SUN Jigui, LIU Jie, ZHAO Lianyu.Clustering Algorithms Research[J]. Journal of Software, 2008, 19(1): 48-61.
	孙吉贵,刘杰,赵连宇. 聚类算法研究[J]. 软件学报,2008,19(1):48-61.
[9]	LI Xiaoyu, YU Liying, LEI Hang, et al.The Parallel Implementation and Application of an Improved K-means Algorithm[J]. Journal of University of Electronic Science and Technology of China, 2017, 46(1): 61-68.
	李晓瑜,俞丽颖,雷航,等. 一种K-means改进算法的并行化实现与应用[J]. 电子科技大学学报,2017,46(1):61-68.
[10]	HAN Lingbo, WANG Qiang, JIANG Zhengfeng, et al.Improved k-means Initial Clustering Center Selection Algorithm[J]. Computer Engineering and Applications, 2010, 46(17): 150-152.
	韩凌波,王强,蒋正锋,等. 一种改进的k-means初始聚类中心选取算法[J]. 计算机工程与应用,2010,46(17):150-152.
[11]	LI Hongcheng, WU Xiaoping, CHEN Yan. k-means Clustering Method Preserving Differential Privacy in MapReduce Framework[J]. Journal on Communications, 2016, 37(2): 124-130.
	李洪成,吴晓平,陈燕. MapReduce框架下支持差分隐私保护的k-means聚类方法[J]. 通信学报,2016,37(2):124-130.
[12]	DUAN Juan, XIN Yang, MA Yuwei.Research and Design of Security Audit Log System Based on Web Application[J].Netinfo Security, 2014, 14(10): 70-76.
	段娟,辛阳,马宇威. 基于Web应用的安全日志审计系统研究与设计[J]. 信息网络安全,2014,14(10):70-76.
[13]	SU Rong.Research and Application of Security Log Clustering Mining Algorithm Based on Hadoop Platform[D]. Xi’an: Northwest University, 2015.
	苏蓉. 基于Hadoop平台的安全日志聚类挖掘算法研究与应用[D]. 西安:西北大学,2015.
[14]	ZHONG Ya, GUO Yuanbo.Design and Implementation of Log Parsing System Based on Machine Learning[J]. Journal of Computer Applications, 2018, 38(2): 352-356.
	钟雅,郭渊博. 基于机器学习的日志解析系统设计与实现[J]. 计算机应用,2018,38(2):352-356.
[15]	GAO Hua.Research on Intrusion Detection Parallel Algorithm Based on Massive Logs[J]. Modern Electronics Technique, 2016, 39(9): 71-75.
	高华. 基于海量日志的入侵检测并行化算法研究[J].现代电子技术,2016,39(9):71-75.
[16]	HU Xue, FENG Huamin, LI Mingwei, et al.Analysis of an Enhanced Apriori Algorithms in Data Mining[J]. Netinfo Security, 2015, 15(11): 77-83.
	胡雪,封化民,李明伟,等. 数据挖掘中一种增强的Apriori算法分析[J]. 信息网络安全,2015,15(11):77-83.
[17]	LIU Yan, CAO Ning, PAN Wei, et al.System Anomaly Detection in Distributed Systems through MapReduce-Based Log Analysis[C]//IEEE. 3rd International Conference on Advanced Computer Theory and Engineering, Augest 20-22, 2010, Chengdu, China. New Jersey: IEEE, 2010: 410-413.
[18]	VAARANDI R, PIHELGAS M.Using Security Logs for Collecting and Reporting Technical Security Metrics[C]//IEEE. 2014 IEEE Military Communications Conference, October 6-8, 2014. New Jersey: IEEE, 2014: 294-299.