信息网络安全 ›› 2018, Vol. 18 ›› Issue (8): 56-63.doi: 10.3969/j.issn.1671-1122.2018.08.008

• • 上一篇    下一篇

基于Hadoop的海量安全日志聚类算法研究

陆勰1(), 罗守山1, 张玉梅2   

  1. 1. 北京邮电大学信息安全中心,北京 100876
    2. 武警天津总队参谋部综合信息保障中心,天津 300001
  • 收稿日期:2018-04-04 出版日期:2018-08-20 发布日期:2020-05-11
  • 作者简介:

    作者简介:陆勰(1985—),女,云南,硕士研究生,主要研究方向为信息安全;罗守山(1962—),男,北京,教授,博士,主要研究方向为网络与信息安全;张玉梅(1967—),女,天津,本科,主要研究方向为网络通信及信息安全。

  • 基金资助:
    国家高技术研究发展计划(863计划)[2015AA016005]

Research on Hadoop-based Massive Security Log Clustering Algorithm

Xie LU1(), Shoushan LUO1, Yumei ZHANG2   

  1. 1. Information Security Center, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Chinese People’s Armed Police Force Corps of Tianjin, Tianjin 300001, China;
  • Received:2018-04-04 Online:2018-08-20 Published:2020-05-11

摘要:

大数据环境下,网络安全事件层出不穷,网络安全成为各界关注的热点。安全日志记录着设备运行状态的重要信息,通过对其分析可以实时掌握网络安全态势,可作为事前防护、事后追责的安全审计手段,实现对异常事件的追责与溯源。针对日志审计的重要性并结合数据挖掘在日志分析领域的重要作用,同时针对单机环境下处理海量数据效率相对滞后等问题,文章提出一种基于Hadoop的面向海量安全日志的聚类算法。首先,文章提出了基于最大最小距离(MMD)和均值思想对K-means聚类算法进行改进,克服了传统K-means聚类算法在寻找初始聚类中心随机性的缺陷;其次,为了适应海量数据的有效处理,提高聚类的效率与速度,将改进的K-means聚类算法部署在Map/Reduce上进行迭代计算。实验表明,改进的聚类算法的准确性优于其他典型算法,聚类效果稳定,在集群的性能上具有较好的运行速度和加速比。

关键词: 安全日志, 聚类, K-means, Map/Reduce, Hadoop

Abstract:

In the big data environment, network security incidents emerge one after another, and network security has become a hot spot of concern. As a dark data in the new environment, the security log records the important information of the running status of the equipment. Through its analysis, it can grasp the network security situation in real time, and can be used as a security auditing tool for pre-protection and after-accusation, to achieve abnormal events. Aiming at the importance of log auditing and combining the important role of data mining in the field of log analysis, and aiming at the relative lag of processing massive data in a single machine environment, a clustering algorithm based on Hadoop for massive security log is proposed. Firstly, the K-means clustering algorithm is improved based on the maximum and minimum distance (MMD) and the mean value, which overcomes the defect of the traditional K-means algorithm in finding the randomness of the initial cluster center. Secondly, in order to adapt to the massive data. Effectively process, improve the efficiency and speed of clustering, and deploy the improved K-means clustering algorithm on Map/Reduce for iterative calculation. Experiments show that the improved clustering algorithm proposed in this paper is better than other typical methods, and the clustering effect is stable. It has better running speed and speedup ratio in cluster performance.

Key words: security log, clustering, K-means, Map/Reduce, Hadoop

中图分类号: