信息网络安全 ›› 2024, Vol. 24 ›› Issue (5): 767-777.doi: 10.3969/j.issn.1671-1122.2024.05.010

• 理论研究 • 上一篇    下一篇

一种启发式日志模板自动发现方法

张书雅1,2,3, 陈良国1,2,3, 陈兴蜀1,2,3()   

  1. 1.四川大学网络空间安全学院,成都 610065
    2.数据安全防护与智能治理教育部重点实验室,成都 610065
    3.四川大学网络空间安全研究院,成都 610065
  • 收稿日期:2024-03-01 出版日期:2024-05-10 发布日期:2024-06-24
  • 通讯作者: 陈兴蜀 E-mail:chenxsh@scu.edu.cn
  • 作者简介:张书雅(1999—),女,四川,硕士研究生,主要研究方向为数据安全管理|陈良国(1993—),男,贵州,博士研究生,主要研究方向为大数据和网络安全|陈兴蜀(1968—),女,贵州,教授,博士,主要研究方向为云计算安全、数据安全、威胁检测、开源情报和人工智能安全
  • 基金资助:
    国家自然科学基金(U19A2081);中央高校基础研究基金(SCU2023D008);中央高校基础研究基金(2022SCU12116);中央高校基础研究基金(2023SCU12129);中央高校基础研究基金(2023SCU12126);四川大学理工科发展计划(2020SCUNG129)

An Automatic Discovery Method for Heuristic Log Templates

ZHANG Shuya1,2,3, CHEN Liangguo1,2,3, CHEN Xingshu1,2,3()   

  1. 1. School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
    2. Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Chengdu 610065, China
    3. Cyber Science Research Institute, Sichuan University, Chengdu 610065, China
  • Received:2024-03-01 Online:2024-05-10 Published:2024-06-24
  • Contact: CHEN Xingshu E-mail:chenxsh@scu.edu.cn

摘要:

日志是安全分析领域的重要数据来源。然而,非结构化原始日志无法直接用于安全分析,因此将日志解析为结构化模板是至关重要的第一步。现有的日志解析方法大多假设属于相同日志模板的日志消息具有相同的日志长度,但日志存在变长变量,导致属于相同模板的日志消息被错误地提取成不同的模板。因此,文章提出一种日志模板自动发现方法KeyParse,首先,基于最长公共子序列算法实现日志与模板的相似度计算,以此忽略变长变量带来的差异性影响,从而实现日志与模板的匹配;其次,基于最高频繁项实现日志模板分组,避免属于相同事件且长度不等的日志消息被划分到不同模板组,减少了模板冗余并提升了模板匹配效率;最后,基于HeavyGuardian算法实现流式日志消息的最高频繁项统计,解决了传统频率统计方法难以适应流式日志消息词频动态变化的问题。实验结果表明,KeyParse在面对多种类型日志集时均具有较高的准确率,平均解析准确度达0.968,并且在解析大型日志集时具有更好的性能。

关键词: 日志解析, 模板分组, 模板自动发现

Abstract:

Log is an important source of data in the field of security analytics. However, unstructured raw log can’t be used directly for security analysis, so parsing log into structured templates is a critical first step. Most of the existing log parsing methods assume that the log messages belonging to the same log template have the same log length, but the log messages belonging to the same template are incorrectly extracted into different templates due to the variable length of the log. Therefore, this paper proposed an automatic log template discovery method, KeyParse, which firstly calculated the similarity between logs and templates based on the longest common subsequence algorithm, so as to ignore the differential influence caused by variables, so as to achieve the matching of logs and templates. Secondly, the log template grouping was realized based on the highest frequency items to avoid the log messages belonging to the same event and different lengths being divided into different template groups, which reduced the template redundancy and improved the template matching efficiency. Finally, the HeavyGuardian algorithm was used to realize the statistics of the highest frequency items of streaming log messages. It solved the problem that the traditional frequency statistics method was difficult to adapt to the dynamic change of the word frequency of streaming log messages. Experimental results show that KeyParse has higher accuracy in the face of various types of log sets, with an average parsing accuracy of 0.968, and has higher performance when parsing large log sets.

Key words: log parsing, template grouping, template auto-discovery

中图分类号: