信息网络安全 ›› 2022, Vol. 22 ›› Issue (4): 30-39.doi: 10.3969/j.issn.1671-1122.2022.04.004

• 技术研究 • 上一篇    下一篇

基于解析器树的日志压缩优化方法

刘吉强1(), 何嘉豪1, 张建成2,3, 黄学臻4   

  1. 1.北京交通大学计算机与信息技术学院,北京 100044
    2.山东省计算中心,济南 250014
    3.山东正中信息技术股份有限公司,济南 250014
    4.公安部第一研究所,北京 100048
  • 收稿日期:2022-01-12 出版日期:2022-04-10 发布日期:2022-05-12
  • 通讯作者: 刘吉强 E-mail:jqliu@bjtu.edu.cn
  • 作者简介:刘吉强(1973—),男,山东,教授,博士,主要研究方向为可信计算、隐私保护、云计算安全|何嘉豪(1997—),男,河南,硕士研究生,主要研究方向为区块链、数据安全存储|张建成(1973—),男,河南,副研究员,硕士,主要研究方向为密码技术、物联网安全技术|黄学臻(1984—),女,山西,工程师,博士,主要研究方向为隐私保护、数据安全
  • 基金资助:
    国家重点研发计划(2020YFB2103800);中国国家铁路集团有限公司科技研究开发计划(N2020W005);山东省重大科技创新工程项目(2019JZZY020128)

Log Compression Optimization Method Based on Parser Tree

LIU Jiqiang1(), HE Jiahao1, ZHANG Jiancheng2,3, HUANG Xuezhen4   

  1. 1. Department of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
    2. Shandong Computer Science Center, Jinan 250014, China
    3. Shandong Zhengzhong Information Technology Co.,Ltd, Jinan 250014, China
    4. The First Research Institute of the Ministry of Public Security, Beijing 100048, China
  • Received:2022-01-12 Online:2022-04-10 Published:2022-05-12
  • Contact: LIU Jiqiang E-mail:jqliu@bjtu.edu.cn

摘要:

信息系统日志数据对安全分析非常重要,随着日志规模与日俱增,高效地进行日志数据存储和审计成为信息系统安全的关键问题之一。日志数据压缩能够减少对日志数据存储的巨大开销,已经成为日志数据领域的研究热点之一。传统的压缩工具、算法在小规模文本的处理上效果较好,但对于信息系统产生的大规模日志数据并不适用。现有日志压缩算法通过提取日志结构的方式实现数据压缩,但对日志数据中数值变量部分的压缩率和压缩速度的提升不明显。文章提出一种基于解析器树的日志压缩优化方法(TOLC),通过解析器构造解析器树,提取相应的日志模板并进行模板压缩,进而对数值变量部分进行编码压缩。文章通过5个不同类型的大型日志数据集对TOLC进行评估,并与其他方法进行比较。实验结果表明,TOLC在所有数据集上都实现了最高的压缩率,且在大型日志数据集中也表现出了很好的压缩速度,整体上表现最优。

关键词: 解析器树, 日志压缩, 模板提取, 数值编码, 压缩率

Abstract:

Information system log data is very important for security analysis, but its size is growing with each passing day, and efficient log data storage and auditing has become one of the key issues for information system security. Log data compression can reduce the huge overhead on log data storage, and has become a hot research topic in the field of log data. Traditional compression tools and algorithms work well for small-scale text processing, but are not applicable to large-scale log data generated by information systems; existing log compression algorithms achieve data compression by extracting log structures, but the compression rate and compression speed of the numerical variable part of log data are not significantly improved. This paper proposes a parser tree based log compression optimization method(TOLC), which extracts the corresponding log templates and performs template compression by constructing a parser tree using a parser, and then encodes and compresses the remaining variable parts. In this paper, TOLC is evaluated on five different types of large log datasets, and by comparing with other methods, TOLC achieves the highest compression ratio on all datasets and also shows good compression speed on large log datasets, and its overall performance is optimal.

Key words: parser tree, log compression, template extraction, numerical code, compression ratio

中图分类号: