信息网络安全 ›› 2022, Vol. 22 ›› Issue (6): 9-25.doi: 10.3969/j.issn.1671-1122.2022.06.002

• 技术研究 • 上一篇    下一篇

机器学习在x86二进制反汇编中的应用研究综述

王鹃1,2(), 王蕴茹1,2, 翁斌1,2, 龚家新1,2   

  1. 1.武汉大学国家网络安全学院,武汉 430072
    2.武汉大学空天信息安全与可信计算教育部重点实验室,武汉 430072
  • 收稿日期:2022-01-13 出版日期:2022-06-10 发布日期:2022-06-30
  • 通讯作者: 王鹃 E-mail:jwang@whu.edu.cn
  • 作者简介:王鹃(1976—),女,湖北,教授,博士,主要研究方向为软件安全、可信计算、人工智能应用、云计算和物联网安全|王蕴茹(1997—),女,山东,硕士研究生,主要研究方向为人工智能和软件安全|翁斌(2000—),男,福建,本科,主要研究方向为人工智能和软件安全|龚家新(1999—),男,安徽,硕士研究生,主要研究方向为软件安全和漏洞挖掘
  • 基金资助:
    国家电网有限公司科技项目(520940210009)

Survey on Application of Machine Learning in Disassembly on x86 Binaries

WANG Juan1,2(), WANG Yunru1,2, WENG Bin1,2, GONG Jiaxin1,2   

  1. 1. School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
    2. Key Laboratory of Aerospace Information Security and Trusted Computing of Ministry of Education, Wuhan University, Wuhan 430072, China
  • Received:2022-01-13 Online:2022-06-10 Published:2022-06-30
  • Contact: WANG Juan E-mail:jwang@whu.edu.cn

摘要:

二进制反汇编技术是二进制漏洞检测、控制流完整性和代码相似度检测的核心。传统反汇编技术高度依赖于预先定义的启发式规则和专家知识,在函数识别、变量类型识别、控制流生成等任务中应用效果不够好。机器学习在序列和图数据结构处理上的发展为二进制分析注入了新活力,弥补了传统二进制逆向技术的缺陷,推动了二进制分析研究工作。文章从机器学习在x86二进制反汇编中的应用入手,对函数识别、函数指纹复原、数据流生成等任务的相关工作进行调研分析,首先总结反汇编的传统技术及难点;然后提炼在x86二进制反汇编中应用机器学习的一般工作模式,包括二进制特征提取、特征向量化、模型训练及评估,并依据特征包含的信息和嵌入方式分别对特征提取和向量化过程的方法进行分类,同时依据具体工作总结机器学习模型训练中的重要技术;最后基于研究现状总结已有工作的局限性和面临的挑战,阐述未来可能的研究方向。

关键词: 反汇编, 机器学习, 软件安全, 逆向工程

Abstract:

Binary disassembly technology is the core of vulnerability finding, control flow integrity and code similarity measurement. Traditional disassembly techniques highly rely on predefined heuristics and expert knowledge, and its application effect of identifying function boundaries, variable types and reconstructing control flow graphs and other tasks are not good enough. The development of machine learning in handling sequential and graphical data has enabled machine learning to be applied to binary analysis and make up for the defects of the traditional disassembly techniques, thus promoting the researches of binary reverse analysis. This paper focused on the application of machine learning in disassembly on x86 binaries and analyzed in-depth the research work related to function identification, function signature recovery and data flow reconstruction. Firstly, the traditional methods and challenges of disassembly on x86 binaries were summarized comprehensively. Secondly, the general workflow of machine learning in disassembly on x86 binaries including binary feature extraction, vectorization, and model training was distilled. This paper classified the methodologies of feature extraction and vectorization based on the feature contents and embedded approaches respectively, and subsequently summarized the significant techniques of model training utilized in specific disassembly tasks. Finally, the limitations and challenges of current work were concluded, and the future research directions were elaborated.

Key words: disassembly, machine learning, software security, reverse engineering

中图分类号: