Netinfo Security ›› 2022, Vol. 22 ›› Issue (6): 9-25.doi: 10.3969/j.issn.1671-1122.2022.06.002

Previous Articles     Next Articles

Survey on Application of Machine Learning in Disassembly on x86 Binaries

WANG Juan1,2(), WANG Yunru1,2, WENG Bin1,2, GONG Jiaxin1,2   

  1. 1. School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
    2. Key Laboratory of Aerospace Information Security and Trusted Computing of Ministry of Education, Wuhan University, Wuhan 430072, China
  • Received:2022-01-13 Online:2022-06-10 Published:2022-06-30
  • Contact: WANG Juan E-mail:jwang@whu.edu.cn

Abstract:

Binary disassembly technology is the core of vulnerability finding, control flow integrity and code similarity measurement. Traditional disassembly techniques highly rely on predefined heuristics and expert knowledge, and its application effect of identifying function boundaries, variable types and reconstructing control flow graphs and other tasks are not good enough. The development of machine learning in handling sequential and graphical data has enabled machine learning to be applied to binary analysis and make up for the defects of the traditional disassembly techniques, thus promoting the researches of binary reverse analysis. This paper focused on the application of machine learning in disassembly on x86 binaries and analyzed in-depth the research work related to function identification, function signature recovery and data flow reconstruction. Firstly, the traditional methods and challenges of disassembly on x86 binaries were summarized comprehensively. Secondly, the general workflow of machine learning in disassembly on x86 binaries including binary feature extraction, vectorization, and model training was distilled. This paper classified the methodologies of feature extraction and vectorization based on the feature contents and embedded approaches respectively, and subsequently summarized the significant techniques of model training utilized in specific disassembly tasks. Finally, the limitations and challenges of current work were concluded, and the future research directions were elaborated.

Key words: disassembly, machine learning, software security, reverse engineering

CLC Number: