信息网络安全 ›› 2021, Vol. 21 ›› Issue (7): 72-79.doi: 10.3969/j.issn.1671-1122.2021.07.009

• 技术研究 • 上一篇    下一篇

抗噪的应用层二进制协议格式逆向方法

方敏之1,2(), 程光1,2, 孔攀宇1,2   

  1. 1.东南大学网络空间安全学院,南京 211189
    2.东南大学网络空间国际治理研究基地,南京 211189
  • 收稿日期:2021-02-04 出版日期:2021-07-10 发布日期:2021-07-23
  • 通讯作者: 方敏之 E-mail:mzfang@njnet.edu.cn
  • 作者简介:方敏之(1996—),男,江苏,硕士研究生,主要研究方向为协议逆向分析|程光(1972—),男,安徽,教授,博士,主要研究方向为加密流量|孔攀宇(1996—),男,重庆,硕士研究生,主要研究方向为加密流量分析
  • 基金资助:
    国家重点研发计划(2018YFB1800602)

Anti-noise Application Layer Binary Protocol Format Reverse Method

FANG Minzhi1,2(), CHENG Guang1,2, KONG Panyu1,2   

  1. 1. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
    2. International Governance Research Base of Cyberspace, Southeast University, Nanjing 211189, China
  • Received:2021-02-04 Online:2021-07-10 Published:2021-07-23
  • Contact: FANG Minzhi E-mail:mzfang@njnet.edu.cn

摘要:

现有的基于网络流量的二进制协议格式逆向方法通过比对多个相同类型的报文来推导协议格式,但报文集中的噪声报文会导致协议格式识别准确率较低,为此文章提出一种自动化去除噪声并推断协议格式的方法。该方法首先挖掘报文序列每个位置上的频繁项,识别出报文集中的特殊标识(FD);然后根据每个位置上FD的频率之和有效去除噪声报文;接着根据报文头部的FD进行递归式的去噪与报文分割;再在通过报文分割得到的报文集合中进行k-means聚类,并用轮廓系数自动化确定聚类数k,获得各单一协议格式报文子集;最后在各报文子集中使用渐进多序列比对算法获得协议格式。实验结果表明,文章方法可以有效去除真实环境流量中的混杂噪声报文,有效提取协议格式中的关键词,从而推断出协议格式。

关键词: 二进制协议逆向, 特殊标识, 递归聚类, 序列比对, 频繁项挖掘

Abstract:

The existing binary protocol format reverse methods based on network traffic deduce the protocol format by comparing multiple messages of the same type, but the noise messages in the message set will lead to low accuracy of protocol format recognition. This paper proposes a method of automatically removing the noise and deducing the protocol format. Firstly, the method mines the frequent items at each position of message sequence, identifies the special identification (FD) in the message set, and effectively removes the noise messages according to the sum of the frequency of FD at each position. Then the method performs recursive denoising and message segmentation according to the FD of the message header, performs k-means clustering in the message set obtained by message segmentation, and automatically determines the clustering number k by the contour coefficient to obtain the message subset of each single protocol format. Finally, the protocol format is obtained by using progressive multiple sequence alignment algorithm in each message subset. The experimental results show that the proposed method can effectively remove the mixed noise messages in the real environment traffic, effectively extract the key words in the protocol format, and deduce the protocol format.

Key words: binary protocol reverse, special identification, recursive clustering, sequence alignment, frequent item mining

中图分类号: