基于Siamese架构的恶意软件隐藏函数识别方法

doi:10.3969/j.issn.1671-1122.2023.05.007

摘要/Abstract

摘要：

目前，隐藏技术已被普遍应用于恶意软件中，以避免反病毒引擎的检测及研究人员的反向分析，所以有效识别恶意软件中的隐藏函数对于恶意软件代码检测和深度分析具有重要意义。但在该领域上，现有方法不同程度都存在一些问题，如无法取得高准确性、对样本量少或者样本类别分布不平衡的数据集的鲁棒性较差等。为实现实用的针对恶意软件隐藏函数的检测方法，文章提出一种新颖的基于Siamese架构的识别方法来检测隐藏函数的类型。该方法可以有效提高隐藏函数识别的准确性，Siamese架构的引入改善了小样本量数据集鲁棒性差的问题。针对从恶意软件中提取的15种常见类型的隐藏函数的数据集进行实验，结果表明，该方法生成的嵌入向量较嵌入神经网络SAFE具有更好的质量，该方法较几种常用的隐藏函数检测工具有更高的检测精度。

关键词: 二进制分析, 隐藏函数检测, 神经网络, 指令嵌入

Abstract:

At present, hiding technology has been widely used in malware to avoid the detection of anti-virus engines and reverse analysis by researchers. Therefore, effective identification of hidden functions in malware is of great significance for malware code detection and in-depth analysis. However, in this field, the existing methods have more or less problems, such as inability to obtain high accuracy, poor robustness to data sets with small sample size or unbalanced distribution of sample categories. In order to implement a practical detection method for malicious software hidden functions, a novel identification method based on Siamese architecture is proposed to detect the type of hidden functions. This method can effectively improve the accuracy of hidden function recognition, and the introduction of Siamese architecture improves the problem of poor robustness of small sample size data sets. For the dataset of 15 common types of hidden functions extracted from malicious software, the experimental results show that the embedded vector generated by this method has better quality than the nearest embedded neural network SAFE, and this method has higher detection accuracy than several common hidden function detection tools.

Key words: binary analysis, hidden function detection, neural network, instruction embedding

中图分类号:

TP309

陈梓彤, 贾鹏, 刘嘉勇. 基于Siamese架构的恶意软件隐藏函数识别方法[J]. 信息网络安全, 2023, 23(5): 62-75.

CHEN Zitong, JIA Peng, LIU Jiayong. Identification Method of Malicious Software Hidden Function Based on Siamese Architecture[J]. Netinfo Security, 2023, 23(5): 62-75.

图/表 16

图1

图2

图3

图4

图5

图6

图7

图8

表1

表2

表3

表4

图9

图10

表5

表6

参考文献 37

[1]	LI Jizhong. Research on Key Technology of Cryptography Algorithm Recognition and Analysis[D]. Zhengzhou: PLA Information Engineering University, 2014.
	李继中. 密码算法识别与分析关键技术研究[D]. 郑州: 解放军信息工程大学, 2014.
[2]	CAI Jianzhang, WEI Qiang, ZHU Yuefei. Identification of Encrypted Function in Malicious Software[J]. Journal of Computer Applications, 2013, 33(11): 3239-3243.
	蔡建章, 魏强, 祝跃飞. 识别恶意软件中的加密函数[J]. 计算机应用, 2013, 33(11): 3239-3243.
[3]	WRIGHT J L, MANIC M. Neural Network Approach to Locating Cryptography in Object Code[C]// IEEE. 2009 IEEE Conference on Emerging Technologies & Factory Automation. New York: IEEE, 2009: 1-4.
[4]	AIGNER A. Falke-Mc: A Neural Network Based Approach to Locate Cryptographic Functions in Machine Code[C]// ACM. Proceedings of the 13th International Conference on Availability, Reliability and Security. New York: ACM, 2018: 1-8.
[5]	CHUA Z L, SHEN S, SAXENA P, et al. Neural Nets Can Learn Function Type Signatures From Binaries[C]// ACM. USENIX Security Symposium. New York: ACM, 2017: 99-116.
[6]	DING S H H, FUN B C M, CHARLAND P. Asm2vec: Boosting Static Representation Robustness for Binary Clone Search Against Code Obfuscation and Compiler Optimization[C]// IEEE. 2019 IEEE Symposium on Security and Privacy (SP). New York: IEEE, 2019: 472-489.
[7]	XU Xiaojun, LIU Chang, FENG Qian, et al. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection[C]// ACM. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
[8]	MASSARELLI L, DI LUNA G A, PETRONI F, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity[C]// Springer. Detection of Intrusions and Malware, and Vulnerability Assessment:16th International Conference, DIMVA 2019. Heidelberg: Springer, 2019: 309-329.
[9]	HARVEY I. Cipher Hunting: How to Find Cryptographic Algorithms in Large Binaries[J]. NCipher Corporation Ltd. 2001: 46-51.
[10]	CABALLERO J, YIN H, LIANG Z, et al. Polyglot: Automatic Extraction of Protocol Message Format Using Dynamic Binary Analysis[C]// ACM. Proceedings of the 14th ACM Conference on Computer and Communications Security. New York: ACM, 2007: 317-329.
[11]	CABALLERO J, JOHNSON N M, MCCAMANT S, et al. Binary Code Extraction and Interface Identification for Security Applications[R]. Berkeley: California University Berkeley of Electrical Engineering and Computer Science, UCB/EECS-2009-133, 2009.
[12]	LIN Z, JIANG X, XU D, et al. Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution[C]// NDSS. 15th Annual Network & Distributed System Security Symposium. San Diego: ISOC, 2008: 1-15.
[13]	LI Jizhong, JIANG Liehui, YIN Qing, et al. Cryptogram Algorithm Recognition Technology Based on Bayes Decision-Making[J]. Computer Engineering, 2008, 34(20): 159-160.
	李继中, 蒋烈辉, 尹青, 等. 基于 Bayes 决策的密码算法识别技术[J]. 计算机工程, 2008, 34(20): 159-160.
[14]	LI Jizhong. Research on Technology of Cryptogram Algorithm Recognition Based on Similarity Decision-Making[D]. Zhenghou: PLA Information Engineering University, 2009.
	李继中. 基于相似性判定的密码算法识别技术研究[D]. 郑州: 解放军信息工程大学, 2009.
[15]	LUTZ N. Towards Revealing Attacker’s Intent by Automatically Decrypting Network Traffic[EB/OL]. (2008-08-01)[2022-09-14]. https://pub.tik.ee.ethz.ch/students/2008-FS/MA-2008-08.
[16]	WANG Zhi, JIANG Xexian, CUI Weidong, et al. ReFormat: Automatic Reverse Engineering of Encrypted Messages[C]// Springer. Computer Security-ESORICS 2009: 14th European Symposium on Research in Computer Security. Heidelberg: Springer, 2009: 200-215.
[17]	LIU T M, JIANG L, HE H, et al. Researching on Cryptographic Algorithm Recognition Based on Static Characteristic-Code[C]// Springer. Security Technology:International Conference, SecTech 2009, Held as Part of the Future Generation Information Technology Conference, FGIT 2009. Heidelberg: Springer, 2009: 140-147.
[18]	SNAKER. KANAL-Krypto Analyzer for PEiD[EB/OL]. (2019-04-18)[2022-03-26]. http://www.dcs.fmph.uniba.sk/zri/6.prednaska/tools/PEiD/plugins/kanal.htm.
[19]	X3CHUN. Crypto Searcher[EB/OL]. (2019-05-31)[2022-07-08]. http://quequero.org/uicwiki/images/Cryptosearcher_2004_05_19.zip.
[20]	PARADOX/AT4RE. Hash Crypto Detector[EB/OL]. (2019-11-21)[2022-07-10]. https://github.com/felixgr/kerckhoffs/blob/master/static_tools/HCD.rar.
[21]	GUILFANOVER. Findcrypt2[EB/OL]. (2018-10-05)[2022-07-02]. http://www.hexblog.com/?p=28.
[22]	PLOHMANN D. IDAscope[EB/OL]. (2020-09-23)[2022-07-02]. https://bitbucket.org/daniel_plohmann/simplifire.idascope/.
[23]	DRAFT. Draft Crypto Analyzer[EB/OL]. (2019-05-16)[2022-07-08]. http://www.literatecode.com/draca.
[24]	GROBERT F, WILLEMS C, HOLZ T. Automated Identification of Cryptographic Primitives in Binary Programs[C]// Springer. Recent Advances in Intrusion Detection:14th International Symposium, RAID 2011. Heidelberg: Springer, 2011: 41-60.
[25]	ZHAO R, GU D, LI J, et al. Detection and Analysis of Cryptographic Data Inside Software[C]// Springer. Information Security:14th International Conference. Heidelberg: Springer, 2011: 182-196.
[26]	LE Q, MIKOLOV T. Distributed Representations of Sentences and Documents[C]// ACM. International Conference on Machine Learning. New York: ACM, 2014: 1188-1196.
[27]	DAI H, DAI B, SONG L. Discriminative Embeddings of Latent Variable Models for Structured Data[C]// ACM. International Conference on Machine Learning. New York: ACM, 2016: 2702-2711.
[28]	MASSARELLI L, DI LUNA G A, PETRONI F, et al. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis[C]// NDSS. Proceedings of the 2nd Workshop on Binary Analysis Research (BAR). San Diego: ISOC, 2019: 1-11.
[29]	SHIN E C R, SONG D, MOAZZEZI R. Recognizing Functions in Binaries with Neural Networks[C]// ACM. 24th {USENIX} Security Symposium ({USENIX} Security 15). New York: ACM, 2015: 611-626.
[30]	MA Jun, LI Congying. Evolution and Algorithm of Pre-trained Word Embedding Technology[J]. Chinese Journal of Medical Library and Information Science, 2022, 30(12): 31-39.
	马俊, 李聪颖. 预训练词嵌入技术的演化与算法[J]. 中华医学图书情报杂志, 2022, 30(12): 31-39.
[31]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL]. (2013-09-07)[2022-09-14]. https://arxiv.org/abs/1301.3781.
[32]	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013.
[33]	KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A Convolutional Neural Network for Modelling Sentences[EB/OL]. (2014-04-08)[2022-09-14]. https://arxiv.org/abs/1404.2188.
[34]	TOMPSON J, GOROSHIN R, JAIN A, et al. Efficient Object Localization Using Convolutional Networks[C]// IEEE. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2015: 648-656.
[35]	FELBO B, MISLOVE A, SOGAARD A, et al. Using Millions of Emoji Occurrences to Learn Any-Domain Representations for Detecting Sentiment, Emotion and Sarcasm[EB/OL]. (2017-10-07)[2022-09-14]. https://arxiv.org/abs/1708.00524.
[36]	HADSELL R, CHOPRA S, LECUN Y. Dimensionality Reduction by Learning An Invariant Mapping[C]// IEEE. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). New York: IEEE, 2006: 1735-1742.
[37]	VX. VX Heaven Virus Collection[EB/OL]. (2019-05-13)[2021-06-06]. http://academictorrents.com/details/34ebe49a48aa532deb9c0dd08a08a017aa04d810.

隐藏算法类型	函数数量
ADLER32	391
aBLib	18
Big-number	156
SHA-1	43
SHA-256	26
BASE64	224
DES[char]	132
CRC32	451
CRC32[poly]	39
MD5	450
BLOWFISH	23
ZLIB[long]	126
ZLIB[word]	151
RC5/RC6	63
HAVAL(5 pass)	136
总计	2429

隐藏算法类型	函数数量
ADLER32	27
aBLib	2
Big-number	11
SHA-1	3
SHA-256	2
BASE64	15
DES[char]	9
CRC32	32
CRC32[poly]	3
MD5	30
BLOWFISH	2
ZLIB[long]	8
ZLIB[word]	10
RC5/RC6	4
HAVAL(5 pass)	10
其余函数	32
总计	200

超参数	值
嵌入维度	100
Spatial Dropout率	20%
卷积核数量	128
4个卷积层卷积核大小	5,6,7,8
K-Max池化层k值	3
Dropout率	60%
Epoch	100
Mini-Batch	32
学习率	0.1%
长度阈值	1100
优化器	Adam

神经网络模型	Accuracy	Precision	Recall	F1
BiLSTM	58.34 %	54.81 %	56.93 %	55.85 %
CNN_LSTM	61.04 %	60.49 %	61.04 %	60.76 %
AvRNN	93.91 %	93.08 %	93.40 %	93.24 %
DropoutAvRNN	92.38 %	90.76 %	89.93 %	90.34 %
textCNN	91.94 %	90.84 %	90.35 %	90.59 %
AvCNN	95.26 %	93.83 %	92.67 %	93.20 %
K-MAX-CNN	96.51 %	95.72 %	94.76 %	95.24 %
K-Max-DCNN-Attention	98.45 %	97.95 %	97.03 %	97.49 %

检测工具算法种类	本文方法	Findcrypt	IDAscope	HCD	Crypto Searcher	DRACA
ADLER32	○	●	○	○	●	●
aPLib	○	●	●	●	●	●
BASE64	○	○	●	○	○	●
BLOWFISH	○	○	○	○	○	○
CRC32	○	○	○	○	○	○
CRC32[poly]	○	○	○	○	●	●
DES[char]	○	●	○	○	○	○
HAVAL(5 pass)	○	●	○	○	●	●
MD5	○	○	○	○	○	○
RC5/RC6	○	●	○	○	○	○
SHA-256	○	●	●	○	○	●
SHA-1	○	○	●	○	●	○
ZLIB[long]	○	●	○	○	●	●
ZLIB[word]	○	●	○	○	●	●
Big-number	○	○	●	●	●	●