基于特征提取的恶意软件行为及能力分析方法研究

doi:10.3969/j.issn.1671-1122.2019.12.009

摘要/Abstract

摘要：

为应对恶意软件对网络空间安全的威胁,安全厂商发布了大量恶意软件报告,其中蕴含着许多网络安全相关信息,如恶意软件的特征能力及其所采取的具体行为模式。通过对这些恶意软件报告进行分析获取相关信息,有助于研究人员全面了解恶意软件功能,实现有效防御。自动从报告中抽取与恶意软件能力及行为相关的文本的任务,存在报告数量庞大、文本结构松散、一词多义的问题。为此,文章提出基于Bert预训练模型获取特征向量的方法,以实现对多义词的消歧,通过BiLSTM和注意力机制进一步提取特征,训练分类器。利用MalwareTextDB数据集进行实验,召回率和F1值分别可达到85.56%和66.67%。与其他模型进行比较,该模型能够更高效地自动从恶意软件报告中提取与恶意软件行为特征及能力相关文本。

关键词: 恶意软件, 文本分类, BERT, BiLSTM, 注意力机制

Abstract:

In response to the threat of malware to cyberspace security, cybersecurity agencies have released a large number of malware reports, which contain many cybersecurity related information,such as the malware’s capabilities and the specific actions taken. By analyzing the malware reports and obtaining information, researchers can fully understand its functions and mount an effective defense. The task of automatically extract texts related to malware capabilities and behaviors from reports, facing the problems of a large number of reports, loose text structure, and polysemy. Based on the Bert pre-training model to disambiguate polysemy, input it into BiLSTM and attention mechanism network to further extract features and train the classifier. Experimented on the MalwareTextDB dataset, the recall rate and F1 value can be 85.56% and 66.67%. Compared to other methods, the model is able to extract texts related to malware behavior and capabilities from malware reports more automatically and efficiently.

Key words: malware, text classification, BERT, BiLSTM, attention mechanism

中图分类号:

TP309

冯胥睿瑞, 刘嘉勇, 程芃森. 基于特征提取的恶意软件行为及能力分析方法研究[J]. 信息网络安全, 2019, 19(12): 72-78.

Xuruirui FENG, Jiayong LIU, Pengsen CHENG. Analyzing Malware Behavior and Capability Related Text Based on Feature Extraction[J]. Netinfo Security, 2019, 19(12): 72-78.

图/表 9

图1

图2

图3

表1

表2

表3

表4

表5

表6

参考文献 19

[1]	WANG Shaomin, YANGDi, RENHua.Key Technology Research and Model Validation of Text Classification System Based on Deep Learning[J]. Telecommunications Science, 2018, 34(12): 117-124.
	汪少敏,杨迪,任华.基于深度学习的文本分类系统关键技术研究与模型验证[J].电信科学,2018,34(12):117-124.
[2]	KIM Y.Convolutional Neural Networks for Sentence Classification[C]// Association for Computational Linguistics. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), October 25-29, 2014, Doha, Qatar. Stroudsburg PA: Association for Computational Linguistics, 2014: 1746-1751.
[3]	KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P.A Convolutional Neural Network for Modelling Sentences[C]// Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, June 22-27, 2014, Baltimore‚USA. Stroudsburg PA: Association for Computational Linguistics, 2014: 655-665.
[4]	TANG D, QIN B, LIU T.Document Modeling with Gated Recurrent Neural Network for Sentiment Classification[C]// Association for Computational Linguistics. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, September 17-21, 2015, Lisbon, Portugal. Stroudsburg PA: Association for Computational Linguistics, 2015: 1422-1432.
[5]	YANG Z, YANG D, DYER C, et al.Hierarchical Attention Networks for Document Classification[C]// Association for Computational Linguistics. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 12-17, 2016, San Diego, California. Stroudsburg PA: Association for Computational Linguistics, 2016: 1480-1489.
[6]	YANG Dong, WANG Yizhi.An Attention-based C-GRU Neural Network for Text Classification[J]. Computer and Modernization, 2018, 34(2): 96-100.
	杨东,王移芝.基于Attention-based C-GRU神经网络的文本分类[J]. 计算机与现代化,2018,34(2):96-100.
[7]	JIANG Dapeng.Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.
	江大鹏. 基于词向量的短文本分类方法研究[D].杭州:浙江大学,2015.
[8]	WANG Wei, SUN Yuxia, QI Qingjie, et al.Text Sentiment Classification Model Based on BiGRU-Attention Neural Network[J]. Application Research of Computers, 2018, 36(12): 1-10.
	王伟,孙玉霞,齐庆杰,等.基于BiGRU-Attention神经网络的文本情感分类模型[J].计算机应用研究,2018,36(12):1-10.
[9]	BENGIO Y, DUCHARME R, VINCENT P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(1): 1137-1155.
[10]	MIKOLOV T, CHEN K, CORRADO G, et al.Efficient Estimation of Word Representations in Vector Space[J]. Computer Science, 2013(1):28-36.
[11]	PENNINGTON J, SOCHER R, MANNING C.Glove: Global Vectors for Word Representation[C]//Association for Computational Linguistics. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), October 25-29, 2014, Doha, Qatar. Stroudsburg, PA: 2014: 1532-1543.
[12]	SIKDAR U K, BARIK B, GAMBÄCK B. Flytxt_NTNU at SemEval-2018 Task 8: Identifying and Classifying Malware Text Using Conditional Random Fields and Naive Bayes Classifiers[C]//Association for Computational Linguistics. Proceedings of The 12th International Workshop on Semantic Evaluation, June 5-6, 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics, 2018: 890-893.
[13]	LOYOLA P, GAJANANAN K, WATANABE Y, et al.Villani at SemEval-2018 Task 8: Semantic Extraction from Cybersecurity Reports using Representation Learning[C]//Association for Computational Linguistics. Proceedings of The 12th International Workshop on Semantic Evaluation, June 5-6, 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics, 2018: 885-889.
[14]	PETERS M E, NEUMANN M, IYYER M, et al.Deep Contextualized Word Representations[C]// Association for Computational Linguistics. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 1-6, 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics, 2018: 2227-2237.
[15]	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving Language Understanding by Generative Pre-training[EB/OL]. , 2018-11-5.
[16]	DEVLIN J, CHANG M W, LEE K, et al.Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Association for Computational Linguistics. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2-7, 2019, Minneapolis, Minnesota. Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186.
[17]	BRIDGES R A, JONES C L, IANNACONE M D, Testa, et al. Automatic Labeling for Entity Extraction in Cyber Security[EB/OL]. , 2018-11-5.
[18]	PHANDI P, SILVA A, LU W.Semeval-2018 Task 8: Semantic Extraction from Cybersec Urity Reports Using Natural Language Processing(SecureNLP)[C]//Association for Computational Linguistics. Proceedings of The 12th International Workshop on Semantic Evaluation, June 5-6, 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics, 2018: 697-706.
[19]	MA C, ZHENG H, XIE P, et al.DM_NLP at SemEval-2018 Task 8: Neural Sequence Labeling with Linguistic Features[C]//Association for Computational Linguistics. Proceedings of The 12th International Workshop on Semantic Evaluation, June 5-6, 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics, 2018: 707-711.

	总句子数	总单词数	恶意软件相关句子数	恶意软件相关单词数
训练集	9,435	231,180	2204	12165
验证集	1213	32029	79	459
测试集	618	13080	90	453

句子	数据集	标签
All three samples provided remote access to the attacker, via two Command and Control(C2)Servers .	训练集	恶意软件相关
The samples were clearly malicious and varied in sophistication .	训练集	恶意软件不相关
To provide access to the server of interest the at-tackers may appropriately modify rules for firewalls Microsoft TMG, CISCO, etc .	验证集	恶意软件相关
Here is a table with the minimal information about 46 different samples .	验证集	恶意软件不相关
The“Cohhoc“malware uses an obfuscation layer, to disguise the malware and to complicate the analysis .	测试集	恶意软件相关
For example, this code can perform any of the following actions.	测试集	恶意软件不相关

超参数名称	参数大小
BiLSTM隐藏单元	128
Dropout参数	0.3
学习率	5e-5
批量大小	16

	真实为恶意软件相关	真实为恶意软件不相关
预测为恶意软件相关	真阳性(TP)	假阳性(FP)
预测为恶意软件不相关	假阴性(FN)	真阴性(TN)

	TP	TN	FP	FN	精确率/%	召回率/%	F1/%	准确率/%
随机Word embedding	55	454	74	35	42.64	61.11	50.23	82.36
Glove	63	433	95	27	39.87	70.00	50.81	80.26
本文模型	77	464	64	13	54.61	85.56	66.67	87.54