信息网络安全 ›› 2025, Vol. 25 ›› Issue (10): 1579-1588.doi: 10.3969/j.issn.1671-1122.2025.10.009
收稿日期:2025-05-25
出版日期:2025-10-10
发布日期:2025-11-07
通讯作者:
孙奕
E-mail:11112072@bjtu.edu.cn
作者简介:王友贺(1998—),男,河南,硕士研究生,主要研究方向为网络与信息安全、恶意检测|孙奕(1979—),女,河南,教授,博士,主要研究方向为网络与信息安全、数据安全交换
基金资助:Received:2025-05-25
Online:2025-10-10
Published:2025-11-07
Contact:
SUN Yi
E-mail:11112072@bjtu.edu.cn
摘要:
为应对现有恶意PDF文档检测方法忽视特征之间语义关系以及局限于单一类型的特征分析等问题,文章提出一种检测方案,将CNN-BiLSTM-CBAM的模型和多特征融合应用于恶意PDF文档检测中。该方法不仅融合了静态分析中提取的常规信息和结构信息,还结合了动态分析捕获的API序列信息,构建了一个全面多维的特征集。首先,该模型利用卷积神经网络提取特征集中的局部特征;然后,利用双向长短时记忆(BiLSTM)网络捕获特征间的依赖性和上下文语义关系特征,通过卷积块注意力模块(CBAM)为不同特征分配不同的权重,筛选出较具区分性的关键特征;最后,利用Softmax分类器计算检测结果。实验结果表明,与现有方法相比,该模型在准确率、召回率和F1分数等关键性能指标上均展现出显著优势,有效提升了恶意PDF文档的检测性能。
中图分类号:
王友贺, 孙奕. 基于CNN-BiLSTM-CBAM的多特征融合恶意PDF文档检测方法[J]. 信息网络安全, 2025, 25(10): 1579-1588.
WANG Youhe, SUN Yi. Multi-Feature Fusion for Malicious PDF Document Detection Based on CNN-BiLSTM-CBAM[J]. Netinfo Security, 2025, 25(10): 1579-1588.
表1
一般特征的安全相关性
| 特征名称 | 安全相关性 |
|---|---|
| pdfsize | 由于页面大小和内容不同,一般恶意文档较小 |
| metadata size | 元数据提供PDF文件信息,可以利用它隐藏嵌入的内容 |
| contains text | 文本内容呈现不是恶意PDF文件的目的,因此文件中可能包含较少的文本 |
| header | PDF头部混淆是逃避反病毒扫描的常见方法,恶意文件往往会修改头部格式 |
| obj、endobj | obj对象开始和结束的数量,表示文档中存在对象的 个数,正常文档两者相等 |
| stream、endstream | PDF 中二进制数据序列的数量,正常文档两者相等 |
| encrypt | 恶意文档倾向于通过加密来隐藏恶意行为 |
| /ObjStm | 定义数据流对象,可以隐藏特定的其他对象(文本、 脚本、网页链接、图像等) |
| /JS、/JavaScript | 恶意文档往往嵌套JS代码,利用JS代码执行恶意操作,实现堆喷射或解析漏洞等 |
| /JBIG2Decode | 指明PDF使用JBIG2压缩,是常见的恶意内容编码 过滤器 |
| /launch | 执行动作或程序的行为,执行Action的次数与OpenAction字段关联 |
| /EmbeddedFile | 恶意文档往往利用嵌入的恶意文件(如Word文档、 图片等)执行恶意行为 |
| /XFA | 某些PDF文件中包含的XML表单架构,可能存在被 攻击者利用的脚本技术 |
| /AA、/Acroform /OpenAction | 大部分嵌有JavaScript代码的恶意PDF文件都有自动执行JavaScript代码的行为,其中Acrofrom代表Adobe Acrobat创建的表单,可能被攻击者制作恶意脚本 |
表2
各模型实验结果对比
| 模型 | Precision | Recall | F1-Score | Accuracy | FPR | 训练时间/s |
|---|---|---|---|---|---|---|
| RF | 97.792% | 95.944% | 96.859% | 96.889% | 2.167% | 66.175 |
| SVM | 94.737% | 83.478% | 88.752% | 89.420% | 4.638% | 73.394 |
| AdaBoost | 97.428% | 88.389% | 92.689% | 93.028% | 2.333% | 89.728 |
| CNN | 98.861% | 96.444% | 97.638% | 97.667% | 2.278% | 112.327 |
| TextCNN | 99.042% | 99.087% | 99.064% | 99.122% | 1.049% | 125.331 |
| 本文模型 | 99.898% | 99.639% | 99.768% | 99.847% | 0.271% | 133.172 |
表3
消融实验
| 模型 | Precision | Recall | F1-Score | Accuracy | FPR | 训练时间/s |
|---|---|---|---|---|---|---|
| CNN | 98.861% | 96.444% | 97.638% | 97.667% | 2.278% | 112.327 |
| CNN-CBAM | 99.432% | 99.134% | 99.283% | 99.185% | 0.975% | 113.244 |
| BiLSTM | 98.729% | 99.333% | 99.030% | 99.083% | 1.167% | 152.692 |
| BiLSTM-CBAM | 99.592% | 99.491% | 99.541% | 99.333% | 1.085% | 154.733 |
| CNN-BiLSTM | 99.527% | 99.596% | 99.561% | 99.741% | 0.778% | 165.038 |
| CNN-BiLSTM-Attention | 99.796% | 99.694% | 99.745% | 99.629% | 0.543% | 148.317 |
| CNN-BiLSTM-CBAM | 99.898% | 99.639% | 99.768% | 99.847% | 0.271% | 133.172 |
| [1] | MITRE Corporation. CVE Details Vulnerability Statistics[EB/OL]. (2024-03-22)[2025-05-17]. https://www.cvedetails.com/product/497/Adobe-Acrobat-Reader.html?vendor_id=53. |
| [2] | YU Min, JIANG Jianguo, LI Gang, et al. A Survey of Research on Malicious Document Detection[J]. Journal of Cyber Security, 2021, 6(3): 54-76. |
| 喻民, 姜建国, 李罡, 等. 恶意文档检测研究综述[J]. 信息安全学报, 2021, 6(3): 54-76. | |
| [3] | GOPINATH M, SETHURAMAN S C. A Comprehensive Survey on Deep Learning Based Malware Detection Techniques[EB/OL]. (2022-12-21)[2025-05-17]. https://www.sciencedirect.com/science/article/abs/pii/S1574013722000636. |
| [4] | SRNDIC N, LASKOV P. Detection of Malicious PDF Files Based on Hierarchical Document Structure[C]// ISOC. The 20th Annual Network & Distributed System Security Symposium. San Jose: Citeseer, 2013: 1-16. |
| [5] | LU Xiaofeng, WANG Fei, JIANG Cheng, et al. A Universal Malicious Documents Static Detection Framework Based on Feature Generalization[EB/OL]. (2021-12-20)[2025-05-17]. https://www.mdpi.com/2076-3417/11/24/12134. |
| [6] |
FALAH A, PAN Lei, HUDA S, et al. Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach[J]. Future Generation Computer Systems, 2021, 115: 314-326.
doi: 10.1016/j.future.2020.09.015 URL |
| [7] | ISSAKHANI M, VICTOR P, TEKEOGLU A, et al. PDF Malware Detection Based on Stacking Learning[C]// Science and Technology Publications. The 8th International Conference on Information Systems Security and Privacy (ICISSP 2022). Heidelberg: Springer, 2022: 562-570. |
| [8] | AL-HAIJA Q A, ODEH A, QATTOUS H. PDF Malware Detection Based on Optimizable Decision Trees[EB/OL]. (2022-09-30)[2025-05-17]. https://www.mdpi.com/2079-9292/11/19/3142. |
| [9] | YERIMA S Y, BASHAR A. Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents[EB/OL]. (2023-07-20)[2025-05-17]. https://www.mdpi.com/2079-9292/12/14/3148. |
| [10] | BABAAGBA K O, ADESANYA S O. A Study on the Effect of Feature Selection on Malware Analysis Using Machine Learning[C]// ACM. The 2019 8th International Conference on Educational and Information Technology. New York: ACM, 2019: 51-55. |
| [11] | YU Yuanzhe, WANG Jinshuang, ZOU Xia. A Malicious PDF Detection Method Based on Feature Agglomeration and Convolutional Neural Network[J]. Information Technology and Network Security, 2021, 40(8): 35-41. |
| 俞远哲, 王金双, 邹霞. 基于特征集聚和卷积神经网络的恶意PDF文档检测方法[J]. 信息技术与网络安全, 2021, 40(8): 35-41. | |
| [12] | JEONG Y S, WOO J, KANG A R. Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks[EB/OL]. (2019-04-03)[2025-05-17]. https://onlinelibrary.wiley.com/doi/10.1155/2019/8485365. |
| [13] | TOGACAR M, ERGEN B. Processing 2D Barcode Data with Metaheuristic Based CNN Models and Detection of Malicious PDF Files[EB/OL]. (2024-05-09)[2025-05-17]. https://www.sciencedirect.com/science/article/abs/pii/S1568494-624004964. |
| [14] | JIANG Jianguo, WANG Chenhao, YU Min, et al. NFDD: A Dynamic Malicious Document Detection Method without Manual Feature Dictionary[EB/OL]. (2021-09-09)[2025-05-17]. https://link.springer.com/chapter/10.1007/978-3-030-86130-8_12. |
| [15] | LEI Jingwei, YI Peng, CHEN Xiang. PDF Document Detection Model Based on Graph Neural Network and Deep Learning[J]. Computer Engineering and Design. 2024, 45(2): 356-366. |
| 雷靖玮, 伊鹏, 陈祥. 基于图神经网络与深度学习的PDF文档检测模型[J]. 计算机工程与设计, 2024, 45(2): 356-366. | |
| [16] |
WANG Wenbo, YI Peng, KOU Taotao, et al. GLDOC: Detection of Implicitly Malicious MS-Office Documents Using Graph Convolutional Networks[J]. Cybersecurity, 2024, 7(1): 48-62.
doi: 10.1186/s42400-024-00243-7 |
| [17] |
THAKUR P, KANSAL V, RISHIWAL V. Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection[J]. Wireless Personal Communications, 2024, 136(3): 1879-1901.
doi: 10.1007/s11277-024-11366-y |
| [18] | MANIRIHO P, MAHMOOD A N, CHOWDHURY M J M. API-MalDetect: Automated Malware Detection Framework for Windows Based on API Calls and Deep Learning Techniques[EB/OL]. (2023-07-22)[2025-05-17]. https://www.sciencedirect.com/science/article/pii/S1084804523001236. |
| [19] |
AFZAL S, ASIM M, JAVED A R, et al. URLdeepDetect: A Deep Learning Approach for Detecting Malicious URLs Using Semantic Vector Models[J]. Journal of Network and Systems Management, 2021, 29(3): 1-27.
doi: 10.1007/s10922-020-09571-8 |
| [20] |
JIHADO A A, GIRSANG A S. Hybrid Deep Learning Network Intrusion Detection System Based on Convolutional Neural Network and Bidirectional Long Short-Term Memory[J]. Journal of Advances in Information Technology, 2024, 15(2): 219-232.
doi: 10.12720/jait.15.2.219-232 URL |
| [21] | THEKKEKARA J P, YONGCHAREON S, LIESAPUTRA V. An Attention-Based CNN-BiLSTM Model for Depression Detection on Social Media Text[EB/OL]. (2024-03-22)[2025-05-17]. https://www.sciencedirect.com/science/article/pii/S0957417424007000. |
| [22] | YANG Xiuzhang, PENG Guojun, LUO Yuan, et al. OMRDetector: A Method for Detecting Obfuscated Malicious Requests Based on Deep Learning[J]. Chinese Journal of Computers, 2022, 45(10): 2167-2189. |
| 杨秀璋, 彭国军, 罗元, 等. OMRDetector:一种基于深度学习的混淆恶意请求检测方法[J]. 计算机学报, 2022, 45(10): 2167-2189. | |
| [23] | WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block Attention Module[C]// Springer. The European Conference on Computer Vision(ECCV 2018). Heidelberg: Springer, 2018: 3-19. |
| [24] | CIC. PDF Dataset: CIC-Evasive-PDFMal2022[EB/OL]. (2022-02-10)[2025-05-17]. https://www.unb.ca/cic/datasets/PDFMal-2022.html. |
| [1] | 林伟. 基于多特征融合的区块链异常交易检测[J]. 信息网络安全, 2022, 22(10): 24-30. |
| [2] | 李振军;程杰仁. 基于多特征分布式拒绝服务攻击的检测[J]. , 2013, 13(5): 0-0. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||
