An Automated Penetration Testing System Based on Multi-Agent Architecture

doi:10.3969/j.issn.1671-1122.2026.04.012

Abstract

Abstract:

In recent years, cyberattacks have become increasingly organized and automated. With the support of artificial intelligence technologies, particularly large language models, attackers are able to rapidly write and derive malicious code, and construct automated and distributed reconnaissance and attack processes targeting specific objectives through botnets. This has posed severe threats and risks to cybersecurity defenses. To effectively address these challenges, this thesis proposed and designed a novel automated penetration testing system based on a multi-agent architecture. The system decomposed traditional penetration testing tasks into atomic sub-tasks, which were then collaboratively completed by multiple agents. Experimental results show that the system significantly outperforms traditional vulnerability scanning tools across multiple testing metrics, being capable of comprehensively identifying various types of security vulnerabilities in the target information system, and providing highly credible evidence chains for vulnerability disclosure. Furthermore, the system can generate executable remediation recommendations, achieving the automation and engineering of the penetration testing process, thus offering an advanced, efficient, and stable solution for organizations to conduct regular network security vulnerability management.

Key words: penetration testing system, multi-agent architecture, autonomous mission planning, system and network security

CLC Number:

TP309

DONG Yingjuan, LYU Ping, LIU Bing. An Automated Penetration Testing System Based on Multi-Agent Architecture[J]. Netinfo Security, 2026, 26(4): 654-664.

Figures/Tables 7

References 35

[1]	BISHOP M. About Penetration Testing[J]. IEEE Security & Privacy Magazine, 2007, 5(6): 84-87.
[2]	NIST SP 800-115 Technical Guide to Information Security Testing and Assessment[S]. Gaithersburg: National Institute of Standards and Technology, 2008.
[3]	ANTUNES N, VIEIRA M. Benchmarking Vulnerability Detection Tools for Web Services[C]// IEEE. 2010 IEEE International Conference on Web Services. New York: IEEE, 2010: 203-210.
[4]	XIONG Pulei, PEYTON L. A Model-Driven Penetration Test Framework for Web Applications[C]// IEEE. 2010 Eighth International Conference on Privacy, Security and Trust. New York: IEEE, 2010: 173-180.
[5]	ROY S S, THOTA P, NARAGAM K V, et al. From Chatbots to Phishbots?: Phishing Scam Generation in Commercial Large Language Models[C]// IEEE. 2024 IEEE Symposium on Security and Privacy (SP). New York: IEEE, 2024: 36-54.
[6]	BECKERICH M, PLEIN L, CORONADO S. RatGPT: Turning Online LLMs into Proxies for Malware Attacks[EB/OL].(2023-09-07)[2025-09-02]. https://arxiv.org/abs/2308.09183.
[7]	MOHAMED F M F, ELBREIKI W, ABDULLAHI I, et al. WormGPT: A Large Language Model Chatbot for Criminals[C]// IEEE. 2023 24th International Arab Conference on Information Technology (ACIT). New York: IEEE, 2023: 1-6.
[8]	HOU Xinyi, ZHAO Yanjie, LIU Yue, et al. Large Language Models for Software Engineering: A Systematic Literature Review[J]. ACM Transactions on Software Engineering and Methodology, 2024, 33(8): 1-79.
[9]	YANG Zhou, SUN Zhensu, YUE T Z, et al. Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code[EB/OL].(2024-03-12)[2025-09-02]. https://arxiv.org/abs/2403.07506.
[10]	OH S, LEE K, PARK S, et al. Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models[C]// IEEE. 2024 IEEE Symposium on Security and Privacy (SP). New York: IEEE, 2024: 1141-1159.
[11]	SCHUSTER R, SONG Congzheng, TROMER E, et al. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion[C]// USENIX. The 30th USENIX Security Symposium. Berkely: USENIX Association, 2021: 1559-1575.
[12]	NGUYEN P T, DI S C, DI R J, et al. Adversarial Attacks to API Recommender Systems: Time to Wake up and Smell the Coffee?[C]// IEEE. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). New York: IEEE, 2021: 253-265.
[13]	QI Shiyi, YANG Yuanhang, GAO S, et al. BadCS: A Backdoor Attack Framework for Code Search[EB/OL].(2023-05-09)[2025-09-02]. https://arxiv.org/abs/2305.05503.
[14]	SUN Weisong, CHEN Yuchen, TAO Guanhong, et al. Backdooring Neural Code Search[EB/OL].(2023-06-12)[2025-09-02]. https://arxiv.org/abs/2305.17506.
[15]	WAN Yao, ZHANG Shijie, ZHANG Hongyu, et al. You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search[C]// ACM. The 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2022: 1233-1245.
[16]	BODDY M, GOHDE J, HAIGH T, et al. Course of Action Generation for Cyber Security Using Classical Planning[C]// ICAPS. International Conference on Automated Planning and Scheduling. Palo Alto: AAAI, 2005: 16-21.
[17]	OBES J L, SARRAUTE C, RICHARTE G. Attack Planning in the Real World[EB/OL].(2013-06-19)[2025-09-02]. https://arxiv.org/abs/1306.4044.
[18]	ROBERTS M, HOWE A, RAY I, et al. Personalized Vulnerability Analysis through Automated Planning[EB/OL]. [2025-09-02]. https://www.researchgate.net/publication/228946141_Personalized_Vulnerability_Analysis_through_Automated_Planning.
[19]	DENG Gelei, LIU Yi, MAYORAL-VILCHES V, et al. PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing[C]// USENIX. The 33rd USENIX Security Symposium. Berkely: USENIX Association, 2024: 847-864.
[20]	HAPPE A, KAPLAN A, CITO J. LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks[EB/OL].(2023-10-17)[2025-09-02]. https://arxiv.org/abs/2310.11409.
[21]	FANG R, BINDU R, GUPTA A, et al. LLM Agents Can Autonomously Hack Websites[EB/OL].(2024-02-06)[2025-09-02]. https://arxiv.org/abs/2402.06664.
[22]	FANG R, BINDU R, GUPTA A, et al. LLM Agents Can Autonomously Exploit One-Day Vulnerabilities[EB/OL].(2024-04-11)[2025-09-02]. https://arxiv.org/abs/2404.08144.
[23]	OpenAI. What is the Difference between the GPT-4 Models[EB/OL]. [2025-10-26]. https://help.openai.com/en/articles/7127966-what-is-the-difference-between-the-gpt-4-models.
[24]	LIU N F, LIN K, HEWITT J, et al. Lost in the Middle: How Language Models Use Long Contexts[EB/OL].(2023-07-06)[2025-09-02]. https://arxiv.org/abs/2307.03172.
[25]	BANG Yejin, CAHYAWIJAYA S, LEE N, et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity[EB/OL].(2023-02-08)[2025-09-02]. https://arxiv.org/abs/2302.04023.
[26]	MITRE Corporation. MITRE ATT&CK^® Matrix for Enterprise (Knowledge Base)[EB/OL].(2025-10-28)[2026-01-27]. https://attack.mitre.org/matrices/enterprise/.
[27]	SINGH G P, BHARTI V, HOODA M K. A Review on NIST, ISO 27001, HIPAA and MITRE ATT&CK Cybersecurity Frameworks[J]. Webology, 2021, 18(6): 1872-1880.
[28]	AMMANN P, WIJESEKERA D, KAUSHIK S. Scalable, Graph-Based Network Vulnerability Analysis[C]// ACM. The 9th ACM Conference on Computer and Communications Security. New York: ACM, 2002: 217-224.
[29]	Anthropic. Introducing the Model Context Protocol[EB/OL].(2024-11-25)[2025-09-02]. https://www.anthropic.com/news/model-context-protocol.
[30]	LYON G F. Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning[M]. Rockland, MA: Insecure. Com LLC, 2009.
[31]	Hack The Box. Hack The Box: Hacking Training for the Best[EB/OL]. [2025-11-05]. http://www.hackthebox.com/.https://www.acunetix.com/.
[32]	picoCTF. picoCTF 2021 Redpwn Competition[EB/OL]. [2025-11-21]. https://picoctf.org/competitions/2021-redpwn.html.
[33]	PortSwigger. Cross-Site Scripting (XSS)[EB/OL]. [2025-11-21]. https://portswigger.net/web-security/cross-site-scripting.
[34]	Acunetix. Acunetix Web Vulnerability Scanner[EB/OL]. [2025-11-21]. https://www.acunetix.com/.https://arxiv.org/abs/2404.08144.
[35]	Chaitin Technology. X-Ray Vulnerability Scanner[EB/OL].(2026-01-01)[2026-02-02]. https://www.chaitin.cn/en/xrayhttps://www.chaitin.cn/en/xray.

MITRE ATT&CK 各阶段	对应工具模块
侦查	侦查、指纹识别
资源开发	有效载荷生成
初始访问	漏洞利用脚本、黑盒测试
执行	本地Shell、远程控制
持久化/横向移动	远程控制、本地文件操作
信息收集	网页爬虫、无头Chromium、数据结构化

靶场名称	难度	PentestGPT		本文系统
靶场名称	难度	测试次数/次	成功次数/次	成功次数/次	测试次数/次
Sau	简单	5	5	5	5
Pilgramage	简单	5	3	4	5
Topology	简单	5	0	2	5
PC	简单	5	4	3	5
MonitorsTwo	简单	5	3	5	5
Authority	中等	5	0	2	5
Sandworm	中等	5	0	3	5
Jupiter	中等	5	0	2	5
Agile	中等	5	2	4	5
OnlyForYou	中等	5	0	2	5
总计	—	50	17	32	50

试题名称	分类	PentestGPT		本文系统
试题名称	分类	测试/次	成功/次	成功/次	测试/次
login	Web	5	5	5	5
advance-potion-making	forensics	5	3	3	4
spelling-quiz	crypto	5	4	3	5
caas	Web	5	2	5	5
XtrOrdinary	crypto	5	5	3	5
tripplesecure	crypto	5	3	2	5
clutteroverflow	binary	5	1	3	5
not crypto	reverse	5	0	0	5
scrambled-bytes	forensics	5	0	0	5
breadth	reverse	5	0	0	5
notepad	Web	5	1	4	5
college-rowing-team	crypto	5	2	1	5
fermat-strings	binary	5	0	0	5
corrupt-key-1	crypto	5	0	0	5
SaaS	binary	5	0	0	5
riscy business	reverse	5	0	0	5
homework	binary	5	0	0	5
lockdown-horses	binary	5	0	0	5
corrupt-key-2	crypto	5	0	0	5
vr-school	binary	5	0	0	5
MATRIX	reverse	5	0	0	5

测试项	AWVS	Xray	本文系统
Reflected XSS into HTML context with nothing encoded	成功	成功	成功
Stored XSS into HTML context with nothing encoded	成功	失败	成功
DOMXSS in document.write sink using source location.search	成功	失败	成功
DOM XSS in innexHTML sink using source location.search	成功	失败	成功
DOM XSS in jQuery anchor href attribute sink using location.search source	失败	失败	成功
DOM XSS in jQuery selector sink usinga hashchange event	失败	失败	成功
Reflected XSS into attribute with angle brackets HTML-encoded	成功	成功	成功
Stored XSS into anchor href attribute with double quotes HTML-encoded	失败	失败	失败
Reflected XSS into Javascript string with angle brackets HTML encoded	成功	成功	成功
DOM XSS in document.write sink using source location.search inside a select element	成功	成功	成功
DOM XSS in AngularJs expression with angle brackets and double quotes HTML-encoded	成功	成功	成功
Reflected DOM XSS	失败	失败	失败
Reflected XSS into HTML context with most tags and attributes blocked	成功	成功	失败
Reflected XSS into HTML context with all tags blocked except custom ones	成功	成功	成功