A Study on Autonomous Decision-Making for Network Defense Based on Hierarchical Reinforcement Learning

doi:10.3969/j.issn.1671-1122.2026.01.008

Abstract

Abstract:

To address the issue that traditional network defense decision-making methods are unable to effectively cope with complex dynamic network environments and diverse network attacks, this paper proposed a network defense autonomous decision-making method based on hierarchical reinforcement learning, combined with a high-fidelity network attack and defense simulation environment. A Markov network attack and defense game model based on incomplete information was constructed to analyze the dynamic interaction process of the attacker and defender and to formally represent the optimal defense strategy. The complex defense decision-making task caused by the unknown type of attacker was decomposed through the collaborative work of the top-level control agent and the bottom-level execution agent. Simulation experiment results under different attack and defense scenarios show that this method can make flexible and efficient decision responses to two types of penetration attack patterns, maintain resilient defense, and generate interpretable action distributions. Comparative analysis with existing related work further confirms the superiority of the proposed method in defense effectiveness.

Key words: network defense decision, Markov game, hierarchical reinforcement learning, autonomous cyber operation

CLC Number:

TP309

WANG Huanzhen, XU Hongping, LI Kuangdai, LIU Yang, YAO Linyuan. A Study on Autonomous Decision-Making for Network Defense Based on Hierarchical Reinforcement Learning[J]. Netinfo Security, 2026, 26(1): 91-101.

Figures/Tables 15

References 23

[1]	CrowdStrike. CrowdStrike 2024 Threat Hunting Report[EB/OL]. (2024-08-15)[2025-02-02]. ://crowdstrike.com/explore/crowdstrike-2024-threat-hunting-report/crowdstrike-2024-threat-hunting-report.
[2]	MIAO Li, LI Shuai, WU Xiangjuan, et al. Mean-Field Stackelberg Game-Based Security Defense and Resource Optimization in Edge Computing[J]. Applied Sciences, 2024, 14(9): 3406-3417. doi: 10.3390/app14083406 URL
[3]	WU Huici, GAO Qiuyue, TAO Xiaofeng, et al. Differential Game Approach for Attack-Defense Strategy Analysis in Internet of Things Networks[J]. IEEE Internet of Things Journal, 2021, 9(12): 10340-10353. doi: 10.1109/JIOT.2021.3122115 URL
[4]	LIU Liang, TANG Chuhao, ZHANG Lei, et al. A Generic Approach for Network Defense Strategies Generation Based on Evolutionary Game Theory[EB/OL]. (2024-08-15)[2025-02-02]. https://doi.org/10.1016/j.ins.2024.120875.
[5]	SUN Pengyu, TAN Jinglei, LI Chenwei, et al. Network Security Defense Decision-Making Method Based on Time Differential Game[J]. Netinfo Security, 2022, 22(5): 64-74.
	孙鹏宇, 谭晶磊, 李晨蔚, 等. 基于时间微分博弈的网络安全防御决策方法[J]. 信息网络安全, 2022, 22(5): 64-74.
[6]	HAMMAR K, STADLER R. Finding Effective Security Strategies through Reinforcement Learning and Self-Play[C]// IEEE. The 16th International Conference on Network and Service Management. New York: IEEE, 2020: 1-9.
[7]	HAMMAR K, STADLER R. Intrusion Prevention through Optimal Stopping[J]. IEEE Transactions on Network and Service Management, 2021, 19: 2333-2348. doi: 10.1109/TNSM.2022.3176781 URL
[8]	ALSHAMRANI A, ALSHAHRANI A. Adaptive Cyber Defense Technique Based on Multiagent Reinforcement Learning Strategies[J]. Intelligent Automation & Soft Computing, 2023, 36(3): 2757-2771.
[9]	SELMONAJ A, SZEHR O, DEL R G, et al. Hierarchical Multi-Agent Reinforcement Learning for Air Combat Maneuvering[C]// IEEE. 2023 International Conference on Machine Learning and Applications. New York: IEEE, 2023: 1031-1038.
[10]	TANG Yunlong, SUN Jing, WANG Huan, et al. A Method of Network Attack-Defense Game and Collaborative Defense Decision-Making Based on Hierarchical Multi-Agent Reinforcement Learning[EB/OL]. (2024-07-01)[2025-02-02]. https://doi.org/10.1016/j.cose.2024.103871.
[11]	CHEAH M, STONE J, HAUBRICK P, et al. CO-DECYBER: Cooperative Decision Making for Cybersecurity Using Deep Multi-Agent Reinforcement Learning[C]// Springer. The 29th European Symposium on Research in Computer Security. Heidelberg: Springer, 2023: 628-643.
[12]	STANDEN M, LUCAS M, BOWMAN D, et al. Cyborg: A Gym for the Development of Autonomous Cyber Agents[EB/OL]. (2021-08-20)[2025-02-02]. https://doi.org/10.48550/arXiv.2108.09118.
[13]	WIEBE J, MALLAH R A, LI L. Learning Cyber Defence Tactics from Scratch with Multi-Agent Reinforcement Learning[EB/OL]. (2023-08-25)[2025-02-02]. https://doi.org/10.48550/arXiv.2310.05939.
[14]	PALMER G, PARRY C, HARROLD D J B, et al. Deep Reinforcement Learning for Autonomous Cyber Operations: A Survey[EB/OL]. (2024-09-17)[2025-02-02]. https://doi.org/10.48550/arXiv.2310.07745.
[15]	LIU Xiaohu, ZHANG Hengwei, DONG Shuqin, et al. Network Defense Decision-Making Based on a Stochastic Game System and a Deep Recurrent Q-Network[EB/OL]. (2021-12-01)[2025-02-02]. https://doi.org/10.1016/j.cose.2021.102480.
[16]	WAHAB O A, BENTAHAR J, OTROK H, et al. Resource-Aware Detection and Defense System against Multi-Type Attacks in the Cloud: Repeated Bayesian Stackelberg Game[J]. IEEE Transactions on Dependable and Secure Computing, 2019, 18(2): 605-622. doi: 10.1109/TDSC.8858 URL
[17]	SLIVKINS A. Introduction to Multi-Armed Bandits[EB/OL]. (2019-04-15)[2025-02-02]. https://doi.org/10.48550/arXiv.1904.07272.
[18]	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2019-04-15)[2025-02-02]. https://doi.org/10.48550/arXiv.1707.06347.
[19]	PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-Driven Exploration by Self-Supervised Prediction[C]// PMLR. International Conference on Machine Learning. New York: PMLR, 2017: 2778-2787.
[20]	KIELY M, BOWMAN D, STANDEN M, et al. On Autonomous Agents in a Cyber Defence Environment[EB/OL]. (2023-09-02)[2025-02-02]. https://doi.org/10.48550/arXiv.2309.07388.
[21]	OASIS Open. OpenC 2 Language Specification Version 2.0[EB/OL]. (2024-05-15)[2025-02-02]. https://docs.oasis-open.org/openc2/oc2ls/v2.0/oc2ls-v2.0.pdf.
[22]	HANNAY J. Champion Award of CAGE Challenge 2[EB/OL]. (2023-06-06)[2025-02-02]. https://github.com/john-cardiff/-cyborg-cage-2.
[23]	RASHID T, SAMVELYAN M, DE W C S, et al. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning[J]. Journal of Machine Learning Research, 2020, 21(178): 1-51.

攻击动作及ATT&CK ID	含义
网络发现（T1018）	通过工具进行主动扫描，发现网络中的新主机、IP地址
服务发现（T1046）	通过启动与所选主机的连接发现该主机上的响应式服务
漏洞利用（T1210）	利用远程服务漏洞获取内部系统未经授权的访问权限
权限提升（TA0004）	获取更高级别的访问权限，如root、administrator
服务停止（T1489）	阻断关键服务正常运行，破坏网络性能并实现攻击目的

防御类型及OpenC2动作ID	动作指令	含义
侦察（1&30）	监测	收集网络中的恶意活动信息
侦察（1&30）	分析	定位主机内的恶意文件
恢复（10&23）	移除	移除恶意进程、文件和服务
恢复（10&23）	重启	重启系统至初始状态
欺骗（15&18）	部署Apache蜜点	在指定主机部署相关蜜点服务
	部署Femitter蜜点
	部署HarakaSMPT蜜点
	部署SMSS蜜点
	部署SSHD蜜点
	部署Svchost蜜点
	部署Tomcat蜜点
	部署Vsftpd蜜点
睡眠（—）	睡眠	不执行实际动作

参数	配置
学习率	0.0005
截断系数	0.5
价值函数损失系数	1
价值函数截断系数	5
ICM奖励权重	1
ICM学习率	0.001
激活函数	ReLU
机器学习框架	RLlib+PyTorch

方案	策略	得分
CAGE Challenge 2^[20]	无响应	-1034.33 ± 165.21
	随机响应	-666.26 ± 364.30
	重启响应	-460.49 ± 353.58
HANNAY^[22]	PPO+启发式	-13.24 ± 4.25
WIEBE^[13]等人	QMIX^[23]	-16.75 ± 3.48
TANG^[10]等人	分层PPO	-16.60 ± 4.03
本文方法	分层PPO+ICM	-14.79 ± 4.84