Discovery and Optimization Method of Attack Paths Based on PPO Algorithm

doi:10.3969/j.issn.1671-1122.2023.09.005

Abstract

Abstract:

Selecting penetration actions based on policy networks and discovering the optimal attack path is a crucial technology in automated penetration testing. However, existing methods have issues such as excessive ineffective actions and slow convergence speed during the training process. To address these problems, this paper applied the proximal policy optimization (PPO) algorithm to the attack path optimization problem and proposed an improved version called improved PPO with penetration action selection (IPPOPAS) that incorporated a penetration action selection module. This module enabled the algorithm to select actions based on the penetration testing scenario during the experience collection phase. The paper designd and implemented various components of the IPPOPAS algorithm, including policy networks, value networks, and the penetration action selection module, to enhance the action selection process. Parameter tuning and algorithm optimization were also performed to improve the performance and efficiency of the algorithm. Experimental results demonstrate that the IPPOPAS algorithm achieves faster convergence speed compared to traditional DQN algorithms and their variations in specific network scenarios. Additionally, the algorithm exhibits even faster convergence speed with an increasing number of vulnerabilities in the host. Furthermore, the effectiveness of the IPPOPAS algorithm is validated in scenarios with expanded network scales.

Key words: automated penetration testing, policy network, PPO algorithm, attack path discovery

CLC Number:

TP309

ZHANG Guomin, ZHANG Shaoyong, ZHANG Jinwei. Discovery and Optimization Method of Attack Paths Based on PPO Algorithm[J]. Netinfo Security, 2023, 23(9): 47-57.

Figures/Tables 15

References 19

[1]	PHILLIPS C, SWILER L P. A Graph-Based System for Network-Vulnerability Analysis[C]// ACM. 1998 Workshop on New Security Paradigms. New York: ACM, 1998: 71-79.
[2]	YOUSEFI M, MTETWA N, ZHANG Yan, et al. A Reinforcement Learning Approach for Attack Graph Analysis[C]// IEEE. 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE). New York: IEEE, 2018: 212-217.
[3]	OU Xinming, GOVINDAVAJHALA S, APPEL A W. MulVAL: A Logic-Based Network Security Analyzer[C]// USENIX. 14th Conference on USENIX Security Symposium. Berlin: USENIX, 2005: 113-128.
[4]	BELLMAN R E. A Markovian Decision Process[J]. Journal of Mathematics and Mechanics, 1957, 6(5): 679-684.
[5]	CASSANDRA A R, KAELBLING L P, LITTMAN M L. Acting Optimally in Partially Observable Stochastic Domains[C]// ACM. Twelfth AAAI National Conference on Artificial Intelligence. New York: ACM, 1994: 1023-1028.
[6]	SARRAUTE C, BUFFET O, HOFFMANN J. Penetration Testing= =POMDP Solving?[EB/OL]. (2013-06-19)[2023-04-30]. https://arxiv.org/pdf/1306.4714.pdf.
[7]	SCHWARTZ J, KURNIAWATI H. Autonomous Penetration Testing Using Reinforcement Learning[EB/OL]. (2019-05-15)[2023-04-30]. https://arxiv.org/abs/1905.05965.
[8]	ZENNARO F M, ERDODI L. Modeling Penetration Testing with Reinforcement Learning Using Capture-the-Flag Challenges and Tabular Q-Learning[EB/OL]. (2020-05-26)[2023-04-30]. https://arxiv.org/abs/2005.12632.
[9]	ZHANG Lei, BAI Wei, LI Wei, et al. Discover the Hidden Attack Path in Multiple Domain Cyberspace Based on Reinforcement Learning[EB/OL]. (2021-04-15)[2023-04-30]. https://arxiv.org/abs/2104.07195.
[10]	HUANG Lanxiao, CODY T, REDINO C, et al. Exposing Surveillance Detection Routes via Reinforcement Learning, Attack Graphs, and Cyber Terrain[C]// IEEE. 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). New York: IEEE, 2022: 21-28.
[11]	SCHWARTZ J, KURNIAWATTI H. NASim: Network Attack Simulator[EB/OL]. (2019-05-26)[2023-04-30]. https://networkattacksimulator.readthedocs.io/.
[12]	CHRISTIAN S, MICHAEL B, WILLIAM B, et al. CyberBattle-Sim[EB/OL]. (2021-05-11)[2023-04-30]. https://github.com/microsoft/cyberbattlesim.
[13]	ZHOU Shicheng, LIU Jingju, HOU Dongdong, et al. Autonomous Penetration Testing Based on Improved Deep Q-Network[EB/OL]. (2021-07-16)[2023-04-30]. https://doi.org/10.3390/app11198823.
[14]	FIGUEROA-LORENZO S, AÑORGA J, ARRIZABALAGA S. A Survey of IIoT Protocols: A Measure of Vulnerability Pisk Analysis Based on CVSS[J]. ACM Computing Surveys (CSUR), 2020, 53(2): 1-53.
[15]	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[EB/OL]. (2017-07-20)[2023-04-30]. https://arxiv.org/abs/1707.06347.
[16]	SCHULMAN J, LEVINE S, MORITZ P, et al. Trust Region Policy Optimization[C]// ACM. International Conference on Machine Learning. New York: ACM, 2015: 1889-1897.
[17]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-Level Control Through Deep Reinforcement Learning[J]. Nature, 2015, 518(7540): 529-533. doi: 10.1038/nature14236
[18]	SCHULMAN J, MORITZ P, LEVINE S, et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation[EB/OL]. (2015-06-08)[2023-04-30]. https://arxiv.org/abs/1506.02438.
[19]	CHOWDHARY A, HUANG Dijiang, MAHENDRAN J S, et al. Autonomous Security Analysis and Penetration Testing[C]// IEEE. 2020 16th International Conference on Mobility, Sensing and Networking (MSN). New York: IEEE, 2020: 508-515.

漏洞编号	针对服务或进程	攻击复杂度	概率设置
CVE-2019-1069	Schtask	Low	0.8
CVE-2020-9484	Tomcat	Low	0.8
CVE-2019-0841	Daclsvc	Low	0.8
CVE-2019-16759	HTTP	Low	0.8
CVE-2018-15473	SSH	Low	0.8
CVE-2019-18232	FTP	Medium	0.5
CVE-2019-19317	SMTP	High	0.2
CVE-2017-7494	SAMBA	Low	0.8

地址	操作系统	主机价值	服务	进程
(1,0)	Linux	0	HTTP	－
(2,0)	Windows	100	SMTP	Schtask
(2,1)	Windows	0	SMTP	Schtask
(3,0),(3,3),(3,4),(4,3)	Windows	0	FTP	Schtask
(3,1)	Windows	0	FTP, HTTP	Daclsvc
(3,2)	Windows	0	FTP	－
(4,0),(4,1),(4,2),(5,2)	Linux	0	SSH	－
(5,0)	Windows	100	SSH, SAMBA	Tomcat
(5,1)	Linux	0	SSH, HTTP	Tomcat
(5,3)	Linux	0	SSH	Daclsvc

网络场景	主机数量/个	敏感主机数量/个	子网数量/个	开放网络数量/个	运行进程数量/个
Scenario 1	16	2	5	5	3
Scenario 2	35	2	10	5	3
Scenario 3	16	2	5	40	3
Scenario 4	35	2	10	80	3
Scenario 5	50	3	13	5	5
Scenario 6	75	3	18	5	5
Scenario 7	150	3	22	5	5

超参数	含义	取值（场景1~ 场景4）	取值（场景5~ 场景7）
Actor learning rate, $actor\_lr$	Actor网络的学习率	1e-4	1e-5
Critic learning rate,$critic\_lr$	Critic网络的学习率	5e-3	5e-4
$\lambda $	GAE计算过程中的参数	0.9	0.9
Discount factor, $\gamma $	折扣因子	0.9	0.9
Hidden layer size	隐藏层神经元层数及个数	[128,128]	[128,128]
n_steps	控制每个采样轨迹的长度	2000(Scenario 1) 3000(Scenario 2) 5000(Scenario 3) 8000(Scenario 4)	8000
Epochs	一条序列的数据用来训练轮数	10	10
Clip Ratio $\epsilon $	PPO中截断范围的参数	0.2	0.2