Safe Optimization Algorithm for Zero-Sum Game of Nonlinear Cyber-Physical Systems Based on Model-Free Reinforcement Learning

doi:10.3969/j.issn.1671-1122.2026.05.004

Abstract

Abstract:

This paper proposed a safe optimization algorithm for zero-sum game of nonlinear cyber-physical systems based on model-free reinforcement learning, specifically targeting active suspension system in vehicles subjected to denial-of-service attacks. The algorithm aimed to address safety control issues in scenarios with unknown system models and network packet loss. By introducing a Bernoulli random sequence to characterize the packet loss process caused by denial-of-service attacks, the attacked system was modeled as a stochastic nonlinear system. A discounted cost function incorporating control effort and disturbance penalty was defined, transforming the security control problem into a zero-sum game. A model-free value iteration algorithm based on Q-learning was designed, which constructed a Q-function involving state, control, and disturbance to avoid reliance on the system model. Furthermore, a neural network-based evaluation execution interference architecture was adopted to achieve function approximation. The evaluation network was used to approximate the Q-function, and the execution network and interference network were used to generate control strategies and disturbance strategies. Theoretical analysis demonstrates that the proposed algorithm ensures monotonic convergence and uniform boundedness of the value function sequence. Simulation results indicate that the method effectively maintains the stability and control performance of the suspension system even under denial-of-service attacks.

Key words: denial of service attack, adaptive dynamic programming, Q-learning, critic-actor-disturbance structure

CLC Number:

TP309

XIE Xiangpeng, ZHU Qi. Safe Optimization Algorithm for Zero-Sum Game of Nonlinear Cyber-Physical Systems Based on Model-Free Reinforcement Learning[J]. Netinfo Security, 2026, 26(5): 713-724.

Figures/Tables 12

References 22

[1]	JIN Zengwang, JIANG Lingyang, DING Junyi, et al. A Review of Research on Industrial Control System Security[J]. Netinfo Security, 2025, 25(3): 341-363.
	金增旺, 江令洋, 丁俊怡, 等. 工业控制系统安全研究综述[J]. 信息网络安全, 2025, 25(3):341-363.
[2]	HUANG Penghao, KIM J, KUMAR P R, et al. Enhancing Cybersecurity for Industrial Control Systems: Innovations in Protecting PLC-Dependent Industrial Infrastructures[J]. IEEE Internet of Things Journal, 2024, 11(22): 36486-36493. doi: 10.1109/JIOT.2024.3408098 URL
[3]	CHEN Da, CAI Xiao, SUN Yanbin, et al. Optimization of Data Conflict and DDoS Attack Defense Mechanisms in Industrial Control Systems Based on Greedy Algorithm[J]. Netinfo Security, 2025, 25(6): 943-954.
	陈大, 蔡肖, 孙彦斌, 等. 基于贪心算法优化工业控制系统数据冲突与DDoS攻击防御机制[J]. 信息网络安全, 2025, 25(6):943-954.
[4]	BI Yannan, WANG Tong, QIU Jianbin, et al. Adaptive Decentralized Finite-Time Fuzzy Secure Control for Uncertain Nonlinear CPSS under Deception Attacks[J]. IEEE Transactions on Fuzzy Systems, 2023, 31(8): 2568-2580. doi: 10.1109/TFUZZ.2022.3229487 URL
[5]	ZHANG Liangju, XIE Xiangpeng, ZHANG Kun. Adaptive Policy Evaluation with Adjustable Step Sizes for Active Quarter-Vehicle Suspension Systems under IoT Environment[J]. IEEE Internet of Things Journal, 2025, 12(22): 48610-48620. doi: 10.1109/JIOT.2025.3605944 URL
[6]	DAVARI M, ZHAO Jianguo, YANG Chunyu, et al. Reinforcement Learning to Stabilize Singularly Perturbed DC-Side Dynamics of Grid-Connected Voltage-Source Converters in Modern AC-DC Grids Using Singular Perturbation Theory and Adaptive Dynamic Programming[J]. IEEE Transactions on Industrial Electronics, 2025, 72(3): 2914-2926. doi: 10.1109/TIE.2023.3327574 URL
[7]	HASAN A, MUHAMMAD K B. Offline Reinforcement Learning-Based Optimal Backoff Policy Selection for WSNS Using DQN: A Data-Driven Approach for Coexistence Management in the Unlicensed Spectrum[J]. IEEE Internet of Things Journal, 2025, 12(14): 28975-28985. doi: 10.1109/JIOT.2025.3567886 URL
[8]	VAMVOUDAKIS K G, LEWIS F L. Online Actor-Critic Algorithm to Solve the Continuous-Time Infinite Horizon Optimal Control Problem[J]. Automatica, 2010, 46(5): 878-888. doi: 10.1016/j.automatica.2010.02.018 URL
[9]	MODARES H, LEWIS F L. Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning[J]. Automatica, 2014, 50(7): 1780-1792. doi: 10.1016/j.automatica.2014.05.011 URL
[10]	WANG Ding, XIN Peng, ZHAO Mingming, et al. Intelligent Optimal Control of Constrained Nonlinear Systems via Receding-Horizon Heuristic Dynamic Programming[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(1): 287-299. doi: 10.1109/TSMC.2023.3306338 URL
[11]	ZHU Qi, ZHANG Kun, XIE Xiangpeng. Multi-Event-Triggered Adaptive Dynamic Programming for Non-Zero-Sum Game of Unknown Nonlinear System[J]. International Journal of Robust and Nonlinear Control, 2024, 34(8): 5168-5189. doi: 10.1002/rnc.v34.8 URL
[12]	YE Jun, BIAN Yougang, LUO Biao, et al. Costate-Supplement ADP for Model-Free Optimal Control of Discrete-Time Nonlinear Systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 45-59. doi: 10.1109/TNNLS.2022.3172126 URL
[13]	WERBOS P J. Neural Networks for Control and System Identification[C]//IEEE. The 28th IEEE Conference on Decision and Control. New York: IEEE, 1989: 260-265.
[14]	WANG Ding, ZHAO Mingming, LIU Derong, et al. Research Advances on Data-Driven Adaptive Critic Control[J]. Acta Automatica Sinica, 2025, 51(6): 1170-1190.
	王鼎, 赵明明, 刘德荣, 等. 数据驱动自适应评判控制研究进展[J]. 自动化学报, 2025, 51(6): 1170-1190.
[15]	SI J, WANG Y T. Online Learning Control by Association and Reinforcement[J]. IEEE Transactions on Neural Networks, 2001, 12(2): 264-276. doi: 10.1109/72.914523 pmid: 18244383
[16]	LIU Feng, SUN Jian, SI J, et al. A Boundedness Result for the Direct Heuristic Dynamic Programming[J]. Neural Networks, 2012, 32: 229-235. doi: 10.1016/j.neunet.2012.02.005 pmid: 22397949
[17]	QASEM O, GAO Weinan, VAMVOUDAKIS K G. Adaptive Optimal Control of Continuous-Time Nonlinear Affine Systems via Hybrid Iteration[EB/OL]. (2023-11-01)[2026-01-22]. https://www.sciencedirect.com/science/article/abs/pii/S0005109823004223.
[18]	JIANG Huaiyuan, ZHOU Bin, DUAN Guangren. Modified General Policy Iteration Based Adaptive Dynamic Programming for Unknown Discrete-Time Linear Systems[J]. International Journal of Robust and Nonlinear Control, 2022, 32(12): 7149-7173. doi: 10.1002/rnc.v32.12 URL
[19]	ZHAO Jing, WONG P K, LI Wenfeng, et al. Reliable Fuzzy Sampled-Data Control for Nonlinear Suspension Systems against Actuator Faults[J]. ASME Transactions on Mechatronics, 2022, 27(6): 5518-5528.
[20]	YANG Hongjiu, LI Ying, YUAN Huanhuan. Adaptive Dynamic Programming for Security of Networked Control Systems with Actuator Saturation[J]. Information Sciences, 2018, 460: 51-64.
[21]	WANG Ding, WANG Yuan, ZHAO Mingming, et al. Iterative Q-Learning Design for Zero-Sum Games with Evolving Policies[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2025, 55(7): 4587-4599. doi: 10.1109/TSMC.2025.3554219 URL
[22]	SHAO Xingchen, XIE Xiangpeng, LUAN Xiaoli. Asynchronous Gain-Scheduling Secure Control of Nonlinear Cyber-Physical Systems under Complex Transition Probabilities: A Dual-Domain Polynomial Framework[J]. IEEE Transactions on Industrial Informatics, 2025, 21(11): 8258-8269. doi: 10.1109/TII.2025.3586061 URL

参数	定义
${{m}_{s}}/{{m}_{u}}$	簧载质量/非簧载质量
${{c}_{s}}/{{c}_{u}}$	阻尼/轮胎阻尼
${{k}_{s}}/{{k}_{u}}$	悬架刚度/轮胎刚度
${{l}_{s}}/{{l}_{u}}/{{l}_{r}}$	车身/轮胎/路面位移
${{\dot{l}}_{s}}/{{\dot{l}}_{u}}/{{\dot{l}}_{r}}$	车身/轮胎/路面的时间导数
${{k}_{sn}}$	立方刚度

参数	值
${{m}_{s}}$	2.45 kg
${{c}_{s}}$	7.5 N·m/s
${{k}_{s}}$	900 N/m
${{k}_{sn}}$	10 N·s/m
${{m}_{u}}$	1 kg
${{c}_{u}}$	5 N·m/s
${{k}_{u}}$	2500 N/m
$\max \{{{l}_{r}}\}$	0.038 m