信息网络安全 ›› 2026, Vol. 26 ›› Issue (5): 713-724.doi: 10.3969/j.issn.1671-1122.2026.05.004

• 学术研究 • 上一篇    下一篇

基于无模型强化学习的非线性信息物理系统零和博弈的安全优化算法

解相朋1, 朱淇2()   

  1. 1 南京邮电大学物联网学院, 南京 210023
    2 南京邮电大学自动化学院, 南京 210023
  • 收稿日期:2026-02-05 出版日期:2026-05-10 发布日期:2026-06-03
  • 通讯作者: 朱淇 13511598422@163.com
  • 作者简介:解相朋(1982—),男,山东,教授,博士,主要研究方向为工业控制模糊系统|朱淇(2002—),女,江苏,硕士研究生,主要研究方向为自适应动态规划
  • 基金资助:
    国家自然科学基金(U25B2056)

Safe Optimization Algorithm for Zero-Sum Game of Nonlinear Cyber-Physical Systems Based on Model-Free Reinforcement Learning

XIE Xiangpeng1, ZHU Qi2()   

  1. 1 School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
    2 College of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
  • Received:2026-02-05 Online:2026-05-10 Published:2026-06-03

摘要:

针对遭受拒绝服务攻击的汽车主动悬架系统,文章提出一种基于无模型强化学习的非线性信息物理系统零和博弈的安全优化算法,旨在解决系统模型未知和网络数据包丢失场景下的安全控制问题。通过引入伯努利随机序列刻画拒绝服务攻击导致的数据包丢失过程,将受攻击系统建模为随机非线性系统,并定义包含控制输入与干扰惩罚的折扣代价函数,将安全控制问题转化为一个零和博弈问题。此外,设计基于Q学习的无模型值迭代算法,通过构造包含状态、控制与扰动的Q函数,避免传统方法对系统模型的依赖。同时,采用基于神经网络的评价—执行—干扰架构实现函数逼近,分别利用评价网络逼近Q函数,执行网络与干扰网络分别生成控制策略与扰动策略。理论分析表明,文章所提算法能保证值函数序列的单调收敛性和一致有界性。仿真结果表明,该算法在拒绝服务攻击环境下仍能有效维持悬架系统的稳定性与控制性能。

关键词: 拒绝服务攻击, 自适应动态规划, Q学习, 评价—执行—干扰架构

Abstract:

This paper proposed a safe optimization algorithm for zero-sum game of nonlinear cyber-physical systems based on model-free reinforcement learning, specifically targeting active suspension system in vehicles subjected to denial-of-service attacks. The algorithm aimed to address safety control issues in scenarios with unknown system models and network packet loss. By introducing a Bernoulli random sequence to characterize the packet loss process caused by denial-of-service attacks, the attacked system was modeled as a stochastic nonlinear system. A discounted cost function incorporating control effort and disturbance penalty was defined, transforming the security control problem into a zero-sum game. A model-free value iteration algorithm based on Q-learning was designed, which constructed a Q-function involving state, control, and disturbance to avoid reliance on the system model. Furthermore, a neural network-based evaluation execution interference architecture was adopted to achieve function approximation. The evaluation network was used to approximate the Q-function, and the execution network and interference network were used to generate control strategies and disturbance strategies. Theoretical analysis demonstrates that the proposed algorithm ensures monotonic convergence and uniform boundedness of the value function sequence. Simulation results indicate that the method effectively maintains the stability and control performance of the suspension system even under denial-of-service attacks.

Key words: denial of service attack, adaptive dynamic programming, Q-learning, critic-actor-disturbance structure

中图分类号: