信息网络安全 ›› 2025, Vol. 25 ›› Issue (9): 1377-1384.doi: 10.3969/j.issn.1671-1122.2025.09.006

• 入选论文 • 上一篇    下一篇

基于深度语义挖掘的大语言模型越狱检测方法研究

刘会1,2, 朱正道3, 王淞鹤1, 武永成4, 黄林荃5,6()   

  1. 1.华中师范大学计算机学院,武汉 430079
    2.华中师范大学人工智能与智慧学习湖北省重点实验室,武汉 430079
    3.华中师范大学人工智能教育学部,武汉 430079
    4.荆楚理工学院人工智能学院,荆门 448000
    5.武汉软件工程职业学院信息学院,武汉 430205
    6.武汉开放大学信息学院,武汉 430205
  • 收稿日期:2025-06-15 出版日期:2025-09-10 发布日期:2025-09-18
  • 通讯作者: 黄林荃 huanglq@whvcse.edu.cn
  • 作者简介:刘会(1992—),男,湖北,讲师,博士,CCF会员,主要研究方向为人工智能安全、隐私保护|朱正道(2005—),男,江西,本科,主要研究方向为人工智能安全|王淞鹤(2005—),女,吉林,本科,主要研究方向为人工智能安全|武永成(1971—),男,湖北,副教授,硕士,主要研究方向为人工智能、网络空间安全|黄林荃(1991—),女,湖北,讲师,硕士,主要研究方向为隐私保护、计算机视觉
  • 基金资助:
    国家资助博士后研究人员计划(GZC20230922);中国博士后科学基金(2024M751050);湖北省博士后创新人才培养项目(2024HBBHCXB042);人工智能与智慧学习湖北省重点实验室2025年度开放研究基金(2025AISL007)

Jailbreak Detection for Large Language Model Based on Deep Semantic Mining

LIU Hui1,2, ZHU Zhengdao3, WANG Songhe1, WU Yongcheng4, HUANG Linquan5,6()   

  1. 1. School of Computer Science, Central China Normal University, Wuhan 430079, China
    2. Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
    3. Faculty of Artificial Intelligence Education, Central China Normal University, Wuhan 430079, China
    4. School of Artificial Intelligence, Jingchu University of Technology, Jingmen 448000, China
    5. School of Information, Wuhan Vocational College of Software and Engineering, Wuhan 430205, China
    6. School of Information, Wuhan Open University, Wuhan 430205, China
  • Received:2025-06-15 Online:2025-09-10 Published:2025-09-18

摘要:

对用户提示词进行伪装是大语言模型(LLM)越狱攻击中常见的手段,常见形式包括语义编码和前缀注入等,旨在绕过LLM的安全审查机制,从而诱导其生成违反伦理规范的内容。为应对这一挑战,文章提出一种基于深度语义挖掘的LLM越狱检测方法,通过挖掘用户提示词的潜在真实意图,有效激活模型内置的安全审查机制,实现对越狱攻击的准确识别。文章针对3种典型的越狱攻击方式在3个主流LLM上开展了广泛实验。实验结果表明,文章所提方法的平均准确率达到了96.48%,将越狱攻击的平均攻击成功率从33.75%降至1.38%,相比于当前较优检测方法,该方法将防御能力提升了4%,展现出较强的越狱防护能力。

关键词: 大语言模型, 深度语义挖掘, 安全审查, 越狱攻击

Abstract:

Jailbreak attacks on large language model (LLM) often involve disguising user prompts to evade built-in safety mechanisms. Common strategies include semantic encoding and prefix injection, which induce LLM to generate unethical or harmful content. To address this issue, we proposed a jailbreak detection method based on deep semantic mining. By uncovering the latent intent embedded in user prompts, our approach effectively activated the model’s safety protocols, enabling accurate identification of malicious prompts. We evaluated the proposed method across 3 representative jailbreak techniques on 3 mainstream LLM. Experimental results show that the proposed method achieves an average detection accuracy of 96.48%, reducing the jailbreak attack success rate from 33.75% to 1.38%. Compared to the latest existing detection methods, it improves defense performance by 4%, demonstrating strong capability in mitigating jailbreak attacks.

Key words: large language model, deep semantic mining, safety protocol, jailbreak attack

中图分类号: