Netinfo Security ›› 2025, Vol. 25 ›› Issue (9): 1377-1384.doi: 10.3969/j.issn.1671-1122.2025.09.006

Previous Articles     Next Articles

Jailbreak Detection for Large Language Model Based on Deep Semantic Mining

LIU Hui1,2, ZHU Zhengdao3, WANG Songhe1, WU Yongcheng4, HUANG Linquan5,6()   

  1. 1. School of Computer Science, Central China Normal University, Wuhan 430079, China
    2. Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
    3. Faculty of Artificial Intelligence Education, Central China Normal University, Wuhan 430079, China
    4. School of Artificial Intelligence, Jingchu University of Technology, Jingmen 448000, China
    5. School of Information, Wuhan Vocational College of Software and Engineering, Wuhan 430205, China
    6. School of Information, Wuhan Open University, Wuhan 430205, China
  • Received:2025-06-15 Online:2025-09-10 Published:2025-09-18

Abstract:

Jailbreak attacks on large language model (LLM) often involve disguising user prompts to evade built-in safety mechanisms. Common strategies include semantic encoding and prefix injection, which induce LLM to generate unethical or harmful content. To address this issue, we proposed a jailbreak detection method based on deep semantic mining. By uncovering the latent intent embedded in user prompts, our approach effectively activated the model’s safety protocols, enabling accurate identification of malicious prompts. We evaluated the proposed method across 3 representative jailbreak techniques on 3 mainstream LLM. Experimental results show that the proposed method achieves an average detection accuracy of 96.48%, reducing the jailbreak attack success rate from 33.75% to 1.38%. Compared to the latest existing detection methods, it improves defense performance by 4%, demonstrating strong capability in mitigating jailbreak attacks.

Key words: large language model, deep semantic mining, safety protocol, jailbreak attack

CLC Number: