信息网络安全 ›› 2024, Vol. 24 ›› Issue (7): 1098-1109.doi: 10.3969/j.issn.1671-1122.2024.07.011

• 理论研究 • 上一篇    下一篇

基于语言特征集成学习的大语言模型生成文本检测

项慧(), 薛鋆豪, 郝玲昕   

  1. 杭州电子科技大学网络空间安全学院,杭州 310018
  • 收稿日期:2024-02-01 出版日期:2024-07-10 发布日期:2024-08-02
  • 通讯作者: 项慧 xianghui@hdu.edu.cn
  • 作者简介:项慧(2000—),女,浙江,硕士研究生,主要研究方向为自然语言处理、大语言模型|薛鋆豪(1999—),男,浙江,硕士研究生,主要研究方向为大模型、大模型应用与安全|郝玲昕(2000—),男,山西,硕士研究生,主要研究方向为Web安全、Web漏洞自动化挖掘。
  • 基金资助:
    国家自然科学基金(61772162);浙江省重点研发计划(2023C03198)

Large Language Model-Generated Text Detection Based on Linguistic Feature Ensemble Learning

XIANG Hui(), XUE Yunhao, HAO Lingxin   

  1. School of Cyberspace, Hangzhou Dianzi University, Hangzhou 310018, China
  • Received:2024-02-01 Online:2024-07-10 Published:2024-08-02

摘要:

大语言模型的快速发展为日常生活和工作提供了极大的便利,但也为个人和社会带来了挑战。因此,迫切需要能够检测大语言模型生成文本的检测器。为了兼具良好的检测性能和泛化能力,文章提出了一种基于语言特征集成学习的大语言模型生成文本检测方法EBF Detection。EBF Detection融合了微调预训练语言模型和高阶自然语言统计特征,利用判决机制,实现了大语言模型生成文本检测。实验结果显示,EBF Detection不仅在域内数据上平均的检测准确率达到了98.72%,而且在域外数据上的平均检测准确率达到了96.79%。

关键词: 大语言模型, 大语言模型生成文本检测, 集成学习, 语言特征

Abstract:

The rapid development of large language model (LLM) has provided great convenience for daily life and work, but has also brought challenges for individuals and society. Therefore, there is an urgent need for detectors that can detect text generated by large language models. For good detection performance and generalization ability, this paper proposed a large language model-generated text detection method based on linguistic feature learning—EBF detection. EBF detection combined the fine-tuned pre-trained language model and higher-order natural language statistical features, and used the decision mechanism to realize the LLM-generated text detection. Experimental results show that EBF Detection not only achieves an average detection accuracy of 98.72% on in-domain data, but also achieves an average detection accuracy of 96.79% on out-of-domain data.

Key words: large language model, LLM-generated text detection, ensemble learning, linguistic feature

中图分类号: