董騰飛 楊頻 徐宇 代金鞘 賈鵬



摘 要 ???:文本生成技術的惡意濫用問題日益嚴重,因此生成文本檢測技術至關重要. 現有的檢測方法依賴于基于特定數據集的統計異常特征,從而導致方法的泛化能力較差. 本文考慮不同種類生成文本均易出現的事實錯誤、語義沖突問題,提出了一種基于事實和語義一致性的生成文本檢測方法. 該方法通過實體將文本和外部知識庫進行比較,得到文本的事實一致性特征. 另一方面,該方法借助文本蘊含技術對文本上文與下文進行關系推理,得到文本的語義一致性特征. 最后將這兩類特征與RoBERTa的輸出隱藏向量拼接,輸入到線性分類層進行預測. 實驗結果表明,該方法比當前的檢測方法具有更高的準確率和泛化能力.
關鍵詞 :文本生成; 生成文本檢測; 外部知識庫; 文本蘊含
中圖分類號 : TP391.1 文獻標識碼 :A DOI : ?10.19907/j.0490-6756.2023.042002
Generated text detection based on factual and semantic consistency
DONG Teng-Fei, YANG Pin, XU Yu, DAI Jin-Qiao, JIA Peng
(College of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China)
The malicious abuse of the text generation technology has becoming more and more serious, which makes the detection for generated text considerably important. The existing detection methods mainly rely on statistical anomalous features based on the specific dataset, which leads to the poor generalization ability. Considering the common problems of factual errors and semantic conflicts in the generated text, this paper proposes a generated text detection method based on the factual and semantic consistency. By using the text entity, the proposed method compares the text with the external knowledge base to obtain the factual consistency feature of the text. On the other hand, the text entailment technology is used to infer the semantic relationship between the text above and below to obtain the semantic consistency feature of the text. Finally, the above two types of features are spliced with RoBERTa output hidden vector and input to the linear classification layer for prediction. The experimental results show that the proposed method has higher accuracy and generalization ability than the existing detection methods.
Text generation; Generated text detection; External knowledge base; Textual entailment
1 引 言
文本生成技術是自然語言處理的核心技術之一. 它使用既定信息和文本生成模型(Text Generative Model, TGM),生成滿足特定目標的文本. 文本生成技術有諸多正面應用,比如故事生成 ?[1]、會話響應生成 ?[2]、代碼自動完成 ?[3]和放射學報告生成 ?[4]等. 但不幸的是,該技術被非法者惡意濫用于生成神經假新聞 ?[5-7]、虛假產品評論 ?[8]和垃圾郵件 ?[9]. 這些虛假信息通過互聯網廣泛傳播,會對國家、社會和個人造成不利的影響. 同時文本生成技術仿真化發展,極大降低了人工檢測的可能性. 而生成文本檢測技術能夠有效區分生成文本和人工文……