










摘要: 針對工藝制造領域文本提出一種融入知識的命名實體識別方法,旨在能夠準確地識別工藝文本中的12類實體。該方法依據工藝領域知識設計正則規則,對文本序列進行實體的預識別,形成預識別特征矩陣,并使用編碼器對預識別特征矩陣編碼,再將識別到的結果保存于詞典中,然后對輸入文本分詞訓練基于詞的知識表示,最后加入基于神經網絡的實體識別模型中。使用BiLSTM為預識別特征矩陣編碼器和BiLSTM-CRF神經網絡模型的F1值達到92.55%。實驗結果表明,融入知識的工藝文本命名實體識別方法能夠有效提高工藝文本實體的識別效果。
關鍵詞: 工藝制造;正則規則;神經網絡;命名實體識別;特征矩陣編碼器;BiLSTM
中圖分類號: TP391 文獻標志碼: A
doi:10.3969/j.issn.2095-1248.2023.01.009
A knowledge-integrated process text named entity recognition method
YANG Hong-peng1,WANG Pei-yan1,CAI Dong-feng1,ZHANG Gui-ping1,2,ZHU Yong-kang1
(1.Human-Computer Intelligence Research Center,Shenyang Aerospace University,Shenyang 110136,
China;2.Knowledge Engineering amp; Service Department,Shenyang Global Envoy Software Co.Ltd.,Shenyang 110136,China)
Abstract: A knowledge-based named entity recognition method was proposed for process manufacturing domain texts,aiming at accurately recognize 12 types of entities in process texts.According to the regular rules based on process domain knowledge to design,
and recognized the entities of the text sequence in advanced to form a pre recognition feature matrix
,and encoded the pre-identification feature matrix using an encoder,then saved the recognized results in a lexicon,then trained a word-based knowledge representation on the input text sub-word,and finally added the entity recognition model based on neural network.The F1 value of 92.55% was achieved using BiLSTM as the pre-recognition feature matrix encoder and BiLSTM-CRF neural network model.The experimental results show that the process text named entity recognition method incorporating knowledge can effectively improve the recognition of process text entities.
Key words: process manufacturing;regular rules;neural network;named entity recognition;feature matrix encoder;BiLSTM
命名實體識別(Named Entity Recognition,NER)任務是指在文本中準確地識別出命名實體,例如人名、地名、機構名等專有名詞,并為其賦予實體類型[1],是信息提取、問答系統、句法分析、機器翻譯等應用領域的重要基礎工具,在自然語言處理技術走向實用化的過程中占有重要地位。工藝制造領域的命名實體識別是在工藝制造說明、工藝制造大綱和工藝制造規范等文本中準確快速地識別出工藝制造所用工程圖紙、使用的零件、遵循的規范、制造標準等12類實體。工藝文本的命名實體識別是工藝知識圖譜、工藝自動生成不可或缺的組件,在參與實際應用中發揮著重要作用。
目前對于通用領域的命名實體識別技術層出不窮。Zhang等[2]提出Lattice LSTM(Long Short-Term Memory)[3]網絡結構,能夠避免分詞錯誤帶來的影響。……