〔摘 要〕同義詞自動識別在信息檢索#65380;知識挖掘等方面起著重要作用,一直以來都是業界的關注焦點#65377;本文結合網上詞典鏈接分析方法和共現分析方法來自動提取同義詞,分別通過分析頁面的后向鏈接信息#65380;重定向頁面和對網頁內容利用共現分析方法來識別同義詞,和傳統的同義詞識別方法比較有更好的覆蓋率和準確性#65377;
〔關鍵詞〕同義詞識別;鏈接挖掘;共現分析;相似度
〔中圖分類號〕TP393 〔文獻標識碼〕C 〔文章編號〕1008-0821(2009)08-0125-03
Automatic Recognition of Chinese Synonyms Using
Link Structure and Co-occurrence AnalysisHuang Fang1 Liu Youhua1,2 Zhang Kezhuang1 Li Yin2
(1.Department of Information Management,Nanjing University,Nanjing 210093,China;
2.School of Management and Engineering,Nanjing University,Nanjing 210093,China)
〔Abstract〕Automatic Recognition of Synonyms plays an important role in aspects such as information retrieval and knowledge mining,and has always been a focus in these fields.This paper combined link structure analysis of web dictionary and co-occurrence analysis to detect synonyms.It was realized by analyzing backward link and redirect page information,and meanwhile by co-occurrence analysis of theweb page contents.The result showed that this method has better coverage and precision.
〔Key words〕synonym recognition;link mining;co-occurrence analysis;similarity
目前,自動識別中文同義詞的方法主要有以下幾種:(1)基于字面相似度和詞素相似度的算法;(2)基于《同義詞詞林》[1]#65380;《知網》語義體系的同義詞識別[2];(3)基于信息檢索的同義詞識別算法[3]#65377;其中,第一種方法依據字面的相似度,然而很多同義詞字面并不相似,并且字面相似的詞也并不一定是同義詞;基于詞典語義體系的方法解決了字面相似性不足的問題,但它要基于專門的詞典,而像《同義詞詞林》#65380;《知網》等詞典本身詞匯覆蓋率不夠,并且還存在新詞未收錄的問題;而基于信息檢索的識別算法則依賴于搜索引擎的性能#65377;隨著互聯網和web挖掘技術的發展,通過分析網上詞典鏈接來獲取同義詞也逐漸成為同義詞識別一個研究方向#65377;文獻[4]利用HITS算法分析wikipedia的鏈接和分類得到同義詞和相關詞,取得比較好的效果;文獻[5]把詞典中存在的詞匯之間的解釋和被解釋關系看成是一種語義上的鏈接關系,并引用pagerank算法來計算詞匯間的語義相似度……