陳鵬 郭小燕



摘? 要: 樸素貝葉斯分類器過分依賴分類數據的質量,當待分類數據呈現復雜多元屬性時,其分類的效果急劇下降,利用adaboost算法組合多個樸素貝葉斯分類器設計A_B模型。將3600份原始數據經過中文分詞、句法分析、文本向量化后將A_B模型訓練成一個A_B分類器。解決了分類器對于待分類數據敏感的問題,兩個A_B分類器協同工作將二分類器轉換為三分類器,解決了將原始農業文本信息分為農業新聞類,農業技術類,農業經濟類三種類型的問題。分別利用600份標準數據與加了30%干擾信息的復雜數據測試分類器的分類效果,實驗結果表明A_B分類器不僅對標準分類數據具有良好的分類效果,面對復雜多元的分類數據是仍然表現出較好的分類性能。利用不同的測試數據對A_B分類器測試發現:A_B分類器均具有良好的收斂性,其分類效果不依賴分類數據特征,具有分類效果的穩定性。
關鍵詞: 貝葉斯;Adaboost;農業短文本;分類
中圖分類號: S24;TP3? ? 文獻標識碼: A? ? DOI:10.3969/j.issn.1003-6970.2020.09.004
本文著錄格式:陳鵬,郭小燕. 基于Adaboost與樸素貝葉斯的農業短文本信息分類[J]. 軟件,2020,41(09):1318
【Abstract】: Naive Bayes classifier relies too much on the quality of classification data. When the classified data presents complex multivariate attributes, whose classification effect decreases sharply. Adaboost algorithm is used to combine multiple Naive Bayesian classifiers to design A_B model. After Chinese word segmentation, parsing and text vectorization, the A_B model is trained as an A_B classifier based the 3600 sets of original data. The problem that classifier is sensitive to data to be classified is solved. Two A_B classifiers work together to convert two two-category classifiers into one three-category classifiers, and solve the problem that the original agricultural text information is divided into three types: agricultural news, agricultural technology and agricultural economy. Using 600 sets of standard data and complex data with 30% disturbed information to test the classification effect of the classifier, the experimental results show that the A_B classifier not only has a good classification effect on the standard classification data, but also has a good classification performance to complex and multivariate classification data. Using different test data to test A_B classifier, it is found that A_B classifier has good convergence, whose classification effect does not depend on the characteristics of classification data, and has the stability of classification effect.
【Key words】: Bayes; Adaboost; Agricultural short text; Classification
0? 引言
隨著農業信息化進程的加快,農業新聞網站,農產品銷售網站,農業技術網站和農業數據庫等農業信息平臺也隨之出現,農業數據隨時間呈爆發式增長,海量的農業類數據需要處理。文本是網絡信息的主要載體、BBS、博客、新聞評論中往往包含著諸如農業政策法規,農民的消費需求以及農村的發展趨勢等數據信息。為了洞察農村、農業的發展規律,以及農民的消費規律,對這些文本信息進行合理地分析與挖掘顯得非常必須。……