摘 要: 針對支持向量機(SVM)模型不能有效處理海量數據挖掘的問題,提出一種改進的基于主動學習的支持向量機(AL_SVM)方法。該方法首先將訓練集隨機劃分為多個獨立同分布的子集,并選擇其中一個子集作為初始訓練集來訓練SVM得到初始分類器和支持向量集,然后根據已經得到的分類器信息在剩余樣本集中選擇對于分類器改進作用最大的有價值樣本。并與已得到的支持向量集合并構成新訓練集,以更新分類器,從而在保留重要支持向量信息的前提下,去除大量不重要的支持向量,一定程度上避免了過學習問題,提高了學習效率。實驗表明,AL_SVM方法能夠在保持學習器泛化能力的同時提高其學習效率。
關鍵詞: 支持向量機; 主動學習; 有價值樣本;支持向量
中圖分類號: TN911?34 文獻標識碼: A 文章編號: 1004?373X(2013)24?0022?03
Support vector machine algorithm based on active learning
BAI Ning
(Department of Computer Science and Technology, Shanxi Police Academy, Taiyuan 030021, China)
Abstract:To solve the problems that the support vector machine (SVM) moedel can not process the massive dataset mining effectively, an improved support vector machine based on active learning (AL_SVM) algorithm is presented in this paper. The training set is divided randomly into some independent and identical subsets and only a subset of them is selected as the training set of SVM model to obtain the initial classifier and support vectors set, and then a most valuable sample is selected from the rest of samples by the former classifier to improve the learner and it is combined with the support vector set of the former SVM training to train a new SVM model, soas to improve the classifier. By this method, the important support vectors are retained and the unimportant support vectors are deleted. Therefore, the over?fitting problem can be avoided, and the training efficiency is improved by this method. Simulation results demonstrate that the AL_SVM method can maitain the learner’s generalization ability and improve the learning efficiency simultaneously.
Keywords: support vector machine; active learning; valuable sample; support vector
0 引 言
隨著現代科技的進步,以及人們管理和知識水平的提高,現實世界中需要處理的數據量越來越大。2008年9月,《Nature》雜志出版了一個專刊,討論大數據存儲、管理和分析問題[1],之后麥肯錫公司、《Science》雜志也相繼出版了大數據報告及研究專刊[2],圍繞科學研究中大數據的問題進行了深入討論,說明了大數據問題處理的重要性和必要性。因此,大數據問題的研究目前已經成為人工智能領域的熱點問題。為了能夠從規模龐大、聯系緊密、結構復雜、類型多樣的數據中提煉出有用知識,針對大數據的數據挖掘技術研究近年來受到了研究者的廣泛關注。數據挖掘[3]是指從大規模的、不完整的、有噪聲的、模糊的、隨機的復雜數據集中提取潛在有用的信息或知識。……