摘 要:在眾多聚類算法中,K-means和自組織神經(jīng)網(wǎng)絡(SOM)是較為經(jīng)典的2個。在分析2種算法優(yōu)缺點的基礎上,提出基于SOM的K-means兩階段聚類算法,該算法根據(jù)SOM算法自動聚類的優(yōu)點得到初步聚類數(shù)目和各類中心點,以此作為K-means算法的初始輸入進一步聚類,從而得到精確的聚類信息。最后,應用該算法對某地區(qū)電信家庭客戶數(shù)據(jù)進行分析,結果表明該算法有較好的聚類效果。關鍵詞:聚類; 自組織神經(jīng)網(wǎng)絡; K-means; 細分
中圖分類號:TN911-34文獻標識碼:A
文章編號:1004-373X(2010)16-0113-04
SOM+K-means Two-phase Clustering Algorithm and Its Application
ZHOU Huan, LI Guang-ming, ZHANG Gao-yu
(School of Information Management, Shanghai Finance University, Shanghai 201209, China)
Abstract: K-means and SOM network are two classical algorithms among many clustering ones. A new SOM-based K-means two-phase clustering algorithm is proposed based on the analysis of the advantages and shortcomings of the two algorithms. The quantity of the preliminary clustering and the central point of each cluster were acquired with K-means algorithm, by means of the auto-clustering advantages of SOM algorithm. Taking the results as the initial input of the K-means algorithm to make the further clustering, the accurate clustering results are gained. The data of the telecom family customers in a district is analyzed with the algorithm. The results confirm that the algorithm is better than SOM network and K-means algorithms when they are separately used.
Keywords: clustering; SOM network; K-means; partition
0 引 言
聚類分析[1]是一種探查數(shù)據(jù)結構的工具,其核心是聚類,即將對象劃分為簇,使得同一個簇的對象相似,而不同簇的對象相異。對象可以通過某些度量(如屬性/特征)或與其他對象的關系(例如,逐對距離、相似性)來描述。聚類屬于非監(jiān)督學習技術。在商業(yè)社會,急需對急劇增長的數(shù)據(jù)加以組織并從數(shù)據(jù)中學習有價值信息,這使得聚類成為一個非常活躍的研究領域,是數(shù)據(jù)挖掘中、也是實踐中應用得最多的分析方法。
在聚類分析中,用得比較廣泛的一種聚類算法就是K-means算法[2],該算法具有簡單、容易理解、計算方便、速度快以及能夠有效處理大型數(shù)據(jù)庫的優(yōu)點而成為聚類分析中的經(jīng)典算法。但K-means算法存在著固有的缺點[3-6]:如初始值對聚類結果影響較大、容易陷入局部最優(yōu)、依賴經(jīng)驗判斷最優(yōu)類的個數(shù)以及對“噪音”和孤立點數(shù)據(jù)比較敏感,這些缺陷大大限制了它的應用范圍和效果。和K-means算法相比,SOM[7-8](self organizing mapping)神經(jīng)網(wǎng)絡是一個無監(jiān)督的學習模式,它能夠將數(shù)據(jù)從高維空間映射到低維空間上,通過降維尋找多維數(shù)據(jù)的主要統(tǒng)計特征,并根據(jù)數(shù)據(jù)間的相似性自動將數(shù)據(jù)分成不同的類別,從而達到增強客戶有用信息,降低噪聲的影響。……