999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Parallel Spectral Clustering Based on MapReduce

2013-06-06 04:18:34QiweiZhongYunlongLinJunyangZouKuangyanZhuQiaoWangandLeiHu
ZTE Communications 2013年2期

Qiwei Zhong ,Yunlong Lin ,Junyang Zou ,Kuangyan Zhu ,Qiao Wang ,and Lei Hu

(1.School of Information Science and Engineering,Southeast University,Nanjing210096,China;

2.ZTECorporation,Nanjing210012,China)

Abstract Clustering is one of the most widely used techniques for explor?atory data analysis.Spectral clustering algorithm,a popular modern clustering algorithm,has been shown to be more effec?tive in detecting clusters than many traditional algorithms.It has applications ranging from computer vision and information retrieval to social science and biology.With the size of databas?es soaring,clustering algorithms have scaling computational time and memory use.In this paper,we propose a parallel spec?tral clustering implementation based on MapReduce.Both the computation and data storage are distributed,which solves the scalability problems for most existing algorithms.We empirical?ly analyzethe proposed implementation on both benchmark net?works and a real social network dataset of about two million ver?tices and two billion edges crawled from Sina Weibo.It is shown that the proposed implementation scales well,speeds up the clustering without sacrificing quality,and processes mas?sive datasets efficiently on commodity machine clusters.

Keyw ords spectral clustering;parallel implementation;massive dataset;Hadoop MapReduce;datamining

1 Introduction

C lustering is a method of unsupervised learning that is widely used in exploratory data analysis[1].Re?cently,a spectral clustering algorithm was shown to be more effective than many other traditional algo?rithms in detecting clusters.Extracting valuable information from an ocean of data is a hot research topic,and applications of spectral clustering range from computer vision and informa?tion retrieval to social science and biology[1],[2].Spectral clustering cannot be adequately scaled in terms of computa?tional time and memory use to deal with massive datasets.For example,a quadratic resource bottleneck occurs when comput?ing pairwise similarities and the number of data instances n(Table 1)is large.A considerable amount of time is needed to compute the first k(Table 1)eigenvectors of a Laplacian ma?trix;therefore,it is necessary to develop dataset-oriented par?allel implementations[2],[3].In[4],message passing interface(MPI)is used to parallelize the spectral clustering algorithm.MapReduce is better than MPI and can automatically split mass data with Hadoop Distributed File System(HDFS).Better load balancing and data parallelism improve the scalability and efficiency of the machine clusters when a dataset is large.Therefore,a MapReduce framework is more suitable for han?dlingthisproblem.

▼Table1.Notation used in thispaper

MapReduce is a programming model and associated imple?mentation for processing large datasets[5].Users specify the computational rules in forms of a Map function that processes a<key,value>pair to generate a set of intermediate<key,val?ue>pairs.A Reduce function merges all intermediate values associated with the same key.Programs written in this way can be executed in parallel on large commodity machine clusters.The underlying runtime system automatically partitions the in?put data,schedules the program's execution,handles machine failures,and manages intermachine communication.This al?lows programmers who are inexperienced with parallel to easi?ly use the resources of a large machine cluster.The overall flow of a common MapReduce operation is shown in Fig.1.Both Google and Hadoop provide MapReduce runtimes that are flexible and tolerant to faults.

In this paper,we propose a parallel spectral clustering im?plementation(PSCI)that is based on Hadoop MapReduce and that can be applied to massive datasets.By constructing prop?er<key,value>pairs,theproposed implementation can beeffi?ciently executed in parallel.Tests on benchmark networks show the effectiveness of PSCI in detecting communities.We analyze a real social network dataset of about two million verti?ces and two billion edges crawled from Sina Weibo(a popular Twitter-like microblog popular in China)to show the efficien?cy and practicability of PSCIfor massivedatasets.

In section 2,we give a brief overview of spectral clustering algorithm in order to understand particular bottlenecks and to analyze the parallel partsof thealgorithm.In section 3,wepro?pose PSCI based on the Hadoop MapReduce framework.In section 4,we show the results of tests on PSCIand evaluate its clustering quality and scalability(in terms of runtime speedup and scaleup).Concludingremarksaremadein section 5.

▲Figure1.MapReduceexecution overview.

2 Spectral Clustering Algorithm

The most common spectral clustering algorithm is graph La?placian matrix[1],[6].We assume that G=(V,E)is a weight?ed undirected graph with vertex set V={v1,v2,...,vn}and edge set E={e1,e2,...,em}.Each edge between vertices viand vjhasanon-negativeweight wij≥0 and wij=wji.Theadjacen?cy matrix of thegraph is

where wij=0 means that viand vjare not connected along an edge.Then,thedegreeof a vican bedefined as

where the sum in(2)only runs over all vertices that are adja?cent vi,on account that the weight wij=0 for all other vertices vj.The degree matrix of the graph is

where the degrees d1,d2,...,dnare on the diagonal.Based on the adjacency matrix and degree matrix,an un-normalized graph Laplacian matrix is given by L=D-W,and the normal?ized graph Laplacian matrix is given by

We now assume that a dataset comprises n in?stances x1,x2,...,xn,which can be arbitrary ob?jects.The pairwise similarity between xiand xjis given by sij=s(xi,xj)and can be measured by a similarity function that is non-negative and sym?metric.The corresponding similarity matrix is giv?en by

An example of a similarity function is the Gauss?ian function

where the parameterσcontrols the number of the neighbors.Nevertheless,sijin network is usually defined as

If we replace W in(4)with S,we obtain the new formula in a spectral clusteringalgorithm:

Furthermore,one often reduces the matrix S to a sparse one by considering only significant relationship between instances for conserving the computational time.A summary of the spec?tral clusteringalgorithmisshown here[1].

Algorithm1.Spectral Clustering Algorithm Input:the similarity matrix S∈R n×n,and thenumber of desired clusters k.Output:k Clusters.Procedure:1.Construct asimilarity graph and let W beits weighted adjacency matrix.2.Computethenormalized Laplacian matrix L sym.3.Computethefirst k eigenvectors u 1,u 2,...u k of L sym.4.Let U∈R n×k bethematrix containingthevectors u 1,u 2,...u k ascolumns.5.Form thematrix T∈R n×k from U by normalizingtherowstonorm1.6.For i=1,...,n,y i∈R k bethevector correspondingtothe i th row of T.7.Cluster thepoints(y i)i=1,...,n with the k-Means algorithm intoclusters C 1,C 2,...,C k.8.Output clusters A 1,A 2,...,A k with A i={j y j∈Ci}.

Generally,it is useful to change representation of the ab?stract data points xito points yi∈Rkdue to the properties of graph Laplacian matrix.It enhances the cluster performance,so that clusters can be trivially detected in the new representa?tion.In particular,thesimple k-meansclustering algorithmde?tectstheclusterswithout any difficulties.

Algorithm2.k-Means Algorithm

Input:thedataset comprised of instances,and thenumber of desired clusters k.Output:k Clusters.Procedure:

1.k initial group centroids(so-called“Means”)arerandomly selected from thedataset.

2.k clustersare created by associatingevery object with thenearest centroid.

3.The positionsof the k centroids arerecalculated when all objectshave been assigned.

4.Step 2 and 3 arerepeated until thecentroidsnolonger move(or convergenceis reached).

3 Parallel Implementation Based on Map Reduce

There are three intensive computing processes in a spectral clusteringalgorithm:construction of the Laplacian matrix,com?putation of the first k eigenvectors,and calculation of distanc?es in k-means.Therefore,after preprocessing the original data?set,we reasonably segment the similarity matrix computation and sparse computation by data point index in order to con?struct the Laplacian matrix.When calculating the eigenvec?tors,we put the Laplacian matrix on an HDFSand launch dis?tributed Lanczos operations to get the first k eigenvectors of the Laplacian matrix.Finally,we make parallel k-means clus?tering algorithms on the eigenvector's transposed matrix to get the final clustering results.The entire process of our parallel implementation isshown in Fig.2.

3.1 Construction of the Laplacian Matrix

The Laplacian matrix is constructed in two steps:computa?tion of similarity matrix and sparsification of the similarity ma?trix.Fortunately,both computation of the pairwise similarities and t-nearest neighbors of one object are not related to the computation of those of other objects.Therefore,the computations for different objects can be done in parallel via dataset partitioning(Fig.3).

3.1.1 Mapper:Calculation of Similarity

Input:<key,value>pair,where key is the unique index and features of one object and value is the index and corresponding features of each oth?er object whose index is greater.

Output:<key′,value′>pair,where key′is the index of one object,and value′is the index of each other object and its similarity with the object in key′.

3.1.2 Reducer:Construction of the Sparse Matrix

Input:<key,value>pair,where key is the index of one object and value is the index of each other object and itssimilarity with theobject in key.

Output:<key′,value′>pair,where key′is the index of oneobject and value'isthe index and cor?responding similarity of each other object among t-nearest neighborsof theobject in key′.

3.2 Computation of First k Eigenvectors

After we have calculated and stored the Laplacian matrix,we must parallelize the eigensolver.The Lanczos algorithm is an iterative algorithm.It is an adaptation of power methods and is used to find eigenvalues and eigenvectors of a square matrix or the singular value decompositions of a rectangular matrix.It is particularly useful for finding decompositions of very large sparse matrices.

Algorithm3.Lanczos Algorithm Input:the Laplacian matrix.Output:atridiagonal matrix T and a Q matrix.Procedure:1.Initialization.q 0←0,β0←0.q 1←random vector normalized 1.2.Iteration.For k=1,2,...,m wk=L symq k-βkq k-1.αk=(w k,q k).wk=wk-αkq k.βk+1=‖wk‖.q k+1=wk/βk+1.3.Return.Q n×m=(q 1,q 2,q 3,...,q m-1,q m).α1 β2 0 β2 α2 β3 β3 α3→→→ → →→→→ → →→→→→→→ → →Tmm= .………βm-1 βm-1 αm-1βm 0 βmαm

▲Figure2.Parallel spectral clustering implementation process.

▲Figure3.Laplacian matrix construction.

The Lanczos iteration converts the Laplacian matrix Ln×ninto a real symmetric tridiagonal matrix Tm×m(m<n),through which eigenvalues and eigenvectors can be easily solved by methods such as QR Iteration.A<eigenvalue,eigenvector>pair of L,which is(λk,Qn×),corresponds to the<eigenvalue,eigenvec?tor>pair of T,which is(λk).In each iteration of the Lanc?zos algorithm,matrix-vector multiplication is the most inten?sive calculation,and can be done in parallel via matrix parti?tioning(Fig.4).

3.2.1 Mapper:Matrix-Vector Multiplication

Input:Global vector V.<key,value>pair,where key is the line index of matrix,and value is the content corresponding to theindex in key.

Output:<key′,value′>pair,where key′is NULL,and val?ue′is the line index and multiplication result between value and vector V.

3.2.2 Reducer:Construction of the Multiplication Result

Input:<key,value>pair,where key is NULL,and value is all thepartial results.

Output:<key′,value′>pair,where key′is NULL,and val?ue′is the final result vector of matrix-vector multiplication.

3.3 Parallel k-Means Algorithm

Computations of the distances between one object and the centers are not related to computations of the distances be?tween other objects and corresponding centers[7].Therefore,the computation of distances between different objects and centers can be done in paral?lel.The new centers,which will be used in the next iteration,should be updated in each itera?tion.Therefore,the iterative procedures must be executed serially(Fig.5).

3.3.1 Mapper:Associate Every Instancewith the Nearest Center

Input:Global variable centers,<key,value>pair,where key is the index of one instance and value is the feature(i.e.the dimension values)cor?responding to the instance in key.

Output:<key′,value′>pair,where key′is the index of thenearest center of the instance and val?ue′istheindex and featureof theinstance.

3.3.2 Combiner:Calculation of the Sumand Quadratic Sumof Values of Each Dimension of Instancesand its Number Assigned tothe Same Center on Local Disk

Input:<key,value>pair,where key is the in?dex of the center,and value is the index and fea?ture of each instance assigned to the same center

Output:<key′,value′>pair,where key′is the index of thecenter,and value′isthesumand qua?dratic sum of values of each dimension of instances and its number with thesamecenter.

3.3.3 Reducer:Calculate the New Centers and

Iteration Condition.

Inpu t:<key,value>pair,where key is the index of the cen?ter and value comprises the sum and quadratic sum of the val?ues in each dimension of instances with the same center from all hosts,and the number of those instances.

Output:<key′,value′>pair,where key′is the index of the new center and value′is thefeaturesrepresentingthenew cen?ter.

4 Empirical Analysis

We designed our experiments to validate the scalability and quality of parallel implementation in PSCI.Our experiments are based on both computer-generated benchmark networks and a real social network dataset of 1,628,853 vertices and 176,620,423 edges crawled from Sina Weibo beforehand.We ran the experiments on IBM Blade Cluster with 10 compute nodes.All the nodes were identical,and each was configured with a CPU faster than 2 GHz and memory greater than 8 GB.Hadoop version 0.20.0 and Java1.6.0 wereused asthe MapRe?ducesystemfor all experiments.

▲Figure4.Matrix-Vector multiplication.

▲Figure5.Parallel k-meansalgorithm.

To test the performance of parallel implementation,we use Modularity as an evaluation function.Modularity is a property of a network and a specific proposed division of that network into communities[8],[9].It measures whether the division is good in the sense that there are many more edges within com?munities but only a few in between.The higher the Modulari?tyscore,the better the clustering quality.Generally,this score fallsbetween 0.3 and 0.7 in social networks[9].

If G=(V,E)with theadjacency matrix A isconstructed of

then we assume that the vertices are divided into communities sothat vertex u belongstocommunity cu.The Modularity Q is

Also,

where eij(i≠j)is the fraction of edges that join vertices in community i to vertices in community j,and eiiis the fraction of edges in the same community i.Theδfunctionδ(i,j)is 1 if i=j and 0 otherwise;duisthe degree of vertex u;aiisthe frac?tion of ends of edges that are attached to vertices in community i and m is the number of edges in the graph.

4.1 Experimentson Benchmark Networks

Benchmark networks are standard measures for communi?ty-detection algorithms[10].Here,we generate the benchmark networks with parameters<k>=120,β=1,γ=2,μ=0.2,kmax=1000,smin=500,and smax=2000.First,we test perfor?mance of PSCI with network sizes ranging from 20,000 to 500,000.The results are shown in Fig.6 and Fig.7.Modulari?ty is high for different network sizes and falls between 0.4 and 0.7.This supports the resultsin[9].Fig.7 shows that PSCIhas very good scalability.Computational time almost becomes lin?ear as the network grows above 100,000.This means that PSCI hasgood scalability and treatsmassivedatasetsefficiently.

When determining the speedup with different machine num?bers on a benchmark network of 500,000 instances(Fig.8),we see that in the beginning,the implementation has nearly linear substantial speedup as the number of machines increases.Then,there is a slowdown caused by the overhead time of framework startup and required intermachinecommunication.

4.2 Experimentson Real Networks

▲Figure6.Modularity of clusteringresultswith different network sizes.

▲Figure7.Runtimesof PSCIfor different network sizes.

▲Figure8.Theruntimesof PSCIwith theincreasingnumber of machines.

Here,we analyze the quality and speedup on a real social network of 1,628,853 vertices and 176,620,423 edges crawled from Sina Weibo.The edge represents the following relation?ship between one vertex and another.Table 2 shows Modulari?ty and runtimes of processing on the real social network on 10 machines.The visual clustering using the adjacency matrix plot is shown in Fig.9.The results indicate that our parallel implementation speeds up the clustering process without sacri?ficingquality.

▼Table2.Modularity and runtimeson thereal network

▲Figure 9.A visual clustering result using the adjacency matrix plot of the real network.The horizontal and vertical coordinates represent the vertices,and the diagonal is a clear line that indicates the good quality of theclustering.

In addition,the runtime for dealing with Picasa(a dataset of 637,137 images)on 16 machines is nearly 15,000 seconds[4].This parallelizes the spectral clustering algorithm based on MPI.As a consequence,our implementation is faster for paral?lel processing because both the dataset and spectral clustering algorithmareboth distributed.

5 Conclusion

The spectral clustering algorithm is one of the most impor?tant modern algorithms and has been shown to be more effec?tive in community detection than many traditional algorithms.However,the growing amount of data in applications makes spectral clustering of massive datasets challenging.In this pa?per,we propose a parallel spectral clustering implementation based on MapReduce.In our PSCI,both the computation and data storage are distributed,and this solves the problems of most of the existing algorithms mentioned at the outset of this paper.By empirically analyzing both benchmark networks and a real massive social network dataset,we show that the pro?posed implementation has high Modularity across different net?work sizes;it reduces the computational time so that it is al?most linear;and the substantial speedup is nearly linear as the number of machines increases within a certain range.The im?plementation scales well,speeds up clustering without sacrific?ing quality,and processes massive datasets efficiently on com?modity machine clusters.

主站蜘蛛池模板: 9丨情侣偷在线精品国产| 中文字幕亚洲精品2页| 久久国产精品77777| 高清无码手机在线观看| 成人自拍视频在线观看| 亚洲人成网站色7799在线播放| 欧美亚洲激情| 国产香蕉在线视频| 欧美福利在线观看| 久久久久久久久久国产精品| 色婷婷色丁香| 国产色婷婷| 国产精品香蕉在线| 久久久久国产精品免费免费不卡| 午夜限制老子影院888| 98精品全国免费观看视频| 亚洲欧美日韩中文字幕在线一区| 免费在线a视频| 亚洲三级色| av在线人妻熟妇| 日韩经典精品无码一区二区| 青青青国产视频| 色婷婷亚洲综合五月| 国产毛片基地| 国产微拍一区二区三区四区| 日韩在线播放中文字幕| 中国美女**毛片录像在线| 精品日韩亚洲欧美高清a| 国产在线精品人成导航| 中国特黄美女一级视频| 99在线观看视频免费| 午夜精品区| 日韩精品无码免费专网站| 国产精品.com| 91小视频在线播放| 亚洲AⅤ永久无码精品毛片| www亚洲天堂| 免费aa毛片| 国产高潮流白浆视频| 亚洲浓毛av| 亚洲国产精品无码久久一线| 欧美α片免费观看| 97视频在线精品国自产拍| 在线一级毛片| 欧美精品成人一区二区视频一| 国内精品手机在线观看视频| 日韩欧美91| 最新加勒比隔壁人妻| 色妞www精品视频一级下载| 五月婷婷伊人网| 日韩精品亚洲人旧成在线| 国产精品99r8在线观看| 国产成人综合在线视频| 国产91小视频| 一级成人a做片免费| 国产精品林美惠子在线播放| 最新国产你懂的在线网址| 制服无码网站| 99热这里只有精品国产99| 免费观看男人免费桶女人视频| 国产一级二级三级毛片| 亚洲人成影院午夜网站| 欧美国产日韩另类| 在线观看国产小视频| 国产福利在线免费| 最新日韩AV网址在线观看| 亚洲Av激情网五月天| 亚洲不卡无码av中文字幕| 一级毛片中文字幕| 91综合色区亚洲熟妇p| 亚洲日韩精品欧美中文字幕| 国产成人福利在线| 国产在线观看精品| 老司国产精品视频91| 久久久久久久久18禁秘| 538精品在线观看| 国产亚洲欧美在线专区| 国产女人喷水视频| 亚洲精品视频网| 亚洲AV无码乱码在线观看代蜜桃 | 香蕉eeww99国产在线观看| 亚洲自偷自拍另类小说|