社會媒體大數(shù)據(jù)分析研究綜述*

2017-01-18 08:14:59杜治娟王秋月孟小峰

計算機與生活 2017年1期

關(guān)鍵詞：用戶信息模型

杜治娟，王碩，王秋月，孟小峰

中國人民大學信息學院，北京 100872

社會媒體大數(shù)據(jù)分析研究綜述*

杜治娟+，王碩，王秋月，孟小峰

中國人民大學信息學院，北京 100872

DU Zhijuan,WANG Shuo,WANG Qiuyue,et al.Survey on social media big data analytics Journal of Frontiers of Computer Science and Technology,2017,11(1)：1-23.

社會媒體作為人們傳播信息和表達觀點的重要渠道，包含大量豐富的有用信息，近年來已成為大數(shù)據(jù)最具代表性的數(shù)據(jù)來源之一，挖掘與分析這些信息對社會發(fā)展影響深遠。按照社交媒體的構(gòu)成要素將目前研究劃分為3類，即從基于用戶的分析、基于關(guān)系的分析和基于交互內(nèi)容的分析三方面進行總結(jié)分析。首先，從多源異構(gòu)網(wǎng)絡(luò)中識別用戶身份，發(fā)現(xiàn)社群并計算用戶影響力來分析基于用戶的數(shù)據(jù)；其次，從用戶關(guān)系強度計算、信息傳播和影響力最大化3個角度探討了基于交互關(guān)系為中心的數(shù)據(jù)分析；然后，基于用戶交互內(nèi)容探討了特征提取與選擇、話題事件挖掘、多媒體數(shù)據(jù)分析以及情感分析4個問題。最后，從信息傳播、影響力計算、特征提取與選擇、微博新聞挖掘、社會媒體大數(shù)據(jù)融合和跨語言情感分析6個方面指出了現(xiàn)有研究的挑戰(zhàn)性和未來研究的新視角。

社交媒體；大數(shù)據(jù)；用戶行為；交互關(guān)系；交互內(nèi)容

1 引言

近十幾年來，在線社會網(wǎng)絡(luò)越來越流行，如博客，以照片共享為主要功能的Flickr、Facebook、Google+、LinkedIn以及具有強媒體性質(zhì)的微博等。它們快速增長并允許用戶連接、互動、共享和合作，創(chuàng)建了一個新的強大的通信媒體和信息發(fā)現(xiàn)、共享平臺[1-2]。平均而言[3]，F(xiàn)acebook的用戶每人每月花7.75小時與朋友進行交流，每天發(fā)帖32億，而Twitter每天發(fā)帖3.4億，F(xiàn)lickr每分鐘上傳3 000多張照片，博客每年發(fā)帖量也超過1.53億。

社交網(wǎng)絡(luò)的快速、深度發(fā)展使其自身變得越來越龐雜。當前社交網(wǎng)絡(luò)用戶過億，社交圖譜異常龐大，如RenRen社交圖譜[4]有75.33萬條邊、2.74萬個可見交互圖、24.1萬個隱性交互圖；用戶在不同的社交媒體中持續(xù)交互；各種信息在多種社交網(wǎng)絡(luò)中快速傳播。這些特點給社交網(wǎng)絡(luò)的研究帶來巨大挑戰(zhàn)。雖然社交網(wǎng)絡(luò)形形色色，但它們都由用戶、關(guān)系和內(nèi)容組成。因此，本文從用戶、關(guān)系和內(nèi)容三方面分析現(xiàn)有研究，如圖1所示。

Fig.1 Typical characteristics for social media big data圖1 社會媒體大數(shù)據(jù)典型特征

從用戶層面上看，活躍用戶是社交網(wǎng)絡(luò)的核心，主導整個社交網(wǎng)絡(luò)的交互。社會媒體中的用戶可分為博主、關(guān)注對象和粉絲，可以進行發(fā)布、關(guān)注、轉(zhuǎn)發(fā)（RT）、提及（@）、回復和評論操作，并且同一個用戶可以參與多個社交網(wǎng)絡(luò)的互動。因此，以用戶為中心的研究主要集中在：（1）從多源異構(gòu)網(wǎng)絡(luò)中識別用戶身份，判斷用戶角色，可以借助URL、提及等分析。例如利用URL判斷與其他社會網(wǎng)絡(luò)連接情況[5]，使用@提及屬性的出入度判定不同角色的用戶[6-7]等，對于用戶信息的融合非常有用。（2）人以類聚，物以群分，當社交網(wǎng)絡(luò)中用戶在某段時間內(nèi)互動形成具有穩(wěn)定群體結(jié)構(gòu)、一致行為特征和統(tǒng)一意識形態(tài)后他們就會形成社群[8]。這對于研究人的群體特征、行為規(guī)律等非常有用。（3）各行各業(yè)都有具有影響力的人物，社交網(wǎng)絡(luò)中也不例外，用戶影響力計算[9]、意見領(lǐng)袖發(fā)現(xiàn)[10]在推薦系統(tǒng)、病毒式營銷、廣告投放、信息傳播、專家發(fā)現(xiàn)等多個領(lǐng)域廣泛應用[11]。

從交互關(guān)系的層面看，用戶之間存在關(guān)注關(guān)系、傳播關(guān)系和互惠關(guān)系。其中，關(guān)注關(guān)系由粉絲行為引起，可用于影響力分析[12]，關(guān)注關(guān)系引發(fā)了用戶的網(wǎng)絡(luò)弱關(guān)系性和聚類性[13]；傳播關(guān)系由轉(zhuǎn)播、提及和內(nèi)嵌的URL引起，具有更強的話題關(guān)聯(lián)性[14]；互惠關(guān)系由評論、回復引起，是傳播關(guān)系的特殊情況。這些研究的基本依據(jù)是信息學的傳播，它們的價值更多地體現(xiàn)在商業(yè)價值和政治價值，比如研究用戶及用戶群體的傳播能力和權(quán)威性，可以選取出有傳播力、影響力的用戶組成初始種子集合，使信息得到最大化的傳播；與此同時，各方的利益也將不同程度地得到最大化，利益雙方可以從社會網(wǎng)絡(luò)關(guān)系的廣度和深度采取不同措施制約對方發(fā)展或提升自身利益[15-16]。

從用戶交互內(nèi)容看，用戶交互的內(nèi)容不僅有文本信息，還會包含大量的地理位置、圖像和視頻等多媒體信息，并且在這些信息中還會包含情感信息。因此，社會媒體的價值體現(xiàn)在：（1）利用位置信息、社會媒體的動態(tài)性和時效性分析多媒體數(shù)據(jù)。（2）從交互內(nèi)容中分析情感有助于提取不同領(lǐng)域的公眾情緒和意見，可以確定民意調(diào)查的影響[17]，有效解釋和描述政治事件[18]，預測股票趨勢[19]等。但是微博討論的話題不拘泥于任何方式，可變性大，這種互動引發(fā)公眾情緒的不斷變化，挑戰(zhàn)性變大。（3）碎片信息的關(guān)聯(lián)與整合，由于海量的不同文化背景的各種思維在交互中相互交融，使原本碎片狀的信息以話題事件的方式相關(guān)聯(lián)，進而匯聚為思想流。這種思想流看問題的角度各異，也更能顯現(xiàn)出事情的本來面目。但是微博的短文本、多語言背景[20]，以及口語化、錯誤拼寫和縮寫、使用特殊符號等對內(nèi)容的理解造成很大挑戰(zhàn)。#標簽、轉(zhuǎn)播、提及、URL等可以輔助分析內(nèi)容[21]。比如利用#標簽收集特定話題和事件的信息[5-6]，提高檢索性能和進行語義分析[14]等。使用轉(zhuǎn)播估計話題興趣度或博文重要度[7,20]，提及查找具有特定興趣的個人或特定話題的視圖[22]，使用URL計數(shù)度量事件流行度[14]等。

由此可見，社會媒體大數(shù)據(jù)中潛藏著大量有價值的信息，挖掘過程面臨很多挑戰(zhàn)。因此，本文第2、3、4章分別基于用戶、交互關(guān)系和交互內(nèi)容三方面綜述現(xiàn)有研究工作；第5章指出面臨的挑戰(zhàn)和新問題。

2 基于用戶的分析

社會網(wǎng)絡(luò)中基于用戶的研究包括多源異構(gòu)網(wǎng)絡(luò)中用戶身份識別、社群發(fā)現(xiàn)和用戶影響力計算。

2.1 用戶身份識別

在線社會網(wǎng)絡(luò)可看做異構(gòu)信息網(wǎng)絡(luò)，其中的信息通常包括時間、地點、人物、事件等，而用戶往往同時存在于多個不同的社會網(wǎng)絡(luò)中。由于異構(gòu)的特點，導致同一個人在不同的網(wǎng)絡(luò)中會呈現(xiàn)一定的差異，如何在此種情況下識別這個人的身份成為近年來異構(gòu)社會網(wǎng)絡(luò)研究的一個熱點。文獻[23]提出了跨異構(gòu)社會網(wǎng)絡(luò)的用戶身份識別方法，如圖2所示。

用戶身份識別主要思想是用戶匹配的推理策略，在一對一匹配條件約束下，通過擴展Jaccards系數(shù)和擴展Adar度量來對文本內(nèi)容、空間分布、時間分布等多個特征進行分析。類似的，也可以采用協(xié)同分割模型[24]來解決在多個大規(guī)模社會網(wǎng)絡(luò)上處于不同網(wǎng)絡(luò)中的相同身份的辨識問題。該方法主要利用圖論知識，對一個社會網(wǎng)絡(luò)的拓撲進行平衡化分割，從而在不同的網(wǎng)絡(luò)中發(fā)現(xiàn)相同的分割規(guī)律，進而實現(xiàn)身份對齊。文獻[25]受力的相互作用和能量守恒原理的啟發(fā)，提出了基于能量方程的COSNET模型，采用的方法分別是無監(jiān)督成對網(wǎng)絡(luò)對齊和傳遞集成網(wǎng)絡(luò)對齊的方法，分別從局部一致性和全局一致性兩方面來分析異構(gòu)網(wǎng)絡(luò)環(huán)境下的用戶匹配問題。

以上這些都是針對非匿名網(wǎng)絡(luò)的，實際的匿名網(wǎng)絡(luò)中用戶的身份識別問題也很重要，因此，文獻[26]針對匿名社會網(wǎng)絡(luò)設(shè)計了一個無監(jiān)督的多網(wǎng)絡(luò)對齊模型，能夠解決匿名網(wǎng)絡(luò)中用戶信息和錨鏈接缺失的問題。總之，以上方法考慮到異構(gòu)網(wǎng)絡(luò)的特點，挖掘同一身份在不同網(wǎng)絡(luò)中的共性，從而完成身份識別。

2.2 社群發(fā)現(xiàn)

社群是指用戶在某段時間內(nèi)互動形成的具有穩(wěn)定群體結(jié)構(gòu)、一致行為特征和統(tǒng)一意識形態(tài)的個體和社會關(guān)系的集合。社群內(nèi)部用戶關(guān)系強度強，聚合強度大，而社群之間用戶關(guān)系強度弱，離散程度大[27]。社群挖掘的目的在于從用戶的行為、群體結(jié)構(gòu)和關(guān)系模式中發(fā)現(xiàn)潛在的規(guī)律。

社群結(jié)構(gòu)按照用戶社會關(guān)系和對文本內(nèi)容的興趣度劃分為兩種[27]：（1）以用戶個體為中心的社群結(jié)構(gòu)。由微博主、粉絲、好友及具有相同#標簽或興趣度的用戶組成，其主體微博主一般影響力較大，充當意見領(lǐng)袖的角色，其他用戶對微博主的某條博文進行評論、轉(zhuǎn)發(fā)，這種結(jié)構(gòu)隨著微博主的威望或博文熱度的降低而減弱。（2）以話題為中心的社群結(jié)構(gòu)，以話題內(nèi)容為中心，聚合大部分興趣愛好相同或具有相同#標簽的用戶，他們討論的主題大多以時效性較強、關(guān)注度較高的熱點話題為主，社群成員地位平等，分布均勻，這種結(jié)構(gòu)隨著話題的結(jié)束而消失。

早期社群劃分以靜態(tài)劃分為主，采用基于圖聚類的方法和基于相似度計算的方法。基于圖聚類的方法采用圖建模復雜網(wǎng)絡(luò)，通過計算節(jié)點相似度，按照子網(wǎng)內(nèi)部節(jié)點相似度高，不同子網(wǎng)中節(jié)點的連接數(shù)最少的原則劃分網(wǎng)絡(luò)，每個子網(wǎng)記為一個社群。大部分算法采用迭代二分的方式尋找二分網(wǎng)絡(luò)各自的最優(yōu)化分解以獲得滿足條件的子圖。比較著名的有Kernighan-Lin算法[28]和基于圖的Laplace矩陣特征向量的譜二分法[29]。基于相似度計算的方法是根據(jù)網(wǎng)絡(luò)中節(jié)點間的相似性或者連接的強弱來決定是否保留或刪除邊，實現(xiàn)網(wǎng)絡(luò)群體的重構(gòu)。GN算法[30]、Newman的快速算法[31]等都是這類方法的代表。此外，用戶個體同一時間可能以不同身份出現(xiàn)在不同的社群，因此出現(xiàn)了重疊社群發(fā)現(xiàn)[32]，后來演變出了動態(tài)社群發(fā)現(xiàn)[33]。它根據(jù)信息資源和網(wǎng)絡(luò)結(jié)構(gòu)進行動態(tài)穩(wěn)定的變化規(guī)律劃分，如從分組群和個體兩個層次進行動態(tài)規(guī)劃，或者依據(jù)當前社群結(jié)構(gòu)約束、歷史演變模式和特定時刻單節(jié)點的多社群屬性進行劃分。

Fig.2 Identification schematic圖2 身份識別示意圖

2.3 用戶影響力計算

影響力計算對單個用戶的影響力進行衡量，通常采用節(jié)點權(quán)重進行表征。目前主要從網(wǎng)絡(luò)拓撲結(jié)構(gòu)、個體及其關(guān)系特征和信息傳播結(jié)構(gòu)3個角度來研究，其中，從網(wǎng)絡(luò)拓撲結(jié)構(gòu)出發(fā)的方法如表1所示。

根據(jù)網(wǎng)絡(luò)拓撲結(jié)構(gòu)可以將影響力分為節(jié)點的影響力和邊的影響力。表1中節(jié)點度分為節(jié)點的入度[34]、出度[35]和度中心度[36]。其中出入度是有方向的，表示信息傳播的方向，出度度量鄰居節(jié)點對當前節(jié)點的影響力，入度反之，而度中心度度量的是當前對鄰居節(jié)點的平均影響力。接近中心度[37]表示信息從當前節(jié)點傳播到其他節(jié)點的距離，可以度量當前節(jié)點對其他節(jié)點的間接影響，也可以度量當前節(jié)點自身的關(guān)系強度。因此，接近中心度越大，當前節(jié)點影響其他節(jié)點的速度越快。中介中心度[36]表示信息流經(jīng)當前節(jié)點的數(shù)量，值越大，該節(jié)點在網(wǎng)絡(luò)中越重要。PageRank算法[38]計算的是當前節(jié)點在網(wǎng)絡(luò)中的影響力排名，當前節(jié)點影響力受其他節(jié)點影響力影響時，它們之間呈正相關(guān)關(guān)系。HITS（hyperlink induced topic search）算法[39]綜合考慮了節(jié)點的權(quán)威度與中心度，但沒有考慮節(jié)點影響力的劃分。聚集系數(shù)[40]表示當前與鄰居節(jié)點產(chǎn)生聯(lián)系的可能性，具有傳遞性，可以用來預測形成社區(qū)的可能性。

也有文獻表明信息傳播與網(wǎng)絡(luò)拓撲結(jié)構(gòu)沒有必然聯(lián)系，單純基于網(wǎng)絡(luò)拓撲結(jié)構(gòu)計算影響力不夠準確[41-42]。用戶行為及其關(guān)系特征可以增強計算的準確性[43-45]。如文獻[43-44]采用轉(zhuǎn)發(fā)、提及等用戶行為的傳播頻率和執(zhí)行范圍有效度量用戶發(fā)布信息的影響力。信息傳播結(jié)構(gòu)主要用到信息傳播樹，從其規(guī)模、深度、廣度等方面進行研究。例如，基于用戶和信息傳播樹的話題相似性的影響力排序算法[13]，簡單易行，但客觀性和準確性欠佳；采用粉絲數(shù)、轉(zhuǎn)發(fā)數(shù)、用戶被提及數(shù)和PageRank值等來衡量用戶影響力[38]；利用產(chǎn)生式圖模型來度量Twitter異構(gòu)網(wǎng)絡(luò)中的話題影響力[46]。這些研究表明基于不同角度得出的度量結(jié)果差別較大，但基于同類度量角度的度量結(jié)果較為相似。此外，研究也表明時間因素對影響力計算有重要作用[47-48]。

Table 1 User influence measures from network topology表1 從網(wǎng)絡(luò)拓撲結(jié)構(gòu)角度度量用戶影響力

3 基于關(guān)系的分析

社會網(wǎng)絡(luò)通常采用圖表示，節(jié)點代表用戶集合，有激活和未激活兩類狀態(tài)，節(jié)點的權(quán)重代表用戶影響力；邊代表用戶關(guān)系的集合，邊的權(quán)重代表用戶關(guān)系強度。目前主要是根據(jù)用戶關(guān)系研究信息傳播，它在調(diào)控輿情、產(chǎn)品推廣等方面有實用價值。

3.1 用戶關(guān)系強度計算

用戶關(guān)系強度用于表征用戶之間交互的概率，在微博網(wǎng)絡(luò)中用邊的權(quán)重表示。研究表明影響用戶關(guān)系強度的因素較多，如用戶類型和行為[49]、網(wǎng)絡(luò)結(jié)構(gòu)[50]、微博博文特征和語法特征[51]等。目前典型的計算方法如表2所示。

Table 2 User relationship strength calculation methods表2 用戶關(guān)系強度計算方法

單純從網(wǎng)絡(luò)鏈接分析的角度看，用戶關(guān)系強度計算可以分為相似度計算、邊介數(shù)、影響力圖3種。其中，相似度計算[52-53]通常采用Jaccard相似度、Cosine相似度和Overlap相似度；邊介數(shù)[30]類似于中介中心度，只不過面向的對象是邊；影響力圖[38]采用有向帶權(quán)圖表示社交網(wǎng)絡(luò)，弧的方向表示影響力來源，弧的權(quán)重表示影響力強度，與弧的重數(shù)呈正相關(guān)。

從是否考慮時間因素的角度看，用戶關(guān)系強度計算可以分為靜態(tài)模型和時間模型。其中靜態(tài)模型包括文獻[54]所提出的方法、隱含變量模型[55]等。文獻[54]根據(jù)獨立級聯(lián)模型和真實傳播數(shù)據(jù)將研究問題建模為一個似然函數(shù)最大化問題，然后利用期望最大化進行求解。隱含變量模型是根據(jù)用戶描述內(nèi)容的相似度與用戶間的交互關(guān)系計算關(guān)系強度。這兩種方法計算代價高，不適用于大規(guī)模數(shù)據(jù)集。時間型方法[44]增加了理論時間與實際時間的關(guān)聯(lián)關(guān)系，通常采用連續(xù)型或離散型指數(shù)衰減模型。連續(xù)型用戶關(guān)系強度具有時間動態(tài)性，但只能非增量式地計算用戶的聯(lián)合影響力，不適用于大規(guī)模數(shù)據(jù)，為此出現(xiàn)了用離散時間函數(shù)近似表示的用戶關(guān)系強度，它可以增量式地計算用戶的聯(lián)合影響力。

研究表明，即便網(wǎng)絡(luò)結(jié)構(gòu)上不相關(guān)聯(lián)，只要交互內(nèi)容上有影響關(guān)系，那么這些用戶之間就存在間接影響關(guān)系。為此文獻[56]提出了基于歷史交互信息的HF-NMF方法，其中交互信息包括信息條目、用戶與信息的關(guān)系。文獻[47]利用轉(zhuǎn)移熵量化交互信息的演化過程，從而計算用戶之間的間接影響力。

3.2 信息傳播

信息傳播模型研究社會網(wǎng)絡(luò)中用戶對信息的轉(zhuǎn)播和采納。例如Twitter中，當一個用戶轉(zhuǎn)發(fā)一條信息，他首先要與信息本身交互，因此初始消息的廣播創(chuàng)建了一個新的通知和帖子的級聯(lián)，這些對象被稱為信息級聯(lián)[57]。傳播模型分為意見動態(tài)（opiniondynamic，OD）模型、博弈論（game-theoretical，GT）模型。模型的對比如表3所示。

Table 3 Information propagation models表3 信息傳播模型

意見動態(tài)模型包括級聯(lián)模型、閾值模型和傳染病模型。級聯(lián)模型認為只要未激活鄰居節(jié)點中任意節(jié)點v以概率pu,v激活u成功，u將被激活；否則，v從此不能再激活u。pu,v的取值與u、v節(jié)點無關(guān)，是獨立的。線性閾值模型認為當v中所有節(jié)點的激活能力之和大于u的被激活閾值θv時，u將被激活。級聯(lián)模型、閾值模型中已激活節(jié)點不可以向未激活狀態(tài)轉(zhuǎn)換。也有一些擴展模型，如文獻[58]用增量函數(shù)pv(u,Fv)代替獨立級聯(lián)模型中的pu,v，F(xiàn)v表示被u激活失敗的鄰居節(jié)點集合；并且文獻[63]發(fā)現(xiàn)pv(u,Fv)值隨著被激活失敗次數(shù)的增加而遞減。文獻[58]還采用閾值函數(shù)fu(Av)代替，其中Av表示前一時刻被激活的鄰居節(jié)點集合。

病毒傳播模型認為只要v不為空，u就會以固定概率p被感染（激活），并且在一段時間之后，u可以重新回歸易感染（未激活狀態(tài)）。除此之外，處于免疫狀態(tài)的節(jié)點不會被感染，也不會去感染其他節(jié)點。其中p取固定值，一般與用戶關(guān)系無關(guān)，只與信息本身有關(guān)。例如，在Twitter中傳播謠言，如果謠言傳播給易感染者，則易感染者會以概率α變成已感染者；如果謠言傳播給已感染者，則已感染者會以概率β變成免疫者；否則鄰居節(jié)點不再發(fā)送該謠言給它時，它以1-β的概率繼續(xù)傳播該謠言。

意見動態(tài)模型的最大特點是需要預定義簡單的規(guī)則和行為，這樣模型失去了動態(tài)性和靈活性，且考慮因素單一。因此出現(xiàn)了博弈論模型，它認為節(jié)點u應該同時考慮所有鄰居節(jié)點和信息內(nèi)容，使其自身利益最大化。例如，動態(tài)隨機最優(yōu)反應（stochasticbest-response dynamics）模型根據(jù)每個動作未來效用的概率分布選擇動作[30]；而文獻[62]的未來效用pi,β(yi|xN(i))根據(jù)當前節(jié)點與鄰居節(jié)點的相互作用產(chǎn)生。它與閾值模型的區(qū)別在于用戶通過衡量自身可選策略的利益大小使其自身利益最大化來選擇動作，而不是基于閾值，靈活性強。

上述兩大類模型都屬于理論型傳播模型，它們單純從理論上模擬信息傳播，模型中的時刻都是理論上的時間間隔，并非真實的時間。為此，出現(xiàn)了用戶關(guān)系強度的計算源于實際數(shù)據(jù)的傳播模型，它們采用信息本身特性、用戶關(guān)系、微博網(wǎng)絡(luò)外部因素等多方面對信息傳播進程建模，預測信息傳播動態(tài)以及用戶個體的傳播行為。主要有兩條研究主線：（1）從整體出發(fā)，預測信息的擴散速度、范圍、廣度和深度等[7,42,64]；（2）從個體出發(fā)，預測用戶個體傳播某條信息的概率，進而研究整個社會網(wǎng)絡(luò)的信息傳播情況[7,65]。

3.3 影響力最大化

影響力計算是針對單個用戶節(jié)點而言的，而影響力最大化問題[66]涉及網(wǎng)絡(luò)中的多個用戶，考量集體的聯(lián)合影響力，它利用信息傳播模型聚集用戶，使用戶集合可以最大程度地影響其他用戶，從而使信息最大程度地擴散。它是在線社交網(wǎng)絡(luò)的重要研究問題，主要研究可分為傳統(tǒng)影響力最大化問題和新型影響力最大化問題。

傳統(tǒng)影響力最大化是針對單條信息而言的，主要研究方法包括基于信息傳播模型的近似貪心算法、啟發(fā)式算法和混合算法[58,67-69]，以及這些算法在擴展性上的改進算法[70-73]。貪心算法中基于獨立級聯(lián)模型和線性閾值模型可以避免多個信息同時擴散到某一節(jié)點的現(xiàn)象；基于節(jié)點度中心度和距離中心度的方法，將信息擴散僅限制在一個局部團體內(nèi)，無法擴散到整個網(wǎng)絡(luò)。啟發(fā)式算法只考慮了網(wǎng)絡(luò)結(jié)構(gòu)，而沒有考慮到信息在網(wǎng)絡(luò)中擴散的動態(tài)性。混合方法，例如采用級聯(lián)和閾值模型計算影響力，采用貪婪近似啟發(fā)式方法選擇k個最優(yōu)的初始種子達到影響力傳播最大化。

新型影響力最大化問題包括競爭性影響力最大化、最低成本影響力最大化和自適應影響力最大化問題。競爭性影響力最大化是針對同時傳播的多條相互影響的信息而言的，比如不同品牌或廠家的新品信息、關(guān)于某一事件的謠言信息和可信信息等。對于其中每條信息，如何從自身的角度選擇初始節(jié)點集合使得該信息影響力得到最大化，這個問題稱為競爭性信息影響力最大化問題。最早解決這個問題的是文獻[61,74-75]，它們證明了以競爭對手的初始種子作為先驗知識實現(xiàn)競爭群影響最大化是一個NP難問題和次模（submodular）問題，并設(shè)計了兩個爬山算法。文獻[76]研究了類似的問題，但是其采用的是線性閾值模型和通用閾值模型。文獻[15-16]研究了社交網(wǎng)絡(luò)中廣告活動的影響力最大化問題。這些研究兩個共同的缺點是假設(shè)了兩個不太現(xiàn)實的情況：（1）假設(shè)當前用戶已經(jīng)選擇了種子，就不再感知新競爭對手的存在；（2）假設(shè)當前用戶感知競爭對手的策略，并且從已經(jīng)提供免費樣品的目標用戶中選定種子。針對這種情況，文獻[76]提出了基于博弈論的文獻[77]新框架，給出了3個更為實際的假設(shè)：（1）在競爭網(wǎng)絡(luò)中給定r個影響力已經(jīng)最大化的分組，每個分組在相同的策略下獨立選擇k個種子；（2）假設(shè)每個分組可以感知競爭對手的存在，但不感知他們所采用的策略；（3）假設(shè)在影響力傳播過程中，一旦當前用戶受到一些分組的影響，則他不再受任何其他分組的影響。

最低成本影響力最大化的目的是確定種子用戶的最小數(shù)目，這些用戶能夠觸發(fā)寬級聯(lián)的信息傳播[78]。早期研究限定在單個網(wǎng)絡(luò)中，但是只考慮單個網(wǎng)絡(luò)的信息傳播會影響計算的準確度，因為一個用戶可以處于Twitter、Facebook等多種社交網(wǎng)絡(luò)，并傳播相同的信息。最近，出現(xiàn)了跨多個社交網(wǎng)絡(luò)的影響力最大化的研究[78]，它采用無損耦合和有耦合模式將多個網(wǎng)絡(luò)映射到單個網(wǎng)絡(luò)。無損耦合方案保留原有網(wǎng)絡(luò)的所有屬性，提供高質(zhì)量的解決方案，而有損耦合方案考慮了運行時間和內(nèi)存消耗因素。

以上研究都采用了非自適應設(shè)置，即營銷人員應該選擇所有種子用戶，給予免費樣品等。這樣營銷人員被迫僅依賴于傳播模型選擇所有的種子。如果某些選定的種子表現(xiàn)得不好，就沒有機會選擇正確的了。為此，文獻[79]提出了自適應影響力最大化方法，并給出了兩個自適應離線策略MaxSpread和MinTss。MaxSpread給定種子數(shù)預算和時間范圍，使其最大限度地發(fā)揮影響；MinTss給出一個時間范圍和受影響的目標用戶的預期數(shù)量，最小化所需的種子數(shù)量。

4 基于內(nèi)容的分析

文本是社會媒體數(shù)據(jù)的核心[80]，其研究包括文本特征提取與選擇、話題挖掘、事件和新聞檢測。

4.1 文本特征提取與選擇

收集到的原始文本組織松散，直接用于文本分析會影響分析的準確性[81-82]。預處理就是采用特征抽取和特征選擇的方法將文檔組織成固定數(shù)目的預定義類別，典型處理技術(shù)如圖3所示。

Fig.3 Document preprocessing technology圖3 文檔預處理技術(shù)

4.1.1 特征抽取

特征抽取方法大概分為3類[81]：形態(tài)分析（morphological analysis）、句法分析、語義分析。

形態(tài)分析主要是將文檔轉(zhuǎn)化為詞序列（去除標點符號），包括詞語切分（tokenization）、去除停用詞、詞干還原。詞語切分是指將文檔去除標點符號并切分成詞的序列[83]；去除停用詞是指去除如“the”、“a”、“or”這種詞，主要是為了削減文檔包含詞的數(shù)量來提高文本處理的效率和效果[84]；詞干還原是指將詞還原為詞根的形式，如“talking”→“talk”，典型的詞根還原算法如Brute-force、Suffix-stripping、Affix-removal和n-gram[81]。

句法分析用于分析句子的邏輯語義，典型方法

其中，D是一組文檔；文檔頻率DF(t)代表出現(xiàn)t的文檔數(shù)。IDF用于縮減詞條權(quán)重，降低詞條頻繁出現(xiàn)的影響，但是它不適合分析微博數(shù)據(jù)[89]，原因有4個：（1）博文的低冗長使得詞頻通常接近于0、1，不能體現(xiàn)博文之間的區(qū)別[89]，并且詞頻權(quán)重僅表示詞條在集合中的重要性，而不是它相對于時間的重要性，不足以衡量時間敏感的話題。（2）噪聲和稀有詞的IDF得分更高，而話題詞在多個博文中出現(xiàn)說明其更重要，因此減少此類話題詞的重要性可能會導致性能下降[90]。（3）博文的高通量性使得計算整體權(quán)重不切實際。（4）詞頻技術(shù)不捕獲詞條的順序，致使處理過程中信包括詞性標注（part-of-speech tagging，POS）和解析法。詞性標注就是根據(jù)單詞在句子中的上下文語法知識為單詞添加詞匯分類，以便進行語言分析。詞性標注的典型技術(shù)可分為基于規(guī)則的形態(tài)分析和隨機模型，如隱馬爾科夫模型（hidden Markov model，HMM）[85]。HMM是一種隨機標記技術(shù)，主要用來從輸入詞序列中發(fā)現(xiàn)最類似的POS標記。解析[86]用于檢測句子的語法結(jié)構(gòu)，通常采用解析樹分析句子的語序。

語義分析就是理解句子的含義，包括關(guān)鍵詞識別技術(shù)和語義網(wǎng)技術(shù)。關(guān)鍵詞識別技術(shù)用于從文本信息中提取有用內(nèi)容，通常基于語義詞典，如Word-Net-Affect，可用于情感分析，但是它依賴于文本中的顯示詞匯。比如多人在飛機失事中遇難，表達悲傷情緒，但文中沒有出現(xiàn)“悲傷”，因此，它檢測不出悲傷這種情緒。為了彌補這種缺陷，出現(xiàn)了語義網(wǎng)技術(shù)，用于表示概念、事件，以及它們之間的關(guān)系[86-87]，這種技術(shù)利用的是詞語的背景信息而非明顯的關(guān)鍵字。

4.1.2 特征選擇

特征選擇是為了消除目標文本中無關(guān)和冗余的信息，主要是根據(jù)詞在文檔中的重要性得分選擇重要特征[88]。主要分為基于頻率的方法、潛在語義索引（latent semantic indexing，LSI）和隨機映射，其中最常見的度量方法是基于頻率的技術(shù)，如TF/IDF，定義如下：息丟失。

LSI[91]傾向于提高詞匯匹配，隨機映射則是通過大的文檔集創(chuàng)建映射圖。圖中任何選定的區(qū)域可以用于提取類似主題的新文檔。

4.2 話題事件挖掘

事件是指在特定的時間和地點下發(fā)生的有前因和后果的事情，而話題是指由所有直接相關(guān)事件構(gòu)成的大事件[92]。話題挖掘的主要任務是話題檢測與跟蹤（topic detection and tracking，TDT），采用歷史事件追溯檢測和在線新事件自動識別方法[93]，已有大量研究，尤其針對完整新聞報導[93-94]和博客[95]的話題檢測已取得了一些成績。然而，由于微博格式復雜，內(nèi)容簡短，用語不規(guī)范等特點，TDT技術(shù)不能簡單應用到微博[20]。下面從話題模型、話題摘要、話題的檢測與跟蹤三方面進行介紹。

4.2.1 話題模型

話題模型用于識別文本內(nèi)容的潛在語義，典型的靜態(tài)話題模型有：（1）向量空間模型，用向量表示詞，計算方便，但缺乏信息的語義關(guān)聯(lián)，并且新詞、多義詞、別義詞對基于第三方詞典或者語言資料的詞匯鏈模型挑戰(zhàn)性很大。（2）圖模型，充分考慮上下文的語義關(guān)系，彌補了傳統(tǒng)話題模型語義信息缺失的不足，但是在實際應用中存在著計算代價高，存儲容量大等問題。（3）概率模型，典型的模型如LDA（latent Dirichlet allocation）[96]，它采用三層貝葉斯的形式表示潛在的話題，具有較好的泛化性，但也不太適合稀疏數(shù)據(jù)和短文本。因此演變出了針對微博中單一話題的L-LDA（labeled-LDA）模型[97]和Twitter-LDA模型[98]。L-LDA主要考慮了標簽（Hashtag），在建模推文排名、為用戶推薦任務[97]方面有應用。L-LDA、LDA發(fā)現(xiàn)潛在話題的處理流程如圖4所示。

Fig.4 Labeled-LDAand its generation process圖4 標記LDA及其生成過程

Twitter-LDA模型[98]發(fā)現(xiàn)單個推文通常是關(guān)于單一話題的。此外，各種研究表明推文的短文本特性導致LDA不太適合Twitter。克服這種問題的一個想法是將推文聚合在一起來提供更多的背景知識，可以根據(jù)詞條按內(nèi)容[99]、話題[13]或author-topic（AT）模型[100]對推文進行分組。然而研究表明，相比簡單的基于詞條的方法，直接應用AT模式不會產(chǎn)生顯著改進，內(nèi)容聚合的性能比AT模型的聚合更好。

微博話題數(shù)據(jù)具有流數(shù)據(jù)特征，隨時間不斷動態(tài)演化，因而出現(xiàn)了話題動態(tài)演化方法[101]，主要有：（1）將文檔的時間信息作為話題特征的一個指標維度，并基于傳統(tǒng)空間向量構(gòu)建具有動態(tài)演變性的話題模型。（2）基于概率話題模型在強度和內(nèi)容上挖掘話題演化，主要是計算時間信息與話題、文檔[102]、詞項[103]的后驗概率分布。（3）基于已有詞項特征與新增詞項特征的演變特性挖掘話題演化。方法（1）、（2）在動態(tài)更新問題上具有明顯不足，并且僅采用均值泛化的思想去增量擴充演化中的話題特征，影響了計算準確度；方法（3）提高話題關(guān)聯(lián)的正確率，有效地解決了話題演化的偏斜問題。

4.2.2 話題摘要

話題摘要旨在從多條博文中自動為相同話題生成摘要，以輔助話題核心語義的理解。微博話題摘要的研究大概分為兩類：一類是針對話題事件的摘要；另一類是針對信息檢索的摘要。主要研究有：（1）基于抽取的自動摘要，如根據(jù)相關(guān)性、最大邊緣相關(guān)度或相關(guān)系數(shù)和覆蓋內(nèi)容最大化[104]自動摘要，這種方法的缺點是高效的算法摘要的質(zhì)量差，質(zhì)量高的算法計算量大。（2）基于理解的自動摘要，這種方法與第一種方法的區(qū)別在于摘要內(nèi)容不完全出自原文，根據(jù)語義理解得到摘要意義表示進而生成摘要，如使用隱馬爾科夫模型發(fā)現(xiàn)Twitter事件的隱狀態(tài)摘要話題的所有博文[105]。（3）基于關(guān)鍵詞選取與序列化的自動摘要，如提取命名實體、時間、事件短語和類型作為摘要[106]，采用圖模型序列化詞條作為摘要[107]。這些摘要方法主要針對事件話題，針對用戶查詢的摘要方法并不多見。

目前微博的摘要方法主要針對事件話題，缺乏針對用戶查詢的摘要方法，這種方法需要根據(jù)用戶查詢和微博的特征選擇和組織摘要內(nèi)容。并且已有成果中摘要大都是詞、短語、句子或消息的簡單陳列，缺乏對博文之間內(nèi)在關(guān)聯(lián)關(guān)系的考慮，摘要的組織形式和呈現(xiàn)方式欠佳，需要探索合適的微博摘要組織策略和呈現(xiàn)方式。

4.2.3 話題檢測與跟蹤

話題檢測與跟蹤包括在線新事件的檢測和歷史事件追溯[93,108]。在線新事件檢測的任務是實時從媒體反饋中識別事件[93]；歷史事件追溯的任務是從歷史積累的文檔中識別以前未知的事件[108]。

對于新事件檢測研究已久，早期研究采用傳統(tǒng)事件檢測方法，目前大多是基于特征的方法，如基于突發(fā)性、趨勢分析等檢測新事件。已有文獻的對比如表4所示。

Table 4 New event detection method based on burst detection and trend analysis表4 基于突發(fā)性檢測和趨勢分析的新事件檢測方法

文獻[109]針對經(jīng)典事件檢測方法在處理海量數(shù)據(jù)上速度和效率方面的局限性，提出了恒定時間和恒定空間的辦法來解決這個問題。主要優(yōu)勢在于，它結(jié)合了位置敏感哈希（locality sensitive hashing，LSH）技術(shù)和變分推理策略，這種近似技術(shù)可以檢測未知事件。文獻[110]采用增量在線聚類技術(shù)從博文中檢測相似話題，通過支持向量機（support vector machine，SVM）模型進行分類，利用了時間、社交、局部特征。其中，事件特征用來刻畫此詞聚類頻率的海量性；局部特征捕捉消息聚類中用戶的交互關(guān)系，如轉(zhuǎn)發(fā)、回復、提及等；話題特征描述聚類的局部連貫性，事件聚類圍繞中心話題展開。文獻[111]提出統(tǒng)一的事件檢測、跟蹤和摘要的流程，首先采用話題詞聚類檢測事件；然后將相關(guān)事件的跟蹤問題轉(zhuǎn)化為二分圖匹配問題；最后將已跟蹤的時間鏈做成方便用戶理解的摘要。文獻[112-113]將問題轉(zhuǎn)化為圖劃分問題并基于小波分析檢測事件。文獻[112]用詞構(gòu)建小波信號，而文獻[113]用＃標簽事件作為小波信號。利用＃標簽共現(xiàn)原理實現(xiàn)事件檢測。文獻[114]提出了Twevent，一個面向推文的基于片段的事件檢測系統(tǒng)。片段是指推文中的一個或連續(xù)的多個字。采用不重疊的k-最近鄰圖和基于廣義對稱條件概率（symmetric conditional probability，SCP）的連續(xù)文件段的n-gram技術(shù)。它的獨特性在于采用突發(fā)性片段作為事件片段，而不是依賴于突發(fā)性詞或者話題。文獻[116-117]針對特殊事件進行檢測。其中文獻[116]采用有監(jiān)督分類技術(shù)檢測地震、臺風、交通事故等事件。而文獻[117]采用預定義規(guī)則TEDAS對Twitter中與犯罪、疾病相關(guān)的事件進行檢測。文獻[118]提出3種可選擇的基于梯度提升決策樹的回歸及其學習模型，用于檢測有爭議的事件。文獻[104]對文獻[118]進行擴展，允許對實體排序。文獻[106]提出了TwiGl，它是面向Twitter的開放領(lǐng)域事件抽取和分類系統(tǒng)，從800條隨機選擇的推文上訓練命名實體標注器來抽取命名實體，用已有的Twitter-tuned part-ofspeech tagger工具抽取事件提及，然后根據(jù)潛在變量模型LinkLDA分類已抽取的事件。文獻[12]提出了TwitterStand系統(tǒng)，它利用位置信息和基于加權(quán)詞向量的聚類算法從博文中自動獲取突發(fā)新聞。

對于歷史事件追溯的研究也有很多，其對比如表5所示。

Table 5 Event tracing methods表5 事件追溯方法

文獻[119]在消息級別，采用條件隨機場技術(shù)抽取位置等場值，因子圖模型捕捉?jīng)Q策之間的交互、變分推理技術(shù)來提升海量消息預測的效率和效果。文獻[120]利用詞頻分析和位置共現(xiàn)技術(shù)來提高召回率，文獻[121]將此方法擴展到不同的社交媒體網(wǎng)絡(luò)。文獻[124]提出ETree系統(tǒng)，它首先采用n-gram技術(shù)將短消息分組到預以連貫的信息塊，然后采用增量層次建模技術(shù)構(gòu)建不同粒度的時間主體結(jié)構(gòu)，最后采用時間分析技術(shù)識別信息塊之間的內(nèi)在因果關(guān)系。文獻[125]通過用戶標注的標簽從Flickr照片中檢測事件分3步完成：（1）事件標簽檢測，使用標簽的事件和位置分布信息發(fā)現(xiàn)事件相關(guān)的標簽，并采用小波轉(zhuǎn)換技術(shù)減少噪音。（2）事件產(chǎn)生，檢測分布模式的特征，聚類事件相關(guān)的標簽。（3）事件照片識別，根據(jù)每個標簽，聚類照片相關(guān)的事件。

4.2.4 微博新聞檢測

目前，微博對新聞業(yè)的影響很大，如重大政治事件和緊急情況中記者使用微博和觀眾互動，跟蹤新聞的發(fā)展[126]，并可以通過微博發(fā)表個人看法，為新聞報道的發(fā)展提供了一個額外的語境和對新聞的額外透明度[127]。此外，Twitter多次表明，它是一個新聞媒體[128]。微博新聞話題的檢測不同于新聞集成，也不同于傳統(tǒng)的話題和趨勢檢測。新聞集成如谷歌新聞和雅虎新聞注重新聞文章，新聞文章包含豐富的新聞話題和較少的噪音。話題和趨勢檢測是從博文中識別并合并話題，并不檢測話題與真實事件的相關(guān)性。但新聞話題源于話題，因此話題檢測是微博新聞話題檢測的根本任務。此外突發(fā)性[129]、話題趨勢[130]、信息監(jiān)測等技術(shù)也是必不可少的。目前，微博信息檢測跟蹤工具如表6[131]所示。

目前也有一些有影響力的微博新聞檢測系統(tǒng)，如Eddi，一個互動的Twitter話題瀏覽系統(tǒng)[132]，但僅適用于單個用戶流，不處理來自公眾的tweets流。典型的系統(tǒng)還有TwitterStand[12]，如圖5所示。

從架構(gòu)的角度，TwitterStand和TwitInfo有更完整的特征集，包括推文爬取、事件識別和話題識別。TwitterStand[12]使用詞頻進行在線聚類，尋找話題并定期合并重復聚類，在收集微博新聞方面效果很好，但不完全自動化，其話題檢測性能取決于預先選定的種子，并允許用戶根據(jù)地域瀏覽新聞，適合于地理特征的應用。TwitInfo[133]旨在幫助用戶瀏覽由用戶指定的事件，通過計算時間序列推文頻率的峰值檢測話題，還可以運用情感分析幫助用戶可視化一個事件的關(guān)鍵點。這兩種系統(tǒng)都使用TF-IDF權(quán)重以減少流行用語的影響。

Table 6 Existing information discovery and tracking tool for Twitter表6 現(xiàn)有Twitter信息發(fā)現(xiàn)與跟蹤工具

Fig.5 System architecture of TwitterStand圖5 TwitterStand系統(tǒng)架構(gòu)

4.3 多媒體數(shù)據(jù)的挖掘與分析

多媒體數(shù)據(jù)源于特定領(lǐng)域的特定問題，主要從位置信息的利用、社會媒體的動態(tài)性和時效性以及社會媒體大數(shù)據(jù)中存在的深層語義三方面進行分析。

4.3.1 地理或位置信息分析

理解和發(fā)現(xiàn)人的移動規(guī)律在交通管理、城市規(guī)劃、安全管理等方面尤為重要，近年來出現(xiàn)的GeoSM（地理標記社會媒體）為研究此類問題提供了方便。可以利用社會媒體中的地理標簽信息學習人的位置信息，但是人的移動信息往往很分散且移動能力是有限的，為了解決此種標簽稀疏的問題，文獻[134]采用了以標簽信息為參考對用戶進行分組的策略，然后通過建立HMMs模型來同時完成分組和移動模型的建立，既可以將移動行為類似的人員分到一組，又能據(jù)此預測人的移動趨勢，從而兼顧了分組的合理性和地理位置引導的有效性。文獻[135]提出了Geo-SAGE系統(tǒng)。該系統(tǒng)使用了一個稀疏性附加生成模型來實現(xiàn)空間目標推薦，該模型基于概率混合模型LCA-LDA來建立，能通過收集用戶的位置和頻度信息得到用戶的空間偏好模型；接下來再通過空間金字塔結(jié)構(gòu)把個人的偏好與地理區(qū)域的人群偏好結(jié)合到一起來挖掘總體的偏好模式，進而達到對用戶進行有效推薦的目標。文獻[136]針對大氣環(huán)境檢測問題提出利用地理傳感器中的時空標簽信息來完成時空協(xié)同演化模式挖掘的功能。首先對空間傳感器的數(shù)據(jù)通過小波變化進行過濾，而且要在空間信息上附加時間信息，以此得到單個傳感器的演化模式；然后通過SCP搜索樹這個數(shù)據(jù)結(jié)構(gòu)來存儲收集的這些單個數(shù)據(jù)；接下來在此結(jié)構(gòu)上通過搜索算法來完成帶空間約束的集成協(xié)同演化模式發(fā)現(xiàn)，從而使人們可以通過大氣環(huán)境的時空關(guān)系變化模式來進行更有效的相關(guān)分析和合理的治理。

4.3.2 社會媒體的動態(tài)性和時效性分析

最典型的例子就是新聞事件或者說新聞報道，里面不僅會有事件發(fā)生的時間和地點，甚至還會產(chǎn)生很多社會話題和社會影響，這些都會隨著時間的推移而發(fā)生變化。文獻[137]提出了基于在線新聞分析的EKNOT系統(tǒng)，能在給定的時間范圍內(nèi)從Google-News中提取出新聞事件，并從Twitter中找出關(guān)于該事件的評論和有關(guān)話題，最終可以提供新聞的事件描述、事件發(fā)展時間軸、事件所涉及的實體對象和對象間的關(guān)系、關(guān)于該事件的評論和態(tài)度的統(tǒng)計，可以看成一個聯(lián)動的過程，基本采用自然語言處理的常見方法，但是該系統(tǒng)需要一定的時間來完成整個分析過程。文獻[138]將實時分析和地理信息結(jié)合到一起，提出了一個GeoBurst系統(tǒng)。它能夠通過Tweet流在線地實時發(fā)現(xiàn)突發(fā)性的本地事件，并提取出帶有地理標簽的新聞主題；然后對此進行聚類來找到與該主題和地理位置有關(guān)的其他突發(fā)性話題，從而建立該突發(fā)性事件的時間軸來更好地實現(xiàn)輿論分析等任務。該系統(tǒng)還同現(xiàn)有的EvenTweet和Wavelet兩個系統(tǒng)進行了比較，實驗證明在同樣的數(shù)據(jù)集上Geo-Burst系統(tǒng)更能準確地發(fā)現(xiàn)突發(fā)性事件，換句話說就是對突發(fā)性更敏感和準確。文獻[139]進一步提出了新聞事件隨著時間推移會產(chǎn)生許多不同角度的衍生話題，事件本身的動態(tài)性也是多樣化的，一個主題下會有很多子主題，因此提出建立一個分層的事件側(cè)面模型來幫助檢測事件，并確定事件的主題，當然這主要建立在批處理系統(tǒng)之上。文獻[140]也提出了多視角聚類算法，利用兩階段隨機游走策略來建立動態(tài)的中心主題模型，它能更好地描述主題的變遷過程，而不僅僅是主題的不同側(cè)面。

4.3.3 社會媒體大數(shù)據(jù)中存在的深層語義挖掘

對新聞、視頻、圖像等社會媒體進一步挖掘深層語義也越來越受到重視。比如之前的新聞事件或主題發(fā)現(xiàn)問題，文獻[141]提出了輿論偏見發(fā)現(xiàn)模型，不僅僅是發(fā)現(xiàn)主題或事件，還要根據(jù)該主題或事件的變遷過程來發(fā)現(xiàn)新舊狀態(tài)下是否出現(xiàn)了偏激的傾向。這個問題類似于情感分析，但是需要對情感進行分類，并以此識別出那些不正常或偏激的論點。文獻[142]除了采用很多新的自然語言處理技術(shù)來完成實體識別和共指消歧之外，還利用了實體和外部知識庫來更好地發(fā)現(xiàn)用戶的偏好，并據(jù)此可以估計哪些新聞或事件會更流行。文獻[143]也做了同樣的事，只不過聚焦的媒介發(fā)生了變化，他們更關(guān)注電視、電影等包含更多圖像的媒體，在瀏覽數(shù)量這個特征之外，提出通過加入額外的諸如圖像相關(guān)性分析等方法引入深層影響因子分析的機制，進而更好地判斷話題的流行程度。文獻[144]提出分析人類的行為，在定義了人類行為的一般模型的同時，指出要充分利用各種媒介，包括電影腳本、視頻片段、粉絲討論等多種渠道去挖掘行為模式，從而構(gòu)建多種人類的行為模式并加以集成構(gòu)成行為語義框架。

社會媒體中的信息挖掘技術(shù)也催生了很多應用，文獻[145]利用發(fā)信者與信息之間的關(guān)系分析來完成垃圾信息檢測；文獻[146]利用地理位置信息來幫助出租車司機實時規(guī)劃路線；文獻[147]通過You-Tube和Twitter相結(jié)合來發(fā)現(xiàn)病毒式營銷模式。總之，研究人員已經(jīng)開始利用這些社會媒體中的語義信息來幫助人們完善社會服務和構(gòu)建網(wǎng)絡(luò)安全，這些應用就不再贅述了。

4.4 情感分析

情感分析也叫意見挖掘，旨在依據(jù)意見目標從語料中識別和提取特定主題的屬性、要素和隱含的主觀信息。意見目標通常稱作實體，可以是人物、事件或話題，與要素和子要素相關(guān)聯(lián)，每個要素都有其自己的一套情感屬性。微博情感分析可以提取不同領(lǐng)域的公眾情緒和意見，可以確定民意調(diào)查的影響[17]，有效解釋和描述政治事件[18]，預測股票趨勢[19]等。各種情感分析技術(shù)、高密度的情感承載詞和非正式的詞（如“coooool”）有助于微博感情的分類[148]。情感分析面臨的挑戰(zhàn)和已有研究工作在報告[149]和專著[150]中有詳細的分析和總結(jié)，但是缺乏多維度的情感度量方法，并且微博的多關(guān)系特征和話題的演化特性引發(fā)了情感的動態(tài)演化現(xiàn)象，隨著微博數(shù)據(jù)流的迅速增長，這個問題也需要考慮。

5 研究展望

由上述分析可知，社會媒體已經(jīng)引起廣泛關(guān)注，已有一些研究成果，但隨著社會的發(fā)展，需求的變化，社會媒體大數(shù)據(jù)挖掘又面臨著新的挑戰(zhàn)。

（1）信息傳播效應捕捉

社會媒體網(wǎng)絡(luò)中信息傳播效應的刻畫是一個復雜的問題，它受到信息自身因素、社會因素和網(wǎng)絡(luò)外部因素的綜合影響，并且用戶本身的屬性與信息本身的屬性也相互影響，準確全面地反映信息傳播效應已成為關(guān)鍵。這一問題的解決還依賴于影響力、用戶關(guān)系強度和傳播規(guī)律。

①用轉(zhuǎn)發(fā)數(shù)來衡量影響力以及從單個獨立的角度研究影響力的方法不能很好地刻畫信息傳播情況和完全展現(xiàn)用戶的影響力，需要將網(wǎng)絡(luò)的拓撲結(jié)構(gòu)與信息傳播樹結(jié)合使用，不僅要考慮信息傳播樹的規(guī)模，還要著重關(guān)注其深度和廣度等特征。

②信息傳播是一個動態(tài)過程，需要捕捉用戶關(guān)系強度與傳播關(guān)系的動態(tài)規(guī)律。目前一般采用理論型傳播模型，但是這種模型計算得到的用戶關(guān)系強度脫離實際，并且存在著理論時間與真實時刻關(guān)聯(lián)的問題。可以考慮從信息傳播歷史數(shù)據(jù)挖掘分析用戶關(guān)系強度，將理論模型和實際數(shù)據(jù)聯(lián)通起來體現(xiàn)實際應用價值。并且利用社交媒體數(shù)據(jù)的群體特征，借助動態(tài)社區(qū)捕捉信息傳播規(guī)律。

（2）影響力計算

基于關(guān)系分析的一個具有重要商業(yè)價值的研究方向是影響力計算和信息傳播的最大化問題。其中信息傳播的最大化問題的全局最優(yōu)化被證明是NP難問題，對于大規(guī)模的社會網(wǎng)絡(luò)，目前只能采用一些優(yōu)化算法獲取近似的較優(yōu)解，并且對于影響力最大化問題目前的最佳解決算法也只處理了百萬級規(guī)模的社會網(wǎng)絡(luò)[69]。而目前微博網(wǎng)絡(luò)節(jié)過億，如何在微博網(wǎng)絡(luò)中快速計算出固定數(shù)量的最有影響力的節(jié)點集合還有待進一步探究。

此外，①因為競爭性信息在選擇初始節(jié)點時有先后順序，所以不同次序的信息會有不同的選擇策略，這也需要考慮。②在線社交網(wǎng)絡(luò)除了文本數(shù)據(jù)，還包含大量的圖像聲音等多媒體信息，它對影響力分析也提出了新挑戰(zhàn)。③研究表明，隱式交互圖比可見交互圖傳播信息的速度更快，揭示的關(guān)系更重要[4]，因此，兩種圖中的影響力是什么關(guān)系，如何量化它們之間的聯(lián)系有待研究。④話題傳播模型多種多樣，但用戶影響力相對穩(wěn)定，它們之間如何影響，程度如何還有待探索。⑤對于影響力最大化問題，除了競爭性影響力最大化問題外，最低成本影響力最大化、自適應影響力最大化和多重影響力最大化也是目前有待研究的問題。

（3）特征提取與選擇

針對傳統(tǒng)數(shù)據(jù)的特征提取與選擇方法已有很多，但是不利于處理低頻詞和發(fā)現(xiàn)新特征，而這種情況在微博數(shù)據(jù)中大量存在。與詞頻模型相比，序列模式挖掘保持了詞的順序并可以捕捉潛在的語義，更能解釋話題。但是采用模式挖掘的兩大挑戰(zhàn)是：大量冗余模式的產(chǎn)生和長模式的低支持度問題。冗余模式是任何模式挖掘中不可避免的問題，但是博文中的噪音加劇了這種問題。對于新特征發(fā)現(xiàn)問題，尤其針對博文，區(qū)分信息新穎性和發(fā)現(xiàn)新特征很重要。在信息新穎性區(qū)分方面，詞性標注、詞重疊度和博文語句相似度等方法都發(fā)揮著很大作用。此外，目前社交網(wǎng)絡(luò)中特征提取與選擇是針對文本數(shù)據(jù)而言的，但是社交網(wǎng)絡(luò)中還包含大量的圖像聲音等多媒體信息，這些信息又將如何處理也是目前需要考慮的問題，有待進一步研究。

（4）微博新聞挖掘

目前社交網(wǎng)絡(luò)中新聞檢測研究成果很多，但是微博新聞檢測僅限于特定的域或事件，仍然缺少針對微博的跨領(lǐng)域新聞話題檢測技術(shù)和適合微博屬性的單獨計算模式；另一方面，新聞的第一要義是新，那么如何在線實時處理這種社會化的短文本流？微博新聞信息彌散分布在海量博文中，每個博文僅是大話題的一個小碎片，如何識別新聞話題？如何實時檢測新聞事件？新聞話題存在動態(tài)演化性，那么如何判斷事件的連續(xù)性？如何挖掘這種動態(tài)的關(guān)聯(lián)演化性？新聞挖掘的核心是話題挖掘，那么如何迅速從海量博文中提取有意義且更容易被理解的微博話題？目前微博用戶中移動用戶占多數(shù)，那么挖掘到的新聞以什么形式呈現(xiàn)？如何設(shè)計針對微博的動態(tài)新聞集成系統(tǒng)？這些都有待深入研究和探索。另外，傳統(tǒng)新聞檢測大多針對文本信息，很少考慮多媒體信息對新聞檢測的影響，這也有待進一步解決。

（5）社會媒體大數(shù)據(jù)融合

隨著社會網(wǎng)絡(luò)服務的發(fā)展，用戶在社交互動中加入了多種服務，并收集了大量的信息。因此，如何整合分布式社會網(wǎng)絡(luò)，進而對各種社會媒體數(shù)據(jù)源進行融合，為知識的挖掘提供更好的數(shù)據(jù)資源已經(jīng)成為亟待解決的問題。在這個過程中，由于社會媒體的自發(fā)性，導致了發(fā)布的信息不能保證其真實可靠，這一挑戰(zhàn)加大了融合的難度。社會媒體數(shù)據(jù)的利用價值之一是事件話題挖掘，目前也傾向于采用構(gòu)建話題知識庫方法，將其用作參照物。比如構(gòu)建縮寫的知識庫用于縮寫詞的識別和鏈接；類似的還可以構(gòu)建社會媒體常用語知識庫，更復雜的可以構(gòu)建一個話題事件知識庫。這也是目前的一個重點研究方面。

（6）跨語言情感分析

挖掘情感是為了體現(xiàn)商業(yè)價值，目前大數(shù)據(jù)向跨語言融合邁進，相應的情感分析也向跨語言情感分析發(fā)展。但是，語言的不同體現(xiàn)在語言特征、要素分布的不同，語言間關(guān)聯(lián)的障礙使得跨語言情感分析成為更大的挑戰(zhàn)，這是目前亟待解決的問題。

社會媒體大數(shù)據(jù)有其獨特的特性，不僅包含社會關(guān)系屬性，還包括文本數(shù)據(jù)、多媒體數(shù)據(jù)等挖掘價值。研究熱點問題很多，本文僅從用戶行為、信息傳播、文本挖掘、多媒體數(shù)據(jù)分析4個方面對相關(guān)研究成果做了總結(jié)、分析和展望。

[1]Khan N,Yaqoob I,Hashem I A T,et al.Big data:survey, technologies,opportunities,and challenges[J].The Scientific World Journal,2014:1-18.

[2]Brewin M W.Media,society,world:social theory and digital media practice[J].New Media&Society,2013,15 (7):1195-1197.

[3]216 social media and Internet statistics[EB/OL].[2015-12-16].http://thesocialskinny.com/216-social-media-andinternet-statistics-september-2012/.

[4]Saini S,Jin H,Jespersen D,et al.An early performance evaluation of many integrated core architecture based SGI rackable computing system[C]//Proceedings of the 2013 International Conference on High Performance Computing, Networking,Storage and Analysis,Denver,USA,Nov 17-21,2013.New York:ACM,2013:94.

[5]Chang H C.A new perspective on twitter hashtag use:diffusion of innovation theory[J].Proceedings of the American Society for Information Science and Technology,2010, 47(1):1-4.

[6]Bruns A,Burgess J E,Crawford K,et al.#qldfloods and @QPSMedia:crisis communication on Twitter in the 2011 south east Queensland floods[M].Brisbane:ARC Centre of Excellence for Creative Industries and Innovation,2012:19-23.

[7]Yang Zi,Guo Jingyi,Cai Keke,et al.Understanding retweeting behaviors in social networks[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management,Toronto,Canada,Oct 26-30,2010. New York:ACM,2010:1633-1636.

[8]Wang Meng,Wang Chaokun,Yu J X,et al.Community detection in social networks:an in-depth benchmarking study with a procedure-oriented framework[J].Proceedings of the VLDB Endowment,2015,8(10):998-1009.

[9]Li Dong,Xu Zhiming,Li Sheng,et al.A survey on information diffusion in online social networks[J].Chinese Journal of Computers,2014,37(1):189-206.

[10]Merton R K.Social theory and social structure[M].New York:Simon and Schuster,1968.

[11]Wu Xindong,Li Yi,Li Lei.Influence analysis of online social networks[J].Chinese Journal of Computers,2014,37 (4):735-752.

[12]Sankaranarayanan J,Samet H,Teitler B E,et al.Twitter-Stand:news in Tweets[C]//Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems,Seattle,USA,Nov 4-6, 2009.New York:ACM,2009:42-51.

[13]Weng Jianshu,Lim E P,Jiang Jing,et al.TwitterRank: finding topic-sensitive influential Twitterers[C]//Proceedings of the 3rd ACM International Conference on Web Search and Data Mining,New York,Feb 4-6,2010.New York:ACM,2010:261-270.

[14]Welch M J,Schonfeld U,He D,et al.Topical semantics of twitter link[C]//Proceedings of the 4th International Conference on Web Search and Web Data Mining,Hong Kong, China,Feb 9-12,2011.NewYork:ACM,2011:327-336.

[15]Budak C,Agrawal D,El Abbadi A.Limiting the spread of misinformation in social networks[C]//Proceedings of the 20th International Conference on World Wide Web,Hyderabad,India,Mar 28-Apr 1,2011.New York:ACM,2011: 665-674.

[16]Tsai J,Nguyen T H,Tambe M.Security games for controlling contagion[C]//Proceedings of the 26th AAAI Conference on Artificial Intelligence,Toronto,Canada,Jul 22-26, 2012.Menlo Park,USA:AAAI Press,2012:1464-1470.

[17]Tumasjan A,Sprenger T O,Sandner P G,et al.Election forecasts with twitter:how 140 characters reflect the political landscape[J].Social Science Computer Review,2011, 29(4):402-418.

[18]Shamma D A,Kennedy L,Churchill E F.Tweet the debates: understanding community annotation of uncollected sources [C]//Proceedings of the 1st SIGMM Workshop on Social Media,Beijing,Oct 23,2009.New York:ACM,2009:3-10. [19]Bollen J,Mao Huina,Zeng Xiaojun.Twitter mood predicts the stock market[J].Journal of Computational Science,2011,2(1):1-8.

[20]Kwak H,Lee C,Park H,et al.What is Twitter,a social network or a news media?[C]//Proceedings of the 19th International Conference on World Wide Web,Raleigh, USA,Apr 26-30,2010.New York:ACM,2010:591-600.

[21]Liao Yang,Moshtaghi M,Han Bo,et al.Mining micro-blogs: opportunities and challenges[M]//Computational Social Networks.London:Springer,2011:129-159.

[22]Teevan J,Ramage D,Morris M R.TwitterSearch:a comparison of microblog search and Web search[C]//Proceedings of the 4th ACM International Conference on Web Search and Data Mining,Hong Kong,China,Feb 9-12,2011. New York:ACM,2011:35-44.

[23]Kong Xiangnan,Zhang Jiawei,Yu P S.Inferring anchor links across multiple heterogeneous social networks[C]// Proceedings of the 22nd ACM International Conference on Information&Knowledge Management,San Francisco, USA,Oct 27-Nov 1,2013.NewYork:ACM,2013:179-188.

[24]Jin Songchang,Zhang Jiawei,Yu P S,et al.Synergistic partitioning in multiple large scale social networks[C]//Proceedings of the 2014 IEEE International Conference on Big Data,Washington,Oct 27-30,2014.Piscataway,USA: IEEE,2014:281-290.

[25]Zhang Yutao,Tang Jie,Yang Zhilin,et al.COSNET:connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Sydney,Australia,Aug 10-13,2015.New York: ACM,2015:1485-1494.

[26]Zhang Jiawei,Yu P S.Multiple anonymized social networks alignment[C]//Proceedings of the 2015 IEEE International Conference on Data Mining,Atlantic,USA,Nov 14-17,2015.Piscataway,USA:IEEE,2015:599-608.

[27]Bhattacharya P,Ghosh S,Kulshrestha J,et al.Deep twitter diving:exploring topical groups in microblogs at scale[C]// Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work&Social Computing,Baltimore,USA,Feb 15-19,2014.NewYork:ACM,2014:197-210.

[28]Kernighan B W,Lin S.An efficient heuristic procedure for partitioning graphs[J].Bell System Technical Journal,1970, 49(2):291-307.

[29]Pothen A,Simon H D,Liou K P.Partitioning sparse matrices with eigenvectors of graphs[J].SIAM Journal on Matrix Analysis andApplications,1990,11(3):430-452.

[30]Girvan M,Newman M E J.Community structure in social and biological networks[J].Proceedings of the National Academy of Sciences,2002,99(12):7821-7826.

[31]Newman M E J.Fast algorithm for detecting community structure in networks[J].Physical Review E,2004,69(6): 66-133.

[32]Jiang Yawen.Community detection in complex networks [D].Beijing:Beijing Jiaotong University,2014.

[33]Tantipathananandh C,Berger-Wolf T,Kempe D.A framework for community identification in dynamic social networks[C]//Proceedings of the 13thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose,USA,Aug 12-15,2007.New York:ACM,2007: 717-726.

[34]Borgatti S P,Everett M G.A graph-theoretic perspective on centrality[J].Social Networks,2006,28(4):466-484.

[35]Ghosh R,Lerman K.Predicting influential users in online social networks[J].arXiv:1005.4882,2010.

[36]Freeman L C.Centrality in social networks conceptual clarification[J].Social Networks,1978,1(3):215-239.

[37]Sabidussi G.The centrality index of a graph[J].Psychometrics,1966,31(4):581-603.

[38]Java A,Kolari P,Finin T,et al.Modeling the spread ofinfluence on the blogosphere[C]//Proceedings of the 15th International World Wide Web Conference,Edinburgh, UK,May 23-26,2006.New York:ACM,2006:22-26.

[39]Awekar A C,Mitra P,Kang J.Selective hypertext induced topic search[C]//Proceedings of the 15th International World Wide Web Conference,Edinburgh,UK,May 23-26,2006.New York:ACM,2006:1023-1024.

[40]Holland P W,Leinhardt S.Transitivity in structural models of small groups[J].Comparative Group Studies,1971, 2(2):107-124.

[41]Kwak H,Lee C,Park H,et al.What is Twitter,a social network or a news media?[C]//Proceedings of the 19th International Conference on World Wide Web,Raleigh, USA,Apr 26-30,2010.New York:ACM,2010:591-600.

[42]Yang J,Leskovec J.Modeling information diffusion in implicit networks[C]//Proceedings of the 2010 IEEE International Conference on Data Mining,Sydney,Australia,Dec 14-17,2010.Piscataway,USA:IEEE,2010:599-608.

[43]Cha M,Haddadi H,Benevenuto F,et al.Measuring user influence in Twitter:the million follower fallacy[C]//Proceedings of the 4th International Conference on Weblogs and Social Media,Washington,May 23-26,2010:30.

[44]Goyal A,Bonchi F,Lakshmanan L V S.Learning influence probabilities in social networks[C]//Proceedings of the 3rd ACM International Conference on Web Search and Data Mining,New York,Feb 4-6,2010.New York:ACM,2010: 241-250.

[45]Tan Chenhao,Tang Jie,Sun Jimeng,et al.Social action tracking via noise tolerant time-varying factor graphs[C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington,Jul 25-28,2010.NewYork:ACM,2010:1049-1058.

[46]Liu Lu,Tang Jie,Han Jiawei,et al.Mining topic-level influence in heterogeneous networks[C]//Proceedings of the 19th ACM Conference on Information and Knowledge Management,Toronto,Canada,Oct 26-30,2010.New York: ACM,2010:199-208.

[47]Ver Steeg G,Galstyan A.Information transfer in social media[C]//Proceedings of the 21st International Conference on World Wide Web,Lyon,France,Apr 16-20,2012.New York:ACM,2012:509-518.

[48]Ver Steeg G,Galstyan A.Information-theoretic measures of influence based on content dynamics[C]//Proceedings of the 6th ACM International Conference on Web Search and Data Mining,Rome,Italy,Feb 4-8,2013.New York: ACM,2013:3-12.

[49]Lumezanu F,Klein H.Measuring the tweeting behavior of propagandists[C]//Proceedings of the 6th International AAAI Conference on Weblogs and Social Media,Dublin, Ireland,Jun 4-7,2012.Menlo Park,USA:AAAI,2012: 864-864.

[50]Lerman K,Ghosh R.Information contagion:an empirical study of the spread of news on Digg and Twitter social networks[C]//Proceedings of the 4th International Conference on Weblogs and Social Media,Washington,May 23-26,2010.Menlo Park,USA:AAAI,2010:90-97.

[51]Lehmann J,Gon?alves B,Ramasco J J,et al.Dynamical classes of collective attention in Twitter[C]//Proceedings of the 21st International Conference on World Wide Web, Lyon,France,Apr 16-20,2012.New York:ACM,2012: 251-260.

[52]Granovetter M S.The strength of weak ties[J].American Journal of Sociology,1972,36(3):361-366.

[53]Crandall D,Cosley D,Huttenlocher D,et al.Feedback effects between similarity and social influence in online communities[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Las Vegas,USA,Aug 24-27,2008.New York:ACM,2008:160-168.

[54]Saito K,Nakano R,Kimura M.Prediction of information diffusion probabilities for independent cascade model[C]// LNCS 5179:Proceedings of the 2008 International Conference on Knowledge-Based and Intelligent Information and Engineering Systems,Zagreb,Croatia,Sep 3-5,2008. Berlin,Heidelberg:Springer,2008:67-75.

[55]Xiang R,Neville J,Rogati M.Modeling relationship strength in online social networks[C]//Proceedings of the 19th International Conference on World Wide Web,Raleigh,USA,Apr 26-30,2010.New York:ACM,2010: 981-990.

[56]Cui Peng,Wang Fei,Liu Shaowei,et al.Who should share what?:item-level social influence prediction for users and posts ranking[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval,Beijing,Jul 25-29,2011.New York:ACM,2011:185-194.

[57]Easley D,Kleinberg J.Networks,crowds,and markets: reasoning about a highly connected world[M].Oxford: Cambridge University Press,2010.

[58]Kempe D,Kleinberg J,Tardos é.Maximizing the spread of influence through a social network[J].Theory of Computing,2015,11(4):105-147.

[59]Chen Wei,Yuan Yifei,Zhang Li.Scalable influence maximization in social networks under the linear threshold model[C]//Proceedings of the 2010 IEEE International Conference on Data Mining,Sydney,Australia,Dec 14-17,2010.Piscataway,USA:IEEE,2010:88-97.

[60]May R M,Lloyd A L.Infection dynamics on scale-free networks[J].Physical Review E,2001,64(6):66-112.

[61]Blume L E.The statistical mechanics of strategic interaction [J].Games and Economic Behavior,1993,5(3):387-424.

[62]Young H P.The dynamics of social innovation[J].Proceedings of the National Academy of Sciences,2011,108(S4): 21285-21291.

[63]Kempe D,Kleinberg J,Tardos é.Influential nodes in a diffusion model for social networks[C]//LNCS 3580:Proceedings of the 32nd International Colloquium on Automata,Languages,and Programming,Lisbon,Portugal,Jul 11-15,2005.Berlin,Heidelberg:Springer,2005:1127-1138.

[64]Yang J,Counts S.Predicting the speed,scale,and range of information diffusion in Twitter[C]//Proceedings of the 4th International Conference on Weblogs and Social Media, Washington,May 23-26,2010.Menlo Park,USA:AAAI, 2010:355-358.

[65]Song X,Chi Y,Hino K,et al.Information flow modeling based on diffusion rate for prediction and ranking[C]//Proceedings of the 16th International Conference on World Wide Web,Banff,Canada,May 8-12,2007.New York: ACM,2007:191-200.

[66]Richardson M,Domingos P.Mining knowledge-sharing sites for viral marketing[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Edmonton,Canada,Jul 23-26,2002. New York:ACM,2002:61-70.

[67]Leskovec J,Krause A,Guestrin C,et al.Cost-effective outbreak detection in networks[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,San Jose,USA,Aug 12-15, 2007.New York:ACM,2007:420-429.

[68]Chen Wei,Wang Yajun,Yang Siyu.Efficient influence maximization in social networks[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Paris,France,Jun 28-Jul 1,2009.New York:ACM,2009:199-208.

[69]Chen Wei,Wang Chi,Wang Yajun.Scalable influence maximization for prevalent viral marketing in large-scale social networks[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington,Jul 25-28,2010.New York:ACM,2010: 1029-1038.

[70]Goyal A,Bonchi F,Lakshmanan L V S.A data-based approach to social influence maximization[J].Proceedings of the VLDB Endowment,2011,5(1):73-84.

[71]Goyal A,Lu W,Lakshmanan L V S.Simpath:an efficient algorithm for influence maximization under the linear threshold model[C]//Proceedings of the 2011 IEEE 11th International Conference on Data Mining,Vancouver,Canada, Dec 11-14,2011.Piscataway,USA:IEEE,2011:211-220.

[72]Kim J,Kim S K,Yu H.Scalable and parallelizable processing of influence maximization for large-scale social networks? [C]//Proceedings of the 2013 IEEE 29th International Conference on Data Engineering,Brisbane,Australia,Apr 8-12, 2013.Piscataway,USA:IEEE,2013:266-277.

[73]Li Hui,Bhowmick S S,Sun A.Cinema:conformity-aware greedy algorithm for influence maximization in online social networks[C]//Proceedings of the 16th International Conference on Extending Database Technologys,Genoa, Italy,Mar 18-22,2013.New York:ACM,2013:323-334.

[74]Carnes T,Nagarajan C,Wild S M,et al.Maximizing influence in a competitive social network:a follower's perspective[C]//Proceedings of the 9th International Conference on Electronic Commerce,Minneapolis,USA,Aug 19-22, 2007.New York:ACM,2007:351-360.

[75]Bharathi S,Kempe D,Salek M.Competitive influence maximization in social networks[C]//LNCS 4858:Proceedings of the 3rd International Workshop on Web and Internet Economics,San Diego,USA,Dec 12-14,2007.Berlin,Heidelberg:Springer,2007:306-311.

[76]Borodin A,Filmus Y,Oren J.Threshold models for competitive influence in social networks[C]//LNCS 6484:Proceedings of the 6th International Workshop on Internet and Network Economics,Stanford,USA,Dec 13-17,2010.Berlin,Heidelberg:Springer,2010:539-550.

[77]Li Hui,Bhowmick S S,Cui Jiangtao,et al.Getreal:towards realistic selection of influence maximization strategies in competitive networks[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data,Melbourne,Australia,May 31-Jun 4,2015.New York:ACM,2015:1525-1537.

[78]Zhang H,Nguyen D T,Zhang H,et al.Least cost influence maximization across multiple social networks[J].IEEE/ ACM Transactions on Networking,2016,24(2):929-939.

[79]Vaswani S,Lakshmanan L V S.Adaptive influence maximization in social networks:why commit when you can adapt?[J].arXiv:1604.08171,2016.

[80]Hu Xia,Liu Huan.Text analytics in social media[M]//Mining Text Data.New York:Springer US,2012:385-414.

[81]Forman G,Kirshenbaum E.Extremely fast text feature extraction for classification and indexing[C]//Proceedings of the 17th ACM Conference on Information and Knowledge Management,Napa Valley,USA,Oct 26-30,2008.New York:ACM,2008:1221-1230.

[82]Dai Yue,Kakkonen T,Sutinen E.MinEDec:a decisionsupport model that combines text-mining technologies with two competitive intelligence analysis methods[J].International Journal of Computer Information Systems and Industrial ManagementApplications,2011,3:165-173.

[83]Negi P S,Rauthan M M S,Dhami H S.Language model for information retrieval[J].International Journal of ComputerApplications,2010,12(7):13-17.

[84]ChandraShekar B H,Shoba G.Classification of documents using Kohonen's self-organizing map[J].International Journal of Computer Theory and Engineering,2009,1(5): 610-613.

[85]Yuan Lichi.Improvement for the automatic part-of-speech tagging based on hidden Markov model[C]//Proceedings of the 2010 2nd International Conference on Signal Processing Systems,Dalian,China,Jul 5-7,2010.Piscataway, USA:IEEE,2010,1:744-747.

[86]Ling H S,Bali R,Salam RA.Emotion detection using keywords spotting and semantic network IEEE ICOCI 2006 [C]//Proceedings of the 2006 International Conference on Computing&Informatics,Kuala Lumpur,Malaysia,Jun 6-8,2006.Piscataway,USA:IEEE,2006:1-5.

[87]Li J,Khan S U.MobiSN:semantics-based mobile ad hoc social network framework[C]//Proceedings of the Global Communications Conference,Honolulu,USA,Nov 30-Dec 4,2009.Piscataway,USA:IEEE,2009:1-6.

[88]Hua J,Tembe W D,Dougherty E R.Performance of featureselection methods in the classification of high-dimension data[J].Pattern Recognition,2009,42(3):409-424.

[89]Efron M,Organisciak P,Fenlon K.Improving retrieval of short texts through document expansion[C]//Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval,Portland, USA,Aug 12-16,2012.New York:ACM,2012:911-920.

[90]Lee C H,Wu C H,Chien T F.BursT:a dynamic term weighting scheme for mining microblogging messages [C]//LNCS 6677:Proceedings of the 8th International Symposium on Neural Networks,Guilin,China,May 29-Jun 1,2011.Berlin,Heidelberg:Springer,2011:548-557.

[91]Yoshida K,Tsuruoka Y,Miyao Y,et al.Ambiguous part-ofspeech tagging for improving accuracy and domain portability of syntactic parsers[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence,Hyderabad,India,Jan 6-12,2007.San Francisco,USA:Morgan Kaufmann,2007:1783-1788.

[92]Fiscus J G,Doddington G R.Topic detection and tracking evaluation overview[M]//Topic Detection and Tracking. New York:Springer US,2002:17-31.

[93]Allan J,Papka R,Lavrenko V.On-line new event detection and tracking[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne,Australia,Aug 24-28,1998,New York:ACM,1998:37-45.

[94]Allan J,Harding S,Fisher D,et al.Taking topic detection from evaluation to practice[C]//Proceedings of the 38th Annual Hawaii International Conference on System Sciences,Big Island,USA,Jan 3-6,2005.Piscataway,USA: IEEE,2005:101a.

[95]Sekiguchi Y,Kawashima H,Okuda H,et al.Topic detection from blog documents using users'interests[C]//Proceedings of the 7th International Conference on Mobile Data Management,Nara,Japan,May 9-13,2006.Piscataway,USA:IEEE,2006:108.

[96]Blei D,Ng A,Jordan M.Latent Dirichlet allocation[J]. The Journal of Machine Learning Research,2003,3(1): 993-1022.

[97]Ramage D,Dumais S T,Liebling D J.Characterizing microblogs with topic models[C]//Proceedings of the 4th International Conference on Weblogs and Social Media,Washington,May 23-26,2010.Menlo Park,USA:AAAI,2010: 130-137.

[98]Zhao W Xin,Jiang Jing,Weng Jianshu,et al.Comparing Twitter and traditional media using topic models[C]// LNCS 6611:Proceedings of the 33rd European Conference on Information Retrieval,Dublin,Ireland,Apr 18-21, 2011.Berlin,Heidelberg:Springer,2011:338-349.

[99]Hong L,Davison B D.Empirical study of topic modeling in Twitter[C]//Proceedings of the 1st Workshop on Social Media Analytics,Washington,Jul 25,2010.New York: ACM,2010:80-88.

[100]Steyvers M,Smyth P,Rosen-Zvi M,et al.Probabilistic authortopic models for information discovery[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Seattle,USA, Aug 22-25,2004.New York:ACM,2004:306-315.

[101]Zhao Xujian,Yang Chunming,Li Bo,et al.A topic evolution mining algorithm of news text based on feature evolving [J].Chinese Journal of Computers,2014,37(4):819-832.

[102]Wang X,McCallum A.Topics over time:a non-Markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Philadelphia, USA,Aug 20-23,2006.New York:ACM,2006:424-433.

[103]Xu Ge,Wang Houfeng.The development of the topic models in natural language processing[J].Chinese Journal of Computers,2011,34(8):1423-1436.

[104]Popescu A M,Pennacchiotti M,Paranjpe D.Extracting events and event descriptions from twitter[C]//Proceedings of the 20th International Conference Companion on World Wide Web,Hyderabad,India,Mar 28-Apr 1,2011. New York:ACM,2011:105-106.

[105]Chakrabarti D,Punera K.Event Summarization using Tweets[C]//Proceedings of the 5th International Conference on Weblogs and Social Media,Barcelona,Spain,Jul 17-21,2011.Menlo Park,USA:AAAI,2011:66-73.

[106]Ritter A,Etzioni O,Clark S.Open domain event extraction from Twitter[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Beijing,Aug 12-16,2012.New York:ACM, 2012:1104-1112.

[107]Sharifi B,Hutton M A,Kalita J.Summarizing microblogs automatically[C]//Proceedings of Human Language Technologies:Conference of the North American Chapter of the Association of Computational Linguistics,Los Angeles, USA,Jun 2-4,2010.Stroudsburg,USA:ACL,2010:685-688.

[108]Yang Y,Pierce T,Carbonell J.A study of retrospective and on-line event detection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne,Australia, Aug 24-28,1998,New York:ACM,1998:28-36.

[109]Petrovi? S,Osborne M,Lavrenko V.Streaming first story detection with application to Twitter[C]//Proceedings of Human Language Technologies:Conference of the North American Chapter of the Association of Computational Linguistics,Los Angeles,USA,Jun 2-4,2010,Stroudsburg,USA:ACL,2010:181-189.

[110]Becker H,Naaman M,Gravano L.Beyond trending topics: real-world event identification on Twitter[J]//Proceedings of the 5th International Conference on Weblogs and Social Media,Barcelona,Spain,Jul 17-21,2011.Menlo Park, USA:AAAI,2011:438-441.

[111]Long Rui,Wang Haofen,Chen Yuqiang,et al.Towards effective event detection,tracking and summarization on microblog data[C]//LNCS 6897:Proceedings of the 12th International Conference on Web-Age Information Management, Wuhan,China,Sep 14-16,2011.Berlin,Heidelberg:Springer, 2011:652-663.

[112]Weng J,Lee B S.Event detection in Twitter[C]//Proceedings of the 5th International Conference on Weblogs and Social Media,Barcelona,Spain,Jul 17-21,2011.Menlo Park,USA:AAAI,2011:401-408.

[113]Cordeiro M.Twitter event detection:combining wavelet analysis and topic inference summarization[C]//Proceedings of the 7th Doctoral Symposium in Informatics Engineering,Porto,Portugal,Jan 26-27,2012.Porto:Faculdade de Engenharia da Universidade do Porto,2012:123-138.

[114]Li C,SunA,DattaA.Twevent:segment-based event detection from Tweets[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management,Maui,USA,Oct 29-Nov 2,2012.New York: ACM,2012:155-164.

[115]Robinson B,Power R,Cameron M.A sensitive Twitterearthquake detector[C]//Proceedings of the 22nd International Conference on World Wide Web,Rio de Janeiro, Brazil,May 13-17,2013.NewYork:ACM,2013:999-1002.

[116]Sakaki T,Okazaki M,Matsuo Y.Earthquake shakes Twitter users:real-time event detection by social sensors[C]//Proceedings of the 19th International Conference on World Wide Web,Raleigh,USA,Apr 26-30,2010.New York: ACM,2010:851-860.

[117]Li R,Lei K H,Khadiwala R,et al.Tedas:a Twitter-based event detection and analysis system[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering,Washington,Apr 1-5,2012.Piscataway,USA: IEEE,2012:1273-1276.

[118]Popescu A M,Pennacchiotti M.Detecting controversial events from twitter[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management,Toronto,Canada,Oct 26-30,2010.New York: ACM,2010:1873-1876.

[119]Benson E,Haghighi A,Barzilay R.Event discovery in social media feeds[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1,Portland,USA,Jun 19-24,2011.Stroudsburg，USA:ACL,2011:389-398.

[120]Becker H,Chen F,Iter D,et al.Automatic identification and presentation of Twitter content for planned events[C]// Proceedings of the 5th International Conference on Weblogs and Social Media,Barcelona,Spain,Jul 17-21,2011. Menlo Park,USA:AAAI,2011:1-2.

[121]Becker H,Iter D,Naaman M,et al.Identifying content for planned events across social media sites[C]//Proceedings of the 5th ACM International Conference on Web Search and Data Mining,Seattle,USA,Feb 8-12,2012.New York: ACM,2012:533-542.

[122]Massoudi K,Tsagkias M,De Rijke M,et al.Incorporating query expansion and quality indicators in searching microblog posts[C]//LNCS 6611:Advances in Information Retrieval,Proceedings of the 33rd European Conference on IR Research,Dublin,Ireland,Apr 18-21,2011.Berlin,Heidelberg:Springer,2011:362-367.

[123]Metzler D,Cai C,Hovy E.Structured event retrieval over microblog archives[C]//Proceedings of the 2012 Conference of the 9th American Chapter of the Association for Computational Linguistics:Human Language Technologies, Montréal,Canada,Jun 3-8,2012.Stroudsburg,USA:ACL, 2012:646-655.

[124]Gu Hansu,Xie Xing,Lv Qin,et al.Etree:effective and efficient event modeling for real-time online social media networks[C]//Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence,Lyon,France, Aug 22-27,2011.Piscataway,USA:IEEE,2011:300-307.

[125]Chen L,Roy A.Event detection from Flickr data through wavelet-based spatial analysis[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management,Hong Kong,China,Nov 2-6,2009.New York: ACM,2009:523-532.

[126]Lee R,Wakamiya S,Sumiya K.Discovery of unusual regional social activities using geo-tagged microblogs[J].World Wide Web,2011,14(4):321-349.

[127]Hayes A S,Singer J B,Ceppos J.Shifting roles,enduring values:the credible journalist in a digital age[J].Journal of Mass Media Ethics,2007,22(4):262-279.

[128]Lu Rong,Xu Zhiheng,Zhang Yang,et al.Life activity modeling of news event on Twitter using energy function [C]//Proceedings of the 16th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining,Kuala Lumpur,Malaysia,May 29-Jun 1,2012.Berlin,Heidelberg:Springer,2012:73-84.

[129]Kleinberg J.Bursty and hierarchical structure in streams[J]. Data Mining and Knowledge Discovery,2003,7(4):373-397.

[130]Kontostathis A,Galitsky L M,Pottenger W M,et al.A survey of emerging trend detection in textual data mining [M]//Survey of Text Mining.New York:Springer,2004: 185-224.

[131]Lau C H.Detecting news topics from microblogs using sequential pattern mining[D].Brisbane:Queensland University of Technology,2014.

[132]Bernstein M S,Suh B,Hong L,et al.Eddi:interactive topicbased browsing of social status streams[C]//Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology,New York,Oct 3-6,2010.New York:ACM,2010:303-312.

[133]Marcus A,Bernstein M S,Badar O,et al.Twitinfo:aggregating and visualizing microblogs for event exploration [C]//Proceedings of the 2011 SIGCHI Conference on Hu-man Factors in Computing Systems,Vancouver,Canada, May 7-12,2011.New York:ACM,2011:227-236.

[134]Zhang Chao,Zhang Keyang,Yuan Quan,et al.GMove: group-level mobility modeling using geo-tagged social media[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco,USA,Aug 13-17,2016.New York:ACM, 2016:1305-1314.

[135]Wang Weiqing,Yin Hongzhi,Chen Ling,et al.Geo-SAGE: a geographical sparse additive generative model for spatial item recommendation[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Sydney,Australia,Aug 10-13,2015.New York:ACM,2015:1255-1264.

[136]Zhang Chao,Zheng Yu,Ma Xiuli,et al.Assembler:efficient discovery of spatial co-evolving patterns in massive geo-sensory data[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Sydney,Australia,Aug 10-13,2015.New York: ACM,2015:1415-1424.

[137]Min Li.EKNOT:event knowledge from news and opinions on Twitter[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence,Phoenix,USA,Feb 12-17, 2016.Menlo Park,USA:AAAI,2016:4367-4368.

[138]Zhang Chao,Zhou Guangyu,Yuan Quan,et al.GeoBurst: real-time local event detection in geo-tagged Tweet streams [C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval,Pisa,Italy,Jul 17-21,2016.New York:ACM, 2016:513-522.

[139]Wang Jingjing,Tong Wenzhu,Yu Hongkun,et al.Mining multi-aspect reflection of news events in Twitter:discovery,linking and presentation[C]//Proceedings of the 2015 IEEE International Conference on Data Mining,Atlantic City,USA,Nov 14-17,2015.Piscataway,USA:IEEE,2015: 429-438.

[140]Peng Min,Zhu Jiahui,Li Xuhui,et al.Central topic model for event-oriented topics mining in microblog stream[C]// Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:1611-1620. [141]Lu Haokai,Caverlee J,Niu Wei.BiasWatch:a lightweight system for discovering and tracking topic-sensitive opinion bias in social media[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:213-222.

[142]Prasojo R E,Kacimi M,Nutt W.Entity and aspect extraction for organizing news comments[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:233-242.

[143]Ding Wanying,Shang Yue,Guo Lifan,et al.Video popularity prediction by sentiment propagation via implicit network[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015: 1621-1630.

[144]Tandon N,de Melo G,De A,et al.Knowlywood:mining activity knowledge from Hollywood narratives[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:223-232.

[145]Wu Fangzhao,Shu Jinyun,Huang Yongfeng,et al.Social spammer and spam message co-detection in microblogging with social context regularization[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:1601-1610.

[146]Qian S,Cao J,Mou?l F L,et al.SCRAM:a sharing considered route assignment mechanism for fair taxi route recommendations[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Sydney,Australia,Aug 10-13,2015.New York: ACM,2015:955-964.

[147]Vallet D,Berkovsky S,Ardon S,et al.Characterizing and predicting viral-and-popular video content[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management,Melbourne,Australia,Oct 19-23,2015.New York:ACM,2015:1591-1600.

[148]Brody S,Diakopoulos N.Cooooooooooooooollllllllllllll-!!!!!!!!!!!!!!:using word lengthening to detect sentiment in microblogs[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing,Edinburgh,UK,Jul 27-31,2011.Stroudsburg,USA:ACL, 2011:562-570.

[149]Pang B,Lee L.Opinion mining and sentiment analysis[J]. Foundations and Trends in Information Retrieval,2007,2 (1/2):1-135.

[150]Liu Bing.Sentiment analysis and opinion mining[J].Synthesis Lectures on Human Language Technologies,2012,5 (1):1-167.

附中文參考文獻：

[9]李棟,徐志明,李生,等.在線社會網(wǎng)絡(luò)中信息擴散[J].計算機學報,2014,37(1):189-206.

[11]吳信東,李毅,李磊.在線社交網(wǎng)絡(luò)影響力分析[J].計算機學報,2014,37(4):735-752.

[32]姜雅文.復雜網(wǎng)絡(luò)社區(qū)發(fā)現(xiàn)若干問題研究[D].北京:北京交通大學,2014.

[101]趙旭劍,楊春明,李波,等.一種基于特征演變的新聞話題演化挖掘方法[J].計算機學報,2014,37(4):819-832.

[103]徐戈,王厚峰.自然語言處理中主題模型的發(fā)展[J].計算機學報,2011,34(8):1423-1436.

DU Zhijuan was born in 1986.She is a Ph.D.candidate at Renmin University of China,and the member of CCF. Her research interests include Web data management and cloud data management,etc.

杜治娟（1986—），女，中國人民大學博士研究生，CCF學生會員，主要研究領(lǐng)域為Web數(shù)據(jù)管理，云數(shù)據(jù)管理等。

WANG Shuo was born in 1981.He is a Ph.D.candidate at Renmin University of China,and a lecturer at Hebei University.His research interests include data fusion and knowledge fusion,machine learning and soft computing,etc.

王碩（1981—），男，中國人民大學博士研究生，河北大學數(shù)學與信息科學學院講師，主要研究領(lǐng)域為數(shù)據(jù)融合與知識融合，機器學習，軟計算等。

WANG Qiuyue was born in 1974.She is an assistant professor at Renmin University of China.Her research interests include database and information systems,information retrieval,large-scale knowledge processing,natural language questions and answers,etc.

王秋月（1974—），女，博士，中國人民大學講師，主要研究領(lǐng)域為數(shù)據(jù)庫和信息系統(tǒng)，信息檢索，大規(guī)模知識處理，自然語言問答等。

MENG Xiaofeng was born in 1964.He is a professor and Ph.D.supervisor at Renmin University of China,and the fellow of CCF.His research interests include cloud data management,Web data management,flash-based databases and privacy protection,etc.

孟小峰（1964—），男，中國人民大學教授、博士生導師，CCF會士，主要研究領(lǐng)域為云數(shù)據(jù)管理，Web數(shù)據(jù)管理，閃存數(shù)據(jù)庫，隱私管理等。

Survey on Social Media Big DataAnalytics*

DU Zhijuan+,WANG Shuo,WANG Qiuyue,MENG Xiaofeng
School of Information,Renmin University of China,Beijing 100872,China
+Corresponding author:E-mail:nmg-duzhijuan@163.com

Social media,which consists of a large number of meaningful information,is an important way for people to propagate information and express themselves.In recent years,it has become one of the most representative sources of big data.Mining and analyzing the information has profound impact on social development.According to the elements of social media,the current researches are divided into three categories,including analysis based on users, analysis based on relationships and analysis based on interactive contents.Firstly,analyzing user-centered data from user identification based multi-source heterogeneous network,community detection and user influence computing. Secondly,analyzing user relationship strength calculation,information diffusion and influence maximization issues based on interactive relationship-center.Thirdly,discussing feature extraction and selection,the topic or event mining, multimedia data analysis and sentiment analysis issues based on user interactive content analyzing interactive content-centric.Finally,this paper elaborates challenges of mining big data of social media and points out the future work from information diffusion,influence computing,feature extraction and selection,news mining based on Microblog,social media big data fusion and cross-lingual sentiment analysis 6 aspects.

：TP393

10.3778/j.issn.1673-9418.1601037

*The National Natural Science Foundation of China under Grant Nos.61379050,61532010,91224008,61532016(國家自然科學基金);the National Key R&D Program of China under Grant Nos.2016YFB1000602,2016YFB1000603(國家重點研發(fā)計劃);the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.20130004130001(高等學校博士學科點專項科研基金);the Research Funds of Renmin University under Grant No.11XNL010(中國人民大學科學研究基金).

Received 2016-01,Accepted 2016-09.

CNKI網(wǎng)絡(luò)優(yōu)先出版:2016-09-08,http://www.cnki.net/kcms/detail/11.5602.TP.20160908.1045.002.html

Key words:social media;big data;user behavior;interactive relationship;interactive content