一、語料概述
名著《紅樓夢》電子文本的基本信息列表如下(表1):

本論文分析所用《紅樓夢》的語料是從http://ling.ccnu.edu.cn/ylk/gudian.htm網址下載獲得。該語料共120回,其中除了漢字以外還包含210個特殊符號,比如非漢字符號、圖形符號、結構符、標點、阿拉伯數字、日語、拉丁字母等等。
下面是這些特殊符號的列表,按這些特殊符號的顯示特征分兩組列出:一組是可以看到的,也就是可顯示的;另一組是無顯示的,雖然在文本中看不到符號,但是都有各自的碼位。第一組(表2)共有173個特殊符號,第二組(表3)共37個特殊符號,由于第二組符號無顯示,因此我們把十六進制的編碼附在了括號中。

這些符號都是《紅樓夢》的組成部分,但是本文主要是考慮可顯示的文字特征,所以在做分析和統計的時候并沒有考慮這些非漢字特殊符號的作用。考慮到標點符號所占比率較高,我們會在下面專門對標點符號進行一些分析和說明。
二、對標點符號的統計
標點符號也表示一定的語義,對小說的理解和語言的表達都有一定的作用。
統計數據顯示,《紅樓夢》中非漢字字符出現的次數是137850,其中標點符號出現的次數是137540。已知小說字符總數是868996,標點符號占了 137540/868996≈0.158,也就是說,小說中有15.8%都是標點符號。其中頻次排在最高的兩個標點符號是逗號和句號,同時也是小說字符中頻率最高的字符,這說明了小說中重復出現次數最多的符號不是漢字而是標點符號。逗號和句號重復出現次數分別高達59357和29400。利用句號、感嘆號這種句子結束符,我們可以大致推測小說的規模,但是具體到小說的內容,單單看這些數據是無能為力的。只有將小說中的字詞和標點符號結合起來才能更好地理解和解析文本。下面開始研究小說中的單個字。
三、字頻統計和字關聯
考慮到非漢字符號對分析《紅樓夢》沒有太大的貢獻,所以在對字頻進行統計時并沒有考慮非漢字符號。因此我們現在的數據信息是:漢字的個數(不重復)是4316,出現總數(重復)是731146。我們不可能對四千多個漢字都一一進行分析,有些漢字可能只出現過一次或者幾次,所以我們選擇了有代表性的,即出現頻率在0.1%以上的漢字作為研究目標。在選擇代表性的漢字時,我們可以以出現次數的累計總和占所有漢字的出現總數(731146)的過半作為標準;但是觀察了統計數據后我們發現,一些助詞的出現比率遠遠高于表達具體語義的動詞和名詞,所以我們最終選擇了出現頻率0.1%以上的漢字,從結果可以看出這樣的選擇是可取的。
出現頻率在0.1%以上的漢字共有194個,總數為502347,占小說總數的68.7%,基本上涵蓋了將近70%的字數,但是漢字的個數卻只占了 194/4316≈0.0449,還不到5%。下面按頻次的高低列出《紅樓夢》中所有高頻漢字。由于漢字很多,所以每個漢字的信息都用分號隔開,每組漢字的信息包括:字,出現次數,百分比(之間用逗號隔開)。
了,21193,0.028986;的,15720,0.0215;不,15025,0.02055;一,12149,0.016616;來,11429,0.01563;道,11059,0.0151;人,10542,0.0144;是,10142,0.01387;說,9692,0.013256;我,9173,0.012546;這,7810,0.010682;他,7737,0.01058;你,7142,0.009768;去,6186,0.00846;著,6166,0.00843;也,6106,0.00835;兒,6074,0.008308;玉,6051,0.008276;有,5987,0.008189;寶,5820,0.00796;個,5656,0.007736;子,5466,0.007476;又,5220,0.007139;賈,5201,0.00711;里,5143,0.00703;那,4909,0.00671;們,4893,0.00669;見,4804,0.00657;只,4677,0.006397;太,4302,0.00588;便,4078,0.005578;好,4042,0.005528;在,4002,0.00547;笑,3957,0.00541;家,3917,0.005357;上,3809,0.0052;么,3670,0.00502;得,3610,0.004937;大,3466,0.00474;姐,3443,0.004709;頭,3403,0.00465;聽,3301,0.004515;就,3253,0.004449,出,3225,0.00441;回,3070,0.004199;知,2922,0.003996;日,2917,0.00399;要,2903,0.00397;下,2775,0.003795;都,2677,0.00366;心,2655,0.00363;事,2641,0.00361;二,2630,0.003597;老,2602,0.003559;過,2584,0.00353;話,2504,0.003425;還,2496,0.0034;起,2477,0.003388;自,2455,0.003358;如,2357,0.0032;看,2353,0.003218;叫,2267,0.0031;到,2243,0.003068;沒,2243,0.003068;兩,2230,0.00305;母,2206,0.003017;些,2172,0.00297;時,2156,0.002949;之,2139,0.002926;今,2117,0.002895;小,2020,0.00276;問,2001,0.002737;因,1977,0.0027;鳳,1949,0.002666;奶,1947,0.00266;等,1938,0.00265;娘,1871,0.002559;可,1863,0.002548;什,1855,0.002537;呢,1826,0.002497;忙,1822,0.00249;夫,1805,0.002469;想,1792,0.00245;面,1781,0.002436;爺,1773,0.002425;才,1771,0.0024;中,1672,0.002287;王,1661,0.00227;打,1588,0.00217;進,1548,0.002117;此,1538,0.0021;倒,1534,0.002098;罷,1525,0.002086;樣,1507,0.00206;吃,1455,0.00199;和,1453,0.001987;正,1411,0.0019;幾,1400,0.001915;無,1400,0.001915;姑,1395,0.001908;后,1388,0.001898;黛,1383,0.00189;天,1362,0.00186;然,1292,0.001767;前,1281,0.00175;為,1274,0.00174;意,1261,0.001725;別,1253,0.0017;再,1253,0.0017;門,1242,0.001699;丫,1232,0.001685;走,1222,0.00167;外,1221,0.00167;襲,1213,0.001659;作,1212,0.001658;怎,1206,0.001649;三,1203,0.001645;眾,1189,0.001626;妹,1188,0.001625;方,1170,0.0016;生,1170,0.0016;多,1164,0.00159;明,1157,0.00158;將,1156,0.00158;已,1150,0.00157;身,1142,0.00156;把,1141,0.00156;以,1133,0.00155;氣,1125,0.001539;釵,1119,0.0015;何,1117,0.001528;親,1087,0.001487;給,1077,0.00147;拿,1066,0.001458;與,1059,0.001448;手,1054,0.00144;坐,1054,0.00144;年,1048,0.00143;若,1038,0.0014;十,1036,0.001417;用,1036,0.001417;請,1031,0.0014;房,1027,0.001405;發,993,0.001358;薛,993,0.001358;且,991,0.001355;春,983,0.001344;媽,979,0.001339;政,978,0.001338;命,972,0.001329;姨,959,0.0013;原,952,0.00130;花,950,0.001299;所,948,0.001297;處,934,0.001277;先,909,0.00124;邊,904,0.001236;誰,902,0.001234;己,899,0.00123;平,899,0.00123;瞧,895,0.001224;璉,892,0.00122;內,888,0.001215;住,887,0.001213;管,886,0.001212;女,880,0.001204;死,866,0.001184;送,856,0.001171;連,834,0.001141;至,831,0.001137;告,830,0.001135;早,823,0.001126;會,817,0.001117;東,815,0.001115;香,812,0.001111;林,807,0.001104;往,802,0.001097;西,802,0.001097;月,797,0.00109;帶,794,0.001086;雖,790,0.00108;應,785,0.001074;必,772,0.001056;從,770,0.001053;口,767,0.001049;分,765,0.001046;怕,761,0.001041;聲,758,0.001037;四,754,0.001031;當,746,0.00102;放,745,0.001019;能,744,0.001018;未,744,0.001018;云,736,0.001007
根據上面的統計數據,我們可以看出:
1)《紅樓夢》中虛詞使用頻率相當高,包括:了、的、不、著、也、個、又、得、就、還、之……
雖然虛詞比實詞少,但是意義卻比較復雜,一般都作為實詞的修飾成分,它們和實詞組合后產生各種語義。虛詞的作用只能搬到小說中根據它的搭配來進行理解和分析。
2)名詞比率也很高,例如:人、兒、子、玉、寶、賈、家、姐、頭、母、鳳、奶、娘、夫、爺、王、姑、黛、丫、妹、薛、媽、姨、女等等。
從這些使用頻率高的名詞可以看出,《紅樓夢》主要是圍繞人展開的,主體是講賈、王、史、薛四大家族的事情。主人公的名字當中用的“寶”“玉”和“黛”等字頻率也較高。再根據這些名詞之間的聯系,我們可以推測這是一個大家族,有兒有女,爺、奶、母、姐、妹、姑、夫俱全,而且女人的角色占較大比率。如果把“丫”字和“頭”字組合,也可以推測《紅樓夢》講述的應該是丫頭眾多的有錢大戶人家的事情。
3)頻率高的動詞:來、道、是、去、有、見、笑、聽、出、知、要、看、叫、到、死……
從這些動詞的特點很難推測《紅樓夢》中人物的主要活動,這些動詞在文中可能有很多詞性,看單字只會想到歧義,無法正確理解它們在文中的確切含義。所以動詞之間的聯系和小說內容之間的關系還得在小說文本中聯系上下文進行分析。
4)還有一些頻率高的名詞,比如:香、月、云、花、春等等。通過這些字,也容易聯想到《紅樓夢》中應該不乏詩情畫意和浪漫的愛情。
其實這些高頻字中隱隱約約也包含了作者使用語言的特點,同時,對每一回進行一次字頻的統計,可以在某種程度上推測故事發展的細微變化、貫穿出小說的主題思路。
四、總結
從上面的統計數據可以看出,高頻字雖然很少,但在小說表達故事內容時卻占有舉足輕重的地位。不過文中指的高頻字是占小說總字數的0.1%以上的字,根據上面的分析,我們取的出現頻率0.1%以上的字在全文中占的百分比接近70%,所以說出現頻率0.1%以上的字基本上可以作為小說高頻字的代表。這也說明了這些字在小說中占據的分量。雖然它們個數不多,卻是小說表意的中心所在。取0.1%上的字可以大致推測出小說中的主要角色、內容的大體趨向等,如果要給出更確切的觀點和解釋,還需要回到文本中進一步分析并獲得更詳細的數據信息,不能完全靠頻率推測。
這些字大多為名詞、動詞、代詞和助詞,這也說明了小說在文字應用上的特性,這些漢字在時代變遷中應用的變化不是很大,基本上保持在高頻詞的位置上。
下一步我們將進一步深入該項研究,從分析字擴展到分析詞匯,從《紅樓夢》擴展到其他名著,從中找出它們的共同點和不同點,進而總結語言的發展變化規律,探討字詞和故事情節之間的緊密聯系。
參考文獻:
[1] http://www.yp.edu.sh.cn/sflxx/mingren/01-12/caoxq.htm
[2]孫展.關于“紅樓”的真實與猜想[J].中國新聞周刊,2006,(38).
[3]曹潔.談《紅樓夢》語言世界的“偏離”[J].平頂山學院學報,2006,(3).
[4]王紹新.《紅樓夢》詞匯與現代詞匯的詞義比較研究[J].語言教學與研究,2002,(3).
[5]孔昭琪.《紅樓夢》的詞語活用[J].泰安師專學報,2000,(4).
[6]于平.試論“紅樓夢語言”形成的社會文化因素[J].南京師大學報(社會科學版),1999,(6).
[7]李小明 王亞莉.自動分詞中的單字虛詞處理[A].http://chinese.fudan.edu.cn/phoneticslab/yuyin5/papers/07-10-089.pdf
(那日松 吉日嘎拉,中國傳媒大學播音主持藝術學院)