999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Spam Filtering:Online Naive Bayes Based on TONE

2013-06-06 04:19:34GuangluSunHongyueSunYingcaiMaandYuewuShen
ZTE Communications 2013年2期

Guanglu Sun,Hongyue Sun,Yingcai Ma,and Yuewu Shen

(1.Research Instituteof Information Technology,Tsinghua University,Beijing 100084,China;

2.ZTECorporation,Shenzhen 518057,China;

3.School of Computer Scienceand Technology,Harbin University of Science and Technology,Harbin 150080,China)

Abstract The naive Bayes(NB)model has been successfully used to tack?le spam,and is very accurate.However,there is still room for improvement.We use a train on or near error(TONE)method in online NB to enhance the performance of NB and reduce the number of training emails.We conducted an experiment to de?termine the performance of the improved algorithm by plotting(1-ROCA)%curves.The results show that the proposed method improves the performance of original NB.

Keyw ords spam filtering;online naive Bayes;train-on or near error

1 Introduction

E mail is an efficient communication technology and one of the most widely used internet applications.However,spam is a drain on network resources and is often detrimental to user experience.Some peo?ple use spam for malicious purposes,so spam filtering is a hot?spot in current research.

The body of the email contains essential information and is arguably the most important part of the email.Content-based filtering is a reliable method for combating spam.Machine learning provides more accurate prediction and is an attractive solution for content-based filtering.However,there is no con?sensusabout which learningalgorithmsarebest[1].

Machine learning techniques are usually based on genera?tive models,such as naive Bayes(NB),or discriminative mod?els,such as support vector machines(SVMs).For most large-scale tasks,discriminative models perform better than generative models[2].This is especially true when there is suf?ficient training data.In TREC spam track,the Bogofilter is a fast Bayesian spam filter that is used as the baseline[3].Many researchers have achieved state-of-the-art spam filtering us?ing SVMs;however,SVMs typically require training time that is quadratic in the number of training examples[4].SVMs are not suitable for online filtering because they are not updated in real time.With the Bayesian method,filtering is inaccurate,but only linear training time is required,and robustness is less likely to be affected by bad data[5],[6].The Bayesian filtering system is easy to deploy because it is simple and lightweight[7].

In this paper,we propose an improved online NB classifier.Online NB is often used in spam filtering,but unlike SMV,it can update itself in real time according to spam behavior.In the original NBmodel,training data is passively accepted.Up?dating the training data is expensive for most classifiers,and this practice has been strongly discouraged by industry[8].Train on or near error(TONE)is a sample-selection method that can be used to discard useless examples[9].Only parts of examples are trained using TONE.When TONE is applied to the online NB model and tested with several large spam data sets,the online model performs better than the original NB model.In particular,the number of examples needed to train an effective classifier decreases.

In section 2,we review the framework of an online learning model for spam filtering.In section 3,we describe online NB models based on TONE.In section 4,we show that the im?proved algorithm is much more efficient than the original NB algorithm.Section 5 concludesthe paper.

2 Online Learning Model for Spam Filtering

Many models used in traditional machine-learning applica?tions operate in pool-based(offline)mode[10].The model is trained on a large data set and the examples are reclassified without retraining.The process of an offline learning model tends to be optimal for all the training data,but the online mode has an online learning process that can adapt to a chang?ing environment.Online learning algorithms update the learn?er with new received examples;that is,they can use an old hy?pothesis(if one exists)as the starting point for retraining and adapting to changes in data.

Spam filtering is typically done online(Fig.1).Emails are viewed as a stream,not as a pool,when entering the system one by one.The filter makes a spam or ham prediction for each email.Next,the user reads the message and perhaps gives feedback to the learning-based filter.The filter uses a label to update the feature library and retrain the learner.Ideally,this improves future predictive performance.Large-scale and on?line classification problems can be solved with a classifier that allows online training and classification[11].In a changing en?vironment,an online NB learner is typically used for spam fil?tering,which proceeds incrementally[12].An online NBlearn?er only has linear training time and can be easily deployed in an online settingwith incremental updates.

▲Figure1.Theonlinespam filtering scenario.

3 Online NB Model Based on TONE

3.1 Bayesian Spam Filtering

Naive Bayes is popular in industry probably because of its simplicity and the ease with which it can be implemented.Its linear computational complexity and high accuracy are compa?rabletothat of moreelaboratelearningalgorithms.

Here,we give notations for the NBmodel.In an example da?taset{(X(1),y(1))...(X(m),y(m))...},X(m)denotesavector contain?ing features of the m th example.The corresponding label is y(m).The spam likelihood P(y=spam|X)is calculated using the Bayesian formula:

Similarly,the hamlikelihood is calculated using

To model P(y|X),xiis conditionally independent for a giv?en y.This assumption is called the NBassumption.The result?ing algorithmiscalled the NBclassifier and isgive by

In spam filtering,there is no need to estimate P(X).The quotient of(1)and(2)isgiven by

We can use(4)to classify the email as spam or something else.In(4),P(spam)is the a priori probability of spam,and P(xi|y=spam)is expressed as a frequency in the spam cate?gory.Theasprioriprobability of spamisgiven by

and P(xi|y=spam)isgiven by

where Nspamis the number of spams,and Nhamis the number of hams.Thesituation for hamissimilar tothat for spam.

3.2 The Model

The proposed NB algorithm works fairly well,but there is a simple tweak that makes it work much better,especially for text classification.If a feature only occurs in ham,then P(xi|y=spam)may be zero.To avoid this,we can use Laplace smoothing,given by

To avoid underflow in the practical calculation,we use a log?arithm.Therefore,(4)istransformed into

We can now classify the email by Pprime.If Pprimeis greater than 0,themail ispredicted tobespam;otherwise,it isham.

To apply TONE algorithm,we use the logistic function to convert Pprimeinto a score of 0~1.Equation(9)maps Pprimeto a score of between 0 and approximately 1.The scale parameter ensures that Pprimeis not too big:

To meet the spam filter's requirements,the online filter should update itself at the appropriate time.Spam filtering needs to be highly scalable because it involves large amounts of high-dimensional data.Content-based spam detection often requires training the learner.In original NB,there is no need to update the learner;however,in improved online NB,TONE can be applied to the training process(called thick threshold training)[13].TONE is developed from train on error(TOE).There are two scenarios in which the learner training mode can be activated using this approach:1)when samples have been misclassified by the filter and 2)when correctly classified sam?plesfall within apredefined boundary.Weimprovethepredict?ing ability of NB by introducing an online NB method based on TONE.The improved algorithm,called called NB-TONE,is cheap and doesnot result in performance loss.

With TONE,examples that have the least classification con?fidence are chosen.The parameter c is a thick threshold for training.Regardless of how the email is classified,if the score does not exceed the thick threshold,the email is not well clas?sified,and the learner has to be trained and updated.On the other hand,TONE can also make a classifier more robust so that overfitting is averted.If the example is far from the hyper?plane,the classifier predicts the example with higher confi?dence,and such examples do not need training.TONE is a sample-selection method that reduces the number of training examplesand cutsdown trainingtime.

Algorithm1.NB-TONE 1:for each mail{<X(i),y(i)>,...}i=1,2,...2:A new message arrives 3:Eq.8//calculate the P prime 5:Eq.9//P prime mappingtoscore 7:if(score>0.5)then 8: X(i)isspam 9:else 10: X(i)is ham 11:endif 12:if(|score-0.5|<c or X(i)ismisclassified)then 13: train model by<X(i),y(i)>14:endif 15:end for

4 Evaluationsand Results

In section 3,online NB spam filtering based on TONE was proposed.Here,we test the algorithm on large benchmark sets of email data.

4.1 Data Sets

Two benchmark data sets were used,both of which were from TREC spam filtering competitions.These data sets were trec05p,which contained 92,189 messages in English[3],and trec06p,which contained 37,822 messages in English[14].For a fair comparison,each data set was ordered,and we compared our method with the original model.(1-ROCA)%was used as the standard performance measure.

4.2 Feature Space

Feature extraction is important for machine learning.An ap?propriate feature extraction method greatly improves the accu?racy of the learner.In[15],character-level n-grams are valid and robust for a variety of spam detection methods[16].Here,an email is represented as a vector that has a unique dimen?sion for each possible substring of n characters.With the 4-gram feature extraction method,only the first 3000 features of each email were extracted,and the same features were re?pure NB.Moreover,NB-TONE can cut down the number of training examplesand reduce computational cost.moved from each email.

4.3 Classification Performance

In our experiments on NB-TONE,we found that the sam?pling threshold c ranged from0.01 to 0.50.Weused theparam?eterε=10-5,and thescaleparameter was2500.

We examined the difference between pure NB and NB-TONE.For NB-TONE,c=0.15 and c=0.25.Note that if c=0.5,the algorithm degenerates into pure NB.The train%represents the overall percentage of training data.The results in Table 1 show that NB-TONE comprehensively outperforms

▼Table1.NB-TONEbeatspure NBon trec05p and trec06p

4.4 Parameter Sensitivity

Fig.2 shows the effect of c on(1-ROCA)%performance.The results indicate a that(1-ROCA)%performance varies with respect to c.From c=0 to 0.15,the number of examples increases and performance improves.However,as c approach?es0.5,performanceworsens.

5 Conclusion

▲Figure2.NB-TONEon data set resultsreported as(1-ROCA)%by threshold c.

We improved traditional online naive Bayes by using TONE.In the online process,the classifier updates itself at the appro?priatetime.Thismethod improvesclassification and low-confi?dence method and reduces the number of labels needed for high performance.Furthermore,the approach is well suited to this domain because spam filtering is inherently an online task.Our experiment showsthat our NB-TONEisreliable.

主站蜘蛛池模板: 亚洲精品国偷自产在线91正片| h视频在线观看网站| 日日拍夜夜操| 国内视频精品| 全部毛片免费看| 在线国产91| 在线观看国产精品第一区免费| 在线观看免费AV网| 亚洲欧美日韩中文字幕在线| 思思99热精品在线| 日韩精品免费一线在线观看| 国产麻豆91网在线看| a在线亚洲男人的天堂试看| 九九九精品成人免费视频7| 波多野结衣中文字幕一区二区| 亚洲一级毛片免费观看| 国产爽歪歪免费视频在线观看| 成人免费视频一区| 91精品国产一区| 天天躁夜夜躁狠狠躁图片| 免费A级毛片无码免费视频| 人妻少妇久久久久久97人妻| 91精品免费久久久| 中文字幕中文字字幕码一二区| 久久无码av三级| 国产综合日韩另类一区二区| 欧美精品成人一区二区视频一| 在线看片中文字幕| 欧美日本视频在线观看| 国产h视频在线观看视频| 国产精品成人AⅤ在线一二三四| 日本免费福利视频| 国产精品第一区在线观看| 久久免费视频6| 99激情网| 国产免费羞羞视频| 亚洲人视频在线观看| 九色在线观看视频| 嫩草在线视频| 最新日韩AV网址在线观看| 国产免费人成视频网| 欧美午夜视频| 秋霞一区二区三区| 香蕉久久永久视频| 天天综合网亚洲网站| 亚洲成人免费在线| 天天做天天爱天天爽综合区| 久久久亚洲色| 91精品国产一区自在线拍| 久久香蕉国产线看观看精品蕉| 91毛片网| 992tv国产人成在线观看| 国产精品夜夜嗨视频免费视频| 色有码无码视频| 国产精品自在在线午夜 | 欧美日本在线播放| 国产91线观看| 国产欧美精品一区aⅴ影院| 99热亚洲精品6码| 91无码人妻精品一区二区蜜桃| 国产黄在线免费观看| 成人一区专区在线观看| 国产精品无码久久久久久| 亚洲天堂视频在线播放| 91美女视频在线观看| 久久久久久国产精品mv| 九九热视频在线免费观看| 毛片免费试看| 日韩精品成人网页视频在线| 国产1区2区在线观看| 国产精鲁鲁网在线视频| 国产精品自拍露脸视频 | 麻豆国产精品一二三在线观看| 国产激爽爽爽大片在线观看| 欧美午夜网| 亚洲无码精品在线播放| 国产jizzjizz视频| 欧美高清三区| 亚洲中文字幕在线精品一区| 色色中文字幕| 欧美色视频在线| 精品一区二区三区四区五区|