999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

The Research on How to Detect Plagiarism in the Theses Based on Automatic Abstraction

2010-04-16 09:15:20ZhaoJunjieWangLiWangPingshui
電腦與電信 2010年2期
關(guān)鍵詞:文本研究

Zhao JunjieWang Li Wang Pingshui

(Anhui University of Finance&Economics,Bengbu 233061,Anhui)

Automatic abstraction can automatically extract the brief and coherent essays reflecting the main contents of the text completely and accurately from the text or text collection,using the computer to meet the general or particular users’requirements.First,this paper refers to the definition,function and classification of automatic abstraction,and then gives a kind of automatic abstraction technology based on keywords retrieval.It also puts forward a method of detecting plagiarism in the theses based on automatic abstraction and analyzes the results of the experiment.Finally,the author introduces the further work in brief.

automatic abstraction;keywords;extraction;retrieval;plagiarism detection

Author introduction:Zhao Junjie,male,Suzhou Anhui,master degree,lecturer,research direction:data mining and the information retrieval.

Fund of social sciences research project:the project(07JC870006)youth fund,Anhui University of Finance and Economics research projects(ACJYZD200914).

趙俊杰,男,安徽宿州人,碩士,講師,研究方向:數(shù)據(jù)挖掘與情報(bào)檢索。

教育部社科研究基金青年項(xiàng)目,項(xiàng)目編號(hào):07JC870006,安徽財(cái)經(jīng)大學(xué)教研重點(diǎn)項(xiàng)目,項(xiàng)目編號(hào):ACJYZD200914。

1.Introduction

So-called automatic abstraction is to automatically extract abstracts from the original literature using the computer[1].Automatic abstraction quickly condenses and extracts a large of electronic texts,which is an accurate and efficient way to accelerate the reading and obtain information resources.So-called abstract is a brief and coherent passage to reflect the central content of a document accurately,mainly including the following three types:instruction,information and comment[2].This paper mainly studies on information abstract,a kind of concentrated expression for the details of the content.It can help users to grasp the core content of the original paper only through reading the abstract,and greatly save the time and improve the efficiency of reading.The main purpose of this study is to design a kind of automatic abstraction techniques based on keywords retrieval and apply it to the rapid detection of paper copy.

2.Overview of Automatic Abstraction Technology

Automatic abstraction consists of three steps:text analysis,information selection and generalization,and generating abstracts.Text analysis finds the most representative components of original contents.Conversion process compresses text through summary.The last step is to recombine the original content and generate abstracts[3].

Automatic abstraction includes four main methods:automatic extraction,automatic abstraction based on understanding,information extraction and automatic abstraction based on structure[4].

2.1 Automatic Extraction

Automatic extraction regards text as a linear sequence sentences and the sentence as a linear sequence of words.It usually works by four steps:(1)calculating the right value of words;(2)calculating the right value of sentences;(3)descending the order of all the original sentences from the highest value to the lowest,and the highest one is selected as abstract words;(4)outputting all abstract words according to their order in the original text.In automatic extraction,the calculation of word value and sentence value and the selection of abstract words are all on the basis of the six kinds of text form:word-frequency,title,position,syntax structure,clue words and demonstrative expressions.These six features are the basis of automatic extraction and they indicate the theme of the text from different angles.

2.2 Automatic Abstraction Based on Understanding

The obvious difference between this mat hod and automatic abstraction lies in the use of knowledge.It not only obtains language structure by using the knowledge of linguistics,but also gets the significance of abstract by using the knowledge of this field.Finally it produces the abstract from the significance.

2.3 Information Extraction

Information extraction means to automatically identify the information such as referring to an entity,relationship,and event from a given set of texts and store or manage all the information.The method of using information extraction to carry out automatic summarization should firstly identify the themes of text,then choose the framework of abstracts,analyze the useful fragments of information extraction deeply and use relevant phrases or sentences to fill the abstract framework.Lastly,we will make use of the abstract model to convert the content in the framework into the abstract and output it.

2.4 Automatic Abstraction Based on Structure

The abstract words are usually regarded as top sentences which are related to many sentences in a network composed of sentences.The relationship between sentences can be judged by that of words or conjunctions.To a long article,it also can be regarded as a network of paragraphs.We can give each paragraph a feature vector,and take the inner product of these two paragraphs eigenvector as the connection strength of them.If the connection strength is beyond the given threshold,the two paragraphs have semantic links.Lastly,the central groups with the link to many segments are extracted to form an abstract of an article.

3.A Technique of Automatic Abstraction Based on Keyword Retrieval

3.1 Keyword Extraction

The algorithm model of keyword extraction puts the following into a full framework,such as word segmentation and part-of-speech tagging,text pretreatment,linear weighting algorithm,the formation and filtration of combined words,merging keywords,etc.And the two important data structures are the word information table the compound word information table.The generated combined words are not regarded as exceptions,but to give them value with the scientific method and take part in the competition with other words(the words made by the algorithm of linear weighting).Then we merge the two tables and get the ultimate keywords[5].

We first deal with text pretreatment,and the system of word segmentation and part-of-speech tagging,then use the algorithm of linear weighting.Through analyzing the frequency of the Chinese text,part of speech and the position of phrases,we quantize the weighting factor and calculate the value of each word.Then the candidate keys are extracted according to the size of value,and take them as the basis of final keywords.Based on the method of getting combined words by using linear weighted algorithm,we can get the second candidate keywords list.Finally the repeated items in these two tables are taken away,and the keywords are produced according to the right order of the size of value.Meanwhile,the number of keywords can be specified by users.

3.2 Algorithm of Automatic Abstraction

The algorithm of automatic abstraction first does the text segmentation by using segmentation tools[6];then it extracts the keywords,on the one hand,it stores the keywords according to the unit of paragraphs;each keyword is given different weights by the order of extraction(1.0,0.9,0.8),and the weight of each statement in every paragraph is calculated according to the value of keywords.Title,position and the length of sentences are also taken as the important factors of choosing abstracts besides word frequency.According to statistics,the chance of abstract words appearing on the title is around 95%,85%in the beginning of the paragraph,and 7%in the end of the paragraph.Therefore,the titles with keywords are directly seen as abstract words.The other statements are sorted by the order of weight,and the 5 sentences with the maximum weight in each paragraph are picked up as candidate key sentences.Then we select the abstract words considering the position and length of statements.Eventually the abstract of the whole thesis is formed.Specific processes are shown below:

4.Detection of Thesis Copying Based on Automatic Abstraction

4.1 Basic Thought

Because most papers take up the large space,it is time-consuming to compare them.Therefore,we first compare their abstracts and again compare whole text if they have a high similarity to find the contents suspected of plagiarism.But some authors offer too simple abstracts,no more than 200words;or the abstract is not too accurate.And to sum the full content of text is a good way to stress the key point.So here this paper deals with the themes with automatic abstraction and compare the abstracts so that the accuracy is improved.

4.2 Concrete Steps

Step 1:to segment the paper to be detected and the original one;

Step 2:to extract the keywords of the paper to be detected and the original one respectively and store them;

Step 3:to calculate and sort the weight of sentences in the paper to be detected and the original one respectively,and gen-erate automatic abstracts;

Figure1 Automatic Abstraction Based on Keyword Retrieval

Step 4:to compare the abstracts of the paper to be detected and the original one,calculate the similarity;to calculate the similarity of the abstract provided by the author and the automatic abstract;

Step 5:to suspect that it is a copy if the similarity is beyond 10%,make a further comparison between the whole text of the paper to be detected and the original one,output the copied contents.Otherwise,it is not thought as a copy.

4.3 Experimental Result

This paper designs the three copying files D1,D2,and D3 to act as the test samples.The proportions of plagiarism are about 20%,30%and 50%respectively.And the main purpose is to test that different proportions of plagiarism have an influence on the result of the comparison.The paper calculates the similarity by using word-frequency statistics,that is,to get the proportion of similar words out of the total words[7].Figure 2 is an interface of automatic abstraction system.Table 1 contains not only three copying files D1,D2 and their corresponding abstracts,but also the result of similarity between them and the original text,abstract and automatic abstracts.

Figure 2 Automatic Abstract for a Certain Document

Table 1 Experimental Result

4.4 Basic Summary

From the experiment result we can see that the similarity of the whole text and automatic abstract is very close to the proportion of copying.But the abstract provided by the writer sometimes makes some errors due to the accuracy and the words of the abstract.The abstract generated by the automatic abstraction based on keyword retrieval can roughly summarize the text,replace text to be detected.Of course it's only a preliminary inspection;detailed text detection still needs to be done.

In addition,the keywords given by some authors are less and not very accurate.This system usually extracts 5-8 keywords,and they can reflect the theme of the text,so that the automatic abstract which is based on keywords retrieval is more accurate.

5.Conclusion

The a bstract with good quality can replace the retrieval position of the original text to a certain extent and act as an alternative to the retrieval,so that it can reduce the time spent on the information retrieval.The experts at home and abroad are always exploring an accurate and efficient algorithm of automatic abstraction.There is still something to be improved in this paper.Generally,the abstract is about 700 words in a paper with 7000 words.The more the words or paragraphs of the text are,the more the words of abstract are.Therefore,it is necessary to reduce the number of words,that is,within 500 words.We can combine a few paragraphs in the practice or pick up the key sentences for the unit of subtitle,not for the unit of paragraph.

[1]柴曉麗,自動(dòng)文摘技術(shù)的研究與應(yīng)用[D].碩士學(xué)位論文.長(zhǎng)春理工大學(xué),2006.

[2]黃麗瓊,中文自動(dòng)文摘及評(píng)價(jià)方法的研究[D].碩士學(xué)位論文.重慶大學(xué),2007.

[3]郭燕慧,鐘義信等,自動(dòng)文摘綜述,情報(bào)學(xué)報(bào)[J].2002,21(5):582~591.

[4]劉挺,王開(kāi)鑄,自動(dòng)文摘的四種主要方法,情報(bào)學(xué)報(bào)[J].1999,18(1):10~19.

[5]張紅鷹,基于模糊處理的中文文本關(guān)鍵詞提取算法[J].現(xiàn)代圖書(shū)情報(bào)技術(shù),2009,(5):39~43.

[6]李榮陸,文本分類(lèi)及其相關(guān)技術(shù)研究[D].博士學(xué)位論文.復(fù)旦大學(xué),2005.

[7]趙俊杰,一種基于段落詞頻統(tǒng)計(jì)的論文抄襲判定算法[J].計(jì)算機(jī)技術(shù)與發(fā)展,2009,19(4):231~233,238.

猜你喜歡
文本研究
FMS與YBT相關(guān)性的實(shí)證研究
2020年國(guó)內(nèi)翻譯研究述評(píng)
遼代千人邑研究述論
初中群文閱讀的文本選擇及組織
甘肅教育(2020年8期)2020-06-11 06:10:02
視錯(cuò)覺(jué)在平面設(shè)計(jì)中的應(yīng)用與研究
科技傳播(2019年22期)2020-01-14 03:06:54
在808DA上文本顯示的改善
EMA伺服控制系統(tǒng)研究
基于doc2vec和TF-IDF的相似文本識(shí)別
電子制作(2018年18期)2018-11-14 01:48:06
新版C-NCAP側(cè)面碰撞假人損傷研究
文本之中·文本之外·文本之上——童話故事《坐井觀天》的教學(xué)隱喻
主站蜘蛛池模板: 99精品在线视频观看| 国产色伊人| 日韩第一页在线| 女高中生自慰污污网站| 日本影院一区| 囯产av无码片毛片一级| 国内精品久久人妻无码大片高| 91综合色区亚洲熟妇p| 超级碰免费视频91| 国产00高中生在线播放| 精品伊人久久久久7777人| 国产成人福利在线视老湿机| 久久久久久高潮白浆| 亚洲精品不卡午夜精品| 日韩色图区| 中文字幕亚洲综久久2021| 亚洲日韩第九十九页| 最新午夜男女福利片视频| 免费国产小视频在线观看| 亚洲精品国产乱码不卡| 欧美成人日韩| 国产精品不卡片视频免费观看| 国产乱人伦精品一区二区| 日韩高清欧美| 亚洲综合第一页| 激情爆乳一区二区| 亚洲日本韩在线观看| 天天色综合4| 小说 亚洲 无码 精品| 色色中文字幕| 青草视频在线观看国产| 亚洲天堂日韩在线| 免费在线一区| Jizz国产色系免费| 欧美亚洲一区二区三区导航| 白浆免费视频国产精品视频| 国产视频资源在线观看| 亚洲视频一区在线| 亚洲日本中文字幕乱码中文 | 国产视频大全| 99精品一区二区免费视频| 四虎在线观看视频高清无码| 真实国产乱子伦视频| 91精选国产大片| 亚洲国产精品久久久久秋霞影院 | 自慰高潮喷白浆在线观看| 九色视频一区| 久久综合色视频| 国产欧美精品午夜在线播放| 少妇精品网站| 日韩美一区二区| 欧美在线视频不卡| 国产一区免费在线观看| 亚洲娇小与黑人巨大交| 99这里只有精品免费视频| 亚洲欧美综合另类图片小说区| 久久精品日日躁夜夜躁欧美| 青青热久免费精品视频6| 成人免费视频一区二区三区| 精品亚洲国产成人AV| 欧美精品成人一区二区在线观看| 欧美日本在线播放| 国产美女主播一级成人毛片| 91视频区| 久久免费视频6| 91精品综合| 国产精品亚洲αv天堂无码| 91po国产在线精品免费观看| 久久99精品久久久久纯品| 东京热高清无码精品| 不卡午夜视频| 国产成+人+综合+亚洲欧美 | 亚洲一本大道在线| 中文字幕 91| 欧美日韩在线国产| 色播五月婷婷| 免费一级毛片在线播放傲雪网| 国产成人av一区二区三区| 国产69囗曝护士吞精在线视频| 无码网站免费观看| 色亚洲成人| 99人体免费视频|