999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Integrating Deep Learning and Machine Translation for Understanding Unrefined Languages

2022-11-09 08:14:38HongGeunJiSoyoungOhJinaKimSeongChoiandEunilPark
Computers Materials&Continua 2022年1期

HongGeun Ji,Soyoung Oh,Jina Kim,Seong Choi and Eunil Park,*

1Department of Applied Artificial Intelligence,Sungkyunkwan University,Seoul,03063,Korea

2Raon Data,Seoul,03073,Korea

3Department of Computer Science and Engineering,University of Minnesota,Minneapolis,55455,MN,USA

Abstract:In the field of natural language processing(NLP),the advancement of neural machine translation has paved the way for cross-lingual research.Yet,most studies in NLP have evaluated the proposed language models on well-refined datasets.We investigatewhether a machine translation approach is suitable for multilingual analysis of unrefined datasets,particularly,chat messages in Twitch.In order to address it,we collected the dataset,which included 7,066,854 and 3,365,569 chat messages from English and Korean streams,respectively.We employed several machine learning classifiers and neural networks with two different types of embedding:word-sequence embedding and the final layer of a pre-trained language model.The results of the employed models indicate that the accuracy difference between English,and English to Korean was relatively high,ranging from 3% to 12%.For Korean data(Korean,and Korean to English),it ranged from 0% to 2%.Therefore,the results imply that translation from a low-resource language (e.g.,Korean)into a high-resource language (e.g.,English) shows higher performance,in contrast to vice versa.Several implications and limitations of the presented results are also discussed.For instance,we suggest the feasibility of translation from resource-poor languages for using the tools of resource-rich languages in further analysis.

Keywords: Twitch;multilingual;machine translation;machine learning

1 Introduction

In linguistic and computer science research,one of the most challenging research topics is to develop systems for high-quality translation and multi-linguistic processing.Thus,many scholars have attempted to propose state-of-the-art translation services and systems to improve the results of translation.

In addition to some translation research,natural language processing (NLP) technologies have been rapidly improving.Because of international collaboration in research and development,the majority of NLP research aims to investigate resource-rich languages that are widely used in global society.Hence,NLP research is more focused on English rather than other languages [1].

Because of insufficient research and development in under-resourced languages,several scholars attempted to apply English NLP technologies to understand and investigate other languages [2-4].For instance,Patel and colleagues used machine translation for sentiment analysis of movie reviews and then compared the results of the translation approach with native Hindustani NLP [3].

To employ NLP technologies for low-resource languages,a two-step approach can be used.First,well-constructed translation methodologies should be employed to translate the contents in low-resource language into high-resource language.Second,the translated content is represented as vectors by various word embedding algorithms.Therefore,improved translation methodologies can enhance the results of NLP technologies in other languages.

Within this trend,several studies have attempted to develop state-of-the-art translation techniques.One of the remarkable improvements is Google’s neural machine translation system(GNMT) [5].Compared with the phrase-based production system,GNMT reduced errors by 40%when using human evaluation [5].Using rapidly improving machine translation techniques,Kocich et al.[6] successfully categorized the sentiments in an online social network dataset using an English sentiment library.

However,most recent studies have been conducted for well-refined content.With unrefined content,there can be some hindrances,for example,when chat messages are processed and explored.Communication in chat messages (known as “netspeak”) has unique language characteristics in spelling and grammar,including the use of acronyms and abbreviations [7].Moreover,because a lot of me-media channels,which are interactive media platforms for viewers and streamers,are globally introduced,a huge amount of chat messages and content in various languages is produced.Thus,we aim to investigate whether machine translation can be applicable for multilingual analysis of unrefined content.To address it,unrefined chat messages of both English and Korean streamers inTwitch[8],a widely used online streaming service,are collected for analysis.

2 Related Work

Machine learning and deep learning approaches have become mainstream in NLP research.Also,the cross-lingual approaches in NLP have also been extensively explored and achieved considerable results.Thanks to these approaches,diverse tasks can be performed for limitedresource languages (e.g.,Spanish and Hindi) and not only for languages with rich resources (e.g.,English) [2-4].

Among these tasks,a text categorization task using bilingual-corpus datasets was represented as the cost-effective methodology resulting in comparable accuracy [9].

Moreover,with the advancement of neural machine translation (NMT) beyond the conventional translation models,several cross-lingual approaches applied this technique [3,10,11].Patel and colleagues showed comparable accuracy of sentiment classification by translating low-resource languages into English (as a high-resource language) [3].Furthermore,performance of NMT models can be enhanced by focusing on topic-level attention during the process of translation [11].

Recent cross-lingual approaches have been improved by a pre-trained language model based on neural networks [12,13].The pre-trained word-embedding techniques,such asSkip-Gram[14],andGloVe[15],capture different properties of the words.Moreover,in the case of learning the contextual meaning and structure of the syntax,several state-of-the-art pre-trained language models were introduced,includingCoVe[16],ELMo[17],andBERT[18].The transformer encoder enabled these models to handle the complex representation of contextual semantics.All the representative pre-trained language models were trained on refined large text corpora (such asWikipediain English,as a commonly used language).

By favor of these properties,several studies have applied pre-trained language models on large-scale data [19].However,the majority of prior studies have been conducted using relatively well-refined datasets (e.g.,Wikipedia,social networking sites,microblogs,or user reviews) [20].As pre-trained language models were implemented to read the whole sequence of words and showed remarkable improvements in NLP tasks,we attempt to examine whether applying advanced pretrained language models to the unrefined content to learn the entire context of words can be recommended in the field of machine translation.

Thus,we investigates whether machine translation approaches are applicable to the classification task of unrefined data compared with the evaluation of the original language.

3 Method

To validate our approach on unrefined data,we used chat messages in a representative live-streaming platform,Twitch.InTwitch,there are active interactions and communications between viewers and streamers [21].We selected a straightforward binary classification task for chat messages: predicting whether a specific viewer inTwitchis a subscriber who pays for live game-streaming services.

3.1 Data Acquisition and Preprocessing

We collected the 50 most-followed English and Korean streamers fromTwitchMetrics[22].Specifically,we collected all chat messages from five recent streams of each streamer using an open-source crawler,Twitch-Chat-Downloader[23].The dataset included 7,066,854 and 3,365,569 chat messages from English and Korean streams,respectively.

Fig.1 shows the whole data preprocessing procedures.During the preprocessing,we first excluded the chat messages with URLs,user tags annotated with @,and emoticons.In addition,we eliminated the notifications which indicated who subscribed to the streamers.We did not apply stemming or lemmatization to prevent the information loss in short messages.In addition,we removed the chat messages less than five words which cannot convey the states of the viewers.Subsequently,we usedGoogle Translation APIto translate English chat messages to Korean and vice versa.The chat messages that were not translated properly were removed.Finally,we used 1,321,445 English (EN) and English-to-Korean (EN2KO) and 109,419 Korean (KO) and Koreanto-English (KO2EN) chat messages.Moreover,to classify whether a specific viewer is a subscriber,we identified the subscription badges of viewers,which were displayed in messages.

3.2 Embedding

We employed two techniques for embedding: word-sequence and sentence embedding.

3.2.1 Word-Sequence Embedding

We employed two tokenization techniques according to the target language.In the case of English (ENandKO2EN),we employed theTokenizerof Python libraryKeras[24].We tokenized the Korean chat messages (KOandEN2KO) using theOpen Korea Textof Korean NLP library,KoNLPy[25].After examining the tokenization techniques,we embedded the tokens in 256-dimensional vectors.

Figure 1:Workflow procedures

3.2.2 Sentence Embedding:BERT

We used the embedding vector extracted from the last layer of a widely-used pre-trained language model,BERT,which reflects the context of the sentences.Among the wide range ofBERTmodel sizes,we choseBERT-base-uncasedmodel to the English chat messages (ENandKO2EN) [26].For Korean chat messages (KOandEN2KO),we appliedKoBERT[27].With the employment ofBERTmodel,we used the hidden states of first token of input sequence (called[CLS] token) in the last layer ofBERT,a 768-dimensional vector,as one of the embedding techniques.

3.2.3 Classification Models

We applied both machine learning classifiers and deep neural networks:Logistic Regression,Na?ve Bayes,Random Forest,XGBoost,Multilayer Perceptron(MLP),STACKED-LSTM,andCONV-LSTM.TheSTACKED-LSTMmodel consists of two long short-term memory (LSTM)layers with 128 recurrent neurons and a fully connected layer.TheCONV-LSTMhas onedimensional convolutional layer with 64 filters,max-pooling layer,LSTMlayer with 128 recurrent neurons,and a fully connected layer.The output of the fully connected layer is passed through the softmax activation function.

We divided the collected chat messages into training (80%) and testing (20%) sets.Therefore,the training sets included 87,535 (KO) and 1,057,156 (EN) chat messages.The number of chat messages in the test dataset was 21,884 (KO) and 264,289 (EN).We applied thesynthetic minority over-sampling technique(SMOTE) for the machine learning classifiers [28];moreover,we adjusted class weights in the cross-entropy function of the deep neural networks to handle class imbalance(Fig.2) [29,30].

Figure 2:Class distribution for English and Korean datasets

4 Results

4.1 Classification Models with English Data

The accuracy of the classifiers using English data (ENandEN2KO) is summarized in Tab.1.Among classifiers using untranslated English (EN),Random Forestwith word-sequence embedding showed the highest performance,with the accuracy of 89.35%.TheSTACKED-LSTMmodel with word-sequence embedding showed the highest accuracy (82.03%) among the models with Englishto-Korean input data (EN2KO).

The average accuracy of the models with word-sequence embedding was slightly higher with untranslated data (EN: 78.79%) compared with translated data (EN2KO: 73.30%).Similarly,in the case ofBERTembedding,the models with untranslated data (EN: 80.17%) outperformed the models with translated data (EN2KO: 78.13%).

In the case of theNa?ve Bayesclassifier,performance was better withBERTembedding rather than word-sequence embedding,which was approximately 25% (EN) and 27% (EN2KO),respectively.

As shown on the left side of Fig.3,the accuracy of classifiers with the word-sequence embedding of the untranslated data (EN) was higher than forBERTembedding (Random Forest,XGBoost,CONV-LSTM,and STACKED-LSTM).

4.2 Classification Models with Korean Data

Tab.2 represents the accuracy of classifiers using Korean data as input (KOandKO2EN).Random ForestwithBERTembedding showed the highest performance for both translated and untranslated data (KO: 86.92%,KO2EN: 86.70%).The average accuracy of classifiers was similar for untranslated and translated input data (KO: 73.74%,KO2EN: 72.11%).This aligns with the results ofBERTembedding (KO: 80.30%,KO2EN: 79.33%).

Table 1:Classification metrics with English data

In addition,the accuracy ofNa?ve Bayeswas much higher withBERTembedding (KO:76.42%,KO2EN: 79.95%) rather than word-sequence embedding (KO: 25.32%,KO2EN: 26.31%).The right side of Fig.3 shows the accuracy of the classifiers trained on Korean data (KO,KO2EN).Overall,the classifiers with relatively high accuracy had different embedding methods.

5 Discussion

We aimed to validate whether machine-translated datasets are applicable in the NLP tasks.We conducted binary classification with unrefined data (chat messages in live-streaming platform,Twitch) by using several machine learning classifiers and neural networks.Moreover,we employed two different types of embedding: word-sequence embedding and the output layer ofBERT.We chose both English (resource-rich) and Korean (resource-poor) languages for the validation and named the datasets as follows:EN,KO,EN2KO,andKO2EN.

Figure 3:Classification accuracy for English data (EN,EN2KO) and Korean data (KO,KO2EN)

According to our results,the accuracy difference betweenENandEN2KOwas relatively high,ranging from 3% to 12%.For Korean data (KOandKO2EN),it ranged from 0% to 2%.Therefore,the results imply that translation from a low-resource language (e.g.,Korean) into a high-resource language (e.g.,English) shows higher performance,in contrast to vice versa.

Among the classifiers showing high accuracy for English (ENandEN2KO),the wordsequence embedding was highly employed.Meanwhile,in Korean (KOandKO2EN),there are no significant differences in dominance between word-sequence andBERTembedding.This shows that contextual approaches ofBERTto unrefined data does not effectively impact the analysis.

In the case of classifiers resulting in low accuracy,Na?ve Bayesin the current study,BERTembedding showed much higher accuracy compared to the word-sequence embedding in the multilingual analysis of unrefined content.

Overall,the evaluation of all classifiers implies that using machine translation from resourcepoor (e.g.,Korean) to resource-rich (e.g.,English) language for input data (KO2EN) does not significantly affect the performance.This would suggest the feasibility of translation from resource-poor languages for using the tools of resource-rich languages in further analysis.

Although we investigated the efficacy of machine translation from a low-resource language to a high-resource language,several limitations must be considered.First,our evaluation of the task was limited to English and Korean.We may further investigate whether our approach produces comparable results in other languages.Also,using a highly improved classifier may be considered due to the rapid advancement in the field of machine learning.Therefore,these limitations can be addressed in future work.

Table 2:Classification metrics with Korean data

Acknowledgement: Prof.Eunil Park thanks to TwitchMetrics,and ICAN Program,IITP grant funded by the Korea government (MSIT) (No.IITP-2020-0-01816).

Funding Statement: This work was supported by Institute of Information &communications Technology Planning &Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00358,AI·Big data based Cyber Security Orchestration and Automated Response Technology Development).

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

主站蜘蛛池模板: 欧美精品影院| 日韩无码黄色网站| 成年A级毛片| 国产精品久久久久久搜索| 欧美第一页在线| 国产永久在线视频| 色偷偷一区二区三区| 亚洲精品777| 国内精自视频品线一二区| 青青热久免费精品视频6| 国产亚洲欧美另类一区二区| 色婷婷成人| 精品国产一区二区三区在线观看| 国产一级毛片在线| 国产尹人香蕉综合在线电影| 88av在线播放| 欧美a√在线| 免费亚洲成人| 欧美日在线观看| 国产亚洲高清视频| 国产精品欧美在线观看| 97亚洲色综久久精品| 日韩无码白| 亚洲中文字幕无码mv| 国产欧美日韩在线一区| 久久频这里精品99香蕉久网址| 免费无码又爽又刺激高| 四虎国产精品永久在线网址| 伊人91在线| 99久久婷婷国产综合精| 91精品视频网站| 为你提供最新久久精品久久综合| 亚洲国产日韩视频观看| 毛片在线看网站| 91网址在线播放| 亚洲第一福利视频导航| 国产视频 第一页| 久久国产高潮流白浆免费观看 | a在线观看免费| 婷婷六月综合| 亚洲欧美人成电影在线观看| 久久婷婷色综合老司机| av一区二区人妻无码| 国产综合亚洲欧洲区精品无码| 亚洲欧美日韩中文字幕在线一区| 九九免费观看全部免费视频| 国产美女久久久久不卡| 亚洲天堂啪啪| 亚洲无码在线午夜电影| 污污网站在线观看| 国产福利免费视频| 99re热精品视频中文字幕不卡| 亚洲综合色婷婷中文字幕| 亚洲毛片一级带毛片基地| 精品久久人人爽人人玩人人妻| 极品国产一区二区三区| 在线免费亚洲无码视频| 亚洲日韩精品无码专区97| 日韩亚洲综合在线| 波多野结衣无码视频在线观看| 亚洲中文字幕日产无码2021| 欧美亚洲一区二区三区在线| 亚洲欧州色色免费AV| 国产乱子伦视频三区| 精品国产一二三区| 亚洲中文字幕在线精品一区| 亚洲五月激情网| 国产在线观看成人91| 午夜在线不卡| 99精品这里只有精品高清视频| 亚洲福利视频一区二区| 青青热久麻豆精品视频在线观看| 久久 午夜福利 张柏芝| 欧美不卡视频在线观看| 亚洲成肉网| 国产精品国产主播在线观看| 永久免费无码日韩视频| 四虎国产永久在线观看| 国产91特黄特色A级毛片| 欧美人与牲动交a欧美精品 | 无码精品国产VA在线观看DVD| 欧美一区二区三区国产精品|