999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

New Generation Model of Word Vector Representation Based on CBOW or Skip-Gram

2019-07-18 01:59:46ZeyuXiongQiangqiangShenYueshanXiongYijieWangandWeiziLi
Computers Materials&Continua 2019年7期

Zeyu Xiong, Qiangqiang Shen, Yueshan Xiong, Yijie Wang and Weizi Li

Abstract: Word vector representation is widely used in natural language processing tasks.Most word vectors are generated based on probability model, its bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words.Recently, neural-network language models CBOW and Skip-Gram are developed as continuous-space language models for words representation in high dimensional real-valued vectors.These vector representations have recently demonstrated promising results in various NLP tasks because of their superiority in capturing syntactic and contextual regularities in language.In this paper, we propose a new strategy based on optimization in contiguous subset of documents and regression method in combination of vectors, two of new models CBOW-OR and SkipGram-OR for word vector learning are established.Experimental results show that for some words-pair, the cosine distance obtained by the CBOW-OR (or SkipGram-OR) model is generally larger and is more reasonable than CBOW (or Skip-Gram), the vector space for Skip-Gram and SkipGram-OR keep the same structure property in Euclidean distance, and the model SkipGram-OR keeps higher performance for retrieval the relative words-pair as a whole.Both CBOW-OR and SkipGram-OR model are inherent parallel models and can be expected to apply in large-scale information processing.

Keywords: Distributed word vector, continuous-space language model, hierarchical softmax.

1 Introduction

A word vector representation is a mathematical processing object associated with each word.Generating word representations is an essential task of natural language processing (NLP) [Bengio, Ducharme and Vincent (2001);Collobert and Weston (2008)].Many NLP tasks such as sentiment analysis,sentence or text classification and so on consider words as basic units.An important step is the introduction of continuous representations of words [Bengio, Ducharme, Vincent et al.(2003)].When it comes to texts, one of the most commonly used fixed-length features is bag-of-words.Traditionally,the default word representation regards a word as a one-hot vector,which shares the same size of the vocabulary.Despite its popularity,bag-of-words features have two major weaknesses: They lose the ordering of the words and they also ignore semantics of the words.In order to address these issues,Cao et al.use the histogram of the bag of words model(BOW)to determine the number of sub-images in the image that convey secret information for the purpose of improving the retrieval efficiency[Cao,Zhou,Sun et al.(2018)].

Continuous-space language models[Holger(2007);Bengio,Schwenk,Senécal et al.(2006)]are neural-network language models in which words are represented as high dimensional real-valued vectors.These vector representations have recently demonstrated promising results in various tasks [Collobert and Weston (2008); Bengio, Schwenk, Senécal et al.(2006)] due to their superiority in capturing syntactic and contextual regularities in language.

Recent works in learning vector representations of words use neural networks [Mnih and Hinton(2008);Turian,Ratinov and Bengio(2010);Mikolov,Sutskever,Chen et al.(2013)].The outcome is that after the neural network model is trained,the word vectors are mapped into a vector space such that semantically similar words have similar vector representations.Distributed word representations draw more attention for better performance in a wider range of natural language processing tasks,ranging from speech tagging[Santos and Zadrozny (2014)], named entity recognition [Turian, Ratinov and Bengio (2010)], partof-speech tagging, parsing [Socher, Lin, Manning et al.(2011)], semantic role labeling[Collobert,Weston,Bottou et al.(2011)],phrase recognition[Socher,Lin,Manning et al.(2011)],sentiment analysis[Socher,Pennington,Huang et al.(2011)],paraphrase detection[Socher,Huang,Pennin et al.(2011)],to machine translation[Cho,Merri?nboer,Gulcehre et al.(2014)].Han et al.[Kim,Kim and Cho(2017)]proposed a method to create concepts by clustering word vectors generated from word2vec, and used the frequencies of these concept clusters to represent document vectors.

The distributed word vector learning mainly depends on word in the vocabulary and corpus,corpus collected is generally according to time ordered in topic related or event related.In this paper,we divide documents into several subsets,in order to preserve accurate proximity information among subset,a combination model based on strategy of optimization and regression,as an extension of distributed word vector is constructed.

The rest of this paper is organized as follows:Section 2 introduces prior research related to n-gram,CBOW model and Skip-gram model.Section 3 formally presents our approach in the integrated extension model for word vector representation.Two novel models CBOWOR and SkipGram-OR are proposed.Section 4 describes the experimental settings and experimental results.At last, we conclude the paper and discuss some future work in Section 5.

2 Related works

2.1 n-gram model

The goal of statistical language modeling[Bengio,Ducharme,Vincent et al.(2003)]is to learn the joint probability function of sequences of words in a language.This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.

Curse of dimensionality: For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 10000010-1=1050-1 free parameters.

A statistical model of language can be represented by the conditional probability of the next word given all the previous ones,since

where wtis the t-th word,

Such statistical language models have already been found useful in many technological applications involving natural language,such as speech recognition,language translation,and information retrieval.

Following the above mentioned models, n-gram models construct tables of conditional probabilities for the next word and for each one of a large number of contexts,i.e.,combinations of the last n-1 words:

Bengio et al.[Bengio,Ducharme,Vincent et al.(2003)]proposed a neural network model to calculate formula(2),Feature vectors of words are learned based on their probability of co-occurring in the same documents.

The training set is a sequence w1,w2,··· ,wTof words belong to V,where the vocabulary V is a large but finite set.The objective is to learn a good model f(wt,···wt-n+1) =that can give high out-of-sample likelihood.The model is decomposed intothe following two parts:

1) A mapping C from any element i of V to a real vector C(i) ∈Rm.It represents the distributed feature vectors associated with each word in a vocabulary.In practice, C is represented by a|V|×m matrix of free parameters.

2)A function g maps an input sequence of feature vectors of words in context

to a conditional probability distribution over words in V for the next word wt.The output of g is a vector whose i-th element estimates the probabilitshown in Fig.1.

Figure 1: Neural architecture f(i,wt,··· ,wt-n+1)=g(i,C(Wt-1),··· ,C(Wt-n+1))

Training result is achieved by finding θ that maximizes a training corpus penalized loglikelihood:

where R(θ)is a regularization term.R is a weight decay penalty applied to the weights of a neural network and to the matrix C.

The softmax output layer is calculated as follows:

yiis the unnormalized log-probability for each output word i, computed as follows, with parameters b,W,U,d and H:

where the hyperbolic tangent tanh is applied element by element,W can be optionally set to zero(no direct connections),and x is the word features layer activation vector,which is the concatenation of the input word features from the matrix C:

ε is the“learning rate”.

2.2 The word vector model

Mikolov et al.[Mikolov, Sutskever, Chen et al.(2013)] introduced the CBOW and Skip-Gram model.Both models include three levels: input, projection and output (Fig.2 and Fig.3).The training objective is to learn word vector representations that can predict the nearby words well.

Figure 2: The CBOW model

Figure 3: The Skip-gram model

Given a sequence of training words w1,w2,··· ,wT, the CBOW model is asked to maximize the following average log probability,

but the Skip-Gram model is asked to maximize the average log probability

where c is the size of the training context.The basic Skip-Gram formulation of p(wt+j|wt)is defined using softmax function as follows:

where vwtis input vector representation of word wt,andare output vector representations of words wt+j,wi.W is the number of words in a vocabulary.

There are many methods for visualizing the relationship between words vector representation.Fig.4 shows one way for terms' relevancy: two-dimensional PCA projection of the 1000-dimensional Skip-Gram vectors of countries and their capital cities [Mikolov,Sutskever, Chen et al.(2013)].It illustrates the ability of the model [Mikolov, Sutskever,Chen et al.(2013)] to automatically organize concepts and learn implicitly.Without any supervised information about what a capital city means during training are given before the mining relationships between them are obtained.

Figure 4: Country and Capital Vectors Projected by PCA[Kim,Kim and Cho(2017)]

Fig.5 shows another visualization method for clustering same word association, t-sne clustering method is used.“t-SNE”is a technique which visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map.The technique is a variation of Stochastic Neighbor Embedding [Georey and Roweis (2002)] that is much easier to optimize,and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.Stochastic Neighbor Embedding(SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities that represent similarities [Kim, Kim and Cho (2017)].In t-SNE[Maaten and Hinton(2008)],it employs a Student t-distribution with one degree of freedom as the heavy-tailed distribution in the low-dimensional map.

As Fig.5 shown that words are represented in a continuous embedded space is very important.Various conventional machine learning and data mining techniques can be applied in this space to solve various text mining tasks[Cui,Shi and Chen(2016);Bansal,Gimpel and Livescu(2014); Xue, Fu and Zhan(2014); Cao and Wang(2015); Ren, Kiros and Zemel(2015)].Fig.5 also shows an example of such embedded space visualized by t-sne[Cui,Shi and Chen(2016)].The embedded words located in one circle represent the names of baseball players, the names of soccer players and the names of countries are separated in different clustering circle, and words similar meanings are located close.The words with different meanings are located far away.

2.2.1 Hierarchical softmax

The formula(12)is a full softmax model which is impractical because the cost of computing is proportional to W, which is often large(105-107terms).The hierarchical softmax is a computationally efficient approximation of the full softmax.The main advantage of hierarchical softmax is that instead of evaluating W output nodes in the neural network to obtain the probability distribution,it evaluates only log2(w)nodes.

Figure 5: Embedded space using t-sne [Maaten and Hinton (2008);Kim, Kim and Cho(2017)]

Hierarchical probabilistic neural network language model was first proposed by Morin[Morin and Bengio(2005)],Mnih and Hinton[Mnih and Hinton(2008)]explored a number of methods for constructing a tree structure and ameliorated the effect on both the training time and the resulting model accuracy.Mikol et al.[Mikolov,Sutskever,Chen et al.(2013);Mikolov(2012)]used a binary Huffman tree,as it assigns short codes to the frequent words which results in fast training.

The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves.For each word w located at the leaf,let n(w,j)be the j-th node on the path from the root to w,and let Len(w)be the length of this path,so n(w,1) = root and n(w,Len(w))=w.let child(n)be an arbitrary fixed child of inner node n,and let〈x〉be 1 if x is true and-1 otherwise,then the hierarchical softmax is defined as follows:

where σ(x)=1/(1+exp(-x)),the cost of computing log p(wO|wI)and?log p(wO|wI)is proportional to Len(wO),which on average is no greater than log W.

3 New learning model of word vector representation

We first divide training document into several relative subsets of document, the relative property may be considered as document collection of contextual feature and semantic feature.

Let Cp1,··· ,Cpnbe n subsets,V1(w),··· ,Vn(w)be corresponding distributed word vectors for word w generated from CBOW or Skip-Gram model.Let SAMT be a sampling set in the vocabulary for topic words,in order to preserve accurate proximity information among subsets, we consider a regression model as an extension of distributed word vectors.The new learning model for distributed word vector representation is described as the following optimization problem and regression strategy.

Step 2.reconstruct word vector for each word w

Fig.6 shows the integrated extension model for distributed word vector representation.

Figure 6: The new generation model of word vector representation

4 Experiments

4.1 Task description

Our task is to develop a new method for generating word vectors and verify its efficiency on actual document dataset.The new generation model of word vector includes two submodels, one is based on the combination of CBOW [Mikolov (2012)] and our regression with optimization strategy, we denote CBOW-OR model, and the other is based on the combination of Skip-Gram[Mikolov(2012)]and our regression with optimization strategy,denote SkipGram-OR model.In the following experiments,we divide documents into three sub-documents, and on the premise of sharing same vocabulary, same word vectors are trained respectively on three subsets with same dimension.The optimization and regression methods are used to integrate the three vectors into a vector,which is regarded as the word vector of the word.

4.2 Dataset description

The dataset we are using is text8,which is download using Google word2vector.The size of the corpus is 100 MB, vocabulary size is 71291 in documents, some auxiliary words have been removed,for example,a,the,is,and so on.And some rare words are removed.In addition, the whole documents contains 4406976 words.Using method, we divide the documents into three parts in size: 36.6 MB, 36.6 MB, 26.8 MB.We name these three sub-documents as text8_1, text8_2, and text8_3.The same word is trained separately on the 3 sub-documents to obtain 3 vectors, and then a word vector is obtained by using the optimization and regression mentioned above.

4.3 Evaluation mode

In order to test the effect of our method, we design two sets of comparative experiments,one is to measure Euclidean distance of the vector in two different vector spaces with same dimension,the other is to measure cosine distance of the two word vectors in each vector space.

Three groups of experiment are conducted according to different SAMT in Eq.(14), for Tabs.1 and 3, SAMT={cat, China, computer, dog, exam, hospital, Japan, nurse, school,software},for Tabs.4 and 6,SAMT={car,children,country,driver,hospital,nation,nurse,parent,school,students},for Tabs.7 and 9,SAMT={army,chef,friend,fruit,gentlemen,ladies, partner, restaurant, soldier, vegetables}.Four kinds of word vector space are respectively generated by CBOW,Skip-Gram,CBOW-OR and SkipGram-OR.The last two models is proposed in this paper.

In Tabs.1, 2, 4, 5, 7, 8, we compare the Euclidean distance of the vector learned for the same word under different vector spaces.We know that a vector can be represented as a point in vector space.Since the dimension for each words of vector space is same, we can compare the Euclidean distance of a word in different spaces.We can test the relation between the Euclidean distance of multiple key words in any two different vector spaces,in order to test the structure consistency of different vector spaces.

Since the cosine distance of the two similar word vectors that are trained should be relatively large, the semantic relationship between words is similar in an article.In other words,the probability of simultaneous occurrence of two words should be large,such as cats and dogs, hospitals and nurses, school and students, etc.So in the Tabs.3, 6, 9, we compare the cosine distance between the vector pairs of the same set of word pairs, which are obtained under three different learning mechanisms respectively,as a criterion for evaluating the vector of words.

Tab.1 to Tab.9 show some interesting properties: Tabs.1, 2, Tabs.4, 5, Tabs.7, 8 show that the words vector space for Skip-Gram and SkipGram-OR keep the same structure property in Euclidean distance.Tab.3 shows that there exists more accurate cosine distance between two words using SkipGram-OR model than other models of CBOW,CBOW-OR and Skip-Gram,meanwhile,Tabs.6,9 show that there exist more accurate cosine distancebetween two words by using CBOW-OR model than other models of CBOW,SkipGram-OR and Skip-Gram.

Table 1: Euclidean distance of the vector learned from the same word under different methods

Table 2: Euclidean distance of the vector learned from the same word under different methods

Table 3: The cosine distance of the two word vectors

Table 4: Euclidean distance of the vector learned from the same word under different methods

Table 5: Euclidean distance of the vector learned from the same word under different methods

Table 6: The cosine distance of the two word vectors

Table 7: Euclidean distance of the vector learned from the same word under different methods

Table 8: Euclidean distance of the vector learned from the same word under different methods

Table 9: The cosine distance of the two word vectors

By combining Tabs.3, 6, 9, we get the synopsis for 15 different words towards four different models.Fig.7 shows that the model SkipGram-OR keeps higher performance for retrieval the relative words-pair as a whole.

Figure 7: the comparative result of four models for 15 different words-pair

5 Conclusions

We develop two kinds of models for generating words of vector:CBOW-OR and SkipGram-OR.The key strategy for these two models is using optimization in contiguous training documents and regression method in combination of vectors.CBOW-OR and SkipGram-OR can be performed in parallel.Experimental results show that for some words pair,the cosine distance obtained by the CBOW-OR or SkipGram-OR model is generally larger and is more reasonable than CBOW and Skip-Gram.

We also achieved exciting results.The Euclidean distance between the vectors of the same word learned under different mechanisms is nearby.It can be seen that the vector space obtained by different models has some consistency.That is, the Euclidean distance of different word vectors in any two vector spaces is approximately the same.Especially, we also find that the vector space for Skip-Gram and SkipGram-OR keep the same structure property in Euclidean distance.

Based on the inherent parallel in generating words of vector and semantic validity in words pair, the proposed models in this paper can be expected to apply in large-scale information processing.

Acknowledgement:The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work Supported by the National Natural Science Foundation of China (No.61379103,61379052), the National Key Research and Development Program (2016YFB1000101) the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No.14JJ1026), Specialized Research Fund for the Doctoral Program of Higher Education (Grant No.20124307110015).

主站蜘蛛池模板: 国产第二十一页| 中国成人在线视频| 青青草国产在线视频| 欧美激情首页| 久久综合九色综合97婷婷| 国内精品久久久久鸭| 九色国产在线| 欧美日韩免费| 亚洲视频影院| 国产丰满大乳无码免费播放| 色哟哟国产成人精品| 无码一区18禁| 亚洲男人天堂久久| 最新国产成人剧情在线播放| 国产国语一级毛片| 99热在线只有精品| 无码aaa视频| 日韩精品一区二区三区免费在线观看| 日韩欧美中文| 国产青青草视频| 天天摸夜夜操| 亚洲电影天堂在线国语对白| 国产成人盗摄精品| 91网站国产| 黄片一区二区三区| 亚洲丝袜第一页| 国产成人精品第一区二区| 亚洲欧美日韩天堂| 婷婷激情亚洲| 激情无码字幕综合| 国产丝袜无码一区二区视频| 亚洲妓女综合网995久久| 国产乱人乱偷精品视频a人人澡| 成人午夜网址| 成人在线亚洲| 精品视频在线一区| 在线视频一区二区三区不卡| 亚洲欧美另类色图| 性做久久久久久久免费看| 在线亚洲小视频| 午夜激情婷婷| 99re精彩视频| 国产欧美视频综合二区| 自拍中文字幕| 亚洲精品欧美重口| 国产欧美日韩资源在线观看| 国产网友愉拍精品| 老汉色老汉首页a亚洲| 中文字幕自拍偷拍| 永久免费AⅤ无码网站在线观看| 日韩精品专区免费无码aⅴ| 福利在线免费视频| 亚洲欧美日韩中文字幕一区二区三区| 老司机久久99久久精品播放| 日本不卡在线播放| AV网站中文| 熟妇丰满人妻| 福利姬国产精品一区在线| 四虎永久免费地址| 国产剧情无码视频在线观看| 91精品国产情侣高潮露脸| 国产小视频免费观看| 久久亚洲天堂| 超碰aⅴ人人做人人爽欧美 | 国产精品网拍在线| 狂欢视频在线观看不卡| 中国丰满人妻无码束缚啪啪| 国产欧美在线观看视频| 中文字幕2区| 麻豆精品在线视频| 亚洲侵犯无码网址在线观看| 日韩在线永久免费播放| 亚洲精品无码在线播放网站| 亚洲AV无码乱码在线观看代蜜桃| 久久中文字幕2021精品| 99在线视频精品| 丝袜亚洲综合| 日韩AV无码免费一二三区| 亚洲第一视频区| 精品伊人久久大香线蕉网站| 国产成人精品优优av| 99视频只有精品|