Learning DALTS for cross-modal retrieval

2019-09-17 08:45:02ZhengYuWenminWang

CAAI Transactions on Intelligence Technology 2019年1期

Zheng Yu, Wenmin Wang ?

School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, People’s Republic of China

Abstract: Cross-modal retrieval has been recently proposed to find an appropriate subspace, where the similarity across different modalities such as image and text can be directly measured. In this study, different from most existing works,the authors propose a novel model for cross-modal retrieval based on a domain-adaptive limited text space (DALTS)rather than a common space or an image space. Experimental results on three widely used datasets, Flickr8K,Flickr30K and Microsoft Common Objects in Context (MSCOCO), show that the proposed method, dubbed DALTS, is able to learn superior text space features which can effectively capture the necessary information for cross-modal retrieval. Meanwhile, DALTS achieves promising improvements in accuracy for cross-modal retrieval compared with the current state-of-the-art methods.

1 Introduction

The task of cross-modal retrieval is beginning to attract more and more attention recently. That is, given an image (text) query,we aim to search for the most relevant text (image). However,multimedia data is heterogeneous intrinsically and thus hard for us to measure the similarity directly. So the main challenge remaining in cross-modal retrieval is how to embed heterogeneous multimedia data into a homogeneous space, so that their similarity can be measured directly. More specifically, the main challenge consists of the following two sub-problems.

The first problem is how to learn efficient features for multimedia data, which evolves from hand-crafted features to deep features gradually. As for image, with the great success achieved by convolutional neural networks (CNNs), Sharif Razavian et al. [1]argue that a pre-trained deep CNN is an effective image feature extractor for many computer vision tasks including cross-modal retrieval. However, do off-the-shelf CNNs provide sufficient information for cross-modal retrieval? Most existing works employ off-the-shelf CNNs such as VGGNet [2] and ResNet [3] to extract image features. However, these models are usually pre-trained for classification and thus only need to consider the category information contained in an image. Therefore, they are inevitable to miss detailed cues such as how the objects relate to each other as well as their attributes and the activities they are involved in,which may play an indispensable role in cross-modal retrieval. As shown in Fig. 1, given two different input images, a pre-trained CNN can only recognise the objects contained in each image which are similar to each other such as ‘man’, ‘surfboard’ and ‘wave’.However, it tends to miss some crucial cues which are totally dissimilar for each other such as how do the man surf the wave.With the great progress achieved in image captioning task recently,we can get sensible descriptive sentences corresponding to an input image, which contain nouns and verbs. That is, image captioning models are able to not only recognise the objects in the image(nouns) but also preserve rich relation information among different objects (verbs). Therefore, we adopt image captioning models to make up for the shortcomings of the traditional CNN features.

As for text, Word2Vec[4], Latent Dirichlet allocation (LDA)[5]and FV[6]are all popular choices for text representation.However,they are all pre-trained on some specified corpora which are totally different from the datasets adopted in cross-modal retrieval. As such, instead of using off-the-shelf models, we employ recurrent neural network (RNN) to learn text features from scratch.

Given efficient features for image and text, the second problem is how to find a homogeneous space. Since in this paper we only focus on the retrieval between image and text, cross-modal retrieval can be achieved by a common space [7–18], a text space[19–21] or an image space [22]. Considering the way people perform cross-modal retrieval, different modalities are processed asymmetrically in the brain. It refers to the well-known semantic gap [23] reflecting the fact that textual features are closer to human understanding (and language) than the pixel-based features[19]. Therefore, the textual features provide more accurate information than the pixel-based features during retrieval.Moreover, it is more straightforward for brains to understand the text than the image because nature language is the result of high-level abstraction of image content. Accordingly, we propose a feature embedding network to explore the possibility of performing cross-modal retrieval in a text space.

The text space is highly discriminative. If a linear classifier is trained after the text space to predict whether a vector comes from an image or a sentence, we can achieve near 100% accuracy.That is, we can fit a hyperplane in the text space to near perfectly separate out images and sentences. This property violates the original goal to find a homogeneous space. Thus, since the source image space and the target text space can be regarded as two different domains, we propose a domain classifier to further minimise the diversity among the features from different modalities,similar to the idea of domain adaptation in[24]. That is, the domain classifier tries to discriminate the difference between the source domain (the original image space) and the target domain (the text space) during training, while the feature embedding network tries to learn domain-invariant features and confuse the domain classifier.Therefore, an additional adversarial loss will be back-propagated to the feature embedding network in order to guide the network to learn domain-invariant text space features for image and text.

The text space is essentially a vector space spanning by a set of base vectors which are also known as different Chinese characters or English words. For Chinese, there are no exact numbers for Chinese characters, which is close to 100,000. Meanwhile, the emergence of enormous new words every year makes the size of the text space continue to grow. In addition to Chinese, similar phenomenon has appeared in other languages such as English.According to incomplete statistics, the number of English words has exceeded 1,000,000, and it is still growing by thousands every year. Therefore, natural language is inherently divergent. It is almost impossible to learn a complete and unlimited text space.

Fig. 1 Illustration of the problem of using pre-trained CNNs to extract image features. Such classification models extract similar features for two images with different interactions among objects (‘jumping off’ versus‘surfing, paddling toward’)

However, in most cases, people only need to remember some of the commonly used Chinese characters and English words to meet their daily needs. For example, many English linguists argue that about 3650 commonly used English words can accomplish more than 95% of the tasks of expressing ideas and communication. The ‘National Dictionary of Modern Chinese’published by the National Board of Education in November 1987 proposes that the number of commonly used words in modern Chinese is 2500, accounting for more than 99% of the daily use of Chinese.

Therefore, this paper ensures the convergence of the proposed algorithm by learning a limited text space (LTS) with a fixed vocabulary. The ability for the LTS to understand is affected by the size of the vocabulary. The bigger the vocabulary, the stronger the understanding ability. Increasing the number of words blindly will not improve the retrieval performance but increase the complexity of the algorithm in time and space.

Our core contributions are summarised as follows:

? We propose a novel model domain-adaptive LTS (DALTS) to perform cross-modal retrieval humanly in a DALTS, which can better imitate the human behaviour. Moreover, we give a brief explanation on the LTS.

? In contrast to the commonly used pre-trained features for both image and text, DALTS is able to learn task-specific features.

? To further minimise the diversity between the source domain(the original image space) and the target domain (the LTS),the idea of domain adaptation is applied to the model to learn a DALTS.

The rest of this paper is organised as follows. We review the related work for cross-modal retrieval in Section 2. Then in Section 3, we propose our own model and describe it in detail. To emphasise the effectiveness of DALTS, Section 4 and 5 show extensive experiments on three benchmark datasets. Finally, we make a summary of this paper in Section 6.

2 Related work

2.1 Multi-modal feature learning

For cross-modal retrieval, most existing works directly use off-the-shelf features to represent images [8, 11, 14–17]. However,the pre-trained features are likely to leave out some crucial information which may be the key to cross-modal retrieval.Recently, image captioning models [25–28] can be used to learn task-specific features to provide more information that is useful to cross-modal retrieval. Given an input image, before decoding it to a descriptive sentence, image captioning models first try to map the image into a text space. Thus, the text space feature for an image contains not only category information but also rich relation information among different objects. Typically, multi-modal RNN (m-RNN) [25], neural image caption (NIC) [26], deep visual-semantic alignments [27] and unifying visual-semantic embeddings (VSEs) [28] are all representative methods for image captioning.

As for text, similarly, typical methods such as Word2Vec [4],LDA [5] and FV [6] are all pre-trained on some specified corpora which are totally different from the benchmark datasets in cross-modal retrieval. Recently, with the great progress on machine translation [29], RNN is found to be a more powerful tool for language modelling which can be trained from scratch and thus more suitable for cross-modal retrieval.

2.2 Homogeneous space learning

The mainstream approach tries to learn a common space by affine transformations on both image and text sides. Typically, canonical correlation analysis [15] tries to learn a common space by maximising the correlations between relevant image and text features. Karpathy et al. [10] break down both image and text into fragments and embeds them into a common multi-modal space which utilises fine-grained alignments between image and text.Niu et al. [13] address the problem of dense VSE that maps not only full sentences and whole images but also phrases within sentences and salient regions within images into a multi-modal embedding space. Wang et al. [16] propose deep structure-preserving embeddings (DSPEs) for image and text which extends pairwise ranking loss to model the intra-modal relationship and adopts a complicated data sample scheme. Nam et al. [12] propose dual attention networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between image and text. In addition to a common space, in the DeViSE model developed by Frome et al.[20], a text space is formed by a pre-trained Word2Vec model.The text space vector of an image is obtained by a convex combination of the word embedding vectors of the visual labels predicted to be the most relevant to the image. However, the visual labels only reflect the objects contained in an image but ignore how these objects relate to each other as well as their attributes and the activities they are involved in. Thus, the Word2Vec space is not an effective text space for cross-modal retrieval. Recently, a distributional visual embedding space provided by Word2VisualVec [22] is found to be an effective space to perform cross-modal retrieval by embedding the text into an image space.

2.3 Domain adaptation

In the absence of labelled data for a certain task,domain adaptation often provides an attractive option given that labelled data of similar nature but from a different domain are available. Ganin and Lempitsky [24] propose a new approach to domain adaptation in deep architectures that can learn features that are discriminative for the main learning task on the source domain and invariant with respect to the shift between the domains, which can be achieved by a domain classifier and a simple gradient reversal layer.Inspired by Goodfellow in Generative Adversarial Nets [30],there exists an alternative adversarial training strategy rather than using the gradient reversal layer. Recently, Park and Im [14] try to learn a common space for cross-modal retrieval based on domain adaptation and have achieved competitive experimental results.

3 Proposed method

The general framework of DALTS is shown in Fig. 2a, which contains a feature extraction network, a feature embedding network and a domain classifier.

3.1 Feature extraction

Image representation: The network for image feature extraction consists of two branches: VGGNet that is pre-trained for image classification and NIC [26] that is pre-trained for image captioning. As mentioned earlier, VGGNet tends to capture rich category information but leave out some detailed cues for cross-modal retrieval. Conversely, NIC has the innate advantage of mining rich relation information among different objects contained in an image. So they are perfectly complementary to each other for cross-modal retrieval. Accordingly, we aim to design the network for image feature extraction as a combination of these two separate models.

As shown in Fig. 3a, the blue and green dashed boxes represent NIC and VGGNet, respectively, which are pre-trained on image captioning and image classification task. Given an input image,a forward pass of the pre-trained VGGNet produces a 4096-dimension feature IVGG. As for NIC, in order to avoid the information loss during decoding, we regard the 512-dimension output of the image embedding layer as the image feature INIC.Finally, we denote a 4608-dimension feature IConcatas the feature for the input image, which is the concatenation of IVGGand INIC.In practise, we have tried a further step to fine-tune the parameters of NIC but no significant gains were observed, so we decided to leave them fixed.

Text representation: We employ Long Short Term Memory networks (LSTM) to learn d-dimensional text features as shown in Fig. 3b. Here, d is also denoted as the dimensionality of the LTS.Let S =(s0, s1, ..., st), te{0···T} be an input text with length T, where we represent each word as a one-hot vector stof dimension equals to the size of the dictionary. Note that we denote by sTas a special end word which designates the end of the text.Before fed into the LSTM, stshould be embedded into a denser space where Weis a word embedding matrix. Then we feed the vectors into LSTM, which take the form

Fig.2 Overview of DALTS.The overall loss function contains the traditional pairwise ranking loss(the blue dashed lines)and the additional adversarial loss(the brown dashed lines)

Fig. 3 Detailed illustration of the feature extraction network

where it, ft, ot, ct, htdenote the input,forget,output,memory cell and hidden state of the LSTM at time step t,respectively.Here,xtis the input word at time step t and ht?1is the hidden state of the LSTM at the last time step t ?1. S denotes the sigmoid function and⊙indicates element-wise multiplication. W, U and b represent the trainable parameters of LSTM. Thus, the feature for S can be obtained from the hidden state of the LSTM at time T, that is, hT.

3.2 Domain classifier

We adopt the concept from [14, 24]. Instead of using the gradient reversal layer, we advocate the adversarial training strategy by designing a domain classifier. Specifically, the domain classifier is a simple feed-forward neural network that has three fully connected layers as shown in Fig. 2c. Given image and text features in the LTS, the domain classifier tries to predict the domain label for each input, for example, [0, 1] for the image features and [1, 0] for the text features. During training, we minimise the cross-entropy loss for a better domain discrimination with the parameters ud

where xiand yirepresent the ith input feature and its corresponding domain label, respectively. The mapping function Dud(·) is able to predict the domain label given an input feature xi.

3.3 Feature embedding

The feature embedding network aims to learn an LTS with parameters uf. As shown in Fig. 2b, we design two mapping functions to transform IVGGand INICto d-dimensional text space features Irespectively, denoted as f (·) and g (·).Similar toare complementary to each other as well. Therefore, we add a fusion layer on top to combine the two features by summation. The whole process can be defined as

where Ifinalare the LTS features for an input image. Note that the procedure of text feature extraction from scratch by LSTM is equivalent to embedding text into an LTS. Therefore, ufinvolves the parameters of LSTM.

After embedding image and text into an LTS, the next step is to compare their similarities. We define a scoring function s(v, t )=v t, where v and t represent image and text features,respectively. To make s equivalent to cosine similarity, v and t are first scaled to have unit norm by the l2 ?norm layer.

Then, two kinds of loss functions are exploited to train the embedding network: pairwise ranking loss and adversarial loss Ld.Pairwise ranking loss is widely adopted for cross-modal retrieval.

Let ufdenote all the parameters to be learnt. We optimise the following pairwise ranking loss:

where tkis a negative text for a given image v and vkis a negative image for a given text t. To obtain the non-matching terms, we choose them randomly from the training set and re-sample every epoch.

Meanwhile,the adversarial loss Ldwill be back-propagated to the feature embedding network simultaneously. Since the feature embedding network tries to maximise Ldin order to learn domain-invariant features, the optimisation goals of these two loss functions are opposite. Therefore, the overall loss function for the feature embedding network can be defined as

where l is an adaptation factor varying from 0 to 1 in order to suppress noisy signal from the domain classifier at the early stages of the training procedure. Following Ganin and Lempitsky [24],we update l by the following equation:

where p is the fraction of current step in maximum training steps.

3.4 Training procedure

The training procedure can be divided into five stages.We denote the parameters of domain classifier and feature embedding network as udand uf, respectively.

Stage 1: During the first training stage, we pre-train NIC on image captioning using the benchmark datasets in cross-modal retrieval such as Flickr30K and Microsoft Common Objects in Context(MSCOCO). After the training complete, we can learn efficient image features.

Stage 2: After extracting features for all images, we start stage 2 to learn an LTS. Given loss function L for the feature embedding network, we fix udand try to update ufby the following rule:

where m is the learning rate.

Stage 3:After stage 2,we start stage 3 to enhance the discriminating ability of the domain classifier. Given loss function Ldfor the domain classifier, we fix ufand try to update udby the following rule:

where m is the learning rate.

Stage 4: For each training batch, repeat stage 2 and stage 3 until DALTS converges.

Stage 5: We can further fine-tune the parameters of NIC.

4 Experiments

In this section,we perform extensive experiments on Flickr8K[31],Flickr30K [32] and MSCOCO [33] datasets following the dataset splits in [10]. Evaluation is performed using Recall@K (with K=1, 5, 10), which computes the mean number of images (texts)for which the correct texts (images) are ranked within the top-K retrieved results.

4.1 Implementation details

For image feature extraction,we first pre-train NIC on image captioning task using Flickr30K and MSCOCO and fix the parameters of NIC and VGGNet during the whole training procedure. In practise,we have tried a further step to fine-tune the parameters of NIC but no significant gains were observed, so we decided to leave them fixed. More specifically, we first rescale the image to 256×256,and then use a single centre crop of size 224×224 to compute 1-crop VGG image feature. For text feature extraction, we set the dimensionality of the LTS d to 1024.Meanwhile,the dimensionality of word embedding is set to 1024 as well.

The feature embedding network contains two functions, f (·) and g (·). For f (·), W1is a 4096×2048 matrix and W2is a 2048×1024 matrix. Among various layers, Rectified Linear Unit(ReLU) is adopted to be the activation function and a dropout layer is added right after ReLU with probability=0.5 in order to reduce overfitting. For g (·), V is a 512×1024 matrix. The margin is set to 0.3 in all our experiments. To accelerate the training and also make gradient updates more stable, we apply batch normalisation right after each mapping function.

We employ a three-layer feed-forward neural network activated by ReLU for the domain classifier. The output dimensions of intermediate layers D1and D2are set to [512, 512]. Softmax layer is added right after the last layer D2.

During training,we adopt Adam optimiser to optimise the model with learning rate 0.0002 for the first 15 epochs and then decay the learning rate by 0.1 for the remaining 15 epochs. We use a mini-batch size of 128 in all our experiments.

4.2 Comparison with the state of the art

In this section, we report experimental results for cross-modal retrieval including image-to-text retrieval (Img2Text) and text-to-image retrieval (Text2Img) on the benchmark Flickr8K,Flickr30K and MSCOCO datasets.

For Flickr8K, experimental results are presented in Table 1.Comparing DALTS with the current state-of-the-art method Hierarchical Multiscale Long Short Term Memory Networks(HMLSTM) [13], we observe that our model achieves new state-of-the-art results on image-to-text retrieval. However, it performs slightly inferiorly to HMLSTM on text-to-image retrieval. Since, instead of the global features we use with massive redundant information, HMLSTM extracts features for phrases within sentences and salient regions within images, as well as embeds them into a denser space.

On Flickr30K, the best performing competitor model becomes DAN_vgg [12] on both tasks, as shown in Table 2. Only DAN_vgg outperforms DALTS on image-to-text retrieval. As for the text-to-image retrieval, DALTS achieves new state-of-the-art results. Owing to the application of attention mechanism, DAN is able to focus on certain aspects of data sequentially and aggregate essential information over time to infer the results.On the contrary, we use global features to represent both image and text which are likely to contain noisy or unnecessary information.

As shown in Table 3,with enough training data,DALTS achieves about 1 and 2% improvement in R@5 and R@10, respectively,on image-to-text retrieval, compared with DSPE. However,DALTS performs slightly inferiorly to smLSTM [9], which utilises attention mechanism similar to DAN. As for the text-to-image retrieval, DALTS performs slightly inferiorly to DSPE. One possible reason may be that the chain structured LSTM is likely to miss the intrinsic hierarchical structure of texts and thus shows weaker ability to learn text features than Fisher vector, which is learned by external text corpora. In particular, DALTS performs better than MRLA [14], which shows that pairwise ranking loss is more suitable for cross-modal retrieval rather than category classification loss.

Table 1 Bidirectional retrieval results on Flickr8K

Table 2 Bidirectional retrieval results on Flickr30K

For a fair comparison,the results of VSE++[7]in Tables 2 and 3 are based on 1-crop VGG image features without fine-tuning.Note that if 10-crops ResNet image features are used to train the models with fine-tuning, the experimental results could be further improved.

As shown in the Appendix,Fig.4 shows some qualitative results for cross-modal retrieval on Flickr8K.To emphasise the efficiency of our proposed model, the retrieval results for each query are listed from left to right according to three variants of our proposed model, DALTS (VGG+BLSTM), DALTS (NIC+BLSTM) and DALTS (VGG+NIC+BLSTM). We can observe that the retrieval results from left to right obtain significant improvement especially from DALTS (VGG+BLSTM) to DALTS (NIC+BLSTM).Furthermore, the incorrectly retrieved results are reasonable as well, compared with the ground truth.

In general,DALTS achieves promising improvements in accuracy for cross-modal retrieval compared with the current state-of-the-art methods, though it has some obvious limitations. In the future,we will employ a stronger CNN (ResNet) for experiments.Meanwhile, attention mechanism will be applied to reduce the negative impact of redundant information.

5 Further study on DALTS

5.1 Importance of different components

To demonstrate the impact of different components in DALTS, we report results for the following variants in Table 4:

? DALTS (VGG+LSTM): In this setting, we remove NIC while keep the remaining part fixed.

Table 3 Bidirectional retrieval results on MSCOCO

Fig.4 Qualitative cross-modal retrieval results on Flickr8K.The first column lists image and text queries for retrieval.The second column to the fourth column shows top five retrieved results for each query by our proposed model DALTS (VGG+BLSTM), DALTS (NIC+BLSTM) and DALTS (VGG+NIC+BLSTM)respectively. For image-to-text retrieval, the correctly retrieved texts for each image query are denoted in red. As for the text-to-image retrieval, the image with a hook represents the correctly retrieved image for a text query

Table 4 Experimental results on four variants of DALTS

? DALTS(NIC+LSTM):In contrast to DALTS(VGG+LSTM)},we remove VGGNet while keep the remaining part fixed.

? DALTS (VGG+NIC+LSTM): Network as in Fig. 2a.

? DALTS (VGG+NIC+BLSTM): The network structure is as above but LSTM is replaced by BLSTM.

For Flickr8K, we observe that changing the image feature extractor from VGG to NIC improves the accuracy by about 22%for image-to-text retrieval and about 20% for text-to-image retrieval. It demonstrates that, rather than VGG, NIC can better capture the information we need for cross-modal retrieval such as the interaction information among different objects. Furthermore,changing from NIC to the combination of VGG and NIC improves the accuracy by about 6% for both image-to-text and text-to-image retrievals. It reveals that the combination of VGG and NIC can not only capture fine-grained category information but also detailed relation information among different objects. Finally, replacing the LSTM by BLSTM provides an additional improvement of 2% for image-to-text retrieval and about 1% for text-to-image retrieval.As for Flickr30K and MSCOCO, the trends for the experimental results are the same as Flickr8K.

5.2 Impact of domain adaptation

As mentioned before, domain adaptation is adopted to further minimise the diversity between the features from image and text.

Table 5 Experimental results on DALTS and LTS

Fig. 5 t-SNE visualisation of our feature embeddings on the Flickr8K test set(1000 images and 1000 texts).The red circle represents the image feature and the blue circle represents the text feature

Table 6 Results for DALTS (1-crop) and DALTS (10-crops)on MSCOCO

To verify the efficiency of DALTS compared with our prior work,dubbed LTS [34], we report results in Table 5. We denote the aforementioned DALTS (VGG+NIC+LSTM) as DALTS. As for LTS, we remove the modality classifier and train the network without domain adaptation.

From Table 5, we can easily observe that DALTS achieves significant improvement in accuracy for both image-to-text and text-to-image retrievals, which verify the efficiency of our proposed model. LTS outperforms DALTS only in terms of R@1 on MSCOCO dataset for image-to-text retrieval.

As shown in Fig. 5, we use t-Distributed Stochastic Neighbor Embedding (t-SNE) tool to visualise the distribution of the LTS features learned by DALTS on Flickr8K. Since an image is annotated by five texts, we randomly sample one text for each image. The comparison in Fig. 5 shows that DALTS is able to minimise the diversity between the distribution of image and text features.

5.3 Effect of 10-crops VGG image features

In previous experiments,we use 1-crop VGG image features to train DALTS. We also consider computing the image features using the mean of feature vectors for 10 crops of similar size. Therefore,we report results for DALTS [the aforementioned variant: DALTS(VGG+NIC+BLSTM)] trained with 1-crop and 10-crops VGG image features, respectively, as shown in Table 6.

From Table 6, we can observe that DALTS (10-crops) achieves significant improvement in accuracy for bidirectional retrieval compared with DALTS (1-crop), which demonstrates the efficiency of 10-crops VGG image features.

6 Conclusion

In this paper, we propose a novel model DALTS to perform cross-modal retrieval in an LTS combined with adversarial learning mechanism. Extensive experiments on three benchmark datasets demonstrate that the structure of our model is well-designed and NIC is proven to be a more powerful feature extractor for image than the traditional CNNs such as VGGNet in cross-modal retrieval. Moreover, DALTS achieves promising improvement in accuracy compared with some typical state-ofthe-art methods. In the future, we will pay more attention to the attention mechanism and try to learn fine-grained features for both image and text to reduce the negative impact of redundant information. Meanwhile, stronger CNNs such as ResNet, will be used to further verify the efficiency of our proposed model.

7 Acknowledgments

This project was supported by the Shenzhen Peacock Plan(20130408-183003656), the Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467) and the National Natural Science Foundation of China (NSFC,No.U1613209).

8 References

[1] Sharif Razavian, A., Azizpour, H., Sullivan, J., et al.: ‘CNN features off-the-shelf: an astounding baseline for recognition’. Proc. IEEE Conf.Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813

[2] Simonyan,K.,Zisserman,A.:‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, 2014

[3] He,K.,Zhang,X.,Ren,S.,et al.:‘Deep residual learning for image recognition’.Proc.IEEE Conf.Computer Vision and Pattern Recognition,2016,pp.770–778

[4] Mikolov,T.,Sutskever,I.,Chen,K.,et al.:‘Distributed representations of words and phrases and their compositionality’. Advances in Neural Information Processing Systems, 2013, pp. 3111–3119

[5] Blei,D.M.,Ng,A.Y.,Jordan,M.I.:‘Latent Dirichlet allocation’,J.Mach.Learn.Res., 2003, 3, pp. 993–1022

[6] Klein, B., Lev, G., Sadeh, G., et al.: ‘Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation’, arXiv preprint arXiv:1411.7399, 2014

[7] Faghri, F., Fleet, D.J., Kiros, J.R., et al.: ‘VSE++: improving visual-semantic embeddings with hard negatives’, 2017

[8] Fan,M.,Wang,W.,Wang,R.:‘Coupled feature mapping and correlation mining for cross-media retrieval’.2016 IEEE Int.Conf.Multimedia&Expo Workshops(ICMEW), 2016, pp. 1–6

[9] Huang, Y., Wang, W., Wang, L.: ‘Instance-aware image and sentence matching with selective multimodal LSTM’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, vol. 2, no. 6, p. 7

[10] Karpathy, A., Joulin, A., Fei-Fei, L.F.: ‘Deep fragment embeddings for bidirectional image sentence mapping’. Advances in Neural Information Processing Systems, 2014, pp. 1889–1897

[11] Ma,L.,Lu,Z.,Shang,L.,et al.:‘Multimodal convolutional neural networks for matching image and sentence’. Proc. IEEE Int. Conf. Computer Vision, 2015,pp. 2623–2631

[12] Nam, H., Ha, J.W., Kim, J.: ‘Dual attention networks for multimodal reasoning and matching’, arXiv preprint arXiv:1611.00471, 2016

[13] Niu, Z., Zhou, M., Wang, L., et al.: ‘Hierarchical multimodal LSTM for dense visual-semantic embedding’. 2017 IEEE Int. Conf. Computer Vision (ICCV),2017, pp. 1899–1907

[14] Park,G.,Im,W.:‘Image-text multi-modal representation learning by adversarial backpropagation’, arXiv preprint arXiv:1612.08354, 2016

[15] Thompson, B.: ‘Canonical correlation analysis’, Encyclopedia of Statistics in Behavioral Science, 2005

[16] Wang, L., Li, Y., Lazebnik, S.: ‘Learning deep structure-preserving image-text embeddings’. Proc. IEEE Conf. Computer Vision and Pattern Recognition,2016, pp. 5005–5013

[17] Yan,F.,Mikolajczyk,K.:‘Deep correlation for matching images and text’.Proc.IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 3441–3450

[18] Zheng, Z., Zheng, L., Garrett, M., et al.: ‘Dual-path convolutional image-text embedding with instance loss’, arXiv preprint arXiv:1711.05535, 2017

[19] Chami, I., Tamaazousti, Y., Le Borgne, H.: ‘AMECON: abstract meta-concept features for text-illustration’. Proc. 2017 ACM on Int. Conf. on Multimedia Retrieval, 2017, pp. 347–355

[20] Frome, A., Corrado, G.S., Shlens, J., et al.: ‘Devise: a deep visual-semantic embedding model’. Advances in Neural Information Processing Systems, 2013,pp. 2121–2129

[21] Norouzi, M., Mikolov, T., Bengio, S., et al.: ‘Zero-shot learning by convex combination of semantic embeddings’, arXiv preprint arXiv:1312.5650, 2013

[22] Dong, J., Li, X., Snoek, C.G.M.: ‘Word2visualvec: cross-media retrieval by visual feature prediction’, CoRR abs/1604.06838, 2016

[23] Smeulders, A.W.M., Worring, M., Santini, S., et al.: ‘Content-based image retrieval at the end of the early years’, IEEE Trans. Pattern Anal. Mach. Intell.,2000, 22, (12), pp. 1349–1380

[24] Ganin, Y., Lempitsky, V.: ‘Unsupervised domain adaptation by backpropagation’, arXiv preprint arXiv:1409.7495, 2014

[25] Mao, J., Xu, W., Yang, Y., et al.: ‘Deep captioning with multimodal recurrent neural networks (m-RNN)’, arXiv preprint arXiv:1412.6632, 2014

[26] Vinyals,O.,Toshev,A.,Bengio,S.,et al.:‘Show and tell:a neural image caption generator’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015,pp. 3156–3164

[27] Karpathy,A.,Fei-Fei,L.:‘Deep visual-semantic alignments for generating image descriptions’.Proc.IEEE Conf.Computer Vision and Pattern Recognition,2015,pp. 3128–3137

[28] Kiros,R.,Salakhutdinov,R.,Zemel,R.S.:‘Unifying visual-semantic embeddings with multimodal neural language models’,2014,arXiv preprint arXiv:1411.2539

[29] Sutskever,I.,Vinyals,O.,Le,Q.V.:‘Sequence to sequence learning with neural networks’. Advances in Neural Information Processing Systems, 2014,pp. 3104–3112

[30] Goodfellow,I.,Pouget-Abadie,J.,Mirza,M.,et al.:‘Generative adversarial nets’.Advances in Neural Information Processing Systems,2014, pp. 2672–2680

[31] Hodosh, M., Young, P., Hockenmaier, J.: ‘Framing image description as a ranking task: data, models and evaluation metrics’, J. Artif. Intell. Res., 2013,47, pp. 853–899

[32] Young, P., Lai, A., Hodosh, M., et al.: ‘From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions’, Trans. Assoc. Comput. Linguist., 2014, 2, pp. 67–78

[33] Lin, T.Y., Maire, M., Belongie, S., et al.: ‘Microsoft coco: common objects in context’. European Conf. Computer Vision, 2014, pp. 740–755

[34] Yu, Z., Wang, W., Fan, M.: ‘Learning a limited text space for cross-media retrieval’. Int. Conf. Computer Analysis of Images and Patterns, 2017,pp. 292–303

CAAI Transactions on Intelligence Technology2019年1期

CAAI Transactions on Intelligence Technology的其它文章: Slang feature extraction by analysing topic change on social media; Teaching a robot to use electric tools with regrasp planning; Expectation-maximisation for speech source separation using convolutive transfer function; Adaptive multifactorial particle swarm optimisation; Ensemble multi-objective evolutionary algorithm for gene regulatory network reconstruction based on fuzzy cognitive maps; Enhanced CNN for image denoising