接上篇：【Paper】Deep Multimodal Representation Learning: A Survey （Part2）

文章目錄

Deep Multimodal Representation Learning: A Survey

深度多模態表徵學習研究綜述

E. ATTENTION MECHANISM

Attention mechanism allows a model to focus on specific regions of a feature map or specific time steps of a feature sequence. Via attention mechanism, not only an improved performance can be achieved, but also better interpretability of feature representations can be seen. This mechanism mimics the human ability to extract the most discriminative information for recognition. Rather than using all of the information at once, the attention decision process prefers to concentrate on the part of the scene selectively which is needed [151]. Recently, this method has demonstrated its unique power in improving performance in many applications such as visual classification [152]–[154], neuralmachinetranslation [155], [156],speechrecognition [92], image captioning [13], [91], video description [42], [90], visual question-answering [24], [157], cross-modal retrieval [31], [158], and sentiment analysis [22].

注意力機制允許模型專注於特徵圖的特定區域或特徵序列的特定時間步長。通過注意力機制，不僅可以提高性能，而且可以看到特徵表示的更好的可解釋性。這種機制模仿了人類提取最有區別的信息進行識別的能力。注意決策過程不是一次使用所有信息，而是選擇性地將注意力集中在需要的場景部分[151]。最近，這種方法已經證明了其在許多應用中提高性能的獨特能力，例如視覺分類[152]-[154]，神經機器翻譯[155]，[156]，語音識別[92]，圖像字幕[13]，[91] ，視頻描述[42]，[90]，視覺問答[24]，[157]，跨模式檢索[31]，[158]和情感分析[22]。

According to whether a key is used during selecting part of the features,attention mechanism can be categorized into two groups:key-based attention,and key less attention.Key-based attention used a key to search for salient localized features. Take image caption as an example [13], its typical structure can be illustrated as Fig. 8, where a CNN network encodes the image into a feature set {ai}, and then an RNN network decodes the input into hidden states {ht}. In time step t, the output yt is predicted based on htand ct, where ctis the salient feature summarized from {ai}. During the process of extracting the salient featurect, the current statehtin decoder plays as a key and the encoder states {ai} play as a source to be searched [159]. The computation method of attention mechanism [13], [156] can be defined as (26) to (28), and the compatibility scores between the key and the sources can be evaluated via one of the three different functions listed in (29).

根據選擇部分特徵時是否使用鍵，注意力機制可分爲兩類：基於鍵的注意力和較少鍵的注意力。基於鍵的注意力使用鍵來搜索突出的局部特徵。以圖像標題爲例[13]，其典型結構如圖8所示，CNN網絡將圖像編碼爲特徵集{ai}，然後RNN網絡將輸入解碼爲隱藏狀態{ht} 。在時間步t中，基於ht和ct預測輸出yt，其中cti是從{ai}總結的顯着特徵。在提取顯着特徵的過程中，當前狀態解碼器充當鍵，編碼器狀態{ai}充當要搜索的源[159]。注意機制[13]，[156]的計算方法可以定義爲（26）至（28），並且可以通過（29）中列出的三個不同函數之一來評估密鑰和源之間的兼容性分數。

Key-based attention is widespread in visual description applications [13], [90], [160], where an encoder-decoder network is commonly used. It brings us an approach to evaluate the importance of the features within a modality or among modalities. On the one hand, attention mechanism can be used to select the most salient features within a modality, on the other hand, it can be used to balance the contribution among modalities during fusing several modalities.

基於鍵的注意在視覺描述應用[13]，[90]，[160]中得到了廣泛的應用，其中編碼器-解碼器網絡通常被使用。它爲我們帶來了一種評估模態或模態之間特徵重要性的方法。一方面，注意力機制可用於選擇模態中最突出的特徵，另一方面，注意力機制可用於在融合多個模態期間平衡模態之間的貢獻。

In order to recognize and describe objects contained in the visual modality, a set of localized region features, which potentially encode different objects distinctly, would be more helpful than a single feature vector. By selecting the most salient regions in an image or time steps of a video sequence dynamically, both system performance and noise tolerance can be improved. For example, Xu et al. [13] adopted attention mechanism to detect salient objects in an image and fused them with text features in a decoder unit for captioning. In such a case, guided by current text generated in time step t, the attention module will be used to search local regions appropriate for predicting next word.

爲了識別和描述視覺模態中包含的對象，一組可能局部地對不同對象進行編碼的局部區域特徵將比單個特徵向量更有幫助。通過動態選擇圖像中的最顯着區域或視頻序列的時間步長，可以提高系統性能和噪聲容限。例如，徐等。 [13]採用注意力機制來檢測圖像中的顯着對象，並在解碼器單元中將它們與文本特徵融合以進行字幕。在這種情況下，在時間步驟t中生成的當前文本的指導下，注意力模塊將用於搜索適合預測下一個單詞的局部區域。

For locating local features more accurately, several attention models have been proposed. Y ang et al. [157] proposed a stacked attention network for searching image regions. They suggested that multiple steps of search or reasoning are helpful to locate to fine-grained regions. In the beginning, the model locates one or more local regions in the image by attention using language features as a key and then combines the attended visual and language features into a vector, which also plays as a key used for next iterator. After K steps, not only the appropriate local regions are located, but both features are fused. Zhu et al. [161] proposed a structured attention model to capture the semantic structure among image regions, and their experiments showed that this model is capable of inferring spatial relations and attending to the right region. Chen et al. [162] proposed to incorporate spatial and channel wise attentions in a CNN network. In their model, not only local regions but also channels of CNN features are filtered simultaneously.

爲了更準確地定位局部特徵，已經提出了幾種注意力模型。揚等。 [157]提出了一種用於搜索圖像區域的堆疊式注意力網絡。他們認爲，搜索或推理的多個步驟有助於找到細粒度的區域。首先，該模型通過使用語言特徵作爲鍵來吸引注意力來定位圖像中的一個或多個局部區域，然後將參與的視覺和語言特徵組合成向量，該向量也充當下一個迭代器的鍵。經過K步後，不僅找到了適當的局部區域，而且融合了兩個要素。朱等。 [161]提出了一種結構化的注意力模型來捕獲圖像區域之間的語義結構，他們的實驗表明，該模型能夠推斷空間關係並關注正確的區域。 Chen等。 [162]提出將空間和信道方面的注意力納入CNN網絡。在他們的模型中，不僅本地區域，而且CNN功能的通道也同時被過濾。

So far, attention models are mostly trained using indirect cues because of lacking explicit attention annotations. Alternatively, Gan et al. [163] trained the attention module using direct supervision. They collected link information between visual segments and words from several datasets and then utilized the link information to guide the training of attention module explicitly. The experiments showed that improved performance could be achieved.

到目前爲止，由於缺少顯式的注意力註釋，注意力模型大多使用間接提示進行訓練。另外，Gan等。 [163]使用直接監督訓練了注意力模塊。他們從多個數據集中收集了視覺片段和單詞之間的鏈接信息，然後利用鏈接信息來明確指導注意模塊的訓練。實驗表明可以提高性能。

Balancing the contribution of different modalities is a key issue that should be considered during fusing multimodal features. By contrast to concatenation or fixed weights fusion methods, an attention-based method can adaptively balance the contributions of different modalities. Several pieces of research [90], [91], [164] have been reported that dynamically assigning weights to modality-specific features condition on a context is helpful to improve application performance.

平衡不同模式的貢獻是在融合多模式功能期間應考慮的關鍵問題。與級聯或固定權重融合方法相比，基於注意力的方法可以自適應地平衡不同模式的貢獻。已有幾項研究[90]，[91]，[164]的研究表明，在上下文中將權重動態分配給特定於模態的特徵條件有助於提高應用程序性能。

Horiet al. [90] proposed to tackle multimodal fusion based on attention for video description. In addition to attending on specific regions and time steps, the proposed method highlights attending on modality-specific information. After modality-specific features have been extracted, the attention module produces appropriate weights to combine features from different modalities based on the context. In a cross-modalretrievaltask,Chenet al. [164] adopted a similar strategy to adaptively fuse modalities and filter out unrelated information within each modality according to search keys.

霍里特等[90]提出解決基於注意力的視頻描述多峯融合。除了參加特定區域和時間步驟之外，所提出的方法還強調了參加特定於情態的信息。在提取了特定於模態的特徵之後，注意模塊會根據上下文生成適當的權重，以合併來自不同模態的特徵。 Chenet等人在跨模式檢索任務中。 [164]採用了一種類似的策略來自適應融合模式，並根據搜索關鍵字過濾掉每個模式內不相關的信息。

Lu et al. [91] introduced an adaptive attention frame to determine whether including a visual feature or not during generating the caption. They argued that some words such as ‘‘the’’ are not related to any visual object. Therefore, no visual feature is needed in this case. Suppose that the visual feature is excluded, the decoder would just depend on the language features to predict a word.

Lu等。 [91]介紹了一種自適應注意力框架，以確定在生成字幕期間是否包括視覺特徵。他們爭辯說，諸如“ the”之類的詞與任何視覺對象都不相關。因此，在這種情況下不需要視覺特徵。假設視覺特徵被排除，則解碼器將僅依賴於語言特徵來預測單詞。

Keyless attentionis mostly used for classificationorregression task. In such an application scene, since the result is generated in a single step, it is hard to define a key to guide the attention module. Alternatively, the attention is applied directly on the localized features without any key involved. The computation functions can be illustrated as flow:

無鑰匙注意力主要用於分類或迴歸任務。在這樣的應用場景中，由於結果是在單個步驟中生成的，因此很難定義用於引導關注模塊的鍵。可替代地，將注意力直接施加在局部特徵上而無需任何關鍵。計算功能可以說明爲流程：

Due to the nature to select prominent cues from raw input, keyless attention mechanism is suitable for a multimodal feature fusion task which suffers from issues such as semantic confliction, duplication, and noise. Through the attention mechanism, it provides us an approach to evaluate the relationship between parts of modalities, which may be complementary or supplementary. By selecting complementary features from different modalities and fusing them into a single representation, the semantic ambiguity could be eased.

由於從原始輸入中選擇突出提示的性質，無鍵注意機制適用於遭受語義衝突，重複和噪音等問題的多峯特徵融合任務。通過注意力機制，它爲我們提供了一種方法，可以評估模態各部分之間的關系，這可以是互補的，也可以是補充的。通過從不同的模態中選擇互補特徵並將其融合爲單個表示，可以減輕語義上的歧義。

The advantage of attention mechanism in multimodal fusion has been proven in many applications. For example, Long et al. [165] compared four multimodal fusion methods and demonstrated that attention based method is the most effective one for addressing the video classification problem. They performed experiments in different setups: early fusion, middle-level fusion, attention-based fusion, and late fusion, which corresponding to different fusion points. The experimental result also shows that attention based fusion method is robust across various datasets. Some other researches also demonstrated the promising perspective of attention based methods for multimodal features fusion [166], [167].

注意機制在多峯融合中的優勢已在許多應用中得到證明。例如，Long等。 [165]比較了四種多峯融合方法，並證明了基於注意力的方法是解決視頻分類問題最有效的方法。他們以不同的設置進行了實驗：早期融合，中級融合，基於注意力的融合和後期融合，它們對應於不同的融合點。實驗結果還表明，基於注意力的融合方法在各種數據集中均具有魯棒性。其他一些研究也證明了基於注意力的多模式特徵融合方法的前景廣闊[166]，[167]。

A special issue on multimodal feature fusion is fusing features from several variable length sequences such as videos, audios, sentences or a set of localized features. A simple way to tackle this problem is fusing each sequence independently via the attention mechanism. After each sequence has been combined into a weighted representation with a fixed length, they will be concatenated or fused into a single vector. This way is beneficial for fusing several sequences, even in the case that their lengths are different, which is commonly seen in a multimodal dataset. However, such a simplified method does not explicitly consider the interaction between modalities, and thus may ignore the fine-grained cross-modal relationships.

關於多峯特徵融合的一個特殊問題是融合來自多個可變長度序列的特徵，例如視頻，音頻，句子或一組局部特徵。解決此問題的一種簡單方法是通過注意力機制獨立地融合每個序列。在將每個序列組合成具有固定長度的加權表示後，它們將被串聯或融合到單個向量中。即使在長度不同的情況下，這種方式也有利於融合多個序列，這在多模式數據集中通常會看到。但是，這種簡化方法沒有明確考慮模態之間的相互作用，因此可能會忽略細粒度的跨模態關係。

A solution to model the interactions between attention modules is constructing a shared context as an extra condition for the computation of modality-specific attention modules. For example, Lu et al. [24] proposed to construct a global context by calculating the similarity between visual and text features. Nam et al. [158] used an iterative strategy to update the shared context and modality-specific attention distribution. Firstly, modality-specific features will be summarized based on attention modules, then they are fused into a context used for next iterator.

對關注模塊之間的交互進行建模的一種解決方案是構造一個共享上下文，作爲計算特定於情態的關注模塊的額外條件。例如，Lu等。 [24]提出通過計算視覺特徵和文本特徵之間的相似性來構建全局上下文。 Nam等。 [158]使用迭代策略來更新共享的上下文和特定於模式的注意力分佈。首先，將基於注意模塊總結特定於模式的功能，然後將其融合到用於下一個迭代器的上下文中。

Recently, a novel learning strategy named multi-attention mechanism, which utilizes several attention modules to extract different types of features from the same input data, has been exploited. Generally, each type of feature locates in a distinct subspace and reflects different semantics. Hence, the multi-attention mechanism is helpful in discovering different inter-modal dynamics. For example, Zadeh et al. [22] proposed to discovery diverse interactions between modalities using multi-attention mechanism. At each time step t,

最近，已經開發了一種新穎的名爲多注意機制的學習策略，該策略利用多個注意模塊從相同的輸入數據中提取不同類型的特徵。通常，每種類型的特徵都位於不同的子空間中，並反映出不同的語義。因此，多注意機制有助於發現不同的聯運方式動力學。例如，Zadeh等。 [22]提出使用多注意機制來發現模態之間的各種相互作用。在每個時間步t，

the hidden states hmt from all modalities were concatenated into vector ht, then multi-attentions will be applied on htto extract K different weighted vectors which reflect distinctive cross-modal relationships. After that, all the K vectors are fused into a single vector which represents the shared hidden state across modalities at time t.

將所有模態的隱藏狀態hmt連接到向量ht中，然後對ht進行多注意，以提取K個不同的加權矢量，這些向量反映了獨特的跨模態關係。之後，將所有K個向量融合爲一個向量，該向量表示在時間t跨模態的共享隱藏狀態。

Another example is the model form Zhou et al. [167], which fused heterogeneous features of user behavior based on multi-attention mechanism. Here, a user behavior type can be seen as a distinctive modal, because different types of behaviors have distinctive attributes. The author supposed that the semantics of user behavior can be affected by the context. Hence, the semantic intensity of that behavior also depends on the context. Firstly, the model project all types of behaviors into a concatenated vector denoted as S, which is a global feature and plays as the context in the attention module. Then, S is projected into K latent semantic sub-spaces to represent different semantics. After that, the model fuses K sub-spaces through attention module.

另一個例子是Zhou等人的模型。 [167]，它融合了基於多注意機制的用戶行爲的異構特徵。在這裏，用戶行爲類型可以被視爲獨特的模態，因爲不同類型的行爲具有獨特的屬性。作者認爲，用戶行爲的語義可能會受到上下文的影響。因此，該行爲的語義強度也取決於上下文。首先，該模型將所有類型的行爲投影到表示爲S的級聯向量中，該向量是全局特徵，在關注模塊中充當上下文。然後，將S投影到K個潛在語義子空間中以表示不同的語義。之後，模型通過注意力模塊融合K個子空間。

One of the advantages of attention mechanism is its capability to select salient and discriminative localized features, which can not only improve the performance of multimodal representations but also lead to better interpretability. Additionally, by selecting prominent cues, this technique can also help to tackle issues such as noise and help to fuse complementary semantics into multimodal representations.

注意機制的優點之一是它能夠選擇顯着和有區別的局部特徵，這不僅可以提高多模式表示的性能，而且可以帶來更好的解釋性。另外，通過選擇突出的提示，該技術還可以幫助解決諸如噪音之類的問題，並有助於將互補語義融合到多模式表示中。

IV. CONCLUSION AND FUTURE DIRECTIONS

TABLE 3. A summary of the key issues, advantages and disadvantages for each framework or typical model described in this paper. One thing should be mentioned is that both cross-modal similarity model and deep canonical correlation analysis (DCCA) are belonged to coordinated representation framework.

In this paper, we provided a comprehensive survey on deep multimodal representation learning. According to the underlying structures in which different modalities are integrated, we category deep multimodal representation learning methods into three groups of frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we summarize some typical models in this area, which range from conventional models to newly developed technologies, including probabilistic graphical models, multimodal autoencoders, deep canonical correlation analysis, generative adversarial networks, and attention mechanism. For each framework or model, we describe its basic structure, learning objective, and application scenes. Additionally, we also discuss their key issues, advantages, and disadvantages which have been briefly summarized in Table 3.

在本文中，我們對深度多模式表示學習進行了全面的調查。根據集成了不同模態的基礎結構，我們將深度多模態表示學習方法分爲三類框架：聯合表示，協調錶示和編碼器-解碼器。此外，我們總結了該領域的一些典型模型，範圍從傳統模型到新開發的技術，包括概率圖形模型，多模式自動編碼器，深度典範相關分析，生成對抗網絡和注意力機制。對於每個框架或模型，我們都描述其基本結構，學習目標和應用場景。此外，我們還討論了它們的關鍵問題，優點和缺點，這些優點，缺點和缺點在表3中進行了簡要總結。

When coming into the learning objectives and key issues in all kinds of learning frames or typical models, we can clearly see that the primary objective of multimodal representation learning is to narrow the distribution gap in a joint semantic subspace while keeping modality specific semantic intact. They achieve this objective in different ways: joint representation framework maps all modalities into a global common subspace; coordinated representation framework maximizes the similarity or correlation between modalities while keeping each modality independent; encoder-decoder framework maximizes the condition distribution among modalities and keep their semantics consistent; probabilistic graphical models maximize the joint probability distribution across modalities;multimodal auto encoders endeavor to keep modality specific distribution intact by minimizing the reconstruction errors; generative adversarial networks aims to narrow the distribution difference between modalities by an adversarial process; attention mechanism selects salient features from modalities,such that they are similar in local manifolds or such that they are complementary with each other.

當進入各種學習框架或典型模型的學習目標和關鍵問題時，我們可以清楚地看到，多峯表示學習的主要目標是縮小聯合語義子空間中的分佈差距，同時保持模態特定的語義完整。他們以不同的方式實現了這一目標：聯合表示框架將所有模式映射到一個全局公共子空間中；協調的表示框架使模態之間的相似性或相關性最大化，同時使每個模態保持獨立；編碼器-解碼器框架最大程度地提高了模態之間的條件分佈並保持其語義的一致性；概率圖形模型最大程度地提高了跨模態的聯合概率分佈；多模態自動編碼器通過最小化重構誤差，努力保持模態特定的分佈完整；生成對抗網絡旨在通過對抗過程縮小模式之間的分佈差異；注意機制從模式中選擇顯着特徵，使它們在局部流形中相似或彼此互補。

With the rapid development of deep multimodal representation learning methods, the need for much more training data is growing. However, the volume of the current multimodal datasets is limited because of the high cost of manual labeling. The acquirement of high-quality labeled datasets is extremely labor-consuming. A popular solution to address this problem is transfer learning, transferring general knowledge from the source domain with a large-scale dataset to target domain with insufficient data [168]. Transfer learning has been widely used in the multimodal representation learning area and has been shown to be effective in improving performance in many multimodal tasks. One of the examples is the reuse of pre-trained CNN network such as VGGNet [48], ResNet [49], which can be used for extracting image features in a multimodal system. The second example is word embeddings such as word2vec [50], Glove [51]. Although these representations of words are trained only on general-purpose language corpora, they can be transferred to other datasets directly even without fine-tuning.

隨着深度多模態表示學習方法的迅速發展，對訓練數據的需求也越來越大。然而，由於人工標註的高成本，目前的多模態數據集的容量有限。獲取高質量的標記數據集是非常耗費人力的。解決這個問題的一個流行的解決方案是轉移學習，將一般知識從具有大規模數據集的源域轉移到數據不足的目標域[168]。轉移學習在多模態表徵學習領域得到了廣泛的應用，並被證明能有效地提高多模態任務的性能。其中一個例子是預先訓練的CNN網絡的重用，例如VGGNet[48]、ResNet[49]，它們可用於在多模式系統中提取圖像特徵。第二個例子是單詞嵌入，比如word2vec[50]，Glove[51]。儘管這些詞的表示只能在通用語言語料庫中訓練，但即使沒有微調，它們也可以直接傳輸到其他數據集。

In contrast to the widespread use of convenient and effective knowledge transfer strategy in image and language modality, similar methods are not yet available within audio or video modality. Hence, the deep networks used for extracting audio or video features would more easily suffer from overfitting due to the limited training instances. As a result, in many applications such as sentiment analysis and emotion recognition which based on fused multimodal features, it is relatively hard to improve the performance when only audio and video data are available. Alternatively, most works have to increasingly rely on a stronger language model. Although someeffortshavebeenmadetotransfercross-domainknowledge to audio and video modalities, in the multimodal representation learning area, more convenient and effective methods are still required.

與在圖像和語言情態中廣泛使用方便有效的知識轉移策略相比，在音頻或視頻情態中還沒有類似的方法。因此，由於訓練實例有限，用於提取音頻或視頻特徵的深層網絡更容易遭受過度擬合的影響。因此，在基於融合多模態特徵的情感分析和情感識別等應用中，僅利用音頻和視頻數據是很難提高性能的。另一方面，大多數作品不得不越來越依賴一種更強的語言模式。儘管在音頻和視頻模式的跨領域知識轉換方面做了一些努力，但在多模態表示學習領域，仍然需要更方便有效的方法。

In Addition to the knowledge transferring within the same modality, cross-modal transfer learning which aims to transfer knowledge from one modality to another is also a significant research direction. For example, recent studies show that the knowledge transferred from images can help to improve the performance of video analysis tasks [169]. Besides, an alternative but the more challenging approach is the transfer learning between multimodal datasets. The advantage of this method is that the correlation information among different modalities in the source domain can also be exploited, while the weakness is its complexity, both modality difference and domain discrepancy should be tackled simultaneously.

除了在同一模態內進行知識轉移外，旨在將知識從一種模態轉移到另一種模態的跨模態轉移學習也是一個重要的研究方向。例如，最近的研究表明，從圖像中傳輸的知識有助於提高視頻分析任務的性能[169]。此外，另一種更具挑戰性的方法是多模態數據集之間的傳遞學習。該方法的優點是還可以利用源域中不同模態之間的相關信息，缺點是其複雜性，需要同時處理模態差異和域差異。

Another feasible future direction to tackle the problem of relying on large scale labeled datasets is unsupervised or weakly supervised learning, which can be trained using the ubiquitous multimodal data generated by internet users. Unsupervised learning has been widely used for dimensionality reduction and feature extraction on unlabeled datasets.

解決依賴大規模標籤數據集的問題的另一個可行的未來方向是無監督學習或弱監督學習，可以使用互聯網用戶生成的無處不在的多峯數據進行訓練。無監督學習已被廣泛用於減少未標註數據集的維數和特徵提取。

That is why conventional unsupervised learning methods such as multimodal autoencoders are still active today, although comparing to CNN or RNN features their performance are not so good. Due to a similar reason, generative adversarial nets has recently attracted much attention in the multimodal learning area.

這就是爲什麼傳統的無監督學習方法（例如多模態自動編碼器）今天仍然活躍的原因，儘管與CNN或RNN的功能相比，它們的性能還不是很好。由於類似的原因，生成對抗網絡最近在多模式學習領域引起了很多關注。

Most recently, weakly supervised learning has demonstrated its potential in exploiting useful knowledge hidden behind the multimodal data. For example, given an image and its description, it is highly possible that a segment can be described by some words in the sentence. Although the oneto-one correspondences between them are fully unknown, the work proposed by Karpathy and Fei-Fei [76] shows that these hidden relationships can be discovered via weakly supervised learning. Potentially, a more promising application of these type of weak supervision based methods is video analysis, where different modalities such as actions, audios, languages have been roughly aligned in the timeline.

最近，弱監督學習證明了其在利用多模式數據背後隱藏的有用知識方面的潛力。例如，給定圖像及其描述，很可能可以通過句子中的某些單詞來描述一個句段。儘管它們之間的一對一對應關係是完全未知的，但Karpathy和Fei-Fei [76]提出的工作表明，可以通過弱監督學習來發現這些隱藏的關係。這些類型的基於弱監督的方法可能會更有前途的應用是視頻分析，在這種視頻分析中，不同的方式（例如動作，音頻，語言）已在時間軸上大致對齊。

For a long time, multimodal representation learning suffers from issues such as semantic confliction, duplication, and noise. Although attention mechanism can be used to address these problems partially, they work implicitly and cannot be controlled actively. A more promising method for this problem is integrating reasoning ability into multimodal representation learning networks. Via the reasoning mechanism, a system would have the capability to select evidence actively which is sorely needed and could play an important role in mitigating the impact of these troubling issues. We believe thattheclosecombinationofrepresentationlearningandtheir reasoning mechanism will endow machines with intelligent cognitive capabilities.

長期以來，多模式表示學習遭受諸如語義衝突，重複和噪音等問題的困擾。儘管注意力機制可以用來部分解決這些問題，但是它們隱含地起作用並且不能被主動控制。解決此問題的一種更有希望的方法是將推理能力集成到多模式表示學習網絡中。通過推理機制，系統將具有主動選擇急需的證據的能力，並且可以在減輕這些令人困擾的問題的影響方面發揮重要作用。我們認爲，表示學習的緊密結合及其推理機制將賦予機器智能的認知能力。

論文原文：點擊此處

【Paper】Deep Multimodal Representation Learning: A Survey （Part3）

文章目錄

Deep Multimodal Representation Learning: A Survey

E. ATTENTION MECHANISM

IV. CONCLUSION AND FUTURE DIRECTIONS

【CV12】如何在Keras使用 Mask R-CNN 進行目標檢測

【CV13】如何在Keras中使用 YOLO v3 進行目標檢測

【CV10】經典CNN模型中圖像數據增強方法簡介

【CV09】如何可視化CNN中的卷積核和特徵圖

【CV11】如何從頭開發於CIFAR-10圖像分類的CNN

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結