論文期刊：CVPR 2015 (oral)
論文被引：3673 (04/24/20)
論文原文：點擊此處

該論文是 CNN-LSTM 的開山鼻祖，主要用於生成圖像描述。初稿發佈於2014年，拿到了 CVPR 的 oral，第四個版本（本文）發佈於2016年。

文章目錄

7.1 Prior Work

7.2 Contemporaneous and Subsequent Work

7.2.3 Video Description

8 CONCLUSION

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

用於視覺識別和描述的長期遞歸卷積網絡

Abstract

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

基於深度卷積網絡的模型主導了最近的圖像解釋任務。我們調查了也經常出現的模型是否對涉及序列，視覺和其他方面的任務有效。我們描述了一類遞歸卷積體系結構，它是端到端可訓練的，適合大規模的視覺理解任務，並展示了這些模型對活動識別，圖像字幕和視頻描述的價值。與先前的模型假定固定的視覺表示或對順序處理執行簡單的時間平均相比，循環卷積模型“加倍深入”以學習時空的構圖表示。當非線性被合併到網絡狀態更新中時，學習長期依賴性是可能的。可區分的遞歸模型之所以吸引人，是因爲它們可以將可變長度的輸入（例如視頻）直接映射到可變長度的輸出（例如自然語言文本），並且可以對複雜的時間動態建模；但是可以通過反向傳播對其進行優化。我們的循環序列模型直接連接到現代視覺卷積網絡模型，可以共同訓練以學習時間動態和卷積感知表示。我們的結果表明，與單獨定義或優化的識別或生成的最新模型相比，此類模型具有明顯的優勢。

1 INTRODUCTION

Recognition and description of images and videos is a fundamental challenge of computer vision. Dramatic progress has been achieved by supervised convolutional neural network (CNN) models on image recognition tasks, and a number of extensions to process video have been recently proposed. Ideally , a video model should allow processing of variable length input sequences, and also provide for variable length outputs, including generation of full length sentence descriptions that go beyond conventional one-versus-all prediction tasks. In this paper we propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures for visual recognition and description which combines convolutional layers and long-range temporal recursion and is end-to-end trainable (Figure 1). We instantiate our architecture for specific video activity recognition, image caption generation, and video description tasks as described below.

圖像和視頻的識別和描述是計算機視覺的基本挑戰。監督卷積神經網絡（CNN）模型在圖像識別任務上已經取得了令人矚目的進展，最近還提出了許多擴展的視頻處理方法。理想情況下，視頻模型應允許處理可變長度的輸入序列，並且還應提供可變長度的輸出，包括生成超出傳統的“一對多”預測任務的長句子描述。預測任務的全長句子描述。在本文中，我們提出了長期遞歸卷積網絡（LRCN），這是一類用於視覺識別和描述的體系結構，該體系結構將卷積層和遠程時間遞歸相結合，並且是端到端可訓練的（圖1）。我們爲特定的視頻活動識別，圖像標題生成和視頻描述任務實例化我們的體系結構，如下所述。

Fig. 1. We propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures leveraging the strengths of rapid progress in CNNs for visual recognition problems, and the growing desire to apply such models to time-varying inputs and outputs. LRCN processes the (possibly) variable-length visual input (left) with a CNN (middle left), whose outputs are fed into a stack of recurrent sequence models (LSTMs, middle-right), which finally produce a variable-length prediction (right). Both the CNN and LSTM weights are shared across time, resulting in a representation that scales to arbitrarily long sequences.

圖1.我們提出了長期遞歸卷積網絡（LRCN），這是一類利用CNN的快速發展優勢解決視覺識別問題的體系結構，並且人們越來越希望將這種模型應用於時變的輸入和輸出。 LRCN使用CNN（中間左）處理（可能）可變長度視覺輸入（左），其輸出被饋入一堆遞歸序列模型（LSTM，右中），最終產生可變長度預測（右）。 CNN和LSTM權重均在時間上共享，因此表示形式可縮放爲任意長序列。

Research on CNN models for video processing has considered learning 3D spatio-temporal filters over raw sequence data [1], [2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [3], [4]. Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling. Following the same inspiration that motivates current deep convolutional models, we advocate for video recognition and description models which are also deep over temporal dimensions; i.e., have temporal recurrence of latent variables. Recurrent Neural Network (RNN) models are “deep in time” – explicitly so when unrolled – and form implicit compositional representations in the time domain. Such “deep” models predated deep spatial convolution models in the literature [5], [6].

對用於視頻處理的CNN模型的研究已經考慮了對原始序列數據[1]，[2]的3D時空濾波器的學習，以及對幀到幀表示的學習，其中結合了在固定窗口上聚合的基於瞬時光流或基於軌跡的模型或視頻鏡頭片段[3]，[4]。這樣的模型探索了感知時間序列表示學習的兩個極端：要麼學習完全通用的時變加權（fully general time-varying weighting），要麼應用簡單的時序匯合（temporal pooling）。遵循激發當前深度卷積模型的相同靈感，我們提倡視頻識別和描述模型，它們在時間維度上也很深。即具有潛在變量的時間重複性。遞歸神經網絡（RNN）模型是“深度時間”的-展開時明確如此-並在時域中形成隱式組成表示。這種“深度”模型早於文獻[5]，[6]中的深度空間卷積模型。

The use of RNNs in perceptual applications has been explored for many decades, with varying results. A significant limitation of simple RNN models which strictly integrate state information over time is known as the “vanishing gradient” effect: the ability to backpropagate an error signal through a long-range temporal interval becomes increasingly difficult in practice. Long Short-T erm Memory (LSTM) units, first proposed in [7], are recurrent modules which enable long-range learning. LSTM units have hidden state augmented with nonlinear mechanisms to allow state to propagate without modification, be updated, or be reset, using simple learned gating functions. LSTMs have recently been demonstrated to be capable of large-scale learning of speech recognition [8] and language translation models [9], [10].

幾十年來，人們一直在探索RNNs在感知應用中的應用，結果各不相同。隨着時間的推移，嚴格整合狀態信息的簡單RNN模型的一個顯著侷限性被稱爲“消失梯度”效應：通過長時間間隔反向傳播錯誤信號的能力在實踐中變得越來越困難。長-短期記憶單元（LSTM）是[7]中首次提出的一種能夠實現長距離學習的遞歸模塊。LSTM單元具有非線性機制增強的隱藏狀態，允許狀態傳播而無需修改、更新或重置，使用簡單的學習門控函數。LSTMs最近被證明能夠大規模學習語音識別[8]和語言翻譯模型[9]，[10]。

We show here that convolutional networks with recurrent units are generally applicable to visual time-series modeling, and argue that in visual tasks where static or flat temporal models have previously been employed, LSTM style RNNs can provide significant improvement when ample training data are available to learn or refine the representation. Specifically, we show that LSTM type models provide for improved recognition on conventional video activity challenges and enable a novel end-to-end optimizable mapping from image pixels to sentence-level natural language descriptions. We also show that these models improve generation of descriptions from intermediate visual representations derived from conventional visual models.

我們在這裏證明了具有遞歸單元的卷積網絡通常適用於視覺時間序列建模，並且認爲在以前使用靜態或平坦時間模型的視覺任務中，當有足夠的訓練數據來學習或改進表示時，LSTM風格的RNNs可以提供顯著的改進。具體來說，我們表明LSTM類型的模型提供了對傳統視頻活動挑戰的改進識別，並實現了從圖像像素到句子級自然語言描述的端到端優化映射。我們還表明，這些模型改進了從傳統視覺模型派生的中間視覺表示的描述生成。

We instantiate our proposed architecture in three experimental settings (Figure 3). First, we show that directly connecting a visual convolutional model to deep LSTM networks, we are able to train video recognition models that capture temporal state dependencies (Figure 3 left; Section 4). While existing labeled video activity datasets may not have actions or activities with particularly complex temporal dynamics, we nonetheless observe significant improvements on conventional benchmarks.

我們在三個實驗環境中實例化了我們提出的架構（圖3）。首先，我們展示了將視覺卷積模型直接連接到深層LSTM網絡，我們能夠訓練捕獲時間狀態依賴的視頻識別模型（圖3左；第4節）。雖然現有的標記視頻活動數據集可能沒有具有特別複雜的時間動態的動作或活動，但我們仍然觀察到對傳統基準的顯著改進。

Second, we explore end-to-end trainable image to sentence mappings. Strong results for machine translation tasks have recently been reported [9], [10]; such models are encoder-decoder pairs based on LSTM networks. We propose a multimodal analog of this model, and describe an architecture which uses a visual convnet to encode a deep state vector, and an LSTM to decode the vector into a natural language string (Figure 3 middle; Section 5). The resulting model can be trained end-to-end on large-scale image and text datasets, and even with modest training provides competitive generation results compared to existing methods.

其次，我們探索端到端可訓練的圖像到句子映射。機器翻譯任務的強結果最近被報道[9]，[10]；這類模型是基於LSTM網絡的編碼器-解碼器對。我們提出了該模型的一種多模式模擬，並描述了一種使用visual convnet對深狀態向量進行編碼，使用LSTM將向量解碼爲自然語言字符串的體系結構（圖3中間；第5節）。所得到的模型可以在大規模的圖像和文本數據集上進行端到端的訓練，即使是適度的訓練，也能提供與現有方法相比具有競爭力的生成結果。

Finally, we show that LSTM decoders can be driven directly from conventional computer vision methods which predict higher-level discriminative labels, such as the semantic video role tuple predictors in [11] (Figure 3, right; Section 6). While not end-to-end trainable, such models offer architectural and performance advantages over previous statistical machine translation-based approaches.

最後，我們展示了LSTM解碼器可以直接從預測更高級別區分標籤的傳統計算機視覺方法驅動，例如[11]中的語義視頻角色元組預測（圖3，右；第6節）。雖然這種模型不是端到端可培訓的，但與以前基於統計機器翻譯的方法相比，它具有架構和性能優勢。

We have realized a generic framework for recurrent models in the widely adopted deep learning framework Caffe [12], including ready-to-use implementations of RNN and LSTM units. (See http://jeffdonahue.com/lrcn/.)

在廣泛採用的深度學習框架Caffe[12]中，我們實現了一個用於遞歸模型的通用框架，包括RNN和LSTM單元的現成實現。（見http://jeffdonahue.com/lrcn/）

圖2。本文中使用的基本RNN單元（左）和LSTM存儲單元（右）的圖（摘自[13]，對[14]中描述的體系結構進行了略微簡化，該體系結構源自[7]中最初提出的LSTM）。

2 BACKGROUND: RECURRENT NETWORKS

Traditional recurrent neural networks (RNNs, Figure 2, left) model temporal dynamics by mapping input sequences to hidden states, and hidden states to outputs via the following recurrence equations (Figure 2, left):

傳統的遞歸神經網絡（RNN，圖2，左）通過以下順序方程將輸入序列映射到隱藏狀態，將隱藏狀態映射到輸出，從而對時間結構進行建模（圖2，左）：

where $g$ is an element-wise non-linearity , such as a sigmoid or hyperbolic tangent, $x_t$ is the input, $h_t \in \R^N$ is the hidden state with $N$ hidden units, and $z_t$ is the output at time $t$ . For a length $T$ input sequence $(x_1, x_2, ..., x_T)$ , the updates above are computed sequentially as $h_1$ (letting $h_0=0$ ), $z_1, h_2, z_2, ..., h_T, z_T$ .

其中 $g$ 是非線性逐元素運算組件（激活函數），例如S形或雙曲線正切， $x_t$ 是輸入， $h_t \in \R^N$ 是具有 $N$ 個隱藏單元的隱藏狀態， $z_t$ 是在時間 $t$ 的輸出。對於長度爲 $T$ 的輸入序列 $(x_1, x_2, ..., x_T)$ ，以上更新計算的順序爲 $h_1$ (令 $h_0=0$ ), $z_1, h_2, z_2, ..., h_T, z_T$ 。

Though RNNs have proven successful on tasks such as speech recognition [15] and text generation [16], it can be difficult to train them to learn long-term dynamics, likely due in part to the vanishing and exploding gradients problem [7] that can result from propagating the gradients down through the many layers of the recurrent network, each corresponding to a particular time step. LSTMs provide a solution by incorporating memory units that explicitly allow the network to learn when to “forget” previous hidden states and when to update hidden states given new information. As research on LSTMs has progressed, hidden units with varying connections within the memory unit have been proposed. We use the LSTM unit as described in [13] (Figure 2, right), a slight simplification of the one described in [8], which was derived from the original LSTM unit proposed in [7].

儘管事實證明RNN在諸如語音識別[15]和文本生成[16]等任務上是成功的，但可能很難訓練他們學習長期動態，這可能部分是由於梯度消失和爆炸問題[7]所致。可以通過在遞歸網絡的多個層中向下傳播梯度來獲得結果，每個梯度對應於特定的時間步長。 LSTM通過合併內存單元提供了一種解決方案，該內存單元明確允許網絡學習何時“忘記”先前的隱藏狀態以及何時在給定新信息的情況下更新隱藏狀態。隨着對LSTM的研究的發展，已經提出了在存儲單元內具有不同連接的隱藏單元。我們使用[13]中描述的LSTM單元（圖2，右），對[8]中描述的LSTM單元進行了略微的簡化，它源自[7]中提出的原始LSTM單元。

Letting $σ(x) = (1 + e^{−x})^{−1}$ be the sigmoid non-linearity which squashes real-valued inputs to a $[0,1]$ range, and letting $tanh(x) = \frac {e^x−e^{−x}} {e^x+e^{−x}} = 2σ(2x) − 1$ be the hyperbolic tangent non-linearity , similarly squashing its inputs to a $[−1,1]$ range, the LSTM updates for time step $t$ given inputs $x_t, h_{t−1}$ , and $c_{t−1}$ are:

令 $σ(x) = (1 + e^{−x})^{−1}$ 爲非線性sigmoid激活函數，它將實值輸入壓縮到[0,1]範圍，並且讓 $tanh(x) = \frac {e^x−e^{−x}} {e^x+e^{−x}} = 2σ(2x) − 1$ 是雙曲正切非線性，類似地將其輸入壓縮到[−1,1]範圍，LSTM在給定輸入 $x_t, h_{t−1}$ 和 $c_{t−1}$ 的情況下更新時間步 $t$ ：

$x \bigodot y$ denotes the element-wise product of vectors $x$ and $y$ .
$x \bigodot y$ 表示向量 $x$ 和 $y$ 的逐元素乘積。

In addition to a hidden unit ht∈ RN, the LSTM includes an input gate it∈ RN, forget gate ft∈ RN, output gate ot∈ RN, input modulation gate gt∈ RN, and memory cell ct∈ RN. The memory cell unit ct is a sum of two terms: the previous memory cell unit ct−1which is modulated by ft, and gt, a function of the current input and previous hidden state, modulated by the input gate it. Because it and ft are sigmoidal, their values lie within the range [0,1], and it and ft can be thought of as knobs that the LSTM learns to selectively forget its previous memory or consider its current input. Likewise, the output gate ot learns how much of the memory cell to transfer to the hidden state. These additional cells seem to enable the LSTM to learn complex and long-term temporal dynamics for a wide variety of sequence learning and prediction tasks. Additional depth can be added to LSTMs by stacking them on top of each other, using the hidden state h(‘−1) t of the LSTM in layer ‘ − 1 as the input to the LSTM in layer ‘.

除了隱藏單元ht∈RN，LSTM還包括輸入門it∈RN，忘記門ft∈RN，輸出門ot∈RN，輸入調製門gt∈RN和存儲單元ct∈RN。記憶單元ct是兩項的和：前一個存儲單元單元ct-1（由ft調製）和gt（當前輸入和前一個隱藏狀態的函數），由輸入門對其進行調製。因爲它和ft是S形的，所以它們的值在[0,1]範圍內，並且可以將它和ft視爲LSTM學習選擇性地忘記其先前的記憶或考慮其當前輸入的旋鈕。同樣，輸出門ot瞭解要轉移到隱藏狀態的記憶單元數量。這些額外的單元格似乎使LSTM能夠爲各種序列學習和預測任務學習複雜的長期時間動態。通過將第 $l-1$ 層中LSTM的隱藏狀態 $h^{(l-1)}_t$ 用作第 $l$ 層LSTM的輸入，可以將它們堆疊在一起，從而增加LSTM的深度。

Recently , LSTMs have achieved impressive results on language tasks such as speech recognition [8] and machine translation [9], [10]. Analogous to CNNs, LSTMs are attractive because they allow end-to-end fine-tuning. For example, [8] eliminates the need for complex multi-step pipelines in speech recognition by training a deep bidirectional LSTM which maps spectrogram inputs to text. Even with no language model or pronunciation dictionary, the model produces convincing text translations. [9] and [10] translate sentences from English to French with a multilayer LSTM encoder and decoder. Sentences in the source language are mapped to a hidden state using an encoding LSTM, and then a decoding LSTM maps the hidden state to a sequence in the target language. Such an encoder-decoder scheme allows an input sequence of arbitrary length to be mapped to an output sequence of different length. The sequence-to-sequence architecture for machine translation circumvents the need for language models.

最近，LSTM在語言任務（例如語音識別[8]和機器翻譯[9]，[10]）上取得了令人印象深刻的結果。與CNN相似，LSTM具有吸引力，因爲它們允許端到端的微調。例如，[8]通過訓練將頻譜圖輸入映射到文本的深度雙向LSTM，消除了語音識別中對複雜的多步流水線的需求。即使沒有語言模型或發音詞典，該模型也會產生令人信服的文本翻譯。 [9]和[10]使用多層LSTM編碼器和解碼器將句子從英語翻譯爲法語。使用編碼LSTM將源語言中的句子映射到隱藏狀態，然後通過解碼LSTM將隱藏狀態映射到目標語言中的序列。這種編碼器-解碼器方案允許將任意長度的輸入序列映射到不同長度的輸出序列。機器翻譯的序列到序列體系結構避免了對語言模型的需求。

The advantages of LSTMs for modeling sequential data in vision problems are twofold. First, when integrated with current vision systems, LSTM models are straightforward to fine-tune end-to-end. Second, LSTMs are not confined to fixed length inputs or outputs allowing simple modeling for sequential data of varying lengths, such as text or video. We next describe a unified framework to combine recurrent models such as LSTMs with deep convolutional networks to form end-to-end trainable networks capable of complex visual and sequence prediction tasks.

LSTM在視覺問題中對順序數據進行建模的優點是雙重的。首先，當與當前的視覺系統集成時，LSTM模型可以直接端到端微調。其次，LSTM不限於固定長度的輸入或輸出，而是允許對長度可變的順序數據（例如文本或視頻）進行簡單建模。接下來，我們將描述一個統一的框架，以將諸如LSTM的循環模型與深度卷積網絡相結合，以形成能夠執行復雜的視覺和序列預測任務的端到端可訓練網絡。

3 LONG-TERM RECURRENT CONVOLUTIONAL NETWORK (LRCN) MODEL

This work proposes a Long-term Recurrent Convolutional Network (LRCN) model combining a deep hierarchical visual feature extractor (such as a CNN) with a model that can learn to recognize and synthesize temporal dynamics for tasks involving sequential data (inputs or outputs), visual, linguistic, or otherwise. Figure 1 depicts the core of our approach. LRCN works by passing each visual input $x_t$ (an image in isolation, or a frame from a video) through a feature transformation $\phi V(.)$ with parameters $V$ , usually a CNN, to produce a fixed-length vector representation $\phi _V(x_t)$ . The outputs of $\phi _V$ are then passed into a recurrent sequence learning module.

這項工作提出了一個長期遞歸卷積網絡（LRCN）模型，該模型將深層次的視覺特徵提取器（例如CNN）與可以學習識別和合成動態時間序列建模任務中的序列數據（輸入和輸出），視覺，語言或其它任務。圖1描繪了我們方法的核心。 LRCN的工作方式是，將每個視覺輸入 $x_t$ （獨立的圖像或視頻中的一幀）通過帶有參數 $V$ （通常是CNN）的特徵變換 $\phi V(.)$ ，以生成固定長度的矢量表示 $\phi _V(x_t)$ 。然後將 $\phi _V$ 的輸出傳遞到遞歸序列學習模塊。

In its most general form, a recurrent model has parameters $W$ , and maps an input $x_t$ and a previous time step hidden state $h_{t−1}$ to an output $z_t$ and updated hidden state $h_t$ . Therefore, inference must be run sequentially (i.e., from top to bottom, in the Sequence Learning box of Figure 1), by computing in order: $h_1= f_W(x_1, h_0) = f_W(x_1,0)$ , then $h_2= f_W(x_2, h_1)$ , etc., up to $h_T$ . Some of our models stack multiple LSTMs a top one another as described in Section 2.

以其最一般的形式，循環模型具有參數 $W$ ，並將輸入 $x_t$ 和上一個時間步隱藏狀態 $h_ {t-1}$ 映射到輸出 $z_t$ 和更新的隱藏狀態 $h_t$ 。因此，必須按以下順序計算依次推斷（即，從上到下，在圖1的“序列學習”框中）： $h_1 = f_W(x_1，h_0)= f_W(x_1,0)$ ，然後是 $h_2 = f_W(x_2，h_1)$ ，依此類推，直到 $h_T$ 。我們的某些模型將多個LSTM堆疊在一起，如第2節所述。

To predict a distribution $P(y_t)$ over outcomes $y_t \in C$ (where $C$ is a discrete, finite set of outcomes) at time step $t$ , the outputs $z_t \in \R^{d_z}$ of the sequential model are passed through a linear prediction layer $\hat{y_t}= W_z z_t+ b_z$ , where $W_z\in \R ^{|C|×d_z}$ and $b_z\in \R^{|C|}$ are learned parameters. Finally , the predicted distribution $P(yt)$ is computed by taking the softmax of
$\hat{y_t}: P(y_t= c) = softmax(\hat{y_t}) = \frac{exp(\hat{y_t},c) } {\sum_{c'\in C} exp(\hat{y_t},c')}$

爲了在時間步長 $t$ 上預測結果 $y_t \in C$ （其中 $C$ 是離散的有限結果集）上的分佈 $P(y_t)$ ，輸出 $z_t \in \R^{d_z }$ 的順序模型通過線性預測層 $\hat{y_t} = W_z z_t + b_z$ ，其中 $W_z \in \R^{| C |×d_z}$ 和 $b_z \in \R^{ | C |}$ 是學習的參數。最後，通過採用以下公式的softmax來計算預測分佈 $P(yt)$ ：
$\hat{y_t}:P(y_t = c)= softmax(\hat {y_t})= \frac {exp(\hat{y_t},c)} {\sum_ {c'\in C} exp(\hat{y_t},c')}$

The success of recent deep models for object recognition [17], [18], [19] suggests that strategically composing many “layers” of non-linear functions can result in powerful models for perceptual problems. For large T, the above recurrence indicates that the last few predictions from a recurrent network with T time steps are computed by a very “deep” (T layer) non-linear function, suggesting that the resulting recurrent model may have similar representational power to a T layer deep network. Critically , however, the sequence model’s weights W are reused at every time step, forcing the model to learn generic time step-to-time step dynamics (as opposed to dynamics conditioned on t, the sequence index) and preventing the parameter size from growing in proportion to the maximum sequence length.

最近用於對象識別的深層模型的成功[17]，[18]，[19]表明，有策略地組合非線性函數的許多“層”可以產生強大的感知問題模型。對於較大的T，上述遞歸表明來自具有T個時間步長的遞歸網絡的最後幾個預測是通過非常“深的”（T層）非線性函數計算的，這表明所得的遞歸模型可能具有與T層深度網絡。但是，至關重要的是，序列模型的權重W在每個時間步都被重用，從而迫使模型學習通用的時間步長到時間步長的動態特性（與基於t的動態條件相反，即序列索引），並防止參數大小增長與最大序列長度成比例。

In most of our experiments, the visual feature transformation φ corresponds to the activations in some layer of a deep CNN. Using a visual transformation φV(.) which is time-invariant and independent at each time step has the important advantage of making the expensive convolutional inference and training parallelizable over all time steps of the input, facilitating the use of fast contemporary CNN implementations whose efficiency relies on independent batch processing, and end-to-end optimization of the visual and sequential model parameters V and W.

在我們的大多數實驗中，視覺特徵轉換φ對應於深層CNN某層中的激活。使用不隨時間變化且在每個時間步均獨立的可視變換φV(.)具有以下重要優勢：使昂貴的卷積推理和訓練可在輸入的所有時間步上並行化，從而便於使用高效的當代CNN實現依賴於獨立的批處理以及視覺和順序模型參數V和W的端到端優化。

We consider three vision problems (activity recognition, image description and video description), each of which instantiates one of the following broad classes of sequential learning tasks:

我們考慮了三個視覺問題（活動識別，圖像描述和視頻描述），每個問題都實例化了以下廣泛的順序學習任務之一：

Sequential input, static output (Figure 3, left): $(x_1，x_2，...，x_T →y$ . The visual activity recognition problem can fall under this umbrella, with videos of arbitrary length $T$ as input, but with the goal of predicting a single label like running or jumping drawn from a fixed vocabulary.

1）順序輸入，靜態輸出（圖3，左）： $(x_1，x_2，...，x_T →y$ 。視覺活動識別問題可能屬於這種情況，以任意長度 $T$ 的視頻作爲輸入，但目標是預測單個標籤，例如從固定詞彙中提取的奔跑或跳躍。

Static input, sequential output (Figure 3, middle): x 7→ hy1, y2, …, yTi. The image captioning problem fits in this category, with a static (non-time-varying) image as input, but a much larger and richer label space consisting of sentences of any length.

2）靜態輸入，順序輸出（圖3，中間）： $x→(y_1，y_2，...，y_T)$ 。圖像標題問題屬於此類，輸入的是靜態（非時變）圖像，但是標籤空間更大，更豐富，由任意長度的句子組成。

Sequential input and output (Figure 3, right): $x_1，x_2，...，x_T→(y_1，y_2，...，y_{T'})$ . In tasks such as video description, both the visual input and output are time-varying, and in general the number of input and output time steps may differ (i.e., we may have $(T \neq T')$ ). In video description, for example, the number of frames in the video should not constrain the length of (number of words in) the natural language description.

3）順序輸入和輸出（右圖3）： $x_1，x_2，...，x_T→(y_1，y_2，...，y_{T'})$ 。在諸如視頻描述之類的任務中，視覺輸入和輸出都是隨時間變化的，並且通常輸入和輸出時間步長的數量可能不同（即，我們可能具有 $(T \neq T')$ 。例如，在視頻描述中，視頻中的幀數不應限制自然語言描述的長度（其中的單詞數）。

Fig. 3. T ask-specific instantiations of our LRCN model for activity recognition, image description, and video description.

In the previously described generic formulation of recurrent models, each instance has T inputs $x_1，x_2，...，x_T$ and $T$ outputs $(y_1，y_2，...，y_{T'})$ . Note that this formulation does not align cleanly with any of the three problem classes described above – in the first two classes, either the input or output is static, and in the third class, the input length T need not match the output length $T_0$ . Hence, we describe how we adapt this formulation in our hybrid model to each of the above three problem settings.

在先前描述的遞歸模型的一般表述中，每個實例具有 $T$ 個輸入 $x_1，x_2，...，x_T$ 和 $T$ 個輸出 $(y_1，y_2，...，y_{T})$ 。請注意，此公式與上述三個問題類別中的任何一個都不完全吻合-在前兩個類別中，輸入或輸出是靜態的，在第三個類別中，輸入長度 $T$ 不必與輸出長度 $T_0$ 匹配。因此，我們描述瞭如何在我們的混合模型中將此公式適應上述三個問題設置中的每一個。

With sequential inputs and static outputs (class 1), we take a late-fusion approach to merging the per-time step predictions $(y_1，y_2，...，y_{T})$ into a single prediction $y$ for the full sequence. With static inputs x and sequential outputs (class 2), we simply duplicate the input $x$ at all T time steps: $x: ∀t∈{1,2,...,T}:x_t:= x$ . Finally , for a sequenceto-sequence problem with (in general) different input and output lengths (class 3), we take an “encoder-decoder” approach, as proposed for machine translation by [9], [20]. In this approach, one sequence model, the encoder, maps the input sequence to a fixed-length vector, and another sequence model, the decoder, unrolls this vector to a sequential output of arbitrary length. Under this type of model, a run of the full system on one instance occurs over T+T0−1 time steps. For the first T time steps, the encoder processes the input $x_1，x_2，...，x_T$ , and the decoder is inactive until time step T, when the encoder’s output is passed to the decoder, which in turn predicts the first output y1. For the latter $T'-1$ time steps, the decoder predicts the remainder of the output $y_2,y_3,...,y_{T'}$ with the encoder inactive. This encoderdecoder approach, as applied to the video description task, is depicted in Section 6, Figure 5 (left).

對於順序輸入和靜態輸出（類別1），我們採用後融合方法將每個時間步長預測 $(y_1，y_2，...，y_{T})$ 合併爲整個序列的單個預測 $y$ 。對於靜態輸入 $x$ 和順序輸出（類2），我們只需在所有 $T$ 個時間步長上覆制輸入 $x: ∀t∈{1,2,...,T}:x_t:= x$ 。最後，對於（通常）具有不同輸入和輸出長度（第3類）的序列間問題，我們採用“編碼器-解碼器”方法，如[9]，[20]提出的機器翻譯方法。在這種方法中，一個序列模型（編碼器）將輸入序列映射到固定長度的向量，而另一個序列模型（解碼器）將該向量展開爲任意長度的順序輸出。在這種類型的模型下，整個系統在一個實例上的運行發生在 $T + T'-1$ 的時間步長上。對於前 $T$ 個時間步長，編碼器處理輸入 $x_1，x_2，...，x_T$ ，並且解碼器在時間步長 $T$ 之前是不活動的，直到時間步長 $T$ ，此時編碼器的輸出將傳遞到解碼器，進而預測第一個輸出 $y_1$ 。對於後面的 $T'-1$ 時間步長，解碼器在編碼器未激活的情況下預測輸出 $y_2,y_3,...,y_{T'}$ 的其餘部分。應用於視頻描述任務的這種編碼器-解碼器方法在第6節，圖5（左）中進行了描述。

Under the proposed system, the parameters (V, W) of the model’s visual and sequential components can be jointly optimized by maximizing the likelihood of the ground truth outputs ytat each time step t, conditioned on the input data and labels up to that point $(x_{1:t}, y_{1:t−1})$ . In particular, for a training set $D$ of labeled sequences $(x_t, y_t)^T_{t=1}∈ D$ , we optimize parameters $(V, W)$ to minimize the expected negative log likelihood of a sequence sampled from the training set $L(V, W,D) = −\frac {1} {|D|} \sum_{(x_t,y_t)^T_{t=1}} \in D \sum ^T _{t=1} log P(y_t|x_{1:t}, y_{1:t−1}, V, W)$ . One of the most appealing aspects of the described system is the ability to learn the parameters “end-to-end,” such that the parameters V of the visual feature extractor learn to pick out the aspects of the visual input that are relevant to the sequential classification problem. We train our LRCN models using stochastic gradient descent, with backpropagation used to compute the gradient $∇_{V,W}L(V, W,˜D)$ of the objective L with respect to all parameters $(V, W)$ over minibatches $˜D ⊂ D$ sampled from the training dataset $D$ .

在建議的系統下，可以通過最大化每個時間步 $t$ 的正確標註（ground truth）輸出 $y_t$ 的可能性來共同優化模型的視覺和順序成分的參數 $(V, W)$ ，該條件取決於輸入數據和直到該點的標籤 $(x_{1:t}, y_{1:t−1})$ 。特別是，對於帶有標記序列 $(x_t, y_t)^T_{t=1}∈ D$ 的訓練集 $D$ ，我們優化參數 $(V, W)$ 以最小化從訓練集 $L(V, W,D) = −\frac {1} {|D|} \sum_{(x_t,y_t)^T_{t=1}} \in D \sum ^T _{t=1} log P(y_t|x_{1:t}, y_{1:t−1}, V, W)$ 。所描述的系統的最吸引人的方面之一是能夠學習“端到端”參數的能力，從而使視覺特徵提取器的參數 $V$ 學會挑選與視覺輸入相關的視覺輸入方面。順序分類問題。我們使用隨機梯度下降訓練LRCN模型，並使用反向傳播來計算物鏡L相對於所有樣本 $\widetilde{D} ⊂ D$ 的所有參數 $(V, W)$ 的梯度 $∇_{V,W}L(V, W,\widetilde{D})$ 。來自訓練數據集 $D$ 。

We next demonstrate the power of end-to-end trainable hybrid convolutional and recurrent networks by exploring three applications: activity recognition, image captioning, and video description.

接下來，我們將通過探索三種應用來證明端到端可訓練混合卷積和遞歸網絡的功能：活動識別，圖像字幕和視頻描述。

5 IMAGE CAPTIONING （略）

6 VIDEO DESCRIPTION （略）

7 RELATED WORK

我們介紹與該工作中討論的三個任務有關的先前文獻。此外，我們討論了結合卷積網絡和循環網絡的後續擴展，以在活動識別，圖像字幕和視頻描述以及相關的新任務（例如視覺問題解答）上獲得改進的結果。

7.1 Prior Work

7.1.1 Activity Recognition

State-of-the-art shallow models combine spatio-temporal features along dense trajectories [50] and encode features as bags of words or Fisher vectors for classification. Such shallow features track how low level features change through time but cannot track higher level features. Furthermore, by encoding features as bags of words or Fisher vectors, temporal relationships are lost.

最新的淺層模型結合了沿密集軌跡的時空特徵[50]，並將特徵編碼爲單詞袋或Fisher向量袋進行分類。這樣的淺層特徵跟蹤低層特徵如何隨時間變化，但不能跟蹤高層特徵。此外，通過將特徵編碼爲單詞袋或Fisher向量，就失去了時間關係。

Many deep architectures proposed for activity recognition stack a fixed number of video frames for input to a deep network. [3] propose a fusion convolutional network which fuses layers which correspond to different input frames at various levels of a deep network. [4] proposes a two stream CNN which combines one CNN trained on RGB frames and one CNN trained on a stack of 10 flow frames. When combining RGB and flow by averaging softmax scores, results are comparable to state-of-the-art shallow models on UCF101 [25] and HMDB51 [51]. Results are further improved by using an SVM to fuse RGB and flow as opposed to simply averaging scores. Alternatively, [1] and [2] propose learning deep spatio-temporal features with 3D convolutional neural networks. [2], [52] propose extracting visual and motion features and modeling temporal dependencies with recurrent networks. This architecture most closely resembles our proposed architecture for activity classification, though it differs in two key ways. First, we integrate 2D CNNs that can be pre-trained on large image datasets. Second, we combine the CNN and LSTM into a single model to enable end-to-end fine-tuning.

提議用於活動識別的許多深度架構堆疊固定數量的視頻幀，以輸入到深度網絡。 [3]提出了一種融合卷積網絡，該融合卷積網絡在深層網絡的各個級別上融合對應於不同輸入幀的層。 [4]提出了一種兩流CNN，它結合了一個在RGB幀上訓練的CNN和一個在10個流幀的堆棧上訓練的CNN。當通過平均softmax得分將RGB和流結合在一起時，結果與UCF101 [25]和HMDB51 [51]上的最新淺模型相當。通過使用SVM融合RGB和流，與簡單平均分數相反，結果得到了進一步改善。另外，[1]和[2]提出利用3D卷積神經網絡學習深度的時空特徵。 [2]，[52]提出提取視覺和運動特徵並使用遞歸網絡對時間依賴性進行建模。儘管它在兩個關鍵方面有所不同，但該體系結構與我們建議的活動分類體系結構最相似。首先，我們集成了可以在大型圖像數據集上進行預訓練的2D CNN。其次，我們將CNN和LSTM合併爲一個模型，以實現端到端的微調。

7.1.2 Image Captioning

Several early works [53], [54], [55], [56] on image captioning combine object and scene recognition with template or tree based approaches to generate captions. Such sentences are typically simple and are easily distinguished from more fluent human generated descriptions. [46], [57] address this by composing new sentences from existing caption fragments which, though more human like, are not necessarily accurate or correct.

關於圖像字幕的一些早期工作[53]，[54]，[55]，[56]將對象和場景識別與基於模板或樹的方法結合在一起以生成字幕。這樣的句子通常很簡單，很容易與更流暢的人工生成描述區分開。 [46]，[57]通過從現有字幕片段組成新句子來解決此問題，儘管這些片段更像人類，但不一定準確或正確。

More recently, a variety of deep and multi-modal models [27], [29], [30], [58] have been proposed for image and caption retrieval, as well as caption generation. Though some of these models rely on deep convolutional nets for image feature extraction [30], [58], recently researchers have realized the importance of also including temporally deep networks 11 to model text. [29] propose an RNN to map sentences into a multi-modal embedding space. By mapping images and language into the same embedding space, they are able to compare images and descriptions for image and annotation retrieval tasks. [27] propose a model for caption generation that is more similar to the model proposed in this work: predictions for the next word are based on previous words in a sentence and image features. [58] propose an encoderdecoder model for image caption retrieval which relies on both a CNN and LSTM encoder to learn an embedding of image-caption pairs. Their model uses a neural language decoder to enable sentence generation. As evidenced by the rapid growth of image captioning, visual sequence models like LRCN are increasingly important for describing the visual world using natural language.

最近，已經提出了用於圖像和字幕檢索以及字幕生成的各種深度和多模式模型[27]，[29]，[30]，[58]。儘管其中一些模型依賴於深度卷積網絡來進行圖像特徵提取[30] [58]，但最近的研究人員已經意識到，還必須包括時間上的深度網絡11以對文本進行建模。 [29]提出了一種RNN將句子映射到多模式嵌入空間中。通過將圖像和語言映射到相同的嵌入空間中，它們能夠比較圖像和描述以進行圖像和註釋檢索任務。 [27]提出了一種用於字幕生成的模型，該模型與這項工作中提出的模型更爲相似：下一個單詞的預測基於句子中的前一個單詞和圖像特徵。 [58]提出了一種用於圖像字幕檢索的編碼器-解碼器模型，該模型依賴於CNN和LSTM編碼器來學習圖像字幕對的嵌入。他們的模型使用神經語言解碼器來啓用句子生成。正如圖像字幕的快速增長所證明的那樣，像LRCN這樣的視覺序列模型對於使用自然語言描述視覺世界變得越來越重要。

7.1.3 Video Description

Recent approaches to describing video with natural language have made use of templates, retrieval, or language models [11], [59], [60], [60], [61], [62], [63], [64]. To our knowledge, we present the first application of deep models to the video description task. Most similar to our work is [11], which use phrase-based SMT [47] to generate a sentence. In Section 6 we show that phrase-based SMT can be replaced with LSTMs for video description as has been shown previously for language translation [9], [65].

視頻說明。用自然語言描述視頻的最新方法已經使用了模板，檢索或語言模型[11]，[59]，[60]，[60]，[61]，[62]，[63]，[64] 。據我們所知，我們將深度模型首次應用於視頻描述任務。與我們的工作最相似的是[11]，它使用基於短語的SMT [47]生成句子。在第6節中，我們展示了基於短語的SMT可以用LSTM代替視頻描述，如先前針對語言翻譯所展示的[9]，[65]。

7.2 Contemporaneous and Subsequent Work

Similar work in activity recognition and visual description was conducted contemporaneously with our work, and a variety of subsequent work has combined convolutional and recurrent networks to both improve upon our results and achieve exciting results on other sequential visual tasks.

活動識別和視覺描述方面的類似工作與我們的工作同時進行，隨後的各種工作將卷積和遞歸網絡相結合，以改善我們的結果並在其他順序的視覺任務上取得令人興奮的結果。

7.2.1 Activity Recognition

Contemporaneous with our work, [66] train a network which combines CNNs and LSTMs for activity recognition. Because activity recognition datasets like UCF101 are relatively small in comparison to image recognition datasets, [66] pretrain their network using the Sports-1M [3] dataset which includes over a million videos mined from YouTube. By training a much larger network (four stacked LSTMs) and pretraining on a large video dataset, [66] achieve 88.6% on the UCF101 dataset.

活動識別。與我們的工作同步的是，[66]訓練了一個結合CNN和LSTM進行活動識別的網絡。由於活動識別數據集（如UCF101）與圖像識別數據集相比相對較小，因此[66]使用Sports-1M [3]數據集對其網絡進行了預訓練，其中包括從YouTube提取的超過一百萬個視頻。通過訓練更大的網絡（四個堆疊的LSTM）並在大型視頻數據集上進行預訓練，[66]在UCF101數據集上達到88.6％。

[67] also combines a convolutional network with an LSTM to predict multiple activities per frame. Unlike LRCN, [67] focuses on frame-level (rather than video-level) predictions, which allows their system to label multiple activities that occur in different temporal locations of a video clip. Like we show for activity recognition, [67] demonstrates that including temporal information improves upon a single frame baseline. Additionally , [67] employ an attention mechanism to further improve results.

[67]也結合了卷積網絡和LSTM來預測每幀的多個活動。與LRCN不同，[67]專注於幀級（而不是視頻級）預測，這使他們的系統可以標記在視頻剪輯的不同時間位置發生的多種活動。就像我們爲活動識別所顯示的那樣，[67]證明在單個幀基線上包含時間信息會有所改善。另外，[67]採用注意力機制來進一步改善結果。

7.2.2 Image Captioning

[45] and [38] also propose models which combine a CNN with a recurrent network for image captioning. Though similar to LRCN, the architectures proposed in [45] and [38] differ in how image features are input into the sequence model. In contrast to our system, in which image features are input at each time step, [45] and [38] only input image features at the first time step. Furthermore, they do not explore a “factored” representation (Figure 4). Subsequent work [44] has proposed attention to focus on which portion of the image is observed during sequence generation. By including attention, [44] aim to visually focus on the current word generated by the model. Other works aim to address specific limitations of captioning models based on combining convolutional and recurrent architectures. For example, methods have been proposed to integrate new vocabulary with limited [40] or no [68] examples of images and corresponding captions.

[45]和[38]還提出了將CNN與遞歸網絡相結合以進行圖像字幕的模型。儘管與LRCN相似，但[45]和[38]中提出的體系結構在將圖像特徵輸入到序列模型的方式有所不同。與我們在每個時間步輸入圖像特徵的系統相反，[45]和[38]僅在第一時間步輸入圖像特徵。此外，他們沒有探討“因式”表示（圖4）。隨後的工作[44]提出了對關注在序列生成過程中觀察到圖像的哪個部分的關注。通過吸引注意力，[44]旨在從視覺上關注模型生成的當前單詞。其他工作旨在基於卷積和循環體系結構的結合來解決字幕模型的特定限制。例如，已經提出了將新詞彙與圖像和相應字幕的有限[40]或沒有[68]的示例進行整合的方法。

7.2.3 Video Description

In this work, we rely on intermediate features for video description, but end-to-end trainable models for visual captioning have since been proposed. [69] propose creating a video feature by pooling high level CNN features across frames. The video feature is then used to generate descriptions in the same way an image is used to generate a description in LRCN. Though achieving good results, by pooling CNN features, temporal information from the video is lost. Consequently , [70] propose an LSTM to encode video frames into a fixed length vector before sentence generation with an LSTM. Using an end-to-end trainable “sequence-to-sequence” model which can exploit temporal structure in video, [70] improve upon results for video description. [71] propose a similar model, adding a temporal attention mechanism which weights video frames differently when generating each word in a sentence.

在這項工作中，我們依賴於視頻描述的中間特徵，但此後提出了用於視覺字幕的端到端可訓練模型。 [69]提議通過跨幀合併高級CNN功能來創建視頻功能。然後，將視頻功能用於生成描述，就像使用圖像在LRCN中生成描述一樣。儘管獲得了良好的結果，但是通過合併CNN功能，視頻中的時間信息會丟失。因此，[70]提出了一種LSTM，用於在使用LSTM生成句子之前將視頻幀編碼爲固定長度的矢量。使用可以利用視頻中時間結構的端到端可訓練的“序列到序列”模型，[70]改進了視頻描述的結果。 [71]提出了一個類似的模型，增加了一個時間注意機制，當生成句子中的每個單詞時，該機制對視頻幀的加權不同。

7.2.4 Visual Grounding

[72] combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase. In order to reconstruct the phrase, the model must learn to visually ground the input phrase to the appropriate location in the image.

[72]將CNN與LSTM結合使用以實現可視化。該模型首先使用LSTM對描述部分圖像的短語進行編碼，然後學習關注圖像中的適當位置以準確地重建該短語。爲了重建短語，模型必須學習將輸入的短語在視覺上接地到圖像中的適當位置。

7.2.5 Natural Language Object Retrieval

In this work, we present methods for image retrieval based on a natural language description. In contrast, [73] use a model based on LRCN for object retrieval, which returns the bounding box around a given object as opposed to an entire image. In order to adapt LRCN to the task of object retrieval, [73] include local convolutional features which are extracted from object proposals and the spatial configuration of object proposals in addition to a global image feature. By including local features, [73] effectively adapt LRCN for object retrieval.

自然語言對象檢索。在這項工作中，我們提出了基於自然語言描述的圖像檢索方法。相反，[73]使用基於LRCN的模型進行對象檢索，該模型返回圍繞給定對象而不是整個圖像的邊界框。爲了使LRCN適應對象檢索的任務，[73]除了全局圖像特徵外，還包括從對象建議中提取的局部卷積特徵和對象建議的空間配置。通過包含局部特徵，[73]有效地使LRCN適用於對象檢索。

8 CONCLUSION

We’ve presented LRCN, a class of models that is both spatially and temporally deep, and flexible enough to be applied to a variety of vision tasks involving sequential inputs and outputs. Our results consistently demonstrate that by learning sequential dynamics with a deep sequence model, we can improve upon previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.

我們介紹了LRCN，這是一類模型，在空間和時間上都具有深度，並且足夠靈活，可以應用於涉及順序輸入和輸出的各種視覺任務。我們的結果一致表明，通過使用深度序列模型學習順序動力學，我們可以改進以前的方法（僅在視覺域中學習深度的參數層次結構），以及在方法上採用固定的視覺表示形式並僅學習輸入的方法。輸出序列的動態。

As the field of computer vision matures beyond tasks with static input and predictions, deep sequence modeling tools like LRCN are increasingly central to vision systems for problems with sequential structure. The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to handle with little input preprocessing and no hand-designed features.

隨着計算機視覺領域的發展超出靜態輸入和預測任務的範圍，諸如LRCN之類的深度序列建模工具在視覺系統中對於順序結構問題的重要性日益增強。這些工具可以輕鬆地集成到現有的視覺識別管道中，使其成爲時變視覺輸入或順序輸出的感知問題的自然選擇，這些方法只需很少的輸入預處理就可以處理，而無需人工設計。

Fig. 6. Image description: images with corresponding captions generated by our finetuned LRCN model. These are images 1-12 of our randomly chosen validation set from COCO 2014 [33]. We used beam search with a beam size of 5 to generate the sentences, and display the top (highest likelihood) result above.

圖6.圖像描述：經過微調的LRCN模型生成的具有相應標題的圖像。這些是我們從COCO 2014 [33]中隨機選擇的驗證集的圖像1-12。我們使用光束大小爲5的光束搜索來生成句子，並在上方顯示最高（最高可能性）結果。

【Paper】CNN-LSTM：Long-term Recurrent Convolutional Networks for Visual Recognition and Description