反思視頻摘要的評估標準

由本人翻譯自原文鏈接，爲追求順口和易於理解並未嚴格按照原文翻譯，爲此也在翻譯下方提供了原文。

摘要 Abstract

視頻摘要是這樣一種技術，其對原始視頻進行簡短概述，同時還保留了主要故事/內容。隨着可用素材（視頻）數量的迅猛增長，自動化這一塊就大有可爲了。公共基準數據集促進了近階段自動化技術的進展，咱們現在更容易也更公平地比較不同的自動化方法。目前已建立的評估協議，其將機器生成的摘要與數據集提供的一組參考摘要進行比較。在本文中，我們將使用兩個流行的基準測試數據集對該管道進行深入評估。我們驚奇的發現，隨機生成的摘要居然與最先進的摘要算法性能相當甚至更好。在某些情況下，隨機摘要的表現甚至優於留一法實驗中人工生成的摘要。

Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments.

此外，結果表明，視頻分割對性能指標的影響最爲顯著。而通常來說，視頻分割是提取視頻摘要的固定預處理方法。基於我們的觀察，我們提出了評估重要性得分的替代方法，以及估計得分和人工標註之間相關性的直觀可視化方法。

Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations.

1.介紹 Introduction

隨手可得的視頻素材數量激增，使得對技術的需求也與日俱增，這樣才能使用戶能夠快速瀏覽和觀看視頻。一種補救方法是自動視頻摘要，其用來生成一個簡短的視頻概覽，保留原始視頻中最重要的內容。例如，體育賽事的原始畫面可以壓縮成幾分鐘的摘要，只保留重要事件，如進球、點球等。

The tremendous growth of the available video material has escalated the demand for techniques that enable users to quickly browse and watch videos. One remedy is provided by the automatic video summarization, where the aim is to produce a short video skim that preserve the most important content of the original video. For instance, the original footage from a sport event could be compressed into a few minute summary illustrating the most important events such as goals, penalty kicks, etc.

圖1: 一個常用的視頻摘要管道和我們的隨機化測試的說明。我們利用隨機總結來驗證當前的評估框架。
Figure 1. An illustration of the commonly used video summarization pipeline and our randomization test. We utilize random summaries to validate the current evaluation frameworks.

各種文獻中已經提出了許多自動摘要方法。如圖1所示，最新的方法，其範式由視頻分割、重要性評分預測和視頻片段選擇組成。該流程中最具挑戰性的部分就是是重要性評分預測，其任務是突出顯示對視頻內容最重要的部分。多種因素都會影響視頻部分的重要性，一旦採用的重要性標準不同，單個視頻的視頻摘要就可能不同。事實上，之前的研究已經提出了多種重要性標準，如視覺趣味性[2,3]、緊湊性(即冗餘較小)[28]和多樣性[25,29]。

Numerous automatic summarization methods have been proposed in the literature. The most recent methods follow a paradigm that consists of a video segmentation, importance score prediction, and video segment selection as illustrated in Figure 1. The most challenging part of this pipeline is the importance score prediction, where the task is to highlight the parts that are most important for the video content. Various factors affect importance of video parts, and different video summaries are possible for a single video given a different criterion of importance. In fact, previous works have proposed a variety of importance criteria, such as visual interestingness[2, 3], compactness(i.e., smaller redundancy) [28], and diversity [25, 29].

儘管人們對自動視頻摘要做了大量的研究，但仍不知道如何評估生成的摘要的合理性。

一個直接但令人信服的方法就是主觀評價生成的摘要是否合理，然而，收集人類的反應代價昂貴，而且由於人的主觀性，實驗結果基本不可以重現結果；
另一種方法，是將生成的視頻摘要與一組由人類標註員準備的固定參考摘要進行比較。爲此，會邀請人類標註員手工創建視頻摘要，然後這些視頻摘要被視爲標準答案（ground truth）。該方法的優點是參考摘要可以重用，即不同的視頻摘要方法可以在不反覆添加人工標註的情況下進行評估，實驗可以重現。

Despite the extensive efforts toward automatic video summarization, the evaluation of the generated summaries is still an unsolved problem. A straightforward but yet convincing approach would be to utilise a subjective evaluation; however, collecting human responses is expensive, and reproduction of the result is almost impossible due to the subjectivity. Another approach is to compare generated video summaries to a set of fixed reference summaries prepared by human annotators. To this end, the human annotators are asked to create video summaries, which are then treated as ground truth. The advantage of this approach is the reusability of the reference summaries, i.e., different video summarization methods can be evaluated without additional annotations and the experiments can be reproduced.

最常用的數據集是SumMe[2]和TVSum[18]，這些數據集都用於以參考爲基礎的評估。他們爲每個原始視頻都提供了一組視頻，此外還提供了多人爲這些視頻生成的參考摘要(或重要性分數)。兩個數據集使用的基本評估方法，就是使用F1評分（F1 Score）來衡量機器生成的摘要與人工生成的參考摘要之間的一致性。SumMe和TVSum自問世以來，在近期的視頻綜述文獻中被廣泛採用[3,12,21,25,26,27,29]。然而，基於參考摘要的評估方法是否真的有效，這些文獻中並未提及。

The most popular datasets used for reference based evaluations are SumMe[2] and TVSum[18]. These datasets provide a set of videos as well as multiple human generated reference summaries (or importance scores) for each original video. The basic evaluation approach, used with both datasets, is to measure the agreement between the generated summary and the reference summaries using F1 score. Since their introduction, SumMe and TVSum have been widely adopted in the recent video summarization literature [3, 12, 21, 25, 26, 27, 29]. Nevertheless, the validity of reference summary-based evaluation has not been previously investigated.

本文利用SumMe[2]和TVSum[18]數據集，對當前基於參考摘要的評估框架進行了深入研究。我們將首先審查框架，然後應用隨機測試來評估結果的質量。我們提出的隨機化測試，基於隨機重要性評分和隨機視頻分割來生成視頻摘要。這樣生成的摘要提供了一個基本分數，可以通過偶然實驗獲得。

This paper delves deeper into the current reference based evaluation framework using SumMe [2] and TVSum [18] datasets. We will first review the framework and then apply a randomization test to assess the quality of the results. The proposed randomization test generates video summaries based on random importance scores and random video segmentation. Such summaries provide a baseline score that is achievable by chance.

圖2：比較兩種最近的方法創建的摘要和我們的隨機方法所創建的摘要(第4節)。藍色的線表示多幀特徵(segment level)相對於時間(幀)的重要性得分。橙色區域表示爲最終摘要選擇的幀。三種方法都使用相同的KTS[14]分割邊界。有趣的是，儘管重要性得分明顯不同，但所有方法(包括隨機方法)產生了非常相似的輸出。
Figure 2. Comparison of summaries created by two recent methods and our randomized method (Section 4). The blue line shows the segment level importance scores with respect to time (frames). The orange areas indicate the frames selected for the final summary. All of the three methods use the same segment boundaries by KTS [14]. Interestingly, all methods (including the randomone) produce very similar outputs despite clear differences in the importance scores.

圖2說明了我們工作的一個主要發現。結果發現，“隨機方法”完全無視視頻內容進行重要分數預測，但得出的摘要與最先進的方法幾乎相同。更深入的分析表明，雖然重要性分數有差異，但在彙總以得到最終摘要時，這些差異被忽略了。隨機化測試揭示了當前視頻摘要評估方案的重大缺陷，這促使我們提出了一個新的框架用於評估重要性排名。

Figure 2 illustrates one of the main findings of our work. It turned out that the random method produces summaries that are almost identical to the state-of-the-art despite the fact that it is not using the video content at all for importance score prediction. Deeper analysis shows that while
there are differences in the importance scores they are ignored when assembling the final summary. The randomization test revealed critical issues in the current video summa summarization evaluation protocol, which motivated us to propose a new framework for assessing the importance rankings.

本論文的主要貢獻如下:

The main contributions of this paper are as follows:

我們評估了當前基於參考摘要的評估框架的有效性，並揭示了這樣的事實，即一種隨機方法也能夠達到與當前最先進的技術相似的性能分數。

We assess the validity of the current reference summary-based evaluation framework and reveal that a random method is able to reach similar performance scores as the current state-of-the-art.

我們證明了廣泛使用的F1評分主要是由視頻片段長度的分佈決定的。我們的分析爲這一現象提供了一個簡單的解釋。

We demonstrate that the widely used F1 score is mostly determined by the distribution of video segment lengths. Our analysis provides a simple explanation for this phenomenon.

我們演示了使用預測排序和人類標註員排序之間的相關性，來評估重要性排名。此外，我們還提出了幾種可視化方法，讓我們能夠洞察預測得分與隨機得分之間的關係。

We demonstrate evaluating the importance rankings using correlation between the predicted ordering and the ordering by human annotators. Moreover, we propose several visualisations that give insight to the predicted scoring versus random scores.

2.1. 視頻摘要 Video Summarization

文獻中提出了一系列不同的視頻摘要方法。有一組作品是通過測量視覺趣味性[2]來檢測重要鏡頭，如視覺特徵的動態性[8]，視覺顯著性[11]。Gygli等人[3]結合了多種屬性，包括顯著性、美學和幀畫面中是否有人。

A diverse set of video summarization approaches have been presented in the literature. One group of works aim at detecting important shots by measuring the visual interestingness [2], such as dynamics of visual features [8], and visual saliency [11]. Gygli et al. [3] combined multiple properties including saliency, aesthetics, and presence of people in the frames.

另一組方法通過丟棄冗餘鏡頭[28]來實現緊湊性。最大化輸出視頻的代表性和多樣性，這也是最近的作品中被廣泛使用的標準[1,14,25]。這些方法都是基於這樣一個假設:一個好的摘要應該具有多樣化的內容，採樣的鏡頭還要能解釋原視頻中的事件。

Another group of methods aims at compactness by discarding redundant shots [28]. Maximization of representativeness and diversity in the output video are also widely used criteria in the recent works [1, 14, 25]. These methods are based on the assumption that a good summary should have diverse content while the sampled shots explain the events in the original video.

最近，人們提出了基於LTSM的深度神經網絡模型，可以直接預測人類標註者[26]給出的重要性評分。採用行列式點過程（determinantal point process）[7]對模型進行了擴展，以保證分割選擇的多樣性。最後，Zhou等人[21]使用強化學習獲得了一種幀選擇策略，來最大化生成的摘要的多樣性和代表性。

Recently, LSTM-based deep neural network models have been proposed to directly predict the importance scores given by the human annotators [26]. The model is also extended with determinantal point process [7] to ensure diverse segment selection. Finally, Zhou et al. [21] applied reinforcement learning to obtain a policy for the frame selection in order to maximize the diversity and representativeness of the generated summary.

儘管這些工作使用不同的重要性標準，但其中許多都使用類似的處理管道。

首先，對原始視頻中的每一幀進行重要度評分;
其次，將得到的視頻分割成短片段。
最後，通過在揹包（揹包算法）約束下最大化重要分數來選擇視頻片段子集，生成輸出摘要。

Although these works use various importance criteria, many of them employ a similar processing pipeline. Firstly, the importance scores are produced for each frame in the original video. Secondly, the obtained video is divided into short segments. Finally, the output summary is generated by selecting a subset of video segments by maximising the importance scores with the knapsack constraint.

2.2. 視頻摘要評估 Video Summary Evaluation

視頻摘要的評估是一項具有挑戰性的任務。這主要是由於主觀性質的質量標準在作祟，多個觀衆之間的分歧會導致標準不同，不同的時間點進行評估也會導致標準不同。用於評估的視頻和標註數量有限，也進一步放大了模糊性問題。

The evaluation of a video summary is a challenging task. This is mainly due to subjective nature of the quality criterion that varies from viewer to viewer and from one time instant to another. The limited number of evaluation videos and annotations further magnify this ambiguity problem.

大多數早期工作[10,11,19]以及一些近期的工作[22]都採用了用戶研究，即觀衆對每個視頻作品[10,15,23]單獨的視頻摘要的質量進行主觀評分。這種方法的關鍵缺點是相關的成本高和可重複性低。也就是說，即使讓同樣的觀衆再次評價同樣的視頻，也無法得到相同的評價結果。

Most early works [10, 11, 19] as well as some recent works [22] employ user studies, in which viewers subjectively score the quality of output video summaries prepared solely for the respective works [10, 15, 23]. The critical shortcoming in such approach is the related cost and reproducibility. That is, one cannot obtain the same evaluation results, even if the same set of viewers would re-evaluate the same videos.

許多最近的工作反而通過與人工參考摘要進行比較來評價它們生成的摘要。

Khosla等人[5]提出，在參考摘要和生成的摘要中，使用關鍵幀之間的像素級距離。
Lee等人使用包含感興趣對象的幀數作爲相似性度量。
Gong等人[1]計算由人類標註員選擇的關鍵幀的精確度和召回率。
Yeung等人[24]提出了不同的方法，其基於文本描述評估摘要的語義相似度，爲此，他們生成了一個以自我爲中心的長視頻數據集，其中的片段用文本描述進行了註釋。該框架主要使用場景是：基於用戶查詢的視頻摘要評估[13,16]。

最近，計算人工參考摘要和機器生成摘要之間的重疊，已經成爲視頻摘要評價的標準框架[2,3,14,17,18,28]。

Many recent works instead evaluate their summaries by comparing them to reference summaries. Khosla et al. [5] proposed to use the pixel-level distance between keyframes in reference and generated summaries. Lee et al. [9] use number of frames that contain objects of interest as a similarity measure. Gong et al. [1] compute precision and recall scores over keyframes selected by human annotators. Yeung et al. [24] propose a different approach and evaluate the semantic similarity of the summaries based on textual descriptions. For this, they generated a dataset with long egocentric videos for which the segments are annotated with textual descriptions. This framework is mainly used to evaluate video summaries based on user queries [13, 16]. More recently, computing overlap between reference and generated summaries has become the standard framework for video summary evaluation [2, 3, 14, 17, 18, 28].

表1。在最近的工作報告中，F1衡量SumMe和TVSum基準。Average (Avr)表示F1在所有參考摘要中的平均得分，maximum (Max)表示參考摘要[3]中的最高F1得分。此外，我們展示了隨機測試和人工標註(留一法測試)的F1值。可以注意到，隨機摘要的結果可以與最先進的技術媲美，甚至可以與人工標註媲美。
Table 1. The F1 measures for SumMe and TVSum benchmarks as reported in recent works. Average (Avr) denotes the average of F1 scores over all reference summaries and maximum (Max) denotes the highest F1 score within the reference summaries [3]. In addition, we show the F1 values for our randomized test and human annotations (leave-one-out test). It can be noted that random summaries achieve comparable results to the state-of-the-art and even to human annotations.

本文研究了將生成的摘要與一組人工標註的參考摘要進行比較的評估框架。目前，有兩個公共數據集有助於這種類型的評估。SumMe[2]和TVSum[18]數據集提供手動創建的參考摘要，是目前最流行的評估基準。SumMe數據集包含了從人類標註員處收集的個人視頻和相應的參考摘要，這樣的標註員一般有15-18個。TVSum數據集爲YouTube視頻提供的鏡頭級別的重要性得分。大多數文獻使用機器生成摘要與人工參考摘要之間的F1測度作爲績效指標。表1顯示了兩個數據集的報告分數。SumMe數據集大約有15個不同的人工參考摘要，它有兩種可能的方法來聚合F1分數:一種是計算所有人工參考摘要上F1度量的平均值，另一種是使用最大分數。

This paper investigates the evaluation framework where generated summaries are compared to a set of human annotated references. Currently, there are two public datasets that facilitate this type of evaluation. SumMe [2] and TVSum [18] datasets provide manually created reference summaries and are currently the most popular evaluation benchmarks. The SumMe dataset contains personal videos and the corresponding reference summaries collected from 15–18 annotators. The TVSum dataset provides shot-level importance scores for YouTube videos. Most of the literature uses the F1 measure between generated summaries and reference summaries as a performance indicator. Table 1 shows reported scores for both datasets. The SumMe dataset, which has around 15 different reference summaries, has two possible ways for aggregating the F1 scores: One is to compute an average of F1 measures over all reference summaries, and the other is to use the maximum score.

3. 目前的評估框架 Current evaluation framework

3.1. SumMe

SumMe是一個視頻摘要數據集，包含從YouTube獲得的25條個人視頻。這些視頻是未經編輯或最低限度編輯的。該數據集爲每個視頻提供了15-18個人工參考摘要。人工標註者單獨製作參考摘要，使每個摘要的長度小於原視頻長度的15%。爲了進行評估，機器生成的摘要在長度上也要受到同樣的限制。

SumMe is a video summarization dataset that contains 25 personal videos obtained from the YouTube. The videos are unedited or minimally edited. The dataset provides 15–18 reference summaries for each video. Human annotators individually made the reference summaries so that the length of each summary is less than 15% of the original video length. For evaluation, generated summaries should be subject to the same constraint on the summary length.

3.2. TVSum

TVSum包含50個YouTube視頻，每個視頻都有一個標題和一個類別標籤作爲元數據。TVSum數據集不提供人工參考摘要，而是對每個視頻進行處理，視頻以兩秒爲單位提供人工標註的重要性評分。爲了進行評估，根據這些重要度分數，按照以下步驟生成具有預定義長度的類參考摘要:

首先，將視頻分成短視頻段，短視頻段與生成的摘要段長度相同。
然後，對視頻片段內的重要性得分進行平均，得到一個片段級的重要性得分。
最後，通過找到一個片段子集，使摘要中的總重要性得分最大化，生成一個類參考摘要。

這種方法的優點是能夠生成所需長度的摘要。

TVSum contains 50 YouTube videos, each of which has a title and a category label as metadata. Instead of providing reference summaries, the TVSum dataset provides human annotated importance scores for every two second of each video. For evaluation, the reference summaries, with a predefined length, are generated from these importance scores using the following procedure: Firstly, videos are divided into short video segments, which are the same as in the generated summary. Then, the importance scores within a video segment are averaged to obtain a segment-level importance score. Finally, a reference summary is generated by finding a subset of segments that maximizes the total importance score in the summary. The advantage of this approach is the ability to generate summaries with desired length.

3.3. 評價指標 Evaluation measure

最常用的評價方法是計算預測和人工參考摘要之間的F1測度（F1 measure）。設 \(y\in\{0,1\}\)表示一個標籤，表示機器從原始視頻中選擇了哪些幀進入摘要(即，如果第i幀被選擇，則\(y_i\) = 1，否則爲0)。對於用戶參考摘要，給定類似的標籤 \(y_i^*\) ，則F1得分定義爲

\(F1=\frac{2 \times PRE \times REC}{PRE+REC}\) (1)

其中

\(PRE=\frac{\sum_{n=1}^N y_i\centerdot y_i^*}{\sum_{n=1}^N y_i}\) and \(REC=\frac{\sum_{n=1}^N y_i\centerdot y_i^*}{\sum_{n=1}^N y_i^*}\) (2)

是幀級的精確度和召回率得分。N爲原始視頻的總幀數。

譯者注: 上面的公式可以簡化爲下面的公式，\(S_{user}\)是用戶標註的高光部分，\(S_{machine}\)是機器標註的高光部分，而\(S_{user}\bigcap S_{machine}\)則是用戶和機器標註的高光重合部分。

\(PRE=\frac{|S_{user}\bigcap S_{machine}|}{|S_{machine}|}\) and \(REC=\frac{|S_{user}\bigcap S_{machine}|}{|S_{user}|}\) (2)

The most common evaluation approach is to compute F1 measure between the predicted and the reference summaries.
Let \(y\in\{0,1\}\) denote a label indicating which frames from the original video is selected to the summary (i.e. \(y_i\) = 1 if the i-th frame is selected and otherwise 0). Given similar label \(y_i^*\) for the references summary, the F1 score is defined as

\(F1=\frac{2PRE*REC}{PRE+REC}\) (1)

where

\(PRE=\frac{\sum_{n=1}^N y_i\centerdot y_i^*}{\sum_{n=1}^N y_i}\) and \(REC=\frac{\sum_{n=1}^N y_i\centerdot y_i^*}{\sum_{n=1}^N y_i^*}\) (2)

are the frame level precision and recall scores. N denotes the total number of frames in the original video.

在實驗中，分別計算每個人工參考摘要的F1得分，這些得分彙總方式有兩種，要麼是視頻平均值，要麼是視頻最大值。前一種方法意味着，生成的摘要應該包含最多的一致片段，而後者認爲所有人類標註員都提供了合理的重要性分數。因此，如果機器生成的摘要與至少一個人工參考摘要匹配，那麼它應該有較高的分數。

In the experiments, the F1 score is computed for each reference summary separately and the scores are summarised either by averaging or selecting the maximum for each video. The former approach implies that the generated summary should include segments with largest number of agreement, while the latter argue that all human annotators provided reasonable importance scores and thus the generated summary should have high score if it matches at least one of the reference summaries.

4. 隨機測試 Randomization test

通常視頻摘要管道由三個部分組成;重要性分數估計，視頻分割，鏡頭選擇(圖1)。我們設計了一個隨機測試來評估每個部分對最終評估分數的貢獻。在這些實驗中，我們利用隨機重要分數和隨機視頻分割生成獨立於視頻內容的視頻摘要。其中，每一幀的重要度得分獨立於一個均勻分佈[0,1]。當需要時，通過平均池化相應的幀級隨機分數來產生段級分數。對於視頻分割，我們使用下面這些選項：

Commonly video summarization pipeline consists of three components; importance score estimation, video segmentation, and shot selection (Figure 1). We devise a randomization test to evaluate the contribution of each part to the final evaluation score. In these experiments we generate video summaries that are independent of video content by utilising random importance scores and random video segment boundaries. Specifically, the importance score for each frame is drawn independently from an uniform distribution [0, 1]. When needed, the segment-level scores are produced by average pooling the corresponding frame-level random scores. For video segmentation, we utilise the options defined below.

統一分割將視頻分割爲固定時長的片段。我們在實驗中使用了60幀，這大致相當於2秒(SumMe和TVSum數據集的幀率分別爲30 fps和25 fps)。

**Uniform segmentation **divides the video into segments of constant duration. We used 60 frames in our experiments, which roughly corresponds to 2 seconds (the frame rates in SumMe and TVSum datasets are 30 fps and 25 fps, respectively).

單峯分割從單峯分佈中採樣每個片段中的幀數。我們假設相鄰鏡頭邊界(shot boundry)之間的幀數服從事件率\(\ λ = 60\)的泊松分佈。

One-peak segmentation samples the number of frames in each segment from an unimodal distribution. We assume that the number of frames between adjacent shot boundaries follow the Poisson distribution with event rate \(\lambda = 60\).

雙峯分割類似於單峯版本，但利用雙峯分佈，即，兩個泊松分佈的混合，其事件率分別是\(\lambda = 30\)和\(\lambda = 90\)。對於抽樣，我們從兩個等概率的泊松分佈中隨機選擇一個，然後對幀的數量進行抽樣。因此，一個視頻被分爲較長和較短的片段，但是一個片段中的期望幀數是60幀。除了完全隨機的方法，我們還要多評估一種常用的分割方法，及其結合隨機分數後的變種方法。

Two-peak segmentation is similar to one-peak version, but utilises bimodal distribution, i.e., a mixture of two Poisson distributions, whose event rates are \(\lambda = 30\) and \(\lambda = 90\),respectively. For sampling, we randomly choose one of the two Poisson distributions with the equal probability and then sample the number of frames. Consequently, a video is segmented into both longer and shorter segments, yet the expected number of frames in a segment is 60 frames.In addition to the completely random methods, we assess one commonly used segmentation approach and its variation in conjunction with the random scores.

基於核的時域分割(KTS)[14]基於視頻的視覺內容，它是最近視頻摘要文獻中應用最廣泛的方法(表1)。KTS通過檢測視覺特徵的變化產生以分割邊界。如果視覺特徵沒有發生顯著變化，視頻片段往往會很長。

Kernel temporal segmentation (KTS) [14] is based on the visual content of a video and is the most widely used method in the recent video summarization literature (Table 1). KTS produces segment boundaries by detecting changes in visual features. A video segment tends to be long if visual features do not change considerably.

隨機化KTS首先用KTS分割視頻，然後打亂分割順序;因此，段長分佈與KTS完全相同，但分割邊界與視覺特徵不同步。

Randomized KTS first segments the video with KTS and then shuffles the segment ordering; therefore, the distribution of segment lengths is exactly the same as KTS’s, but the segment boundaries are not synchronized with the visual features.

由這些隨機和部分隨機的摘要作爲基線（這樣的基線可以完全由偶然獲得），得到F1分數。合理的評估框架，會給產生合理重要性分數的方法打高分。此外，人們會期望，人類生成的標準答案(ground truth)摘要應該會在留一法實驗中獲得高分。

F1 scores obtained by these randomized (and partially randomized) summaries serve as a baseline that can be achieved completely by chance. Reasonable evaluation framework should produce higher scores for methods that are producing sensible importance scores. Furthermore, one would expect that human generated ground truth summaries should produce top scores in leave one out experiments.

4.1. 對SumMe數據集的分析 Analysis on the SumMe dataset

圖3：F1爲SumMe的不同分割和重要性分數組合。淺藍色條表示隨機摘要，深藍條表示人工創建的參考摘要的分數(留一法測試)。紫色條表示不同分割方法下DR-DSN重要性評分。左:F1平均分數相對於參考摘要的平均值。右:最大分數的平均值。
Figure 3. F1 scores for different segmentation and importance score combinations for SumMe. Light blue bars refer to random summaries and dark blue bars indicate scores of manually created reference summaries (leave-one-out test). Purple bars show the scores for DR-DSN importance scoring with different segmentation methods. Left: the average of mean F1 scores over reference summaries. Right: the average of the maximum scores.

圖3顯示了使用我們的隨機化方法的不同版本以獲得的F1分數(平均值和最大值)(請參閱前面的部分)。我們針對每種隨機的設置,進行了100次試驗，黑色條是95%置信區間。此外，同樣的圖包含每種隨機分割方法對應的F1分數，但使用的是最近發佈的一種方法DR-DSN[29]的幀級重要性分數。人工參考摘要的性能是通過留一法來實現的。在這種情況下，通過平均每個參考摘要獲得的F1分數(avg或max)來計算最終結果。

Figure 3 displays the F1 scores (average and maximum) obtained with different versions of our randomized method (see previous section). We performed 100 trials for every random setting and the black bar is the 95% confidence interval. In addition, the same figure contains the corresponding F1 scores for each random segmentation method, but using frame level importance scores from one recently published methods DR-DSN [29]. The reference performance is obtained using human created reference summaries in leave-one-out scheme. In this case, the final result is calculated by averaging the F1 scores (avg or max) obtained for each reference summary.

圖4：最近報道的SumMe中使用KTS分割方法的F1分數。採用KTS分割的隨機摘要的平均得分用淺藍色虛線表示。
Figure 4. Recently reported F1 scores for methods using KTS segmentation in SumMe. The average score for random summaries with KTS segmentation is represented by a light blue dashed line.

有趣的是，我們觀察到性能明顯由分割方法決定，並且對重要性評分有很小的影響(如果有的話)。此外，人類性能和最佳性能的自動方法之間的差異佈局在量級上類似於分割方法之間的差異。圖4展示了SumMe數據集最新的結果。令人驚訝的是，具有隨機重要性分數的KTS分割獲得了與最好的公佈方法相當的性能。第4.3節對此現象提供了可能的解釋。

Interestingly, we observe that the performance is clearly dictated by the segmentation method and there is small (if any) impact on the importance scoring. Moreover, the difference between human performance and the best performing automatic method is similar in magnitude to the differences between the segmentation approaches. Figure 4 illustrates the recent state-of-the-art results for SumMe dataset. Surprisingly, KTS segmentation with random importance scores obtains comparable performance to the best published methods. Section 4.3 provides possible explanations for this phenomenon.

4.1.1 人對SumMe的評價 Human Evaluation on SumMe

我們進行了人類評估，以比較SumMe數據集上的摘要。受試者比較兩個視頻摘要，並決定哪一個視頻摘要更好地總結了原始視頻。在第一個實驗中，我們要求受試者對使用隨機重要性評分和DR-DSN評分生成的視頻摘要進行評分。兩種方法都使用KTS分割。總體而言，隨機評分略高於DR-DSN評分，然而，46%的答案都認爲他們同樣好(或同樣差)。這一結果與第4.1節中的觀察結果一致，即重要性評分幾乎不會影響SumMe數據集上的評估評分。我們還用隨機重要性評分比較了KTS分割和均勻分割。因此，對於記錄長時間活動的視頻，如參觀自由女神像、水肺潛水等，被試更傾向於統一分割。另一方面，KTS更適合於有重大事件或活動的視頻。對於這類視頻，重要部分的歧義性較小，因此根據機器生成的摘要與人類參考摘要的一致性，可以得到較高的F1分數。人體評價的詳細結果見補充資料。

We conducted human evaluation to compare summaries on the SumMe dataset. Subjects compare two video summaries and determine which video better summarizes the original video. In the first experiment, we asked subjects to rate video summaries generated using random importance scores and DR-DSN scores. Both methods use KTS segmentation. Overall, random scores got a slightly higher score than DR-DSN, however, 46% of answers were that the summaries are equally good (bad). This result agree with the observation in the Section 4.1 that the importance scoring hardly affects the evaluation score on the SumMe dataset. We also compare KTS and uniform segmentation with random importance scoring. As a result, subjects prefer uniform segmentation for videos recording long activity, e.g., sightseeing of the statue of liberty and scuba diving. On the other hand, KTS works better for videos with notable events or activities. For such videos, the important parts have little ambiguity, therefore the F1 scores based on the agreement between generated summaries and reference summaries can get higher. For the detailed results of the human evaluation, see the supplementary material.

4.2. TVSum數據集分析 Analysis on TVSum dataset

TVSum數據集不包含人類參考摘要，而是包含原始視頻中每2秒片段的人工標註的重要性分數。這種方法的主要優點是能夠生成任意長度的參考摘要。也可以使用不同的分割方法。基於這些原因，TVSum爲研究重要性評分和分割在當前評估框架中的作用，提供了一個很好的工具。

Instead of reference summaries, TVSum dataset contains human annotated importance scores for every 2 second segment in the original video. The main advantage of this approach is the ability to generate reference summaries of arbitrary length. It is also possible to use different segmentation methods. For these reasons, TVSum provides an excellent tool for studying the role of importance scoring and segmentation in the current evaluation framework.

圖5。對於TVSum數據集，不同的分割方法結合隨機或人工標註的重要性評分(留一法)的F1分數。淺藍條表示隨機得分，深藍條表示人工標註。有趣的是，在大多數情況下，隨機註釋和人工註釋獲得了類似的F1分數。
Figure 5. F1 scores for different segmentation methods combined to either random or human annotated importance scores (leave-one-out) for TVSum dataset. Light blue bars refer to random scores and dark blue bars indicates human annotations. Interestingly, the random and human annotations obtain similar F1 scores in most cases.

圖5顯示了不同分割方法的F1評分，其中使用了隨機重要性評分和人工標註的重要性評分。在後一種情況下，結果是使用留一法計算的。令人驚訝的是，對於大多數分割方法，隨機的重要性分數具有與人工標註相似的性能。另外，完全隨機雙峯分割與基於內容的KTS分割效果相當。此外，表1中的結果表明，我們的隨機結果與文獻中報告的最佳結果是相同的(或至少是一個層級，可以一搏的)。統一分割和單峯分割不能達到相同的結果，但在這些情況下，更好的重要性評分似乎是有幫助的。總的來說，取得的結果突出了，當前基於F1的評估框架所面臨的挑戰。

Figure 5 displays the F1 scores for different segmentation methods using both random and the human annotated importance scores. In the latter case, the results are computed using leave-one-out procedure. Surprisingly, for the most of the segmentation methods, the random importance scores have similar performance as human annotations. In addition, the completely random two-peak segmentation performs equally well as content based KTS segmentation. Furthermore, the results in Table 1 illustrate that our random results are on-par (or at least comparable) with the best reported results in the literature. The uniform and one-peak segmentation do not reach the same results, but in these cases the better importance scoring seems to help.In general, the obtained results highlight the challenges in utilizing the current F1 based evaluation frameworks.

4.3. 討論 Discussion

正如在前幾節中所觀察到的，隨機摘要產生了驚人的高性能分數。結果與最先進的水平相當，有時甚至超過了人類的水平。特別是當我們用因片段長度而產生較大變化的分割方法(即雙峯、KTS和隨機KTS)時，更容易產生較高的F1分數。在視頻摘要方法中，最常用到的算法是揹包算法，在執行揹包算法時，通過檢驗片段長度對選擇過程的影響，咱們可以理解這一結果。

As observed in the previous sections, the random summaries resulted in surprisingly high performance scores. The results were on-par with the state-of-the-art and sometimes surpassed even the human level scores. In particular, the segmentation methods that produce large variation in the segment length (i.e. two-peak, KTS, and randomized KTS) produced high F1 scores. The results may be understood by examining how the segment length affects on selection procedure in the knapsack formulation that is most commonly adopted in video summarization methods.

圖6:從摘要中隱式丟棄長段，只選擇短段。上圖:綠色和淺綠色區域顯示由雙峯分割方法產生的分段邊界。下圖爲動態規劃算法選擇的片段(藍色)，以及前15%的最短視頻片段(淺綠色)，以及它們之間重疊的片段(紫色)。請注意，大多數被選中的部分都在最短的分段中。
Figure 6. Long segments are implicitly discarded from the summary and only short segments are selected. Top: Green and light green areas visualize segment boundaries generated by the twopeak segmentation method. Bottom plot shows segments selected by dynamic programming algorithm (blue), and top 15% of the shortest video segments (light green), and segments overlapping between them (Purple). Notice that the most of the selected parts are within the group of the shortest segments.

通常用於揹包算法的動態規劃求解器有這樣的邏輯：分段只有滿足以下條件纔會被選擇，即，相對於由一組短小分段拼接成的長分段，其對總得分的影響更大。換句話說，只有當如下情況纔會選擇片段A：即不存在組合長度小於A的片段B+C，且B+C對總分的影響大於或等於A時。
在當前的摘要任務中，較長的片段很少出現這種情況，因此摘要通常只由較短的片段組成。這種現象極大地限制了分段子集選擇的合理性。例如，雙峯分割模式分別爲30幀和90幀的兩個分佈中提取段長度;因此，我們可以粗略地說，較長的片段佔據了總長度的三分之二。如果這些較長的片段都被丟棄，生成的摘要只包含原始視頻的剩餘三分之一。對於生成長度爲原始視頻時長15%的摘要，無論相關的重要性分數如何，大多數片段都希望被用於生成和參考摘要的共享。圖6說明了這一點。由於同樣的原因，如果所有分段都具有相同的長度，那麼重要性分數的影響更大(參見圖5中統一的單峯結果)。

A dynamic programming solver, commonly used for the knapsack problem, selects a segment only if the corresponding effect on the overall score is larger than that of any combination of remaining segments whose total length is shorter. In other words, a segment A is selected only if there are no segments B and C whose combined length is less than A and the effect to total score is more or equal to A. This is rarely true for longer segments in the current summarization tasks, and therefore the summary is usually only composed of short segments. This phenomenon significantly limits the reasonable choices available for segment subset selection.
For example, two-peak segmentation draws a segment length from two distributions whose modes are 30 frames and 90 frames; therefore, we can roughly say that longer segments occupies two-third of the total length. If these longer segments are all discarded, the generated summary only consists of the rest one-third of the original video.For generating a summary whose length is 15% of the original video duration, most of the segments are expected to be shared for generated and reference summaries regardless of associated importance scores. This is illustrated in Figure 6. Due to the same reason, the importance scores have more impact if all the segments have equal length (see uniform and one-peak results in Figure 5).

使用幀級分數的總和可以緩解挑戰;然而，大多數工作採用平均值，因爲這大大提高F1在TVSum上的分數。對於摘要而言，人工摘要明顯優於隨機摘要，但我們仍然可以看到分割方法在最終摘要生成中的重要性。4.1節中對SumMe數據集的結果說明了另一個挑戰。對於這個數據集，基於KTS的參考摘要獲得了非常高的性能分數。KTS使用時隱隱地包含了小冗餘策略，旨在創建一個視覺上無冗餘的視頻摘要。也就是說，KTS將視覺上相似的幀組合成一個片段。因此，長片段很可能是多餘的，不那麼生動，因此他們不那麼有趣。人類標記者也不願意在他們的摘要中包含這樣的片段。與此同時，動態規劃在選擇分割子集時，也避免了前面討論的長分段。因此，生成的摘要往往符合人類的偏好。

Using the sum of frame-level scores may alleviate the challenge; however, most works instead employ averaging because this drastically increases F1 scores on TVSum. With summation, human summary clearly outperforms random ones, but we can still see the effect of segmentation. The results on SumMe dataset in Section 4.1 illustrate another challenge. For this dataset, KTS-based references obtain really high performance scores. The use of KTS implicitly incorporate small-redundancy strategy, which aims to create a visually non-redundant video summary. That is, KTS groups visually similar frames into a single segment. Therefore, long segments are likely to be redundant and less lively and thus they are less interesting. Human annotators would not like to include such segments in their summaries. Meanwhile, the dynamic programming-based segment subset selection tends to avoid long segments as
discussed above. Thus generated summaries tend to match the human preference.

5. 重要性評分評價框架 Importance score evaluation framework

上述挑戰表明瞭，目前的基準並不適用於評估重要性分數的質量。與此同時，近年來的視頻綜述文獻大多針對重要分數的預測提出了相應的方法。爲了克服這個問題，我們提出了一種新的評估方法。

The aforementioned challenges render the current benchmarks inapplicable for assessing the quality of the importance scores. At the same time, most of the recent video summarization literature present methods particularly for importance score prediction. To overcome this problem, we present a new alternative approach for the evaluation.

5.1. 使用等級順序統計量進行評估 Evaluation using rank order statistics

在統計學中，等級相關係數是比較順序關聯(即排名之間的關係)的成熟工具。我們利用這些工具來比較由機器生成的和人類標註的幀級別重要性分數(如[20])，從而測量他們所提供的隱式排名之間的相似性。

In statistics, rank correlation coefficients are well established tools for comparing the ordinal association (i.e. relationship between rankings). We take advantage of these tools in measuring the similarities between the implicit rankings provided by generated and human annotated frame level importance scores as in [20].

更準確地說，我們使用Kendall的\(\tau\)[4]和Spearman的\(\rho\)[6]相關係數。爲了得到結果，我們首先根據機器生成的重要性分數和人工標註的參考分數(每個標註者一個排名)對視頻幀進行排序。在第二階段，我們將生成的排名與每個參考排名進行比較。最後的相關分數是通過平均每個結果得到的。

More precisely, we use Kendall’s \(\tau\)[4] and Spearman’s \(\rho\)[6] correlation coefficients. To obtain the results, we first rank the video frames according to the generated importance scores and the human annotated reference scores (one ranking for each annotator). In the second stage, we compare the generated ranking with respect to each reference ranking. The final correlation score is then obtained by averaging over the individual results.

表2：在TVSum數據集上，計算不同重要性得分與人工標註得分之間的 Kendall's \(\tau\) 和 Spearman's \(\rho\) 相關係數。
Table 2. Kendall’s \(\tau\) and Spearman’s \(\rho\) correlation coefficients computed between different importance scores and manually annotated scores on TVSum dataset.

我們通過評估兩種最新的視頻摘要方法(dppLSTM[26]和DR-DSN[29])來演示排名相關的度量方法。對於這兩種方法，我們都使用了原始作者提供的實現。對於完整性檢查，我們還使用隨機評分來計算結果，根據定義，它應該產生零平均分數。這些結果是通過爲每個原始視頻生成100個均勻分佈的隨機值序列[0,1]，並對得到的相關係數進行平均得到的。人的表現是使用留一法產生的。表2總結了從TVSum數據集獲得的結果。

We demonstrate the rank order correlation measures, by evaluating two recent video summarization methods (dppLSTM [26] and DR-DSN [29]). For both methods, we utilise the implementations provided by the original authors. For sanity check, we also compute the results using random scoring, which by definition should produce zero average score. These results are obtained by generating 100 uniformly-distributed random value sequences in [0, 1] for each original video and averaging over the obtained correlation coefficients. The human performance is produced using leave-one-out approach. Table 2 summarizes the obtained results for TVSum dataset.

總的來說，測試方法和隨機評分之間有明顯的區別。此外，人工標註的相關係數顯著高於其他任何方法，這證實了人類重要性得分之間存在相關性。從測試方法來看，dppLSTM結果比DR-DSN具有更高的性能。這是有意義的，因爲dppLSTM專門用於預測人類標註的重要性分數，而DR-DSN旨在最大限度地增加生成摘要內容的多樣性。然而，這兩種方法都明顯優於隨機評分。

Overall, the metric shows a clear difference between tested methods and the random scoring. In addition, the correlation coefficient for human-annotations is significantly higher than for any other method, which confirms that human importance scores correlate to each other. From the tested methods, dppLSTM results in higher performance compared to DR-DSN. This makes sense, since dppLSTM is particularly trained to predict human annotated importance scores, while DR-DSN aims at maximizing the diversity of the content in the generated summaries. However, both methods clearly outperform the random scoring.

我們進一步研究重要性得分的相關度量方法與輸出視頻摘要質量之間的關係。我們比較了使用兩種重要性分數生成的視頻摘要，即與人工標註正相關/負相關的的重要性分數。人工評價結果表明，重要性分數爲正相關的視頻摘要表現優於其他。結果的細節在補充材料中。

We further investigate the relation between the correlation measures for importance scores and the quality of output video summaries. We compare video summaries generated using importance scores which positively correlate with human annotations and those using importance scores with negative correlation. The result of human evaluation demonstrated that video summaries generated using importance scores with positive correlation outperformed others. The details of the result are in the supplementary material.

5.2. 可視化重要性得分相關性 Visualizing importance score correlations

評估視頻摘要的主要挑戰之一是人工標註之間的不一致性。事實上，雖然人工標註的相關係數在表2中最高，但其絕對值仍然較低。這源於重要分數標註的主觀性和模糊性。我們可以想象，視頻中重要的東西可能是高度主觀的，標註者可能認爲其在視頻中很重要，也可能認爲其不值一提。此外，即使標註者同意某個視頻內容很重要，視頻中也可能有多個部分以不同的觀點和表達方式，展現了相同的內容。從這些部分進行選擇可能仍然是模糊的問題。

One of the main challenges in the evaluation of video summaries is the inconsistency between the humanannotations. In fact, although the human annotations result in the highest correlation coefficient in Table 2, the absolute value of the correlation is still relatively low. This stems from subjectivity and ambiguity in the importance score annotation. As we can imagine, what is important in a video can be highly subjective, and the annotators may or may not agree. Furthermore, even if the annotators agree that a certain video content is important, there can be multiple parts in a video that contain the same content in different viewpoints and expressions. Selection from these parts may still be ambiguous problem.

圖7：分數曲線形成概述。
Figure 7. Overview of the score curve formation.

爲了突出標註中的變化，我們建議，將預測的相對於參考標註的重要性評分，進行排序，進行可視化。爲此：

我們首先計算相對於人類標註員的幀級平均得分。
在第二階段，我們根據預測的重要性得分按降序對幀進行排序(圖7中)。
在最後一步，我們根據第二階段得到的排名來累積平均的參考分數。更精確地說，

\(a_i=\sum_{t=1}^i\frac{S_t}{\sum_{j=1}^n S_j}\)

這裏面 \(S_i\) 表示排序後的視頻中第i幀的人類標註的平均得分。分母中的標準化因子確保最大值等於1。如圖7(下圖)所示，\(a_i\) 在經過排序的幀上形成單調遞增的曲線。如果預測得分與人類得分有很高的相關性，那麼曲線會迅速上升。類似的曲線也可以用留一法來計算人類得分。

To highlight the variation in the annotations, we propose to visualize the predicted importance score ranking with respect to the reference annotations. To do this, we first compute the frame level average scores over the human annotators. In the second stage, we sort the frames with respect to the predicted importance scores in descending order (Figure 7, middle). In the final step, we accumulate the averaged reference scores based on the ranking obtained in the second stage. More precisely,

\(a_i=\sum_{t=1}^i\frac{S_t}{\sum_{j=1}^n S_j}\)

where \(S_i\) denotes the average human-annotated score for the i-th frame in the sorted video. The normalization factor in the denominator ensures that the maximum value equals to 1. As shown in Figure 7 (bottom), \(a_i\) forms a monotonically increasing curve over the sorted frames. If the predicted scores have high correlation to human scores, the curve should increase rapidly. Similar curves can be produced for the human scores using leave-one-out approach.

圖8：從TVSum數據集爲兩個視頻生成的示例相關曲線(sTEELN-vY30和kLxoNp-UchI是視頻id)。紅線表示每個人類標註員的相關曲線，黑色虛線表示隨機重要性分數的期望。藍色和綠色曲線分別爲dppLSTM和DR-DSN方法的結果。更多結果請參閱補充材料。
Figure 8. Example correlation curves produced for two videos from TVSum dataset (sTEELN-vY30 and kLxoNp-UchI are video ids). The red lines represent correlation curves for each human annotator and the black dashed line is the expectation for a random importance scores. The blue and green curves show the corresponding results to dppLSTM and DR-DSN methods, respectively. See supplementary material for more results.

圖8顯示了從TVSum數據集生成的兩個視頻的相關曲線。紅線表示每個人類標註員的at曲線，黑色虛線表示隨機重要性分數的期望。藍色和綠色曲線分別爲dppLSTM和DR-DSN方法的結果。淺藍色表示相關曲線所在的區域。也就是說，當預測的重要性分數與人類標註的平均分數完全一致時，即基於分數的排名相同，曲線位於淺藍色區域的上界。另一方面，當分數排名與參考順序相反時，曲線與區域的下界重合。

Figure 8 shows correlation curves produced for two videos from TVSum dataset. The red lines show the at curve for each human annotator and the black dashed line is the expectation for a random importance scores. The blue and green curves show the corresponding results to dppLSTM and DR-DSN methods, respectively. The light-blue colour illustrates the area, where correlation curves may lie. That is, when the predicted importance scores are perfectly concordant with averaged human-annotated scores, i.e., the score based rankings are the same, the curve lies on the upper bound of the light-blue area. On the other hand, a curve coincides with the lower bound of the the area when the ranking of the scores is in a reverse order of the reference.

大多數人類標註員獲得的曲線都遠高於圖8中的隨機基線。此外，圖8 (a)顯示，dppLSTM和DR-DSN都能夠預測與人類標註正相關的重要性分數。另一方面，圖8 (b)顯示了遠遠低於黑色虛線的兩條紅線。這意味着這些標註員標記了幾乎與總體共識相反的反應。圖9中的詳細觀察顯示，事實確實如此。異常值突出了1500和3000幀左右的片段，另一方面，其他標註員對片段的看法幾乎相反。所提出的可視化提供的工具，直觀地說明了這種趨勢。

The most of the human annotators obtain a curve that is well above the random baseline in Figure 8. Moreover, Figure 8 (a) shows that both dppLSTM and DR-DSN are able to predict importance scores that are positively correlated with human annotations. On the other hand, Figure 8 (b) shows two red lines that are well below the black dashed line. This implies that these annotators labelled almost opposite responses to the overall consensus. Detailed observation in Figure 9 reveals that this is indeed the case. The outliers highlighted segments around 1500 and 3000 frames, on the other hand, other annotators showed almost opposite opinion for the segments. The proposed visualization provides intuitive tool for illustrating such tendencies.

圖9：比較人類標註的分數。底部一行顯示了所選的兩個人類標註員(異常值)的幀級別重要性得分。中間一行顯示了通過平均其他人類註釋器(內部人)獲得的類似分數。上面一行說明了對應視頻中的關鍵幀。你可以注意到，內部和外部突出顯示了視頻中幾乎完全相反的部分。
Figure 9. Comparison of human-annotated scores. The bottom row shows the frame level importance scores for the selected two human annotators (outliers). The middle row displays the similar score obtained by averaging over the remaining human annotators (inliers). The top row illustrates keyframes from the corresponding video. One can notice that inliers and outlier have highlighted almost completely opposite parts of the video.

6、結論 Conclusion

公共基準數據集扮演着重要的角色，因爲它們促進了不同方法的簡單和公平的比較。基準評估的質量具有很高的影響，因爲研究工作通常趨於最大化基準結果。在本文中，我們評估了兩個廣泛使用的視頻摘要基準的有效性。我們的分析表明，目前基於F1評分的評價框架存在嚴重的問題。

Public benchmark datasets play an important role as they facilitating easy and fair comparison of methods. The quality of the benchmark evaluations have high impact as the research work is often steered to maximise the benchmark results. In this paper, we have assessed the validity of two widely used video summarization benchmarks. Our analysis reveals that the current F1 score based evaluation framework has severe problems.

結果表明，在大多數情況下，隨機生成的摘要能夠獲得與最先進的方法相似甚至更好的性能分數。完全隨機方法的性能有時會超過人工標註員。進一步分析發現，得分的形成主要是由視頻分割決定的，特別是分段長度的分佈。這主要是由於子集選擇程序的廣泛使用。在大多數情況下，重要分數的貢獻被基準測試完全忽略了。基於我們的觀察，我們提出使用預測的重要性分數和人類標註的重要性分數之間的相關性來評估不同方法，而不使用由分割片段子集選擇過程給出的最終摘要。引入的評估方法爲摘要方法的行爲，提供了更多的見解和思路。我們還提出了通過累計分數曲線，從而將相關性可視化的方法，該方法直觀地說明了各種人工標註的重要性分數的質量。

In most cases it turned out that randomly generated summaries were able to reach similar or even better performance scores than the state-of-the-art methods. Sometimes the performance of completely random method surpassed that of human annotators. Closer analysis revealed that score formation was mainly dictated by the video segmentation and particularly the distribution of the segment lengths. This was mainly due to the widely used subset selection procedure. In most cases, the contribution of the importance scores were completely ignored by the benchmark tests. Based on our observations, we proposed to evaluate the methods using the correlation between predicted and human-annotated importance scores instead of the final summary given by the segment subset selection process. The introduced evaluation offers additional insights about the behaviour of the summarization methods. We also proposed to visualize the correlations by accumulative score curve, which intuitively illustrates the quality of the importance scores with respect to various human annotations.

我們提處的新評估框架，只包括如何估計幀級重要性分數的方法。它不適用於其他的方法，例如，基於聚類的方法，該方法會挑選出靠近聚類中心的視頻片段。此外，我們主要討論了基於其與人工標註的相關性的評估。視頻中故事的可理解性、視覺美感和與用戶查詢的相關性等其他因素也對各種應用程序有價值。我們認爲，在今後的工作中處理這些方面是很重要的。此外，我們認爲需要新的更大的數據集，來推動視頻摘要研究的發展。

The proposed new evaluation framework covers only methods that estimate the frame level importance scores. It is not suitable for other approaches such as e.g., clusteringbased methods that pick out video segments close to cluster centres. In addition, we primarily addressed the evaluation based on correlation with human annotations. Other factors like comprehensibility of a story in a video, visual aesthetics and relevance to a user query would also be valuable for various applications. We believe that it would be important to address these aspects in future works. Moreover, we believe that new substantially larger datasets are needed for pushing video summarization research forward.

鳴謝 Acknowledgement

本工作得到了JSPS KAKENHI批准號16K16086和18H03264的部分支持。

This work was partly supported by JSPS KAKENHI Grant Nos. 16K16086 and 18H03264.

引用 References

[1] B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems (NIPS), pages 2069–2077, 2014.
[2] M. Gygli, H. Grabner, H. Riemenschneider, and L. van Gool. Creating summaries from user videos. In European Conference on Computer Vision (ECCV), pages 505–520, 2014.
[3] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3090–3098, 2015.
[4] M. G. Kendall. The treatment of ties in ranking problems. Biometrika, 33(3):239–251, 1945.
[5] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan. Largescale video summarization using web-image priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2698–2705, 2013.
[6] S. Kokoska and D. Zwillinger. CRC standard probability and statistics tables and formulae. Crc Press, 1999.
[7] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3), 2012.
[8] R. Lagani`ere, R. Bacco, A. Hocevar, P. Lambert, G. Pa¨ıs, and B. E. Ionescu. Video summarization from spatio-temporal features. In ACM TRECVid Video Summarization Workshop,
pages 144–148, 2008.
[9] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1346–1353, 2012.
[10] Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2714–2721, 2013.
[11] Y. Ma, L. Lu, H. Zhang, and M. Li. A user attention model for video summarization. In ACM International Conference on Multimedia (MM), pages 533–542, 2002.
[12] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya. Video summarization using deep semantic features. In Asian Conference on Computer Vision (ACCV), volume
10115, pages 361–377, 2016.
[13] B. Plummer, M. Brown, and S. Lazebnik. Enhancing video summarization via vision-language embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5781–5789, 2017.
[14] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In European Conference on Computer Vision (ECCV), pages 540–555, 2014.
[15] J. Sang and C. Xu. Character-based movie summarization. In ACM International Conference on Multimedia (MM), pages 855–858, 2010.
[16] A. Sharghi, B. Gong, and M. Shah. Query-focused extractive video summarization. In European Conference on Computer Vision (ECCV), pages 3–19, 2016.
[17] Y. Song. Real-time video highlights for yahoo esports. In Neural Information Processing Systems (NIPS) Workshops,5 pages, 2016.
[18] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes. TVSum: Summarizing web videos using titles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5179–5187, 2015.
[19] C. M. Taskiran, Z. Pizlo, A. Amir, D. Ponceleon, and E. E. J. Delp. Automated video program summarization using speech transcripts. IEEE Transactions on Multimedia, 8(4):775–790, 2006.
[20] A. B. Vasudevan, M. Gygli, A. Volokitin, and L. Van Gool. Query-adaptive video summarization via quality-aware relevance estimation. In ACM International Conference on Multimedia (MM), pages 582–590, 2017.
[21] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao. Video Summarization via Semantic Attended Networks. In AAAI Conference on Artificial Intelligence, pages 216–223, 2018.
[22] T. Yao, T. Mei, and Y. Rui. Highlight detection with pairwise deep ranking for first-person video summarization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[23] T. Yao, T. Mei, and Y. Rui. Highlight detection with pairwise deep ranking for first-person video summarization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[24] S. Yeung, A. Fathi, and L. Fei-fei. VideoSET : Video summary evaluation through text. arXiv preprint arXiv:1406.5824v1, 2014.
[25] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Summary transfer: Exemplar-based subset selection for video summarization.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1059–1067, 2016.
[26] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Video summarization with long short-term memory. In European Conference on Computer Vision (ECCV), pages 766–782, may 2016.
[27] K. Zhang, K. Grauman, and F. Sha. Retrospective Encoders for Video Summarization. In European Conference on Computer Vision (ECCV), pages 383–399, 2018.
[28] B. Zhao and E. P. Xing. Quasi real-time summarization for consumer videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2513–2520, 2014.
[29] K. Zhou, Y. Qiao, and T. Xiang. Deep reinforcement learning for unsupervised video summarization with diversityrepresentativeness reward. 2018.

反思視頻摘要的評估標準

摘要 Abstract