Supervised Hashing for Image Retrieval via Image Represention Learning-筆記1

摘要
Background:
     In the existing supervised hashing methods for images ,an input image is usually encoded by a vector of hand-crafted visual features.     
    e.g.  Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs,which may often degrade the performance of hashing function learning.(人工提取的特徵無法保證圖片對之間的語義正確性,也會降低哈希函數學習的性能)

In this paper:
     We propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions.The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images.

Introduction
The learning-based hashing methods can be divided into three main streams.
<a>Unsupervised methonds,in which only unlabeled data is used to learn hash functions.(無監督)
<b>The other two streams are semi-supervised and supervised methods.(半監督和監督)

Key question:
     In learning-based hashing for images is how to encode images into a useful feature representation so as to enhance the hashing performance.
     Ideally,one would like to automatically learn such a fecture representation that sufficiently preserves the semantic similarities for images during the hash learning process.
     e.g. Without using hand-crafted visual features, Semantic Hashing (Salakhutdinov and Hinton 2007) is a hashing method which automatically constructs binary-code feature representation for images by a multi-layer auto-encoder, with the raw pixels of images being directly used as input.
     Semantic hashing imposes difficult optimization.(語義哈希很難優化)

     In this paper, we propose a supervised hashing  method for image retrieval which simultaneously learns a set of hashing functions as well as a useful image representation tailored to the hashing task.
     Given n images I = {I1, I2, ..., In} and a pairwise similarity matrix S in which Si,j = 1 if Ii and Ij are semantically similar and otherwise Si,j = −1, the task of supervised hashing is to learn a set of q hash functions based on S and I.
     (給定的n張圖片,l1,...,ln分別對應每張圖片,相似矩陣S,當兩張圖片語義相似則對應的S[i][j]=1,otherwise -1,監督哈希的任務是基於S和I學習一系列的哈希函數q)。
 
     Formulating the hash learning task as a single optimization problem usually leads to a complex and highly non-convex objective which may be difficult to optimize. To avoid this issue, one of the popular ways is decomposing the learning process into a hash code learning stage followed by a hash function learning stage (e.g., (Zhang et al. 2010; Lin et al. 2013)).
 爲了避開非凸objective難以優化問題,把學習過程分爲hash code學習階段和hash function stage)
     As shown in Figure 1, the proposed method also adopts such a two-stage paradigm. In the first stage, we propose a scalable coordinate descent algorithm to approximately decompose S into a product form S ≈ 1 /q *HHT , where H ∈ Rn×q with each of its elements being in {−1, 1}. The k-th row in H is regarded as the approximate target hash code of the image Ik. In the second stage, we simultaneously learn a set of q hash functions and a feature representations for the images in I by deep convolutional neural networks 
     (第一階段:分解S ≈ 1 /q *HHT ,H ∈ Rn×q ,在H中第k行認爲是第k張圖片的近似hash code。第二階段:我們用深度CNN來同時學習第I張圖片的feature repressentations和q個hash functions)。

Related Work
LSH:
     The early research of hashing focuses on dataindependent methods, in which the Locality Sensitive Hashing (LSH) methods (Gionis, Indyk, and Motwani 1999; Charikar 2002) are the most well-known representatives. LSH methods use simple random projections as hash functions, which are independent of the data. However, LSH
     (哈希的早期研究集中於數據依賴方法,其中位置敏感哈希(LSH)方法(Gionis,Indyk和Motwani 1999; Charikar 2002)是最知名的代表。 LSH方法使用簡單的隨機投影作爲散列函數,其獨立於數據。 但是,LSH需要很長的hash code 來確保它的準確性,這樣就導致了需要更大的存儲空間並且查全率較低。)

The Approach
Stage 1:learning approximate hash codes
      We define an n by q binary matrix H whose k-th row is Hk· ∈ {−1, 1} q . Hk· represents the target q-bit hash code for the image Ik. Thegoal of supervised hashing is to generate hash codes that preserve the semantic similarities over image pairs. Specifically, the Hamming distance between two hash codes Hi· and Hj. (associated to Ii and Ij , respectively) is expected to be correlated with Sij which indicates the semantic similarity of Ii and Ij . Existing studies (Liu et al. 2012) have pointed out that the code inner product Hi·HT j. has one-to-one correspondence to the Hamming distance between Hi· and Hj. Since Hi· ∈ {−1, 1} q , the code inner product Hi·HT j. is in the range [−q, q]. Hence, as shown in Figure 1(Stage 1), we learn the approximate hash codes for the training images in I by minimizing the following reconstruction errors:
 H:n*q的二進制矩陣,第k行是由q個{-1,1}組成的,q位hash codes代表了第k張圖片。內積Hi·HT j的取值範圍{-q,q}。我們對第I張圖片通過最小化重構誤差來學習近似hash codes.
     注意:因爲公式編寫太浪費時間請直接看論文中的公司,以下如此!
     論文中公式(2)。。。
     公式(2)很難優化,所以提出了公式(3),重要改變是H中的值由原來的{-1,1}變爲[-1,1],但是公式(3)由於是個非凸問題, We propose to solve it by a coordinate descent algorithm using Newton directions.這個算法順序的或者隨機的選擇H中的一項來更新。     
     以下的算法部分都省去了,2015年提出的方法都拋棄了第一階段的工作。

Stage 2:learning  image feature representation and hash functions
     The main task is learning a feature representatin for training images as well as a set of hash functions.
     In many scenarios of supervised hashing for images, the similarity/dissimilarity labels on image pairs are derived from the discrete class labels of the individual images. That is, in hash learning, the discrete class labels of the training images are usually available. Here we can design the output layer of our network in two ways, depending on whether the discrete class labels of the training images are available.In the first way, given only the learned hash code matrix H with each of its rows being a q-bit hash code for a training image, we define an output layer with q output units (the red nodes in the output layer in Figure 1(Stage 2)), each of which corresponds to one bit in the target hash code for an image. We denote the proposed hashing method using a CNN with such an output layer (with only the red nodes, ignoring the black nodes and the associated lines) as CNNH。
(在許多圖像的監督散列場景中,從圖像對的離散類標籤導出圖像對上的相似性/不相似性標籤。也就是說,在哈希學習中,訓練圖像的離散類標籤通常是有用的。在這裏,我們可以以兩種方式設計我們網絡的輸出層,這取決訓練圖像的離散類標籤是否可用。在第一種方式中,僅給出學習的哈希碼矩陣H,其中每行都是q位,我們定義具有q個輸出單元(圖1(階段2)的輸出層中的紅色節點)的輸出層,每個輸出層對應於圖像的目標哈希碼中的一個比特。 我們使用具有這樣的輸出層(僅具有紅色節點,忽略黑色節點和相關聯的線)的CNN來表示所提出的散列法作爲CNNH)
     In the second way, we assume the discrete class labels of the training images are available. Specifically, for n training images in c classes (an image may belong to multiple classes), we define a n by c discrete label matrix Y ∈ {0, 1} n×c, where Yij = 1 if the i-th training image belongs to the j-th class, otherwise Yij = 0. For the output layer in our network, in addition to defining the q output units (the red nodes in the output layer) corresponding to the hash bits as in the first way, we add c output units (the black nodes in the output layer in Figure 1(Stage 2)) which correspond to the class labels of a training images. By incorporating the image class labels as a part in the output layer, we enforce the network to learn a shared image representation which matches both the approximate hash codes and the image class labels. It can be regarded as a transfer learning case in which the incorporated image class labels are expected to be helpful for learning a more accurate image representation (i.e. the hidden units in the fully connected layer). Such a better image representation may be advantageous for hash functions learning. We denote the proposed method using a CNN with the output layer having both the red nodes and the black nodes as CNNH+.
     在第二種方式中,我們假設訓練圖像的離散類標籤可用。具體來說,對於c類中的n個訓練圖像(圖像可以屬於多個類),定義了n*c大小的離散標籤矩陣Y∈{0,1} n×c,其中如果第i個訓練圖像屬於到第j類則Yij = 1,否則Yij = 0。對於我們網絡中的輸出層,除了以第一種方式定義對應於散列位的q個輸出單元(輸出層中的紅色節點)之外,我們增加了c個輸出單元(圖1(階段2)中的輸出層中的黑色節點),其對應於訓練圖像的類別標籤。通過將圖像類標籤作爲輸出層中的一部分,我們豐富網絡學習共享圖像表示,其匹配近似散列碼和圖像類標籤。它可以被認爲是其中所結合的圖像類標籤被期望有助於學習更精確的圖像表示(完全連接層中的隱藏單元)的轉移學習情況。這種更好的圖像表示對於散列函數學習可能是有利的。我們使用CNN來表示所提出的方法,其中具有紅色節點和黑色節點的輸出層都是CNNH +。例如:Y矩陣如下:4*3
       1          2         3
0
1
0
1
0
0
1
0
0
0
0
1















發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章