Fisher Vextor原理

(本博客系原創，轉載請註明出處：http://blog.csdn.net/xuexiyanjiusheng/article/details/46927491)

一、核心

Fisher vector本質上是用似然函數的梯度vector來表達一幅圖像。

二、基礎知識的預備

1. 高斯分佈

生活和自然中，很多的事和物的分佈都可以近似的看做是高斯分佈。比如說：一個班的成績的優良中差的分佈。最優秀的和最差的往往都是少數，一般人是大多數。

高斯分佈直觀的感受是這樣的：這是這種分佈的概率情況的表示：

2. 混合高斯分佈

問題是：一個班的成績的分佈他也可能是這樣的：60分以下以及95分以上很少人，60-75很多人，突然75-85人又少了，但是85-90又多了。這個時候直觀的感受是這樣的：

這個時候很顯然若使用兩個高斯分佈來擬合，加上二者的權重效果比單個的高斯分佈要好得多！若是不止兩個山峯那最好就是再多幾個高斯的分佈同時來擬合。對GMM（Gaussian Mixture Model）的理解。

3.高斯分佈用於圖像

相信大家對於獨立同分布（i.i.d）還是知道的。對於圖像也是，它表示爲你用來表示這樣圖像的特徵的各個維度之間是獨立的。拿一個人來說，我們如果用它的身高、體重、三維來代替他。那這些就是他的特徵了。對於他來說，這些特徵就可以看做是獨立同分布了。對於一個圖像同樣是這樣。

而用到獨立同分布最重要的原因是：你可以將一個樣本（一張圖片）的概率分佈表示爲各個特徵維度上面的概率分佈的乘積。

取對數以後則表示爲各項的對數概率的和，這樣就極大的降低了計算的難度。

4.流形學習

嵌入在高維空間中的低維流形：最直觀的例子通常都會是嵌入在三維空間中的二維或者一維流行。比如說一塊布，可以把它看成一個二維平面，這是一個二維的歐氏空間，現在我們（在三維）中把它扭一扭，它就變成了一個流形（當然，不扭的時候，它也是一個流形，歐氏空間是流形的一種特殊情況）。所以，直觀上來講，一個流形好比是一個 d 維的空間，在一個 m 維的空間中 (m > d) 被扭曲之後的結果。

具體參考之前的博客：http://blog.csdn.net/xuexiyanjiusheng/article/details/46928771

5. Fisher Vector 的本質

Fisher Vector的本質就是對於高斯分佈的變量求偏導！也就是對權重，均值，標準差求偏導得到的結果。最後在需要一個歸一化處理。具體計算放在了下面。

6. 爲什麼Fisher Vector比高斯分佈有效

我們將一張圖近似爲一個高斯分佈，由這個高斯分佈來表示這張圖像。假設我們是做目標的檢測，那麼當你得到一個有相同的高斯分佈的圖的時候你就可以判斷出這就是那個目標了。但實際的情況是卻不一定是這樣的，我們看一張圖

這兩張圖上特徵點的分佈在黑色的區域，二者的分佈卻可以一樣（當然我畫的不是很好）！

由此，我們知道，在高斯分佈的基礎上我們再找到變化的方向，我們便可以更加準確的表示這一張圖！

三、具體原理

Fisher vector本質上是用似然函數的梯度vector來表達一幅圖像，這個梯度向量的物理意義就是describes the direction in which parameters should be modified to best fit the data

《Fisher Kernels on Visual Vocabularies for Image Categorization》：

We propose to apply Fisher kernels on visual vocabularies, where the vocabularies of visual words are represented by means of a GMM.

denotes the set of low-level feature vectors extracted from an image and the set of parameters of the GMM.

where denote respectively the weight, mean vector and covariance matrix of Gaussian i and where N denotes the number of Gaussians. Each Gaussian represents a word of the visual vocabulary: encodes the relative frequency of word i, the mean of the word and the variation around the mean.

We denote . Under an independence assumption(這裏原文交代用到了獨立同分布假設), we have

(1)

取對數之後就是：

(2)

現在需要一組K個高斯分佈的線性組合來逼近這些i.i.d.，假設這些高斯混合分佈參數也是lamda，於是(The likelihood that observation xt was generated by the GMM is:)

(3)

其中線性組合的係數滿足：(The weights are subject to the constraint)

(4)

Pi表示的就是高斯分佈：(the components pi are given by)

(5)

在這裏D是特徵矢量的維數，協方差矩陣計算的是不用維數之間的關係。在這這裏假設協方差矩陣是對角陣也就是feature的不同dim之間是相互獨立的。where D is the dimensionality of the feature vectors. We assume that the covariance matrices are diagonal as (i) any distribution can be approximated with an arbitrary precision by a weighted sum of Gaussians with diagonal covariances and (ii) the computational cost of diagonal covariances is much lower than the cost involved by full covariances. We use the notation

對公式(2)求導，然後將偏導數，也就是梯度作爲fisher vector了。在此之前再定義一個變量：(In the following, denotes the occupancy probability,i.e. the probability for observation xt to have been generated by the i-th Gaussian. Bayes formula gives)

(6)

表徵的是occupancyprobability，也就是特徵xt是由第i個高斯分佈生成的概率。

下面的公式給出了偏導計算公式：(Straightforward derivations provide the following results)

(7)

值得注意的是上面求出來的都是沒有歸一化的vector，需要進行歸一化操作，由於是在概率空間中，與歐式空間中的歸一化不同，引入Fisher matrix進行歸一化。

公式(7)的三個變量分別引入三個對應的歸一化需要的fisher matrix：

(8)

於是最終歸一化之後的fisher vector就是：

(9)

由於每一個特徵是d維的，需要K個高斯分佈的線性組合，有公式(8)，一個Fisher vector的維數爲（2*d+1）*K-1維。

有了Fisher vector，你就可以做圖像分類了。在文章[2,3]中都介紹了對這個Fisher vector的進一步改進，在此不再贅述。

四、vl-feat中的介紹

Fisher vector fundamentals

The FV is an image representation obtained by pooling local image features. It is frequently used as a global image descriptor in visual classification.

While the FV can be derived as a special, approximate, and improved case of the general Fisher Kernel framework, it is easy to describe directly. Let I=(x1,…,xN) bea set of D dimensional feature vectors (e.g. SIFT descriptors) extracted from an image. Let Θ=(μk,Σk,πk:k=1,…,K) be the parameters of a Gaussian Mixture Model fitting the distribution of descriptors. The GMM associates each vector xi to a mode k in the mixture with a strength given by the posterior probability:

For each mode k, consider the mean and covariance deviation vectors

where j=1,2,…,D spans the vector dimensions. The FV of image I is the stacking of the vectors uk and then of the vectors vk for each of the K modes in the Gaussian mixtures:

Normalization and improved Fisher vectors

The improved Fisher Vector [24] (IFV) improves the classification performance of the representation by using to ideas:

Non-linear additive kernel. The Hellinger's kernel (or Bhattacharya coefficient) can be used instead of the linear one at no cost by signed squared rooting. This is obtained by applying the function |z|signz to each dimension of the vector Φ(I). Other additive kernels can also be used at an increased space or time cost.
Normalization. Before using the representation in a linear model (e.g. a support vector machine), the vector Φ(I) is further normalized by the l2 norm (note that the standard Fisher vector is normalized by the number of encoded feature vectors).

After square-rooting and normalization, the IFV is often used in a linear classifier such as an SVM.

Faster computations

In practice, several data to cluster assignments qik are likely to be very small or even negligible. The fast version of the FV sets to zero all but the largest assignment for each input feature xi.