第 11 章 CNNs(2)

卷積網絡 (2)


基礎卷積函數的變體

實際應用的卷積函數和標準離散卷積有一些不同的。
1. 需要同時應用多個卷積函數,因爲一個卷積函數只能抽取一種特徵
2. 輸入通常也不是網格狀實值,可能是向量。例如一個彩色圖像有3個通道(紅綠藍)。

  • 假設有一個 4-D 核數組 K , 其中元素爲 ki,l,m,n , 代表輸入的通道i mn 列處的像素與 輸出的通道l mn 列處的元素之間的聯繫。
  • 假設輸入V 中的元素爲 vi,j,k 是指通道 i jk 列的值.
  • 假設輸出 ZV 有相同的結構. 如果 Z 是通過 V 卷積 no-flipping 的K 生成的。
    那麼
    zi,j,k=l,m,nvl,j+m,k+nki,l,m,n

    如果打算跳過一些核的位置使得計算消耗變得更小,那麼在每個方向取樣s 個像素 此時帶有取樣的卷基函數爲c ,那麼 這個式子裏面下標不明覺厲
    zi,j,k=c(K,V,s)i,j,k=l,m,n[vl,j×s+m,k×s+nki,l,m,n]

zero-pad

它可以使得輸出size變得越來越小
假如沒有 zero-pad: 如果圖像的大小: m×m , 核: k×k , 那麼:
輸出的大小: mk+1×mk+1

類型 zero-pad 輸出大小
vaild 沒有 mk+1×mk+1
same 足夠的0 m×m
full 足夠的0 m+k1×m+k1

如何訓練

假設損失函數爲 J(V,K) , 在後饋中,得到數組 G , Gi,j,k=zi,j,kJ(V,K)

爲了訓練網絡,需要計算損失函數關於核中的權值的偏導(向後傳播)

g(G,V,s)i,j,kl=ki,j,k,lJ(V,K)=m,ngi,m,nvj,m×s+k,n×s+l

如果此層不是網絡的底層,那麼需要計算損失函數關於輸入的梯度(向前傳播)

h(K,G,s)i,j,k=vi,j,kJ(V,K)=l,m|s×l+m=jn,p|s×n+p=kqkq,i,m,pgi,l,n

數據類型

通過卷積操作,網絡可以處理不同大小的圖像,即通過應用不同次數的卷積矯正輸入的大小。

卷積的高效

卷積相當於 通過 傅里葉變換將 輸入和核函數都轉換到頻率域,然後通過傅里葉反變換轉換到時域中。對於一些問題而言,這比直接的離散卷積要高效的多。

關於數據類型

type single channel multi-channel
1-D Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step. Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-D Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output. Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-D Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans. Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

next

Theano or Recurrent and Recursive Nets


Variants of the basic convolution function

The convolution function used in the neural networks differ from standard discrete convolution operation.

  1. many applications of convolution in parallel. Coz a single kernel can only extract one kind of feature.
  2. input is usually not just a grid of real values, but a grid of vector-valued observations. e.g. A color image has 3 channals.
  • Assume that we have a 4-D kernel array K with elements ki,l,m,n , giving the connection strength between a unit in channel i of the output and a unit in channel l of the input, with an offset of m rows and n columns between the output unit and the input unit.
  • Assume our input consists of observed data V with element vi,j,k giving the value of the input unit within channel i at row j and column k .
  • Assume our output consists of Z with the same format as V . If Z is produced by convolving K across V without flipping K ,
    then
    zi,j,k=l,m,nvl,j+m,k+nki,l,m,n

    If we want to skip over some positions of the kernel in order to reduce the computational cost. Thus we sample only every s pixels in each direction, by which we define a downsampled convolution function c such that: I do not understand what this means
    zi,j,k=c(K,V,s)i,j,k=l,m,n[vl,j×s+m,k×s+nki,l,m,n]

zero-pad

It can prevent the width of output from shrinking.
Without zero-pad: If the image: m×m , the kernel: k×k , then:
output: mk+1×mk+1

type zero-pad the size of output
vaild 0 mk+1×mk+1
same enough m×m
full enough m+k1×m+k1

how to train

Assume the loss function is J(V,K) , during the BP we recieve a array G , and Gi,j,k=zi,j,kJ(V,K)

To train the network, we need to compute the derivatives with respect to the weights in the kernel.To do so, we can use a function

g(G,V,s)i,j,kl=ki,j,k,lJ(V,K)=m,ngi,m,nvj,m×s+k,n×s+l

If this layer is not the bottom layer of the network, we’ll need to compute the gradient with respect to V in order to backpropagate the error farther down. To do so, we can use a function

h(K,G,s)i,j,k=vi,j,kJ(V,K)=l,m|s×l+m=jn,p|s×n+p=kqkq,i,m,pgi,l,n

Data types

With convolution operation, the networks can process images with different width and length by that the kernel is simply applied a different number of times depending on the size of the input.

Efficient convolution algorithms

Convolution is equivalent to converting both the input and the kernel to the frequency domain using a Fourier transform, performing point-wise multiplication of the two signals, and converting back to the time domain using an inverse Fourier transform. For some problem sizes, this can be faster than the naive implementation of discrete convolution.

more information about data types

type single channel multi-channel
1-D Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step. Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-D Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output. Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-D Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans. Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

next

Theano or Recurrent and Recursive Nets

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章