Deep Learning:正則化(九)

Parameter Tying and Parameter Sharing

Thus far, in this chapter, when we have discussed adding constraints or penalties to the parameters, we have always done so with respect to a fixed region or point. For example, L2 regularization (or weight decay) penalizes model parameters for deviating from the fixed value of zero.

  • However, sometimes we may need other ways to express our prior knowledge about suitable values of the model parameters.
  • Sometimes we might not know precisely what values the parameters should take but we know, from knowledge of the domain and model architecture, that there should be some dependencies between the model parameters.

A common type of dependency that we often want to express is that certain parameters should be close to one another. Consider the following scenario: we have two models performing the same classification task (with the same set of classes) but with somewhat different input distributions. Formally, we have model A with parameters w(A) and model B with parameters w(B) . The two models map the input to two different, but related outputs: y(A)=f(w(A),x) and y(B)=g(w(B),x) .
Let us imagine that the tasks are similar enough (perhaps with similar input and output distributions) that we believe the model parameters should be close to each other: i,w(A)i should be close to w(B)i . We can leverage this information through regularization. Specifically, we can use a parameter norm penalty of the form:

Ω(ω(A),ω(B))=||ω(A)ω(B)||22

While a parameter norm penalty is one way to regularize parameters to be close to one another, the more popular way is to use constraints: to force sets of parameters to be equal. This method of regularization is often referred to as parameter sharing, where we interpret the various models or model components as sharing a unique set of parameters.
A significant advantage of parameter sharing over regularizing the parameters to be close (via a norm penalty) is that only a subset of the parameters (the unique set) need to be stored in memory. In certain models—such as the convolutional neural network—this can lead to significant reduction in the memory footprint of the model.
Convolutional Neural Networks By far the most popular and extensive use of parameter sharing occurs in convolutional neural networks (CNNs) applied to computer vision. Natural images have many statistical properties that are invariant to translation.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel to the right. CNNs take this property into account by sharing parameters across multiple image locations. The same feature (a hidden unit with the same weights) is computed over different locations in the input.
Parameter sharing has allowed CNNs to dramatically lower the number of unique model parameters and to significantly increase network sizes without requiring a corresponding increase in training data. It remains one of the best examples of how to effectively incorporate domain knowledge into the network architecture.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章