Dropout network, DropConnect network

Notations

  • input vv
  • output rr
  • weight parameter WRd×mW \in \mathbb{R}^{d \times m}
  • activation function aa
  • mask mm for vector and MM for matrix

Dropout

  • Randomly set activations of each layer to zero with probability 1p1-p.
    r=ma(Wv),r = m \circ a(Wv),
    mjBernoulli(p)m_j \sim \text{\small Bernoulli}(p).
  • As many activation functions have the property that a(0)=0)a(0)=0), we have
    r=a(mWv).r = a(m \circ Wv).

DropConnect

  • Randomly set the weight of each layer to zero with probability 1p1-p.
    r=a(MWv),r = a(M \circ Wv),
    MijBernoulli(p)M_{ij} \sim \text{\small Bernoulli}(p).
  • Each MijM_{ij} is drawn independently for each example during training.
    The memory requirement for MM's grows with the size of each mini-batch, and therefore, the implementation needs to be carefully designed.
  • overall model f(x;θ,M)f(x;\theta,M), where θ={Wg,W,Ws}\theta = \{W_g,W,W_s\}
    o=EM[f(x;θ,M)]=Mp(M)f(x;θ,M)=1MMs(a(MW)v);Ws)if p=0.5\begin{aligned} o=\mathbb{E}_M[f(x;\theta,M)]&=\sum_M p(M) f(x;\theta,M)\\ &=\frac{1}{|M|}\sum_M s(a(M \circ W) v); W_s) \quad \text{if } p = 0.5 \end{aligned}
  • inference (test stage)
    r=1MMa((MW)v))r1Zz=1Zrz1Zz=1Za(uz),\begin{aligned} r&=\frac{1}{|M|} \sum_M a((M \circ W)v))\\ r&\approx \frac{1}{Z} \sum_{z=1}^Z r_z \\ &\approx \frac{1}{Z} \sum_{z=1}^Z a(u_z), \end{aligned}
    where uzN(pWv,p(1p)(WW)(vv)u_z \sim \mathcal{N}(pWv,p(1-p)(W \circ W)(v \circ v); ZZ denotes the number of randoml samples drawn from the Gaussian distribution.
    Idea: approximate a sum of weighted Bernoulli random variables by a Gaussian random variable. Partially supported by the central limit theorem.

侷限性\textcolor{red}{\text{\small 侷限性}}:
Both techniques are suitable for fully connected layers only.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章