Randomly set activations of each layer to zero with probability 1−p. r=m∘a(Wv), mj∼Bernoulli(p).
As many activation functions have the property that a(0)=0), we have r=a(m∘Wv).
DropConnect
Randomly set the weight of each layer to zero with probability 1−p. r=a(M∘Wv), Mij∼Bernoulli(p).
Each Mij is drawn independently for each example during training.
The memory requirement for M's grows with the size of each mini-batch, and therefore, the implementation needs to be carefully designed.
overall model f(x;θ,M), where θ={Wg,W,Ws} o=EM[f(x;θ,M)]=M∑p(M)f(x;θ,M)=∣M∣1M∑s(a(M∘W)v);Ws)if p=0.5
inference (test stage) rr=∣M∣1M∑a((M∘W)v))≈Z1z=1∑Zrz≈Z1z=1∑Za(uz),
where uz∼N(pWv,p(1−p)(W∘W)(v∘v); Z denotes the number of randoml samples drawn from the Gaussian distribution.
Idea: approximate a sum of weighted Bernoulli random variables by a Gaussian random variable. Partially supported by the central limit theorem.
侷限性:
Both techniques are suitable for fully connected layers only.