深度學習中是否考慮過樣本量和參數的關係?

在深度學習中,樣本量和參數有什麼關係呢?

是不是樣本量越大?參數越多?模型表現會越好?
參數越多自然想到可能會出現過擬合,樣本量與參數量應該保持怎樣的關係?

參考論文Scaling Laws for Neural Language Model

summary

文章主要討論瞭如下幾個問題

**Performance depends strongly on scale, weakly on model shape: **Model performance depends most strongly on scale, which consists of three factors: the number of model parameters \(N\) (excluding embeddings), the size of the dataset \(D\), and the amount of compute \(C\) used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3)

  • 模型效果更多的依賴於參數的規模,較少的依賴於模型的形狀。
    模型的性能依賴於模型的規模,模型的規模主要由三部分組成:模型參數N(包括emb的數量),數據集的大小D,還有算力C,模型性能主要受限於這三個因素,和模型的深度和寬度關係不大。
    這部分在文章第三部分討論

Smooth power laws: Performance has a power-law relationship with each of the three scale factors \(N, D, C\) when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3)

  • 平滑冪定律
    影響模型性能的三個要素之間存在冪指數的關係,每個參數並受另外兩個參數影響。在fig1中有超過6個數量的對比。本文實驗中沒有很大的偏離現象,在達到零損失之前,性能最終會趨於平緩。

**Universality of overfitting: **Performance improves predictably as long as we scale up \(N\) and \(D\) in tandem, but enters a regime of diminishing returns if either \(N\) or \(D\) is held fixed while the other increases. The performance penalty depends predictably on the ratio \(N^{0.74} / D\), meaning that every time we increase the model size \(8 \mathrm{x}\), we only need to increase the data by roughly \(5 \mathrm{x}\) to avoid a penalty. (Section 4\()\)

  • 過擬合的普遍性

同時增加N和D模型表現就會提升,但是N和D保持不變模型表現保持不變。模型表現主要取決於一個比例係數 \(N^{0.74} / D\),這個係數的啥意思?就是模型參數增加8倍,訓練集的量要增加5倍。

這個是本文給出的參數量和訓練樣本的一個關係。

**Universality of training: **Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)

  • 訓練的普遍性
    訓練曲線也是呈現冪指函數,模型的參數和模型的大小無關。從初期收斂曲線可知,如果訓練時間長,造成的損失將會更多。

**Transfer improves with test performance: **When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss - in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)

  • 遷移學習提升測試集的表現
    當評估與訓練時不同分佈的文本模型時,結果與訓練驗證集上的模型有很強的相關性,損失的偏移量大致不變——換句話說,遷移到不同的分佈會帶來不變的懲罰,但在其他方面會根據訓練集上的性能得到大致的改善。

**Sample efficiency: **Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).

**Convergence is inefficient: **When working within a fixed compute budget \(C\) but without any other restrictions on the model size \(N\) or available data \(D\), we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as \(D \sim C^{0.27}\) with training compute. (Section 6)

  • 收斂效率比較低
    當算力是固定的時但對模型大小\(N\)或可用數據\(D\)沒有任何其他限制時,通過訓練非常大的模型並停止明顯低於收斂速度來獲得最優性能(見圖3)。因此,計算效率最高的訓練將遠比基於訓練小模型以收斂的預期的樣本效率更高,數據需求增長非常緩慢,如使用訓練計算的\(D \sim C^{0.27}\)
    簡單的來說就是算力足夠,模型參數和訓練集越大越好。

Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly \(1-2\) million tokens at convergence for the largest models we can train. (Section 5.1)

  • 優化batchsize
    訓練這些模型的理想批量大小大致僅是損失的一個冪次。

未完待續。。。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章