Top 2 Language Models

ULMFiT/AWD-LSTM

2018年語言模型預訓練特定任務遷移學習的nlp新範式開始成爲主流,ULMFiT(Universal Language Model Fine-tuning for Text Classification)是開山鼻祖之一。模型流程如下,包括基礎語言模型訓練,目標語言模型遷移,分類器微調,語言模型是基於AWD-LSTM(Regularizing and Optimizing LSTM Language Models),分類器是在語言模型上拼接了pooling和linear layers:
-c800
AWD-LSTM(Merity et al. 2017)語言模型細節如下:
-c260
該模型的精髓是dropout everywhere,基礎LSTM加以許多正則和優化tricks達到了當時的SOTA:

  • 在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(傳統Dropout會破壞長依賴,DropConnect從隨機置零部分activations改爲隨機置零部分網絡weights,不破壞標準cuDNN LSTM實現,更加高效)-c300
  • 在embedding層和LSTM層後加入Dropout,防止過擬合(傳統Dropout mask每次都會採樣,LockedDropout只採樣一次,鎖定複用,更加高效)
  • Embedding層和softmax層共享參數,以減少參數總數,並防止去學習輸入和輸出的一一對應關係,對語言模型的學習有比較大的幫助
  • Activation Regularization(AR)/Temporal AR:在LSTM的最後一層輸出上,L2作用於activation讓其接近於0,不同時間output的差值也做L2

ULMFiT在AWD-LSTM的基礎上加入了不少訓練的技巧:

  • Slanted Triangular Learning Rates (STLR):lr先線性增加後線性減少,增加區間短減少區間長,先快速收斂,後微調。
  • Discriminative Fine-Tuning:網絡不同層提取不同的信息,從一般到特定。先確定最後層lr後,其它層按比例遞減 ηl1=ηl/2.6\eta^{l-1} = \eta^l / 2.6
  • Gradual Unfreezing:從上往下逐層放開訓練,防止災難性遺忘
  • Variable Length BPTT: BPTT長度在默認70上加入隨機性,以達到類似shuffling的效果
  • Concat Pooling: 拼接hidden state,max pooling和mean pooling,一定程度解決長依賴問題

ELMo

基礎的word vectors(Word2vec,GloVe,fastText)對於每一個詞都只有一個唯一的表示,而現實中詞在不同語義環境下含義會不同,這就是nlp最根本的難點,歧義(ambiguity),因此contextualized word embeddings嘗試根據詞在不同上下文中產生不同的詞表示,ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基於2層雙向LSTM的語言模型biLM,相比Word2vec就可以使用更長的context(從context window到了whole sentence),使用每層LSTM的internal states作爲詞表徵,每個詞向量是由所在句子動態計算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.
-c418

  • Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
  • User 4096 dim hidden/cell LSTM states with 512 dim
    projections to next input
    • Use a residual connection
    • Tie parameters of token input and output (softmax) and tie these between forward and backward LMs

EMLo有效地證明了,暴露deep internals of the pre-trained network是非常有效,給下游任務不同level的semi-supervision signals:

  • Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
  • Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI

Transformer-based Neural LMs

Transformer Done!

GPT

Bert

beginning of a new era in NLP

GPT-2

Reference

  • Books and Tutorials
    • Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
    • CS224n: Natural Language Processing with Deep Learning, Winter 2019
    • CMU11-747: Neural Nets for NLP, Spring 2019
    • Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
    • Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
  • Papers
    • Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
    • Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
    • Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
    • Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
    • Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
    • Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
    • Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
  • Blogs
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章