Top 2 Language Models

原創

2019-06-26 08:48

ULMFiT/AWD-LSTM

2018年語言模型預訓練特定任務遷移學習的nlp新範式開始成爲主流，ULMFiT(Universal Language Model Fine-tuning for Text Classification)是開山鼻祖之一。模型流程如下，包括基礎語言模型訓練，目標語言模型遷移，分類器微調，語言模型是基於AWD-LSTM(Regularizing and Optimizing LSTM Language Models)，分類器是在語言模型上拼接了pooling和linear layers：

AWD-LSTM(Merity et al. 2017)語言模型細節如下：

該模型的精髓是dropout everywhere，基礎LSTM加以許多正則和優化tricks達到了當時的SOTA:

在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(傳統Dropout會破壞長依賴，DropConnect從隨機置零部分activations改爲隨機置零部分網絡weights，不破壞標準cuDNN LSTM實現，更加高效)
在embedding層和LSTM層後加入Dropout，防止過擬合(傳統Dropout mask每次都會採樣，LockedDropout只採樣一次，鎖定複用，更加高效)
Embedding層和softmax層共享參數，以減少參數總數，並防止去學習輸入和輸出的一一對應關係，對語言模型的學習有比較大的幫助
Activation Regularization(AR)/Temporal AR：在LSTM的最後一層輸出上，L2作用於activation讓其接近於0，不同時間output的差值也做L2

ULMFiT在AWD-LSTM的基礎上加入了不少訓練的技巧:

Slanted Triangular Learning Rates (STLR)：lr先線性增加後線性減少，增加區間短減少區間長，先快速收斂，後微調。
Discriminative Fine-Tuning：網絡不同層提取不同的信息，從一般到特定。先確定最後層lr後，其它層按比例遞減 $\eta^{l-1} = \eta^l / 2.6$
Gradual Unfreezing:從上往下逐層放開訓練，防止災難性遺忘
Variable Length BPTT: BPTT長度在默認70上加入隨機性，以達到類似shuffling的效果
Concat Pooling: 拼接hidden state，max pooling和mean pooling，一定程度解決長依賴問題

ELMo

基礎的word vectors(Word2vec,GloVe,fastText)對於每一個詞都只有一個唯一的表示，而現實中詞在不同語義環境下含義會不同，這就是nlp最根本的難點，歧義(ambiguity)，因此contextualized word embeddings嘗試根據詞在不同上下文中產生不同的詞表示，ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基於2層雙向LSTM的語言模型biLM，相比Word2vec就可以使用更長的context(從context window到了whole sentence)，使用每層LSTM的internal states作爲詞表徵，每個詞向量是由所在句子動態計算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.

Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
User 4096 dim hidden/cell LSTM states with 512 dim
projections to next input
• Use a residual connection
• Tie parameters of token input and output (softmax) and tie these between forward and backward LMs

EMLo有效地證明了，暴露deep internals of the pre-trained network是非常有效，給下游任務不同level的semi-supervision signals：

Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI

Transformer-based Neural LMs

Transformer Done!

GPT

Bert

beginning of a new era in NLP

GPT-2

Reference

Books and Tutorials
- Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
- CS224n: Natural Language Processing with Deep Learning, Winter 2019
- CMU11-747: Neural Nets for NLP, Spring 2019
- Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
- Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
Papers
- Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
- Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
- Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
- Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
- Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
- Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
Blogs

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Top 2 Language Models

ULMFiT/AWD-LSTM

ELMo

Transformer-based Neural LMs

GPT

Bert

GPT-2

Reference

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

01 穩定性（一）如何應對事故並做好覆盤？

線程池那些坑爹的參數-核心線程數&最大線程數&工作隊列

Stream流常用方法總結

2020重新啓航

數學基礎 Probability Theory

Tensorflow 2.0 學習資料

NLP資料整理

Top 2 Language Models

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結