ULMFiT/AWD-LSTM
2018年語言模型預訓練特定任務遷移學習的nlp新範式開始成爲主流,ULMFiT(Universal Language Model Fine-tuning for Text Classification)是開山鼻祖之一。模型流程如下,包括基礎語言模型訓練,目標語言模型遷移,分類器微調,語言模型是基於AWD-LSTM(Regularizing and Optimizing LSTM Language Models),分類器是在語言模型上拼接了pooling和linear layers:
AWD-LSTM(Merity et al. 2017)語言模型細節如下:
該模型的精髓是dropout everywhere,基礎LSTM加以許多正則和優化tricks達到了當時的SOTA:
- 在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(傳統Dropout會破壞長依賴,DropConnect從隨機置零部分activations改爲隨機置零部分網絡weights,不破壞標準cuDNN LSTM實現,更加高效)
- 在embedding層和LSTM層後加入Dropout,防止過擬合(傳統Dropout mask每次都會採樣,LockedDropout只採樣一次,鎖定複用,更加高效)
- Embedding層和softmax層共享參數,以減少參數總數,並防止去學習輸入和輸出的一一對應關係,對語言模型的學習有比較大的幫助
- Activation Regularization(AR)/Temporal AR:在LSTM的最後一層輸出上,L2作用於activation讓其接近於0,不同時間output的差值也做L2
ULMFiT在AWD-LSTM的基礎上加入了不少訓練的技巧:
- Slanted Triangular Learning Rates (STLR):lr先線性增加後線性減少,增加區間短減少區間長,先快速收斂,後微調。
- Discriminative Fine-Tuning:網絡不同層提取不同的信息,從一般到特定。先確定最後層lr後,其它層按比例遞減
- Gradual Unfreezing:從上往下逐層放開訓練,防止災難性遺忘
- Variable Length BPTT: BPTT長度在默認70上加入隨機性,以達到類似shuffling的效果
- Concat Pooling: 拼接hidden state,max pooling和mean pooling,一定程度解決長依賴問題
ELMo
基礎的word vectors(Word2vec,GloVe,fastText)對於每一個詞都只有一個唯一的表示,而現實中詞在不同語義環境下含義會不同,這就是nlp最根本的難點,歧義(ambiguity),因此contextualized word embeddings嘗試根據詞在不同上下文中產生不同的詞表示,ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基於2層雙向LSTM的語言模型biLM,相比Word2vec就可以使用更長的context(從context window到了whole sentence),使用每層LSTM的internal states作爲詞表徵,每個詞向量是由所在句子動態計算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.
- Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
- User 4096 dim hidden/cell LSTM states with 512 dim
projections to next input
• Use a residual connection
• Tie parameters of token input and output (softmax) and tie these between forward and backward LMs
EMLo有效地證明了,暴露deep internals of the pre-trained network是非常有效,給下游任務不同level的semi-supervision signals:
- Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
- Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI
Transformer-based Neural LMs
Transformer Done!
GPT
Bert
beginning of a new era in NLP
GPT-2
Reference
- Books and Tutorials
- Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
- CS224n: Natural Language Processing with Deep Learning, Winter 2019
- CMU11-747: Neural Nets for NLP, Spring 2019
- Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
- Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
- Papers
- Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
- Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
- Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
- Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
- Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
- Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
- Blogs