Efficient Estimation of Word Representations in Vector Space(翻譯)

We propose two novel model architectures for computing continuous vector representations
of words from very large data sets. The quality of these representations
is measured in a word similarity task, and the results are compared to the previously
best performing techniques based on different types of neural networks. We
observe large improvements in accuracy at much lower computational cost, i.e. it
takes less than a day to learn high quality word vectors from a 1.6 billion words
data set. Furthermore, we show that these vectors provide state-of-the-art performance
on our test set for measuring syntactic and semantic word similarities.

我們提出兩個用於 在大規模數據集上 計算連續詞向量表示的新模型框架。這兩種表示的評估方式爲:在詞的相似性計算,此結果對比最近表現最佳的不同類型的神經網絡類型。對比結果顯示,在大幅降低計算消耗的情況下準確性獲得了提升。例如:在16億詞的數據集合上,訓練少於1天,可以獲得一個高質量的詞向量。此高質量的詞向量,在用於評估的語法、語義相似性的數據集上獲得了最先進的性能。

 

1 Introduction
Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity
between words, as these are represented as indices in a vocabulary. This choice has several good
reasons - simplicity, robustness and the observation that simple models trained on huge amounts of
data outperform complex systems trained on less data. An example is the popular N-gram model
used for statistical language modeling - today, it is possible to train N-grams on virtually all available
data (trillions of words [3]).
1簡介
許多當前的NLP系統和技術將單詞視爲原子單元 - 沒有詞與詞相似性的概念,就好像它們在詞彙表中表示爲索引。 這個選擇有幾個好處: 簡單,魯棒 以及一種現象:依賴大數據量訓練得到的簡單的模型 優於 通過較少數據訓練的複雜系統。 一個例子是流行的N-gram模型用於統計語言建模 - 今天,可能所有的可用數據都在用於訓練N-gram模型([3])。

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.
然而,這種簡單的技術在許多任務中都處於極限。 例如,用於訓練語音識別模型的數據 在特定的領域中數據是很少的。訓練效果受限於高質量的語音數據(通常只有幾百萬詞)。在機器翻譯領域,很多語言的只包含幾十億的詞或者更少。因而,當前的狀況是,對簡單技術的提升很難取得顯著的效果,我們應該關注更先進的技術。
With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

伴隨近些年機器學習技術的發展,在更大數據集上訓練更復雜的模型變爲了可能,且複雜模型的效果優於簡單模型。最成功的四路是 詞的分佈式標識。例如,基於神經網絡的語言模型 顯著的優於N-grams models.

1.1 Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

本文的目的是介紹再huge data(十億基本的word,百萬級基礎詞)上訓練高質量詞向量的技術。就我們目前所知,還沒有一個模型 可以達到如下效果:在幾百萬不同的詞,詞向量爲50-100.

We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, 14].

我們用近期發佈的技術用於評估 向量表示的質量,期望相似的詞傾向於同時出現(相鄰出現) 且詞之間有不同的相似度【20】。這種情況在屈折語( inflectional languages ?)的文本中被更早的發現。例如,名詞可以有多個單詞結尾,如果我們在一個原始向量空間的子空間中查找,則會會發現詞的有相似的結尾【13,14】

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

還有一些更令人驚訝的發現是,相似的詞的表示 超出 簡單語法規則。使用詞偏移技術,例如:V(king)-V(man)+V(woman)得到的詞向量 與 Queen的詞向量更相近。 【語義推導?】

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities1 , and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

在本文中,我們試圖通過改進新的模型框架(保護詞的線性規律性)以最大化 詞向量操作的準確性 。我們設計了新的綜合性測試集,用於評估語法、語義規律性,並展示許多此類規律性可以高準確的被模型學習到。進而,我們討論 訓練時間及準確性 如何依賴於詞向量的維度 及 訓練集合的大小。

1.2 Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

將詞表示爲連續向量有很長的歷史【10,26,8】。【1】介紹了一種用於評估NNLM的非常流行框架:帶線性投射及非線性隱藏層的 前反饋的NN,用於學習結合 詞向量表示及統計語言模型。此工作被很多人跟隨。

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.

【13,14】提到了另一種有趣的NNLM框架,首先通過帶一個隱藏層的NN得到詞向量,然後詞向量被用於訓練NNLM。詞向量被學習的時候,擺脫了全部的NNLm。在這個工作中,我們直接擴展了這個框架,並集中精力於第一步:通過簡單的模型學習到詞向量。

It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison2 . However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].

後續會介紹 詞向量可以用於顯著的改進、簡化許多NLP的應用[4,5,29】。對詞向量自身的評估,通過不同在大量語料庫上使用不同的模型框架,其中一些詞向量的結果,被用於未來的研究及對比。然而,據我們所知,這些框架計算代價要比單一目的訓練的昂貴的多【13】,在某些版本的log-bilinear(邏輯 雙線性模型)中使用了斜線權重矩陣【23】。

2 Model Architectures

Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

許多不同類型的模型被研究過 用於 評估 詞的連續表示,包括LSA、LDA.本文中我們集中於詞的分佈式表示,其效果在保存詞的線性的關係方面 線顯著優於LSA,相對於LDA則在計算成本偏低。

Similar to [18], to compare different model architectures we define first the computational complexity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity

與【18】相似,爲了比較不同模型框架,首先我們將完整的訓練模型的參數 用於定義模型計算複雜度;其次,我們最大化準確性,並儘量降低計算複雜性。

 For all the following models, the training complexity is proportional to

       O = E × T × Q, (1)

where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture. Common choice is E = 3 − 50 and T up to one billion. All models are trained using stochastic gradient descent and backpropagation [26].

對於所有的模型,訓練複雜性正比於

     O = E × T × Q, (1)

其中E代表訓練的輪數,T代表訓練集中詞的數量,Q作爲參數。通常E=3~50 ,T接近10億。所有的模型使用隨機梯度下降及反向傳播進行訓練。

2.1 Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [1]. It consists of input, projection, hidden and output layers. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

【1】中提出了概率反饋神經網絡原模型。其由input,projection(映射)、隱藏、輸出層構成。在input層,N個詞使用1-of-V coding編碼,V是詞典的大小。輸入層被映射到映射層P,P具有N*D維,用共享映射矩陣。N在任意給定時間中都是重要的,相對而而言隱藏層是相當低廉消耗的操作。

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of N = 10, the size of the projection layer (P) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity per each training example is

      Q = N × D + N × D × H + H × V     

NNLM框架在映射層與隱藏層的計算中變得復,因爲映射層中的值密度很高。作爲通常選擇N=10,映射層(P)的大小應爲500到2000,隱藏層的大小H應爲500到1000個單元。而且,隱藏層需要計算所有字典中詞的概率分佈,產生輸出層的維度爲V。因而每次訓練的計算複雜度爲:

where the dominating term is H × V . However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized models completely by using models that are not normalized during training [4, 9]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around log2(V ). Thus, most of the complexity is caused by the term N × D × H.

核心的部分是H*V。多種方案被提及用於改善此部分的開銷:使用層次函數【25,223,18】,或避免使用通過訓練中非normalized的模型。通過二元樹表示字典,輸出單元數可以下降到log2(v)。因而,最重要的影響是N × D × H.

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated: while balanced binary tree would require log2(V ) outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about log2(Unigram perplexity(V )). For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the N ×D×H term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax normalization.

在我們的模型中,我們使用層次函數,字典用huffman二元樹表示,此操作依據結果:在神經網絡語言模型中使用詞頻效果較好【16】。Huffman trees 爲高頻詞指定較短的編碼,進而減少輸出層的單元數。平衡二叉樹需要log2(V)的時間,基於層次函數的 Huffman trees  需要 log2(Unigram perplexity(V ))。例如,在字典大小爲100w詞時,可以獲得2次加速(2倍的時間減少?)。雖然按照公式這部分優化在神經網路LMs中並非最重要的提速,因爲最重要的是N ×D×H 的部分,我們後續提出的架構嘗試去除隱藏層,因而其影響會比較重要。

2.2 Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model N), and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the recurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step. The complexity per training example of the RNN model is

 基於週期性神經網絡的語言模型  的提出,解決了 feedforward NNLM 的某些限制,如具體指定上下文長度(是model N的順序),且因爲理論上RNN 不具備映射層,僅有input、hidden、output層。該模型特殊的爲週期性矩陣,其連接着隱藏層,使用時間延時鏈接。這讓週期性模型具有了某種短詞記憶,即過去的一些信息可以被隱藏層的狀態,這幫助了當前input的提升。每次訓練的時間複雜性爲:

        Q = H × H + H × V,

where the word representations D have the same dimensionality as the hidden layer H. Again, the term H × V can be efficiently reduced to H × log2(V ) by using hierarchical softmax. Most of the complexity then comes from H × H

詞表示D 與隱藏層具有相同的維度H。通過使用層次函數,H*V可以被減少到H*log2(V)。大部分複雜性變成了H*H。

2.3 Parallel Training of Neural Networks (神經網絡的並行計算)

To train models on huge data sets, we have implemented several models on top of a large-scale distributed framework called DistBelief [6], including the feedforward NNLM and the new models proposed in this paper. The framework allows us to run multiple replicas of the same model in parallel, and each replica synchronizes its gradient updates through a centralized server that keeps all the parameters. For this parallel training, we use mini-batch asynchronous gradient descent with an adaptive learning rate procedure called Adagrad [7]. Under this framework, it is common to use one hundred or more model replicas, each using many CPU cores at different machines in a data center.

對大數據訓練模型,我們基於名爲DistBelief【6】的大規模分佈式框架,實現了模型eedforward NNLM及本文提到的模型。這套框架讓我們 可以分片並行計算,每片中將梯度數據傳輸到中心服務上(用於存儲全部參數)。爲了並行訓練,我們使用最小批處理異步 且使用 名爲Adagrad [7]的適應性學習率程序。在此框架下,通常使用100片以上的子任務,每個子任務使用不同機器上的多個CPU。

3 New Log-linear Models

In this section, we propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.

在本部分,我們提出兩個是模型架構 用於詞的分佈式標識,並試圖最小化時間複雜度。前部分的主要結論是,大部分的時間複雜度因非線性隱藏層導致。雖然非線性某構成了神經網絡的獨特魅力,我決定開發更見到的模型,其可能不能像神經網路一樣準確的表達數據,但是能更高效的訓練數據。

The new architectures directly follow those proposed in our earlier work [13, 14], where it was found that neural network language model can be successfully trained in two steps: first, continuous word vectors are learned using simple model, and then the N-gram NNLM is trained on top of these distributed representations of words. While there has been later substantial amount of work that focuses on learning word vectors, we consider the approach proposed in [13] to be the simplest one. Note that related models have been proposed also much earlier [26, 8].

這種新的框架在我們的早期工作中提及過【13,14】,文章中提到神經網絡模型可以成功是通過2個步驟:首先,通過簡單模型學習到連續詞向量,基於分佈式訓練N-gram NNLM模型。雖然需要絮叨詞向量的工作很多,我們考慮【13】中提到的方法是最簡單的。需要注意的是,相關模型在【26,8】中也提到過。

3.1 Continuous Bag-of-Words Model

The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). We call this architecture a bag-of-words model as the order of words in the history does not influence the projection. Furthermore, we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word. Training complexity is then

首先提到的框架與feedforward NNLM相似,非線性隱藏層被移除,映射層唄所有word共用(不僅是映射矩陣);因而,所有的詞被映射到同樣的位置。我們稱這種框架爲詞袋模型,因爲歷史上的詞的順序關係不會影響映射,進一步的,我們使用未來的詞;我們再下一部分章節提到的取得了最佳效果(在input中使用前4個詞、後4個詞),學習標準是正確的分類當前的詞。時間複雜度爲:

 Q = N × D + D × log2(V ). (4)

We denote this model further as CBOW, as unlike standard bag-of-words model, it uses continuous distributed representation of the context. The model architecture is shown at Figure 1. Note that the weight matrix between the input and the projection layer is shared for all word positions in the same way as in the NNLM.

我們將其稱爲CBOW,不是標準的詞袋模型,其使用了上下文的連續分佈式標識。模型在Figure1中展示。注意,input層與影深灰色層中的權重矩陣被所有詞位置 ,這個NNLM中是一致的。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章