Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe
文章目錄
- Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe
- 1. Word meaning
- A question lies ahead
- discrete representation
- How to make neighbors represent words?
- 通過向量定義詞語的含義Word meaning id defined in terms of vectors
- 學習神經網絡word embeddings 的基本思路
- 2. Word2vec Introduction
- 3.Research Highlight(Danqi Chen)
- 4.Word2vec objective function gradients
- 5.Optimization Refresher
- Reference
1. Word meaning
A question lies ahead
Q:How do we have a usable meaning in a computer?
Common answer:Use a taxonomy like WordNet that has hypernyms relationships and synonym set.
- WordNet is one of the most famous taxonomic resource and it is popular among computational linguists.Because It is free to download a copy,it provides a lot of taxonomy infomation about words.
- the components of WordNet
- demo implemented by python
The picture above shows you getting a hold of WordNet using the NLTK which is one of the main python packages for nlp.
在python3.7版本中運行結果如下:
顯示跟"panda"一詞接近的上位詞列表如下:
[Synset(‘procyonid.n.01’),
Synset(‘carnivore.n.01’),
Synset(‘placental.n.01’),
Synset(‘mammal.n.01’),
Synset(‘vertebrate.n.01’),
Synset(‘chordate.n.01’),
Synset(‘animal.n.01’),
Synset(‘organism.n.01’),
Synset(‘living_thing.n.01’),
Synset(‘whole.n.02’),
Synset(‘object.n.01’),
Synset(‘physical_entity.n.01’),
Synset(‘entity.n.01’)]
discrete representation
上述有關詞的離散表示雖然是一種語言學資源,但在實際應用中,結果可能並沒有人們所期望得那麼好。因爲詞的離散表示所找到的同義詞在意思上還有細微的差別(nuance),其主要侷限性體現在:
- 缺少新詞(很難與時俱進取更新)
- 主觀化
- 需要大量的人力去創建和維護
- 很難準確計算詞語相似性
大量的基於規則的(rule-based)和統計(statistical)自然語言處理任務中將詞視爲一個不可分割的原子單元。在向量空間中,每一個向量由一個“1”和很多個“0”表示,我們將這種表示方法稱爲“one-hot” representation.它存在的問題是:
1. 語料庫中詞彙表的數目大,向量的維度也就變得非常大
Dimensionality:
20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
- “speech” means speech recognizer
- PTB- penn tree bank
在python中可直接調用:from nltk.corpus import ptb
- big vocab
if we kinda building a machine translation system,we might use a 500,000 word vocabulary.
- Google 1T
Google released sort of 1-terabyte corpus of web crawl.
2. 詞向量兩兩正交(點乘爲零 或稱 內積爲零)
or we can say there is no natural notion of similarity
How to make neighbors represent words?
語言學家J. R. Firth提出,通過一個單詞的上下文可以得到它的意思。J. R. Firth甚至建議,如果你能把單詞放到正確的上下文中去,才說明你掌握了它的意義。
“you shall know a word by the company it keeps.”
20世紀初在哲學語言學方面深有造詣的Wittgenstein也表示“the right way to think about the meaning of words is understanding their use in text.”
通過向量定義詞語的含義Word meaning id defined in terms of vectors
通過調整一個單詞及其上下文單詞的向量,使得根據兩個向量可以推測兩個詞語的相似度;或根據向量可以預測詞語的上下文。這種手法也是遞歸的(recursive),根據向量來調整向量,與詞典中意項的定義相似。
In addition,distributed representations與symbolic representations(localist representation、one-hot representation)相對;discrete representation則與後者及denotation的意思相似。切不可搞混distributed和discrete這兩個單詞。
學習神經網絡word embeddings 的基本思路
- 定義一個用於預測某個單詞上下文的模型:
- 損失函數定義如下:
公式中表示的上下文(負號通常表示“除某某以外”),訓練模型的最終目標是改變詞的向量表示使得損失函數逐漸逼近0
- 然後在一個大型語料庫的不同位置得到訓練實例,調整詞向量,最小化損失函數。
2. Word2vec Introduction
Main idea of word2vec
Skip-gram prediction
Skip-gram 模型思想是:每一次估計,你可以選擇一個詞作爲中心詞(center word),並預測它的窗體大小範圍內的上下文單詞。模型的訓練目標是最大化概率分佈。
word2vec細節
對每一個單詞 t=1,2,…,T(文本詞彙表大小),我們需要預測該單詞窗體“半徑(radius)”爲m的上下文單詞。
- 目標函數:對於給定的當前中心詞,最大化它的每一個上下文單詞的預測概率。
表示所有我們將要優化的變量(where represents all variables we will optimize)。
在訓練過程中我們需要做的是調整模型的參數,使得對中心詞的上下文單詞預測的概率儘可能高。
- 要最大化目標函數,需要對其概率分佈取負對數,得到損失函數——對數似然的相反數。
該式即負對數似然估計(negative log likelihood)
對數函數如下圖紅線所示:
由於是對概率分佈求對數,概率 的值爲,取對數後爲紅色線條在區間中的部分,再對其取負數,得到負對數似然函數如下圖所示:
我們希望得到的概率越大越好,因此概率越接近於1,則函數整體值越接近於0,即使得損失函數取到最小值。
[a tiny cheating trip hidden in the function]
上述函數看起來只有唯一參數 ,在這裏是窗體大小,但實際模型中還有一些超參數(hyper parameters),在之後的授課過程中將會遇到,所有這些參數都是可以調整的超參數,我們當前先將它們看作常量(constants)。
window size is the parameter of the model.Actually ,it turns out that there are several hyper parameters of the model.One is window sized and it turns out that we’ll come across a couple of other fudge factors later in the lecture.And all of those things are hyper parameters that you could adjust.
-
Terminology:
Loss function = Cost function = Objective function
從術語上來講,損失函數 等於 代價函數 等於 目標函數
通常對於概率分佈來說,常用的損失函數爲交叉熵(Cross-entropy loss)
-
我們怎樣應用這些詞向量去最小化負對數似然估計呢?
我們首先通過計算得出對於某一個給定的中心詞來說最有可能的上下文單詞的概率分佈,以及我們的概率分佈最有可能如下式所示:
在該式中,是輸出的上下文單詞中確切的某一個(one of the output),c是中心詞(center word),和是中心詞和上下文單詞的向量形式表示。
點乘(dot product)
如果向量和向量越相似,其點乘後的內積越大;
遍歷: 表示:計算每一次單詞與之間的相似度
Softmax用中心詞去獲得上下文單詞的概率。
-
softmax函數:從到一個概率分佈的標準映射(將數字轉化爲一種概率分佈)
由於我們不能直接將數值(數值可以是正數也可以是負數)轉化爲概率分佈,因此我們可以簡單地將其處理爲指數形式,指數函數可以把實數映射成正數,然後歸一化的到概率(Exponentiate to make positive,Normalize to give probability)。
softmax之所叫softmax,是因爲指數函數會導致較大的數變得更大,小數變得微不足道;這種選擇作用類似於max函數。
The reason why this is called a Softmax function is because it’s kind of close to a max function, because when you exponentiate things,the big thing get way bigger and so they really dominate,and so this really sort of blows out in the direction of a max function,but not fully.It’s still a sort of a soft things.
-
another trip:One word with two vector representation?
you might think that one word should only have one vector representation.And if you really wanted to you could do that, but it turns out you can make the math considerably easier by saying now actually each word has two vector representations that has one vector representation when it synthesis the word.And it has another vector representation when It’s a context word.
在上述公式中,
是輸出的上下文單詞中確切的某一個(one of the output),c是中心詞(center word),和是中心詞和上下文單詞的向量形式表示。
And it turns out not only does that make the math a lot easier,because the two representations are separated when you do optimization rather than tied to each other.It’s actually in practice empirically works a little better as well,so if your life is easier and better,who would not choose that?
So we have two vectors for each word.
-
Question proposed by a student:
When I am dealing with the context words,am I paying attention to where they are or just their identity?
Where they are has nothing to do with it in this model,It’s just what is the identity of the word somewhere in the window?So there’s just one probability distribution and one representation of the context word. Now you know that’s not really a good idea,there are other models which absolutely pay attention to position and distance.And for some purposes,especially more syntactic purposes rather than semantic purposes,that actually helps a lot.But if you’re sort of more interested in just sort of word meaning,it’s turns out that not paying attention to position actually tends to help you rather than hurting you.
-
上圖從左到右依次爲:首先以中心詞 的one-hot表示作爲輸入,輸入層到投影層之間有一箇中心詞的表示矩陣W,我們需要用向量乘以該矩陣從而選出矩陣中的對應列(column)即爲中心詞的表示形式。接下來,我們從投影層到輸出層有第二個矩陣,存儲着上下文單詞的表示。這裏會對中心詞和上下文單詞之間做點乘(dot product),對於點乘之後的向量結果有正有負,我們會對其進行softmax處理,將其轉化爲一種概率分佈。該模型作爲一個生成模型(generative model),在我們給定某個詞作爲中心詞時,模型需要預測出每一個詞出現在上下文中的可能性。
這兩個矩陣都含有V個詞向量,也就是說同一個詞有兩個詞向量,哪個作爲最終的、提供給其他應用使用的embeddings呢?有兩種策略,要麼加起來,要麼拼接起來。在CS224n的編程練習中,採取的是拼接起來的策略:
concatenate the input and output word vectors wordVectors = np.concatenate( (wordVectors[:nWords,:], wordVectors[nWords:,:]), axis=0) wordVectors = wordVectors[:nWords,:] + wordVectors[nWords:,:]
通常將中的向量稱爲輸入向量(input vector),將中的向量稱爲輸出向量(output vector).
-
訓練模型:計算參數向量的梯度
我們通常將模型中所有的參數定義爲一個長向量 中,對d維詞向量和詞彙表大小爲V的詞向量來說:
我們接下來優化這些參數
注意,每一個詞有兩個詞向量表示,這是爲什麼 中每個單詞出現兩次的原因,也是爲什麼 的維度中需要對再乘以2.
我們在訓練模型的過程中,比較規範的做法是:從模型中取出所有參數,然後將它們放入一個大的向量中,接下來我們會做一些優化:改變這些參數去最大化模型的目標函數。
[如何理解這裏的維度d和詞彙表大小的單詞量V,在one-hot中它們不是一樣的嗎?]
What our parameters are is that:
for each word,we’re going to have a little d dimensional vector,when it’s a center word and when it’s a context word.
And so we’ve got a vocabulary of some size,So we’re gonna have a vector for “aardvark” as a context word.a vector for “a” as a context word. and then we’re going to have a vector of “aardvark” as a center word,a vector of “a” as a center word.
(此處解釋了某一個單詞作爲上下文單詞時對應一個詞向量,作爲中心詞時對應一個詞向量)
3.Research Highlight(Danqi Chen)
Introduce the audience a paper from Princeton,the title is A Simple but Tough-to-beat Baseline for Sentence Embeddings.In this lecture we are learning the word vector representations,so we hope these vectors can encode the word meanings.
-
Sentence Embedding
-
From Bag-of-words to Complex Models
-
Bag-of-words(BoW)
-
Complex Models
Recurrent neural networks
Recursive neural networks
Convolutional neural networks
-
-
This paper
-
weighted Bag-of-words + remove some special direction
-
Step1
There’s two steps.So the first step is that just like how we compute the average of the vector representations,they also do this,but each word has a separate weight.Now here,a is a constant,and the p(w),it means the frequency of this word.So this basically means that the average representation down weight the frequency words.
上式中a是一個常量,代表單詞的頻率,式的作用就是對高頻詞進行降權處理。
上圖的意思是:在一個句子集合中,我們怎樣對集合中的一個句子將其表示爲向量形式呢?我們將一個句子拆分成一個個單詞,每一個單詞的向量形式表示爲,該單詞出現的頻率爲,我們希望對出現頻率高的詞進行降權處理.
-
爲什麼要對高頻詞進行降維呢?
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.
-
Step2
So for step 2,after we compute all of these sentence vector representations,we compute the first principal components and also substract the projections onto this first principle component.
-
The idea of the paper
So basically,the idea is that given the sentence representation,the probability of limiting or single word,they’re related to the frequency of the word.And also related to how close the word is related to this sentence representation.
-
-
4.Word2vec objective function gradients
5.Optimization Refresher
We will optimize(maximize or minimize) our objective/cost functions
最小化損失函數即使用梯度下降。
To represent as a simplified python code:
while True:
theta_grad = evaluate_gradient(J,corpus,theta)
theta = theta - alpha * theta_grad
The picture above shows us the red lines are sort of the contour lines of the value of the objective function.And so what you do is when you calculate the gradient,it’s giving you the direction of the steepest descent and you walk a little bit each time in that direction and you will hopefully walk smoothly towards the minimum.
Actually we might have 40 billion tokens in our corpus to go through,and if you have to work out the gradient if your objective function relative to a 40 billion word corpus, that’s gonna take forever.
So instead,what we do is used stochastic gradient descent.SGD is our key tool.So we just take one position in the text,we’ll have one center word and the words around it .
Reference
1.課程視頻(YouTube)
https://www.youtube.com/watch?v=ERibwqs9p38&t=431s
2.斯坦福cs224n教學大綱
http://cs224d.stanford.edu/syllabus.html
3.參考的其他技術博客
http://www.hankcs.com/nlp/word-vector-representations-word2vec.html