【課程筆記】Lecture2-斯坦福自然語言處理cs224n

Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe


Alt

1. Word meaning

A question lies ahead

Q:How do we have a usable meaning in a computer?
Common answer:Use a taxonomy like WordNet that has hypernyms relationships and synonym set.

  • WordNet is one of the most famous taxonomic resource and it is popular among computational linguists.Because It is free to download a copy,it provides a lot of taxonomy infomation about words.
  • the components of WordNet
    Alt
  • demo implemented by python
    Alt
    The picture above shows you getting a hold of WordNet using the NLTK which is one of the main python packages for nlp.

在python3.7版本中運行結果如下:
顯示跟"panda"一詞接近的上位詞列表如下:

[Synset(‘procyonid.n.01’),

Synset(‘carnivore.n.01’),

Synset(‘placental.n.01’),

Synset(‘mammal.n.01’),

Synset(‘vertebrate.n.01’),

Synset(‘chordate.n.01’),

Synset(‘animal.n.01’),

Synset(‘organism.n.01’),

Synset(‘living_thing.n.01’),

Synset(‘whole.n.02’),

Synset(‘object.n.01’),

Synset(‘physical_entity.n.01’),

Synset(‘entity.n.01’)]

discrete representation

上述有關詞的離散表示雖然是一種語言學資源,但在實際應用中,結果可能並沒有人們所期望得那麼好。因爲詞的離散表示所找到的同義詞在意思上還有細微的差別(nuance),其主要侷限性體現在:

  • 缺少新詞(很難與時俱進取更新)
  • 主觀化
  • 需要大量的人力去創建和維護
  • 很難準確計算詞語相似性
    大量的基於規則的(rule-based)和統計(statistical)自然語言處理任務中將詞視爲一個不可分割的原子單元。在向量空間中,每一個向量由一個“1”和很多個“0”表示,我們將這種表示方法稱爲“one-hot” representation.它存在的問題是:
    1. 語料庫中詞彙表的數目大,向量的維度也就變得非常大
    Dimensionality:
    20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
    - “speech” means speech recognizer
    - PTB- penn tree bank
    在python中可直接調用:from nltk.corpus import ptb
    - big vocab
    if we kinda building a machine translation system,we might use a 500,000 word vocabulary.
    - Google 1T
    Google released sort of 1-terabyte corpus of web crawl.
    2. 詞向量兩兩正交(點乘爲零 或稱 內積爲零)
    or we can say there is no natural notion of similarity

How to make neighbors represent words?

語言學家J. R. Firth提出,通過一個單詞的上下文可以得到它的意思。J. R. Firth甚至建議,如果你能把單詞放到正確的上下文中去,才說明你掌握了它的意義。

“you shall know a word by the company it keeps.”
20世紀初在哲學語言學方面深有造詣的Wittgenstein也表示“the right way to think about the meaning of words is understanding their use in text.”

通過向量定義詞語的含義Word meaning id defined in terms of vectors

通過調整一個單詞及其上下文單詞的向量,使得根據兩個向量可以推測兩個詞語的相似度;或根據向量可以預測詞語的上下文。這種手法也是遞歸的(recursive),根據向量來調整向量,與詞典中意項的定義相似。

In addition,distributed representations與symbolic representations(localist representation、one-hot representation)相對;discrete representation則與後者及denotation的意思相似。切不可搞混distributed和discrete這兩個單詞。

學習神經網絡word embeddings 的基本思路

  1. 定義一個用於預測某個單詞上下文的模型:

p(contextwt) p(context|w_{t})

  1. 損失函數定義如下:
    J=1p(wtwt)J=1-p(w_{-t}|w_{t})

​ 公式中wtw_{-t}表示wtw_{t}的上下文(負號通常表示“除某某以外”),訓練模型的最終目標是改變詞的向量表示使得損失函數逐漸逼近0

  1. 然後在一個大型語料庫的不同位置得到訓練實例,調整詞向量,最小化損失函數。

2. Word2vec Introduction

Main idea of word2vec

Alt

Skip-gram prediction

Alt

Skip-gram 模型思想是:每一次估計,你可以選擇一個詞作爲中心詞(center word),並預測它的窗體大小範圍內的上下文單詞。模型的訓練目標是最大化概率分佈。

word2vec細節

對每一個單詞 t=1,2,…,T(文本詞彙表大小),我們需要預測該單詞窗體“半徑(radius)”爲m的上下文單詞。

  • 目標函數:對於給定的當前中心詞,最大化它的每一個上下文單詞的預測概率。

J(θ)=t=1TmjmTp(wt+jwt;θ)J'(\theta)=\displaystyle \prod^{T}_{t=1}\displaystyle \prod^{T}_{-m\leq{j}\leq{m}}p(w_{t+j}|w_{t};\theta)

θ\theta 表示所有我們將要優化的變量(where θ\theta represents all variables we will optimize)。

​ 在訓練過程中我們需要做的是調整模型的參數,使得對中心詞的上下文單詞預測的概率儘可能高。

  • 要最大化目標函數,需要對其概率分佈取負對數,得到損失函數——對數似然的相反數。

J(θ)=1Tt=1Tmjmlog p(wt+jwt)J(\theta)=-{\frac{1}{T}}\displaystyle \sum^{T}_{t=1}\displaystyle \sum_{m\leq{j}\leq{m}}log\, p(w_{t+j}|w_{t})

​ 該式即負對數似然估計(negative log likelihood)

​ 對數函數如下圖紅線所示:

Alt

​ 由於是對概率分佈求對數,概率p(wt+jwt)p(w_{t+j}|w_{t}) 的值爲0p10\leq{p}\leq1,取對數後爲紅色線條在[0,1][0,1]區間中的部分,再對其取負數,得到負對數似然函數如下圖所示:

Alt

我們希望得到的概率越大越好,因此概率越接近於1,則函數整體值越接近於0,即使得損失函數取到最小值。

[a tiny cheating trip hidden in the function]

上述函數看起來只有唯一參數θ\theta ,在這裏是窗體大小,但實際模型中還有一些超參數(hyper parameters),在之後的授課過程中將會遇到,所有這些參數都是可以調整的超參數,我們當前先將它們看作常量(constants)。

window size is the parameter of the model.Actually ,it turns out that there are several hyper parameters of the model.One is window sized and it turns out that we’ll come across a couple of other fudge factors later in the lecture.And all of those things are hyper parameters that you could adjust.

  • Terminology:

    Loss function = Cost function = Objective function

    從術語上來講,損失函數 等於 代價函數 等於 目標函數

    通常對於概率分佈來說,常用的損失函數爲交叉熵(Cross-entropy loss)

  • 我們怎樣應用這些詞向量去最小化負對數似然估計呢?

    我們首先通過計算得出對於某一個給定的中心詞來說最有可能的上下文單詞的概率分佈,以及我們的概率分佈最有可能如下式所示:

    p(oc)=exp(uoTvc)w=1Vexp(uwTvc)p(o|c) = \displaystyle \frac{exp({u_{o}}^Tv_{c})}{\sum_{w=1}^V exp({u_{w}}^Tv_{c})}

    在該式中,oo是輸出的上下文單詞中確切的某一個(one of the output),c是中心詞(center word),vcv_{c}uou_{o}是中心詞和上下文單詞的向量形式表示。

    點乘(dot product) uT=uv=i=1nuiviu^T=u\cdot v = \displaystyle\sum_{i=1}^{n}u_{i}v_{i}

    如果向量uu和向量vv越相似,其點乘後的內積越大;

    遍歷w=1,,Vw=1,…,V: uwTvu_{w}^T v表示:計算每一次單詞與vv之間的相似度

    Softmax用中心詞去獲得上下文單詞的概率。

  • softmax函數:從RV\mathbb{R}^V到一個概率分佈的標準映射(將數字轉化爲一種概率分佈)

    由於我們不能直接將數值(數值可以是正數也可以是負數)轉化爲概率分佈,因此我們可以簡單地將其處理爲指數形式,指數函數可以把實數映射成正數,然後歸一化的到概率(Exponentiate to make positive,Normalize to give probability)。

    pi=euijeujp_{i}= \displaystyle \frac{e^{u_i}}{\sum_je^{u_j}}

    softmax之所叫softmax,是因爲指數函數會導致較大的數變得更大,小數變得微不足道;這種選擇作用類似於max函數。

    The reason why this is called a Softmax function is because it’s kind of close to a max function, because when you exponentiate things,the big thing get way bigger and so they really dominate,and so this really sort of blows out in the direction of a max function,but not fully.It’s still a sort of a soft things.

  • another trip:One word with two vector representation?

    you might think that one word should only have one vector representation.And if you really wanted to you could do that, but it turns out you can make the math considerably easier by saying now actually each word has two vector representations that has one vector representation when it synthesis the word.And it has another vector representation when It’s a context word.

    在上述公式中,

    p(oc)=exp(uoTvc)w=1Vexp(uwTvc)p(o|c) = \displaystyle \frac{exp({u_{o}}^Tv_{c})}{\sum_{w=1}^V exp({u_{w}}^Tv_{c})}

    oo是輸出的上下文單詞中確切的某一個(one of the output),c是中心詞(center word),vcv_{c}uou_{o}是中心詞和上下文單詞的向量形式表示。

    And it turns out not only does that make the math a lot easier,because the two representations are separated when you do optimization rather than tied to each other.It’s actually in practice empirically works a little better as well,so if your life is easier and better,who would not choose that?

    So we have two vectors for each word.

  • Question proposed by a student:

    When I am dealing with the context words,am I paying attention to where they are or just their identity?

    Where they are has nothing to do with it in this model,It’s just what is the identity of the word somewhere in the window?So there’s just one probability distribution and one representation of the context word. Now you know that’s not really a good idea,there are other models which absolutely pay attention to position and distance.And for some purposes,especially more syntactic purposes rather than semantic purposes,that actually helps a lot.But if you’re sort of more interested in just sort of word meaning,it’s turns out that not paying attention to position actually tends to help you rather than hurting you.

    Alt

  • 上圖從左到右依次爲:首先以中心詞wtw_{t}one-hot表示作爲輸入,輸入層到投影層之間有一箇中心詞的表示矩陣W,我們需要用向量乘以該矩陣從而選出矩陣中的對應列(column)即爲中心詞的表示形式。接下來,我們從投影層到輸出層有第二個矩陣,存儲着上下文單詞的表示。這裏會對中心詞和上下文單詞之間做點乘(dot product),對於點乘之後的向量結果有正有負,我們會對其進行softmax處理,將其轉化爲一種概率分佈。該模型作爲一個生成模型(generative model),在我們給定某個詞作爲中心詞時,模型需要預測出每一個詞出現在上下文中的可能性。

    這兩個矩陣都含有V個詞向量,也就是說同一個詞有兩個詞向量,哪個作爲最終的、提供給其他應用使用的embeddings呢?有兩種策略,要麼加起來,要麼拼接起來。在CS224n的編程練習中,採取的是拼接起來的策略:

    concatenate the input and output word vectors
    
    wordVectors = np.concatenate(
        (wordVectors[:nWords,:], wordVectors[nWords:,:]),
        axis=0)
    
    wordVectors = wordVectors[:nWords,:] + wordVectors[nWords:,:]
    

​ 通常將WW中的向量稱爲輸入向量(input vector),將WW'中的向量稱爲輸出向量(output vector).

  • 訓練模型:計算參數向量的梯度

    我們通常將模型中所有的參數定義爲一個長向量θ\theta 中,對d維詞向量和詞彙表大小爲V的詞向量來說:
    Alt

    我們接下來優化這些參數

    注意,每一個詞有兩個詞向量表示,這是爲什麼θ\theta 中每個單詞出現兩次的原因,也是爲什麼θ\theta 的維度中需要對dVdV再乘以2.

    我們在訓練模型的過程中,比較規範的做法是:從模型中取出所有參數,然後將它們放入一個大的向量θ\theta中,接下來我們會做一些優化:改變這些參數去最大化模型的目標函數。

    [如何理解這裏的維度d和詞彙表大小的單詞量V,在one-hot中它們不是一樣的嗎?]

    What our parameters are is that:

    for each word,we’re going to have a little d dimensional vector,when it’s a center word and when it’s a context word.

    And so we’ve got a vocabulary of some size,So we’re gonna have a vector for “aardvark” as a context word.a vector for “a” as a context word. and then we’re going to have a vector of “aardvark” as a center word,a vector of “a” as a center word.

    (此處解釋了某一個單詞作爲上下文單詞時對應一個詞向量,作爲中心詞時對應一個詞向量)

3.Research Highlight(Danqi Chen)

Introduce the audience a paper from Princeton,the title is A Simple but Tough-to-beat Baseline for Sentence Embeddings.In this lecture we are learning the word vector representations,so we hope these vectors can encode the word meanings.

  • Sentence Embedding

    Alt

  • From Bag-of-words to Complex Models

    • Bag-of-words(BoW)

      Alt

    • Complex Models

      Recurrent neural networks

      Recursive neural networks

      Convolutional neural networks

  • This paper

    • weighted Bag-of-words + remove some special direction

      • Step1

        Alt

        There’s two steps.So the first step is that just like how we compute the average of the vector representations,they also do this,but each word has a separate weight.Now here,a is a constant,and the p(w),it means the frequency of this word.So this basically means that the average representation down weight the frequency words.

        上式中a是一個常量,p(w)p(w)代表單詞的頻率,aa+p(w)\displaystyle\frac{a}{a+p(w)}式的作用就是對高頻詞進行降權處理。

        上圖的意思是:在一個句子集合SS中,我們怎樣對集合中的一個句子ss將其表示爲向量形式vsv_s呢?我們將一個句子拆分成一個個單詞,每一個單詞的向量形式表示爲v(w)v(w),該單詞出現的頻率爲p(w)p(w),我們希望對出現頻率高的詞進行降權處理.

      • 爲什麼要對高頻詞進行降維呢?

        Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.

      • Step2
        Alt

        So for step 2,after we compute all of these sentence vector representations,we compute the first principal components and also substract the projections onto this first principle component.

      • The idea of the paper
        Alt

      So basically,the idea is that given the sentence representation,the probability of limiting or single word,they’re related to the frequency of the word.And also related to how close the word is related to this sentence representation.

4.Word2vec objective function gradients

5.Optimization Refresher

​ We will optimize(maximize or minimize) our objective/cost functions

​ 最小化損失函數即使用梯度下降。

Alt

Alt

To represent as a simplified python code:

while True:
    theta_grad = evaluate_gradient(J,corpus,theta)
    theta = theta - alpha * theta_grad

Alt

The picture above shows us the red lines are sort of the contour lines of the value of the objective function.And so what you do is when you calculate the gradient,it’s giving you the direction of the steepest descent and you walk a little bit each time in that direction and you will hopefully walk smoothly towards the minimum.

Alt

Actually we might have 40 billion tokens in our corpus to go through,and if you have to work out the gradient if your objective function relative to a 40 billion word corpus, that’s gonna take forever.

So instead,what we do is used stochastic gradient descent.SGD is our key tool.So we just take one position in the text,we’ll have one center word and the words around it .

Reference

1.課程視頻(YouTube)

https://www.youtube.com/watch?v=ERibwqs9p38&t=431s

2.斯坦福cs224n教學大綱

http://cs224d.stanford.edu/syllabus.html

3.參考的其他技術博客

http://www.hankcs.com/nlp/word-vector-representations-word2vec.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章