Lecture 1
Introduction and Word Vectors
I mainly recorded the content of the word vector part.
-
Required tools: pure python and PyTorch or Tensorflow.
Definition: meaning (from Webster dictionary)
- the idea that is represented by a word, phrase etc.
- the idea that a person wants to express by using words or sign etc.
- the idea that is expressed in work if writing or art etc.
The commonest linguistic way of thinking of meaning:
signifier(symbol) <=> signifier(idea or things)
=> denotational semantics(指稱語義學)
Compared with traditional NLP:
In traditional NLP, we regard words as discrete symbols: "hotel", "motel" a localist representation.
Words can be represented by one-hot vectors(one_hot 編碼,單熱矢量):
For example:
motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
vector dimension = number of words in vocabulary (e.g.,500,000)
But how do you know the relationship between the meaning of a word?
Example: in web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". Because there are almost no differences between the two words.
But, the two words' one-hot vector is really different.
motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
They have no similarity relationship between them. In math term, there two vectors are orthogonal (正交的)。
There is no natural notion of similarity for one-hot vector.
And the word similarity table is too huge and incompleteness.
Solution: learn to encode similarity in the vectors themselves.
Import a new definition: Distributional Semantics(分佈式語義): A word's meaning is given by the words that frequently appear close-by.
(一個詞的意思可以被經常出現在這個詞附近的詞所賦予)
Word Vector(sometimes called word embedding or word representations. They are a distributed representation.)
It is a smallish vector, whereby all of the numbers are non-zero.
In the course video, there is an example: banking.
It's a nine-dimension vector. In fact, people always use a larger dimensionality. Such as 50 is the minimum number. And a typical number is 300 on your laptop. If you want to really max outperformance, maybe 1000 to 3000 is better.
Word2vec --- a framework for learning word vectors.
Main ideas:
-
A large corpus of text(body of text,文本).
-
Every word in a fixed vocabulary is represented by a vector.
-
Go through each position 't' in the text, which has a centre word "c" and context words "o" ("outside").
-
Use the similarity of the word vectors for 'c' and 'o' to calculate the probability of 'o' given 'c' (or vice versa)(條件概率,在c的條件下o的概率).
-
Keep adjusting the word vector to maximize the probability.
Then I attach some slides here which I think is hard to understand if you have no basis of machine learning or deep learning.(剛剛學完概率論,我感覺和概率論極大似然估計挺像的。)
這是我對lecture 1的初步理解,如果有不正確的地方,我會後續不斷完善改正。
Lecture 2
詞義的表示方法
詞義就是單詞所指代的概念或事物,那麼我們如何才能在計算機中獲取單詞可用的詞義呢?對於英文來說,通常的方法是使用WordNet,它根據同義詞關係和單詞層次關係構成單詞網絡。著名自然語言處理庫NLTK中就包含了WordNet,下面是使用WordNet的兩個例子:
左邊的例子是獲取單詞“good”各個詞義的近義詞,右邊例子是獲取單詞“panda”的上位詞。
這樣做要比WordNet簡單很多,但是也存在許多問題,比如:
- 維度災難,這樣稀疏的向量,存儲和訓練都會造成巨大的開銷。
- 每個向量都是正交的,歐式距離也都相等,無法直接衡量每個單詞的相似度。
於是人們想根據單詞的特性(比如一起出現的上下文)來構造稠密(dense)的向量來表示單詞,使其具有表徵詞義的能力。
通過上下文來表徵詞義
最後得到的向量與下圖類似。這種表示單詞的方法稱作詞向量(Word Vectors),也稱做詞嵌入(Word Embeddings)或詞表示(Word Representations)。這樣的詞向量既可以較爲容易地得到,也可以利用餘弦相似度等方法計算單詞的相似度。
詞向量的例子
Word2Vec
Word2Vec [1] 就是一個利用神經網絡通過上下文關係得到詞向量的方法。首先我們來介紹Word2Vec最基本的兩個模型Skip-gram和CBOW(Continuous Bag-Of-Words)。
注:爲了詳細介紹Word2Vec,這節還參考了文獻 [2] ,因此和課程中的符號不完全相同。
Skip-gram
Skip-gram的思想很簡單,先選定一個中心詞 (center word),然後我們用這個詞來預測一定窗口(window)大小內的上下文單詞 (context word) 。例如:
當我們選定“into”作爲中心詞,窗口大小爲2時,上下文單詞包括“problems”, “turning”, “banking”, “crises”。注意在Word2Vec中,我們認爲中心詞與上下文單詞的距離是不重要的,幾個上下文單詞被同等的對待。我們的目標就是用單詞“into”來預測其他四個單詞,如果預測和實際不符就會更新中心詞的詞向量。之後我們將窗口向右移動一個單詞,就得到
這時中心詞變爲“banking”,上下文單詞變爲“turning”, “into”, “crises”, “as”。這樣不斷更新最終就可以得到我們的詞向量。
lecture 2 未完