Lecture 1

Introduction and Word Vectors

I mainly recorded the content of the word vector part.

Required tools: pure python and PyTorch or Tensorflow.

Definition: meaning (from Webster dictionary)

the idea that is represented by a word, phrase etc.
the idea that a person wants to express by using words or sign etc.
the idea that is expressed in work if writing or art etc.

The commonest linguistic way of thinking of meaning:

signifier(symbol) <=> signifier(idea or things)

=> denotational semantics(指稱語義學)

Compared with traditional NLP:

In traditional NLP, we regard words as discrete symbols: "hotel", "motel" a localist representation.

Words can be represented by one-hot vectors(one_hot 編碼，單熱矢量):

For example:

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

vector dimension = number of words in vocabulary (e.g.,500,000)

But how do you know the relationship between the meaning of a word?

Example: in web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". Because there are almost no differences between the two words.

But, the two words' one-hot vector is really different.

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

They have no similarity relationship between them. In math term, there two vectors are orthogonal (正交的)。

There is no natural notion of similarity for one-hot vector.

And the word similarity table is too huge and incompleteness.

Solution: learn to encode similarity in the vectors themselves.

Import a new definition: Distributional Semantics(分佈式語義): A word's meaning is given by the words that frequently appear close-by.

(一個詞的意思可以被經常出現在這個詞附近的詞所賦予)

Word Vector(sometimes called word embedding or word representations. They are a distributed representation.)

It is a smallish vector, whereby all of the numbers are non-zero.

In the course video, there is an example: banking.

It's a nine-dimension vector. In fact, people always use a larger dimensionality. Such as 50 is the minimum number. And a typical number is 300 on your laptop. If you want to really max outperformance, maybe 1000 to 3000 is better.

Word2vec --- a framework for learning word vectors.

Main ideas:

A large corpus of text(body of text,文本).
Every word in a fixed vocabulary is represented by a vector.
Go through each position 't' in the text, which has a centre word "c" and context words "o" ("outside").
Use the similarity of the word vectors for 'c' and 'o' to calculate the probability of 'o' given 'c' (or vice versa)(條件概率，在c的條件下o的概率).
Keep adjusting the word vector to maximize the probability.

Then I attach some slides here which I think is hard to understand if you have no basis of machine learning or deep learning.（剛剛學完概率論，我感覺和概率論極大似然估計挺像的。）

這是我對lecture 1的初步理解，如果有不正確的地方，我會後續不斷完善改正。

Lecture 2

詞義的表示方法

詞義就是單詞所指代的概念或事物，那麼我們如何才能在計算機中獲取單詞可用的詞義呢？對於英文來說，通常的方法是使用WordNet，它根據同義詞關係和單詞層次關係構成單詞網絡。著名自然語言處理庫NLTK中就包含了WordNet，下面是使用WordNet的兩個例子：

左邊的例子是獲取單詞“good”各個詞義的近義詞，右邊例子是獲取單詞“panda”的上位詞。

這樣做要比WordNet簡單很多，但是也存在許多問題，比如：

維度災難，這樣稀疏的向量，存儲和訓練都會造成巨大的開銷。
每個向量都是正交的，歐式距離也都相等，無法直接衡量每個單詞的相似度。

於是人們想根據單詞的特性（比如一起出現的上下文）來構造稠密（dense）的向量來表示單詞，使其具有表徵詞義的能力。

通過上下文來表徵詞義

最後得到的向量與下圖類似。這種表示單詞的方法稱作詞向量（Word Vectors），也稱做詞嵌入（Word Embeddings）或詞表示（Word Representations）。這樣的詞向量既可以較爲容易地得到，也可以利用餘弦相似度等方法計算單詞的相似度。

詞向量的例子

Word2Vec

Word2Vec [1] 就是一個利用神經網絡通過上下文關係得到詞向量的方法。首先我們來介紹Word2Vec最基本的兩個模型Skip-gram和CBOW（Continuous Bag-Of-Words）。

注：爲了詳細介紹Word2Vec，這節還參考了文獻 [2] ，因此和課程中的符號不完全相同。

Skip-gram

Skip-gram的思想很簡單，先選定一個中心詞 (center word)，然後我們用這個詞來預測一定窗口（window）大小內的上下文單詞 (context word) 。例如：

當我們選定“into”作爲中心詞，窗口大小爲2時，上下文單詞包括“problems”, “turning”, “banking”, “crises”。注意在Word2Vec中，我們認爲中心詞與上下文單詞的距離是不重要的，幾個上下文單詞被同等的對待。我們的目標就是用單詞“into”來預測其他四個單詞，如果預測和實際不符就會更新中心詞的詞向量。之後我們將窗口向右移動一個單詞，就得到

這時中心詞變爲“banking”，上下文單詞變爲“turning”, “into”, “crises”, “as”。這樣不斷更新最終就可以得到我們的詞向量。

lecture 2 未完

【Stanford】Deep Learning-CS224N Lecture 1-2

Lecture 1

Introduction and Word Vectors

I mainly recorded the content of the word vector part.

Required tools: pure python and PyTorch or Tensorflow.

Definition: meaning (from Webster dictionary)

The commonest linguistic way of thinking of meaning:

signifier(symbol) <=> signifier(idea or things)

=> denotational semantics(指稱語義學)

Compared with traditional NLP:

In traditional NLP, we regard words as discrete symbols: "hotel", "motel" a localist representation.

Words can be represented by one-hot vectors(one_hot 編碼，單熱矢量):

For example:

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

vector dimension = number of words in vocabulary (e.g.,500,000)

But how do you know the relationship between the meaning of a word?

Example: in web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". Because there are almost no differences between the two words.

But, the two words' one-hot vector is really different.

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

They have no similarity relationship between them. In math term, there two vectors are orthogonal (正交的)。

There is no natural notion of similarity for one-hot vector.

And the word similarity table is too huge and incompleteness.

Solution: learn to encode similarity in the vectors themselves.

Import a new definition: Distributional Semantics(分佈式語義): A word's meaning is given by the words that frequently appear close-by.

(一個詞的意思可以被經常出現在這個詞附近的詞所賦予)

Word Vector(sometimes called word embedding or word representations. They are a distributed representation.)

It is a smallish vector, whereby all of the numbers are non-zero.

In the course video, there is an example: banking.

It's a nine-dimension vector. In fact, people always use a larger dimensionality. Such as 50 is the minimum number. And a typical number is 300 on your laptop. If you want to really max outperformance, maybe 1000 to 3000 is better.

Word2vec --- a framework for learning word vectors.

Main ideas:

A large corpus of text(body of text,文本).

Every word in a fixed vocabulary is represented by a vector.

Go through each position 't' in the text, which has a centre word "c" and context words "o" ("outside").

Use the similarity of the word vectors for 'c' and 'o' to calculate the probability of 'o' given 'c' (or vice versa)(條件概率，在c的條件下o的概率).

Keep adjusting the word vector to maximize the probability.

Then I attach some slides here which I think is hard to understand if you have no basis of machine learning or deep learning.（剛剛學完概率論，我感覺和概率論極大似然估計挺像的。）

Lecture 2

詞義的表示方法

Word2Vec

Skip-gram