Lecture 01 : Introduction and Word Vectors

slides 鏈接1 鏈接2

note 鏈接

vedio 鏈接

1.詞的表示

1.1 WordNet

missing nuance 無法表示細微的差別
missing new meaning of words
Subjective
人力
無法計算相似性

1.2 Onehot

維度高
There is no natural notion of similarity for one-hot vectors! 任何兩個詞都是正交的，沒有計算相似性的方法。

解決方法： learn to encode similarity in the vectors themselves 學習在向量本身中編碼相似性

1.3 Word vectors

word vectors\word embeddings\word representations :distributed representation

Distributional semantics : A word’s meaning is given by the words that frequently appear close-by

2.SVD Based Methods

首先在一個大的數據集中循環計算共現矩陣X,然後對X進行奇異值分解得到 $USV^T$ ，然後用U的行代表詞的詞向量。

2.1 Word-Document Matrix

word $i$ 出現在document $j$ 中: $X_{ij}$ 這個矩陣很大 $R^{|v|*M}$ ，因爲M非常大

2.2 Window based Co-occurrence Matrix

計算每個單詞在一個特定大小的窗口中出現的次數

2.3 Applying SVD to the cooccurrence matrix

得到 $X=USV^T$ ，根據所捕獲的期望方差百分比，在某個指數k處將它們截斷.

$\frac{\sum_{i=1}^k \sigma _i}{\sum_{i=1}^{|V|} \sigma_i}$

$U_{1:|V|,1:k}$ 是我們的詞向量矩陣。

很難用在大型文檔

3.Iteration Based Methods - Word2vec

不需要全局信息，而是不斷迭代，一次學習一次迭代。按照一定的目標訓練模型。在每個迭代中，我們運行我們的模型，評估錯誤，並遵循一個更新規則，最終學習到詞向量。

Go through each position t in the text, which has a center word
c and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa) 需要再理解 根據c和o
的相似性計算給定c得到o的概率或者相反。
Keep adjusting the word vectors to maximize this probability

兩種算法：

Skip-grams (SG)

Predict context (”outside”) words (position independent) given centerword

Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

訓練方法
1、Negative sampling
2、hierarchical softmax

3.1 Continuous Bag of Words Mode(CBOW)

$V、U$ 分別是輸入、輸出詞向量矩陣

steps

對上下文大小爲m的生成 one hot詞向量: $(x^{(c-m)},...,x^{(c+m)}\in R^{|V|})$
one hot詞向量依次與V相乘的到詞嵌入
得到平均向量 $\hat v=\frac{v_{c-m}+...+v_{c+m}}{2m}$
生成分數向量 $z=U\hat v \in R^{|V|}$
將分數轉化爲概率 $\hat y=softmax(z) \in R^{|V|}$
希望 $\hat y$ 匹配 $y$ ， $y$ 也是一個one hot 向量。

然後我們需要一個度量 $y和\hat y$ 距離的函數，我們採用cross entropy

$H(\hat y,y)=-\sum_{j=1}^{|V|}y_j\log(\hat y_j)$

所以：

$minimize J = -\log P(w_c|w_{c-m},...,w_{c+m}) \\ = -\log P(u_c|\hat v)\\ =-\log \frac{exp(u_c^T)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\\ =-u_c^T\hat v+\log \sum_{j=1}^{|V|}exp(u_j^T\hat v)$

3.2 Skip-Gram Model

中心詞預測上下文，“跳轉爲上下文”，（Here we callthe word “jumped” the context）

steps:

生成中心詞的one hot 向量 $x\in R^{|V|}$
生成中心詞的詞嵌入 $v_c=Vx \in R^n$
生成分數向量 $z=Uv_c$
轉爲概率 $\hat y=softmax(z)$

5.希望 $y與\hat y$ 相近

然後需要有一個目標函數，我們用貝葉斯的假設條件，給定中心字，所有輸出字都是完全獨立的。

所以：

$minimize J=-\log P(w_{c-m},...,w_{c+m}|w_c)\\ =-\log \prod_{j=0,j\neq m}^{2m}P(u_{c-m+j}|v_c)\\ =-\log \prod_{j=0,j\neq m}^{2m}\frac{exp(u_{c-m+j}^T v_c)}{\sum{k=1}{|V|}exp(u_k^T v_c)}\\ -\sum_{j=0,j\neq m}{2m}u_{c-m+j}^{T}v_c+2m\log\sum_{k=1}^{|V|}exp(u_k^Tv_c)$

3.3 Negative Sampling

觀察上面兩個模型的目標函數，每一次迭代都要在 $|V|$ 量的數據上計算，十分耗費時間。所以不在整個數據上計算，我們只採樣在一個噪聲分佈 $P_n(w)$ 進行負樣本，其概率分佈與詞彙表頻率的排序相匹配。

假設word and context $(w,c)$ , $P(D=1|w,c)、P(D=0|w,c)$ 分別表示w與c是正樣本和負樣本。

假設w c是正樣本的概率爲：

$P(D=1|w,c,\theta)=\sigma(v_c^T v_w)=\frac{1}{1+e^{(-v_c^T v_w)}}$

重新定義目標函數：

對於skip-gram：

對於CBOW:

$\{\hat u_k|k=1,...,K\}$ 是從 $P_n(w)$ 中採樣的。那麼 $P_n(w)$ 是什麼樣的呢？

what seems to work best isthe Unigram Model raised to the power of 3/4.爲什麼是3/4呢？

3.4 Hierarchical Softmax

在實際應用中，對於不常用的詞，Hierarchical Softmax的效果較好，而對於常用的詞和低維向量，負採樣效果較好。

樹每個葉子節點表示一個詞，

In this model, the probability of a word w given a vector wi,P(w|wi), is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to w. cost is $O(\log|V|)$

$L(w)$ ： the number of nodes in the path from the root to the leaf w

$n(w,i)$ ： the i-th node on this path with associated vector $V_{n(w,i)}$ ， $n(w,1)$ 是根節點， $n(w,L(w))$ 是w的父親節點。

對於每一個內部節點 inner node n,我們選擇一個孩子節點ch(n)（always the left node）

所以概率爲：

並且保證了
$\sigma(v_n^T v_{w_i})+\sigma(-v_n^T v_{w_i})=1$
$\sum_{w=1}^{|V|}P(w|w_i)=1$

To train the model, our goal is still to minimize the negative loglikelihood−logP(w|wi). But instead of updating output vectors per word, we update the vectors of the nodes in the binary tree that arein the path from root to leaf node.
The speed of this method is determined by the way in which the binary tree is constructed and words are assigned to leaf nodes.Mikolov et al. use a binary Huffman tree, which assigns frequentwords shorter paths in the tree.

3.5 objective function

likelihood:

$L(\theta)=\prod_{t=1}^T\prod_{-m\leq j\leq m;j\neq 0}P(w_{t+j}|w_t;\theta)$

objective function $J(\theta)$

$J(\theta)=-\frac{1}{T}\log{L(\theta)}$

最小化目標函數就是最大化預測概率

$v_w$ ：w is a center word
$u_w$ ：w is a context word

$P(o|c)=\frac{exp(u_o^T v_c)}{\sum_{w\in V} exp(u_w^T v_c)}$

Exponentiation makes anything positive 冪運算使值爲正
every word has two vectors v和u向量

理解：在 $w_{t+j}和w_t$ 知道時 $P(w_{t+j}|w_t;\theta)$ 代表已知的這種結果不同的參數 $\theta$ 發生這種結果的概率，使乘積最大就是最大可能的參數。 $P()$ 是我們假設的概率分佈模型。

Lecture 01 : Introduction and Word Vectors

Lecture 01 : Introduction and Word Vectors

1.詞的表示

1.1 WordNet

1.2 Onehot

1.3 Word vectors

2.SVD Based Methods

2.1 Word-Document Matrix

2.2 Window based Co-occurrence Matrix

2.3 Applying SVD to the cooccurrence matrix

3.Iteration Based Methods - Word2vec

3.1 Continuous Bag of Words Mode(CBOW)

3.2 Skip-Gram Model

3.3 Negative Sampling

3.4 Hierarchical Softmax

3.5 objective function

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

conda報錯：CondaHTTPError: HTTP 000 CONNECTION FAILED for url

kaggle——Santander Customer Transaction Prediction

Scrapy(十一）設置隨機User-Agent

算法篇——分治

Scrapy（七）Item Pipeline

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結