lecture02 : Word Vectors 2 and Word Senses

1. word vertor and word2vec

word2vec基本在lecture01，這裏寫一些補充的東西。

Word2vec maximizes objective function by putting similar words nearby in space5

word2vec 將相似的詞更加相近。

Optimization：

Gradient Descent：在全部樣本上
Stochastic Gradient Descent：對全部樣本進行隨機採樣，在採樣的樣本上計算梯度。在每次更新時用1個樣本。

why two vectores?

Easier optimization. Average both at the end

why not capture co-occurrence counts directly

With a co-occurrence matrix X，有兩種方法，windows和document。

問題：

隨着詞彙量的增加而增加
非常高維的:需要大量的存儲空間
後續的分類模型存在稀疏性問題

會導致模型不那麼健壯。

解決方法

對X進行降維，比如SVD。但會有有的詞太頻繁了，解決方法有（1） $min(x，t) t約等於100$ ；（2）全部忽略; Ramped windows that count closer words more；Use Pearson correlations instead of counts, then set negative values to 0；

Encoding meaning in vector differences

Log-bilinear model: $w_i*w_j=\log P(i|j)$

with vector differences $w_x*(w_a-w_b)=\log\frac{P(x|a)}{P(x|b)}$

2. GloVe : Combining the best of both

$J=\sum_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde w_j+b_i+\tilde b_j -\log X_{ij})^2$

優點：

Fast training
Scalable to huge corpora
Good performance even with small corpus and small vectors

2.1 和之前的模型進行比較

到目前爲止有兩種模型來獲取word emdeddings。第一種是基於計數的（LSA、HAL）。雖然這些方法有效地利用了全局統計信息，但它們主要用於捕獲詞的相似性，而在進行詞的類比、分離次優向量空間結構(indicating a sub-optimal vector space structure)等任務時表現較差。另一種是基於窗口的(skpi-gram and CBOW)，雖然有捕捉語義的能力但沒有利用全局的共現信息。

Glove由一個加權最小二乘組成，訓練共現信息，有效利用統計信息。

2.2 Co-occurrence Matrix

$X_{ij}$ 代表單詞j出現在單詞i上下文的概率。 $X_i=\sum_k X_{ik}$ 是所有詞出現在詞i上下文的總數。 $P_{ij}=P(w_j|w_i)=\frac{X_{ij}}{X_i}$ 代表j出現在i上下文的概率。

計算共現矩陣對於大規模文本需要大量計算，但是是一次的前期成本。

2.3 Least Squares Objective

在skip-gram模型種我們用softmax計算概率。然後計算交叉熵

$J=-\sum_{i\in corpus}\sum_{j\in context(i)}\log Q_{ij}$

但是相同的i和j可以共同出現多次。可以將他們合在一起提高效率。

$J=-\sum_{i=1}^W\sum_{j=1}^W X_{ij}\log Q_{ij}$

交叉熵損失的一個顯著缺點是它要求分佈Q被標準化，涉及到整個詞彙量的昂貴的求和操作。所以我們使用最小二乘。

$\hat J=\sum_{i=1}^W\sum_{j=1}^W X_{ij}(\hat P_{ij}-\hat Q_{ij})^2$

where $\hat P_{ij}=X_{ij} and \hat Q_{i}=exp(\hat u_j^T \hat v_i)$ ，但是產生一個問題是X_{ij}的值經常非常大使得優化很困難，所以將 $\hat P and \hat Q$ 對數化。

$\hat J=\sum_{i=1}^W\sum_{j=1}^W X_{i}(\log{\hat P_{ij}}-\log{\hat Q_{ij}})^2\\ =\sum_{i=1}^W\sum_{j=1}^W X_{i}(\hat u_j^T \hat v_i-\log X_{ij})^2$

然後觀察到權重因子 $X_i$ 不是最優的所以：

$\hat J=\sum_{i=1}^W\sum_{j=1}^W f(X_{ij})(\hat u_j^T \hat v_i-\log X_{ij})^2$

2.4 conclusion

Glove模型只對共現矩陣的非零元素進行訓練，有效的利用的全局統計信息。在相同條件下比word2vec表現的更加出色。

3. How to evaluate word vector?

3.1 Intrinsic vs extrinsic

Intrinsic:

對特定的任務評估（比如類比比較）
Fast to compute
Helps to understand that system
Not clear if really helpful unless correlation to real task is established

Extrinsic

Evaluation on a real task
Can take a long time to compute accuracy
Unclear if the subsystem is the problem or its interaction or other subsystems
If replacing exactly one subsystem with another improves accuracy --> Winning!

3.2 Intrinsic word vector evaluation

方法一:Word Vector Analogies

Word Vector Analogies：Syntactiv and Semantic

$a:b :: c:?$

$d=\underset{i}{\operatorname{argmax}}\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}$

存在的問題：如果信息是存在的但不是線性的呢

根據Intrinsic分析超參數得到：

最好是300維左右
asymmetric context (only words to the left)不對稱上下文不是很好。
window size of 8 is good for Glove
Performance is heavily dependent on the model used for word embedding
Performance increases with larger corpus sizes
Performance is lower for extremely low dimensional word vectors

方法二:Correlation Evaluation

Word vector distances and their correlation with human judgments

Example dataset: WordSim353

3.3 Word senses and word sense ambiguity

每個詞都有很多種意思，一個詞向量是否捕獲了所有的這些意思呢？有如下解決方法：

Improving Word Representations Via Global Context And Multiple Word Prototypes

Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2.

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

$v_pike=\alpha_1 v_{pike_1}+\alpha_2 v_{pike_2}+...$

$\alpha_1$ 是頻率

3.4 Extrinsic word vector evaluation

Extrinsic evaluation of word vectors: All subsequent tasks in this class,比如命名實體識別

3.4.1 Problem Formulation

對於NLP的分類問題可以公式化爲：

$\{x^{(i)},y^{(i)}\}_1^N$

與一般問題不同的是，NLP提出了word embedding再訓練的思想。

3.4.2 Retraining Word Vectors

Implementation Tip:Word vector retraining should be considered for large training datasets. For small datasets, retraining word vectors will likely worsen performance

如果我們使用外部任務對單詞向量進行再訓練，我們需要確保訓練集足夠大，能夠覆蓋詞彙表中的大多數單詞。如果訓練數據集很小，就不應該對字向量進行再訓練。如果訓練集很大，再訓練可以提高效率。

3.4.3 Softmax Classification and Regularization

在N個樣本點上：

$\sum_{i=1}^N\log(\frac{exp(W_{k(i)}x)}{\sum_{c=1}^C exp(W_c x^{(i)})})$

$k(i)$ 是返回 $x^{(i)}$ 樣本的正確索引值，劃分爲C類則需要更新的參數爲 $C*d+|V|*d$

$\sum_{i=1}^N\log(\frac{exp(W_{k(i)}x)}{\sum_{c=1}^C exp(W_c x^{(i)})})+\lambda\sum_{k=1}^{C*d+|V|*d}\theta_k^2$

3.4.4 Window Classification

在大多數情況下，我們傾向於使用單詞序列作爲模型的輸入。

lecture02 : Word Vectors 2 and Word Senses

lecture02 : Word Vectors 2 and Word Senses

1. word vertor and word2vec

2. GloVe : Combining the best of both

2.1 和之前的模型進行比較

2.2 Co-occurrence Matrix

2.3 Least Squares Objective

2.4 conclusion

3. How to evaluate word vector?

3.1 Intrinsic vs extrinsic

3.2 Intrinsic word vector evaluation

3.3 Word senses and word sense ambiguity

3.4 Extrinsic word vector evaluation

3.4.1 Problem Formulation

3.4.2 Retraining Word Vectors

3.4.3 Softmax Classification and Regularization

3.4.4 Window Classification

參考資料

Python 潮流週刊#52：Python 處理 Excel 的資源

conda報錯：CondaHTTPError: HTTP 000 CONNECTION FAILED for url

kaggle——Santander Customer Transaction Prediction

Scrapy(十一）設置隨機User-Agent

算法篇——分治

Scrapy（七）Item Pipeline

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結