Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe

文章目录

1. Word meaning

A question lies ahead

Q:How do we have a usable meaning in a computer?
Common answer:Use a taxonomy like WordNet that has hypernyms relationships and synonym set.

WordNet is one of the most famous taxonomic resource and it is popular among computational linguists.Because It is free to download a copy,it provides a lot of taxonomy infomation about words.
the components of WordNet
demo implemented by python

The picture above shows you getting a hold of WordNet using the NLTK which is one of the main python packages for nlp.

在python3.7版本中运行结果如下：
显示跟"panda"一词接近的上位词列表如下：

[Synset(‘procyonid.n.01’),

Synset(‘carnivore.n.01’),

Synset(‘placental.n.01’),

Synset(‘mammal.n.01’),

Synset(‘vertebrate.n.01’),

Synset(‘chordate.n.01’),

Synset(‘animal.n.01’),

Synset(‘organism.n.01’),

Synset(‘living_thing.n.01’),

Synset(‘whole.n.02’),

Synset(‘object.n.01’),

Synset(‘physical_entity.n.01’),

Synset(‘entity.n.01’)]

discrete representation

上述有关词的离散表示虽然是一种语言学资源，但在实际应用中，结果可能并没有人们所期望得那么好。因为词的离散表示所找到的同义词在意思上还有细微的差别(nuance)，其主要局限性体现在：

缺少新词(很难与时俱进取更新)
主观化
需要大量的人力去创建和维护
很难准确计算词语相似性
大量的基于规则的(rule-based)和统计(statistical)自然语言处理任务中将词视为一个不可分割的原子单元。在向量空间中，每一个向量由一个“1”和很多个“0”表示，我们将这种表示方法称为“one-hot” representation.它存在的问题是：
1. 语料库中词汇表的数目大，向量的维度也就变得非常大
Dimensionality:
20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
- “speech” means speech recognizer
- PTB- penn tree bank
在python中可直接调用：from nltk.corpus import ptb
- big vocab
if we kinda building a machine translation system,we might use a 500,000 word vocabulary.
- Google 1T
Google released sort of 1-terabyte corpus of web crawl.
2. 词向量两两正交(点乘为零或称内积为零)
or we can say there is no natural notion of similarity

How to make neighbors represent words?

语言学家J. R. Firth提出，通过一个单词的上下文可以得到它的意思。J. R. Firth甚至建议，如果你能把单词放到正确的上下文中去，才说明你掌握了它的意义。

“you shall know a word by the company it keeps.”
20世纪初在哲学语言学方面深有造诣的Wittgenstein也表示“the right way to think about the meaning of words is understanding their use in text.”

通过向量定义词语的含义Word meaning id defined in terms of vectors

通过调整一个单词及其上下文单词的向量，使得根据两个向量可以推测两个词语的相似度；或根据向量可以预测词语的上下文。这种手法也是递归的(recursive)，根据向量来调整向量，与词典中意项的定义相似。

In addition,distributed representations与symbolic representations（localist representation、one-hot representation）相对；discrete representation则与后者及denotation的意思相似。切不可搞混distributed和discrete这两个单词。

学习神经网络word embeddings 的基本思路

定义一个用于预测某个单词上下文的模型：

$p(context|w_{t})$

损失函数定义如下：
$J=1-p(w_{-t}|w_{t})$

公式中 $w_{-t}$ 表示 $w_{t}$ 的上下文(负号通常表示“除某某以外”)，训练模型的最终目标是改变词的向量表示使得损失函数逐渐逼近0

然后在一个大型语料库的不同位置得到训练实例，调整词向量，最小化损失函数。

2. Word2vec Introduction

Main idea of word2vec

Skip-gram prediction

Skip-gram 模型思想是：每一次估计，你可以选择一个词作为中心词(center word)，并预测它的窗体大小范围内的上下文单词。模型的训练目标是最大化概率分布。

word2vec细节

对每一个单词 t=1,2,…,T(文本词汇表大小)，我们需要预测该单词窗体“半径(radius)”为m的上下文单词。

目标函数：对于给定的当前中心词，最大化它的每一个上下文单词的预测概率。

$J'(\theta)=\displaystyle \prod^{T}_{t=1}\displaystyle \prod^{T}_{-m\leq{j}\leq{m}}p(w_{t+j}|w_{t};\theta)$

$\theta$ 表示所有我们将要优化的变量(where $\theta$ represents all variables we will optimize)。

在训练过程中我们需要做的是调整模型的参数，使得对中心词的上下文单词预测的概率尽可能高。

要最大化目标函数，需要对其概率分布取负对数，得到损失函数——对数似然的相反数。

$J(\theta)=-{\frac{1}{T}}\displaystyle \sum^{T}_{t=1}\displaystyle \sum_{m\leq{j}\leq{m}}log\, p(w_{t+j}|w_{t})$

该式即负对数似然估计(negative log likelihood)

对数函数如下图红线所示：

由于是对概率分布求对数，概率 $p(w_{t+j}|w_{t})$ 的值为 $0\leq{p}\leq1$ ,取对数后为红色线条在 $[0,1]$ 区间中的部分，再对其取负数，得到负对数似然函数如下图所示：

我们希望得到的概率越大越好，因此概率越接近于1，则函数整体值越接近于0，即使得损失函数取到最小值。

[a tiny cheating trip hidden in the function]

上述函数看起来只有唯一参数 $\theta$ ,在这里是窗体大小，但实际模型中还有一些超参数(hyper parameters)，在之后的授课过程中将会遇到，所有这些参数都是可以调整的超参数，我们当前先将它们看作常量(constants)。

window size is the parameter of the model.Actually ,it turns out that there are several hyper parameters of the model.One is window sized and it turns out that we’ll come across a couple of other fudge factors later in the lecture.And all of those things are hyper parameters that you could adjust.

Terminology:

Loss function = Cost function = Objective function

从术语上来讲，损失函数等于代价函数等于目标函数

通常对于概率分布来说，常用的损失函数为交叉熵(Cross-entropy loss)
我们怎样应用这些词向量去最小化负对数似然估计呢？

我们首先通过计算得出对于某一个给定的中心词来说最有可能的上下文单词的概率分布，以及我们的概率分布最有可能如下式所示：

$p(o|c) = \displaystyle \frac{exp({u_{o}}^Tv_{c})}{\sum_{w=1}^V exp({u_{w}}^Tv_{c})}$

在该式中， $o$ 是输出的上下文单词中确切的某一个(one of the output)，c是中心词(center word)， $v_{c}$ 和 $u_{o}$ 是中心词和上下文单词的向量形式表示。

点乘(dot product) $u^T=u\cdot v = \displaystyle\sum_{i=1}^{n}u_{i}v_{i}$

如果向量 $u$ 和向量 $v$ 越相似，其点乘后的内积越大；

遍历 $w=1,…,V$ : $u_{w}^T v$ 表示:计算每一次单词与 $v$ 之间的相似度

Softmax用中心词去获得上下文单词的概率。
softmax函数：从 $\mathbb{R}^V$ 到一个概率分布的标准映射(将数字转化为一种概率分布)

由于我们不能直接将数值(数值可以是正数也可以是负数)转化为概率分布，因此我们可以简单地将其处理为指数形式，指数函数可以把实数映射成正数，然后归一化的到概率(Exponentiate to make positive,Normalize to give probability)。

$p_{i}= \displaystyle \frac{e^{u_i}}{\sum_je^{u_j}}$

softmax之所叫softmax，是因为指数函数会导致较大的数变得更大，小数变得微不足道；这种选择作用类似于max函数。

The reason why this is called a Softmax function is because it’s kind of close to a max function, because when you exponentiate things,the big thing get way bigger and so they really dominate,and so this really sort of blows out in the direction of a max function,but not fully.It’s still a sort of a soft things.
another trip：One word with two vector representation?

you might think that one word should only have one vector representation.And if you really wanted to you could do that, but it turns out you can make the math considerably easier by saying now actually each word has two vector representations that has one vector representation when it synthesis the word.And it has another vector representation when It’s a context word.

在上述公式中，

$p(o|c) = \displaystyle \frac{exp({u_{o}}^Tv_{c})}{\sum_{w=1}^V exp({u_{w}}^Tv_{c})}$

$o$ 是输出的上下文单词中确切的某一个(one of the output)，c是中心词(center word)， $v_{c}$ 和 $u_{o}$ 是中心词和上下文单词的向量形式表示。

And it turns out not only does that make the math a lot easier,because the two representations are separated when you do optimization rather than tied to each other.It’s actually in practice empirically works a little better as well,so if your life is easier and better,who would not choose that?

So we have two vectors for each word.
Question proposed by a student:

When I am dealing with the context words,am I paying attention to where they are or just their identity?

Where they are has nothing to do with it in this model,It’s just what is the identity of the word somewhere in the window?So there’s just one probability distribution and one representation of the context word. Now you know that’s not really a good idea,there are other models which absolutely pay attention to position and distance.And for some purposes,especially more syntactic purposes rather than semantic purposes,that actually helps a lot.But if you’re sort of more interested in just sort of word meaning,it’s turns out that not paying attention to position actually tends to help you rather than hurting you.
上图从左到右依次为：首先以中心词 $w_{t}$ 的one-hot表示作为输入，输入层到投影层之间有一个中心词的表示矩阵W，我们需要用向量乘以该矩阵从而选出矩阵中的对应列(column)即为中心词的表示形式。接下来，我们从投影层到输出层有第二个矩阵，存储着上下文单词的表示。这里会对中心词和上下文单词之间做点乘(dot product)，对于点乘之后的向量结果有正有负，我们会对其进行softmax处理，将其转化为一种概率分布。该模型作为一个生成模型(generative model)，在我们给定某个词作为中心词时，模型需要预测出每一个词出现在上下文中的可能性。

这两个矩阵都含有V个词向量，也就是说同一个词有两个词向量，哪个作为最终的、提供给其他应用使用的embeddings呢？有两种策略，要么加起来，要么拼接起来。在CS224n的编程练习中，采取的是拼接起来的策略：
```
concatenate the input and output word vectors

wordVectors = np.concatenate(
    (wordVectors[:nWords,:], wordVectors[nWords:,:]),
    axis=0)

wordVectors = wordVectors[:nWords,:] + wordVectors[nWords:,:]
```

通常将 $W$ 中的向量称为输入向量(input vector),将 $W'$ 中的向量称为输出向量(output vector).

训练模型：计算参数向量的梯度

我们通常将模型中所有的参数定义为一个长向量 $\theta$ 中，对d维词向量和词汇表大小为V的词向量来说：

我们接下来优化这些参数

注意，每一个词有两个词向量表示，这是为什么 $\theta$ 中每个单词出现两次的原因，也是为什么 $\theta$ 的维度中需要对 $dV$ 再乘以2.

我们在训练模型的过程中，比较规范的做法是：从模型中取出所有参数，然后将它们放入一个大的向量 $\theta$ 中，接下来我们会做一些优化：改变这些参数去最大化模型的目标函数。

[如何理解这里的维度d和词汇表大小的单词量V，在one-hot中它们不是一样的吗？]

What our parameters are is that:

for each word,we’re going to have a little d dimensional vector,when it’s a center word and when it’s a context word.

And so we’ve got a vocabulary of some size,So we’re gonna have a vector for “aardvark” as a context word.a vector for “a” as a context word. and then we’re going to have a vector of “aardvark” as a center word,a vector of “a” as a center word.

(此处解释了某一个单词作为上下文单词时对应一个词向量，作为中心词时对应一个词向量)

3.Research Highlight(Danqi Chen)

Introduce the audience a paper from Princeton,the title is A Simple but Tough-to-beat Baseline for Sentence Embeddings.In this lecture we are learning the word vector representations,so we hope these vectors can encode the word meanings.

Sentence Embedding
From Bag-of-words to Complex Models
- Bag-of-words(BoW)
- Complex Models
  
  Recurrent neural networks
  
  Recursive neural networks
  
  Convolutional neural networks
This paper
- weighted Bag-of-words + remove some special direction
  - Step1
    
    There’s two steps.So the first step is that just like how we compute the average of the vector representations,they also do this,but each word has a separate weight.Now here,a is a constant,and the p(w),it means the frequency of this word.So this basically means that the average representation down weight the frequency words.
    
    上式中a是一个常量， $p(w)$ 代表单词的频率， $\displaystyle\frac{a}{a+p(w)}$ 式的作用就是对高频词进行降权处理。
    
    上图的意思是：在一个句子集合 $S$ 中，我们怎样对集合中的一个句子 $s$ 将其表示为向量形式 $v_s$ 呢？我们将一个句子拆分成一个个单词，每一个单词的向量形式表示为 $v(w)$ ,该单词出现的频率为 $p(w)$ ,我们希望对出现频率高的词进行降权处理.
  - 为什么要对高频词进行降维呢？
    
    Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.
  - Step2
    
    So for step 2,after we compute all of these sentence vector representations,we compute the first principal components and also substract the projections onto this first principle component.
  - The idea of the paper
  So basically,the idea is that given the sentence representation,the probability of limiting or single word,they’re related to the frequency of the word.And also related to how close the word is related to this sentence representation.

4.Word2vec objective function gradients

5.Optimization Refresher

We will optimize(maximize or minimize) our objective/cost functions

最小化损失函数即使用梯度下降。

To represent as a simplified python code:

while True:
    theta_grad = evaluate_gradient(J,corpus,theta)
    theta = theta - alpha * theta_grad

The picture above shows us the red lines are sort of the contour lines of the value of the objective function.And so what you do is when you calculate the gradient,it’s giving you the direction of the steepest descent and you walk a little bit each time in that direction and you will hopefully walk smoothly towards the minimum.

Actually we might have 40 billion tokens in our corpus to go through,and if you have to work out the gradient if your objective function relative to a 40 billion word corpus, that’s gonna take forever.

So instead,what we do is used stochastic gradient descent.SGD is our key tool.So we just take one position in the text,we’ll have one center word and the words around it .

Reference

1.课程视频(YouTube)

https://www.youtube.com/watch?v=ERibwqs9p38&t=431s

2.斯坦福cs224n教学大纲

http://cs224d.stanford.edu/syllabus.html

3.参考的其他技术博客

http://www.hankcs.com/nlp/word-vector-representations-word2vec.html

【课程笔记】Lecture2-斯坦福自然语言处理cs224n

Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe

文章目录

1. Word meaning

A question lies ahead

discrete representation

How to make neighbors represent words?

通过向量定义词语的含义Word meaning id defined in terms of vectors

学习神经网络word embeddings 的基本思路

2. Word2vec Introduction

Main idea of word2vec

Skip-gram prediction

word2vec细节

3.Research Highlight(Danqi Chen)

4.Word2vec objective function gradients

5.Optimization Refresher

Reference

python gdal 安装使用（Windows， python 3.6.8）

TiDB整體架構以及在Mac系統上快速安裝部署TiDB

在Linux上安裝Flink以及編寫打包WordCount程序

Flink Streaming流式滑動窗口單詞計數_With IntelliJ IDEA

【課程筆記】Lecture2-斯坦福自然語言處理cs224n

深度解讀FRAGE: Frequency-Agnostic Word Representation(2018-NIPS)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結