Sklearn_vectorizer and Intro to PyTorch

本章的目标：
- 理解监督学习基本方法
- 了解学习任务的编码输入
- 了解计算图是什么
- 掌握PyTorch的基础知识

Scikit-learn CountVectorizer与TfidfVectorizer

CountVectorizer与TfidfVectorizer是sklearn中特征向量化的两种方法，不同点在于CountVectorizer只考虑每种词汇在该训练文本中出现的频率，而TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外，同时关注包含这个词汇的其它训练文本数目的倒数。

EX1_1 one hot representation

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt

corpus = ['Time flies flies like an arrow.',
          'Fruit flies like a banana.']
# vocab = set([word for sen in corpus for word in sen.split(" ")])
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()  #fit_transform()的作用就是先拟合数据，然后转化它将其转化为标准形式,而transform()的作用是通过找中心和缩放等实现标准化
vocab = one_hot_vectorizer.get_feature_names()  # 返回一个特征名列表，特征的顺序是在矩阵中的顺序
sns.heatmap(one_hot, annot=True,
            cbar=False, xticklabels=vocab,
            yticklabels=['Sentence1','Sentence 2'])
print(one_hot_vectorizer.get_stop_words())
print(one_hot_vectorizer.vocabulary_)   
print(one_hot_vectorizer.vocabulary_.get("a"))  # 发现"a"在处理过程中并没有被当作一个词，在sklearn教程中找到这样一个描述"The default configuration tokenizes the string by extracting words of at least 2 letters. 
print(one_hot_vectorizer.vocabulary_.get("an"))
print(one_hot)
plt.show()

运行结果：

None
{‘time’: 6, ‘flies’: 3, ‘like’: 5, ‘an’: 0, ‘arrow’: 1, ‘fruit’: 4, ‘banana’: 2}
None
0
[[1 1 0 1 0 1 1]
[0 0 1 1 1 1 0]]

EX1_2 tf-idf extention

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns


corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third one',
    'Is this the first document?',
    'I come to American to travel'
]
cv = CountVectorizer(binary=True)
words = cv.fit_transform(corpus)
tfidf_vectorizer = TfidfVectorizer()
tfidf = TfidfTransformer().fit_transform(words)
tfidf2 = tfidf_vectorizer.fit_transform(corpus).toarray()
vocab = cv.get_feature_names()
sns.heatmap(tfidf2, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2','Sentence 3','Sentence 4','Sentence 5'])
print (cv.get_feature_names())
print (words.toarray())
print (tfidf)
plt.show()

运行结果：

[‘american’, ‘and’, ‘come’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’, ‘to’, ‘travel’]
[[0 0 0 1 1 1 0 0 1 0 1 0 0]
[0 0 0 1 0 1 0 1 1 0 1 0 0]
[0 1 0 0 0 0 1 0 1 1 0 0 0]
[0 0 0 1 1 1 0 0 1 0 1 0 0]
[1 0 1 0 0 0 0 0 0 0 0 1 1]]
(0, 10) 0.44027050419943065
(0, 5) 0.44027050419943065
(0, 8) 0.3703694278374568
(0, 4) 0.5303886653382521
(0, 3) 0.44027050419943065
(1, 10) 0.4103997467310884
(1, 5) 0.4103997467310884
(1, 8) 0.34524120496743227
(1, 3) 0.4103997467310884
(1, 7) 0.6128006641982455
(2, 8) 0.30931749359185684
(2, 1) 0.5490363340004775
(2, 9) 0.5490363340004775
(2, 6) 0.5490363340004775
(3, 10) 0.44027050419943065
(3, 5) 0.44027050419943065
(3, 8) 0.3703694278374568
(3, 4) 0.5303886653382521
(3, 3) 0.44027050419943065
(4, 2) 0.5
(4, 11) 0.5
(4, 0) 0.5
(4, 12) 0.5

PyTorch 基础

动态 Vs. 静态计算图框架

静态框架如Theano,Caffe和Tensorflow要求计算图首先被定义，编译然后被执行。
- Pros: leads to extremely efficient implementations
- Cons: cumbersome
动态框架如Chainer,DyNet和Pytorch
- Advances: more flexible and imperative
动态计算图框架在为NLP任务建模时非常有效，对于每一个输入可以潜在地导致生成不同的图结构。

本教程中将学习到的一些PyTorch操作包括：

创建tensors
对tensors的操作
索引，分片以及tensors的连结
计算tensors的梯度
有GPUs情况下对CUDA tensors的使用

PyTorch的安装

在官网中选择与自己硬件环境相对应的版本，复制命令行在终端中运行，如我的mac以及基本编程环境如下：

conda install pytorch torchvision -c pytorch

测试程序

import torch

def describe(x):
    print("Type:",format(x.type()))
    print("Shape/size:",format(x.shape))
    print("Values:\n",format(x))

print("torch.Tensor: initialize a random one by specifying its dimensions")
describe(torch.Tensor(2,3))  # 初始化为随机变量

print("torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)")
describe(torch.rand(2,3)) # 以[0,1)区间内的均匀随机分布的值进行初始化

print("torch.Tensor: initialize with standard normal distribution")
describe(torch.randn(2,3)) # 以标准正态分布的值进行初始化

# Creating a filled tensor
print("filled with zeros\n")
describe(torch.zeros(2,3))

print("filled with ones\n")
x = torch.ones(2,3)
describe(x)

print("filled with a certain value\n")
x.fill_(5)
describe(x)

运行结果：

torch.Tensor: initialize a random one by specifying its dimensions
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-7.7542e+20, 4.5799e-41, -7.7542e+20],
[ 4.5799e-41, 0.0000e+00, 0.0000e+00]])
torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0.7461, 0.5937, 0.9421],
[0.5716, 0.6240, 0.3719]])
torch.Tensor: initialize with standard normal distribution
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-1.5297, -0.7294, 0.1784],
[-1.4134, 0.2278, 1.1762]])
filled with zeros

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0., 0., 0.],
[0., 0., 0.]])
filled with ones

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[1., 1., 1.],
[1., 1., 1.]])
filled with a certain value

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[5., 5., 5.],
[5., 5., 5.]])

CUDA Tensors

CUDA

目前我们只是将tensors分配到CPU内存中运算，但当需要处理一些线性代数操作时，我们需要用到GPUs。想要用GPUs资源，你首先需要在GPUs的内存中分配tensors。能够访问GPUs的特殊的API叫做CUDA。CUDA API由NVIDIA创建但也被限制为只有NVIDIA的GPU可用。

PyTorch

PyTorch使得创建CUDA tensors非常容易，可以将tensor从CPU转换到GPU的同时维持其潜在的类型。更重要的是PyTorch中设备不可知方法(device agnostic method)，使得我们写的程序代码无论是在CPU还是GPU上都可以执行。

首先，检测GPU是否可用

torch.cuda.is_available()
然后检索设备名

torch.device()
接下来所有的tensor都会被实例化并通过.to(device)移到目标设备

import torch 

def describe(x):
    print("Type:",format(x.type()))
    print("Shape/size:",format(x.shape))
    print("Values:\n",format(x),"device =",device)

# 检测CUDA是否可用
print(torch.cuda.is_available())

# 输出可用设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# instantiate and move to the target device
x = torch.rand(3,3).to(device)
describe(x)

输出结果：

False
cpu
Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values:
tensor([[0.9827, 0.8346, 0.1842],
[0.3609, 0.1259, 0.7131],
[0.6021, 0.3017, 0.3955]]) device = cpu

【PyTorch】Sklearn-Vectorizer 和 PyTorch基础编程

Sklearn_vectorizer and Intro to PyTorch

Scikit-learn CountVectorizer与TfidfVectorizer

PyTorch 基础

CUDA Tensors

TiDB整體架構以及在Mac系統上快速安裝部署TiDB

在Linux上安裝Flink以及編寫打包WordCount程序

Flink Streaming流式滑動窗口單詞計數_With IntelliJ IDEA

【課程筆記】Lecture2-斯坦福自然語言處理cs224n

深度解讀FRAGE: Frequency-Agnostic Word Representation(2018-NIPS)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結