Sklearn_vectorizer and Intro to PyTorch

本章的目標：
- 理解監督學習基本方法
- 瞭解學習任務的編碼輸入
- 瞭解計算圖是什麼
- 掌握PyTorch的基礎知識

Scikit-learn CountVectorizer與TfidfVectorizer

CountVectorizer與TfidfVectorizer是sklearn中特徵向量化的兩種方法，不同點在於CountVectorizer只考慮每種詞彙在該訓練文本中出現的頻率，而TfidfVectorizer除了考量某一詞彙在當前訓練文本中出現的頻率之外，同時關注包含這個詞彙的其它訓練文本數目的倒數。

EX1_1 one hot representation

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt

corpus = ['Time flies flies like an arrow.',
          'Fruit flies like a banana.']
# vocab = set([word for sen in corpus for word in sen.split(" ")])
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()  #fit_transform()的作用就是先擬合數據，然後轉化它將其轉化爲標準形式,而transform()的作用是通過找中心和縮放等實現標準化
vocab = one_hot_vectorizer.get_feature_names()  # 返回一個特徵名列表，特徵的順序是在矩陣中的順序
sns.heatmap(one_hot, annot=True,
            cbar=False, xticklabels=vocab,
            yticklabels=['Sentence1','Sentence 2'])
print(one_hot_vectorizer.get_stop_words())
print(one_hot_vectorizer.vocabulary_)   
print(one_hot_vectorizer.vocabulary_.get("a"))  # 發現"a"在處理過程中並沒有被當作一個詞，在sklearn教程中找到這樣一個描述"The default configuration tokenizes the string by extracting words of at least 2 letters. 
print(one_hot_vectorizer.vocabulary_.get("an"))
print(one_hot)
plt.show()

運行結果：

None
{‘time’: 6, ‘flies’: 3, ‘like’: 5, ‘an’: 0, ‘arrow’: 1, ‘fruit’: 4, ‘banana’: 2}
None
0
[[1 1 0 1 0 1 1]
[0 0 1 1 1 1 0]]

EX1_2 tf-idf extention

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns


corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third one',
    'Is this the first document?',
    'I come to American to travel'
]
cv = CountVectorizer(binary=True)
words = cv.fit_transform(corpus)
tfidf_vectorizer = TfidfVectorizer()
tfidf = TfidfTransformer().fit_transform(words)
tfidf2 = tfidf_vectorizer.fit_transform(corpus).toarray()
vocab = cv.get_feature_names()
sns.heatmap(tfidf2, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2','Sentence 3','Sentence 4','Sentence 5'])
print (cv.get_feature_names())
print (words.toarray())
print (tfidf)
plt.show()

運行結果：

[‘american’, ‘and’, ‘come’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’, ‘to’, ‘travel’]
[[0 0 0 1 1 1 0 0 1 0 1 0 0]
[0 0 0 1 0 1 0 1 1 0 1 0 0]
[0 1 0 0 0 0 1 0 1 1 0 0 0]
[0 0 0 1 1 1 0 0 1 0 1 0 0]
[1 0 1 0 0 0 0 0 0 0 0 1 1]]
(0, 10) 0.44027050419943065
(0, 5) 0.44027050419943065
(0, 8) 0.3703694278374568
(0, 4) 0.5303886653382521
(0, 3) 0.44027050419943065
(1, 10) 0.4103997467310884
(1, 5) 0.4103997467310884
(1, 8) 0.34524120496743227
(1, 3) 0.4103997467310884
(1, 7) 0.6128006641982455
(2, 8) 0.30931749359185684
(2, 1) 0.5490363340004775
(2, 9) 0.5490363340004775
(2, 6) 0.5490363340004775
(3, 10) 0.44027050419943065
(3, 5) 0.44027050419943065
(3, 8) 0.3703694278374568
(3, 4) 0.5303886653382521
(3, 3) 0.44027050419943065
(4, 2) 0.5
(4, 11) 0.5
(4, 0) 0.5
(4, 12) 0.5

PyTorch 基礎

動態 Vs. 靜態計算圖框架

靜態框架如Theano,Caffe和Tensorflow要求計算圖首先被定義，編譯然後被執行。
- Pros: leads to extremely efficient implementations
- Cons: cumbersome
動態框架如Chainer,DyNet和Pytorch
- Advances: more flexible and imperative
動態計算圖框架在爲NLP任務建模時非常有效，對於每一個輸入可以潛在地導致生成不同的圖結構。

本教程中將學習到的一些PyTorch操作包括：

創建tensors
對tensors的操作
索引，分片以及tensors的連結
計算tensors的梯度
有GPUs情況下對CUDA tensors的使用

PyTorch的安裝

在官網中選擇與自己硬件環境相對應的版本，複製命令行在終端中運行，如我的mac以及基本編程環境如下：

conda install pytorch torchvision -c pytorch

測試程序

import torch

def describe(x):
    print("Type:",format(x.type()))
    print("Shape/size:",format(x.shape))
    print("Values:\n",format(x))

print("torch.Tensor: initialize a random one by specifying its dimensions")
describe(torch.Tensor(2,3))  # 初始化爲隨機變量

print("torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)")
describe(torch.rand(2,3)) # 以[0,1)區間內的均勻隨機分佈的值進行初始化

print("torch.Tensor: initialize with standard normal distribution")
describe(torch.randn(2,3)) # 以標準正態分佈的值進行初始化

# Creating a filled tensor
print("filled with zeros\n")
describe(torch.zeros(2,3))

print("filled with ones\n")
x = torch.ones(2,3)
describe(x)

print("filled with a certain value\n")
x.fill_(5)
describe(x)

運行結果：

torch.Tensor: initialize a random one by specifying its dimensions
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-7.7542e+20, 4.5799e-41, -7.7542e+20],
[ 4.5799e-41, 0.0000e+00, 0.0000e+00]])
torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0.7461, 0.5937, 0.9421],
[0.5716, 0.6240, 0.3719]])
torch.Tensor: initialize with standard normal distribution
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-1.5297, -0.7294, 0.1784],
[-1.4134, 0.2278, 1.1762]])
filled with zeros

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0., 0., 0.],
[0., 0., 0.]])
filled with ones

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[1., 1., 1.],
[1., 1., 1.]])
filled with a certain value

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[5., 5., 5.],
[5., 5., 5.]])

CUDA Tensors

CUDA

目前我們只是將tensors分配到CPU內存中運算，但當需要處理一些線性代數操作時，我們需要用到GPUs。想要用GPUs資源，你首先需要在GPUs的內存中分配tensors。能夠訪問GPUs的特殊的API叫做CUDA。CUDA API由NVIDIA創建但也被限制爲只有NVIDIA的GPU可用。

PyTorch

PyTorch使得創建CUDA tensors非常容易，可以將tensor從CPU轉換到GPU的同時維持其潛在的類型。更重要的是PyTorch中設備不可知方法(device agnostic method)，使得我們寫的程序代碼無論是在CPU還是GPU上都可以執行。

首先，檢測GPU是否可用

torch.cuda.is_available()
然後檢索設備名

torch.device()
接下來所有的tensor都會被實例化並通過.to(device)移到目標設備

import torch 

def describe(x):
    print("Type:",format(x.type()))
    print("Shape/size:",format(x.shape))
    print("Values:\n",format(x),"device =",device)

# 檢測CUDA是否可用
print(torch.cuda.is_available())

# 輸出可用設備
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# instantiate and move to the target device
x = torch.rand(3,3).to(device)
describe(x)

輸出結果：

False
cpu
Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values:
tensor([[0.9827, 0.8346, 0.1842],
[0.3609, 0.1259, 0.7131],
[0.6021, 0.3017, 0.3955]]) device = cpu

【PyTorch】Sklearn-Vectorizer 和 PyTorch基礎編程

Sklearn_vectorizer and Intro to PyTorch

Scikit-learn CountVectorizer與TfidfVectorizer

PyTorch 基礎

CUDA Tensors

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

根據域名查詢服務器的ip地址

TiDB整體架構以及在Mac系統上快速安裝部署TiDB

在Linux上安裝Flink以及編寫打包WordCount程序

Flink Streaming流式滑動窗口單詞計數_With IntelliJ IDEA

【課程筆記】Lecture2-斯坦福自然語言處理cs224n

深度解讀FRAGE: Frequency-Agnostic Word Representation(2018-NIPS)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結