NLP筆記 - Word Embedding // bag of words

概念解釋

  • 詞/句/文本 嵌入(embedding)
    不要被中文的“嵌入”意思帶偏。embedding是一個數學術語,代表的是一個映射關係。比如漢英字典裏的中文“鈔票”映射到英文就是單詞“money”。這項技術把詞彙表中的單詞或短語映射成由實數構成的向量。在計算機中,一個單詞映射到的往往就是它的索引數字。畢竟目前計算機也只能理解數字。再通俗點講,就是把文本編程計算機能懂的語言。詞嵌入來龍去脈
  • TF-IDF(term frequency–inverse document frequency)
    TF意思是詞頻(Term Frequency),IDF意思是逆文本頻率指數(Inverse Document Frequency)。TF-IDF是一種統計方法,用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨着它在文件中出現的次數成正比增加,但同時會隨着它在語料庫中出現的頻率成反比下降。 一個詞語在一篇文章中出現次數越多, 同時在所有文檔中出現次數越少, 越能夠代表該文章。

詞頻 (term frequency, TF) 指的是某一個給定的詞語在該文件中出現的次數。這個數字通常會被歸一化(一般是詞頻除以文章總詞數), 以防止它偏向長的文件。公式:
在這裏插入圖片描述
逆向文件頻率 (inverse document frequency, IDF)IDF的主要思想是:如果包含詞條t的文檔越少, IDF越大,則說明詞條具有很好的類別區分能力。某一特定詞語的IDF,可以由總文件數目除以包含該詞語之文件的數目,再將得到的商取對數得到。
公式:
在這裏插入圖片描述
某一特定文件內的高詞語頻率,以及該詞語在整個文件集合中的低文件頻率,可以產生出高權重的TF-IDF。因此,TF-IDF傾向於過濾掉常見的詞語,保留重要的詞語。
   在這裏插入圖片描述

詞袋模型example

安裝gensim

首先要安裝第三方模塊gensim,cmd命令進入anaconda下的script路徑下

cmd進入指定路徑:
>>>d:
>>>cd <路徑地址>

然後pip install gensim,等待安裝完成
gensim官方教程鏈接
在開始之前如果想顯示日誌

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

文本預處理

輸入語料庫:
語料是指一組文檔的集合。這裏輸入一個小型的語料庫,每個文檔代表一句話。
文檔內容來自:https://radimrehurek.com/gensim/mycorpus.txt

from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

首先,做些預處理。

  • 文本進行分詞(tokenization)

  • 刪去一些常用詞/停用詞(像for/ a/ of/ the/…這些詞)

  • 刪去只出現一次的詞(防止太稀疏)

預處理:

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
# 去停用詞
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# 去掉只出現一次的詞
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)  # frequency 爲一個字典,統計每個詞出現的次數
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # 分行打印
pprint(texts)

tips:

  • dict()與defaultdict()的區別:dict(),但是如果鍵不存在,就會報錯顯示keyerror。defaultdict()就會返回()裏的類型,不報錯。
from collections import defaultdict
a=dict()
b=defaultdict(int)
執行:
print(b["a"])
>>>0
如果把上面b的定義更改一下
b=defaultdict(list)
執行:
print(b["a"])
>>>[]
執行:
print(a["a"])
>>>KeyError,"a"

輸出:
在這裏插入圖片描述
保存字典輸入:

dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict')  # store the dictionary
print(dictionary)

輸出:

2019-03-27 09:23:28,872 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-03-27 09:23:28,873 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-03-27 09:23:28,954 : INFO : saving Dictionary object under deerwester.dict, separately None
2019-03-27 09:23:28,955 : INFO : saved deerwester.dict
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

可以看出語料庫生成的字典裏有12個不同的單詞。意味着語料庫的每一個文本,也就是每一句話,都可以被12維的稀疏向量表示。
輸入:

print(dictionary.token2id)

輸出:
輸出字典,語料中的每一個單詞關聯一個唯一的id。字典單詞與id能一一對應,不同的人跑的id數字可能變化:

輸出:

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

詞袋模型輸出舉例:

文本用詞袋模型表示

詞袋概念解釋:
如果要對文檔的隱含結構進行推斷,就需要一種數學上能處理的文檔表示方法。一種方法是把每個文檔表達爲一個向量。有很多種表示方法,一種常見的方法是bag-of-words模型,也叫做“詞袋”。在詞袋模型中,每篇文檔(在這裏是每個字符串句子)被表示成一個向量,代表字典中每個詞出現的次數。詞袋模型的一個重要特點是,它完全忽略的單詞在句子中出現的順序,這也就是“詞袋”這個名字的由來。

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

輸出:

[(0, 1), (1, 1)]

表示的意思是字典中詞的id出現了幾次,單詞interaction沒有出現在字典中,所以沒有統計。
doc2bow()函數生成的元組中,括號左邊代表單詞id,括號右邊代表單詞在樣例中的出現次數。生成的是一個像[(word_id, word_count), …]的稀疏向量,也就是詞袋。“interaction”不存在字典裏,不在稀疏向量裏出現。而其他存在在字典裏,卻在新句子中出現0次的單詞,也不顯示在稀疏向量裏。也就說明每個小括號右邊的數字不會小於1。

把語料庫的句子都轉換成稀疏向量,輸入:

corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus)  # store to disk
pprint(corpus)

輸出:

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

迭代跑大文本,將文本轉換成詞袋模型:

class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
          
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly) 
# <__main__.MyCorpus object at 0x10d5690>

for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

輸出:

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

完整讀取text文件並生成字典:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

from gensim import corpora
stoplist = set('for a of the and to in'.split())
from six import iteritems
# collect statistics about all tokens
# remove stop words and words that appear only once
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)
# Dictionary(12 unique tokens)

tf-idf計算的代碼

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))
from gensim import corpora
stoplist = set('for a of the and to in'.split())
from six import iteritems
# collect statistics about all tokens
# remove stop words and words that appear only once
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)
# Dictionary(12 unique tokens)

# 生成詞袋語料庫
class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())


corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
# print(corpus_memory_friendly)
# <__main__.MyCorpus object at 0x10d5690>

for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)
 
from gensim import models
# 將語料庫放入模型中
tfidf = models.TfidfModel(corpus_memory_friendly)
string = "system minors"
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_bow)
print(string_tfidf)

輸出:

Folder "C:\Users\ADMINI~1.PC-\AppData\Local\Temp" will be used to save temporary dictionary and corpus.
2019-03-27 13:35:00,337 : INFO : 'pattern' package not found; tag filters are not available for English
2019-03-27 13:35:00,341 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-03-27 13:35:00,342 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
2019-03-27 13:35:00,342 : INFO : collecting document frequencies
2019-03-27 13:35:00,343 : INFO : PROGRESS: processing document #0
2019-03-27 13:35:00,343 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
<__main__.MyCorpus object at 0x0000000002985518>
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

# tf-idf 的輸出
[(5, 1), (11, 1)]
[(5, 0.58983416267400446), (11, 0.80752440244407231)]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章