bag of words词袋模型

概念解释

  • 词/句/文本 嵌入(embedding)
    不要被中文的“嵌入”意思带偏。embedding是一个数学术语,代表的是一个映射关系。比如汉英字典里的中文“钞票”映射到英文就是单词“money”。这项技术把词汇表中的单词或短语映射成由实数构成的向量。在计算机中,一个单词映射到的往往就是它的索引数字。毕竟目前计算机也只能理解数字。再通俗点讲,就是把文本编程计算机能懂的语言。词嵌入来龙去脉
  • TF-IDF(term frequency–inverse document frequency)
    TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency)。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。 一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章。

词频 (term frequency, TF) 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。公式:
在这里插入图片描述
逆向文件频率 (inverse document frequency, IDF)IDF的主要思想是:如果包含词条t的文档越少, IDF越大,则说明词条具有很好的类别区分能力。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。
公式:
在这里插入图片描述
某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
   在这里插入图片描述

词袋模型example

安装gensim

首先要安装第三方模块gensim,cmd命令进入anaconda下的script路径下

cmd进入指定路径:
>>>d:
>>>cd <路径地址>

然后pip install gensim,等待安装完成
gensim官方教程链接
在开始之前如果想显示日志

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

文本预处理

输入语料库:
语料是指一组文档的集合。这里输入一个小型的语料库,每个文档代表一句话。
文档内容来自:https://radimrehurek.com/gensim/mycorpus.txt

from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

首先,做些预处理。

  • 文本进行分词(tokenization)

  • 删去一些常用词/停用词(像for/ a/ of/ the/…这些词)

  • 删去只出现一次的词(防止太稀疏)

预处理:

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
# 去停用词
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# 去掉只出现一次的词
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)  # frequency 为一个字典,统计每个词出现的次数
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # 分行打印
pprint(texts)

tips:

  • dict()与defaultdict()的区别:dict(),但是如果键不存在,就会报错显示keyerror。defaultdict()就会返回()里的类型,不报错。
from collections import defaultdict
a=dict()
b=defaultdict(int)
执行:
print(b["a"])
>>>0
如果把上面b的定义更改一下
b=defaultdict(list)
执行:
print(b["a"])
>>>[]
执行:
print(a["a"])
>>>KeyError,"a"

输出:
在这里插入图片描述
保存字典输入:

dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict')  # store the dictionary
print(dictionary)

输出:

2019-03-27 09:23:28,872 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-03-27 09:23:28,873 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-03-27 09:23:28,954 : INFO : saving Dictionary object under deerwester.dict, separately None
2019-03-27 09:23:28,955 : INFO : saved deerwester.dict
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

可以看出语料库生成的字典里有12个不同的单词。意味着语料库的每一个文本,也就是每一句话,都可以被12维的稀疏向量表示。
输入:

print(dictionary.token2id)

输出:
输出字典,语料中的每一个单词关联一个唯一的id。字典单词与id能一一对应,不同的人跑的id数字可能变化:

输出:

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

词袋模型输出举例:

文本用词袋模型表示

词袋概念解释:
如果要对文档的隐含结构进行推断,就需要一种数学上能处理的文档表示方法。一种方法是把每个文档表达为一个向量。有很多种表示方法,一种常见的方法是bag-of-words模型,也叫做“词袋”。在词袋模型中,每篇文档(在这里是每个字符串句子)被表示成一个向量,代表字典中每个词出现的次数。词袋模型的一个重要特点是,它完全忽略的单词在句子中出现的顺序,这也就是“词袋”这个名字的由来。

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

输出:

[(0, 1), (1, 1)]

表示的意思是字典中词的id出现了几次,单词interaction没有出现在字典中,所以没有统计。
doc2bow()函数生成的元组中,括号左边代表单词id,括号右边代表单词在样例中的出现次数。生成的是一个像[(word_id, word_count), …]的稀疏向量,也就是词袋。“interaction”不存在字典里,不在稀疏向量里出现。而其他存在在字典里,却在新句子中出现0次的单词,也不显示在稀疏向量里。也就说明每个小括号右边的数字不会小于1。

把语料库的句子都转换成稀疏向量,输入:

corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus)  # store to disk
pprint(corpus)

输出:

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

迭代跑大文本,将文本转换成词袋模型:

class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
          
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly) 
# <__main__.MyCorpus object at 0x10d5690>

for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

输出:

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

完整读取text文件并生成字典:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

from gensim import corpora
stoplist = set('for a of the and to in'.split())
from six import iteritems
# collect statistics about all tokens
# remove stop words and words that appear only once
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)
# Dictionary(12 unique tokens)

tf-idf计算的代码

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))
from gensim import corpora
stoplist = set('for a of the and to in'.split())
from six import iteritems
# collect statistics about all tokens
# remove stop words and words that appear only once
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)
# Dictionary(12 unique tokens)

# 生成词袋语料库
class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())


corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
# print(corpus_memory_friendly)
# <__main__.MyCorpus object at 0x10d5690>

for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)
 
from gensim import models
# 将语料库放入模型中
tfidf = models.TfidfModel(corpus_memory_friendly)
string = "system minors"
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_bow)
print(string_tfidf)

输出:

Folder "C:\Users\ADMINI~1.PC-\AppData\Local\Temp" will be used to save temporary dictionary and corpus.
2019-03-27 13:35:00,337 : INFO : 'pattern' package not found; tag filters are not available for English
2019-03-27 13:35:00,341 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-03-27 13:35:00,342 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
2019-03-27 13:35:00,342 : INFO : collecting document frequencies
2019-03-27 13:35:00,343 : INFO : PROGRESS: processing document #0
2019-03-27 13:35:00,343 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
<__main__.MyCorpus object at 0x0000000002985518>
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

# tf-idf 的输出
[(5, 1), (11, 1)]
[(5, 0.58983416267400446), (11, 0.80752440244407231)]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章