【NLP】LDA2Vec笔记(基于cemoody/lda2vec 未实现)

学习链接:https://blog.csdn.net/u010161379/article/details/51250109


目录

说明

理论

__init__.py

简介

修改+注释后代码

corpus.py

简介

preprocess.py

准备工作

定义tokenize函数

preprocess.py完整代码(注释版)

examples: hacker_news

执行顺序

注意事项

preprocess.py


说明

  • 源代码来自:GitHub - cemoody/lda2vec ,此代码发布于四年前,基于Python2.7。不免有很多如今不适用之处。
  • GitHub上就此代码有很多讨论,有些大神修改代码后能复现模型,说明此代码具有一定的参考价值。
  • 本人尝试两个examples(hacker_news和twenty_newsgroups),均无果。主要问题:①计算机内存不足;②Topic结果全为"skip'和"Oov(未登录词)"

理论

Word2Vec考虑局部特征,LDA考虑全局特征,lda2vec将二者结合。

Lda2vec absorbed the idea of “globality” from LDA.  At the same time LDA predicts globally: LDA predicts a word regarding global context (i.e. all set of documents).

Lda2vec took the idea of “locality” from word2vec, because it is local in the way that it is able to create vector representations of words (aka word embeddings) on small text intervals (aka windows). Word2vec predicts words locally. It means that given one word it can predict the following word.

Word2Vec向量稠密而难以解释,LDA向量稀疏而易于解释。

Typical word2vec vector looks like dense vector filled with real numbers, while LDA vector is sparse vector of probabilities. When I speak about sparsity, I mean that most values in that vector are equal to zero.

Due to the sparsity (but not only) LDA model can be relatively easy interpreted by a human being, but it is inflexible. On the contrary (surprise!), dense word2vec vector is not human-interpretable, but very flexible (has more degrees of freedom).

And why someone should need it?

  • lda2vec approach should improve quality of topic modeling

源代码可直接从作者的github获取。由于源代码是作者于四年前基于Python2.7编写的,如今看来,有诸多需要调整的地方。现就调整之处做一个记录,同时增加注释,便于我等菜鸟理解。

__init__.py

简介

  • __init__.py是Python中package的标识。__init__.py 文件将文件夹变为一个Python模块,Python 中的每个模块的包中,都有__init__.py 文件,有了这个文件,我们才能导入这个目录下的module。在导入一个包时,实际上导入了它的__init__.py文件,可以在__init__.py文件中再导入其他的包或者模块。这样,导入这个包的时候,__init__.py文件自动运行,帮我们导入很多个模块,无需将所有的import语句写在一个文件里逐个导入模块,减少代码量。
  • 批量引入(定义__all__用来模糊导入)。我们在python中导入一个包时,实际上是导入了它的__init__.py文件,这样我们可以在__init__.py文件中批量导入我们所需要的模块,而不再需要一个一个的导入。
  • 配置模块的初始化操作,这个文件也是一个正常的python代码文件,因此可以将初始化代码放入该文件中。

修改+注释后代码

#__init__.py
#lda2vec包(package)
#.py模块(module)

#从D:\Anaconda\Lib\site-packages\lda2vec包,导入所需模块
from lda2vec import dirichlet_likelihood
from lda2vec import embed_mixture
from lda2vec import tracking
from lda2vec import preprocess
from lda2vec import corpus
from lda2vec import topics
from lda2vec import negative_sampling

#使用模块中的类,如dirichlet_likelihood模块的dirichlet_likelihood类
#注意:原代码"dirichlet_likelihood = dirichlet_likelihood.dirichlet_likelihood"纯属作死操作,
#变量名、模块名和其内函数名,别整一样啊

dirichlet = dirichlet_likelihood.dirichlet_likelihood_func
EmbedMixture = embed_mixture.EmbedMixture
Tracking = tracking.Tracking
tokenize = preprocess.tokenize
Corpus = corpus.Corpus
prepare_topics = topics.prepare_topics
print_top_words_per_topic = topics.print_top_words_per_topic
negative = negative_sampling.negative_sampling_func
topic_coherence = topics.topic_coherence

 

corpus.py

简介

#from源代码注释
class Corpus():
    _keys_frequency = None
    def __init__(self, out_of_vocabulary=-1, skip=-2):

        """ The Corpus helps with tasks involving integer representations of
        words. This object is used to filter, subsample, and convert loose
        word indices to compact word indices.

        'Loose' word arrays are word indices given by a tokenizer. The word
        index is not necessarily representative of word's frequency rank, and
        so loose arrays tend to have 'gaps' of unused indices, which can make
        models less memory efficient. As a result, this class helps convert
        a loose array to a 'compact' one where the most common words have low
        indices, and the most infrequent have high indices.

        Corpus maintains a count of how many of each word it has seen so
        that it can later selectively filter frequent or rare words. However,
        since word popularity rank could change with incoming data the word
        index count must be updated fully and `self.finalize()` must be called
        before any filtering and subsampling operations can happen.

        Arguments(有s,意思为参数)
        ---------
        out_of_vocabulary : int, default=-1
            Token index to replace whenever we encounter a rare or unseen word.
            Instead of skipping the token, we mark as an out of vocabulary
            word.
        skip : int, default=-2
            Token index to replace whenever we want to skip the current frame.
            Particularly useful when subsampling words or when padding a
            sentence.

 


def update_word_count(self, loose_array):
    Update the corpus word counts given a loose array of word indices.
    Can be called multiple times, but once `finalize` is called the word
    counts cannot be updated.

    Arguments
    ---------
    loose_array : int array, Array of word indices.
 

#下附update_word_count(self, loose_array)函数简例

#update_word_count(self, loose_array)函数简例
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类

corpus.update_word_count(np.arange(10))
#print(np.arange(10))>>>[0 1 2 3 4 5 6 7 8 9]
corpus.update_word_count(np.arange(8))
#print(np.arange(8))>>>[0 1 2 3 4 5 6 7]

print(corpus.counts_loose[10])#>>>0 个人理解:即数字10在上述列表中出现0次
print(corpus.counts_loose[0])#>>>2 个人理解:即数字0在上述列表中出现2次

def _loose_keys_ordered(self):
    """ Get the loose keys in order of decreasing frequency"""

 


def finalize(self):
    """ Call `finalize` once done updating word counts. 更新完词频后需要执行finalize
    This means the object will no longer accept new word count data, but the loose to compact index mapping can be computed. This frees the object to filter, subsample, and compactify incoming word arrays.

 

#下附finalize(self)函数简例

#个人:没怎么看懂=.=  什么鬼数字游戏
#好歹解决了"numpy.AxisError: axis -1 is out of bounds for array of dimension 0"错误(瞎改),能成功运行示例代码

import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
# We'll update the word counts, making sure that word index 2 is the most common word index.
corpus.update_word_count(np.arange(8) + 2)#>>>[2 3 4 5 6 7 8 9]
#print(np.arange(8)) >>>[0 1 2 3 4 5 6 7]
#print(np.arange(8) + 2) >>>[2 3 4 5 6 7 8 9]
#print(corpus.counts_loose[11])#>0  次
#print(corpus.counts_loose[9]) #>1  次
#print(corpus.keys_counts[0]) >>>报错AttributeError: 'Corpus' object has no attribute 'keys_counts'
# The corpus has not been finalized yet, and so the compact mapping has not yet been computed.

#corpus.update_word_count(np.arange(8) + 2)#>>>[2 3 4 5 6 7 8 9] 
corpus.finalize()
print(corpus.n_specials) #2
# The special tokens are mapped to the first compact indices
print(corpus.compact_to_loose[0]) #-2
corpus.compact_to_loose[0] == corpus.specials['skip'] #True
corpus.compact_to_loose[1] == corpus.specials['out_of_vocabulary'] #True
print(corpus.compact_to_loose[2])  #9 Most popular token is mapped next
print(corpus.loose_to_compact[3])  #8 2nd most popular token is mapped next
first_non_special = corpus.n_specials
print(corpus.keys_counts[first_non_special]) #1 First normal token

# Return the loose keys and counts in descending count order
# so that the counts arrays is already in compact order

 

#报错numpy.AxisError: axis -1 is out of bounds for array of dimension 0
#解决方案参考:https://blog.csdn.net/qq_41185868/article/details/87913872
#修改:①finalize(self)函数"zip(self.keys_loose, self.keys_compact)"强制转为列表类型,即"list(zip(self.keys_loose, self.keys_compact))";②根据报错信息回溯到lib-sitepackages下的lda2vec包-corpus.py,修改报错行"specials = np.sort(self.specials.values())",强制转为列表类型"specials = np.sort(list(self.specials.values()))"

 


def filter_count(self, words_compact, min_count=15, max_count=0,
                 max_replacement=None, min_replacement=None):

    """ Replace word indices below min_count with the pad index.


 ---------
    Arguments
    words_compact: int array
        Source array whose values will be replaced. This is assumed to
        already be converted into a compact array with `to_compact`.
    min_count : int 处理低频词
        Replace words less frequently occuring than this count. This
        defines the threshold for what words are very rare
    max_count : int 处理高频词
        Replace words occuring more frequently than this count. This
        defines the threshold for very frequent words
    min_replacement : int, default is out_of_vocabulary
        Replace words less than min_count with this.
    max_replacement : int, default is out_of_vocabulary
        Replace words greater than max_count with this.

 

#下附filter_count()函数简例

import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类

# Make 1000 word indices with index < 100 and update the word counts.
word_indices = np.random.randint(100, size=1000)#生成1000个在0~100范围内的随机数
corpus.update_word_count(word_indices)
corpus.finalize()
# any word indices above 99 (超过99) will be filtered

# Now create a new text, but with some indices above 100
word_indices = np.random.randint(200, size=1000)
# word_indices.max() < 100
# # Remove words that have never appeared in the original corpus.
filtered = corpus.filter_count(word_indices, min_count=1)
filtered.max() < 100
# We can also remove highly frequent words.
filtered = corpus.filter_count(word_indices, max_count=2)
len(np.unique(word_indices)) > len(np.unique(filtered))

 


def subsample_frequent(self, words_compact, threshold=1e-5):
Subsample the most frequent words. This aggressively replaces words with frequencies higher than 
`threshold`. Words are replaced with the out_of_vocabulary token.
Words will be replaced with probability as a function of their frequency in the training corpus:


    Arguments
    ---------
    words_compact: int array  The input array to subsample.
    threshold: float in [0, 1]
        Words with frequencies higher than this will be increasingly subsampled.

 

#下附filter_count()函数简例

#个人:没怎么看懂=.=  什么鬼数字游戏
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
word_indices = (np.random.power(5.0, size=1000) * 100).astype('i')#astype:转换数组的数据类型
#np.random.power():Draws samples in [0, 1] from a power distribution with positive exponent a - 1.幂分布a power distribution
corpus.update_word_count(word_indices)#更新词频
corpus.finalize()#更新词频结束
compact = corpus.to_compact(word_indices)#将word_indices变紧凑
sampled = corpus.subsample_frequent(compact, threshold=1e-2)
skip = corpus.specials_to_compact['skip']
np.sum(compact == skip)  #0 No skips in the compact tokens
np.sum(sampled == skip) > 0  #True Many skips in the sampled tokens

#《 Distributed Representations of Words and Phrases and their Compositionality》
#翻译版:https://blog.csdn.net/u010555997/article/details/76598666

 


def to_compact(self, word_loose):
    """ Convert a loose word index matrix to a compact array using
    a fixed loose to dense mapping. Out of vocabulary word indices
    will be replaced by the out of vocabulary index. The most common
    index will be mapped to 0, the next most common to 1, and so on.

    Arguments
    ---------
    word_loose : int array
        Input loose word array to be converted into a compact array.
#行"word_compact = corpus.to_compact(word_indices)" 
#报错:AssertionError:self.finalized() must be called before any other array ops
#在Anaconda包:回溯Assertion,个人改法:将报错Assertion所在的代码块简单粗暴地注释掉
'''
    def _check_finalized(self):
        msg = "self.finalized() must be called before any other array ops"
        assert self._finalized, msg
'''
#在所用文件夹corpus.py引用时self._check_finalized()改为self.finalize()
#瞎改,不神奇地成功运行辽。然而,仅限于成功运行。具体每行啥意思,我也说不清哇。


import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
word_indices = np.random.randint(100, size=1000)
n_words = len(np.unique(word_indices))
corpus.update_word_count(word_indices)
word_compact = corpus.to_compact(word_indices)

# The most common word in the training set will be mapped to be
np.argmax(np.bincount(word_compact)) == 2#True
most_common = np.argmax(np.bincount(word_indices))
corpus.loose_to_compact[most_common] == 2#True

# Out of vocabulary indices will be mapped to 1
word_indices = np.random.randint(150, size=1000)
word_compact_oov = corpus.to_compact(word_indices)
oov = corpus.specials_to_compact['out_of_vocabulary']
oov#1
oov in word_compact#False
oov in word_compact_oov#True

 


def to_loose(self, word_compact):
    """ Convert a compacted array back into a loose array.

    Arguments
    ---------
    word_compact : int array
        Input compacted word array to be converted into a loose array.
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
word_indices = np.random.randint(100, size=10)
corpus.update_word_count(word_indices)
corpus.finalize()
word_compact = corpus.to_compact(word_indices)
print(word_compact)#[ 4  3  6 11  7  2  5  9 10  8]
word_loose = corpus.to_loose(word_compact)
np.all(word_loose == word_indices)
print(word_loose)#[78 89 61 15 40 93 65 34 21 39]

 


def compact_to_flat(self, word_compact, *components):
    """ Ravel a 2D compact array of documents (rows) and word
    positions (columns) into a 1D array of words. Leave out special
    tokens and ravel the component arrays in the same fashion.

    Arguments
    ---------
    word_compact : int array
        Array of word indices in documents. Has shape (n_docs, max_length) components : list of arrays
        A list of arrays detailing per-document properties. Each array must n_docs long.

    Returns
    -------
    flat : int array An array of all words unravelled into a 1D shape
    components : list of arrays Each array here is also unravelled into the same shape
#注意:Anaconda路径里的.py不能瞎改

import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
word_indices = np.random.randint(100, size=1000)
corpus.update_word_count(word_indices)
corpus.finalize()
doc_texts = np.arange(8).reshape((2, 4))
doc_texts[:, -1] = -2  # Mark as skips
doc_ids = np.arange(2)
compact = corpus.to_compact(doc_texts)
oov = corpus.specials_to_compact['out_of_vocabulary']
compact[1, 3] = oov  # Mark the last word as OOV
flat = corpus.compact_to_flat(compact)
flat.shape[0] == 6  # True 2 skips were dropped from 8 words
flat[-1] == corpus.loose_to_compact[doc_texts[1, 2]] #True
flat, (flat_id,) = corpus.compact_to_flat(compact, doc_ids)
print(flat_id) #>>>[0 0 1 1 1]

 


def word_list(self, vocab, max_compact_index=None, oov_token='<OoV>'):
    """ Translate compact keys back into string representations for a word.

    Arguments
    ---------
    vocab : dict
        The vocab object has loose indices as keys and word strings as
        values.

    max_compact_index : int
        Only return words up to this index. If None, defaults to the number
        of compact indices available

    oov_token : str
        Returns this string if a compact index does not have a word in the
        vocab dictionary provided.

    Returns
    -------
    word_list : list
        A list of strings representations corresponding to word indices
        zero to `max_compact_index`
#Translate the compact keys into string words
#将一堆数字转为字符串辽,不明,觉厉

import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
vocab = {0: 'But', 1: 'the', 2: 'night', 3: 'was', 4: 'warm'}
word_indices = np.zeros(50).astype('int32')#构造50个全0数组
word_indices[:25] = 0  # 'But' shows 25 times
word_indices[25:35] = 1  # 'the' is in 10 times
word_indices[40:46] = 2  # 'night' is in 6 times
word_indices[46:49] = 3  # 'was' is in 3 times
word_indices[49:] = 4  # 'warm' in in 2 times
corpus.update_word_count(word_indices)
corpus.finalize()
# Build a vocabulary of word indices
corpus.word_list(vocab)
print(corpus.word_list(vocab))#['skip', 'out_of_vocabulary', 'But', 'the', 'night', 'was', 'warm']

 


def compact_word_vectors(self, vocab, filename=None, array=None, top=20000):
    """ Retrieve pretrained word spectors for our vocabulary.
    The returned word array has row indices corresponding to the
    compact index of a word, and columns correponding to the word
    vector.

    Arguments
    ---------
    vocab : dict
        Dictionary where keys are the loose index, and values are the word string.

    use_spacy : bool
        Use SpaCy to load in word vectors. Otherwise Gensim.

    filename : str
        Filename for SpaCy-compatible word vectors or if use_spacy=False
        then uses word2vec vectors via gensim.

    Returns
    -------
    data : numpy float array
        Array such that data[compact_index, :] = word_vector

 

 


def compact_to_bow(self, word_compact, max_compact_index=None):二维数组转为矩阵
Given a 2D array of compact indices, return the bag of words representation where the column is the word index, row is the document index, and the value is the number of times that word appears in that document.
import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类

vocab = {19: 'shuttle', 5: 'astronomy', 7: 'cold', 3: 'hot'}
word_indices = np.zeros(50).astype('int32')
word_indices[:25] = 19  # 'Shuttle' shows 25 times
word_indices[25:35] = 5  # 'astronomy' is in 10 times
word_indices[40:46] = 7  # 'cold' is in 6 times
word_indices[46:] = 3  # 'hot' is in 3 times
corpus.update_word_count(word_indices)
corpus.finalize()
v = corpus.compact_to_bow(word_indices)
print(len(v))#20
print(v[:6])#[ 5  0  0  4  0 10]
print(v[19])#25
print(v.sum())#50
words = [[0, 0, 0, 3, 4], [1, 1, 1, 4, 5]]
words = np.array(words)
bow = corpus.compact_to_bow(words)
print(bow.shape)#(2, 6)

 


def compact_to_coocurrence(self, word_compact, indices, window_size=10):
    """ From an array of compact tokens and aligned array of document indices
    compute (word, word, document) co-occurrences within a moving window.移动窗口

    Arguments
    ---------
    word_compact: int array
    Sequence of tokens.

    indices: dict of int arrays
    Each array in this dictionary should represent the document index it
    came from.

    window_size: int
    Indicates the moving window size around which all co-occurrences will
    be computed.

    Returns
    -------
    counts : DataFrame
    Returns a DataFrame with two columns for word index A and B,
    one extra column for each document index, and a final column for counts
    in that key.
import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
corpus = corpus.Corpus()#调用corpus模块的Corpus类
compact = np.array([0, 1, 1, 1, 2, 2, 3, 0])
doc_idx = np.array([0, 0, 0, 0, 1, 1, 1, 1])
counts = corpus.compact_to_coocurrence(compact, {'doc': doc_idx})
counts.counts.sum()
counts.query('doc == 0').counts.values
compact = np.array([0, 1, 1, 1, 2, 2, 3, 0])
doc_idx = np.array([0, 0, 0, 1, 1, 2, 2, 2])
counts = corpus.compact_to_coocurrence(compact, {'doc': doc_idx})
counts.counts.sum()
print(counts.counts.sum())#14
counts.query('doc == 0').word_index_x.values
counts.query('doc == 0').word_index_y.values
counts.query('doc == 0').counts.values
counts.query('doc == 1').counts.values
print(counts.query('doc == 1').counts.values)#[1 1]

 


def fast_replace(data, keys, values, skip_checks=False):
    """ Do a search-and-replace in array `data`.

    Arguments
    ---------
    data : int array
        Array of integers
    keys : int array
        Array of keys inside of `data` to be replaced
    values : int array
        Array of values that replace the `keys` array
    skip_checks : bool, default=False
        Optionally skip sanity checking the input.
#注意:在corpus.py文件中,fast_replace()函数已不属于Corpus()类,故直接采用corpus.fast_replace调用方式即可

import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#调用lda2vec包的corpus模块
#corpus = corpus.Corpus()#调用corpus模块的Corpus类
corpus.fast_replace(np.arange(5), np.arange(5), np.arange(5)[::-1])
print(corpus.fast_replace(np.arange(5), np.arange(5), np.arange(5)[::-1]))#[4 3 2 1 0]

 

preprocess.py

准备工作

#准备工作
from spacy.lang.en import English
from spacy.attrs import LOWER, LIKE_URL, LIKE_EMAIL
import numpy as np

定义tokenize函数

#处理词(tokenize) Uses spaCy to quickly tokenize text and return an array of indices
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):

"""
Parameters
    ----------
    text : list of unicode strings . These are the input documents. 
          There can be multiple sentences per item in the list.
          python3中,字符串的存储方式都是以Unicode字符来存储的,所以前缀带不带u都一样。

    max_length : int
        This is the maximum number of words per document. If the document is
        shorter then this number it will be padded to this length.
    skip : int, optional
        Short documents will be padded with this variable up until max_length.
    attr : int, from spacy.attrs
        What to transform the token to. Choice must be in spacy.attrs, and =
        common choices are (LOWER, LEMMA)
    merge : int, optional
        Merge noun phrases into a single token. Useful for turning 'New York'
        into a single token.
    nlp : None
        A spaCy NLP object. Useful for not reinstantiating the object multiple
        times.
    kwargs : dict, optional
        Any further argument will be sent to the spaCy tokenizer. For extra
        speed consider setting tag=False, parse=False, entity=False, or
        n_threads=8.

    Returns
    -------
    arr : 2D array of ints
        Has shape (len(texts), max_length). Each value represents
        the word index.
    vocab : dict
        Keys are the word index, and values are the string. The pad index gets
        mapped to None

    >>> sents = [u"Do you recall a class action lawsuit", u"hello zombo.com"]
    >>> arr, vocab = tokenize(sents, 10, merge=True)
    >>> arr.shape[0]
    2
    >>> arr.shape[1]
    10
    >>> w2i = {w: i for i, w in vocab.iteritems()}
    >>> arr[0, 0] == w2i[u'do']  # First word and its index should match
    True
    >>> arr[0, 1] == w2i[u'you']
    True
    >>> arr[0, -1]  # last word in 0th document is a pad word
    -2
    >>> arr[0, 4] == w2i[u'class action lawsuit']  # noun phrase is tokenized
    True
    >>> arr[1, 1]  # The URL token is thrown out
    -2
    """

 

preprocess.py完整代码(注释版)

#准备工作
from spacy.lang.en import English
from spacy.attrs import LOWER, LIKE_URL, LIKE_EMAIL
import numpy as np
from lda2vec import preprocess
import spacy
nlp = spacy.load("en_core_web_sm")


#大函数:处理词(tokenize) Uses spaCy to quickly tokenize text and return an array of indices
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    if nlp is None:
        nlp = English()#使用spacy的英文模型(另外,还有用于处理德文、法文的模型,https://spacy.io/models/en)
    data = np.zeros((len(texts), max_length), dtype='int32')#构造len(texts)行,max_length列的全零矩阵
    data[:] = skip#上述矩阵的值:由0改为skip(-2)
    bad_deps = ('amod', 'compound')#amod:形容词修饰语,compound复合词,dep(dependence relationship)依存关系

    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        doc = nlp(texts[1])
    #nlp.pipe()分词吧
    #enumerate()函数:将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
        if merge: # 转为单个token,noun phrases into single tokens (Useful for turning 'New York' into a single token.)

            for phrase in doc.noun_chunks:#组块分析,只保留形容词和名词 # Only keep adjectives and nouns, e.g. "good ideas"
                #将New York转为New_York
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)

        doc = nlp(texts[1])
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        print(type(dat))
        print('34:',dat)
        #dat=np.array([[1893670687,0,0]])
        if len(dat) > 0:
            dat = dat.astype('int32')
            #msg = "Negative indices reserved for special tokens"
            print('dat.min:',abs(dat.min()))
            print('dat[:, 2] :',dat[:, 1] )
            #assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    print('49',type(uniques))
    uniques=np.array([1,2,5,2,3,7,9,7,5,0])
    print('51',type(uniques))
    print('uniques:',uniques)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab



'''
if __name__ == "__main__":
    import doctest
    doctest.testmod()
'''

 

 

examples: hacker_news

执行顺序

一级目录

examples-hacker_news
  • 以examples-hacker_news(新闻)为例。据我观察,首先,应当运行data-preprocess.py(此代码同时包括用于下载数据的代码),进行数据预处理工作,处理完成后保存产物(如下图):
examples-hacker_news-data-preprocess.py

 

  • 随后,运行examples\hacker_news\lda2vec-lda2vec_run.py,正式跑模型。(该程序用到了预处理产物,生成lda2vec.hdf5)
examples\hacker_news\lda2vec-lda2vec_run.py

 

注意事项

  • 慎用"print"

调试代码时常加入大量print 监测每一步的输出。但是对于大批量数据处理,print 往往增加几十倍的耗时,严重影响效率。

  • 进度条
#参考:https://blog.csdn.net/The_Time_Runner/article/details/87735801
#通过sys.stdout.write()实现进度条(直接粘到代码ok)

import time,sys
for i in range(100):
    percent = i / 100
    sys.stdout.write("\r{0}{1}".format("|"*i , '%.2f%%' % (percent * 100)))
    sys.stdout.flush()
    time.sleep(1)

 

preprocess.py

# Author: Chris Moody <[email protected]>
# License: MIT

# This example loads a large 800MB Hacker News comments dataset
# and preprocesses it. This can take a few hours, and a lot of
# memory, so please be patient!

from lda2vec import preprocess, Corpus
import numpy as np
import pandas as pd
import logging
import cPickle as pickle
import os.path

logging.basicConfig()

max_length = 250   # Limit of 250 words per comment
min_author_comments = 50  # Exclude authors with fewer comments
nrows = None  # Number of rows of file to read; None reads in full file

fn = "hacker_news_comments.csv"
url = "https://zenodo.org/record/45901/files/hacker_news_comments.csv"
if not os.path.exists(fn):
    import requests
    response = requests.get(url, stream=True, timeout=2400)
    with open(fn, 'w') as fh:
        # Iterate over 1MB chunks
        for data in response.iter_content(1024**2):
            fh.write(data)


features = []
# Convert to unicode (spaCy only works with unicode)
features = pd.read_csv(fn, encoding='utf8', nrows=nrows)
# Convert all integer arrays to int32
for col, dtype in zip(features.columns, features.dtypes):
    if dtype is np.dtype('int64'):
        features[col] = features[col].astype('int32')

# Tokenize the texts
# If this fails it's likely spacy. Install a recent spacy version.
# Only the most recent versions have tokenization of noun phrases
# I'm using SHA dfd1a1d3a24b4ef5904975268c1bbb13ae1a32ff
# Also try running python -m spacy.en.download all --force
texts = features.pop('comment_text').values
tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4,
                                    merge=True)
del texts

# Make a ranked list of rare vs frequent words
corpus = Corpus()
corpus.update_word_count(tokens)
corpus.finalize()

# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=10)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
print "n_words", np.unique(clean).max()

# Extract numpy arrays over the fields we want covered by topics
# Convert to categorical variables
author_counts = features['comment_author'].value_counts()
to_remove = author_counts[author_counts < min_author_comments].index
mask = features['comment_author'].isin(to_remove).values
author_name = features['comment_author'].values.copy()
author_name[mask] = 'infrequent_author'
features['comment_author'] = author_name
authors = pd.Categorical(features['comment_author'])
author_id = authors.codes
author_name = authors.categories
story_id = pd.Categorical(features['story_id']).codes
# Chop timestamps into days
story_time = pd.to_datetime(features['story_time'], unit='s')
days_since = (story_time - story_time.min()) / pd.Timedelta('1 day')
time_id = days_since.astype('int32')
features['story_id_codes'] = story_id
features['author_id_codes'] = story_id
features['time_id_codes'] = time_id

print "n_authors", author_id.max()
print "n_stories", story_id.max()
print "n_times", time_id.max()

# Extract outcome supervised features
ranking = features['comment_ranking'].values
score = features['story_comment_count'].values

# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
feature_arrs = (story_id, author_id, time_id, ranking, score)
flattened, features_flat = corpus.compact_to_flat(pruned, *feature_arrs)
# Flattened feature arrays
(story_id_f, author_id_f, time_id_f, ranking_f, score_f) = features_flat

# Save the data
pickle.dump(corpus, open('corpus', 'w'), protocol=2)
pickle.dump(vocab, open('vocab', 'w'), protocol=2)
features.to_pickle('features.pd')
data = dict(flattened=flattened, story_id=story_id_f, author_id=author_id_f,
            time_id=time_id_f, ranking=ranking_f, score=score_f,
            author_name=author_name, author_index=author_id)
np.savez('data', **data)
np.save(open('tokens', 'w'), tokens)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章