  • 源代碼來自:GitHub - cemoody/lda2vec ,此代碼發佈於四年前,基於Python2.7。不免有很多如今不適用之處。
  • GitHub上就此代碼有很多討論,有些大神修改代碼後能復現模型,說明此代碼具有一定的參考價值。
  • 本人嘗試兩個examples(hacker_news和twenty_newsgroups),均無果。主要問題:①計算機內存不足;②Topic結果全爲"skip'和"Oov(未登錄詞)"



Lda2vec absorbed the idea of “globality” from LDA.  At the same time LDA predicts globally: LDA predicts a word regarding global context (i.e. all set of documents).

Lda2vec took the idea of “locality” from word2vec, because it is local in the way that it is able to create vector representations of words (aka word embeddings) on small text intervals (aka windows). Word2vec predicts words locally. It means that given one word it can predict the following word.


Typical word2vec vector looks like dense vector filled with real numbers, while LDA vector is sparse vector of probabilities. When I speak about sparsity, I mean that most values in that vector are equal to zero.

Due to the sparsity (but not only) LDA model can be relatively easy interpreted by a human being, but it is inflexible. On the contrary (surprise!), dense word2vec vector is not human-interpretable, but very flexible (has more degrees of freedom).

And why someone should need it?

  • lda2vec approach should improve quality of topic modeling




  • __init__.py是Python中package的標識。__init__.py 文件將文件夾變爲一個Python模塊,Python 中的每個模塊的包中,都有__init__.py 文件,有了這個文件,我們才能導入這個目錄下的module。在導入一個包時,實際上導入了它的__init__.py文件,可以在__init__.py文件中再導入其他的包或者模塊。這樣,導入這個包的時候,__init__.py文件自動運行,幫我們導入很多個模塊,無需將所有的import語句寫在一個文件裏逐個導入模塊,減少代碼量。
  • 批量引入(定義__all__用來模糊導入)。我們在python中導入一個包時,實際上是導入了它的__init__.py文件,這樣我們可以在__init__.py文件中批量導入我們所需要的模塊,而不再需要一個一個的導入。
  • 配置模塊的初始化操作,這個文件也是一個正常的python代碼文件,因此可以將初始化代碼放入該文件中。



from lda2vec import dirichlet_likelihood
from lda2vec import embed_mixture
from lda2vec import tracking
from lda2vec import preprocess
from lda2vec import corpus
from lda2vec import topics
from lda2vec import negative_sampling

#注意:原代碼"dirichlet_likelihood = dirichlet_likelihood.dirichlet_likelihood"純屬作死操作,

dirichlet = dirichlet_likelihood.dirichlet_likelihood_func
EmbedMixture = embed_mixture.EmbedMixture
Tracking = tracking.Tracking
tokenize = preprocess.tokenize
Corpus = corpus.Corpus
prepare_topics = topics.prepare_topics
print_top_words_per_topic = topics.print_top_words_per_topic
negative = negative_sampling.negative_sampling_func
topic_coherence = topics.topic_coherence




class Corpus():
    _keys_frequency = None
    def __init__(self, out_of_vocabulary=-1, skip=-2):

        """ The Corpus helps with tasks involving integer representations of
        words. This object is used to filter, subsample, and convert loose
        word indices to compact word indices.

        'Loose' word arrays are word indices given by a tokenizer. The word
        index is not necessarily representative of word's frequency rank, and
        so loose arrays tend to have 'gaps' of unused indices, which can make
        models less memory efficient. As a result, this class helps convert
        a loose array to a 'compact' one where the most common words have low
        indices, and the most infrequent have high indices.

        Corpus maintains a count of how many of each word it has seen so
        that it can later selectively filter frequent or rare words. However,
        since word popularity rank could change with incoming data the word
        index count must be updated fully and `self.finalize()` must be called
        before any filtering and subsampling operations can happen.

        out_of_vocabulary : int, default=-1
            Token index to replace whenever we encounter a rare or unseen word.
            Instead of skipping the token, we mark as an out of vocabulary
        skip : int, default=-2
            Token index to replace whenever we want to skip the current frame.
            Particularly useful when subsampling words or when padding a


def update_word_count(self, loose_array):
    Update the corpus word counts given a loose array of word indices.
    Can be called multiple times, but once `finalize` is called the word
    counts cannot be updated.

    loose_array : int array, Array of word indices.

#下附update_word_count(self, loose_array)函數簡例

#update_word_count(self, loose_array)函數簡例
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類

#print(np.arange(10))>>>[0 1 2 3 4 5 6 7 8 9]
#print(np.arange(8))>>>[0 1 2 3 4 5 6 7]

print(corpus.counts_loose[10])#>>>0 個人理解:即數字10在上述列表中出現0次
print(corpus.counts_loose[0])#>>>2 個人理解:即數字0在上述列表中出現2次

def _loose_keys_ordered(self):
    """ Get the loose keys in order of decreasing frequency"""


def finalize(self):
    """ Call `finalize` once done updating word counts. 更新完詞頻後需要執行finalize
    This means the object will no longer accept new word count data, but the loose to compact index mapping can be computed. This frees the object to filter, subsample, and compactify incoming word arrays.



#個人:沒怎麼看懂=.=  什麼鬼數字遊戲
#好歹解決了"numpy.AxisError: axis -1 is out of bounds for array of dimension 0"錯誤(瞎改),能成功運行示例代碼

import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
# We'll update the word counts, making sure that word index 2 is the most common word index.
corpus.update_word_count(np.arange(8) + 2)#>>>[2 3 4 5 6 7 8 9]
#print(np.arange(8)) >>>[0 1 2 3 4 5 6 7]
#print(np.arange(8) + 2) >>>[2 3 4 5 6 7 8 9]
#print(corpus.counts_loose[11])#>0  次
#print(corpus.counts_loose[9]) #>1  次
#print(corpus.keys_counts[0]) >>>報錯AttributeError: 'Corpus' object has no attribute 'keys_counts'
# The corpus has not been finalized yet, and so the compact mapping has not yet been computed.

#corpus.update_word_count(np.arange(8) + 2)#>>>[2 3 4 5 6 7 8 9] 
print(corpus.n_specials) #2
# The special tokens are mapped to the first compact indices
print(corpus.compact_to_loose[0]) #-2
corpus.compact_to_loose[0] == corpus.specials['skip'] #True
corpus.compact_to_loose[1] == corpus.specials['out_of_vocabulary'] #True
print(corpus.compact_to_loose[2])  #9 Most popular token is mapped next
print(corpus.loose_to_compact[3])  #8 2nd most popular token is mapped next
first_non_special = corpus.n_specials
print(corpus.keys_counts[first_non_special]) #1 First normal token

# Return the loose keys and counts in descending count order
# so that the counts arrays is already in compact order


#報錯numpy.AxisError: axis -1 is out of bounds for array of dimension 0
#修改:①finalize(self)函數"zip(self.keys_loose, self.keys_compact)"強制轉爲列表類型,即"list(zip(self.keys_loose, self.keys_compact))";②根據報錯信息回溯到lib-sitepackages下的lda2vec包-corpus.py,修改報錯行"specials = np.sort(self.specials.values())",強制轉爲列表類型"specials = np.sort(list(self.specials.values()))"


def filter_count(self, words_compact, min_count=15, max_count=0,
                 max_replacement=None, min_replacement=None):

    """ Replace word indices below min_count with the pad index.

    words_compact: int array
        Source array whose values will be replaced. This is assumed to
        already be converted into a compact array with `to_compact`.
    min_count : int 處理低頻詞
        Replace words less frequently occuring than this count. This
        defines the threshold for what words are very rare
    max_count : int 處理高頻詞
        Replace words occuring more frequently than this count. This
        defines the threshold for very frequent words
    min_replacement : int, default is out_of_vocabulary
        Replace words less than min_count with this.
    max_replacement : int, default is out_of_vocabulary
        Replace words greater than max_count with this.



import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類

# Make 1000 word indices with index < 100 and update the word counts.
word_indices = np.random.randint(100, size=1000)#生成1000個在0~100範圍內的隨機數
# any word indices above 99 (超過99) will be filtered

# Now create a new text, but with some indices above 100
word_indices = np.random.randint(200, size=1000)
# word_indices.max() < 100
# # Remove words that have never appeared in the original corpus.
filtered = corpus.filter_count(word_indices, min_count=1)
filtered.max() < 100
# We can also remove highly frequent words.
filtered = corpus.filter_count(word_indices, max_count=2)
len(np.unique(word_indices)) > len(np.unique(filtered))


def subsample_frequent(self, words_compact, threshold=1e-5):
Subsample the most frequent words. This aggressively replaces words with frequencies higher than 
`threshold`. Words are replaced with the out_of_vocabulary token.
Words will be replaced with probability as a function of their frequency in the training corpus:

    words_compact: int array  The input array to subsample.
    threshold: float in [0, 1]
        Words with frequencies higher than this will be increasingly subsampled.



#個人:沒怎麼看懂=.=  什麼鬼數字遊戲
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
word_indices = (np.random.power(5.0, size=1000) * 100).astype('i')#astype:轉換數組的數據類型
#np.random.power():Draws samples in [0, 1] from a power distribution with positive exponent a - 1.冪分佈a power distribution
compact = corpus.to_compact(word_indices)#將word_indices變緊湊
sampled = corpus.subsample_frequent(compact, threshold=1e-2)
skip = corpus.specials_to_compact['skip']
np.sum(compact == skip)  #0 No skips in the compact tokens
np.sum(sampled == skip) > 0  #True Many skips in the sampled tokens

#《 Distributed Representations of Words and Phrases and their Compositionality》


def to_compact(self, word_loose):
    """ Convert a loose word index matrix to a compact array using
    a fixed loose to dense mapping. Out of vocabulary word indices
    will be replaced by the out of vocabulary index. The most common
    index will be mapped to 0, the next most common to 1, and so on.

    word_loose : int array
        Input loose word array to be converted into a compact array.
#行"word_compact = corpus.to_compact(word_indices)" 
#報錯:AssertionError:self.finalized() must be called before any other array ops
    def _check_finalized(self):
        msg = "self.finalized() must be called before any other array ops"
        assert self._finalized, msg

import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
word_indices = np.random.randint(100, size=1000)
n_words = len(np.unique(word_indices))
word_compact = corpus.to_compact(word_indices)

# The most common word in the training set will be mapped to be
np.argmax(np.bincount(word_compact)) == 2#True
most_common = np.argmax(np.bincount(word_indices))
corpus.loose_to_compact[most_common] == 2#True

# Out of vocabulary indices will be mapped to 1
word_indices = np.random.randint(150, size=1000)
word_compact_oov = corpus.to_compact(word_indices)
oov = corpus.specials_to_compact['out_of_vocabulary']
oov in word_compact#False
oov in word_compact_oov#True


def to_loose(self, word_compact):
    """ Convert a compacted array back into a loose array.

    word_compact : int array
        Input compacted word array to be converted into a loose array.
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
word_indices = np.random.randint(100, size=10)
word_compact = corpus.to_compact(word_indices)
print(word_compact)#[ 4  3  6 11  7  2  5  9 10  8]
word_loose = corpus.to_loose(word_compact)
np.all(word_loose == word_indices)
print(word_loose)#[78 89 61 15 40 93 65 34 21 39]


def compact_to_flat(self, word_compact, *components):
    """ Ravel a 2D compact array of documents (rows) and word
    positions (columns) into a 1D array of words. Leave out special
    tokens and ravel the component arrays in the same fashion.

    word_compact : int array
        Array of word indices in documents. Has shape (n_docs, max_length) components : list of arrays
        A list of arrays detailing per-document properties. Each array must n_docs long.

    flat : int array An array of all words unravelled into a 1D shape
    components : list of arrays Each array here is also unravelled into the same shape

import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
word_indices = np.random.randint(100, size=1000)
doc_texts = np.arange(8).reshape((2, 4))
doc_texts[:, -1] = -2  # Mark as skips
doc_ids = np.arange(2)
compact = corpus.to_compact(doc_texts)
oov = corpus.specials_to_compact['out_of_vocabulary']
compact[1, 3] = oov  # Mark the last word as OOV
flat = corpus.compact_to_flat(compact)
flat.shape[0] == 6  # True 2 skips were dropped from 8 words
flat[-1] == corpus.loose_to_compact[doc_texts[1, 2]] #True
flat, (flat_id,) = corpus.compact_to_flat(compact, doc_ids)
print(flat_id) #>>>[0 0 1 1 1]


def word_list(self, vocab, max_compact_index=None, oov_token='<OoV>'):
    """ Translate compact keys back into string representations for a word.

    vocab : dict
        The vocab object has loose indices as keys and word strings as

    max_compact_index : int
        Only return words up to this index. If None, defaults to the number
        of compact indices available

    oov_token : str
        Returns this string if a compact index does not have a word in the
        vocab dictionary provided.

    word_list : list
        A list of strings representations corresponding to word indices
        zero to `max_compact_index`
#Translate the compact keys into string words

import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
vocab = {0: 'But', 1: 'the', 2: 'night', 3: 'was', 4: 'warm'}
word_indices = np.zeros(50).astype('int32')#構造50個全0數組
word_indices[:25] = 0  # 'But' shows 25 times
word_indices[25:35] = 1  # 'the' is in 10 times
word_indices[40:46] = 2  # 'night' is in 6 times
word_indices[46:49] = 3  # 'was' is in 3 times
word_indices[49:] = 4  # 'warm' in in 2 times
# Build a vocabulary of word indices
print(corpus.word_list(vocab))#['skip', 'out_of_vocabulary', 'But', 'the', 'night', 'was', 'warm']


def compact_word_vectors(self, vocab, filename=None, array=None, top=20000):
    """ Retrieve pretrained word spectors for our vocabulary.
    The returned word array has row indices corresponding to the
    compact index of a word, and columns correponding to the word

    vocab : dict
        Dictionary where keys are the loose index, and values are the word string.

    use_spacy : bool
        Use SpaCy to load in word vectors. Otherwise Gensim.

    filename : str
        Filename for SpaCy-compatible word vectors or if use_spacy=False
        then uses word2vec vectors via gensim.

    data : numpy float array
        Array such that data[compact_index, :] = word_vector



def compact_to_bow(self, word_compact, max_compact_index=None):二維數組轉爲矩陣
Given a 2D array of compact indices, return the bag of words representation where the column is the word index, row is the document index, and the value is the number of times that word appears in that document.
import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類

vocab = {19: 'shuttle', 5: 'astronomy', 7: 'cold', 3: 'hot'}
word_indices = np.zeros(50).astype('int32')
word_indices[:25] = 19  # 'Shuttle' shows 25 times
word_indices[25:35] = 5  # 'astronomy' is in 10 times
word_indices[40:46] = 7  # 'cold' is in 6 times
word_indices[46:] = 3  # 'hot' is in 3 times
v = corpus.compact_to_bow(word_indices)
print(v[:6])#[ 5  0  0  4  0 10]
words = [[0, 0, 0, 3, 4], [1, 1, 1, 4, 5]]
words = np.array(words)
bow = corpus.compact_to_bow(words)
print(bow.shape)#(2, 6)


def compact_to_coocurrence(self, word_compact, indices, window_size=10):
    """ From an array of compact tokens and aligned array of document indices
    compute (word, word, document) co-occurrences within a moving window.移動窗口

    word_compact: int array
    Sequence of tokens.

    indices: dict of int arrays
    Each array in this dictionary should represent the document index it
    came from.

    window_size: int
    Indicates the moving window size around which all co-occurrences will
    be computed.

    counts : DataFrame
    Returns a DataFrame with two columns for word index A and B,
    one extra column for each document index, and a final column for counts
    in that key.
import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
corpus = corpus.Corpus()#調用corpus模塊的Corpus類
compact = np.array([0, 1, 1, 1, 2, 2, 3, 0])
doc_idx = np.array([0, 0, 0, 0, 1, 1, 1, 1])
counts = corpus.compact_to_coocurrence(compact, {'doc': doc_idx})
counts.query('doc == 0').counts.values
compact = np.array([0, 1, 1, 1, 2, 2, 3, 0])
doc_idx = np.array([0, 0, 0, 1, 1, 2, 2, 2])
counts = corpus.compact_to_coocurrence(compact, {'doc': doc_idx})
counts.query('doc == 0').word_index_x.values
counts.query('doc == 0').word_index_y.values
counts.query('doc == 0').counts.values
counts.query('doc == 1').counts.values
print(counts.query('doc == 1').counts.values)#[1 1]


def fast_replace(data, keys, values, skip_checks=False):
    """ Do a search-and-replace in array `data`.

    data : int array
        Array of integers
    keys : int array
        Array of keys inside of `data` to be replaced
    values : int array
        Array of values that replace the `keys` array
    skip_checks : bool, default=False
        Optionally skip sanity checking the input.

import numpy.linalg as nl
import numpy as np
from lda2vec import corpus#調用lda2vec包的corpus模塊
#corpus = corpus.Corpus()#調用corpus模塊的Corpus類
corpus.fast_replace(np.arange(5), np.arange(5), np.arange(5)[::-1])
print(corpus.fast_replace(np.arange(5), np.arange(5), np.arange(5)[::-1]))#[4 3 2 1 0]




from spacy.lang.en import English
from spacy.attrs import LOWER, LIKE_URL, LIKE_EMAIL
import numpy as np


#處理詞(tokenize) Uses spaCy to quickly tokenize text and return an array of indices
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):

    text : list of unicode strings . These are the input documents. 
          There can be multiple sentences per item in the list.

    max_length : int
        This is the maximum number of words per document. If the document is
        shorter then this number it will be padded to this length.
    skip : int, optional
        Short documents will be padded with this variable up until max_length.
    attr : int, from spacy.attrs
        What to transform the token to. Choice must be in spacy.attrs, and =
        common choices are (LOWER, LEMMA)
    merge : int, optional
        Merge noun phrases into a single token. Useful for turning 'New York'
        into a single token.
    nlp : None
        A spaCy NLP object. Useful for not reinstantiating the object multiple
    kwargs : dict, optional
        Any further argument will be sent to the spaCy tokenizer. For extra
        speed consider setting tag=False, parse=False, entity=False, or

    arr : 2D array of ints
        Has shape (len(texts), max_length). Each value represents
        the word index.
    vocab : dict
        Keys are the word index, and values are the string. The pad index gets
        mapped to None

    >>> sents = [u"Do you recall a class action lawsuit", u"hello zombo.com"]
    >>> arr, vocab = tokenize(sents, 10, merge=True)
    >>> arr.shape[0]
    >>> arr.shape[1]
    >>> w2i = {w: i for i, w in vocab.iteritems()}
    >>> arr[0, 0] == w2i[u'do']  # First word and its index should match
    >>> arr[0, 1] == w2i[u'you']
    >>> arr[0, -1]  # last word in 0th document is a pad word
    >>> arr[0, 4] == w2i[u'class action lawsuit']  # noun phrase is tokenized
    >>> arr[1, 1]  # The URL token is thrown out



from spacy.lang.en import English
from spacy.attrs import LOWER, LIKE_URL, LIKE_EMAIL
import numpy as np
from lda2vec import preprocess
import spacy
nlp = spacy.load("en_core_web_sm")

#大函數:處理詞(tokenize) Uses spaCy to quickly tokenize text and return an array of indices
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    if nlp is None:
        nlp = English()#使用spacy的英文模型(另外,還有用於處理德文、法文的模型,https://spacy.io/models/en)
    data = np.zeros((len(texts), max_length), dtype='int32')#構造len(texts)行,max_length列的全零矩陣
    data[:] = skip#上述矩陣的值:由0改爲skip(-2)
    bad_deps = ('amod', 'compound')#amod:形容詞修飾語,compound複合詞,dep(dependence relationship)依存關係

    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        doc = nlp(texts[1])
    #enumerate()函數:將一個可遍歷的數據對象(如列表、元組或字符串)組合爲一個索引序列,同時列出數據和數據下標,一般用在 for 循環當中。
        if merge: # 轉爲單個token,noun phrases into single tokens (Useful for turning 'New York' into a single token.)

            for phrase in doc.noun_chunks:#組塊分析,只保留形容詞和名詞 # Only keep adjectives and nouns, e.g. "good ideas"
                #將New York轉爲New_York
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)

        doc = nlp(texts[1])
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        if len(dat) > 0:
            dat = dat.astype('int32')
            #msg = "Negative indices reserved for special tokens"
            print('dat[:, 2] :',dat[:, 1] )
            #assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

if __name__ == "__main__":
    import doctest



  • 以examples-hacker_news(新聞)爲例。據我觀察,首先,應當運行data-preprocess.py(此代碼同時包括用於下載數據的代碼),進行數據預處理工作,處理完成後保存產物(如下圖):


  • 隨後,運行examples\hacker_news\lda2vec-lda2vec_run.py,正式跑模型。(該程序用到了預處理產物,生成lda2vec.hdf5)



  • 慎用"print"

調試代碼時常加入大量print 監測每一步的輸出。但是對於大批量數據處理,print 往往增加幾十倍的耗時,嚴重影響效率。

  • 進度條

import time,sys
for i in range(100):
    percent = i / 100
    sys.stdout.write("\r{0}{1}".format("|"*i , '%.2f%%' % (percent * 100)))



# Author: Chris Moody <[email protected]>
# License: MIT

# This example loads a large 800MB Hacker News comments dataset
# and preprocesses it. This can take a few hours, and a lot of
# memory, so please be patient!

from lda2vec import preprocess, Corpus
import numpy as np
import pandas as pd
import logging
import cPickle as pickle
import os.path


max_length = 250   # Limit of 250 words per comment
min_author_comments = 50  # Exclude authors with fewer comments
nrows = None  # Number of rows of file to read; None reads in full file

fn = "hacker_news_comments.csv"
url = "https://zenodo.org/record/45901/files/hacker_news_comments.csv"
if not os.path.exists(fn):
    import requests
    response = requests.get(url, stream=True, timeout=2400)
    with open(fn, 'w') as fh:
        # Iterate over 1MB chunks
        for data in response.iter_content(1024**2):

features = []
# Convert to unicode (spaCy only works with unicode)
features = pd.read_csv(fn, encoding='utf8', nrows=nrows)
# Convert all integer arrays to int32
for col, dtype in zip(features.columns, features.dtypes):
    if dtype is np.dtype('int64'):
        features[col] = features[col].astype('int32')

# Tokenize the texts
# If this fails it's likely spacy. Install a recent spacy version.
# Only the most recent versions have tokenization of noun phrases
# I'm using SHA dfd1a1d3a24b4ef5904975268c1bbb13ae1a32ff
# Also try running python -m spacy.en.download all --force
texts = features.pop('comment_text').values
tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4,
del texts

# Make a ranked list of rare vs frequent words
corpus = Corpus()

# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=10)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
print "n_words", np.unique(clean).max()

# Extract numpy arrays over the fields we want covered by topics
# Convert to categorical variables
author_counts = features['comment_author'].value_counts()
to_remove = author_counts[author_counts < min_author_comments].index
mask = features['comment_author'].isin(to_remove).values
author_name = features['comment_author'].values.copy()
author_name[mask] = 'infrequent_author'
features['comment_author'] = author_name
authors = pd.Categorical(features['comment_author'])
author_id = authors.codes
author_name = authors.categories
story_id = pd.Categorical(features['story_id']).codes
# Chop timestamps into days
story_time = pd.to_datetime(features['story_time'], unit='s')
days_since = (story_time - story_time.min()) / pd.Timedelta('1 day')
time_id = days_since.astype('int32')
features['story_id_codes'] = story_id
features['author_id_codes'] = story_id
features['time_id_codes'] = time_id

print "n_authors", author_id.max()
print "n_stories", story_id.max()
print "n_times", time_id.max()

# Extract outcome supervised features
ranking = features['comment_ranking'].values
score = features['story_comment_count'].values

# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
feature_arrs = (story_id, author_id, time_id, ranking, score)
flattened, features_flat = corpus.compact_to_flat(pruned, *feature_arrs)
# Flattened feature arrays
(story_id_f, author_id_f, time_id_f, ranking_f, score_f) = features_flat

# Save the data
pickle.dump(corpus, open('corpus', 'w'), protocol=2)
pickle.dump(vocab, open('vocab', 'w'), protocol=2)
data = dict(flattened=flattened, story_id=story_id_f, author_id=author_id_f,
            time_id=time_id_f, ranking=ranking_f, score=score_f,
            author_name=author_name, author_index=author_id)
np.savez('data', **data)
np.save(open('tokens', 'w'), tokens)


