tensorflow——tf.contrib.learn.preprocessing.VocabularyProcessor用法

原創

2019-03-29 14:18

主要構建語料集中的詞典，以及把中文序列轉化爲詞id序列
函數
tf.contrib.learn.preprocessing.VocabularyProcessor(max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

參數：
max_document_length: 文檔的最大長度。如果文本的長度大於最大長度，那麼它會被剪切，反之則用0填充。
min_frequency: 詞頻的最小值，出現次數小於最小詞頻則不會被收錄到詞表中。
vocabulary: CategoricalVocabulary 對象。
tokenizer_fn：分詞函數

用法示例：

from tensorflow.contrib import learn
import tensorflow as tf
import numpy as np
import jieba

test_text =['盼望着，盼望着，東風來了，春天的腳步近了。',
            '一切都像剛睡醒的樣子，欣欣然張開了眼。山朗潤起來了，水漲起來了，太陽的臉紅起來了。',
            '桃樹、杏樹、梨樹，你不讓我，我不讓你，都開滿了花趕趟兒。']
#分詞及去掉標點符號
def tokenizer(document):
    tokenizer_document = []
    for text in document:
        content = jieba.cut(text)
        stoplist = ['，', '。', '、']
        outstr = ""
        for word in content:
            if word not in stoplist:
                outstr+=word
                outstr+=" "
        tokenizer_document.append(outstr)
    return tokenizer_document
    
# 分詞後的文本
tokenizer_document = tokenizer(test_text)
# ['盼望着 盼望着 東風 來 了 春天 的 腳步 近 了 ', '一切 都 像 剛 睡醒 的 樣子 欣欣然 張開 了 眼 山朗潤 起來 了 水 漲起來 了 太陽 的 臉紅 起來 了 ', '桃樹 杏樹 梨樹 你 不讓 我 我 不讓 你 都 開滿 了 花 趕趟兒 ']

max_document_length = max([len(text.split(" ")) for text in tokenizer_document])

# 構建詞典
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,min_frequency=0,vocabulary=None,tokenizer_fn=None)

# 創建詞彙表
vocab_processor.fit(tokenizer_document)

# 將文本轉爲詞ID序列，未知或填充用的詞ID爲0
x = np.array(list(vocab_processor.fit_transform(tokenizer_document)))

print(x)
# [[ 1  1  2  3  4  5  6  7  8  4  0  0  0  0  0  0  0  0  0  0  0  0  0]
#  [ 9 10 11 12 13  6 14 15 16  4 17 18 19  4 20 21  4 22  6 23 19  4  0]
#  [24 25 26 27 28 29 29 28 27 10 30  4 31 32  0  0  0  0  0  0  0  0  0]]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

tensorflow——tf.contrib.learn.preprocessing.VocabularyProcessor用法

分類問題集錦及練習

中餐館過程僞代碼及python實現

Day1——Data PreProcessing

gensim word2vec

IDEA初上手的一天

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結