文本分類和詞向量訓練工具fastText的參數和用法

原創

江户川柯壮

2020-06-21 08:00

fastText的參數和用法

fastText由Facebook開源，主要基於fasttext這篇文章的思路paper，主要用於兩個任務：訓練詞向量和文本分類。

下載地址與document ：fasttext官網

fasttext的主要功能：

Training Supervised Classifier [supervised] Supervised Classifier Training for Text Classification. 訓練分類器，就是文本分類，fasttext 的主營業務。

Training SkipGram Model [skipgram] Learning Word Representations/Word Vectors using skipgram technique. 訓練skipgram的方式的詞向量。

Quantization [quantize] Quantization is a process applied on a model so as to reduce the memory usage during prediction. 量化壓縮，降低模型體積。

Predictions [predict] Predicting labels for a given text : Text Classification. 對於文本分類任務，用於預測類別。

Predictions with Probabilities [predict-prob] Predicting probabilities in addition to labels for a given text : Text Classification. 帶有概率的預測類別。

Training of CBOW model [cbow] Learning Word Representations/Word Vectors using CBOW (Continuous Bag Of Words) technique. cbow方式訓練詞向量。

Print Word Vectors [print-word-vectors] Printing of Word Vectors for a trained model with each line representing a word vector. 打印一個單詞的詞向量。

Print Sentence Vectors [print-sentence-vectors] Printing of Sentence Vectors for a trained model with each line representing a vector for a paragraph. 打印文本向量，每個文本的向量長度是一樣的，代表所有單詞的綜合特徵。

Query Nearest Neighbors [nn] 找到某個單詞的近鄰。

Query for Analogies [analogies] 找到某個單詞的類比詞，比如 A - B + C。柏林 - 德國 + 法國 = 巴黎這類的東西。

命令行的fasttext使用：

1 基於自己的語料訓練word2vec

fasttext skipgram -input xxxcorpus -output xxxmodel

訓練得到兩個文件：xxxmodel.bin 和 xxxmodel.vec，分別是模型文件和詞向量形式的模型文件

參數可選 skipgram 或者 cbow，分別對應SG和CBOW模型。

2 根據訓練好的model查看某個詞的neighbor

fasttext nn xxxmodel.bin

Query word? 後輸入單詞，即可獲得其近鄰單詞。

3 其它的一些參數：

-minn 和 -maxn ：subwords的長度範圍，default是3和6
-epoch 和 -lr ：輪數和學習率，default是5和0.05
-dim：詞向量的維度，越大越🐮🍺，但是會佔據更多內存，並降低計算速度。
-thread：運行的線程數，不解釋。

python 模塊的應用方式：

參數含義與功能基本相同，用法如下。

給一個栗子：

def train_word_vector(train_fname, test_fname, epoch, lr, save_model_fname, thr):
    """
    train text classification, and save model
    """
    dim = 500               # size of word vectors [100]
    ws = 5                # size of the context window [5]
    minCount = 500          # minimal number of word occurences [1]
    minCountLabel = 1     # minimal number of label occurences [1]
    minn = 1              # min length of char ngram [0]
    maxn = 2              # max length of char ngram [0]
    neg = 5               # number of negatives sampled [5]
    wordNgrams = 2        # max length of word ngram [1]
    loss = 'softmax'              # loss function {ns, hs, softmax, ova} [softmax]
    lrUpdateRate = 100      # change the rate of updates for the learning rate [100]
    t = 0.0001                 # sampling threshold [0.0001]
    label = '__label__'             # label prefix ['__label__']

    model = fasttext.train_supervised(train_fname, lr=lr, epoch=epoch, dim=dim, ws=ws, 
                                        minCount=minCount, minCountLabel=minCountLabel,
                                        minn=minn, maxn=maxn, neg=neg, 
                                        wordNgrams=wordNgrams, loss=loss,
                                        lrUpdateRate=lrUpdateRate,
                                        t=t, label=label, verbose=True)
    model.save_model(save_model_fname)

    return model

if __name__ == "__main__":
    """ param settings """
    model = train_word_vector(train_fname, test_fname,
                              epoch, lr, save_model_fname, thr)
    model.get_nearest_neighbors(some_word)
    model.predict('sentence') # 得到輸出類別
    model.test(filename) # 輸出三元組，(樣本數量, acc, acc) 這裏的acc是對二分類來說的

無監督學習詞向量和有監督訓練文本分類的 API如下：

train_unsupervised parameters

input # training file path (required)
model # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr # learning rate [0.05]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [5]
minn # min length of char ngram [3]
maxn # max length of char ngram [6]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [ns]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
verbose # verbose [2]

train_supervised parameters

input # training file path (required)
lr # learning rate [0.1]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [1]
minCountLabel # minimal number of label occurences [1]
minn # min length of char ngram [0]
maxn # max length of char ngram [0]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [softmax]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
label # label prefix [’_label_’]
verbose # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

文本分類和詞向量訓練工具fastText的參數和用法

fastText的參數和用法

fasttext的主要功能：

命令行的fasttext使用：

python 模塊的應用方式：

這個網絡爬蟲代碼，拿到數據之後如何存到csv文件中去？

.NET開源強大、易於使用的緩存框架 - FusionCache

面試，有時候是個運氣活

【Java 小白菜入門筆記 2.2】常用的類和方法

【Java 小白菜入門筆記 2.1】面向對象相關

【Java 小白菜入門筆記 1.3】流程控制、數組和輸入輸出

【Java 小白菜入門筆記 1.２】運算符、方法和語句

【Java 小白菜入門筆記 1.1】常量和變量

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

文本分類和詞向量訓練工具fastText的參數和用法

fastText的參數和用法

fasttext的 主要功能：

命令行的fasttext使用：

python 模塊的應用方式：

fasttext的主要功能：