https://github.com/facebookresearch/fastText
python版本
https://github.com/salestock/fastText.py
這個是非官方的版本 現在已經不在使用了
官方提供了Python版本
https://github.com/facebookresearch/fastText/tree/master/python
現在用的都是官方的版本
開始一直報錯就是因爲自己安裝了官方的版本
卻一直還在調用非官方的api
fasttext是facebook開源的一個詞向量與文本分類工具,在學術上沒有太多創新點,好處是模型簡單,訓練速度非常快。簡單嘗試可以發現,用起來還是非常順手的,做出來的結果也不錯,可以達到上線使用的標準。
簡單說來,fastText做的事情,就是把文檔中所有詞通過lookup table變成向量,取平均後直接用線性分類器得到分類結果。fastText和ACL-15上的deep averaging network(DAN,如下圖)比較相似,是一個簡化的版本,去掉了中間的隱層。論文指出了對一些簡單的分類任務,沒有必要使用太複雜的網絡結構就可以取得差不多的結果。
fastText結構
fastText論文中提到了兩個tricks
- hierarchical softmax
- 類別數較多時,通過構建一個霍夫曼編碼樹來加速softmax layer的計算,和之前word2vec中的trick相同
- N-gram features
- 只用unigram的話會丟掉word order信息,所以通過加入N-gram features進行補充用hashing來減少N-gram的存儲
fastText有監督學習(分類)示例
可以通過pip install fasttext安裝包含fasttext python的接口的package
fastText做文本分類要求文本是如下的存儲形式:
__label__2 , birchas chaim , yeshiva birchas chaim is a orthodox jewish mesivta high school in lakewood township new jersey . it was founded by rabbi shmuel zalmen stein in 2001 after his father rabbi chaim stein asked him to open a branch of telshe yeshiva in lakewood . as of the 2009-10 school year the school had an enrollment of 76 students and 6 . 6 classroom teachers ( on a fte basis ) for a student–teacher ratio of 11 . 5 1 .
__label__6 , motor torpedo boat pt-41 , motor torpedo boat pt-41 was a pt-20-class motor torpedo boat of the united states navy built by the electric launch company of bayonne new jersey . the boat was laid down as motor boat submarine chaser ptc-21 but was reclassified as pt-41 prior to its launch on 8 july 1941 and was completed on 23 july 1941 .
__label__11 , passiflora picturata , passiflora picturata is a species of passion flower in the passifloraceae family .
__label__13 , naya din nai raat , naya din nai raat is a 1974 bollywood drama film directed by a . bhimsingh . the film is famous as sanjeev kumar reprised the nine-role epic performance by sivaji ganesan in navarathri ( 1964 ) which was also previously reprised by akkineni nageswara rao in navarathri ( telugu 1966 ) . this film had enhanced his status and reputation as an actor in hindi cinema .
其中前面的__label__
是前綴,也可以自己定義,__label__
後接的爲類別。
我們定義我們的5個類別分別爲:
1:technology
2:car
3:entertainment
4:military
5:sports
生成文本格式
"""
https://github.com/facebookresearch/fastText
python版本
https://github.com/salestock/fastText.py
這個是非官方的版本 現在已經不在使用了
官方提供了Python版本
https://github.com/facebookresearch/fastText/tree/master/python
現在用的都是官方的版本
"""
import jieba
import pandas as pd
import random
cate_dic = {'technology':1, 'car':2, 'entertainment':3, 'military':4, 'sports':5}
df_technology = pd.read_csv("./data/technology_news.csv", encoding='utf-8')
df_technology = df_technology.dropna()
df_car = pd.read_csv("./data/car_news.csv", encoding='utf-8')
df_car = df_car.dropna()
df_entertainment = pd.read_csv("./data/entertainment_news.csv", encoding='utf-8')
df_entertainment = df_entertainment.dropna()
df_military = pd.read_csv("./data/military_news.csv", encoding='utf-8')
df_military = df_military.dropna()
df_sports = pd.read_csv("./data/sports_news.csv", encoding='utf-8')
df_sports = df_sports.dropna()
technology = df_technology.content.values.tolist()[1000:21000]
car = df_car.content.values.tolist()[1000:21000]
entertainment = df_entertainment.content.values.tolist()[:20000]
military = df_military.content.values.tolist()[:20000]
sports = df_sports.content.values.tolist()[:20000]
stopwords=pd.read_csv("data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
def preprocess_text(content_lines, sentences, category):
for line in content_lines:
try:
segs=jieba.lcut(line)
segs = filter(lambda x:len(x)>1, segs)
segs = filter(lambda x:x not in stopwords, segs)
sentences.append("__label__"+str(category)+" , "+" ".join(list(segs)))
except Exception:
print (line)
continue
"""
"""
#生成訓練數據
sentences = []
preprocess_text(technology, sentences, cate_dic['technology'])
preprocess_text(car, sentences, cate_dic['car'])
preprocess_text(entertainment, sentences, cate_dic['entertainment'])
preprocess_text(military, sentences, cate_dic['military'])
preprocess_text(sports, sentences, cate_dic['sports'])
random.shuffle(sentences)
print ("writing data to fasttext format...")
out = open('train_data.txt', 'wb')
for sentence in sentences:
out.write(sentence.encode('utf8')+b"\n")
print("done!")
#將每個類別文檔處理之後寫入文件
開始訓練 非常的快
"""
調用fastText訓練生成模型
https://fasttext.cc/docs/en/python-module.html
意思以前官方沒有提供python 現在官方提供了 所以2個要合併 那麼我們現在安裝的就是新的
https://github.com/facebookresearch/fastText/tree/master/python
"""
import fasttext
model = fasttext.train_supervised('train_data.txt')
"""
對模型效果進行評估
"""
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))
print_results(*model.test('train_data.txt'))
"""
N 87584
P@1 0.974
R@1 0.974
"""
保存模型
"""
保存模型
"""
model.save_model("model_filename.bin")
加載模型
"""
加載模型
"""
m1= fasttext.load_model("model_filename.bin")
壓縮,模型
"""
Compress model files with quantization
When you want to save a supervised model file,
fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.
進行模型壓縮 以適配更小的機器
"""
# with the previously trained `model` object, call :
m1.quantize(input='train_data.txt', retrain=True)
# then display results and save the new model :
#print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")
預測
"""
實際預測
"""
label_to_cate = {'__label__1':'technology', '__label__2':'car','__label__3':'entertainment', '__label__4':'military', '__label__5':'sports'}
texts = ['中新網 日電 2018 預賽 亞洲區 強賽 中國隊 韓國隊 較量 比賽 上半場 分鐘 主場 作戰 中國隊 率先 打破 場上 僵局 利用 角球 機會 大寶 前點 攻門 得手 中國隊 領先']
labels = model.predict(texts)
print(labels)
#([['__label__5']], array([[0.99998939]]))