StarSpace系列之一:tagspace

問題類型

TagSpace 單詞、標籤的嵌入
用途: 學習從短文到相關主題標籤的映射,例如,在 這篇文章 中的描述。這是一個典型的分類應用。

模型: 通過學習兩者的嵌入,學習的映射從單詞集到標籤集。 例如,輸入“restaurant has great food <\tab> #restaurant <\tab> #yum”將被翻譯成下圖。(圖中的節點是要學習嵌入的實體,圖中的邊是實體之間的關係。

在這裏插入圖片描述

訓練數據

training:

training data

The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
新聞數據,4大類,12萬篇。
World
Sports
Business
Sci/Tech

數據樣例

The file classes.txt contains a list of classes corresponding to each label.

__label__2 , garca winds up best in tough going , given what sergio garca has achieved in his career already it is difficult to believe he is only 24 years old . he had a 67 yesterday , four under , to share the volvo masters lead with his fellow spaniard
__label__3 , us shares take a tumble on oil prices , new york , nov 23 ( afp ) - wall street shares slid on tuesday as oil prices surged higher and investors sensed weaknesses in the technology sector .
__label__4 , product review blackberry 7100t smartphone ( newsfactor ) , newsfactor - research in motion ' s ( nasdaq rimm ) quad-band \blackberry 7100t with \pda capabilities is a gsm/gprs ( 850/900/1800/1900 mhz ) cellular handset that can make and receive phone calls in more than 100 countries around the world .

訓練

./classification_ag_news.sh
Downloading dataset ag_news
Compiling StarSpace
make: *** No targets specified and no makefile found.  Stop.
Start to train on ag_news data:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /tmp/starspace/data/ag_news.train
Read 5M words
Number of words in dictionary:  95811
Number of labels in dictionary: 4
Loading data from file : /tmp/starspace/data/ag_news.train
Total number of examples loaded : 120000
Initialized model weights. Model size :
matrix : 95815 10
Training epoch 0: 0.01 0.002
Epoch: 100.0%  lr: 0.008017  loss: 0.036824  eta: <1min   tot: 0h0m0s  (20.0%)
 ---+++                Epoch    0 Train error : 0.03529871 +++--- ☃
Training epoch 1: 0.008 0.002
Epoch: 100.0%  lr: 0.006033  loss: 0.018947  eta: <1min   tot: 0h0m1s  (40.0%)
 ---+++                Epoch    1 Train error : 0.01904551 +++--- ☃
Training epoch 2: 0.006 0.002
Epoch: 100.0%  lr: 0.004000  loss: 0.015143  eta: <1min   tot: 0h0m1s  (60.0%)
 ---+++                Epoch    2 Train error : 0.01569214 +++--- ☃
Training epoch 3: 0.004 0.002
Epoch: 100.0%  lr: 0.002000  loss: 0.014580  eta: <1min   tot: 0h0m2s  (80.0%)
 ---+++                Epoch    3 Train error : 0.01361201 +++--- ☃
Training epoch 4: 0.002 0.002
Epoch: 100.0%  lr: -0.000000  loss: 0.011692  eta: <1min   tot: 0h0m2s  (100.0%)
 ---+++                Epoch    4 Train error : 0.01211345 +++--- ☃
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
Start to evaluate trained model:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 50
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to load a trained starspace model.
STARSPACE-2018-2
Initialized model weights. Model size :
matrix : 95815 10
Model loaded.
Loading data from file : /tmp/starspace/data/ag_news.test
Total number of examples loaded : 7600
------Loaded model args:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Predictions use 4 known labels.
Evaluation Metrics :
hit@1: 0.917105 hit@10: 1 hit@20: 1 hit@50: 1 mean ranks : 1.10263 Total examples : 7600

核心腳本是:

../starspace train \
  -trainFile "${DATADIR}"/ag_news.train \
  -model "${MODELDIR}"/ag_news \
  -initRandSd 0.01 \
  -adagrad false \
  -ngrams 1 \
  -lr 0.01 \
  -epoch 5 \
  -thread 20 \
  -dim 10 \
  -negSearchLimit 5 \
  -trainMode 0 \
  -label "__label__" \
  -similarity "dot" \
  -verbose true
  

分析一下

通過訓練得到模型:
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
models$ wc -l ag_news.tsv

95815 ag_news.tsv
這個其實就是詞向量和標籤向量。

race-day	-0.00570984	0.00158739	0.0122618	0.0040543	0.0113636	0.0171366	-0.000411966	-0.00239621	0.00648128	0.00166139
807km	-0.0186726	0.00619443	-0.000953957	-0.00184454	-0.00456583	0.00638993	0.00550364	-0.00182587	-0.00843166	0.0182373
__label__2	0.277595	0.276433	-0.137682	0.20364	-0.129544	-0.21292	0.284562	-0.127947	0.18984	0.0745169
__label__4	0.00365386	-0.15706	-0.0372935	-0.0641772	0.0239136	0.0957492	-0.175419	0.372076	-0.166987	0.168438
__label__3	-0.0821545	-0.0460339	0.0548889	-0.274909	0.325552	-0.0362561	-0.0883801	-0.110232	-0.0197069	-0.107682
__label__1	-0.22841	-0.0855479	0.102414	0.166055	-0.27244	0.14407	-0.0423679	-0.149521	-0.00979879	-0.135866

其它格式也可以訓練

也可以參利用其它的格式來訓練的文檔。
輸入文件的格式:
restaurant has great food #yum #restaurant
命令:
$./starspace train -trainFile input.txt -model tagspace -label ‘#’

示例腳本:
我們將該模型應用於 AG的新聞主題分類數據集 的文本分類問題。在這一問題中我們的標籤是新聞文章類別,我們使用 hit@1 度量來衡量分類的準確性。這個示例腳本 下載數據並在示例目錄下運行StarSpace模型:

$bash examples/classification_ag_news.sh

總結

  1. 適合爲文本打標籤,標籤集合可以比分類體系更大。分類問題可以直接用fastText原始版。當然也可以用這個。
  2. 當然,如果你想把分類標籤或者主題標籤與詞向量,在一個特徵空間做向量化,沒錯,就是這個了。

如果你想了解深層的模型

請繼續閱讀下面的文獻。我這裏只是抽象一些核心的理念。

https://www.aclweb.org/anthology/D14-1194
我們簡要分析一下這篇論文,這是2014年的一篇論文。

  1. 論文題目
    TAGSPACE: Semantic Embeddings from Hashtags
  2. 論文思想
    利用CNN做doc向量; 然後優化 f(w,t+),f(w,t-)的距離作爲目標函數,得到了 t(標籤)和doc在一個特徵空間的向量表達;這樣就可以找 doc的hashtags了。關於warp loss 請 參考最後一篇文獻。

所有的精華都在一張圖上:
在這裏插入圖片描述

參考文獻

  1. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
  2. Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
  3. https://www.aclweb.org/anthology/D14-1194
  4. https://medium.com/@gabrieltseng/intro-to-warp-loss-automatic-differentiation-and-pytorch-b6aa5083187a
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章