預訓練的詞向量-那些著名的數據集

原創

readilen

2018-11-08 07:01

英語語料庫

谷歌 word2vec

谷歌新聞預訓練詞向量 (about 100 billion words). 300維向量，大約3百萬個單詞和短語。實現論文

download link | source link

臉書 fastText

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

2 million word vectors trained on Common Crawl (600B tokens).

download link | source link

斯坦福 GloVe

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)

download link | source link

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)

download link | source link

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

download link | source link

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

download link | source link

中文語料庫

word2vec

Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor

download link | source link

fastText

Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization

download link | source link
附錄，處理方法：
https://github.com/Kyubyong/wordvectors

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

預訓練的詞向量-那些著名的數據集

英語語料庫

谷歌 word2vec

臉書 fastText

斯坦福 GloVe

中文語料庫

word2vec

fastText

TDengine docker安裝方法

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Navicat安裝與激活教程

論文神器 VSCode Latex 中文 1.創建字體目錄 2.拷貝windows字體文件 3.安裝字體 5 ubuntu安裝latex 6 安裝vscode latex 7 Example

數學課難易程度

作用域摘要

Python 函數化編程 1 無限迭代器 2 處理輸入序列迭代器 3 組合生成器

數的記號 1 進制的多樣性 2 進制多樣性原因

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結