google books ngram viewer數據集的下載與使用

原創

2020-02-20 23:25

最近在做有關word2vec的項目，需要使用到google books ngram viewer中的數據集。但是打開網頁後發現，由於數據很大，google將數據集分成了很多個文件。我如果要下載的話，需要一個一個鏈接去點開下載，這樣似乎很麻煩。

後來我無意中發現了一個Python的package，可以直接使用到google books ngram viewer上面的數據。

首先安裝一下這個包:

pip install google-ngram-downloader

這個包有兩種使用方式：

命令行工具

這個包提供了一個命令行工具：google-ngram-downloader . 它可以用於下載google books ngram viewer上面的數據集。用法如下：

google-ngram-downloader help
usage: google-ngram-downloader <command> [options]

commands:

 cooccurrence  Write the cooccurrence frequencies of a word and its contexts.
 download      Download The Google Books Ngram Viewer dataset version 20120701.
 help          Show help for a given help topic or a help overview.
 readline      Print the raw content.

例如：

google-ngram-downloader download -n 5 #下載5-ngram的數據集
google-ngram-downloader download -h   #查看幫助，會告訴你下載路徑，下載的語言

作爲python API

在Python代碼中，可以這樣獲得數據：

>>> from google_ngram_downloader import readline_google_store
>>>
>>> fname, url, records = next(readline_google_store(ngram_len=5))
>>> fname
'googlebooks-eng-all-5gram-20120701-0.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-0.gz'
>>> next(records)
Record(ngram=u'0 " A most useful', year=1860, match_count=1, volume_count=1)