NLP 數據集整理(持續更新。。。)

這篇文章總結了我看到的NLP相關論文中使用的語料數據,將會持續更新。小夥伴們如果也知道文中沒有的數據集,歡迎大家在評論中告訴我~只要寫數據集的名字,對應文章和下載網址就可以,我看到會第一時間添加到本文中^.^
用於NLP實驗的各類免費英文語料數據庫整理如下:(每個語料數據的鏈接都在註腳對應的文章中,文中還提供了使用的方式)

Semantic Similarity

WordSim3531:包括353個詞對,用於對詞之間的語義相似度排序。語義相似度性能通常用兩個詞矢量之間的餘弦距離表示。

TOEFL2:包括80個同義詞多選一問題,每個問題有4個候選,要求選出最接近的詞。例如對於levied,有imposed(correct),believed,requested,correlated四個選項。同樣使用餘弦距離衡量兩個詞之間的相似度,找到最相鄰的詞。

Semantic&Syntactic3: 包括8869個語義問題和10675個句法問題。在T.Mikolov的word2vec中使用。問題都類似於“man is to (woman) as king is to queen”或者“predict is to (predicting) as dance is to dancing”.

Classification

IMDB4: 這個數據包括3個部分,訓練集,測試集和未標記的數據集。訓練集和測試集用於訓練和測試文本分類模型,未標記的數據集用於訓練詞矢量。被用於情感分析。

Stanford Sentiment Treebank5:這個數據比較小,被用於基於CNN的情感分類中。同樣,還可用於語義表示的實驗上。

Google Snippets6:這個數據包括10060個訓練樣本和2280個測試樣本,分爲8個類。平均下來每個snippet有18.07個詞。

TREC7:這個數據包括6個不同的問題類型,訓練集有5452個標註了的問題,測試集有500個問題。

Sentiment analysis

MR: Sentiment polarity dataset from Movie Review. URL
這個數據包括:

  • document-level: polarity dataset v2.0: 1000 positive and 1000 negative processed reviews.
  • sentence-level: sentence polarity dataset v1.0: 5331 positive and 5331 negative processed sentences/snippets.
  • Sentiment-scale datasets: scale dataset v1.0: a collection of documents whose labels come from a rating scale.
  • Subjectivity dataset v1.0: 5000 subjective and 5000 objective processed sentences

Subj: Subjectivity dataset.URL這個語料的標註爲這些觀點是否主觀

consists of the top 20 results returned by the Yahoo! search engine in response to each of a set of 69 queries containing the word “review.” The queries were drawn from the publicly available list of real MSN users’ queries released for the 2005 KDD Cup competition; the KDD data itself is available at http://www.acm.org/sigs/sigkdd/kdd2005/Labeled800Queries.zip.Note that “sales pitches” were marked objective on the premise that they represent biased reviews that users might wish to avoid seeing.

CR: Customer review dataset(Hu and Liu,2004)URL

This dataset, consists of reviews of five electronics products downloaded from Amazon and Cnet.The sentences have been manually labeled as to whether an opinion is expressed, and if so, what feature from a pre-defined list is being evaluated. An addendum with nine products is also available(http://www.cs.uic.edu/∼liub/FBS/Reviews-9-products.rar) . The curator, Bing Liu, also distributes a comparative-sentence dataset that is available by request.

MPQA: Opinion polarity dataset(Wiebe et al,2005)URL這個語料包含535篇新聞各個來源的文章,在句子層和子句層已經手工標註好了觀點和其他私人屬性(例如信仰,情緒,情感,揣測等).Wiebe et al. Annotating expressions of opinions and emotions in language中有對這個標註有比較詳細的描述。

以上4個都在一篇文章8中使用,作者很有良心地提供了鏈接

others

CoNLL03: 被用在了NER(命名實體識別)中9.

WallStreet Journal:被用在了POS(part of speech)任務中10。但是實際上是開放性的文本,可以用於更多問題。

數據來源對應文章如下:


  1. L.Finkelstein. et al.Placing search in context:The concept revisited.TOIS,2002
  2. T.K.Landauer and S.T.Dumais. A solution to plato’s problem:The latent semantic analysis theory of acquisition,induction,and representation of knowledge. Psychology review.1997
  3. T.Mikolov et al.Distributed representations of words and phrases and their compositionality.NIPS,2013
  4. A.L.Mass et al.Learning word vectors for sentiment analysis.ACL ,2011.
  5. R. Socher et al.Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013
  6. Xuan-Hieu Phan et al.Learning to classify short and sparse text&web with hidden topics from large-scale data collections.ACM, 2008
  7. Xin Li and Dan Roth.Learning question classifiers.ACL, 2002
  8. B.Pang and L.Lee.Opinion mining and sentiment analysis.Foundations and trends in information retrieval, 2008
  9. L.Ratinov and D.Roth.Design challenges and misconceptions in name entity recognition.CoNLL,2009
  10. K.Toutanova et al.Feature-rich part-of-speech tagging with a cyclic word dependency network, NAACL-HLT, 2003
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章