deeplearning4j對word2vec的介紹

Word2Vec

Contents

Introduction to word2vector

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not adeep neural network, it turns text into a numerical form that deep nets can understand.Deeplearning4j implements a distributed form of Word2vec for Java andScala, which works on Spark with GPUs.

翻譯:word2vec是一個用來處理文本信息的兩層神經網絡。它的輸入是一個文本語料庫,它的輸出的一組向量:這些語料中單詞的特徵向量。然而word2vec並不是一個深層神經網絡,它將文本轉化成一種能夠讓深層網絡理解的數值形式。Deeplearning4j爲java和scala實現了一種分佈式的word2vec,可以工作在spark和GPUs上。

Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well togenes, code,likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

翻譯:word2vec是對在自然狀態下語句解析的延展性的應用。同樣可以應用到基因、代碼、喜好、播放列表、社交媒介圖或者其他的語言或符號系列的模式識別。

Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.

翻譯:爲什麼呢?因爲單詞是最簡單的獨立的狀態,就像上面提到的數據,並且我們僅僅是在尋找這些狀態之間的過渡概率:他們將共現的可能性。所以,gene2vec、like2vec和follower2vec都是可能的。記住這一點,下面的教程將幫助你理解如何爲任何獨立狀態或共現狀態的組合創建神經嵌入。

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.

翻譯:word2vec的目標和用處是爲了把向量空間中相似詞語的向量分到同組。這就是說,他能發現數學上的相似之處。word2vec創建的向量,這些向量是由單詞的分佈的數值來表現的單詞特徵,例如這些特徵可以是單詞的上下文。而這過程無需人工干預。

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search,sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.


The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.


Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.


discrete 離散的,獨立的 adj.

co-occur 共現,同現 n.

detect 發現 vt.

mathematically 算術的,數學上的 adv.

numerical 數值的 adj.

representation 表現,代表 n.


參考資料:

http://deeplearning4j.org/word2vec

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章