詞向量訓練skipgram的python實現

skipgram的原理及公式推倒就不詳細說了，主要記錄一下第一個正向傳播和反向傳播都自己寫的神經網絡，也終於體驗了一把負採樣對於詞向量訓練速度的驚人提升，感人！雖然最終的時間複雜度依然較高，不過我正在研究同樣使用python的gensim爲啥這麼快的原因！

(明天有時間會把)數據和代碼放在本人的github裏，寫的比較搓，待改進...

1.工具介紹

python： 3.6

電腦：mac本地跑

數據集： text8的英文語料

2. 數據預處理

替換文本中特殊符號
將文本分詞
去除文本中的低頻詞

def preprocess(text, freq=5):
    '''
    對文本進行預處理

    參數
    ---
    text: 文本數據
    freq: 詞頻閾值
    '''
    # 替換文本中特殊符號
    text = text.lower()
    text = text.replace('.', ' <PERIOD> ')
    text = text.replace(',', ' <COMMA> ')
    text = text.replace('"', ' <QUOTATION_MARK> ')
    text = text.replace(';', ' <SEMICOLON> ')
    text = text.replace('!', ' <EXCLAMATION_MARK> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('(', ' <LEFT_PAREN> ')
    text = text.replace(')', ' <RIGHT_PAREN> ')
    text = text.replace('--', ' <HYPHENS> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace(':', ' <COLON> ')
    words = text.split()

    # 刪除低頻詞，減少噪音影響
    word_counts = Counter(words)
    trimmed_words = [word for word in words if word_counts[word] > freq]

    return trimmed_words

3. 訓練樣本構建

獲取vocabulary，即id->word，和word->id這兩個單詞映射表。
將文本序列轉化爲id序列。
剔除停用詞：停用詞可能頻率比較高，採用以下公式來計算每個單詞被刪除的概率大小。

$P \left( w _ { i } \right) = 1 - \sqrt { \frac { t } { f \left( w _ { i } \right) } }$

其中 $f \left( w _ { i } \right)$ 代表單詞 $w _ { i }$ 的出現頻次。爲一個閾值，一般介於1e-3到1e-5之間，若 $P \left( w _ { i } \right)$ 大於一個閾值，就刪除 $w _ { i }$ 。

def get_train_words(path, t, threshold, freq):
    with open(path) as f:
        text = f.read()
    words = preprocess(text, freq)
    vocab = set(words)
    vocab_to_int = {w: c for c, w in enumerate(vocab)}
    int_to_vocab = {c: w for c, w in enumerate(vocab)}


    # 對原文本進行vocab到int的轉換
    int_words = [vocab_to_int[w] for w in words]

    # 統計單詞出現頻次
    int_word_counts = Counter(int_words)
    total_count = len(int_words)
    # 計算單詞頻率
    word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
    # 計算被刪除的概率
    prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
    # 對單詞進行採樣
    train_words = [w for w in int_words if prob_drop[w] < threshold]
    return int_to_vocab, train_words

4. 生成skipgram模型的輸入單詞對（中心詞，上下文詞）

這裏上下文單詞的window是隨機採樣的，這麼做是爲了更多的採樣離中心詞更近的單詞，畢竟離中心詞越近，跟中心詞關聯的越緊密嘛！

def get_targets(words, idx, window_size):
    '''
    獲得中心詞的上下文單詞列表

    參數
    ---
    words: 單詞列表
    idx: input word的索引號
    window_size: 窗口大小
    '''
    target_window = np.random.randint(1, window_size + 1)
    # 這裏要考慮input word前面單詞不夠的情況
    start_point = idx - target_window if (idx - target_window) > 0 else 0
    end_point = idx + target_window
    # output words(即窗口中的上下文單詞)
    targets = set(words[start_point: idx] + words[idx + 1: end_point + 1])
    return list(targets)


def get_batches(words, window_size):
    '''
    將中心詞的上下文單詞列表一一與中心詞配對
    '''
    for idx in range(0, len(words)):
        targets = get_targets(words, idx, window_size)
        for y in targets:
            yield words[idx], y

5. 一些基礎函數構建

其中sigmoid_grad是對sigmoid函數求梯度。

def softmax(vector):
    res = np.exp(vector)
    e_sum = np.sum(res)
    res /= e_sum
    return res


def sigmoid(inp):
    return 1.0 / (1.0 + 1.0 / np.exp(inp))


def sigmiod_grad(inp):
    return inp * (1 - inp)

6. skipgram模型構建

def forward_backword(input_vectors, output_vectors, in_idx, out_idx, sigma, vector_dimension, vocabulary_size):
    hidden = input_vectors[in_idx]
    output = np.dot(output_vectors, hidden)
    output_p = softmax(output)
    loss = -np.log(output_p[out_idx])
    output_grad = output_p.copy()
    output_grad[out_idx] -= 1.0
    hidden_grad = np.dot(output_vectors.T, output_grad)
    hidden = hidden.reshape(vector_dimension, 1)
    output_grad = output_grad.reshape(vocabulary_size, 1)
    output_vectors_grad = np.dot(output_grad, hidden.T)
    output_vectors -= sigma * output_vectors_grad
    input_vectors[in_idx] -= sigma * hidden_grad
    return loss

但是要注意，這個是最基礎的skipgram模型的前向傳播和反向傳播，它實在是太慢了！慢到根本無法使用！所以下面會用負採樣模型替代它。

def neg_forward_backword(input_vectors, output_vectors, in_idx, out_idx, sigma, vocabulary_size, K=10):
    epsilon = 1e-5
    hidden = input_vectors[in_idx]
    neg_idxs = neg_sample(vocabulary_size, out_idx, K)
    tmp = sigmoid(np.dot(output_vectors[out_idx], hidden))
    hidden_grad = (tmp - 1.0) * output_vectors[out_idx]
    output_vectors[out_idx] -= sigma * (tmp - 1.0) * hidden
    loss = -np.log(tmp + epsilon)
    for idx in neg_idxs:
        tmp = sigmoid(np.dot(output_vectors[idx], hidden))
        loss -= np.log(1.0 - tmp + epsilon)
        hidden_grad += tmp * output_vectors[idx]
        output_vectors[idx] -= sigma * tmp * hidden
    input_vectors[in_idx] -= sigma * hidden_grad
    return loss


def neg_sample(vocabulary_size, out_idx, K):
    res = [None] * K
    for i in range(K):
        tmp = np.random.randint(0, vocabulary_size)
        while tmp == out_idx:
            tmp = np.random.randint(0, vocabulary_size)
        res[i] = tmp
    return np.array(res)

7. 求一些單詞的最相似的K個單詞

爲了驗證一下我們的詞向量訓練效果，得看看單詞的最相似的K個單詞是不是和它比較相似，這個函數就是隨機選取一些高頻單詞，求這些單詞的最相似的K個單詞。

def get_simi(input_vectors):
    valid_size = 16
    valid_window = 100
    # 從不同位置各選8個單詞
    valid_examples = np.array(random.sample(range(valid_window), valid_size // 2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000, 1000 + valid_window), valid_size // 2))

    valid_size = len(valid_examples)

    # 計算每個詞向量的模並進行單位化
    norm = np.sqrt(np.square(input_vectors).sum(axis=1)).reshape(len(input_vectors), 1)
    normalized_embedding = input_vectors / norm
    # 查找驗證單詞的詞向量
    valid_embedding = normalized_embedding[valid_examples]
    # 計算餘弦相似度
    similarity = np.dot(valid_embedding, normalized_embedding.T)
    return similarity, valid_size, valid_examples

8. main函數

參數設置
設計整體代碼流程（即按順序引入上述函數）
結果驗證（即看最相似的K個單詞）

if __name__ == "__main__":
    path = './text8.txt'
    t = 1e-5
    threshold = 0.8  # 剔除概率閾值
    freq = 5
    windows = 10
    int_to_vocab, train_words = get_train_words(path, t, threshold, freq)
    np.save('int_to_vocab', int_to_vocab)
    vocabulary_size = len(int_to_vocab)
    vector_dimension = 200
    input_vectors = np.random.random([vocabulary_size, vector_dimension])
    output_vectors = np.random.random([vocabulary_size, vector_dimension])
    epochs = 10  # 迭代輪數
    sigma = 0.01
    K = 10
    
    iter = 1
    for e in range(1, epochs + 1):
        if e > 1:
            sigma = 0.001
        elif e > 3:
            sigma = 0.0001
        loss = 0
        batches = get_batches(train_words, windows)
        start = time.time()
        for x, y in batches:
            loss += neg_forward_backword(input_vectors, output_vectors, x, y, sigma, vocabulary_size, K)
            if iter % 100000 == 0:
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      "Iteration: {}".format(iter),
                      "Avg. Training loss: {:.4f}".format(loss / 100000),
                      "{:.4f} sec/100000".format((end - start)))
                loss = 0
                start = time.time()
            if iter % 4000000 == 0:
                np.save('input_vectors', input_vectors)
                similarity, valid_size, valid_examples = get_simi(input_vectors)
                for i in range(valid_size):
                    valid_word = int_to_vocab[valid_examples[i]]
                    top_k = 8  # 取最相似單詞的前8個
                    nearest = (-similarity[i, :]).argsort()[1:top_k + 1]
                    log = 'Nearest to [%s]:' % valid_word
                    for k in range(top_k):
                        close_word = int_to_vocab[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
            iter += 1

9.結果

可以看出還是有一些效果的，但由於時間複雜度比較高，沒有調參，epoch跑的也不夠，數據量用的也比較小，所以效果不是太好。但對於熟悉skipgram模型的內部機制、熟悉負採樣也足夠了！不過我正在研究同樣使用python的gensim爲啥這麼快的原因！打算借鑑一下，再自己實現一下hierarchical softmax。

參考網址：https://www.leiphone.com/news/201706/QprrvzsrZCl4S2lw.html

https://zhuanlan.zhihu.com/p/33625794

詞向量訓練skipgram的python實現

1.工具介紹

2. 數據預處理

3. 訓練樣本構建

4. 生成skipgram模型的輸入單詞對（中心詞，上下文詞）

5. 一些基礎函數構建

6. skipgram模型構建

7. 求一些單詞的最相似的K個單詞

8. main函數

9.結果

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

依存句法分析—A Fast and Accurate Dependency Parser using Neural Networks

圖像識別——AlexNet原理解析及實現

fasttext源碼解析

Graph Embedding（一）—— DeepWalk的原理及實現

推薦系統——MF及其python實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結