big code: Learning python code suggestion with a sparse pointer network [ICLR 2017]

Learning python code suggestion with a sparse pointer network [ICLR 2017]

原文：Learning python code suggestion with a sparse pointer network

作者：Avishkar Bhoopchand

單位：倫敦大學學院（University College London）

會議：ICLR 2017

模型

神經語言模型

對序列 $S = a_1, \ldots, a_N$ ， $S$ 的聯合概率爲

$\begin{array}{rcl} P_{\theta} (S) = P_{\theta} (a_1) \cdot \prod_{t = 2}^N P_{\theta} (a_t |a_{t - 1}, \ldots, a_1) \end{array}$

給一個Python code token序列,預測接下來M個token

$\begin{array}{rcl} \underset{a_{t + 1}, \ldots, a_{t + M}}{\arg \max} P_{\theta} (a_1, \ldots, a_t, a_{t + 1}, \ldots, a_{t + M}) \end{array}$

用LSTM估計概率

$\begin{array}{rcl} P_{\theta} (a_t = \tau |a_{t - 1}, \ldots, a_1) = \frac{\exp ({\boldsymbol{v}}_{\tau}^T {\boldsymbol{h}}_t + b_{\tau})}{\sum_{\tau'} \exp ({\boldsymbol{v}}_{\tau'}^T {\boldsymbol{h}}_t + b_{\tau'})} \end{array}$

其中， ${\boldsymbol{v}}_{\tau}$ 是關於token $\tau$ 的參數向量

注意力

$\begin{array}{rcl} \begin{array}{llll} {\boldsymbol{M}}_t & = [{\boldsymbol{m}}_1 \ldots {\boldsymbol{m}}_K] & & \in \mathbb{R}^{k \times K}\\ {\boldsymbol{G}}_t & = \tanh ({\boldsymbol{W}}^M {\boldsymbol{M}}_t + \mathbf{1}_K^T ({\boldsymbol{W}}^h {\boldsymbol{h}}_t)) & & \in \mathbb{R}^{k \times K}\\ {\boldsymbol{\alpha}}_t & = \mathrm{softmax} ({\boldsymbol{w}}^T {\boldsymbol{G}}_t) & & \in \mathbb{R}^{1 \times K}\\ {\boldsymbol{c}}_t & ={\boldsymbol{M}}_t {\boldsymbol{\alpha}}_t^T & & \in \mathbb{R}^k \end{array} \end{array}$

其中， ${\boldsymbol{M}}_t$ 是長度爲 $K$ 的記憶。

$\begin{array}{rcl} \begin{array}{ll} {\boldsymbol{n}}_t = \tanh \left( {\boldsymbol{W}}^A \left[ \begin{array}{l} {\boldsymbol{h}}_t\\ {\boldsymbol{c}}_t \end{array} \right] \right) & \in \mathbb{R}^k\\ {\boldsymbol{y}}_t = \mathrm{softmax} ({\boldsymbol{W}}^V {\boldsymbol{n}}_t +{\boldsymbol{b}}^V) & \in \mathbb{R}^{|V|} \end{array} \end{array}$

其中， ${\boldsymbol{y}}_t$ 是算出來的預測token的概率分佈。

Sparse Pointer Network

$\begin{array}{rcl} \begin{array}{ll} {\boldsymbol{s}}_t [i] & = \left\{ \begin{array}{ll} {\boldsymbol{\alpha}}_t [j] & \text{if } {\boldsymbol{m}}_t [j] = i\\ - C & \text{otherwise } \end{array} \right.\\ {\boldsymbol{i}}_t & = \mathrm{softmax} ({\boldsymbol{s}}_t) \in \mathbb{R}^{|V|} \end{array} \end{array}$

獲得全局詞表的pseudo-sparse distribution。

其中， $- C$ 是一個很小的常數。 ${\boldsymbol{m}}_t = \left[ \mathrm{id}_1, \ldots, \mathrm{id}_K˙\right] \in \mathbb{N}^K$ 是標識符的符號ID。

$\begin{array}{rcl} {\boldsymbol{y}}_t = \mathrm{softmax} ({\boldsymbol{W}}^V {\boldsymbol{h}}_t +{\boldsymbol{b}}^V) \quad \in \mathbb{R}^{|V|} \end{array}$

$\begin{array}{rcl} \begin{array}{lll} {\boldsymbol{h}}_t^{\lambda} & = \left[ \begin{array}{l} {\boldsymbol{h}}_t\\ {\boldsymbol{x}}_t\\ {\boldsymbol{c}}_t \end{array} \right] & \in \mathbb{R}^{3 k}\\ {\boldsymbol{\lambda}}_t & = \mathrm{softmax} \left( {\boldsymbol{W}}^{\lambda} \mathbf{h}_t^{\lambda} +{\boldsymbol{b}}^{\lambda} \right) & \in \mathbb{R}^2\\ {\boldsymbol{y}}_t^{\ast} & = [{\boldsymbol{y}}_t {\boldsymbol{i}}_t] {\boldsymbol{\lambda}}_t & \in \mathbb{R}^{| V |} \end{array} \end{array}$

其中， ${\boldsymbol{x}}_t$ 是輸入的token， ${\boldsymbol{c}}_t$ 是注意力算出來的，
${\boldsymbol{W}}^{\lambda} \in \mathbb{R}^{2 \times 3 k}$

訓練

數據處理

把標識符變成標識符類別+數字，數字用 $NUM$ ，還有就是 $OOV$

效果

$\begin{array}{lccccrcrrr} \hline \text{Model } & \text{Train PP } & \text{Dev PP } & \text{Test PP } & \text{Acc } [\%] & & & \text{Acc } @5 [\%] & & \\ & & & & \text{All } & \text{IDs } & \text{Other } & \text{All } & \text{IDs } & \text{Other }\\ \hline 3 \text{-gram } & 12.90 & 24.19 & 26.90 & 13.19 & - & - & 50.81 & - & -\\ \text{4-gram } & 7.60 & 21.07 & 23.85 & 13.68 & - & - & 51.26 & - & -\\ \text{5-gram } & 4.52 & 19.33 & 21.22 & 13.90 & - & - & 51.49 & - & -\\ \text{6-gram } & 3.37 & 18.73 & 20.17 & 14.51 & - & - & 51.76 & - & -\\ \hline \text{LSTM } & 9.29 & 13.08 & 14.01 & 57.91 & 2.1 & 62.8 & 76.30 & 4.5 & 82.6\\ \text{LSTM w/Attn 20 } & 7.30 & 11.07 & 11.74 & 61.30 & 21.4 & 64.8 & 79.32 & 29.9 & 83.7\\ \text{LSTM w/Attn 50 } & 7.09 & 9.83 & 10.05 & \mathbf{6 3 . 2 1} & \mathbf{3 0 . 2} & \mathbf{6 5 . 3} & 81.69 & 41.3 & 84.1\\ \hline \text{本文 } & 6.41 & {\boldsymbol{\mathbf{9 . 4 0}}} & {\boldsymbol{\mathbf{9 . 1 8}}} & 62.97 & 27.3 & 64.9 & \mathbf{8 2 . 6 2} & \mathbf{4 3 . 6} & \mathbf{8 4 . 5}\\ \hline \end{array}$

遇到的問題

在用他給的代碼爬數據時，出現

Traceback (most recent call last):
  File "github-scraper/scraper.py", line 143, in <module>
    main(sys.argv[1:])
  File "github-scraper/scraper.py", line 130, in main
    repos = create_repos(dbFile)
  File "github-scraper/scraper.py", line 59, in create_repos
    repos = pickle.load(infile)
AttributeError: Can't get attribute 'UTC' on <module 'github3.utils' from 
'/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/github3/utils.py'>

在跑GitHub上代碼時，有些包的位置不對，不知道怎麼回事

小結

這篇文章的模型很好理解lstm+attn+pointer，困難同樣在數據的處理上。沒有提供原始數據或經過處理後的pkl文件，比較麻煩。看代碼的文件，裏面有beam search。這樣一比，感覺做big code，數據處理會是一個比較困難的並且麻煩的事情。

參考

論文原版實現GitHub

big code: Learning python code suggestion with a sparse pointer network [ICLR 2017]

Learning python code suggestion with a sparse pointer network [ICLR 2017]

模型

訓練

數據處理

效果

遇到的問題

小結

參考

換電腦後，Zotero的一些配置

cuda 10 環境下安裝 pytorch_geometric

圖神經網絡學習筆記：Graph Attention Network 淺析

TeXmacs開發：用tm2md將TeXmacs文檔轉換爲markdown文檔

圖神經網絡學習筆記：2018年-2020年 GNN論文簡讀（其他部分）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結