Learning python code suggestion with a sparse pointer network [ICLR 2017]
原文:Learning python code suggestion with a sparse pointer network
作者:Avishkar Bhoopchand
單位:倫敦大學學院(University College London)
會議:ICLR 2017
模型
神經語言模型
對序列S = a 1 , … , a N S = a_1, \ldots, a_N S = a 1 , … , a N ,S S S 的聯合概率爲
P θ ( S ) = P θ ( a 1 ) ⋅ ∏ t = 2 N P θ ( a t ∣ a t − 1 , … , a 1 )
\begin{array}{rcl}
P_{\theta} (S) = P_{\theta} (a_1) \cdot \prod_{t = 2}^N P_{\theta} (a_t
|a_{t - 1}, \ldots, a_1)
\end{array}
P θ ( S ) = P θ ( a 1 ) ⋅ ∏ t = 2 N P θ ( a t ∣ a t − 1 , … , a 1 )
給一個Python code token序列,預測接下來M個token
arg max a t + 1 , … , a t + M P θ ( a 1 , … , a t , a t + 1 , … , a t + M )
\begin{array}{rcl}
\underset{a_{t + 1}, \ldots, a_{t + M}}{\arg \max} P_{\theta} (a_1, \ldots,
a_t, a_{t + 1}, \ldots, a_{t + M})
\end{array}
a t + 1 , … , a t + M arg max P θ ( a 1 , … , a t , a t + 1 , … , a t + M )
用LSTM估計概率
P θ ( a t = τ ∣ a t − 1 , … , a 1 ) = exp ( v τ T h t + b τ ) ∑ τ ′ exp ( v τ ′ T h t + b τ ′ )
\begin{array}{rcl}
P_{\theta} (a_t = \tau |a_{t - 1}, \ldots, a_1) = \frac{\exp
({\boldsymbol{v}}_{\tau}^T {\boldsymbol{h}}_t +
b_{\tau})}{\sum_{\tau'} \exp ({\boldsymbol{v}}_{\tau'}^T
{\boldsymbol{h}}_t + b_{\tau'})}
\end{array}
P θ ( a t = τ ∣ a t − 1 , … , a 1 ) = ∑ τ ′ exp ( v τ ′ T h t + b τ ′ ) exp ( v τ T h t + b τ )
其中,v τ {\boldsymbol{v}}_{\tau} v τ 是關於token
τ \tau τ 的參數向量
注意力
M t = [ m 1 … m K ] ∈ R k × K G t = tanh ( W M M t + 1 K T ( W h h t ) ) ∈ R k × K α t = s o f t m a x ( w T G t ) ∈ R 1 × K c t = M t α t T ∈ R k
\begin{array}{rcl}
\begin{array}{llll}
{\boldsymbol{M}}_t & = [{\boldsymbol{m}}_1 \ldots
{\boldsymbol{m}}_K] & & \in \mathbb{R}^{k \times K}\\
{\boldsymbol{G}}_t & = \tanh ({\boldsymbol{W}}^M
{\boldsymbol{M}}_t + \mathbf{1}_K^T
({\boldsymbol{W}}^h {\boldsymbol{h}}_t)) & & \in
\mathbb{R}^{k \times K}\\
{\boldsymbol{\alpha}}_t & = \mathrm{softmax}
({\boldsymbol{w}}^T {\boldsymbol{G}}_t) & & \in
\mathbb{R}^{1 \times K}\\
{\boldsymbol{c}}_t & ={\boldsymbol{M}}_t
{\boldsymbol{\alpha}}_t^T & & \in \mathbb{R}^k
\end{array}
\end{array}
M t G t α t c t = [ m 1 … m K ] = tanh ( W M M t + 1 K T ( W h h t ) ) = s o f t m a x ( w T G t ) = M t α t T ∈ R k × K ∈ R k × K ∈ R 1 × K ∈ R k
其中,M t {\boldsymbol{M}}_t M t 是長度爲K K K 的記憶。
n t = tanh ( W A [ h t c t ] ) ∈ R k y t = s o f t m a x ( W V n t + b V ) ∈ R ∣ V ∣
\begin{array}{rcl}
\begin{array}{ll}
{\boldsymbol{n}}_t = \tanh \left( {\boldsymbol{W}}^A
\left[ \begin{array}{l}
{\boldsymbol{h}}_t\\
{\boldsymbol{c}}_t
\end{array} \right] \right) & \in \mathbb{R}^k\\
{\boldsymbol{y}}_t = \mathrm{softmax}
({\boldsymbol{W}}^V {\boldsymbol{n}}_t
+{\boldsymbol{b}}^V) & \in \mathbb{R}^{|V|}
\end{array}
\end{array}
n t = tanh ( W A [ h t c t ] ) y t = s o f t m a x ( W V n t + b V ) ∈ R k ∈ R ∣ V ∣
其中,y t {\boldsymbol{y}}_t y t 是算出來的預測token的概率分佈。
Sparse Pointer Network
s t [ i ] = { α t [ j ] if m t [ j ] = i − C otherwise i t = s o f t m a x ( s t ) ∈ R ∣ V ∣
\begin{array}{rcl}
\begin{array}{ll}
{\boldsymbol{s}}_t [i] & = \left\{ \begin{array}{ll}
{\boldsymbol{\alpha}}_t [j] & \text{if }
{\boldsymbol{m}}_t [j] = i\\
- C & \text{otherwise }
\end{array} \right.\\
{\boldsymbol{i}}_t & = \mathrm{softmax}
({\boldsymbol{s}}_t) \in \mathbb{R}^{|V|}
\end{array}
\end{array}
s t [ i ] i t = { α t [ j ] − C if m t [ j ] = i otherwise = s o f t m a x ( s t ) ∈ R ∣ V ∣
獲得全局詞表的pseudo-sparse distribution
。
其中,− C - C − C 是一個很小的常數。 m t = [ i d 1 , … , i d K ˙ ] ∈ N K {\boldsymbol{m}}_t = \left[
\mathrm{id}_1, \ldots, \mathrm{id}_K˙\right] \in
\mathbb{N}^K m t = [ i d 1 , … , i d K ˙ ] ∈ N K 是標識符的符號ID。
y t = s o f t m a x ( W V h t + b V ) ∈ R ∣ V ∣
\begin{array}{rcl}
{\boldsymbol{y}}_t = \mathrm{softmax}
({\boldsymbol{W}}^V {\boldsymbol{h}}_t
+{\boldsymbol{b}}^V) \quad \in \mathbb{R}^{|V|}
\end{array}
y t = s o f t m a x ( W V h t + b V ) ∈ R ∣ V ∣
h t λ = [ h t x t c t ] ∈ R 3 k λ t = s o f t m a x ( W λ h t λ + b λ ) ∈ R 2 y t ∗ = [ y t i t ] λ t ∈ R ∣ V ∣
\begin{array}{rcl}
\begin{array}{lll}
{\boldsymbol{h}}_t^{\lambda} & = \left[ \begin{array}{l}
{\boldsymbol{h}}_t\\
{\boldsymbol{x}}_t\\
{\boldsymbol{c}}_t
\end{array} \right] & \in \mathbb{R}^{3 k}\\
{\boldsymbol{\lambda}}_t & = \mathrm{softmax} \left(
{\boldsymbol{W}}^{\lambda} \mathbf{h}_t^{\lambda}
+{\boldsymbol{b}}^{\lambda} \right) & \in \mathbb{R}^2\\
{\boldsymbol{y}}_t^{\ast} & = [{\boldsymbol{y}}_t
{\boldsymbol{i}}_t] {\boldsymbol{\lambda}}_t & \in
\mathbb{R}^{| V |}
\end{array}
\end{array}
h t λ λ t y t ∗ = ⎣ ⎡ h t x t c t ⎦ ⎤ = s o f t m a x ( W λ h t λ + b λ ) = [ y t i t ] λ t ∈ R 3 k ∈ R 2 ∈ R ∣ V ∣
其中,x t {\boldsymbol{x}}_t x t 是輸入的token,c t {\boldsymbol{c}}_t c t 是注意力算出來的,
W λ ∈ R 2 × 3 k {\boldsymbol{W}}^{\lambda} \in \mathbb{R}^{2 \times 3 k} W λ ∈ R 2 × 3 k
訓練
數據處理
把標識符變成標識符類別+數字,數字用$NUM$
,還有就是$OOV$
效果
Model Train PP Dev PP Test PP Acc [ % ] Acc @ 5 [ % ] All IDs Other All IDs Other 3 -gram 12.90 24.19 26.90 13.19 − − 50.81 − − 4-gram 7.60 21.07 23.85 13.68 − − 51.26 − − 5-gram 4.52 19.33 21.22 13.90 − − 51.49 − − 6-gram 3.37 18.73 20.17 14.51 − − 51.76 − − LSTM 9.29 13.08 14.01 57.91 2.1 62.8 76.30 4.5 82.6 LSTM w/Attn 20 7.30 11.07 11.74 61.30 21.4 64.8 79.32 29.9 83.7 LSTM w/Attn 50 7.09 9.83 10.05 63.21 30.2 65.3 81.69 41.3 84.1 本文 6.41 9.40 9.18 62.97 27.3 64.9 82.62 43.6 84.5
\begin{array}{lccccrcrrr}
\hline
\text{Model } & \text{Train PP } & \text{Dev PP } & \text{Test PP } &
\text{Acc } [\%] & & & \text{Acc } @5 [\%] & & \\
& & & & \text{All } & \text{IDs } & \text{Other } & \text{All } &
\text{IDs } & \text{Other }\\
\hline
3 \text{-gram } & 12.90 & 24.19 & 26.90 & 13.19 & - & - & 50.81 & - & -\\
\text{4-gram } & 7.60 & 21.07 & 23.85 & 13.68 & - & - & 51.26 & - & -\\
\text{5-gram } & 4.52 & 19.33 & 21.22 & 13.90 & - & - & 51.49 & - & -\\
\text{6-gram } & 3.37 & 18.73 & 20.17 & 14.51 & - & - & 51.76 & - & -\\
\hline
\text{LSTM } & 9.29 & 13.08 & 14.01 & 57.91 & 2.1 & 62.8 & 76.30 & 4.5 &
82.6\\
\text{LSTM w/Attn 20 } & 7.30 & 11.07 & 11.74 & 61.30 & 21.4 & 64.8 &
79.32 & 29.9 & 83.7\\
\text{LSTM w/Attn 50 } & 7.09 & 9.83 & 10.05 & \mathbf{6 3 . 2 1} &
\mathbf{3 0 . 2} & \mathbf{6 5 . 3} & 81.69 & 41.3 & 84.1\\
\hline
\text{本文 } & 6.41 & {\boldsymbol{\mathbf{9 . 4 0}}} &
{\boldsymbol{\mathbf{9 . 1 8}}} & 62.97 & 27.3 & 64.9 &
\mathbf{8 2 . 6 2} & \mathbf{4 3 . 6} & \mathbf{8 4 . 5}\\
\hline
\end{array}
Model 3 -gram 4-gram 5-gram 6-gram LSTM LSTM w/Attn 20 LSTM w/Attn 50 本文 Train PP 1 2 . 9 0 7 . 6 0 4 . 5 2 3 . 3 7 9 . 2 9 7 . 3 0 7 . 0 9 6 . 4 1 Dev PP 2 4 . 1 9 2 1 . 0 7 1 9 . 3 3 1 8 . 7 3 1 3 . 0 8 1 1 . 0 7 9 . 8 3 9 . 4 0 Test PP 2 6 . 9 0 2 3 . 8 5 2 1 . 2 2 2 0 . 1 7 1 4 . 0 1 1 1 . 7 4 1 0 . 0 5 9 . 1 8 Acc [ % ] All 1 3 . 1 9 1 3 . 6 8 1 3 . 9 0 1 4 . 5 1 5 7 . 9 1 6 1 . 3 0 6 3 . 2 1 6 2 . 9 7 IDs − − − − 2 . 1 2 1 . 4 3 0 . 2 2 7 . 3 Other − − − − 6 2 . 8 6 4 . 8 6 5 . 3 6 4 . 9 Acc @ 5 [ % ] All 5 0 . 8 1 5 1 . 2 6 5 1 . 4 9 5 1 . 7 6 7 6 . 3 0 7 9 . 3 2 8 1 . 6 9 8 2 . 6 2 IDs − − − − 4 . 5 2 9 . 9 4 1 . 3 4 3 . 6 Other − − − − 8 2 . 6 8 3 . 7 8 4 . 1 8 4 . 5
遇到的問題
在用他給的代碼爬數據時,出現
Traceback ( most recent call last) :
File "github-scraper/scraper.py" , line 143, in < module>
main( sys.argv[ 1:] )
File "github-scraper/scraper.py" , line 130, in main
repos = create_repos( dbFile)
File "github-scraper/scraper.py" , line 59, in create_repos
repos = pickle.load( infile)
AttributeError: Can't get attribute ' UTC' on <module ' github3.utils' from
' /root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/github3/utils.py'>
在跑GitHub上代碼時,有些包的位置不對,不知道怎麼回事
小結
這篇文章的模型很好理解lstm+attn+pointer
,困難同樣在數據的處理上。沒有提供原始數據或經過處理後的pkl
文件,比較麻煩。看代碼的文件,裏面有beam search
。這樣一比,感覺做big code,數據處理會是一個比較困難的並且麻煩的事情。
s t [ i ] = { α t [ j ] if m t [ j ] = i − C otherwise i t = s o f t m a x ( s t ) ∈ R ∣ V ∣
\begin{array}{rcl}
\begin{array}{ll}
{\boldsymbol{s}}_t [i] & = \left\{ \begin{array}{ll}
{\boldsymbol{\alpha}}_t [j] & \text{if }
{\boldsymbol{m}}_t [j] = i\\
- C & \text{otherwise }
\end{array} \right.\\
{\boldsymbol{i}}_t & = \mathrm{softmax}
({\boldsymbol{s}}_t) \in \mathbb{R}^{|V|}
\end{array}
\end{array}
s t [ i ] i t = { α t [ j ] − C if m t [ j ] = i otherwise = s o f t m a x ( s t ) ∈ R ∣ V ∣
這裏m t [ j ] = i \boldsymbol{m}_t [j] = i m t [ j ] = i 沒有太懂怎麼回事
參考