big code: Neural Code Completion [ICLR 2017]

原文：Neural Code Completion

作者：Chang Liu, Xin Wang

單位：加州大學伯克利分校（University of California, Berkeley）

會議：ICLR 2017 (???爲啥openreview網站上寫reject)

一個演示

模型公式

嵌入

$\begin{array}{rcl} E_i = AN_i + BT_i \end{array}$

LSTM

$\begin{array}{rcl} \begin{array}{ll} \left( \begin{array}{l} q\\ f\\ o\\ g \end{array} \right) & = \left( \begin{array}{c} \sigma\\ \sigma\\ \sigma\\ \tanh \end{array} \right) \mathbf{P}_{J, 2 J} \left( \begin{array}{c} \mathbf{x}_i\\ \mathbf{h}_{i - 1} \end{array} \right)\\ \mathbf{c}_i & = f \odot \mathbf{c}_{i - 1} + q \odot g\\ \mathbf{h}_i & = o \odot \tanh \left( \mathbf{c}_i \right) \end{array} \end{array}$

softmax

$\begin{array}{rcl} \hat{N}_{k + 1} = \mathrm{softmax} (W_N \times h_k + b_N) \end{array}$

模型示意圖

NT2N

using the sequence of Non-terminal and Terminal pairs TO predict the next
Non-terminal.

使用N和T預測N

N和T嵌入後進LSTM，再FC後softmax得分類

NTN2T

Non-terminal and Terminal pair sequence and the next Non-terminal TO predict
the next Terminal.

使用N和T，給要預測的N，再預測T

與NT2N相比，FC後要加一個next N的線性變換，再softmax得分類

其他

其他的x2y的，算了。

訓練

找到的代碼：GitHub

這個代碼我沒看到怎麼做數據預處理的，然後是GitHub上的代碼是一個系統，所以模型不純粹，有些看不清到底幹了啥。

這個模型用pytorch很好搭，但是數據不能很好的做出來，都是白搭。

這裏的數據處理方式也沒看懂。

我現在的印象是，用EOF填充，然後分割爲LSTM的輸入大小。

貼一下原文：

We divide each program into segments consisting of s consecutive tokens. The
last segment of a program, which may not be full, is padded with hEOFi tokens.
We coalesce multiple epochs together. We organize all training data into b
buckets. In each epoch, we randomly shuffle all programs in the training data
to construct a queue. Whenever a bucket is empty, a program is popped from the
queue and all segments of the program are inserted into the empty bucket
sequentially. When the queue becomes empty, i.e., the current epoch finishes,
all programs are re-shuffled randomly to reconstruct the queue. Each mini-batch
is formed by b segments, i.e., one segment popped from each bucket. When the
training data has been shuffled for e = 8 times, i.e., e epochs are inserted
into the bucket, we stop adding whole programs, and start adding only the first
segment of each program: when a bucket is empty, a program is chosen randomly,
and its first segment is added to the bucket. We terminate the training process
when all buckets are empty at the same time. That is, all programs from the
first 8 epochs have been trained.

論文評審

提到的文章

Toward Deep Learning Software Repositories
Code Completion with Statistical Language Models
Structured generative models of natural source code
Probabilistic Model for Code with Decision Trees
Grammar as a foreign language.

提到的問題

爲什麼把AST序列化爲(N, T)?
別人也是這麼做的，方便比較
如何保證補全的代碼符合語法？
不能保證，但準確率高(96%)，夠用

big code: Neural Code Completion [ICLR 2017]

一個演示

模型公式

模型示意圖

NT2N

NTN2T

其他

訓練

論文評審

提到的文章

提到的問題

參考

換電腦後，Zotero的一些配置

cuda 10 環境下安裝 pytorch_geometric

圖神經網絡學習筆記：Graph Attention Network 淺析

TeXmacs開發：用tm2md將TeXmacs文檔轉換爲markdown文檔

圖神經網絡學習筆記：2018年-2020年 GNN論文簡讀（其他部分）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結