Javascript類型推斷(2) - 開始訓練吧
準備訓練數據
下面我們將上一節獲取的類型數據信息進行預處理,轉化爲可以訓練的數據。
代碼在GetTypes.js中,會創建三個相關目錄:
let root = "data/Repos-cleaned";
let outputDirGold = "data/outputs-gold/";
let outputDirAll = "data/outputs-all/";
let outputDirCheckJS = "data/outputs-checkjs";
try {
fs.mkdirSync(outputDirGold);
fs.mkdirSync(outputDirAll);
fs.mkdirSync(outputDirCheckJS);
}
catch (err) {
console.log(err);
}
其中,outputs-all數據用於訓練。而goutputs-gold中保存用戶手動標註的類型信息,這個珍貴數據將用於測試集。output-checkjs用於和check js工具的結果做對比。
最終生成的訓練數據如下例:
let a = 0 ; let s = "s" ; console . log ( s ) ; O $number$ O O O O $string$ O O O $Console$ O $void$ O $string$ O O
class Test { public value : number ; constructor ( v ) { this . value = v ; } } let t = new Test ( 0 ) ; O $any$ O O $number$ O O O O O $number$ O O O O $number$ O $number$ O O O O $Test$ O O $any$ O O O O
就是我們上節所見到的代碼和token的對應。
這部分的原理大家應該已經瞭解了,源代碼我們就不詳細分析了。
拆分訓練集和測試集
訓練數據準備完成之後,我們就可以調用lexer.py將其分成訓練集和測試集。
下面是我們了前68個工程爲例的拆分情況:
File counts= 68
Processing 0: 0xProject__0x.js.json
Processing 1: 1backend__1backend.json
Processing 2: 2fd__graphdoc.json
Processing 3: 43081j__rar.js.json
Processing 4: 500tech__angular-tree-component.json
Processing 5: 5calls__5calls.json
Processing 6: 74th__vscode-vim.json
Processing 7: accounts-js__accounts.json
Processing 8: adriancarriger__angularfire2-offline.json
Processing 9: AFASSoftware__maquette.json
Processing 10: afrad__angular2-websocket.json
Processing 11: aggarwalankush__ionic-mosum.json
Processing 12: aggarwalankush__ionic-push-base.json
Processing 13: ahomu__Talkie.json
Processing 14: aikoven__typescript-fsa.json
Processing 15: aioutecism__amVim-for-VSCode.json
Processing 16: airbrake__airbrake-js.json
Processing 17: ajtoo__vscode-org-mode.json
Processing 18: akfish__node-vibrant.json
Processing 19: akserg__ng2-dnd.json
Processing 20: akserg__ng2-slim-loading-bar.json
Processing 21: akserg__ng2-toasty.json
Processing 22: alamgird__angular-next-starter-kit.json
Processing 23: Alberplz__angular2-color-picker.json
Processing 24: alefragnani__vscode-project-manager.json
Processing 25: alex3165__react-mapbox-gl.json
Processing 26: alexjlockwood__avocado.json
Processing 27: alexjlockwood__ShapeShifter.json
Processing 28: alexjoverm__tslint-config-prettier.json
Processing 29: alexjoverm__typescript-library-starter.json
Processing 30: AlexKhymenko__ngx-permissions.json
Processing 31: AlgusDark__bloomer.json
Processing 32: amcdnl__ngrx-actions.json
Processing 33: anandanand84__technicalindicators.json
Processing 34: andrei-markeev__ts2c.json
Processing 35: andrerpena__react-mde.json
Processing 36: andrucz__ionic2-rating.json
Processing 37: angular-redux__store.json
Processing 38: angular-ui__ui-router.json
Processing 39: angulartics__angulartics2.json
Processing 40: ant-design__ant-design-mobile.json
Processing 41: ant-design__ant-design.json
Processing 42: antivanov__js-crawler.json
Processing 43: APIs-guru__graphql-faker.json
Processing 44: APIs-guru__graphql-lodash.json
Processing 45: APIs-guru__graphql-voyager.json
Processing 46: appbaseio__mirage.json
Processing 47: arangodb__arangojs.json
Processing 48: argonjs__argon.json
Processing 49: arkon__ng-sidebar.json
Processing 50: artemsky__ng-snotify.json
Processing 51: artemsky__vue-snotify.json
Processing 52: artsy__emission.json
Processing 53: ascoders__gaea-editor.json
Processing 54: ascoders__react-native-image-viewer.json
Processing 55: ascoders__react-native-image-zoom.json
Processing 56: ashubham__webshot-factory.json
Processing 57: Asymmetrik__ngx-leaflet.json
Processing 58: atom-community__markdown-preview-plus.json
Processing 59: atom-haskell__ide-haskell.json
Processing 60: atom__atom-languageclient.json
Processing 61: aurelia__ux.json
Processing 62: aurelia__validation.json
Processing 63: auth0__angular2-jwt.json
Processing 64: avatsaev__angular-contacts-app-example.json
Processing 65: avatsaev__angular4-docker-example.json
Processing 66: aviabird__angularspree.json
Processing 67: Azure__kashti.json
Train projects: 54
Validation projects: 7
Test projects: 7
Train files: 2184
Validation files: 364
Test files: 187
Producing vocabularies
Size of source vocab: 3377
Size of target vocab: 707
Writing train/valid/test files
Overall tokens: 896479 train, 134374 valid and 60516 test
最後會生成train.txt, valid.txt和test.txt三個文件。
我們取其中的一行,看看其格式:
<s> import 's' ; import { configure } from 's' ; import * as _UNKNOWN_ from 's' ; configure ( { adapter : new _UNKNOWN_ ( ) } ) ; </s> O O O O O O $any$ O O O O O O O $any$ O O O $any$ O O $any$ O O $any$ O O O O O O
嗯,還是加工後的源代碼,與我們第一節中生成的token類型表的對應。
同時,還會生成source_wl和target_wl兩個詞表:
其中source_wl是用到的符號表,例:
.
(
)
,
;
:
{
}
's'
"s"
=
this
0
[
]
const
from
=>
import
null
return
if
export
let
expect
<
>
new
?
function
string
<s>
</s>
public
as
private
!
false
true
===
最後一個詞是_UNKNOWN_
,代表未知詞。
而target_wl是類型的表,我們看下前幾行:
O
$any$
$string$
$number$
$complex$
$void$
$boolean$
$any[]$
$string[]$
$number[]$
$Assertion$
$undefined$
${}$
$HTMLElement$
$Promise$
$ExpectStatic$
$Promise<any>$
$PromiseConstructor$
$Promise<void>$
$Element$
$this$
$ErrorConstructor$
$ZeroEx$
$Math$
$SignedOrder$
$Projection$
$JSON$
$JsApi$
$StockData$
$Console$
$VNode$
$T$
類型中第一個是未知。
除此之外,還會生成test_projects.txt,例:
43081j__rar.js.json
adriancarriger__angularfire2-offline.json
aikoven__typescript-fsa.json
alexjoverm__tslint-config-prettier.json
AlgusDark__bloomer.json
andrerpena__react-mde.json
arangodb__arangojs.json
格式轉換
在使用CNTK處理之前,我們還需要將txt格式轉換成CNTK需要的ctf格式。
這個工具去CNTK官網上可以找到:https://github.com/microsoft/CNTK/blob/master/Scripts/txt2ctf.py
調用命令如下,以Windows爲例,其它系統就不用路徑,直接調用python就好:
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf
訓練
萬事俱備,我們就可以調用infer.py來進行訓練了。
請記得安裝微軟的CNTK框架。
下面是我的訓練命令和輸出
C:\Python\Python36\python.exe .\infer.py
Selected GPU[0] GeForce GTX 960M as the process wide default device.
-------------------------------------------------------------------
Build info:
Built time: Apr 23 2019 21:50:08
Last modified date: Tue Apr 23 17:37:55 2019
Build type: Release
Build target: GPU
With ASGD: yes
Math lib: mkl
CUDA version: 10.0.0
CUDNN version: 7.6.2
Build Branch: HEAD
Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
Training 4597857 parameters in 21 parameter tensors.
-------------------------------------------------------------------
Build info:
Built time: Apr 23 2019 21:50:08
Last modified date: Tue Apr 23 17:37:55 2019
Build type: Release
Build target: GPU
With ASGD: yes
Math lib: mkl
CUDA version: 10.0.0
CUDNN version: 7.6.2
Build Branch: HEAD
Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
Learning rate per 1 samples: 0.001
Minibatch[ 1- 10]: loss = 1.052736 * 42461, metric = 14.26% * 42461;
Minibatch[ 11- 20]: loss = 0.671728 * 46088, metric = 13.34% * 46088;
Minibatch[ 21- 30]: loss = 0.486434 * 42913, metric = 8.57% * 42913;
Minibatch[ 31- 40]: loss = 0.542112 * 45928, metric = 9.83% * 45928;
評估效果
在evaluation.py中,修改model_file變量爲我們上一步訓練好的cntk文件,然後運行就可以評估訓練的效果了。
model_file = "models/model-1.cntk"