Javascript類型推斷(2) - 開始訓練吧

Javascript類型推斷(2) - 開始訓練吧

準備訓練數據

下面我們將上一節獲取的類型數據信息進行預處理,轉化爲可以訓練的數據。

代碼在GetTypes.js中,會創建三個相關目錄:

let root = "data/Repos-cleaned";
let outputDirGold = "data/outputs-gold/";
let outputDirAll = "data/outputs-all/";
let outputDirCheckJS = "data/outputs-checkjs";
try {
    fs.mkdirSync(outputDirGold);
    fs.mkdirSync(outputDirAll);
    fs.mkdirSync(outputDirCheckJS);
}
catch (err) {
    console.log(err);
}

其中,outputs-all數據用於訓練。而goutputs-gold中保存用戶手動標註的類型信息,這個珍貴數據將用於測試集。output-checkjs用於和check js工具的結果做對比。

最終生成的訓練數據如下例:

let a = 0 ; let s = "s" ; console . log ( s ) ;	O $number$ O O O O $string$ O O O $Console$ O $void$ O $string$ O O
class Test { public value : number ; constructor ( v ) { this . value = v ; } } let t = new Test ( 0 ) ;	O $any$ O O $number$ O O O O O $number$ O O O O $number$ O $number$ O O O O $Test$ O O $any$ O O O O

就是我們上節所見到的代碼和token的對應。

這部分的原理大家應該已經瞭解了,源代碼我們就不詳細分析了。

拆分訓練集和測試集

訓練數據準備完成之後,我們就可以調用lexer.py將其分成訓練集和測試集。

下面是我們了前68個工程爲例的拆分情況:

File counts= 68
Processing 0: 0xProject__0x.js.json
Processing 1: 1backend__1backend.json
Processing 2: 2fd__graphdoc.json
Processing 3: 43081j__rar.js.json
Processing 4: 500tech__angular-tree-component.json
Processing 5: 5calls__5calls.json
Processing 6: 74th__vscode-vim.json
Processing 7: accounts-js__accounts.json
Processing 8: adriancarriger__angularfire2-offline.json
Processing 9: AFASSoftware__maquette.json
Processing 10: afrad__angular2-websocket.json
Processing 11: aggarwalankush__ionic-mosum.json
Processing 12: aggarwalankush__ionic-push-base.json
Processing 13: ahomu__Talkie.json
Processing 14: aikoven__typescript-fsa.json
Processing 15: aioutecism__amVim-for-VSCode.json
Processing 16: airbrake__airbrake-js.json
Processing 17: ajtoo__vscode-org-mode.json
Processing 18: akfish__node-vibrant.json
Processing 19: akserg__ng2-dnd.json
Processing 20: akserg__ng2-slim-loading-bar.json
Processing 21: akserg__ng2-toasty.json
Processing 22: alamgird__angular-next-starter-kit.json
Processing 23: Alberplz__angular2-color-picker.json
Processing 24: alefragnani__vscode-project-manager.json
Processing 25: alex3165__react-mapbox-gl.json
Processing 26: alexjlockwood__avocado.json
Processing 27: alexjlockwood__ShapeShifter.json
Processing 28: alexjoverm__tslint-config-prettier.json
Processing 29: alexjoverm__typescript-library-starter.json
Processing 30: AlexKhymenko__ngx-permissions.json
Processing 31: AlgusDark__bloomer.json
Processing 32: amcdnl__ngrx-actions.json
Processing 33: anandanand84__technicalindicators.json
Processing 34: andrei-markeev__ts2c.json
Processing 35: andrerpena__react-mde.json
Processing 36: andrucz__ionic2-rating.json
Processing 37: angular-redux__store.json
Processing 38: angular-ui__ui-router.json
Processing 39: angulartics__angulartics2.json
Processing 40: ant-design__ant-design-mobile.json
Processing 41: ant-design__ant-design.json
Processing 42: antivanov__js-crawler.json
Processing 43: APIs-guru__graphql-faker.json
Processing 44: APIs-guru__graphql-lodash.json
Processing 45: APIs-guru__graphql-voyager.json
Processing 46: appbaseio__mirage.json
Processing 47: arangodb__arangojs.json
Processing 48: argonjs__argon.json
Processing 49: arkon__ng-sidebar.json
Processing 50: artemsky__ng-snotify.json
Processing 51: artemsky__vue-snotify.json
Processing 52: artsy__emission.json
Processing 53: ascoders__gaea-editor.json
Processing 54: ascoders__react-native-image-viewer.json
Processing 55: ascoders__react-native-image-zoom.json
Processing 56: ashubham__webshot-factory.json
Processing 57: Asymmetrik__ngx-leaflet.json
Processing 58: atom-community__markdown-preview-plus.json
Processing 59: atom-haskell__ide-haskell.json
Processing 60: atom__atom-languageclient.json
Processing 61: aurelia__ux.json
Processing 62: aurelia__validation.json
Processing 63: auth0__angular2-jwt.json
Processing 64: avatsaev__angular-contacts-app-example.json
Processing 65: avatsaev__angular4-docker-example.json
Processing 66: aviabird__angularspree.json
Processing 67: Azure__kashti.json
Train projects: 54
Validation projects: 7
Test projects: 7
Train files: 2184
Validation files: 364
Test files: 187
Producing vocabularies
Size of source vocab: 3377
Size of target vocab: 707
Writing train/valid/test files
Overall tokens: 896479 train, 134374 valid and 60516 test

最後會生成train.txt, valid.txt和test.txt三個文件。

我們取其中的一行,看看其格式:

<s> import 's' ; import { configure } from 's' ; import * as _UNKNOWN_ from 's' ; configure ( { adapter : new _UNKNOWN_ ( ) } ) ; </s>	O O O O O O $any$ O O O O O O O $any$ O O O $any$ O O $any$ O O $any$ O O O O O O

嗯,還是加工後的源代碼,與我們第一節中生成的token類型表的對應。

同時,還會生成source_wl和target_wl兩個詞表:
其中source_wl是用到的符號表,例:

.
(
)
,
;
:
{
}
's'
"s"
=
this
0
[
]
const
from
=>
import
null
return
if
export
let
expect
<
>
new
?
function
string
<s>
</s>
public
as
private
!
false
true
===

最後一個詞是_UNKNOWN_,代表未知詞。

而target_wl是類型的表,我們看下前幾行:

O
$any$
$string$
$number$
$complex$
$void$
$boolean$
$any[]$
$string[]$
$number[]$
$Assertion$
$undefined$
${}$
$HTMLElement$
$Promise$
$ExpectStatic$
$Promise<any>$
$PromiseConstructor$
$Promise<void>$
$Element$
$this$
$ErrorConstructor$
$ZeroEx$
$Math$
$SignedOrder$
$Projection$
$JSON$
$JsApi$
$StockData$
$Console$
$VNode$
$T$

類型中第一個是未知。

除此之外,還會生成test_projects.txt,例:

43081j__rar.js.json
adriancarriger__angularfire2-offline.json
aikoven__typescript-fsa.json
alexjoverm__tslint-config-prettier.json
AlgusDark__bloomer.json
andrerpena__react-mde.json
arangodb__arangojs.json

格式轉換

在使用CNTK處理之前,我們還需要將txt格式轉換成CNTK需要的ctf格式。

這個工具去CNTK官網上可以找到:https://github.com/microsoft/CNTK/blob/master/Scripts/txt2ctf.py

調用命令如下,以Windows爲例,其它系統就不用路徑,直接調用python就好:

& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf

訓練

萬事俱備,我們就可以調用infer.py來進行訓練了。
請記得安裝微軟的CNTK框架。

下面是我的訓練命令和輸出

C:\Python\Python36\python.exe .\infer.py
Selected GPU[0] GeForce GTX 960M as the process wide default device.
-------------------------------------------------------------------
Build info:

                Built time: Apr 23 2019 21:50:08
                Last modified date: Tue Apr 23 17:37:55 2019
                Build type: Release
                Build target: GPU
                With ASGD: yes
                Math lib: mkl
                CUDA version: 10.0.0
                CUDNN version: 7.6.2
                Build Branch: HEAD
                Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
Training 4597857 parameters in 21 parameter tensors.
-------------------------------------------------------------------
Build info:

                Built time: Apr 23 2019 21:50:08
                Last modified date: Tue Apr 23 17:37:55 2019
                Build type: Release
                Build target: GPU
                With ASGD: yes
                Math lib: mkl
                CUDA version: 10.0.0
                CUDNN version: 7.6.2
                Build Branch: HEAD
                Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
Learning rate per 1 samples: 0.001
 Minibatch[   1-  10]: loss = 1.052736 * 42461, metric = 14.26% * 42461;
 Minibatch[  11-  20]: loss = 0.671728 * 46088, metric = 13.34% * 46088;
 Minibatch[  21-  30]: loss = 0.486434 * 42913, metric = 8.57% * 42913;
 Minibatch[  31-  40]: loss = 0.542112 * 45928, metric = 9.83% * 45928;

評估效果

在evaluation.py中,修改model_file變量爲我們上一步訓練好的cntk文件,然後運行就可以評估訓練的效果了。

model_file = "models/model-1.cntk"
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章