Javascript類型推斷(3) - 算法模型解析

原創

2019-10-09 11:28

Javascript類型推斷(3) - 算法模型解析

構建訓練模型

上一節我們介紹了生成訓練集，測試集，驗證集的方法，以及生成詞表的方法。
這5個文件構成了訓練的基本素材：

files = {
    'train': { 'file': 'data/train.ctf', 'location': 0 },
    'valid': { 'file': 'data/valid.ctf', 'location': 0 },
    'test': { 'file': 'data/test.ctf', 'location': 0 },
    'source': { 'file': 'data/source_wl', 'location': 1 },
    'target': { 'file': 'data/target_wl', 'location': 1 }
}

詞表我們需要轉換一下格式，放到哈希表裏：

# load dictionaries
source_wl = [line.rstrip('\n') for line in open(files['source']['file'])]
target_wl = [line.rstrip('\n') for line in open(files['target']['file'])]
source_dict = {source_wl[i]:i for i in range(len(source_wl))}
target_dict = {target_wl[i]:i for i in range(len(target_wl))}

下面是一些全局參數：

# number of words in vocab, slot labels, and intent labels
vocab_size = len(source_dict)
num_labels = len(target_dict)
epoch_size = 17.955*1000*1000
minibatch_size = 5000
emb_dim = 300
hidden_dim = 650
num_epochs = 10

下面我們定義x,y,t三個值，分別與輸入詞表、輸出標籤數和隱藏層有關

# Create the containers for input feature (x) and the label (y)
x = C.sequence.input_variable(vocab_size, name="x")
y = C.sequence.input_variable(num_labels, name="y")
t = C.sequence.input_variable(hidden_dim, name="t")

好，我們開始看下訓練的流程：

model = create_model()
enc, dec = model(x, t)
trainer = create_trainer()
train()

訓練模型

首先是一個詞嵌入層：

def create_model():
    embed = C.layers.Embedding(emb_dim, name='embed')

然後是兩個雙向的循環神經網絡（使用GRU），一個全連接網絡，和一個dropout：

    encoder = BiRecurrence(C.layers.GRU(hidden_dim//2), C.layers.GRU(hidden_dim//2))
    recoder = BiRecurrence(C.layers.GRU(hidden_dim//2), C.layers.GRU(hidden_（）dim//2))
    project = C.layers.Dense(num_labels, name='classify')
    do = C.layers.Dropout(0.5)

然後把上面的四項組合起來：

    def recode(x, t):
        inp = embed(x)
        inp = C.layers.LayerNormalization()(inp)
        
        enc = encoder(inp)
        rec = recoder(enc + t)
        proj = project(do(rec))
        
        dec = C.ops.softmax(proj)
        return enc, dec
    return recode

其中雙向循環神經網絡定義如下：

def BiRecurrence(fwd, bwd):
    F = C.layers.Recurrence(fwd)
    G = C.layers.Recurrence(bwd, go_backwards=True)
    x = C.placeholder()
    apply_x = C.splice(F(x), G(x))
    return apply_x

構建訓練過程

首先定義下損失函數，由兩部分組成，一部分是loss，另一部分是分類錯誤：

def criterion(model, labels):
    ce     = -C.reduce_sum(labels*C.ops.log(model))
    errs = C.classification_error(model, labels)
    return ce, errs

有了損失函數之後，我們使用帶動量的Adam算法進行梯度下降訓練：

def create_trainer():
    masked_dec = dec*C.ops.clip(C.ops.argmax(y), 0, 1)
    loss, label_error = criterion(masked_dec, y)
    loss *= C.ops.clip(C.ops.argmax(y), 0, 1)

    lr_schedule = C.learning_parameter_schedule_per_sample([1e-3]*2 + [5e-4]*2 + [1e-4], epoch_size=int(epoch_size))
    momentum_as_time_constant = C.momentum_as_time_constant_schedule(1000)
    learner = C.adam(parameters=dec.parameters,
                         lr=lr_schedule,
                         momentum=momentum_as_time_constant,
                         gradient_clipping_threshold_per_sample=15, 
                         gradient_clipping_with_truncation=True)

    progress_printer = C.logging.ProgressPrinter(tag='Training', num_epochs=num_epochs)
    trainer = C.Trainer(dec, (loss, label_error), learner, progress_printer)
    C.logging.log_number_of_parameters(dec)
    return trainer

訓練

定義好模型之後，我們就可以訓練了。
首先我們可以利用CNTK.io包的功能定義一個數據的讀取器：

def create_reader(path, is_training):
    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
            source        = C.io.StreamDef(field='S0', shape=vocab_size, is_sparse=True), 
            slot_labels    = C.io.StreamDef(field='S1', shape=num_labels, is_sparse=True)
    )), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)

然後我們就可以利用這個讀取器讀取數據開始訓練了：

def train():
    train_reader = create_reader(files['train']['file'], is_training=True)
    step = 0
    pp = C.logging.ProgressPrinter(freq=10, tag='Training')
    for epoch in range(num_epochs):
        epoch_end = (epoch+1) * epoch_size
        while step < epoch_end:
            data = train_reader.next_minibatch(minibatch_size, input_map={
                x: train_reader.streams.source,
                y: train_reader.streams.slot_labels
            })
            # Enhance data
            enhance_data(data, enc)
            # Train model
            trainer.train_minibatch(data)
            pp.update_with_trainer(trainer, with_metric=True)
            step += data[y].num_samples
        pp.epoch_summary(with_metric=True)
        trainer.save_checkpoint("models/model-" + str(epoch + 1) + ".cntk")
        validate()
        evaluate()

上面的代碼中，enhance_data需要解釋一下。
我們的數據並非是完全線性的數據，還需要進行一個數據增強的處理過程：

def enhance_data(data, enc):
    guesses = enc.eval({x: data[x]})
    inputs = C.ops.argmax(x).eval({x: data[x]})
    tables = []
    for i in range(len(inputs)):
        ts = []
        table = {}
        counts = {}
        for j in range(len(inputs[i])):
            inp = int(inputs[i][j])
            if inp not in table:
                table[inp] = guesses[i][j]
                counts[inp] = 1
            else:
                table[inp] += guesses[i][j]
                counts[inp] += 1
        for inp in table:
            table[inp] /= counts[inp]
        for j in range(len(inputs[i])):
            inp = int(inputs[i][j])
            ts.append(table[inp])
        tables.append(np.array(np.float32(ts)))
    s = C.io.MinibatchSourceFromData(dict(t=(tables, C.layers.typing.Sequence[C.layers.typing.tensor])))
    mems = s.next_minibatch(minibatch_size)
    data[t] = mems[s.streams['t']]

測試和驗證

測試和驗證的過程中，也需要我們上面介紹的數據增強的過程：

def validate():
    valid_reader = create_reader(files['valid']['file'], is_training=False)
    while True:
        data = valid_reader.next_minibatch(minibatch_size, input_map={
                x: valid_reader.streams.source,
                y: valid_reader.streams.slot_labels
        })
        if not data:
            break
        enhance_data(data, enc)
        trainer.test_minibatch(data)
    trainer.summarize_test_progress()

evaluate與validate邏輯完全一樣，只是讀取的文件不同：

def evaluate():
    test_reader = create_reader(files['test']['file'], is_training=False)
    while True:
        data = test_reader.next_minibatch(minibatch_size, input_map={
            x: test_reader.streams.source,
            y: test_reader.streams.slot_labels
        })
        if not data:
            break
        # Enhance data
        enhance_data(data, enc)
        # Test model
        trainer.test_minibatch(data)
    trainer.summarize_test_progress()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Javascript類型推斷(3) - 算法模型解析

Javascript類型推斷(3) - 算法模型解析

構建訓練模型

訓練模型

構建訓練過程

訓練

測試和驗證

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Python全棧快餐教程(1) - 用Flask處理HTTP請求

代碼補全快餐教程(1) - 30行代碼見證奇蹟

from torch._C import * ImportError: DLL load failed的一些原因

PyTorch快餐教程2019 (2) - Multi-Head Attention

PyTorch快餐教程2019 (1) - 從Transformer說起

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結