手把手教你開發人工智能微信小程序(3)：加載數據

原創

云水木石

2020-02-21 21:32

在上篇文章《手把手教你開發人工智能微信小程序(2)：線性迴歸模型》，我們在代碼中給定了一組訓練數據，對於機器學習而言，這點數據是不夠的。數據集可以有多種來源，本文就來說說如何從網絡加載數據。

讀完本文，你將學習到：

如何通過fetch加載網絡數據
數據歸範化

加載網絡數據

本文以網絡上的公開數據集 Boston House price 爲例，這個數據集有多種格式，爲了簡單起見，我們先以 JSON 格式爲例。這個 house.json 文件我放到了我的個人網站上，在文章配套的源碼庫中你也可以找到它。

在 Javascript 中，有一個非常方便的 fetch API 用來獲取網絡數據，但非常遺憾的是，不知道出於什麼原因，在微信小程序中，這個 API 被裁掉了。在《手把手教你開發人工智能微信小程序(1)：Hello WeChat！》這篇文章中，爲了使用tfjs，需要導入一個 fetch-wechat 模塊，這實際上是一個採用微信小程序API實現 fetch API的模塊。在代碼中，我們可以使用這個模塊獲取網絡數據。

async function getData() {
  const fetch = fetchWechat.fetchFunc();
  const houseDataReq = await fetch('https://ilego.club/ai/dataset/house.json');
  const houseData = await houseDataReq.json();
  const cleaned = houseData.map(house => ({
    price: house.Price,
    rooms: house.AvgAreaNumberofRooms,
  }))
    .filter(house => (house.price != null && house.rooms != null));


  return cleaned;
}

上面的代碼在獲取到數據後，進行了兩個處理：

房屋價格和多個因素有關，這裏爲了簡化問題起見，假設房價只與房間數量有關，所以只保留了房間數量及價格這兩項數據。
過濾掉沒有定義價格或房間數量的條目。

規範化特徵數據

規範化數據是機器學習中一種常見的處理數據的一種技巧，目的是消除數據量綱對模型的影響，減少過擬合。最簡單的規範化方法就是對數據進行歸一化，就是將數據處理爲[0, 1]之間的範圍，其處理公式爲：

Xnorm = (X - Xmin) / (Xmax - Xmin)

看看代碼是如何實現的：

function convertToTensor(data) {
  return tf.tidy(() => {
    // Step 1\. Shuffle the data    
    tf.util.shuffle(data);
    // Step 2\. Convert data to Tensor
    const inputs = data.map(d => d.rooms)
    const labels = data.map(d => d.price);
    const inputTensor = tf.tensor2d(inputs, [inputs.length, 1]);
    const labelTensor = tf.tensor2d(labels, [labels.length, 1]);
    //Step 3\. Normalize the data to the range 0 - 1 using min-max scaling
    const inputMax = inputTensor.max();
    const inputMin = inputTensor.min();
    const labelMax = labelTensor.max();
    const labelMin = labelTensor.min();
    const normalizedInputs = inputTensor.sub(inputMin).div(inputMax.sub(inputMin));
    const normalizedLabels = labelTensor.sub(labelMin).div(labelMax.sub(labelMin));
    return {
      inputs: normalizedInputs,
      labels: normalizedLabels,
      // Return the min/max bounds so we can use them later.
      inputMax,
      inputMin,
      labelMax,
      labelMin,
    }
  });
}

第一步將數據隨機打亂，也是一種減少過擬合的技巧。

第二步將數組轉化爲tensor

第三步對數據進行歸一化。

構建模型並訓練

這個步驟和上篇文章中講到的步驟是一樣的，這裏模型稍微修改一下，增加一個層：

function createModel() {
  // Create a sequential model
  const model = tf.sequential();


  // Add a single hidden layer
  model.add(tf.layers.dense({ inputShape: [1], units: 1, useBias: true }));


  // Add an output layer
  model.add(tf.layers.dense({ units: 1, useBias: true }));
  return model;
}

接下來訓練模型，因爲數據比較多，一次性訓練所有數據，可能會出現內存溢出，所以需要指定一個batch size

async function trainModel(model, inputs, labels) {
  // Prepare the model for training.  
  model.compile({
    optimizer: tf.train.adam(),
    loss: tf.losses.meanSquaredError,
    metrics: ['mse'],
  });


  const batchSize = 28;
  const epochs = 50;


  return await model.fit(inputs, labels, {
    batchSize,
    epochs
  });
}

注意代碼中指定優化器和損失函數的方式和上篇文章也有所不同，不是以字符串的形式指定，兩種方法都可以，你可以根據自己的偏好選擇。

推理

需要注意的是，因爲模型是通過規範化的數據訓練的，所以在推理時，輸入數據需要進行歸一化處理，而結果需要反歸一化：

    const inputTensor = tf.tensor2d([5], [1, 1]);
    const normalizedInputs = inputTensor.sub(inputMin).div(inputMax.sub(inputMin));
    const preds = model.predict(normalizedInputs);
    const unNormPreds = preds.mul(labelMax.sub(labelMin)).add(labelMin);

小結

本文探討了如何從網絡加載數據集，並採用歸一化對數據進行處理。例子做了簡化處理，仍然算不上一個實用的例子，在下篇文章中，我將介紹一個稍微複雜的例子：手寫數字識別。如果你有什麼建議，歡迎留言。

本系列文章的源碼請訪問：

https://github.com/mogotech/wechat-tfjs-examples

雲水木石

發佈了180 篇原創文章 · 獲贊 447 · 訪問量 36萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

手把手教你開發人工智能微信小程序(3)：加載數據

加載網絡數據

規範化特徵數據

構建模型並訓練

推理

小結

python gdal 安裝使用（Windows， python 3.6.8）

《平凡的世界》讀後感 — 孫少平篇

TensorFlow.js 爲何引入 WASM 後端

寫作練習：小時侯和現在

面對恐懼和壓力，你是怎麼做的？

手把手教你開發人工智能微信小程序(4)：訓練手寫數字識別模型

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結