大多數情況下,mxnet都使用python接口進行深度學習程序的編寫,方便快捷,但是有的時候,需要把機器學習訓練和識別的程序部署到生產版的程序中去,比如遊戲或者雲服務,此時採用C++等高級語言去編寫才能提高性能,本文介紹瞭如何在windows系統下從源碼編譯mxnet,安裝python版的包,並使用C++原生接口創建示例程序。
目標
- 編譯出libmxnet.lib和libmxnet.dll的gpu版本
- 從源碼安裝mxnet python包
- 構建mxnet C++示例程序
環境
- windows10
- vs2015
- cmake3.7.2
- Miniconda2(python2.7.14)
- CUDA8.0
- mxnet1.2
- opencv3.4.1
- OpenBLAS-v0.2.19-Win64-int32
- cudnn-8.0-windows10-x64-v7.1(如果編譯cpu版本的mxnet,則此項不需要)
步驟
下載源碼
最好用git下載,遞歸地下載所有依賴的子repo,源碼的根目錄爲mxnet
git clone --recursive https://github.com/dmlc/mxnet
依賴庫
在此之前確保cmake和python已經正常安裝,並且添加到環境變量,然後再下載第三方依賴庫
- 下載安裝cuda,確保機器是英偉達顯卡,且支持cuda,地址:https://developer.nvidia.com/cuda-toolkit
- 下載安裝opencv預編譯版,地址:https://sourceforge.net/projects/opencvlibrary/files/opencv-win/3.4.1/opencv-3.4.1-vc14_vc15.exe/download
- 下載openblas預編譯版,地址:https://sourceforge.net/projects/openblas/files/v0.2.19/
- 下載cudnn預編譯版,注意與cuda版本對應,地址:https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/8.0_20171129/cudnn-8.0-windows10-x64-v7
cmake配置
打開cmake-gui,配置源碼目錄和生成目錄,編譯器選擇vs2015 win64
配置第三方依賴庫
configure和generate
注意:在這一步中不要勾選cpppackage選項,不然會編譯不過,因爲這選項是爲了生成op.h這個頭文件的,但是由於官網文檔的bug,目前無法通過配置cmake編譯自動生成,後面會講手動生成
編譯vs工程
打開mxnet.sln,配置成release x64模式,編譯整個solution
編譯完成後會在對應文件夾生成mxnet的lib和dll
此時整個過程成功了一半
安裝mxnet的python包
有了libmxnet.dll就可以同源碼安裝python版的mxnet包了
不過,前提是需要集齊所有依賴到的其他dll,如圖所示,將這些dll全部拷貝到mxnet/python/mxnet目錄下
tip: 關於dll的來源
- opencv,openblas,cudnn相關dll都是從這幾個庫的目錄裏拷過來的
- libgcc_s_seh-1.dll和libwinpthread-1.dll是從mingw相關的庫目錄裏拷過來的,git,qt等這些目錄都有
- libgfortran-3.dll和libquadmath_64-0.dll是從adda(https://github.com/adda-team/adda/releases)這個庫裏拷過來的,注意改名
然後,在mxnet/python目錄下使用命令行安裝mxnet的python包
python setup.py install
安裝過程中,python會自動把對應的dll考到安裝目錄,正常安裝完成後,在python中就可以 import mxnet 了
生成C++依賴頭文件 op.h
爲了能夠使用C++原生接口,這一步是很關鍵的一步,目的是生成mxnet C++程序依賴的op.h文件
如果直接指向編譯mxnet並使用C++,前面的python包安裝可以不做
在mxnet/cpp-package/scripts目錄,將所有依賴到的dll拷貝進來
在此目錄運行命令行
python OpWrapperGenerator.py libmxnet.dll
正常情況下就可以在mxnet/cpp-package/include/mxnet-cpp目錄下生成op.h了
如果這個過程中出現一些error,多半是dll文件缺失或者版本不對,很好解決
構建C++示例程序
建立cpp工程,這裏使用經典的mnist手寫數字識別訓練示例(請提前下載好mnist數據,地址:mnist),啓用GPU支持
選擇release x64模式
配置include和lib目錄以及附加依賴項
include目錄包括:
- D:\mxnet\include
- D:\mxnet\dmlc-core\include
- D:\mxnet\nnvm\include
- D:\mxnet\cpp-package\include
lib目錄:
- D:\mxnet\build_x64\Release
附加依賴項:
- libmxnet.lib
代碼 main.cpp
#include <chrono>
#include "mxnet-cpp/MxNetCpp.h"
using namespace std;
using namespace mxnet::cpp;
Symbol mlp(const vector<int> &layers)
{
auto x = Symbol::Variable("X");
auto label = Symbol::Variable("label");
vector<Symbol> weights(layers.size());
vector<Symbol> biases(layers.size());
vector<Symbol> outputs(layers.size());
for (size_t i = 0; i < layers.size(); ++i)
{
weights[i] = Symbol::Variable("w" + to_string(i));
biases[i] = Symbol::Variable("b" + to_string(i));
Symbol fc = FullyConnected(
i == 0 ? x : outputs[i - 1], // data
weights[i],
biases[i],
layers[i]);
outputs[i] = i == layers.size() - 1 ? fc : Activation(fc, ActivationActType::kRelu);
}
return SoftmaxOutput(outputs.back(), label);
}
int main(int argc, char** argv)
{
const int image_size = 28;
const vector<int> layers{128, 64, 10};
const int batch_size = 100;
const int max_epoch = 10;
const float learning_rate = 0.1;
const float weight_decay = 1e-2;
auto train_iter = MXDataIter("MNISTIter")
.SetParam("image", "./mnist_data/train-images.idx3-ubyte")
.SetParam("label", "./mnist_data/train-labels.idx1-ubyte")
.SetParam("batch_size", batch_size)
.SetParam("flat", 1)
.CreateDataIter();
auto val_iter = MXDataIter("MNISTIter")
.SetParam("image", "./mnist_data/t10k-images.idx3-ubyte")
.SetParam("label", "./mnist_data/t10k-labels.idx1-ubyte")
.SetParam("batch_size", batch_size)
.SetParam("flat", 1)
.CreateDataIter();
auto net = mlp(layers);
// start traning
cout << "==== mlp training begin ====" << endl;
auto start_time = chrono::system_clock::now();
Context ctx = Context::gpu(); // Use GPU for training
std::map<string, NDArray> args;
args["X"] = NDArray(Shape(batch_size, image_size*image_size), ctx);
args["label"] = NDArray(Shape(batch_size), ctx);
// Let MXNet infer shapes of other parameters such as weights
net.InferArgsMap(ctx, &args, args);
// Initialize all parameters with uniform distribution U(-0.01, 0.01)
auto initializer = Uniform(0.01);
for (auto& arg : args)
{
// arg.first is parameter name, and arg.second is the value
initializer(arg.first, &arg.second);
}
// Create sgd optimizer
Optimizer* opt = OptimizerRegistry::Find("sgd");
opt->SetParam("rescale_grad", 1.0 / batch_size)
->SetParam("lr", learning_rate)
->SetParam("wd", weight_decay);
std::unique_ptr<LRScheduler> lr_sch(new FactorScheduler(5000, 0.1));
opt->SetLRScheduler(std::move(lr_sch));
// Create executor by binding parameters to the model
auto *exec = net.SimpleBind(ctx, args);
auto arg_names = net.ListArguments();
// Create metrics
Accuracy train_acc, val_acc;
// Start training
for (int iter = 0; iter < max_epoch; ++iter)
{
int samples = 0;
train_iter.Reset();
train_acc.Reset();
auto tic = chrono::system_clock::now();
while (train_iter.Next())
{
samples += batch_size;
auto data_batch = train_iter.GetDataBatch();
// Data provided by DataIter are stored in memory, should be copied to GPU first.
data_batch.data.CopyTo(&args["X"]);
data_batch.label.CopyTo(&args["label"]);
// CopyTo is imperative, need to wait for it to complete.
NDArray::WaitAll();
// Compute gradients
exec->Forward(true);
exec->Backward();
// Update parameters
for (size_t i = 0; i < arg_names.size(); ++i)
{
if (arg_names[i] == "X" || arg_names[i] == "label") continue;
opt->Update(i, exec->arg_arrays[i], exec->grad_arrays[i]);
}
// Update metric
train_acc.Update(data_batch.label, exec->outputs[0]);
}
// one epoch of training is finished
auto toc = chrono::system_clock::now();
float duration = chrono::duration_cast<chrono::milliseconds>(toc - tic).count() / 1000.0;
LG << "Epoch[" << iter << "] " << samples / duration \
<< " samples/sec " << "Train-Accuracy=" << train_acc.Get();;
val_iter.Reset();
val_acc.Reset();
while (val_iter.Next())
{
auto data_batch = val_iter.GetDataBatch();
data_batch.data.CopyTo(&args["X"]);
data_batch.label.CopyTo(&args["label"]);
NDArray::WaitAll();
// Only forward pass is enough as no gradient is needed when evaluating
exec->Forward(false);
val_acc.Update(data_batch.label, exec->outputs[0]);
}
LG << "Epoch[" << iter << "] Val-Accuracy=" << val_acc.Get();
}
// end training
auto end_time = chrono::system_clock::now();
float total_duration = chrono::duration_cast<chrono::milliseconds>(end_time - start_time).count() / 1000.0;
cout << "total duration: " << total_duration << " s" << endl;
cout << "==== mlp training end ====" << endl;
//delete exec;
MXNotifyShutdown();
getchar(); // wait here
return 0;
}
編譯生成目錄
- 預先把mnist數據拷進去,維持相對目錄結構
- 在執行目錄也要把所有依賴的dll拷貝進來
運行結果
在官方的example裏面有mlp的cpu和gpu兩個版本,有興趣的話可以跑起來做一個對比
其實,在某些數據量小的情況下,gpu版本並不明顯比cpu版本消耗的訓練時間少
至此,大功告成