1. MLPerf 介紹

MLPerf是一套用於測量和提高機器學習軟硬件性能的通用基準，主要用來測量訓練和推理不同神經網絡所需要的時間。MLPerf測試集包含了不同領域的Benchmark子項，主要包括圖像分類、物體識別、翻譯、推薦、語音識別、情感分析以及強化學習。

下表顯示了MLPerf不同領域的benchmark：

參考鏈接:MLPerf官網: https://mlperf.org

2. MLPerf 思路

MLPerf的想法有兩個來源，一個是哈佛大學的Fathom項目；另一個是斯坦福的DAWNBench。MLPerf借鑑了前者在評價中使用的多種不同的機器學習任務，以保證Benchmark具有足夠的代表性；同時借鑑了後者使用的對比評價指標，保證公平性。

MLperf按照不同的領域問題設置了不同的Benchmark，對於MLPerf Training測試，每個Benchmark的評價標準是：在特定數據集上訓練一個模型使其達到Quality Target時的Clock time。由於機器學習任務的訓練時間有很大差異，因此，MLPerf 的最終訓練結果是由指定次數的基準測試時間平均得出的，其中會去掉最低和最高的數字，一般是運行5次取平均值，Train測試時間包含了模型構建，數據預處理，訓練以及質量測試等時間。
對於MLPerf Inference測試，每個Benchmark的評價標準是：在特定數據集上測量模型的推理性能，包括Latency和吞吐量。

3. MLPerf Training

3.1 Training Division

MLPerf training可以分爲封閉模型分區（Closed Model Division）和開放模型分區（Open Model Division）。其中封閉模型分區要求使用相同模型和優化器，並限制batch大小或學習率等超參數的值，它旨在硬件和軟件系統的公平比較。而開放模型分區只會限制使用相同的數據解決相同的問題，其它模型或平臺都不會限制，它旨在推進ML模型和優化的創新。

3.1.1 Closed Model Division

MLPerf爲每個領域的Benchmark測試都提供了參考實例，封閉模型分區要求使用與參考實例相同的預處理，模型，訓練方法進行測試。
Closed Model Division具體要求如下：

預處理必須採用與參考實例相同的預處理步驟，圖像必須與參考實現中的大小相同。
權重和偏差必須使用與參考實例相同的常數或隨機值分佈進行初始化。
損失函數必須使用與參考實例相同的損失函數。
優化器必須使用與參考實例相同的優化器。
RL環境也需要與參考實例相同，參數相同。
超參數是可以自己選擇的。

Closed Division可以使用的Model如下：

3.1.2 Open Model Division

開放模型分區要求使用相同的數據集解決相同的問題，允許使用任意預處理，模型或訓練方法。它旨在推進ML模型和優化的創新。目前，全球MLPerf測試的參與者都一致地集中在封閉分區，開放分區還沒有MLPerf測試的提交結果。

3.2 MLPerf Training Result

和DAWNBench的類似，MLPerf Result被定義爲將模型訓練到目標質量的Clock時間，這個時間包括模型構建，數據預處理，訓練以及質量測試等時間，它通常是數小時或數天。MLPerf也提供了一個參考基準，該基準是MLPerf使用一塊 Pascal P100顯卡在未進行任何優化情況下的測試結果。
參考鏈接:MLPerf Results地址: https://mlperf.org/results/

下表爲封閉分區的參考基準和各大廠商的測試結果。

MLperf 的結果首先按照 Division 分類，然後按照 Category 分類。結果表格中的每一行都是由單個提交者使用相同軟件堆棧和硬件平臺生成的一組結果。每行包含以下信息：

名稱	說明
Submitter	提交結果的組織
Hardware	使用的 ML 硬件類型，例如加速器或高性能 CPU。
Chip Count and Type	使用的ML硬件芯片的數量，以及它們是加速器還是CPU。
Software	使用的 ML 框架和主要 ML 硬件庫。
Benchmark result	在各個基準測試中將模型訓練到目標質量的時間。
Cloud Scale	Cloud Scale源自幾家主要雲提供商的定價，並提供相對系統size/cost的粗略指標。參考單個Pascal P100系統的雲規模爲1,雲規模爲4的系統大約需要四倍的成本。
Power	可用內部部署系統的信息。由於標準化功率測量的複雜性，此版本 MLPerf 只允許自願報告任意非官方功率信息。
Details	鏈接到提交的元數據。

另一種結果輸出是：加速比（Speedups）。加速比是當前測試結果和使用一塊Pascal P100顯卡在未進行任何優化情況下的測試結果的比值。即當前測試結果和參考基準的比值。因此，MLPerf 的Speedups結果若是10，則表示被測試系統的訓練速度是在一塊Pascal P100上訓練同一個模型速度的10倍，訓練時間是其 1/10。
下表爲封閉分區的參考基準和各大廠商的Speedups結果。

參考基準爲單個Pascal P100，在各種基準測試上的分數都爲 1（參考值）。谷歌使用的硬件爲4個TPUv2.8，雲規模爲 2.6，在圖像分類任務的 ImageNet 數據集上訓練 ResNet-50 v1.5 的相對訓練加速比爲29.3。TPUv3.8 的效率非常高，平均單個芯片的加速比爲12，且雲價格還和單塊Pascal P100差不多。
土豪英偉達展示了大規模 GPU 並行的效果，它們使用80個DGX-1（260 塊 V100）在 ImageNet 數據集上實現了 1424.4 的加速比。

對於MLPerf Result，即使是多次測試結果取平均值也無法消除所有誤差。MLPerf圖像處理基準測試結果大約有+/- 2.5%的浮動範圍，而其他MLPerf基準測試則有+/-5%誤差。

3.3 MLPerf Training Benchmark

MLPerf爲每個Benchmark子項都提供了參考實現。其中包括：

image_classification - Resnet-50 v1 applied to Imagenet.
object_detection - Mask R-CNN applied to COCO.
single_stage_detector - SSD applied to COCO 2017.
speech_recognition - DeepSpeech2 applied to Librispeech.
translation - Transformer applied to WMT English-German.
recommendation - Neural Collaborative Filtering applied to MovieLens 20 Million (ml-20m).
sentiment_analysis - Seq-CNN applied to IMDB dataset.
reinforcement - Mini-go applied to predicting pro game moves.

這些參考實現都已經在以下環境上進行了測試，測試出的結果作爲MLPerf Results的參考基準。

16 CPUs, one Nvidia P100.
Ubuntu 16.04, including docker with nvidia support.
600GB of disk (though many benchmarks do require less disk).
Either CPython 2 or CPython 3, depending on benchmark (see Dockerfiles for details).

每個Benchmark將一直運行，直到達到目標質量，然後訓練終止，並統計時間結果。下面分別介紹MLPerf Training的每一個Benchmark。

參考鏈接:github地址: https://github.com/mlperf/training

3.3.1 image classification

image_classification Benchmark使用resnet v1.5模型進行圖像分類，實現副本來源於https://github.com/tensorflow/models/tree/master/official/resnet.

1) 環境構建

# 安裝nvidia-docker
1. wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
2. rpm -ivh nvidia-docker-1.0.1-1.x86_64.rpm
3. sudo systemctl restart nvidia-docker

2）準備Dataset
使用此腳本將ImageNet數據集轉換爲TFRecords，或者直接從ImageNet或從image-net.org下載的.tar文件創建TFRecod數據集。

3）運行Benchmark

cd image_classification/tensorflow/
IMAGE=`sudo docker build . | tail -n 1 | awk '{print $3}'`
SEED=2
NOW=`date "+%F-%T"`
sudo nvidia-docker run -v /home/PSR/DataSet/imagenet/ILSVRC2012:/imn -it $IMAGE "./run_and_time.sh" $SEED | tee benchmark-$NOW.log

3.3.2 object_detection

object_detection使用Mask R-CNN with ResNet50 backbone進行模型訓練，參考鏈接爲https://github.com/mlperf/training/tree/master/object_detection。

1) 環境配置

# 1. Checkout the MLPerf repository
mkdir -p mlperf
cd mlperf
git clone https://github.com/mlperf/training.git
#2. Install CUDA and Docker
source training/install_cuda_docker.sh
#3. Build the docker image for the object detection task
cd training/object_detection/
nvidia-docker build . -t mlperf/object_detection
#4. Run docker container and install code
nvidia-docker run -v .:/workspace -t -i --rm --ipc=host mlperf/object_detection "cd mlperf/training/object_detection && ./install.sh"

2）準備Dataset

source download_dataset.sh

3）運行Benchmark

nvidia-docker run -v .:/workspace -t -i --rm --ipc=host mlperf/object_detection "cd mlperf/training/object_detection && ./run_and_time.sh"

4. MLPerf Inference

MLPerf最初定位的時候並沒有把自己限制在Training系統的評估上，而是希望MLPerf也能夠覆蓋Inference系統的評估。對於Training的評估，MLPerf已經提供了一個相對完備和公平的方法，軟硬件廠商通過運行MLPerf的Training Benchmark測試集，比較將模型訓練到特定精度時所花費的時間和成本。在MLPerf Training V0.5發佈一年後，MLPerf發佈了Inference V0.5版本，此版本增加了對ML Inference性能的評估。ML Inference目前還在不斷迭代開發中。
下圖是MLPerf預定的時間表，2019年Inference比賽提交結果的截止時間是9月6號。MLPerf公佈結果的時間是10月4號。

在MLPerf Inference中，主要分爲三個部分：

Load Generator：MLPerf使用負載生成器(LoadGen)，生成不同的測試場景，對Inference進行測試。
Cloud：MLPerf專爲Cloud場景設計的一系列Benchamrk，Cloud場景下的各種Inference Benchmark包括: 圖像分類(resnet50-v1.5、mobilenet-v1)、物體檢測(ssd-mobilenet 、ssd-resnet34)、Language Modeling(lstm)、翻譯(gmnt)、語音識別(DeepSpeech2)、情感分析(Seq2CNN)等。
Edge：MLPerf專爲Edge場景設計的一系列Benchamrk，包括：人臉識別(Sphereface20)、object classification(mobilenet、Shufflenet)、object detection(SSD-MobileNet)、object segmentation(MaskRCNN2GO)、語音識別(DeepSpeech2)、翻譯(gmnt)等，Edge Benchmark側重於小型model場景。

4.1 Load Generator

Load Generator是MLPerf的負載生成器，用於啓動和控制Inference Benchmark測試。MLPerf Inference有三個基本概念: SUT、Sample和query。

SUT表示被測系統。
Sample是運行inference的單位，例如一個image或sentence。
query是一組進行Inference的N個樣本。例如單個query包含8個image。

爲了能夠對各種推理平臺和用例進行代表性測試，MLPerf定義了四種不同的推理場景，如下表所述。

SINGLE STREAM：LoadGen運行單次測試來測量90th percentile Latency。每次query只包含1個Sample，可以理解爲batchsize=1。LoadGen在處理完上一個query後再發送下一個query，直到超過設定的query數並且達到設定的測試時間。query數量和測試時間的雙重要求旨在能夠同時適配大型和小型推理系統。
注意，最後衡量的結果是Tail Latency而不是average Latency。

MULTI-STREAM：LoadGen運行多次測試來測量系統能夠支持的最大stream數，每次測試都會評估特定數量的stream，對於特定數量的Stream，LoadGen生成query，每個query中的Sample的數量等於stream的數量，查詢中的所有Sample將在內存中連續分配。 LoadGen將使用二分法來查找最大stream數，然後測試5次來驗證最大stream的穩定性。

SERVER：LoadGen運行多次測試來確定系統的吞吐量。對於SERVER測試，用戶給出期望的QPS值，Inference系統使用QPS的泊松分佈來生成query，進行吞吐量測試，這裏是模擬Server的應用場景。

OFFLINE：LoadGen運行單次測試來測量系統吞吐量。在測試開始，LoadGen立即發送所有query給Inference系統，該模式是模擬離線場景。

每次運行LoadGen都會使用四種模式之一來評估系統的性能，最後提交的結果可以是{models, scenarios}的任意組合。

在Inference測試中，LoadGen主要負責以下事情：

根據測試某個場景生成query
跟蹤query的Latency
驗證結果的準確性
計算最終的metric

這裏Latency被定義爲從LoadGen將query傳遞到SUT(system under test),到Inference完成並收到回覆的時間。

整個Load Generator的運行流程如下：

LoadGen發送SUT信號（System Under Test）。
SUT啓動併發出信號準備就緒。
LoadGen啓動時鐘並開始生成query。
一旦超過設定的query數並且達到測試時間，LoadGen就會停止生成query。
LoadGen等待所有query完成。
LoadGen計算最終metric。

需要注意的是，在LoadGen中禁止緩存任何query、query參數或中間結果。

4.2 Cloud

MLPerf Inference計劃爲Cloud環境提供6個領域的Benchmark，如下表所示：

領域	模型	數據集
image classification	resnet50-v1.5、mobilenet-v1	imagenet
object detection	ssd-mobilenet 、ssd-resnet34	coco
語言模型	lstm	wmt11
翻譯	gmnt	wmt16
語音識別	DeepSpeech2	OpenSLR
情感分析	Seq2CNN	sentiment IMDB

Inference還在持續開發中，目前已經實現的Benchmark如下表所示：

注: TBD(待定)，Quality和latency targets仍在待定狀態。
MLPerf Infernce爲每個Benchmark都提供了雲參考系統和參考實現，雲參考系統是Google Compute Platform n1-highmem-16（16個vCPU，104GB內存）實例。MLPerf Inference的所有參考實現都在雲參考系統上運行。

4.2.1 image classification 和 object detection

Cloud image_classification 目錄下包含了image classification 和 object detection的Benchmark，使用resnet v1.5模型和mobilenet-v1模型進行圖像分類，使用ssd-resnet34模型和ssd-mobilenet模型進行物體檢測。分別對應了移動端和Host端的Inference測試。鏈接地址爲：https://github.com/mlperf/inference/tree/master/cloud/image_classification
image classification支持的模型如下表所示：

Running the benchmark
1）準備Dataset
你可以到網站上去下載數據集

dataset	download link
imagenet2012 (validation)	http://image-net.org/challenges/LSVRC/2012/
coco (validation)	http://images.cocodataset.org/zips/val2017.zip

你也可以使用ck 工具下載數據集。
ImageNet 2012 validation dataset:

#ImageNet 2012 validation dataset
$ ck install package --tags=image-classification,dataset,imagenet,val,original,full
$ ck install package --tags=image-classification,dataset,imagenet,aux
#Copy the labels next to the images:
$ ck locate env --tags=image-classification,dataset,imagenet,val,original,full
/home/dvdt/CK-TOOLS/dataset-imagenet-ilsvrc2012-val
$ ck locate env --tags=image-classification,dataset,imagenet,aux
/home/dvdt/CK-TOOLS/dataset-imagenet-ilsvrc2012-aux
$ cp `ck locate env --tags=aux`/val.txt `ck locate env --tags=val`/val_map.txt
```bash
COCO 2017 validation dataset:
```bash
#COCO 2017 validation dataset:
$ ck install package --tags=object-detection,dataset,coco,2017,val,original
$ ck locate env --tags=object-detection,dataset,coco,2017,val,original
/home/dvdt/CK-TOOLS/dataset-coco-2017-val

2) 環境構建
在沒有使用Docker環境運行Benchmark測試時，執行以下步驟。

1. pip install tensorflow or pip install tensorflow-gpu
2. cd ../../loadgen; CFLAGS="-std=c++14" python setup.py develop; cd ../cloud/image_classification
3. python setup.py develop

當使用Docker環境時，直接運行:

Run as Docker container
./run_and_time.sh backend model device

backend is one of [tf|onnxruntime|pytorch|tflite]
model is one of [resnet50|mobilenet|ssd-mobilenet|ssd-resnet34]
device is one of [cpu|gpu]
For example:

./run_and_time.sh tf resnet50 gpu

./run_and_time.sh會直接build Docker環境並且運行Benchmark。

3）運行Benchmark

1.在local和docker環境都需要設置2個環境變量
export MODEL_DIR=YourModelFileLocation
export DATA_DIR=YourImageNetLocation

2. 運行腳本
./run_local.sh backend model device

backend is one of [tf|onnxruntime|pytorch|tflite]
model is one of [resnet50|mobilenet|ssd-mobilenet|ssd-resnet34]
device is one of [cpu|gpu]

For example:
./run_local.sh onnxruntime mobilenet gpu --accuracy

如果你想使用server模式，可以嘗試:

./run_local.sh tf resnet50 gpu --count 100 --time 60 --scenario Server --qps 200 --max-latency 0.2

如果要調試準確性問題，請嘗試:

./run_local.sh tf ssd-mobilenet gpu --accuracy --count 100 --time 60 --scenario Server --qps 100 --max-latency 0.2

main.py的參數列表如下:

usage: main.py [-h]
[–dataset {imagenet,imagenet_mobilenet,coco,coco-300,coco-1200,coco-1200-onnx,coco-1200-pt,coco-1200-tf}]
–dataset-path DATASET_PATH [–dataset-list DATASET_LIST]
[–data-format {NCHW,NHWC}]
[–profile {defaults,resnet50-tf,resnet50-onnxruntime,mobilenet-tf,mobilenet-onnxruntime,ssd-mobilenet-tf,ssd-mobilenet-onnxruntime,ssd-resnet34-tf,ssd-resnet34-pytorch,ssd-resnet34-onnxruntime}]
[–scenario list of SingleStream,MultiStream,Server,Offline]
[–mode {Performance,Accuracy,Submission}]
[–queries-single QUERIES_SINGLE]
[–queries-offline QUERIES_OFFLINE]
[–queries-multi QUERIES_MULTI] [–max-batchsize MAX_BATCHSIZE]
–model MODEL [–output OUTPUT] [–inputs INPUTS]
[–outputs OUTPUTS] [–backend BACKEND] [–threads THREADS]
[–time TIME] [–count COUNT] [–qps QPS]
[–max-latency MAX_LATENCY]
[–cache CACHE]
[–accuracy]

參數說明:

參數	說明
`--dataset`	使用指定的數據集。目前我們只支持ImageNet。
`--dataset-path`	數據集路徑。
`--data-format {NCHW,NHWC}`	模型的數據格式。
`--scenario {SingleStream,MultiStream,Server,Offline}`	逗號分隔的基準模式列表。
`--model MODEL`	Model文件。
`--inputs INPUTS`	如果模型格式未提供輸入名稱，則以逗號分隔輸入名稱列表。這是tensorflow所必需的，因爲圖表沒有指定輸入。
`--outputs OUTPUTS`	如果模型格式未提供輸出名稱，則以逗號分隔輸出名稱列表。
`--output OUTPUT]`	JSON文件輸出的位置。
`--backend BACKEND`	使用哪個backend。目前支持的是tensorflow，onnxruntime，pytorch和tflite。
`--threads THREADS`	要使用的工作線程數（默認值：系統中的處理器數）。
`--count COUNT`	我們使用的數據集的圖像數量（默認值：使用數據集中的所有圖像）。
`--qps QPS`	期望的Query per Second。
`--max-latency MAX_LATENCY`	以逗號分隔的列表，我們嘗試在99百分位數中達到的延遲（以秒爲單位）（deault：0.01,0.05,0.100）。
`--mode {Performance,Accuracy,Submission}`	我們想要運行的loadgen模式（默認值：性能）。
`--queries-single QUERIES_SINGLE`	用於SingleStream場景的查詢（默認值：1024）。
`--queries-offline QUERIES_OFFLINE`	用於Offline場景的查詢（默認值：24576）。
`--queries-multi QUERIES_MULTI`	用於MultiStream場景的查詢（默認值：24576）。
`--max-batchsize MAX_BATCHSIZE`	最大的batchsize大小（默認值：128）。

4) Run Result

INFO:main:Namespace(accuracy=True, backend='onnxruntime', cache=0, count=None, data_format=None, dataset='imagenet_mobilenet', dataset_list=None, dataset_path='/home/gs/inference/cloud/image_classification/fake_imagenet', inputs=None, max_batchsize=128, max_latency=[0.2], mode='Performance', model='/home/gs/inference/cloud/image_classification/mobilenet_v1_1.0_224.onnx', output='/home/gs/inference/cloud/image_classification/output/mobilenet-onnxruntime-cpu/results.json', outputs=['MobilenetV1/Predictions/Reshape_1:0'], profile='mobilenet-onnxruntime', qps=10, queries_multi=24576, queries_offline=20, queries_single=1024, scenario=[TestScenario.SingleStream], threads=8, time=10)

INFO:imagenet:loaded 8 images, cache=0, took=0.0sec
INFO:main:starting accuracy pass on 8 items
Accuracy qps=48.12, mean=0.019353, time=0.17, acc=87.50, queries=8, tiles=50.0:0.0152,80.0:0.0198,90.0:0.0278,95.0:0.0366,99.0:0.0436,99.9:0.0451

INFO:main:starting TestScenario.SingleStream, latency=1.0
TestScenario.SingleStream-1.0 qps=65.80, mean=0.015138, time=10.05, acc=0.00, queries=661, tiles=50.0:0.0143,80.0:0.0167,90.0:0.0178,95.0:0.0199,99.0:0.0302,99.9:0.0522

4.2.2 translation gnmt

MLPerf Inference使用GNMT模型進行機器翻譯，培訓GNMT模型和生成數據集的步驟參考train_gnmt.txt。基本上，他們遵循MLPerf Training代碼。

安裝環境依賴

# GPU
pip install --user tensorflow-gpu

**在完整的數據集上運行GNMT： **

$ cd /path/to/gnmt/tensorflow/
# 更改權限並下載預先訓練的模型和數據集:
$ chmod +x ./download_trained_model.sh
$ ./download_trained_model.sh
$ chmod +x ./download_dataset.sh
$ ./download_dataset.sh
# verify the dataset
$ chmod +x ./verify_dataset.sh
$ ./verify_dataset.sh
# 使用特定batchsize評估性能。
$ python run_task.py --run=performance --batch_size=32
$ python run_task.py --run=accuracy

通過LoadGen運行GNMT：
1）按照說明從https://github.com/mlperf/inference/blob/master/loadgen/README.md安裝LoadGen。
2）運行測試

python loadgen_gnmt.py --store_translation

這將在performance模式（默認–mode選項）中調用SingleStream場景（默認–scenario選項），此外，還會將翻譯的每個句子的輸出存儲在單獨的文件中。
可以通過更改“–scenario”選項來運行其他場景。可以使用“–mode Accuracy”選項啓用精度跟蹤。
要測試Accuracy，請運行以下命令：

python loadgen_gnmt.py --mode Accuracy
python process_accuracy.py

4.3 Edge

MLPerf Inference計劃爲Edge環境提供6個領域的Benchmark，如下表所示：

領域	模型	數據集
人臉識別	Sphereface20	LFW
object classification	mobilenet 、Shufflenet	imagenet
object detection	ssd-mobilenet	coco
object segmentation	MaskRCNN2GO	COCO 2014 minival
語音識別	DeepSpeech2	OpenSLR
翻譯	gmnt	wmt16

4.3.1 人臉識別

MLPerf Inference Edge的人臉識別測試選擇Sphereface-20作爲基準模型，它是一個基於20層殘差的卷積神經網絡。基於TensorflowLite實現。

1. 準備環境

1）Cloning mlperf/inference版本庫, 進入inference目錄

$ git clone https://github.com/mlperf/inference.git
$ cd inference

2）使用tensorflow docker鏡像

$ docker pull tensorflow/tensorflow:nightly-devel-py3

3）運行docker鏡像，將sphereface目錄掛載到docker:/mnt目錄中。

docker run -it -w /mnt -v ${PWD}/edge/face_identification/sphereface20:/mnt -e HOST_PERMS="$(id -u):$(id -g)" tensorflow/tensorflow:nightly-devel-py3 bash

在docker鏡像中，準備環境（安裝包和設置環境路徑）

$ ./prepare_env.sh

2. 準備數據集

1）我們使用CASIS-WebFace作爲訓練數據集，使用LFW數據集作爲測試數據集。
在docker image中，下載LFW數據集進行測試：

$ ./dataset/download_lfw.sh

2）在下載的/tmp/dataset/paris.txt文件中，有6,000個測試對。我們選擇前600對，即set1作爲基準。

3. TFLite model inference
1）你可以使用以下腳本運行整個推理步驟。

$ ./inference_sphereface.sh

4. Results

下表是模型信息，包括參數數量和GMAC（Giga Multiply-and-Accumulate）

Parameters	GMAC
2.16 * 10^7	1.75

下表記錄了測試過程所花費的時間。

下表記錄了Sphererface20模型的平均推斷時間和精度。

4.3.2 object classification

MLPerf Inference - Object Classification的實現模型有兩個，一個模型是MobileNet，另一個是fackbook的Shufflenet模型。

Object Classification - MobileNet
參考鏈接：https://github.com/mlperf/inference/tree/master/edge/object_classification/mobilenets

1. 準備數據集
1）Pull CK 版本庫

$ ck pull repo:ck-mlperf
注意: Transitive dependencies include repo:ck-tensorflow.

2）安裝a small dataset (500 images)

$ ck install package:imagenet-2012-val-min
注意: ImageNet dataset descriptions are in repo:ck-env.

3）安裝the full dataset (50,000 images)

$ ck install package:imagenet-2012-val

注意：如果您已將ImageNet驗證數據集下載到目錄中，例如$ HOME / ilsvrc2012-val /，您可以按如下方式檢測它：

$ ck detect soft:dataset.imagenet.val --full_path=$HOME/ilsvrc2012-val/ILSVRC2012_val_00000001.JPEG

2. 準備環境
1）安裝TensorFlow (Python)

$ ck install package:lib-tensorflow-1.13.1-cpu

或者源碼安裝:

$ ck install package:lib-tensorflow-1.13.1-src-cpu

2）爲TensorFlow安裝MobileNet模型
要直接安裝非量化模型：:

$ ck install package --tags=model,tf,mlperf,mobilenet,non-quantized

要直接安裝已量化的模型：:

$ ck install package --tags=model,tf,mlperf,mobilenet,quantized

3）運行TensorFlow（Python）圖像分類client
運行client，對於 non-quantized model:

$ ck run program:image-classification-tf-py
...
*** Dependency 4 = weights (TensorFlow-Python model and weights):
...
    Resolved. CK environment UID = f934f3a3faaf4d73 (version 1_1.0_224_2018_02_22)
...
ILSVRC2012_val_00000001.JPEG - (65) n01751748 sea snake
0.82 - (65) n01751748 sea snake
0.10 - (58) n01737021 water snake
0.04 - (34) n01665541 leatherback turtle, leatherback, leather...
0.01 - (54) n01729322 hognose snake, puff adder, sand viper
0.01 - (57) n01735189 garter snake, grass snake
---------------------------------------

Summary:
-------------------------------
Graph loaded in 0.855126s
All images loaded in 0.001089s
All images classified in 0.116698s
Average classification time: 0.116698s
Accuracy top 1: 1.0 (1 of 1)
Accuracy top 5: 1.0 (1 of 1)
--------------------------------

對於 quantized model:

$ ck run program:image-classification-tf-py
...
*** Dependency 4 = weights (TensorFlow-Python model and weights):
...
    Resolved. CK environment UID = b18ad885d440dc77 (version 1_1.0_224_quant_2018_08_02)
...
ILSVRC2012_val_00000001.JPEG - (65) n01751748 sea snake
0.16 - (60) n01740131 night snake, Hypsiglena torquata
0.10 - (600) n03532672 hook, claw
0.07 - (58) n01737021 water snake
0.05 - (398) n02666196 abacus
0.05 - (79) n01784675 centipede
---------------------------------------

Summary:
-------------------------------
Graph loaded in 1.066851s
All images loaded in 0.001507s
All images classified in 0.178281s
Average classification time: 0.178281s
Accuracy top 1: 0.0 (0 of 1)
Accuracy top 5: 0.0 (0 of 1)
--------------------------------

Object Classification - Shufflenet v1
參考鏈接：https://github.com/mlperf/inference/tree/master/edge/object_classification/shufflenet/caffe2

Shufflenet v1是FackBook推出的物體分類模型，這裏使用 ImageNet 2012 validation dataset作爲評估數據集。
測試過程
1）需要從官方imagenet競賽網站下載imagenet驗證數據集，然後解開文件。
2）直接執行（在純淨安裝的ubuntu 16.04系統上測試）

run.sh <imagenet dir> [group1|group2|group3|group4|group8]

有五種Shufflenet model被提供，分別對應 group1、group2、group3、group4、group8這五種。模型保存在目錄shufflenet / - model / group [x]下。目錄中包含以下文件：

model_dev.svg (svg file to show model structure)
model_init_dev.svg (svg file to show the init model structure)
model_init.pb (init model used by Caffe2 with all trained weights and biases)
model_init.pbtxt (text file showing the weights and biases for all blobs)
model.pb (model used by Caffe2)
model.pbtxt (text file showing the model structure)
model.pkl (model weights in pickle format)

爲了在ImageNet驗證數據集上運行shufflenet模型，該benchmark創建了一個腳本來簡化測試過程。它適用於參考平臺，負責安裝所有依賴項，構建框架，並調用FAI-PEP來執行基準測試。未來的模型提交者可能需要了解流程，並針對新的硬件平臺進行修改。
爲了使流程明顯，基準測試流程的組件如下所述：

可以在Wiki中找到有關imagenet驗證數據集的基準測試概述
build.sh構建了Pytorch / Caffe2框架。
shufflenet_accuracy_imagenet.json 指定整體基準測試流程。
imagenet_test_map.py 將imagenet驗證數據集轉換爲基準測試過程可使用的格式。
convert_image_to_tensor.cc 執行圖像的預處理（縮放，裁剪，規範化等）並轉換爲框架可以直接使用的blob。
caffe2_benchmark.cc 在各種設置中運行模型並輸出結果。
classification_compare.py 將模型輸出與黃金值進行比較。
由於驗證數據集很大且無法在一次迭代中處理，因此將分批處理圖像。aggregate_classification_results.py 會聚合所有運行結果並形成最終結果。

4.3.3 object detection ssd-mobilenet

該部分與上一節中Object Classification - MobileNet操作相似。

1. 準備數據集
1）Pull CK 版本庫

$ ck pull repo:ck-mlperf
注意: Transitive dependencies include repo:ck-tensorflow.

更新所有CK存儲庫

$ ck pull repo --all

2）安裝COCO 2017驗證數據集（5,000張圖片）

$ ck install package --tags=object-detection,dataset,coco.2017,val,original,full

注意：如果您之前通過CK安裝了COCO 2017驗證數據集，例如$ HOME / coco /，您可以按如下方式檢測它:

$ ck detect soft:dataset.coco.2017.val --full_path=$HOME/coco/val2017/000000000139.jpg

3）預處理COCO 2017驗證數據集（前50張圖像）

$ ck install package --tags=object-detection,dataset,coco.2017,preprocessed,first.50

預處理COCO 2017驗證數據集（所有5,000張圖像）

$ ck install package --tags=object-detection,dataset,coco.2017,preprocessed,full

2. 準備環境
1）安裝TensorFlow (Python)

$ ck install package:lib-tensorflow-1.13.1-cpu

或者源碼安裝:

$ ck install package:lib-tensorflow-1.13.1-src-cpu

2）安裝SSD-MobileNet Model
要直接安裝非量化模型：:

$ ck install package --tags=model,tf,object-detection,mlperf,ssd-mobilenet,non-quantized

要直接安裝已量化的模型：:

$ ck install package --tags=model,tf,object-detection,mlperf,ssd-mobilenet,quantized,finetuned

3）在50張圖片上運行TensorFlow (Python) Object Detection client

$ ck run program:object-detection-tf-py --env.CK_BATCH_COUNT=50
...
********************************************************************************
* Process results
********************************************************************************

Convert results to coco ...

Evaluate metrics as coco ...
loading annotations into memory...
Done (t=0.55s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.12s).
Accumulating evaluation results...
DONE (t=0.22s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.315
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.439
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.331
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.064
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.184
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.296
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.322
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.066
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.187
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.711

Summary:
-------------------------------
Graph loaded in 0.748657s
All images loaded in 12.052568s
All images detected in 1.251245s
Average detection time: 0.025536s
mAP: 0.3148934914889957
Recall: 0.3225293342489256
--------------------------------

3. Benchmark the performance

$ ck benchmark program:object-detection-tf-py \
--repetitions=10 --env.CK_BATCH_SIZE=1 --env.CK_BATCH_COUNT=2 --env.CK_METRIC_TYPE=COCO \
--record --record_repo=local --record_uoa=mlperf-object-detection-ssd-mobilenet-tf-py-performance \
--tags=mlperf,object-detection,ssd-mobilenet,tf-py,performance \
--skip_print_timers --skip_stat_analysis --process_multi_keys

注意：當使用N的批次計數時，程序在N個圖像上運行對象檢測，但是在計算平均檢測時間時不考慮慢速首次運行，例如：

$ ck benchmark program:object-detection-tf-py \
--repetitions=10 --env.CK_BATCH_SIZE=1 --env.CK_BATCH_COUNT=2
...
Graph loaded in 0.7420s

Detect image: 000000000139.jpg (1 of 2)
Detected in 1.9351s

Detect image: 000000000285.jpg (2 of 2)
Detected in 0.0284s
...
Summary:
-------------------------------
Graph loaded in 0.741997s
All images loaded in 0.604377s
All images detected in 0.028387s
Average detection time: 0.028387s
mAP: 0.15445544554455443
Recall: 0.15363636363636363
--------------------------------

4. Benchmark the accuracy

$ ck benchmark program:object-detection-tf-py \
--repetitions=1 --env.CK_BATCH_SIZE=1 --env.CK_BATCH_COUNT=5000 --env.CK_METRIC_TYPE=COCO \
--record --record_repo=local --record_uoa=mlperf-object-detection-ssd-mobilenet-tf-py-accuracy \
--tags=mlperf,object-detection,ssd-mobilenet,tf-py,accuracy \
--skip_print_timers --skip_stat_analysis --process_multi_keys

5. Reference accuracy
1）SSD-MobileNet non-quantized

********************************************************************************
* Process results
********************************************************************************

Convert results to coco ...

Evaluate metrics as coco ...
loading annotations into memory...
Done (t=0.49s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.07s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=12.46s).
Accumulating evaluation results...
DONE (t=2.09s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.231
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.349
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.166
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.531
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.209
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.262
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.263
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.023
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.190
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604

Summary:
-------------------------------
Graph loaded in 0.753200s
All images loaded in 1193.981655s
All images detected in 123.461871s
Average detection time: 0.024697s
mAP: 0.23111107753357035
Recall: 0.26304841188725403
--------------------------------

2）SSD-MobileNet quantized

********************************************************************************
* Process results
********************************************************************************

Convert results to coco ...

Evaluate metrics as coco ...
loading annotations into memory...
Done (t=0.48s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.18s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=12.74s).
Accumulating evaluation results...
DONE (t=2.13s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.236
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.361
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.259
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.166
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.546
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.212
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.269
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.618

Summary:
-------------------------------
Graph loaded in 1.091316s
All images loaded in 1150.996266s
All images detected in 122.103661s
Average detection time: 0.024426s
mAP: 0.23594222525632427
Recall: 0.26864982712779556
--------------------------------

4.3.4 translation gnmt

參考鏈接：https://github.com/mlperf/inference/tree/master/edge/translation/gnmt/tensorflow
該部分請參閱cloud目錄中的GNMT代碼。代碼沒有區別。

4.4 Inference Rule

下面是在進行MLPerf Inference Benchmark測試時需要遵守的一些Rule細節。

4.4.1 Inference Division

MLPerf Inference也分爲封閉分區（Closed Division）和開放分區（Open Division）。

封閉分區需要使用等同於參考實現的預處理，後處理和模型。封閉分區允許校準量化，不允許任何再訓練。
開放分區只限制使用相同的數據集，它允許使用任意的預處理或後處理和模型，包括再培訓。

4.4.2 Data Sets

1）對於每個Benchmak測試，MLPerf 將提供以下數據集：

accuracy數據集，用於確定提交是否符合質量目標，並用作驗證集
speed/performance 數據集，是用於測量性能，它是accuracy數據集的子集。

對於除GNMT之外的每個Benchmark測試，MLPerf將提供用於量化的calibration數據集，它是用於生成權重的訓練數據集的一部分。每個參考實現都應包含一個使用checksum來驗證數據集的腳本。每次運行開始時數據集必須保持不變。

2）Pre- and post-processing
所有imaging benchmarks測試都將未剪切的未壓縮位圖作爲輸入，NMT採用文本。

4.4.3 Model權重量化

對於Closed分區，MLPerf爲參考實現提供了fp32格式的weights和biases。MLPerf爲除GNMT之外的每個Benchmark提供了calibration數據集，提交者僅可以使用提供的calibration數據集來進行量化。支持的量化範圍爲：

INT8
INT16
UINT8
UINT16
FP11 (1-bit sign, 5-bit exponent, 5-bit mantissa)
FP16
bfloat16
FP32

注：只能使用基準所有者提供的校準數據集進行校準。允許卷積層爲NCHW或NHWC格式。不允許進行其他再培訓。
對於Open分區，每次運行必須將權重和偏差初始化爲相同的值，允許任何量化方案。

4.4.4 Model實現

規則允許的Model實現：

任意框架和運行時：TensorFlow，TensorFlow-lite，ONNX，PyTorch等
任意地數據順序
允許輸入的權重和內存中表示的權重不同。
數學上等效的變換（例如Tanh與Logistic，Relu6與Relu8）
用等效的稀疏操作替換密集操作
Fusing or unfusing operations
Dynamically switching between one or more batch sizes
允許使用混合精度
將具有1001個類的ImageNet分類器減少到1000個類

規則不允許的Model實現：

替換全部的權重 (Wholesale weight replacement or supplements)
丟棄非零的權重元素 (Discarding non-zero weight elements)
緩存查詢或響應 (Caching queries or responses)
合併相同的query (Coalescing identical queries)

5. 參考鏈接

https://mlperf.org/
https://github.com/mlperf
https://mlperf.org/results/
https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc
https://github.com/mlperf/inference/tree/master/cloud
https://github.com/mlperf/inference/blob/master/cloud/image_classification/GettingStarted.ipynb
https://github.com/mlperf/inference/tree/master/edge
https://github.com/mlperf/inference/tree/master/edge/face_identification/sphereface20/tflite
https://github.com/mlperf/inference/tree/master/edge/object_classification/mobilenets
https://github.com/mlperf/inference/tree/master/edge/object_classification/shufflenet/caffe2
https://github.com/facebook/FAI-PEP/wiki/Run-Imagenet-validate-dataset

關於MLPerf的一些調查

文章目錄

1. MLPerf 介紹

2. MLPerf 思路

3. MLPerf Training

3.1 Training Division

3.1.1 Closed Model Division

3.1.2 Open Model Division

3.2 MLPerf Training Result

3.3 MLPerf Training Benchmark

3.3.1 image classification

3.3.2 object_detection

4. MLPerf Inference

4.1 Load Generator

4.2 Cloud

4.2.1 image classification 和 object detection

4.2.2 translation gnmt

4.3 Edge

4.3.1 人臉識別

4.3.2 object classification

4.3.3 object detection ssd-mobilenet

4.3.4 translation gnmt

4.4 Inference Rule

4.4.1 Inference Division

4.4.2 Data Sets

4.4.3 Model權重量化

4.4.4 Model實現

5. 參考鏈接

微服務實踐k8s&dapr開發部署實驗（2）狀態管理

Win10 LTSC 2019 安裝後的一些步驟

Python 潮流週刊#52：Python 處理 Excel 的資源

SystemTap Tapset Reference Manual

Local and Remote Memory: Memory in a Linux NUMA System

NVIDIA GPU A100 Ampere(安培) 架構深度解析

第一章 GPU虛擬化發展史

HARDWARE SPECIFICATION - PRMS

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結