【TVM手册】三、量化小结

TVM里面关于量化的资料非常的多,虽然很有价值,但是极其零散,对于散户不是非常友好。这里汇总一下。

Offical References

TVM quantization roadmap

INT8 quantization proposal

Quantization Story - 2019-09

Quantization Story - 2019-09
在这里插入图片描述

Quantization Development

  • [RFC] Search-based Automated Quantization - 2020-01-22
    • I proposed a new quantization framework, which brings hardware and learning method in the loop.
    • Brought the idea from some existing quantization frameworks, I choose to adopt the annotation-calibration-realization 3-phases design:
      在这里插入图片描述
    • Annotation: The annotation pass rewrites the graph and inserts simulated quantize operation according to the rewrite function of each operator. The simulated quantize operation simulates the rounding error and saturating error of quantizing from float to integer,
    • Calibration: The calibration pass will adjust thresholds of simulated quantize operations to reduce the accuracy dropping.
    • Realization: The realization pass transforms the simulation graph, which computes with float32 actually, to a real low-precision integer graph.

Quantization Framework supported by TVM

TF Quantization Related

TVM support all Pre-quantized TFLite hosted

  • The performance is evaluated on C5.12xlarge Cascade lake machine, supported Intel VNNI
  • not autotuned the models yet.
    在这里插入图片描述
    在这里插入图片描述

Pytorch Quantization Related

MXNet related

  • Model Quantization for Production-Level Neural Network Inference
    • The below CPU performance is from an AWS EC2 C5.24xlarge instance with custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake).
    • The model quantization delivers more stable speedup over all models, such as 3.66X for ResNet 50 v1, 3.82X for ResNet 101 v1 and 3.77X for SSD-VGG16, which is very close to the theoretical 4X speedup from INT8.
      在这里插入图片描述
    • the accuracy from Apache/MXNet quantization solution is very close to FP32 models without the request of retaining the mode. In Figure 8, MXNet ensured only a small reduction in accuracy, less than 0.5%.在这里插入图片描述
    • [topi] add ARM v8.2 udot (uint8) support #3978
      • Add uint8 intrinsic for ARM. Currently it is udot.v2i32.v8i8 which may have too small lanes. will add more later

Tensor Core Related

Related Commit

Speed Up

Comparison

Automatic Integer Quantization

Accepting Pre-quantized Integer models

Speed Profile Tools

  • How to profile speed in each layer with RPC?
    • the debug runtime will give you some profiling information from the embedded device, e.g.:
      	Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs
      ---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------
      1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1
      _contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1
      relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1
      _contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1
      relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1
      _contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1
      relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1
      _contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1
      reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1
      

Devices Attributes

Third-Party Tutorials

Theory Summary

Practice

Copartner

Please go tvmai/meetup-slides for more recently info what ohter copartners have done for tvm.

Alibaba

  • 记录一下2019
    • 介绍阿里在TVM上的发展历程
    • 在今年(2019年)4月份的时候,我又回来和同事一起搞ARM CPU量化优化了,因为这是有业务要用的。我们一起吭哧吭哧搞了一段时间,可以很高兴的说我们比QNNPack更快,在Mobilenet V1上是1.61x TFLite,1.27X QNNPACK,Mobilenet V2是2X TFLite, 1.34X QNNPack。
  • TVM@AliOS
    在这里插入图片描述

FaceBook

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章