A Brief about How MQBench Supports Hardware and Software
MQBench is able to do quantization over many different backends and quantization algorithms, which relys on the independence of hardware(backends) and software(algorithms). To support a new backend, h/w and s/w should be added.
We provide a typical work-flow:
Add
FakeQuantizeAffine
to simulate the hardware behavior.Add corresponding
Observer
to provideFakeQuantizeAffine
with needed infomation.Add
ModelQuantizer
to insert quantize nodes into the model.Add ONNX-Backend translator to deploy the output graph.
FakeQuantize, Observer and Quantization Scheme
When the model is at calib mode, Observer
collects needed statistics and FakeQuantize
does not quantize the input; when at quant mode, FakeQuantize
performs the quantized forward with the qparams calulated bt Observer
.
To add a FakeQuantizeAffine
, refer to mqbench/fake_quantize/
.
Fucntion: Add a fake quantization function to describe the quantized forward. If the function is not differential, also describe the backward.
Autograd and Symbolic: Wrap the quantization forward/backward function with
torch.autograd.Function
. To enable onnx export, asymbolic
function is also needed.Inherit the QuantizeBase: Define the class which holds the
qparams
and performs the quantized forward. This class inheritmqbench.fake_quantize.quantize_base.QuantizeBase
.Add a Observer if needed: Inherit
ObserverBase
, collect statistics inforward
and get qparams incalculate_qparams
.Use it in prepare_by_platform
Import your class at the
mqbench.prepare_by_platform
.Add your backend type into the Enumeration
mqbench.prepare_by_platform.BackendType
.Define its default scheme at
mqbench.prepare_by_platform.ParamsTable
.Add mappings to and
mqbench.prepare_by_platform.FakeQuantizeDict
.
For now, we have provided some linear quantization affines/observers so it might reduce lots of work.
Custom Quantizer
To prepare the model by platform, quantization nodes should be inserted by and only by ModelQuantizer
, which hides all details about the needed graph. After preparation, the model could be used to QAT or PTQ.
The ModelQuantizer will do:
Op fusion, like Conv2d+BN2d+ReLU -> ConvBnReLU2d.
QAT swap. Fuse the modules into QAT modules and insert quantize node for weights.
Insert quantize nodes after activation. Need to find whose output will be quantized in the backend.
Op Fusion
The fusion is a torch API and you need to edit the torch default fuser method mappings and patterns because the torch fuser does not accept any other args.
If you add some new fusion patterns:
Add the fusion patterns and related fusion method into
mqbench.fuser_method_mapping
. This will turn some op patterns into intrinsic fused modules. If you create new intrinsic modules, add them intomqbench.nn.intrinsic
. If you apply new fusion method, add them intomqbench.utils.fusion
.Add the mapping from intrinsic modules into qat modules. Define the modules at
mqbench.nn.intrinsic.qat
if needed.Add all the metioned things to
mqbench.fuser_method_mapping.fuse_custom_config_dict
.NOTE: If the pattern only apply to certain backend, you need to update the default dicts at ModelQuantizer, rather than
fuse_custom_config_dict
!
QAT Swap
Swap the intrinsic modules into qat ones. Add the mappings into mqbench.fuser_method_mapping
if it is a universal mapping. If not, into ModelQuantizer’s additional_qat_module_mappings.
We deploy bias-quant intrinsic and qat modules, so just refer to the mqbench.nn
to see how to add one by yourself.
Insert Quantize nodes
Usually the TensorRT’s ModelQuantizer is a good example, which quantizes all quantizable input tensors of module. If you need to quantize other nodes, just add them to the set with your own logic.
Deploy Functions
The deploy stage is to turn our torch-based model into ONNX and graphs based on backend. Typically, we will merge bn into conv/deconv/linear, convert the quantized model into onnx and finally remove all fake quantize op and deploy them. The functions are defined at mqbench.convert_deploy
and regisered by @register_deploy_function(BackendType)
.
Deploy stage might be the hardest part of the flow, for it takes care of all h/w details. The normal flow will be like:
No extra work needed for merging bn and conver onnx.
For removing fake_quantize and collecting qparams, you can check the
mqbench.deploy
. There are examples for linear/logarithmic quantization and self-defined onnx runtime.Also, there are platforms support standard onnx quantization. If your platform supports this, please export FakeQuantizeAffine into onnxruntime QDQOperators. 1 2
If extra tools are needed to complete the translation, just integrate your tools into
mqbench.deploy
. And there is also an example of Vitis-AI.