TensorRT
Introduction
NVIDIA TensorRT is a platform for high-performance deep learning inference on GPU device.
Quantization Scheme
8bit per-channel symmetric linear quantization.
where \(s\) is scaling factor to quantize a number from floating range to integer range, \(lb\) and \(ub\) are bounds of integer range. For weights, [lb, ub] = [-127, 127]. For activations, [lb, ub] = [-128, 127].
For weights, each filter needs an independent scale \(s\).
Deploy on TensorRT
Requirements:
Install TensorRT>=8.0EA from NVIDIA
Deployment:
We provide the example to deploy the quantized model to TensorRT using AdaRound and explicit mode.
First edit </path-of-MQBench/application/imagenet_example/PTQ/configs/adaround/r18_8_8_trt.yaml>’s datasets, pretrained and output path, then export the quantized model to onnx.
1cd /path-of-MQBench/application/imagenet_example/PTQ/ptq 2python ptq.py --config /path-of-MQBench/application/imagenet_example/PTQ/configs/adaround/r18_8_8_trt.yaml
Second build the TensorRT INT8 engine and evaluate, please make sure [dataset_path] contains subfolder [val].
1python onnx2trt.py --onnx <path-of-onnx_quantized_deploy_model.onnx> --trt <model_name.trt> --data <dataset_path> --evaluate