TensorRT

NVIDIA TensorRT is a platform for high-performance deep learning inference on GPU device.

Quantization Scheme

8bit per-channel symmetric linear quantization.

\[\begin{equation} q = \mathtt{clamp}(\lfloor x * s \rceil, lb, ub) \end{equation}\]

where \(s\) is scaling factor to quantize a number from floating range to integer range, \(lb\) and \(ub\) are bounds of integer range. For weights, [lb, ub] = [-127, 127]. For activations, [lb, ub] = [-128, 127].

For weights, each filter needs an independent scale \(s\).

In fact, when building the TensorRT engine, the official tool requires the clipping value as quantization parameters, which can be calculated by \(c = s * 127\).