# NNIE

NNIE is a Neural Network Inference Engine of Hisilicon. It support INT8/INT16 quantization.

## Quantization Scheme

8/16 bit per-layer logarithmic quantization.

The specific quantization formulation is:

\begin{split}\begin{aligned} &z = \lfloor 16 * \log_2(c) \rceil - 127 \\ &\mathtt{fakequant(x)} = \begin{cases} - 2 ^ {\dfrac{\mathtt{clamp}(\lfloor 16 * \log_2(-x) \rceil - z, 1, 127) + z}{16}}, & x \lt - 2 ^ {\dfrac{z + 1}{16} - 1} \\ % 0, & - 2 ^ {\dfrac{z + 1}{16} - 1} \le x \lt 2 ^ {\dfrac{z}{16} - 1} \\ 2 ^ {\dfrac{\mathtt{clamp}(\lfloor 16 * \log_2(x) \rceil - z, 0, 127) + z}{16}}, & x \ge 2 ^ {\dfrac{z}{16} - 1} \\ zero, & otherwise \end{cases} \end{aligned}\end{split}

where $$c$$ is clipping range. $$2 ^ {\dfrac{z}{16}}$$ is the smallest positive value that can be represented after quantization.

It represents the integer number in True Form format. The highest bit represents the sign and the rest represents the absolute value of the number.

Floating Numer

Integer Number

Dequantized Floating Number

$$\bigg(- \infty, - 2 ^ {\dfrac{z + 126.5}{16}}\bigg]$$

-127

0xFF

$$- 2 ^ {\dfrac{z+127}{16}}$$

$$\bigg(- 2 ^ {\dfrac{z + 2.5}{16}}, - 2 ^ {\dfrac{z + 1.5}{16}}\bigg]$$

-2

0x82

$$- 2 ^ {\dfrac{z+2}{16}}$$

$$\bigg(- 2 ^ {\dfrac{z + 1.5}{16}}, - 2 ^ {\dfrac{z + 1}{16} - 1}\bigg)$$

-1

0x81

$$- 2 ^ {\dfrac{z+1}{16}}$$

$$\bigg[- 2 ^ {\dfrac{z + 1}{16} - 1}, 2 ^ {\dfrac{z}{16} - 1}\bigg)$$

-0

0x80

0

$$\bigg[2 ^ {\dfrac{z}{16} - 1}, 2 ^ {\dfrac{z + 0.5}{16}}\bigg)$$

0

0x00

$$2 ^ {\dfrac{z}{16}}$$

$$\bigg[2 ^ {\dfrac{z + 0.5}{16}}, 2 ^ {\dfrac{z + 1.5}{16}}\bigg)$$

1

0x01

$$2 ^ {\dfrac{z+1}{16}}$$

$$\bigg[2 ^ {\dfrac{z + 126.5}{16}}, + \infty\bigg)$$

127

0x7F

$$2 ^ {\dfrac{z+127}{16}}$$

NNIE performs a per-layer quantization, which means the inputs of the same layer share the same $$z_a$$ and the weights of the same layer share the same $$z_w$$.

In fact, when building engine using the official tool of NNIE, it requires the clipping value $$c$$ rather than $$z$$. $$c$$ needs to be a number in the ‘gfpq_param_table_8bit.txt’ which ensures that $$16 * \log_2{c}$$ is an integer.

Attention

Pooling: ceil_mode = True

Avoid using depthwise convolution.

Only support 2x nearest neighbor upsample.

For Detection task, you’d better choose RetinaNet structure.