深度学习编译器入门指南 (Q3) 模型量化 - INT8 卷积

最编程 2024-03-30 19:15:21

...

INT8卷积

量化前：conv2d-fp32

Y_{fp32} = K_{fp32} * X_{fp32} + b_{fp32}

量化后^[1]：

量化编码：fp32-to-int8-IO
离线权重量化编码： $K_{fp32} \to K_{int8}$
在线输入量化编码： $X_{fp32} \to X_{int8}$
INT8卷积计算：conv2d-int8
$inner_{int32} = K_{int8} * X_{int8}$
反量化编码：int32-to-fp32-IO
$inner_{int32} \to inner_{fp32}$
加法计算：Add
$Y_{fp32} = tmp_{fp32} + b_{fp32}$

TensoRT2.1卷积核伪代码实现^[2]：

// I8 input tensors: I8_input, I8_weights,  
// I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale,  output_scale, weights_scale[K]

I32_gemm_out = I8_input * I8_weights       // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out       // Cast I32 GEMM output to F32 float

// At this point we have F32_gemm_out which is scaled by  ( input_scale * weights_scale[K] ), 
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
For i in 0, ... K-1:
        rescaled_F32_gemm_out[ :, i, :, :] =  F32_gemm_out[ :, i, :, :]   * \
                         [ output_scale / (input_scale * weights_scale[ i ] ) ]

// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"

rescaled_F32_gemm_out_with_bias = rescaled_F32_gemm_out + output_scale * bias

// Perform ReLU (in F32)F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to globalI8_output = Saturate( Round_to_nearest_integer( F32_result ) )

量化策略 ↩
8-bit Inference with TensorRT By Szymon Migacz, NVIDIA May 8, 2017 ↩

上一篇：揭开量化投资模型构建的面纱

下一篇：利用 POT 工具对 YOLOv5 模型进行 INT8 量化

深度学习编译器入门指南 (Q3) 模型量化 - INT8 卷积

深度学习编译器入门指南 (Q2) 模型量化-KL 散射

深度学习编译器入门指南 (Q4) 模型量化 - INT8 量化实验

深度学习编译器入门指南 (Q1) 模型量化-概览

深度学习编译器入门指南 (Q3) 模型量化 - INT8 卷积