欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

深度学习编译器入门指南 (Q3) 模型量化 - INT8 卷积

最编程 2024-03-30 19:15:21
...

INT8卷积

量化前:conv2d-fp32
Y_{fp32} = K_{fp32} * X_{fp32} + b_{fp32}

量化后[1]

  1. 量化编码:fp32-to-int8-IO
    离线权重量化编码:K_{fp32} \to K_{int8}
    在线输入量化编码:X_{fp32} \to X_{int8}
  2. INT8卷积计算:conv2d-int8
    inner_{int32} = K_{int8} * X_{int8}
  3. 反量化编码:int32-to-fp32-IO
    inner_{int32} \to inner_{fp32}
  4. 加法计算:Add
    Y_{fp32} = tmp_{fp32} + b_{fp32}

TensoRT2.1卷积核伪代码实现[2]

// I8 input tensors: I8_input, I8_weights,  
// I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale,  output_scale, weights_scale[K]

I32_gemm_out = I8_input * I8_weights       // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out       // Cast I32 GEMM output to F32 float

// At this point we have F32_gemm_out which is scaled by  ( input_scale * weights_scale[K] ), 
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
For i in 0, ... K-1:
        rescaled_F32_gemm_out[ :, i, :, :] =  F32_gemm_out[ :, i, :, :]   * \
                         [ output_scale / (input_scale * weights_scale[ i ] ) ]

// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"

rescaled_F32_gemm_out_with_bias = rescaled_F32_gemm_out + output_scale * bias

// Perform ReLU (in F32)F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to globalI8_output = Saturate( Round_to_nearest_integer( F32_result ) )

  1. 量化策略

  2. 8-bit Inference with TensorRT By Szymon Migacz, NVIDIA May 8, 2017