QIL¶
> Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss
Quantization Functions:¶
\[\begin{split}\hat w = \begin{cases}
0, & \text{if } |w| < c_W - d_W \\
\text{sign}(w), & \text{if } |w| > c_W + d_W \\
(\alpha_W |w| + \beta_W)^{\gamma} \cdot \text{sign}(w), & \text{otherwise}
\end{cases}\end{split}\]
where \(\alpha_W = 0.5/d_W\) and \(\beta_W = -0.5c_W/d_W + 0.5\)
\[\begin{split}\hat a = \begin{cases}
0, & \text{if } a < c_A - d_A \\
1, & \text{if } a > c_A + d_A \\
\alpha_A a + \beta_A, & \text{otherwise}
\end{cases}\end{split}\]
where \(\alpha_A = 0.5/d_A\) and \(\beta_A = -0.5c_A/d_A + 0.5\)
量化器的训练:¶
- 网络中的参数 \(\hat w\),按照BP算法进行梯度计算
- \(\frac{\partial \hat w}{\partial \gamma}\)
- \(\frac{\partial \hat w}{\partial c_W}\)
- \(\frac{\partial \hat w}{\partial d_W}\)
优点:
- 虽然量化的过程复杂,但是在推理时,量化后的权重是固定的,仍可以表示为integer的方式
- 有一个偏移的过程,可以起到剪枝的效果
缺点:
- 真正参与到网络运算的权重,\(w\in [-1,1]\),并不能很好推广到除分类任务以外的任务
- 激活值也是如此,\(a\in [0, 1]\)
Experiments¶
2/2给出的是Progressive finetuning的方法,LQ-Net是training from scratch,两者的相比性不大。
Direct: direct finetuning from full-precision weights. Progressive finetuning: (i.e, FP → 5/5 → 4/4 → 3/3 → 2/2 for 2/2-bit network)
Joint Training: The weights and quantization parameters are training jointly. Quantizer only: Only optimize quantizers.