Chris Huang's Blog: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

星期日, 9月 22, 2019

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

這是一種 DNN model reduction (compression) 的方法:第一個stage 將原有model 中
weight 小於某個threshold 的connection 去除(減斷),並進行重新的training ( 以確保
error rate 沒有增加);第二個 stage 將 network 中每一層的weights 做分群並將各群的中
心(or 平均值)做為 code book 來表示每一層的 weights ( 此步驟很像 vector
quantization ); 第三個 stage 則是依據 code book 中 code words 出現的機率大小,以
Huffman code 壓缩之。
大部分的 DNN 可壓個 20 倍,執行速度也比較快!(35x to 49x compression ratio was
reported in literature, as expected, this approach is very time and computing
resources consuming in the training phase)。

https://arxiv.org/abs/1510.00149

【深度神经网络压缩】Deep Compression （ICLR2016 Best Paper）
https://zhuanlan.zhihu.com/p/21574328

Neural networks are both computationally intensive and memory intensive, making
them difficult to deploy on embedded systems with limited hardware resources. To
address this limitation, we introduce “deep compression”, a three stage pipeline:
pruning, trained quantization and Huffman coding, that work together to reduce
the storage requirement of neural networks by 35× to 49× without affecting their
accuracy. Our method first prunes the network by learning only the important
connections. Next, we quantize the weights to enforce weight sharing, finally, we
apply Huffman coding. After the first two steps we retrain the network to fine
tune the remaining connections and the quantized centroids. Pruning, reduces the
number of connections by 9× to 13×; Quantization then reduces the number of
bits that represent each connection from 32 to 5. On the ImageNet dataset, our
method reduced the storage required by AlexNet by 35×, from 240MB to 6.9MB,
without loss of accuracy. Our method reduced the size of VGG-16 by 49× from
552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model
into on-chip SRAM cache rather than off-chip DRAM memory. Our compression
method also facilitates the use of complex neural networks in mobile applications
where application size and download bandwidth are constrained. Benchmarked on
CPU, GPU and mobile GPU, compressed network has 3× to 4× layerwise speedup
and 3× to 7× better energy efficiency.

沒有留言:

張貼留言

Chris Huang's Blog

星期日, 9月 22, 2019

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

沒有留言:

lec-1 (2022-05-12) Accelerating deep learning computation & strategies

連結

檢舉濫用情形