雖然用 DNN train/predict model 也好一陣子了,但這週才是第一次搞懂 cuDNN 是作什麼的
以前好奇過 tensorflow/pytorch 是怎麼做 convolution 的,FFT 不是比較好嗎?
下面的 reference 就給了很好的解釋:
Why GEMM is at the heart of deep learning
Why GEMM works for Convolutions
Hopefully you can now see how you can express a convolutional layer as a matrix multiplication, but it’s still not obvious why you would do it. The short answer is that it turns out that the Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs. This paper from Nvidia is a good introduction to some of the different approaches you can use, but they also describe why they ended up with a modified version of GEMM as their favored approach. There are also a lot of advantages to being able to batch up a lot of input images against the same kernels at once, and this paper on Caffe con troll uses those to very good effect. The main competitor to the GEMM approach is using Fourier transforms to do the operation in frequency space, but the use of strides in our convolutions makes it hard to be as efficient.
The good news is that having a single, well-understood function taking up most of our time gives a very clear path to optimizing for speed and power usage, both with better software implementations and by tailoring the hardware to run the operation well.
然後也才懂為什麼在網路架構不改變的狀況下,設定
torch.backends.cudnn.benchmark = True
在我的電腦上可以快個 15%
因為很多問題可以使用 AI 運算解決,而 GEMM 是 AI 效能的關鍵,半導體的進步可以在 AI 加速器上繼續演進,克服 Amdahl's Law。解決下面這張圖紅色部份這二十年來的困境
所以 John Hennessy and David Patterson 才會說: A New Golden Age for Computer Architecture