P31: Understanding the Performance of Small Convolution Operations for CNN on Intel Architecture
Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, natural language processing, and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, FFT-formulation, Winograd transformation, and direct convolution primarily targeting GPUs. In this paper, we optimize a direct convolution and Winograd implementation for x86 architectures, in particular for Xeon Phi systems, via a dynamic compilation approach. We then show how these JIT optimizations can be integrated in a high-level domain-specific language setting. We shed light on what is possible and what is not possible based on different data-formats and blocking techniques. Our JIT-based Ninja implementation shows close to theoretical peak results on modern x86 architectures, depending on setting and the CPU architecture at hand.
Award: Best Poster Finalist (BP): no
Two-page extended abstract: pdf