DescriptionThe running time of GPU kernels depends on an invocation parameter, the number of threads in each thread block. Sometime the dependence is quite strong leading to 50-100% change in execution time for long-running kernels. Until now, it has been an art form to decide on the optimal setting for this parameter. Nvidia provides a tool for CUDA kernels, called OCC, that guides a developer toward this goal. In this paper, we show that OCC maximizes occupancy of GPU cores but does not meet the performance goal in a wide class of applications. We develop a solution called Snowpack that uses static features in a statistical learning framework to choose the optimal block size parameter. It does this without needing to execute the kernel multiple times, as a possible alternate solution Autotuner does. We evaluate our solution, Snowpack, on 89 kernels of 10 applications.