DescriptionWith the recent advancements in OpenCL-based High-Level Synthesis, FPGAs are now more attractive choices for accelerating High Performance Computing workloads. Despite their power efficiency advantage, FPGAs usually fall short in terms of sheer performance against GPUs due to having multiple times lower memory bandwidth and compute performance. In this work, we show that due to the architectural advantage of FPGAs for stencil computation, apart from power efficiency, these devices can also offer comparable performance to high-end GPUs. We achieve this goal using a parameterized OpenCL-based implementation that employs both spatial and temporal blocking, and multiple advanced FPGA-specific optimizations to maximize performance. We show that it is possible to achieve up to 60 GBps and 230 GBps of effective throughput for 3D stencil computation on Intel Stratix V and Arria 10 FPGAs, respectively, which is comparable to a highly-optimized implementation on high-end GPUs.