Investigating Half-Precision Arithmetic to Accelerate Dense Linear System Solvers
Workshop: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Authors: Azzam Haidar (University of Tennessee)
Abstract: The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI), in particular, has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and, in particular, the general HPC problem of solving (A x = b), where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique -- we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for the first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision (A x = b) solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance and limitations of the approach.