Batched, Reproducible, and Reduced Precision BLAS

Authors: Piotr Luszczek (University of Tennessee)

BP
Abstract: This BoF will bring together the community focused on extending the Basic Linear Algebra Software (BLAS). The existing BLAS have proven to be effective in assisting portable, efficient software for sequential and the current class of high-performance computers. We’d like to investigate the possibility of extending the currently accepted standards to provide greater parallelism for small size operations, reproducibility, and reduced precision support. This is an open forum to discuss and formalize details. The agenda and talks from past workshops can be found here: http://bit.ly/Batch-BLAS-2017 http://bit.ly/Batch-BLAS-2016

A standard interface will be considered for the Batched, Reproducible, and Reduced Precision BLAS.

Long Description: Historically, most design efforts in the HPC community have been made in the direction of solving large linear algebra problems that were handled by the original set of Basic Linear Algebra Subroutines. But in recent years the state-of-the-art approaches for addressing large-scale problems are undergoing a tremendous change. It is becoming increasingly common in many scientific fields to decompose very large-scale simulation problems into multitude of very small linear algebra operations that can be computed in parallel. The representative applications from a variety of scientific fields that exhibit this kind of computing patterns include tensor contractions codes for the quantum Hall effect, astrophysics calculations, metabolic networks applications, CFD and the resulting PDE solvers that use the direct and multifrontal solvers, high-order FEM solver schemes for hydrodynamics, mixed direct-iterative preconditioned solvers, quantum chemistry calculation, image analysis, and signal processing. Unfortunately, applications with many small matrix or tensor operations can exhibit very poor performance using the standard optimized vendor linear algebra libraries. Different strategies, including the use of compiler technologies and autotuning schemes, have been investigated to adapt the existing libraries to small matrix problems without satisfactory performance. These problems are too small to use modern HPC systems and the associated optimized libraries at full efficiency. Nevertheless, the fact that one has to solve thousands of these problems independently suggests it is worth designing new linear algebra libraries. Consequently, batched BLAS algorithms have been introduced to solve thousands of small BLAS operations with only one function call. The computational science community and commercial outfits focused on intesive data analysis actively work on implementations that fulfill the need for optimized batched BLAS-like kernels. The Intel Math Kernel Library has released batched matrix-matrix multiplication as well as batched triangular solver (Batched TRSM). Additionally, NVIDIA cuBLAS includes the same, batched GEMM and triangular solve, along with batched versions for more advanced numerical linear algebra routines. The MAGMA library provides open source implementations for a number of batched BLAS routines for the GPU accelerators. At the same time, some application developers design and implement their own batched BLAS-like kernels. The gradual introduction of batched BLAS routines in vendor libraries and important research software demonstrates awareness of the need for batched BLAS functionality, which is very encouraging. To fully empower batched BLAS based applications, the community needs to make an effort towards standardization of the batched BLAS routines. The batched BLAS interfaces currently provided by Intel MKL, NVIDIA cuBLAS, MAGMA, and other libraries differ significantly from each other, which results in a serious portability issue. The increasing gap between modern GPU architectures, co-processors, and regular multi-core CPUs overburden the effort in providing a standard interface for batched BLAS functions. The calling interfaces and optimal data layout for data storage of a batch of small matrices necessary for good performance vary depending on architecture. To propose an objective standard without a severe performance penalty for any architecture, a first attempt was made by analyzing the benefits and drawbacks of existing batched BLAS interfaces. This BOF continues these effort.

Conference Presentation: pdf

Birds of a Feather Index