Presentation

· Presenters · Organizations · Search Program

Paper

: Designing Vector-Friendly Compact BLAS and LAPACK Kernels

SessionFast Multipole Methods and Linear Algebra

Authors

Kyungjoo Kim

Timothy B. Costa

Mehmet Deveci

Andrew M. Bradley

Simon D. Hammond

Murat E. Guney

Sarah Knepper

Shane Story

Sivasankaran Rajamanickam

Event Type

Paper

Tags

TimeThursday, November 16th2:30pm - 3pm

Location405-406-407

DescriptionMany applications rely on the use of blas/lapack routines on large groups of very small matrices. For example, many PDE-based simulations and machine learning applications require batched blas/lapack routines. While existing batched blas APIs provide meaningful speedup over alternatives like OpenMP loops around traditional blas/lapack, there exists potential for significant speedup by considering a non-canonical data-layout that allows for cross-matrix vectorization in batched blas/lapack routines.

We propose a new compact data-layout that interleaves matrices in blocks according to the architecture’s SIMD vector-length and investigate its merits. Second, we discuss the proposed data-layout in two libraries, an open-source and a vendor implementation. In our experiments, the compact data layout provides up to 5x, 15x and 18x speedup against batched dgemm, dtrsm and dgetrf respectively with a blocksize 5 on Intel Knights Landing. Finally, we demonstrate the improved performance by using the compact data-layout in a line solver for coupled CFD codes.

Download PDF: here

Authors

Kyungjoo Kim

Sandia National Laboratories

Timothy B. Costa

Intel Corporation

Mehmet Deveci

Sandia National Laboratories

Andrew M. Bradley

Sandia National Laboratories

Simon D. Hammond

Sandia National Laboratories

Murat E. Guney

Intel Corporation

Sarah Knepper

Intel Corporation

Shane Story

Intel Corporation

Sivasankaran Rajamanickam

Sandia National Laboratories

Navigation