DescriptionIn the last several years, advances in deep neural networks (DNN) have led to many breakthroughs in machine learning, spanning diverse fields such as computer vision, speech recognition, and natural language processing. Since then, the size and complexity of DNNs have significantly outpaced the growth of CPU performance, leading to an explosion of DNN accelerators.
Recently, major cloud vendors have turned to ASICs and FPGAs to accelerate DNN serving in latency-sensitive online services. While significant attention has been devoted to deep convolutional neural networks (CNN), much less so has been given toward the more difficult problems of memory-intensive Multilayer Perceptions (MLP) and Recurrent Neural Networks (RNN).
Improving the performance of memory-intensive DNNs (e.g., LSTM) can be achieved using a variety of methods. Batching is one popular approach that can be used to drive up device utilization, but can be harmful to tail latencies (e.g., 99.9th) in the context of online services. Increasing off-chip bandwidth is another option, but incurs high cost and power, and still may not be sufficient to saturate an accelerator with tens or even hundreds of TFLOPS of raw performance.
In this work, we present BrainWave, a scalable, distributed DNN acceleration architecture that enables inferencing of MLPs and RNNs at near-peak device efficiency without the need for input batching. The BrainWave architecture is enabled and powered by FPGAs and is able to reduce latency 10-100X relative to well-tuned software.