P42: TRIP: An Ultra-Low Latency, TeraOps/s Reconfigurable Inference Processor for Multi-Layer Perceptrons
Abstract: Multi-Layer Perceptron (MLP) is one of the most commonly deployed Deep Neural Networks, representing 61% of the workload in Google data-centers. MLP Inference, a memory bound problem, typically has hard response time deadlines and prefers latency over throughput. In our work, we designed a TeraOps/s Reconfigurable Inference Processor for MLPs (TRIP) on FPGAs that alleviates the memory bottleneck by storing all application specific weights on-chip. It can be deployed in multiple configurations, including host-independent operation. We have shown that TRIP achieves 60x better performance than the current state-of-the-art Google Tensor Processing Unit (TPU) for MLP Inference. It was demonstrated on the cancer patient datasets used in the Candle Exascale Computing Project (ECP).
Award: Best Poster Finalist (BP): no
Two-page extended abstract: pdf