DescriptionDeep Learning frameworks like Caffe, TensorFlow, and CNTK have brought forward new requirements and challenges for communication runtimes like MVAPICH2-GDR. These include support for low-latency and high-bandwidth communication of very-large GPU-resident buffers. This support is essential to enable scalable distributed training of Deep Neural Networks on GPU clusters. However, current MPI runtimes have limited support for large-message GPU-based collectives. To address this, we propose the S-Caffe framework; a co-design of distributed training in Caffe and large-message collectives in MVAPICH2-GDR. We highlight two designs for MPI_Bcast, one that exploits NVIDIA NCCL and the other that exploits ring-based algorithms. Further, we present designs for MPI_Reduce that provide up-to 2.5X improvement. We also present layer-wise gradient aggregation designs in S-Caffe that exploit overlap of computation and communication as well as the proposed reduce design. S-Caffe provides a scale-out to 160 GPUs for GoogLeNet training and delivers performance comparable to CNTK for AlexNet training.