DescriptionScientific discovery via advances in simulation and data analytics is an ongoing national priority. A corresponding challenge is to sustain the research, development, and deployment of the high performance infrastructure needed to enable those discoveries. Early cloud data centers are evolving with new technologies to better support massive data analytics. Analysis of Big Data use cases identifies the need for HPC technologies in the Apache Big Data Stack (HPC-ABDS). Deep learning, using GPU clusters, is a clear example, but many Machine Learning algorithms also need iteration and HPC communication and optimizations.
Our research has concentrated on runtime and data management to support HPC-ABDS. This is illustrated by our open source software Harp, a plug-in for native Apache Hadoop, which has a convenient science interface, high performance communication, and can invoke Intel’s Data Analytics Acceleration Library (DAAL). We tested this on both a complex Latent Dirichlet Allocation topic model and on subgraph mining algorithms using Intel’s Xeon and Xeon Phi architectures. Other tests show that Harp can run Kmeans, Graph Layout, and Multi-Dimensional Scaling algorithms with realistic application datasets over 4096 cores on the IU Big Red II Supercomputer while achieving linear speedup.
We are building a scalable parallel Machine Learning library that includes routines in Apache Mahout, MLlib, and others built in an NSF funded collaboration. This already has 20 (12 using DAAL) library members and is being tested while we add more functionality. Our results show that data-centric parallelism extends our understanding of distributed and parallel computation to further advancements in handling big model data and speed of convergence. This finding demonstrates the effectiveness of using HPC machines for Big Data problems. We will continue to collaborate with academia, industry, and national centers in exploring the computational capabilities and their applications.