Scientific discovery via advances in simulation and data analytics is an ongoing national priority. A corresponding challenge is to sustain the research, development and deployment of the high performance infrastructure needed to enable those discoveries. Early cloud data centers are evolving with new technologies to better support massive data analytics.
Analysis of Big Data use cases identifies the need for High Performance Computing (HPC) technologies in the Apache Big Data Stack (HPC-ABDS). Deep learning, using GPU clusters, is a clear example but many Machine Learning algorithms also need iteration, and HPC communication and optimizations.
Our research has concentrated on runtime and data management to support HPC-ABDS. This is illustrated by our open source software Harp, a plug-in for native Apache Hadoop, which has a convenient science interface, high performance communication and can invoke Intel’s Data Analytics Acceleration Library (DAAL).
We tested this on both a complex Latent Dirichlet Allocation topic model and on subgraph mining algorithms using Intel’s Xeon and Xeon Phi architectures. Other tests show that Harp can run Kmeans, Graph Layout and Multi-Dimensional Scaling algorithms with realistic application datasets over 4096 cores on the IU Big Red II Supercomputer while achieving linear speedup.
We are building a scalable parallel Machine Learning library that includes routines in Apache Mahout, MLlib and others built in an NSF funded collaboration. This already has 20 (12 using DAAL) library members and is being tested while we add more functionality. Our results show that data-centric parallelism extends our understanding of distributed and parallel computation to further advancements in handling big model data and speed of convergence.
This finding demonstrates the effectiveness of using HPC machines for Big Data problems. We will continue to collaborate with academia, industry and national centers in exploring the computational capabilities and their applications.
About the Speaker:
Dr. Judy Qiu is an associate professor of Intelligent Systems Engineering at Indiana University (IU). Her general area of research is in data-intensive computing at the intersection of Cloud and HPC multicore technologies. This includes a specialization on programming models that support iterative computation, ranging from storage to analysis which can scalably execute data intensive applications.
Her research has been funded by NSF, NIH, Microsoft, Google, Intel and IU. Judy Qiu leads an Intel Parallel Computing Center (IPCC) site at IU. Dr. Qiu was the recipient of an NSF CAREER Award in 2012, Indiana University Trustees Award for Teaching Excellence in 2013-2014, and IU’s Outstanding Junior Faculty Award in 2015.