SC17 Denver, CO

High Performance Computing Education in US Data Science

Authors: Dr. Weijia Xu (Texas Advanced Computing Center, University of Texas)

Abstract: A key property of Data Science is the adoption of new techniques and tools for conquering extremely large datasets using the latest computing infrastructure. However, challenges remain in integrating high performance computing (HPC) knowledge and hands-on practices into Data Science education. This BoF will consist of a panel of experts who will discuss challenges and needs of HPC education in Data Science programs and a have a lively discussion with the audience to explore viable approaches. The ultimate goal is to bring HPC specialists, data scientists, and educators together to broaden HPC education and practice in Data Science.

Long Description: Data Science has emerged as a dedicated field of study and grown quickly within the higher education system. A major driving force behind the increasing popularity of Data Science is the increasing need for data-driven analytics, fueled by massive amounts of complex data produced by businesses, scientific applications, government agencies and social applications. High performance computing (HPC) resources are often required by computational analysis tasks in Data Science. Education and training on essential tools and resources using advanced computing resources are pivotal in preparing new data scientists.

This BOF is centered on this very issue. From our past experiences in providing successful, relevant training sessions, we identify several unique challenges in educational approaches to training on HPC technology for data science education. As volumes of data grow bigger, solutions often become viable only with large-scale cyberinfrastructure (CI). The current National Science Foundation (NSF) CI successfully serves thousands of users advancing critical areas of science and engineering. However, tens of thousands of other NSF-supported researchers and students whose communities have only recently come to large-scale computation as a research tool are not yet trained to take advantage of the data-intensive computing environments and their applications, languages, and libraries. In addition to the increasing amount of available data, the architectures and methods to store, process, and analyze this data have changed drastically in just a few years. Students must have sufficient knowledge of those tools and resources in order to transform theoretical models and methodologies to efficient solutions. This knowledge requirement can easily overwhelm students entering the field. Furthermore, the fast pace of technology and software developments makes keeping that knowledge up-to-date a challenge not only for the students but also for the educators. On the other hand, a student in Data Science should not have to become an expert in HPC in order to succeed. These challenges cannot be solved by HPC specialists or data scientists alone. There is a pressing need to bring data scientists, researchers, educators, and HPC professionals together to explore innovative methods that integrate HPC with existing Data Science programs across country. Organized around this central mission, this BOF features experts from diverse institutional backgrounds and facilitates promising dialogues. Dr. Kelly Gaither is a veteran leader at an HPC center with rich experience in HPC research and pedagogy. Having recently served as a NSF program director, Dr. Daniel Katz will offer unique perspectives on the needs and gaps of utilizing HPC for data driven science. Dr. Ann Stapleton is a biologist who uses cyberinfrastructure extensively in her own research and helped UNCW establish its first Data Science program. Dr. Mark Speck will share his unique experiences and insights on jumpstarting data science education as a new data scientist at Chaminade University of Honolulu. And Mr. Oppiliappan will discuss needs of HPC education for data scientists based on his years of experience in industry. In addition to a panel discussion, the BOF consists of an open discussion with the audience on common issues and next steps for community-building.

Conference Presentation: pdf

Birds of a Feather Index