SC17 Denver, CO

P38: Benchmarking Parallelized File Aggregation Tools for Large Scale Data Management


Authors: Tiffany Li (National Center for Supercomputing Applications, University of Illinois), Craig Steffen (National Center for Supercomputing Applications, University of Illinois), Ryan Chui (National Center for Supercomputing Applications, University of Illinois), Roland Haas (National Center for Supercomputing Applications, University of Illinois), Liudmila S. Mainzer (National Center for Supercomputing Applications, University of Illinois)

Abstract: Large-scale genomic data analyses have given rise to bottlenecks in data management due to the production of many small files. Existing file-archiving utilities, such as tar, are unable to efficiently package large datasets with upward of multiple terabytes and hundreds of thousands of files. To create parallelized and multi-threaded alternatives, ParFu (parallel archiving file utility), MPItar, and ptgz (parallel tar gzip) were developed by the Blue Waters team and the NCSA Genomics team as efficient data management tools, with the ability to perform parallel archiving (and eventually extracting). Scalability was tested for each tool as a function of the number of ranks executed and stripe count on a Lustre filesystem. We used two datasets typically seen in genomic analyses to measure the effects of different file-size distributions. These tests suggest the best user parameters and subsequent costs for usage as efficient replacements of data-packaging tools.
Award: Best Poster Finalist (BP): no

Poster: pdf
Two-page extended abstract: pdf


Poster Index