Reliable Access to Massive Restricted Texts: Experience-Based Evaluation
Workshop: The Eighth International Workshop on Data-Intensive Computing in the Clouds
Abstract: Libraries are seeing growing numbers of digitized textual corpora with restrictions on their content. Probing and mining these massive corpora, of interest to scholars, can be cumbersome because of size, granularity, access restrictions, and organization. Efficient management of such a collection especially under failures depends on the primary storage system. In this paper, we identify the requirements for managing a massive text corpus based on experience in managing the 5.5 billion pages of the HathiTrust digital library. Using the requirements, we compare candidate storage solutions, and using a combination of experimental evaluation and comparison, to identify an optimum choice.