Author/Presenters
Event Type
Workshop

Applications
Clouds and Distributed Computing
SIGHPC Workshop
TimeSunday, November 12th11:40am -
12:10pm
Location507
DescriptionLibraries are seeing growing numbers of digitized
textual corpora with restrictions on their content.
Probing and mining these massive corpora, of interest to
scholars, can be cumbersome because of size,
granularity, access restrictions, and organization.
Efficient management of such a collection especially
under failures depends on the primary storage system. In
this paper, we identify the requirements for managing a
massive text corpus based on experience in managing the
5.5 billion pages of the HathiTrust digital library.
Using the requirements, we compare candidate storage
solutions, and using a combination of experimental
evaluation and comparison, to identify an optimum
choice.
Author/Presenters