An Open Source Infrastructure for Quality Assurance and Preservation of a Large Digital Book Collection

Sven Schlarb:
An Open Source Infrastructure for Quality Assurance and Preservation of a Large Digital Book Collection
In: Archiving 2013, Washington, DC; April 2013; p. 234-238; ISBN / ISSN: 978-0-89208-304-6

Abstract
This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scape-project-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.
More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

Download: link

Leave a Reply