D6.3 Optimization of Preservation Processes

In this deliverable, we examine two practical scenarios in order to investigate how to overcome algorithmic limitations of the MapReduce paradigm to optimize execution speeds as well as the real-world applicability of preservation workflows defined in MapReduce. In particular, we give details on the optimization of an algorithmic operation that requires the use of iterations and give details on a large-scale preservation workflow in which we convert very large collections of images from TIFF to the JP2 file format in a distributed environment. We expand on the work reported in deliverable D6.2 in which we began formulating preservation workflows using the Apache Pig dataflow language as a higher order intermediary. Both the PPL translator and the large-scale preservation use case are now formulated in Apache Pig which in turn is then compiled down to MapReduce for execution in a distributed environment. Our findings indicate that preservation workflows can be formulated and executed efficiently within the boundaries of the MapReduce paradigm. Our PPL translator is available online at https://github.com/umaqsud/taverna-to-pig.

SCAPE_D6.3_TUB_V1.0

 

X