D9.3 Characterisation technology, Release 3 + release report

This report describes the year 3 activities of the SCAPE project in the Characterisation Components work package, and presents an evaluation of format identification tools for execution in a parallelised Map Reduce environment. We report two general solutions that complement each other with different pros and cons. We present a solution to remedy the challenge of different tools giving different results on the same data. We discuss the concept of policy driven validation of digital objects according to an institutional preservation policy and gives reference to a concrete proof of concept solution. We present an evaluation of deploying Apache Tika and DROID on the SCAPE Azure platform as an alternative to the general SCAPE Execution Platform. We present the research project in extracting semantic information from web based text corpora and how such a system could be utilised by the digital preservation community.