D9.2 Characterisation technology, Release 2 + release report

This report describes the year 2 activities of the SCAPE project in the Characterisation Components work package, and presents an evaluation of the suitability of format identification tools for execution in a parallelised Map Reduce environment. Also the publication of format identification data, and the implications for data curation and publication are described. A novel tool to mine domain specific semantic meaning from web archives is presented, and an Azure based application that facilitates the conversion and quality assurance of large document collections is described. Finally a conclusion presents some overall findings before a roadmap for the coming year is presented.