Due to the sheer number of files that are contained within collections of web archive material, all packaged into ARC or WARC files, it is difficult to run identification tools over their contents. This means you may not be aware of all the file formats that are contained within your archives.
Nanite builds on Apache Tika and The National Archives’ DROID to provide a rich format identification and characterisation system. It aims to make it easier to run identification and characterisation at scale, and helps compare and combine the results of different tools.
Nanite-Hadoop can work directly on web archives (ARC & WARC files), without requiring any intermediate decompression of the input data. There are two main types of output: firstly, characterisation data that is compatible with C3PO. Secondly, a table of data is produced that lists the MIME type returned by the original server, the MIME types according to DROID and Apache Tika, the year of harvest and the file extension of all files contained within the entire dataset.
What is Nanite?
Nanite consists of two modules:
- Nanite-Core contains a wrapped version of DROID
- Nanite-Hadoop can rapidly identify and characterise the contents of web archives using Nanite-Core and Apache Tika.
Characterisation data can be output in a format
suitable for import into C3PO.
What are the benefits?
- In-depth knowledge of your files
- Scalability – analyse large amounts of data quickly
- Easy integration with C3PO for visualising and exploring characterisation information from your web archives
- A reusable Java library that uses DROID for identifying InputStreams.
More information