ToMaR – A Data Generator for Large Volumes of Content

Rainer Schmidt, Matthias Rella, and Sven Schlarb

ToMaR – A Data Generator for Large Volumes of Content


We present ToMaR, a scalable application that
supports the efficient integration of legacy applications within
a MapReduce environment. The work is motivated by scenarios
for scalable content processing developed in the context of the
EC project SCAPE. ToMaR specifically addresses the need
for extracting data sets from large volumes of binary content
based on existing, content-specific applications within a scalable
data management environment. This paper discusses the main
functionalities of ToMaR and describes how ToMaR is utilized
as part of a typical workflow. We present a real-word scenario
that makes use of ToMaR for the characterization of archived
web content. A workflow and experimental results which have
been produced using sample content from the Web Archive
Austria are discussed.

Download: Link