Duplicate Detection Approaches for Quality Assurance of Document Image Collections

Roman Graf, Reinhold Huber-Mörk, Alexander Schindler, and Sven Schlarb

Duplicate Detection Approaches for Quality Assurance of Document Image Collections

The International ACM Conference on Management of Emergent Digital EcoSystems (MEDES 2013)

Abstract:

This paper presents an evaluation of different methods for automatic duplicate detection in digitized collections. These approaches are meant to support quality assurance and decision making for long term preservation of digital content in libraries and archives. In this paper we demonstrate advantages and drawbacks of different approaches. Our goal is to select the most efficient method which satisfies the digital preservation requirements for duplicate detection in digital document image collections. Workflows of different complexity were designed in order to demonstrate possible duplicate detection approaches. Assessment of individual approaches is based on workflow simplicity, detection accuracy and acceptable performance, since image processing methods typically require significant computation. Applied image processing methods create expert knowledge that facilitates decision making for long term preservation. We employ AI technologies like expert rules and clustering for inferring explicit knowledge on the content of the digital collection. A statistical analysis of the aggregated information and the qualitative analysis of the aggregated knowledge are presented in the evaluation part of the paper.

Download: Link

X