Structural and Visual Comparisons for Web Page Archiving

Marc Teva Law, Nicolas Thome,  Stéphane Gançarski and Matthieu Cord:
Structural and Visual Comparisons for Web Page Archiving
In: DocEng ’12 Proceedings of the 2012 ACM symposium on Document engineering, Pages 117-120. ACM New York, NY, USA ©2012. Table of contents ISBN: 978-1-4503-1116-8 doi>10.1145/2361354.2361380

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.