Structural and Visual Similarity Learning for Web Page Archiving

Marc Teva Law, Carlos Sureda Gutierrez, Nicolas Thome, Stéphane Gancarski, Matthieu Cord:
Structural and Visual Similarity Learning for Web Page Archiving
In: 10th International Workshop on Content-Based Multimedia Indexing (CBMI 2012), Proceedings of a meeting held 27-29 June 2012, Annecy, France, 96-101, ISBN 9781467323680

Abstract

In this paper a Web page archiving approach combining image and structural techniques is presented. The main goal is to learn a similarity between Web pages in order to detect whether successive versions of pages are similar or not. The system is based on a visual similarity measure designed for Web pages. Combined with a structural analysis of Web page source codes, a supervised feature selection method adapted to Web archiving is proposed. Experiments on real Web archives are reported including scalability issues.

Download: link

X