SCAPE | FactCrawl: A Retrieval Framework for Full-Text Indices

Christoph Boden, Alexander Löser, Christoph Nagel, and Stephan Pieper
FactCrawl: A Fact Retrieval Framework for Full-Text Indices
In: 14th International Workshop on the Web at SIGMOD 2011, Databases WebDB 2011 , 2011.

Abstract
We present FactCrawl, a framework for retrieving structured, factual information leveraging the full-text index of a search engine. The framework applies an approximation algorithm to solve problem of retrieving all facts in a document collection using a minimal set of keywords while minimizing cost. The search engine is queried with automatically generated keywords, the results are re-ranked according to our fact score and documents are forwarded to a fact extractor. Keywords are determined using structural, syntactic, lexical and semantic information from sample documents. We estimate the fact score of a document by combining the observations of keywords in the document. We report results of an experimental evaluation over 20 fact extractors on a Reuters NIST corpus with 731,752 pages. Our experiments demonstrate that FactCrawl more than doubles recall in an online query scenario and nearly halves processing costs in an archive scenario, compared to existing approaches.

Download: link

FactCrawl: A Retrieval Framework for Full-Text Indices

Leave a Reply Cancel reply

Upcoming Events

OPF Blogs for SCAPE

Twitter

Site Search