FactCrawl: A Retrieval Framework for Full-Text Indices

Christoph Boden, Alexander Löser, Christoph Nagel, and Stephan Pieper
FactCrawl: A Fact Retrieval Framework for Full-Text Indices
In: 14th International Workshop on the Web at SIGMOD 2011, Databases WebDB 2011 , 2011.

Abstract
We present FactCrawl, a framework for retrieving structured, factual information leveraging the full-text index of a search engine. The framework applies  an approximation algorithm to solve problem of retrieving  all facts in a document collection using a minimal set of keywords while minimizing cost. The search engine is queried with  automatically generated keywords, the results are re-ranked according to our fact score and documents are forwarded to a fact extractor. Keywords are determined using structural, syntactic, lexical and semantic information from sample documents. We estimate the fact score of a document by combining the observations of keywords in the document. We report results of an experimental evaluation over 20 fact extractors on a Reuters NIST corpus with 731,752 pages. Our experiments demonstrate that  FactCrawl more than doubles recall in an online query scenario and nearly halves processing costs in an archive scenario, compared to existing approaches.

Download: link

Leave a Reply