A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data

Stefan Pröll and Andreas Rauber

A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data

Abstract:

Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data driven experiments. Also exchanging data allows scientists to reuse data in different contexts and gather new knowledge from available sources. But with increasing volume of data, researchers need to reference exact versions of datasets. Until now access to research data often based on single archives of data files where versioning and subsetting support is limited. In this paper we introduce a mechanism that allows researchers to create versioned subsets of research data which can be cited and shared in a lightweight manner. We demonstrate a prototype that supports researchers in creating subsets based on filtering and sorting source data. These subsets can be cited for later reference and reuse. The system produces evidence that allows users to verify the correctness and completeness of a subset based on cryptographic hashing. We describe a replication scenario for enabling scalable data citation in dynamic contexts.

X