Unknown Data

Mining and Consolidating Research Dataset Metadata on the Web

August 2021 – September 2025

Description

Research data is essential to facilitate scientific progress, yet, many valuable datasets are hidden on web sites and small repositories or are hard to find due to insufficient metadata. Only a fraction of researchers pro-actively share dataset metadata through public portals, and curation of such metadata collections is costly. Unknown Data will provide means to automatically discover, extract, and publish metadata about research data that is hidden on the Web or in scholarly publications. Thus, the project’s goal is to improve findability and re-usability of research data by (a) improving metadata quality, in particular with respect to authority and use of existing datasets and (b) uncovering datasets that are not yet reflected in public data repositories and registries.

Our approach (1) utilizes data citations from scholarly articles and web pages to collect metadata about relevant datasets, (2) discovers datasets and their context by crawling web pages, (3) consolidates metadata by linking information from domain-specific databases, (4) facilitates high metadata quality by establishing a discipline-specific curation process, and (5) ensures long-term availability of original data sources by archiving relevant web pages.

Obtaining metadata about research data directly from web sources and publications is a novel approach, increasing the visibility of “long tail” datasets while at the same time providing crucial insights into the actual use and impact of (known) datasets.

The results of this project will benefit two disciplines, computer science and the social sciences, through use case pilots. The DBLP computer science bibliography and the GESIS portals, accessible through GESIS Search, are among the most prestigious and widely-used metadata collections in their respective fields. They both feed into many other search engines, such as Google Dataset Search and CESSDA. Through Unknown Data, the effectiveness and efficiency of researchers in common data search use cases will be significantly improved by (1) creating a centralized and comprehensive collection of metadata about research datasets for the first time in computer science and (2) fundamentally improving quality and quantity of dataset metadata in the social sciences.

Dataset citations extracted from web pages or publications which refer to datasets empower users to judge the impact and authority of datasets. These are crucial features for assessing the usefulness of research data and thereby can boost its reuse.

All metadata collected in this project will be made publicly available as linked open data and through REST APIs beyond the project’s duration. In doing so, we are actively contributing to making research data findable, accessible, interoperable, and re-usable for both researchers and machines (i.e., following the FAIR Data Principles). All software products from this project will be made available as open source and methods and code can be adapted to further disciplines.

Partners

Schloss Dagstuhl – Leibniz Center for Informatics
GESIS – Leibniz Institute for the Social Sciences
Heinrich Heine University of Düsseldorf, Data & Knowlege Engineering
Humboldt University of Berlin, School of Library and Information Science

Organisation

Project staff Schloss Dagstuhl: Dr. Marcel R. Ackermann, Benedikt Maria Beckermann
Project lead GESIS: Prof. Dr. Alexia Katsanidou, Prof. Dr. Brigitte Mathiak
Project lead HHU Düsseldorf: Prof. Dr. Stefan Dietze
Project lead HU Berlin: Prof. Dr. Robert Jäschke

The project is funded by a grant of the German Research Foundation (DFG) funding programme "e-Research Technologies" (grant project number 460676019).

Web links

Infrastructures

Back To Project List

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Unknown Data

Mining and Consolidating Research Dataset Metadata on the Web

Description

Partners

Organisation