Enhancing Open Repositories through Harvesting and Extracting Affiliation Data as First-class Citizen
October 2023 – September 2026
The dblp computer science bibliography provides search functionality over metadata of scientific publications and links to full-text PDFs of scientific publications for the discipline of computer science. To this end, dblp provides high-quality metadata on author names, titles, and venues, including a unique identification of authors, whenever possible. To take the next step, we plan to extend the dblp database to include affiliation information, whenever possible. We have compiled three use cases through which the users benefit from affiliation data. These include direct benefits, such as new and useful search functionality, and indirect benefits, such as better author disambiguation as well as a more accurate data basis for scientometric studies and measuring scientific output. The goal of this project is to develop and evaluate an e-research tool chain that addresses all three use cases and elevates affiliations to a first-class citizen in the dblp data environment.
We have broken down this challenge into four tasks: Get the data, extract the metadata, integrate it into both the back end and the front end of dblp and introduce the data to the community. Specifically, we will build a multi-source metadata harvester to automatically discover and collect metadata from different structured and unstructured web sources such as RDF on the Web of Data, full-text PDFs, Websites, and custom APIs provided by publishers. We download the content and extract and cleanse the metadata from the different web sources. For example, we apply entity recognition to extract the metadata from a PDF, in particular the authors’ affiliations, and match the extracted metadata to external knowledge bases, e.g., lists of known affiliations. The extracted information is fused into a metadata record based on an extended, provenance-aware metadata model. We ingest the new metadata records into the dblp database, where it is manually inspected, edited, and confirmed by curators using the dblp editorial manager. Through this iterative manual inspection, feedback is generated, which is returned to improve the machine learning model for extracting affiliation information, and is also used to improve the metadata harvester.
Through user studies, we tailor the new affiliation interface to meet the users’ needs, but also take care to integrate the new information into the ongoing author disambiguation and quality assurance processes of dblp’s editorial management system. Last but not least, all gathered information will be made publicly available under FAIR principles as part of the ongoing dblp effort to support the e-research community with high-quality and trustworthy datasets, which are already used by thousands of researchers and software developers worldwide.
- Schloss Dagstuhl – Leibniz Center for Informatics
- GESIS – Leibniz Institute for the Social Sciences
- Ulm University, Institute of Databases and Information Systems
- Project staff Schloss Dagstuhl: Dr. Florian Reitz, Dr. Florian Jäckel
- Project lead GESIS: Prof. Dr. Brigitte Mathiak
- Project lead University of Ulm: Prof. Dr. Ansgar Scherp
The project is funded by a grant of the German Research Foundation (DFG) funding programme "e-Research Technologies" (grant project number 515537520).