CANCELLED DDI - Cross Domain Integration (DDI-CDI)

Arofan Gregory (Jaffrey, US)
Hilde Orten (NSD – Bergen, NO)
Joachim Wackerow (GESIS – Mannheim, DE)

Introduction and Motivation

The Data Documentation Initiative (DDI) Alliance has been a leader in setting metadata standards for the social, behavioural, and economic sciences (SBE) for many years. They have provided specifications which support data collection, management, and dissemination with detailed descriptions of the data typical of those domains. As with many other branches of statistics and research, however, the type, volume, and sources of data have multiplied in the recent past. Many projects are now cross-disciplinary, involving data from different domains. At the same time, computational approaches to analysis of data and the reproduction and origination of research has evolved. These factors combine to highlight the need for an enhanced ability to integrate and understand data across domain boundaries, and to understand the provenance and processing of data, even as more and more of the work is performed programmatically by systems which leverage machine learning and other advanced technology approaches.

The DDI Alliance has recently published a new specification intended to fill this need for integrating data from disparate sources: DDI - Cross Domain Integration (DDI-CDI). Unlike earlier DDI work products, DDI-CDI is not domain-specific, but is designed to be used with research data from any domain. The specification provides a model for understanding and integrating data across a wide range of sources, including big data/no SQL, event history and register data, traditional columnar data, and multi-dimensional data. Further, it provides a way of describing data provenance, with a focus not only on traditional linear processes, but also on declarative "black box" processes employed by many modern systems. DDI-CDI is intended not to replace traditional domain models for data description, but to supplement them when data from different sources and of different types is being integrated. It is designed to work easily with many other popular standards and models, including semantic vocabularies and generic technology specifications for data processing, dissemination, and cataloguing.

With an expected production release at the end of 2020, the current draft of the specification is undergoing testing and initial implementation. This workshop will bring experts from different domain and technical background together to assess the DDI-CDI model in light of a series of cross-domain use cases in specific focus areas. The goal is not only to provide feedback on the specification, but also documented examples of how it can be applied, so that these can be shared with implementers in future and help to guide the use of the specification.

Interoperability, Sustainability, and Alignment with Other Standards

DDI - CDI is fundamentally a model which is intended to be implemented across a wide variety of technology platforms, and in combination with many other standards models, and specifications. To support this use, it is formalized using a limited subset of the Unified Modelling Language (UML). The model is provided in the form of Canonical XMI – an interchange format for UML models supported by many different modelling and development tools. Further, a syntax representation is provided in XML, so that direct implementation of the model is possible if needed.

The platform-independence of the model makes it more easily applicable across a broad range of applications and helps ensure that it will be sustainable even as the technology landscape evolves. DDI - CDI builds on many other standard models and is aligned with them where appropriate. This is shown in the model itself, where formalizations from other models and specifications are refined, extended, or directly used. The specification includes a description of what these other standards and models are, and how they are used in DDI - CDI.


Participants will come from multiple domains, data experts who are familiar with domain-specific uses cases and technical experts who are familiar with domain-specific metadata standards. Additionally people from relevant W3C standards will participate (e.g., Data Catalog Vocabulary - DCAT, RDF Data Cube Vocabulary, PROV Ontology, CSV on the Web). Data from several domains have been identified to serve as the focus of the workshop: geographical data, social science data, environmental data, and bio-medical data.


Outputs of the workshop are expected to include:

  • Recommendations for extensions/corrections to the DDI-CDI model, with reference to specific issues encountered in the work on the focus areas/use cases and general considerations
  • Documentation on the focus areas/use cases explicating the various data and the way in which it has been described for integration/harmonization
  • Description of the provenance of the data from the focus areas/use cases
  • Description of the integration and harmonization processes used on the data for each focus area/ use case

All descriptions will be provided at both a general level, and at a detailed technical level, to help not only business users but also the technical implementers of the standards.

