28. – 31. Januar 2003, Dagstuhl Seminar 03051
Information and Process Integration: A Life Science Perspective
Rolf Apweiler (European Bioinformatics Institute – Cambridge, GB)
Thure Etzold (LION Bioscience – Cambridge, GB)
Johann-Christoph Freytag (HU Berlin, DE)
Carole Goble (University of Manchester, GB)
Peter Schwarz (IBM Almaden Center, US)
Auskunft zu diesem Dagstuhl Seminar erteilt
The result of the seminar showed that integration is still wide open field base on the differences in technology, the expectations by the users, and the kind of problems that biologists and life scientists try to solve. It became apparent that often the integration task is driven by the specifics of the application (“lab protocols’ and their mapping onto computer systems). The discussions also made clear that integration must include semantic integration, in particular the meaningful integration of different space and time scales (microseconds vs. millions of years) and the presentation of discrete and continuous data (the former is well understood, the latter is an open area). Another open (biological) issue is the use of measurements which are often not reproducible thus making it difficult to compare and to use. Finally it became apparent that biologists and computer scientists must cooperate much closer to solve the complex problems that exist in life science and are about to appear on the (scientific) horizon.
Break Out Sessions
- Extensible Programming Environments: What‘s so cool about Expression Templates?
- Integration of Time and Space Dependent Data in Life Science
- How Can Geeks and Biologists Cooperate?
- Wineflow Group
This seminar brought together scientists and industrial developers and researchers to discuss the challenges of integrating bioinformatics/life science data in a meaningful way. Despite the technological advances many open problems and issues persists and need to be addressed. This workshop focused on the main issues of data nad process integration in the life science domain.
Detailed Agenda for the Workshop
Over the last fifteen years the amount of data in the area of Life Science/Bioinformatics has grown exponentially. This data is stored and is available in an ever increasing number of data collections (also often referred to as databases), each focusing on specific aspects of life science, such as nucleotide or protein sequences, functional motifs, metabolic pathways, specific organisms, or information related to specific diseases. At the same time the bioinformatics community has developed hundreds of tools to visualize, to analyze, and to process that data, with the goal of turning raw data as produced by sequencing machines into knowledge applicable to drug design and to the development of new therapies. Examples include gene prediction, motif recognition, the computation of phylogenetic relationships, and the deduction of pathways from gene expression arrays. However, almost all of these tools use proprietary, non-standard data formats thus making it (almost) impossible to change those or to introduce new tools without recognizing the need for bridging the gap between the existing world of data and processing conventions and new promising approaches.
With the advent of middleware technology, the focus of research and development in data integration has begun to shift. While many previous efforts have addressed the syntactic integration of data collections, the real challenge now, and for years to come, will be the development of new approaches, techniques, methods and algorithms for performing semantic integration . What will be needed are systems that bring together data that “belongs” together, making this determination on the basis of both structure and meaning. To achieve this goal, current middleware technology will need to be extended so that it can take advantage of ontologies, semantic networks and other metadata (e.g. information about data quality) to gain a deeper understanding of the primary data.
The problems described are present in both academic and research institutions as well as in pharmaceutical, drug design, medical, and health care businesses. Only the use of modern technology promises the users a platform to bring diverse data, information, knowledge, and processing software together to advance science and to satisfy business needs. If the current time necessary for the development of a new drug, which is estimated to be at app. 10 – 15 years, is to be reduced fundamentally, the process from molecular biology evidence to clinical studies has to be highly streamlined, which requires a tight yet flexible intertwining of a multitude of databases and applications.
This seminar should bring together scientists and practitioners from the fields of bioinformatics and information technology, in order to better understand the new challenges as well as existing approaches and relevant technologies. Solutions to the new problems will most likely be driven by extending existing technology (e.g. Object-Relational DBMS) to meet new needs (e.g. federated database management, highly-parallel distributed problem-solving on a grid), emerging tools and standards for managing semi-structured data (e.g. XML, XQuery, XSchema) and process technologies (e.g. CORBA, Java Beans, message-driven workflow using Web Services).
New technology areas such as the onlotogies, the Semantic Web and the Grid are highly applicable to a more meaningful integration of data, information, and processes for Life Sciences. It becomes important that mutual understanding in both the research and business world arises to make the necessary advances in bioinformatics. Still, it is time to evaluate the current solutions and approaches to drive future research and development directions by the pressing needs of the bioinformatics/life science community.
The areas to discuss include:
- Achieving semantic integration
- What are today’s approaches for semantic integration? Are those sufficient for the life science domain?
- What are the necessary concepts such ontologies that are necessary to perform semantic integration?
- What are the languages required to specify the various forms of biological and medical knowledge that is required for bioinformatics research? Are relations and attributes really enough?
- Which knowledge management techniques (personalization, community building, knowledge sharing, text mining) are appropriate to the Life Science area?
- How to ensure data quality, data consistency, and completeness? How can data quality be compared, assessed, measured, combined?
- Information discovery and publication
- What is the optimal access form to the various data collections that are important to scientific organization and business in the different life science areas?
- Can XML be used as the “universal language” for describing the integrated information base? How to capture "navigational access” based on hyper-linked HTML pages performed today in many application areas?
- Version management for data collections and metadata that change daily/weekly? Are there compression schemes that can reduce the large amount of repeated (redundant) data? How can we efficiently store the relationships between new or changing evidence and new versions of data?
- How is information described? What are approaches to handle the description of data (metadata)? Which metadata is relevant (schema, ontologies)? How to store and access it? How to keep it current?
- What is a federated schema if structured and unstructured data are brought together? Which schema integration techniques, federated query and search technologies are applicable?
- What are possible system structures in a highly dynamic world that constantly changes and that makes constant progress?
- Information processing paradigms
- Which processing/transaction models are appropriate?
- How can ontologies and other meta data support more meaningful processing techniques? Are current techniques adequate for distributed query processing? What are new requirements coming from Life Science?
- How to represent and manage derived data, data quality and data provenance?
- How do Semantic Web and Grid technologies contribute?
- Which federated database technologies can be used in which context? Are the trade-offs that provide the bases to decide which approach to choose in a particular situation?
- Information technologies and standardization
- How to use different technologies like SQL/MED wrappers, J2EE connectors, EAI adapters, and Web Services for virtual or physical integration. Which technology should be used under which circumstances?
- Which role will database systems, application server, workflow systems, messaging systems, portal servers, etc. play? How do they relate and cooperate?
- Does Web Database Technology suffice?
- What is the query/retrieval interface for the future?
- What must be standardized in the storage, access, and processing for better information integration?
- What is the minimum in standards one needs for improved “cooperation” and “collaboration” of applications?
- How can XML-based meta data help to improve to understand the semantics of data to perform challenging tasks such as information integration?
As cross fertilization is important, the major goal of the seminar is to bring representatives from the different communities (from research, from vendors, and from users) together for a joint in-depth understanding of the issues, to identify and prioritize the main research items, identify standardization needs, and to discuss demanding questions and open problems in detail. As a major driving force we plan to use case studies coming for life scientists to discuss many of these issues from a user’s (i.e. Life Science) perspective.