Dagstuhl Seminar 23352
Integrating HPC, AI, and Workflows for Scientific Data Analysis
( Aug 27 – Sep 01, 2023 )
Permalink
Organizers
- Rosa Maria Badia (Barcelona Supercomputing Center, ES)
- Laure Berti-Equille (IRD - Montpellier, FR)
- Rafael Ferreira da Silva (Oak Ridge National Laboratory, US)
- Ulf Leser (HU Berlin, DE)
Contact
- Michael Gerke (for scientific matters)
- Susanne Bach-Bernhard (for administrative matters)
Schedule
The Dagstuhl Seminar, held from August 27 to September 1, 2023, was pivotal in highlighting the interdependence of these technologies for modern Big Data analysis. With a focus on bridging the gaps between these historically siloed areas, the seminar addressed the augmentation of resource demands due to the integration of AI into scientific workflows, the challenges posed to HPC architectures, and the exploration of AI's potential in optimizing workflow systems and operations, including scheduling and fault tolerance.
The seminar proffered a nuanced understanding of AI+HPC integrated workflows, elaborating on the different modes in which AI and HPC components could be coupled within workflows. These ranged from AI models replacing computationally intensive components (AI-in-HPC) to AI models that operate externally to steer HPC components or generate new data (AI-out-HPC), and to concurrent AI models that optimize HPC runtime systems (AI-about-HPC). Such integration is vital for the future of scientific workflows, where AI and HPC not only coexist but also co-evolve to foster more effective and intelligent scientific inquiry.
A shift in the paradigm of HPC systems towards real-time interaction within workflows was another focal point of the seminar. Moving away from the traditional batch-oriented systems, the seminar shed light on the emerging need for workflows that support dynamic, on-the-fly interactions. These interactions are not only vital for the real-time steering of computations and the runtime recalibration of parameters but also for making informed decisions on cost-value trade-offs, thereby optimizing both computational and financial resources.
The discussion also ventured into the realm of federated workflows, distinguishing them from the conventional grid computing model. Federated workflows, or cross-facility workflows, emphasize the orchestration of workflows across different computational facilities, each with distinct environments and policies. This paradigm advocates for a seamless execution of complex processes, underscoring the necessity of maintaining coherence and coordination throughout the workflow life cycle.
Contractual and quality-of-service (QoS) considerations in federated workflows, especially when crossing organizational boundaries, were identified as critical areas of focus. The seminar highlighted the need for formal contracts to manage the intricate bindings and dynamic interactions between various entities. The role of a federation engine was emphasized as a tool for translating requirements, ensuring compliance, and resolving disputes, thereby ensuring the workflow's needs are met at each federation point.
Moreover, the seminar identified key challenges and opportunities at the intersection of these technologies, such as the stochastic nature of ML and its impact on the reproducibility of data analysis on HPC systems. It highlighted the need for holistic co-design approaches, where workflows are introduced early and scaled from small-scale experiments to large-scale executions. This approach is essential for integrating the ``full'' workflow environment, including ML/AI components, early in the process, thereby replacing expensive simulation with fast-running surrogates and enabling interactive exploration with the entire software environment.
In summary, the Dagstuhl Seminar 23352 provided an in-depth exploration of the synergistic relationship between HPC, AI, and scientific workflows. It paved the way for future research directions and practical implementations, aiming to revolutionize scientific data analysis by harmonizing computational power with intelligent, data-driven analysis. The discussions and outcomes of the seminar are poised to influence the development of workflow systems and technologies in the years to come, signaling a shift towards more integrated, adaptive, and efficient scientific computing paradigms.
Modern scientific Big Data analysis builds on three pillars: (i) Workflow technologies to express, steer, and make it reproducible, (ii) machine learning (ML) and artificial intelligence (AI) as fundamental and versatile steps, and (iii) HPC for scalable execution of analytic pipelines over very large data sets. Yet their interplay is complex and under-researched. Scientific workflow systems (SWF) today are used universally across scientific domains for approaching large data analysis problems and have underpinned some of the most significant discoveries of the past decades [1]. Many SWF have significant computational, storage, and communication demands, and thus must execute on a wide range of platforms including high performance computing centers (HPC) and even exascale platforms [2]; on the other hand, for many researchers SWF are the method of choice to develop HPC applications. In the past 10 years, this interplay of workflow technologies and HPC has been challenged by the fast rise of AI technologies, in particular ML [3]. SWF become ML rich, where models are trained and applied to large data sets, leading to significant resource requirements as available in HPC centers. However, ML-heavy tasks bring new requirements to HPC, such as GPUs or neuro-morphic chips, and the need to support iterative computations. On the other hand, ML techniques more and more invade workflow steering and HPC optimization, especially in scheduling and resource provisioning. This leads to a triple-edged relationship between HPC, workflows, and ML, where each offers important capabilities to the others but also needs to react to new requirements brought by the others [4].
The above three pillars are researched by communities which are largely separated from each other. However, coping with the current and upcoming large-scale scientific challenges, such as Earth Science, Population Genetics, or Computational Material Science, requires their close interaction for the benefit of the society. Previous attempts to unify the communities were at best bi-directional, ignoring the importance of the interplay of all three factors. For instance, in 2021 some of the organizers of this Dagstuhl Seminar organized/attended a series of virtual events to bring the workflows community together in an attempt to mitigate the proliferation of newly-developed workflow systems, and provide a community roadmap for bringing ML closer to workflows [4]. In this Dagstuhl Seminar, we aim to bring together these three communities to study challenges, opportunities, new research directions, and future pathways at the interplay of SWF, HPC, and ML. In particular, the seminar will focus on the following research questions:
- How can ML technologies be used to improve SWF and PHPC operations, for instance by better scheduling, improved fault tolerance, or energy-efficient resource provisioning?
- How must HPC architectures be adapted to better fit to the requirements of large-scale ML technology, in particular from the field of Deep Learning?
- How must SWF languages and execution systems change to unravel the full power of ML- heavy data analysis on HPC systems?
- What are the most prestigious use cases of ML techniques on HPC, and what specific and currently unmet requirements do they yield?
- What does the stochastic nature of ML affect reproducibility of data analysis on HPC?
To approach these questions, the seminar will follow an innovative “continuous integration” setup where individual contributions of experts are iteratively bundled together to eventually produce a common knowledge framework as a basis for paving a road ahead. We expect the seminar to produce tangible outputs both in terms of joint reports / publications and new international collaborations.
References
[1] Liew, C. S., Atkinson, M. P., Galea, M., Ang, T. F., Martin, P., & Hemert, J. I. V. (2016). Scientific workflows: moving across paradigms. ACM Computing Surveys, 49(4). [2] Badia Sala, R. M., Ayguadé Parra, E., & Labarta Mancho, J. J. (2017). Workflows for science: A challenge when facing the convergence of HPC and big data. Supercomputing frontiers and innovations, 4(1). [3] Ramirez-Gargallo, G., Garcia-Gasulla, M., & Mantovani, F. (2019). TensorFlow on state-of-the-art HPC clusters: a machine learning use case. Int. Symp. on Cluster, Cloud and Grid Computing. [4] Ferreira da Silva, R., et al. (2021). A community roadmap for scientific workflows research and development. IEEE Workshop on Workflows in Support of Large-Scale Science.
- Ilkay Altintas (San Diego Supercomputer Center, US) [dblp]
- Rosa Maria Badia (Barcelona Supercomputing Center, ES) [dblp]
- Laure Berti-Equille (IRD - Montpellier, FR) [dblp]
- Silvina Caino-Lores (INRIA - Rennes, FR) [dblp]
- Kyle Chard (University of Chicago, US) [dblp]
- Rafael Ferreira da Silva (Oak Ridge National Laboratory, US) [dblp]
- Rosa Filgueira (University of St Andrews, GB) [dblp]
- Ana Gainaru (Oak Ridge National Laboratory, US) [dblp]
- Shantenu Jha (Rutgers University - Piscataway, US & Brookhaven National Laboratory - Upton, US) [dblp]
- Timo Kehrer (Universität Bern, CH) [dblp]
- Christine Kirkpatrick (San Diego Supercomputer Center, US) [dblp]
- Daniel Laney (LLNL - Livermore, US) [dblp]
- Ulf Leser (HU Berlin, DE) [dblp]
- Bertram Ludäscher (University of Illinois at Urbana-Champaign, US) [dblp]
- Dejan Milojicic (HP Labs - Milpitas, US) [dblp]
- Paolo Missier (Newcastle University, GB) [dblp]
- Wolfgang E. Nagel (TU Dresden, DE) [dblp]
- Jedrzej Rybicki (Jülich Supercomputing Centre, DE) [dblp]
- Frédéric Suter (Oak Ridge National Laboratory, US) [dblp]
- Domenico Talia (University of Calabria, IT) [dblp]
- Jeyan Thiyagalingam (Rutherford Appleton Lab. - Didcot, GB) [dblp]
- Matthias Weidlich (HU Berlin, DE) [dblp]
- Sean R. Wilkinson (Oak Ridge National Laboratory, US) [dblp]
Classification
- Computational Engineering / Finance / and Science
- Distributed / Parallel / and Cluster Computing
- Machine Learning
Keywords
- Scientific Workflows
- High Performance Computing
- Machine Learning
- Scientific Data Analysis
- Big Data