Two key changes are driving an immediate need for deeper understanding of I/O workloads in high-performance computing (HPC): applications are evolving beyond the traditional bulk-synchronous models to include integrated, multi-step workflows, in-situ analysis, AI, and data analytics methods; and storage systems designs are evolving beyond a two-tiered file system and archive model to complex hierarchies containing temporary, fast tiers of storage close to compute resources with markedly different performance properties. Both of these changes represent a significant departure from the decades-long status quo and require investigation from storage researchers and practitioners to understand their impacts on overall I/O performance. Without an in-depth understanding of I/O workload behavior, storage system designers, I/O middleware developers, facility operators, and application developers will not know how best to design or utilize the additional tiers for optimal performance of a given I/O workload.
The goal of this Dagstuhl Seminar is to bring together experts in I/O performance analysis and storage system architecture to collectively evaluate how our community is capturing and analyzing I/O workloads on HPC systems, identify any gaps in our methodologies, and determine how to develop a better, in-depth understanding of their impact on HPC systems. We expect our discussions to result in a) a set of common terminology across the community to describe I/O workloads; b) concrete recommendations to homogenize measurement and analysis of I/O workloads across centers; c) a roadmap showing how the collected I/O data can have practical impact for users; and d) a special issue of a journal documenting our findings and providing the needed outreach to the wider community.
In the seminar, we will discuss key topic areas and related questions aimed towards our goal of improving understanding of HPC application I/O behavior. We anticipate the following topics and questions will generate lively discussions:
- I/O workflow analysis: What data do we need to collect in order to understand I/O patterns? What analysis do we need to perform in order to know how to support these emerging I/O patterns?
- Tools for I/O analysis: Are our current tools adequate for understanding I/O behavior? If not, what new capabilities do we need? How can we couple our tools to meet the needed capabilities? What lessons can we learn from instrumenting applications in the past that we can apply to our future endeavors?
- Changing workloads and their requirements: How are workflows changing on HPC systems? How are their I/O patterns different than what we have seen in the past? How do we expect workflows to behave in the future?
- Data center support: What data do HPC centers need to collect about their workloads to ensure they stay current with the I/O needs of their applications and workflows? What do HPC system administrators need to know to tune their systems for high I/O performance?
- Storage system designs: Are there advanced storage system designs that could aid in improving the performance of anticipated future workflows? Can we influence the designs such that adequate I/O monitoring and analysis is built into the hardware?
The results of this seminar will have broad applicability for those interested in improving I/O performance of HPC applications, which is an often overlooked bottleneck in system efficiency. We anticipate that our meeting will spark long-term, international collaboration across HPC I/O performance researchers that share the goal of understanding and improving HPC I/O.
Dagstuhl Seminar 21332, "Understanding I/O behavior in scientific and data-intensive computing," brought together computer scientists from around the world to survey how I/O workloads are measured and analyzed on high-performance computing (HPC) systems, identify gaps in methodologies, and debate how to best apply this technology to advance HPC productivity. The hybrid, week-long event attracted 10 physical and 25 virtual attendees. They included representatives from seven countries spanning a variety of career levels in academia, industry, and government. The diversity of perspectives, combined with an intense week-long seminar format, offered an unprecedented opportunity for researchers to share ideas and spark new collaborative opportunities.
The seminar agenda was structured as a combination of full-group plenary sessions and subgroup breakout sessions. The plenary sessions were used to discuss high-level issues, vote on subtopics to investigate, relay results from breakout sessions, and present ``lightning'' talks that highlighted key issues in the community. The breakout sessions employed small groups (roughly five people each) to follow up in ``deep dive'' discussions on specific subtopics. This format enabled attendees from numerous time zones to remain productively engaged throughout the week. We also found it to be successful in facilitating discussion despite the COVID-19 safety considerations that prevented us from assembling at a single venue. The final day of the seminar was devoted to recording seminar findings in a timely manner while subject matter experts were still available for consultation.
Over the course of the seminar, the attendees converged on six high-level topics for deep dive discussions that are covered in this report.
- Tools: Cross-Cutting Issues (Section 4.1) explored common challenges in development of tools for understanding HPC I/O.
- Data Sources and Acquisition (Section 4.2) addressed how to acquire various forms of raw I/O instrumentation from production systems.
- Analysis (Section 4.3) focused on how to interpret I/O instrumentation once acquired.
- Enacting Actionable Responses (Section 4.4) investigated how to best utilize the outcomes from I/O analysis.
- Data Center Support (Section 4.5) focused on strategies for facility operators to facilitate better understanding of I/O behavior.
- Community Support (Section 4.6) explored the unique characteristics of the I/O analysis community and how to foster its growth.
This report presents a separate summary for each deep dive topic, including a survey of the state of the art, gaps, challenges, and recommendations. The report concludes in Section 5 with a summary of cross-cutting themes and recommendations produced by the seminar as a whole. We found that understanding I/O behavior in scientific and data-intensive computing is increasingly important in an era of evolving workloads and increasingly complex HPC systems and that several cross-cutting challenges must be addressed in order to maximize its potential.
- Wolfgang Frings (Jülich Supercomputing Centre, DE) [dblp]
- Yi Ju (Max Planck Computing and Data Facility - Garching, DE) [dblp]
- Andreas Knüpfer (TU Dresden, DE) [dblp]
- Julian Kunkel (Gesellschaft f. wissenschaftl. Datenverarbeitung, DE) [dblp]
- Erwin Laure (Max Planck Computing and Data Facility - Garching, DE) [dblp]
- Radita Liem (RWTH Aachen, DE)
- Frank Mueller (North Carolina State University - Raleigh, US) [dblp]
- Sarah Neuwirth (Goethe-Universität Frankfurt am Main, DE) [dblp]
- Sebastian Oeste (TU Dresden, DE) [dblp]
- Martin Schulz (TU München, DE) [dblp]
- Marcus Vincent Boden (Gesellschaft f. wissenschaftl. Datenverarbeitung, DE)
- Jim Brandt (Sandia National Labs - Albuquerque, US) [dblp]
- André Brinkmann (Universität Mainz, DE) [dblp]
- Suren Byna (Lawrence Berkeley National Laboratory, US) [dblp]
- Philip Carns (Argonne National Laboratory, US) [dblp]
- Fahim Tahmid Chowdhury (Florida State University - Tallahassee, US) [dblp]
- Hariharan Devarajan (LLNL - Livermore, US) [dblp]
- Ann Gentile (Sandia National Labs - Albuquerque, US) [dblp]
- Sivalingam Karthee (Huawei Technologies - Reading, GB) [dblp]
- Roland Laifer (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Jay Lofstead (Sandia National Labs - Albuquerque, US) [dblp]
- Johann Lombardi (Intel Corporation - Meudon, FR) [dblp]
- Stefano Markidis (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
- Sandra Adriana Mendez (Barcelona Supercomputing Center, ES)
- Kathryn Mohror (LLNL - Livermore, US) [dblp]
- Sarp Oral (Oak Ridge National Laboratory, US) [dblp]
- Michael Ott (LRZ - München, DE) [dblp]
- Marc Snir (University of Illinois - Urbana, US) [dblp]
- Shane Snyder (Argonne National Laboratory, US) [dblp]
- Mehmet Soysal (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Osamu Tatebe (University of Tsukuba, JP) [dblp]
- Devesh Tiwari (Northeastern University - Boston, US) [dblp]
- Chen Wang (University of Illinois at Urbana Champaign, US) [dblp]
- Michele Weiland (University of Edinburgh, GB) [dblp]
- Weikuan Yu (Florida State University - Tallahassee, US) [dblp]
- Dagstuhl-Seminar 17202: Challenges and Opportunities of User-Level File Systems for HPC (2017-05-14 - 2017-05-19) (Details)
- Distributed / Parallel / and Cluster Computing
- Information Retrieval
- I/O performance measurement
- understanding user I/O patterns
- HPC I/O
- I/O Characterization