TOP
Suche auf der Schloss Dagstuhl Webseite
Sie suchen nach Informationen auf den Webseiten der einzelnen Seminare? - Dann:
Nicht fündig geworden? - Einige unserer Dienste laufen auf separaten Webseiten mit jeweils eigener Suche. Bitte beachten Sie folgende Liste:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminare
Innerhalb dieser Seite:
Externe Seiten:
  • DOOR (zum Registrieren eines Dagstuhl Aufenthaltes)
  • DOSA (zum Beantragen künftiger Dagstuhl Seminare oder Dagstuhl Perspektiven Workshops)
Publishing
Innerhalb dieser Seite:
Externe Seiten:
dblp
Innerhalb dieser Seite:
Externe Seiten:
  • die Informatik-Bibliographiedatenbank dblp


Dagstuhl-Seminar 25032

Task and Situation-Aware Evaluation of Speech and Speech Synthesis

( 12. Jan – 15. Jan, 2025 )

(zum Vergrößern in der Bildmitte klicken)

Permalink
Bitte benutzen Sie folgende Kurz-Url zum Verlinken dieser Seite: https://www.dagstuhl.de/25032

Organisatoren

Kontakt

Gemeinsame Dokumente



Programm

Summary

This report documents the program and the outcomes of Dagstuhl Seminar "Task and Situation-Aware Evaluation of Speech and Speech Synthesis" (25032).

The recent advances in deep neural netowrks have pushed the boundaries for synthetic speech to the point where synthetic speech is, in some contexts, indistinguishable from human speech. Alongside a slew of well-known issues with deep fakes, this development raises fundamental questions concerning the evaluation of synthetic speech and its relation to the evaluation of human speech. Human speech and synthetic speech have traditionally been evaluated in different ways, with human speech often serving as an implicit or explicit gold standard for synthetic speech. At the same time, the technical distinction between synthetic and human speech is getting increasingly blurred: human speech is delivered through encoding/decoding processes that changes the signal fundamentally - most notably in applications such as speech-to-speech translation and anonymisation - and voice cloning of recorded speech passes as speech synthesis. We hold that the fundamental question when evaluating speech these days is no longer „How similar to human performance is this?“, but rather: „Is this „good“ speech?“

The issue is made more complex still by the fact that what constitutes good speech, synthetic or human, is not a trivial question. Finally, standard evaluation methodologies fail to take into account the interdiciplinary nature of speech science and and speech technology through its assumption that one single evaluation metric should satisfy all requirements.

The goal of the Dagstuhl Seminar „Task and situation-aware evaluation of speech and speech synthesis“ (25032) was to initiate shift in the different communities impacted by these evolutions. To do so, we gathered a total of 22 renowned researchers from various disciplines (among others: engineering, phonetics, user interface, computer science, speech pathology) to exchange about this fundamental issue. Exchange between groups was encouraged by organisating the seminar around working groups designed to explore the breadth of research fields and applications with stakes in speech and synthetic speech evaluation. This was also intended to encourage more active involvement of the participants both during and after the seminar. This hands-on approach came at the expense of formal talks and panel discussions, which were limited to two talks, both intended to get background „out of the way“ and to allow the rest of the discussions to focus on the future.

In the present executive summary, our goal is twofold. Firstly, the presentation of the immediate outcomes of the dicussions that took place during the seminar. In addition, we are convinced that the manner in which the seminar was organized represents an contribution as it presents a process towards fruitful trans- and interdisciplinary exchange leading to a long term impact.

Each day of the three-day seminar was given a broad goal: The first day focussed on background, the second on innovation and solutions, and the third on consolidation and structuring. Each day was further divided into sessions, each with a specific result in mind.

Following a general introduction to the seminar, including Dagstuhl practicalities, the first day started out with three-minute participant presentations. As most of our participants have a long and broad set of experiences in the speech field, they were encouraged to focus their presentations on matters of direct relevance to the seminar, and they were also asked to specify their own interests in relation to the seminar description. Since this seminar was designed to gather and reconcile as much as possible of the collective experiences in speech and speech synthesis evaluation, with few presentations and much discussion and collaborative work, we include all personal statements in the full report. They constitute a fair representation of the morning session. After a following discusssion and a brief introduction to the afternoon sessions, the morning session was concluded.

The afternoon session continued and concluded what we view as the background work in this seminar. First, two talks presented the limits of the current state-of-the-art in speech synthesis evaluation methodologies. A longer session dedicated to an extensive exchange about these methodologies and their limits followed the talks. Although these shortcomings were are well-known within the group of participants, experience has shown that discussion on improvements and evaluation innovation easily get stuck in repeated discussions about the shortcomings of current methods. Our goal was to get these discussions out of the way on day one, and then explicitly avoid going back in days two and three, to ensure that the remaining time of the seminar was dedicated to the exploration of new horizons.

At the end of day one, the organisers presented a set of speech and speech technology areas with suggested use cases, together with group assignments for all participants, in order to allow the participants to muse over these in preparation for the second day.

The second day was dedicated to discover and explore new directions for speech and speech synthesis evaluation. This day was structured around four working group sessions interspersed with flash presentation plenary sessions. The goal of these plenary sessions was not only to fuel the discussion of the following sessions, but also to inform all the participants of the reflexion of each working group. The first three sessions were centred around high-level use cases known to be impacted by the recent evolution in speech synthesis.

The organizers defined the groups by taking into account the background and the interests of the participants as communicated in their personal statements. This strategy proved to be effective as no participants requested to change group. While the first two sessions aimed at developing the initial use cases and what falls under each use case, the third session focused on exploring potential methodologies. Note that the use cases, here, served primarily as a focal point for discussion, designed to capture specific TTS and speech characteristics and requirements as well as cover specific types of applications. Thus the goal was not to create fully fledged and ready-to-use protocols, but to explore what constraints and requirements future methodologies will have to meet.

Informed by these meetings, the organizers then defined a set of five methodology umbrellas which emerged from different use cases, and the last session was dedicated to exploring these umbrellas. For this session, the participants were left free to select the group to join. While this led to an uneven distribution, this also provided space for the participants to focus on some of their more immediate interests.

The last session of the day was a plenary general discussion of the day's events, aiming to allow participants to bring up points of criticism and complementary information.

The last day was dedicated to clean, collect, summarize the information produced previous day, with a clear aim at future work and practical ways to continue the work discussed in the seminar. The result was a less intensive and the organisers also decided to give more freedom for the participants to determine concrete activities they wanted to participate. Working group formed spontaneously to establish and engage in short term solutions to address the current flaws in speech evaluation.

The immediate outcome of the seminar is the establishment of a core network of renowned researchers from various disciplines dedicated to speech and speech synthesis evaluation. Due to its multidisciplinary and breadth, this community can reach a range of different research communities.

A major achievement in the wake of the seminar is a set of new guidelines for reviewing TTS papers with respect to evaluation. Members of the network are part of the organisation committees of Interspeech - the reference conference about speech science and speech technology - and the Speech Synthesis Workshop (SSW), and have promoted - for Interspeech - or enforced - for SSW the use of these guidelines. In addition, a discussion on edits and amendments to the ITU reports on TTS is currently undertaken with ITU.

Several papers are also underway as direct results of the seminar, including a position paper to SSW about the new directions in speech synthesis evaluation and a more substantial survey article to the journal Computer, Speech & Language (CSL).

To increase awareness, we are currently exploring other ways of dissemination such as designing tutorials to be presented at Interspeech 2026, as well as a repeating work shop series. We will also propose a special issue of the journal CSL dedicated to speech and speech synthesis evaluation.

In the longer term, the goal reaches far beyond documenting the current state of speech and speech synthesis evaluation, towards a dynamic process that avoids renewed fossilisation. We believe we have achieved this on three fronts. First, the seminar ensured the transmission of information between generation of researchers. This is necessary to keep the field cohesive. Second, the seminar brought researchers both from academia and industry. This is critical to ensure that balanced is maintain between the different interests. Finally, the seminar provided the space to not only get new directions but also to have a core set of renowned researchers whose duty is now to impulse this change in their respective communities armed with the different resources provided by the activities of th

Copyright Jens Edlund, Sébastien Le Maguer, Christina Tånnander, and Petra Wagner

Motivation

Call to Action This Dagstuhl Seminar Task and Situation-Aware Evaluation of Speech and Speech Synthesis is an opportunity to help redefine the metrics and methods traditionally used to evaluate speech synthesis and human speech as they are used across disciplines, tasks, and applications. The seminar is designed as a collaborative platform where experts from engineering, the humanities, social sciences, and more come together to challenge the status quo and drive innovation. Through the combination of perspectives and the bridging of gaps between scientific disciplines we hope to uncover and develop evaluation techniques that are not only scientifically rigorous but also contextually relevant to the diverse uses of speech technology today. We seek the collective expertise of participants from diverse areas of science and technology – ranging from phonetics to machine learning, from rhetorical analysis to practical applications in assistive technologies. This interdisciplinary approach is essential for crafting evaluation standards that are as dynamic and nuanced as the technologies and applications they aim to assess. To this end, we encourage you to share, debate, and refine ideas, methods, and tools that can contribute to a transformative discussion on the evaluation of speech synthesis.

Context and Objectives While the de facto standard TTS evaluation metrics, such as the Mean Opinion Score (MOS), have been criticised for decades, they are currently barraged by publications pointing to a variety of flaws. More importantly, the recent disruptive progress in TTS techniques has rendered traditional evaluation targets such as naturalness (a hard-to-define blend of signal quality and human-likeness) and intelligibility all but obsolete. Indeed, one of the (positive) reviewers of this seminar proposal suggested asking the fundamental question Is evaluation for speech synthesis still needed? We believe it is, but we also note that the current methods deliver problematic results, such as ceiling effects and the conclusion that synthetic voices are more humanlike than human voices.

Our concern here is that the conventional methods fall short when it comes to addressing the contextual nuances and the specific application needs of modern synthesized speech. If we change the question Rank the naturalness of this voice to Rank this voice as if it was the voice of a professional human performing the same task , the artificial voice will rank considerably lower for a great many tasks. There is a pressing need for more sophisticated approaches that take contextual and situational framing into account and incorporate the complexity and diversity of current and future speech synthesis applications. This may also involve a more nuanced treatment of the participants in evaluations, as the assumption of normal distributed participant characteristics in all populations is unlikely to hold.

Agenda Overview The seminar progresses through a series of sessions of different nature, all of which are partially prepared in advance by organisers and participants alike. After an initial Existing Evaluation Methods Review , in which specifications, validations, and known issues with existing methods are discussed as a benchmark for new methodologies, we delve into Use Case Workshops focused on identifying and detailing specific use cases for speech synthesis and human speech alike, in order to better understand the varied requirements of different applications. Next, we engage in Hands-On Method Development in group sessions to propose, develop, and refine evaluation methods for selected use cases. These practical sessions transitions from theory to action, allowing participants to experiment with and iteratively improve evaluation approaches. Finally, the results are discussed and collected in Guideline Formulation , where we work towards formulating guidelines for selecting and implementing evaluation methods. The session will focus on creating a decision framework that assists researchers and practitioners in choosing the appropriate evaluation metrics based on specific criteria.

Copyright Jens Edlund, Sébastien Le Maguer, Christina Tånnander, and Petra Wagner

Teilnehmer

Please log in to DOOR to see more details.

  • Elisabeth André (Universität Augsburg, DE) [dblp]
  • Gérard Bailly (University Grenoble Alpes, FR)
  • Erica Cooper (NICT - Kyoto, JP) [dblp]
  • Benjamin Cowan (University College - Dublin, IE) [dblp]
  • Jens Edlund (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
  • Naomi Harte (Trinity College Dublin, IE) [dblp]
  • Simon King (University of Edinburgh, GB) [dblp]
  • Esther Klabbers (Beaverton, US) [dblp]
  • Sébastien Le Maguer (University of Helsinki, FI) [dblp]
  • Zofia Malisz (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
  • Bernd Möbius (Universität des Saarlandes, DE) [dblp]
  • Sebastian Möller (TU Berlin, DE & DFKI Berlin, DE) [dblp]
  • Roger K. Moore (University of Sheffield, GB) [dblp]
  • Ayushi Pandey (Trinity College Dublin, IE) [dblp]
  • Olivier Perrotin (University Grenoble Alpes, FR) [dblp]
  • Fritz Michael Seebauer (Universität Bielefeld, DE) [dblp]
  • Sofia Strömbergsson (Karolinska Institute - Stockholm, SE) [dblp]
  • Christina Tånnander (Swedish Agency for Accessible Media - Malmö, SE) [dblp]
  • David R. Traum (USC - Playa Vista, US) [dblp]
  • Petra Wagner (Universität Bielefeld, DE) [dblp]
  • Junichi Yamagishi (National Institute of Informatics - Tokyo, JP) [dblp]
  • Yusuke Yasuda (Nagoya University, JP)

Klassifikation
  • Artificial Intelligence
  • Computation and Language
  • Human-Computer Interaction

Schlagworte
  • Evaluation
  • Human-in-the-Loop
  • Speech technology
  • Speech-to-Text Synthesis