Computational Audio Analysis
( 03. Nov – 08. Nov, 2013 )
- Meinard Müller (Universität Erlangen-Nürnberg, DE)
- Shrikanth S. Narayanan (University of Southern California, US)
- Björn Schuller (TU München, DE)
- Andreas Dolzmann (für wissenschaftliche Fragen)
- Susanne Bach-Bernhard (für administrative Fragen)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
With the rapid growth and omnipresence of digitized multimedia data, the processing, analysis, and understanding of such data by means of automated methods has become a central issue in computer science and associated areas of research. As for the acoustic domain, audio analysis traditionally has strongly been focused on data related to speech with the goal to recognize and transcribe the spoken words. In the proposed seminar, we want to consider current and future audio analysis tasks that go beyond the classical speech recognition scenario. On the one hand side, we want to look at the computational analysis of speech with regard to the speakers’ traits (e. g., gender, age, height, cultural and social background), physical conditions (e. g., sleepiness, medical and alcohol intoxication, health), or emotion-related and affective states (e. g., stress, interest, confidence, frustration). So, rather than recognizing what is being said, the goal is to find out how and by whom it is being said. On the other side, there is a rich variety of sounds besides speech such as music recordings, animal sounds, environmental sounds, and mixtures thereof. Here, similar as for the speech domain, we want to study how to decompose and classify the content of complex sound mixtures with the objective to infer semantically meaningful information.
When dealing with specific audio domains such as speech or music, it is crucial to properly understand and apply the appropriate domain-specific properties, be they acoustic, linguistic, or musical. Furthermore, data-driven learning techniques that exploit the availability of carefully annotated audio material have successfully been used for recognition and classification tasks. However, dealing with rather vague categories as in emotion recognition or considering general audio sources such as environmental sounds, model assumptions are often violated, or it even renders impossible to define explicit models. Furthermore, for non-standard audio material, annotated datasets are hardly available. Therefore, in this context, data-driven methods that are used in speech recognition are not directly applicable; instead semi-supervised or unsupervised learning techniques can be a promising approach to remedy these issues. Finally, audio sources as occurring in the real world are typically superimposed and highly blended, potentially consisting of overlapping speech, music, and general sound sources. Thus, efficient source separation techniques are required that allow for splitting up, re-synthesizing, analyzing, and classifying the individual sources—a problem that, for general audio signals, is yet not well understood.
The aim of this seminar is to gather researchers, who are experts in different audio-related multimedia domains and cover a broad spectrum of data analysis techniques (including audio signal processing, machine learning, probabilistic models) and tasks (including annotation, alignment, segmentation, classification, searching). Doing so, we hope to break new ground for analyzing, processing, experiencing, and understanding multi-faceted and highly blended real-world audio sources from a semantic perspective. Furthermore, bringing together experts from different disciplines, young researchers as well as people from industry, we expect this seminar to generate strong interest and to stimulate vibrant discussions, while highlighting opportunities for new collaborations across the attendees.
General questions and issues that will be addressed in this seminar include:
- Efficient blind source separation with limited knowledge on the type of audio
- Unsupervised learning for handling general audio
- Machine learning techniques that allow for coupling various sources of information
- Development of probabilistic frameworks for multi-faceted analysis
- Detection and exploitation of mutual dependencies between different aspects
- Handling partial, uncertain, and scattered information
- Discovering long-term temporal structures
- Automatic annotation of noisy raw data
- Designing and learning robust and expressive mid-level audio features
- Matching semantically similar data
- Extracting emotion-related parameters from speech, music, and general audio
- Können Computer Gefühlszustände erkennen?
Article about seminar 13451, published in the Saarbrücker Zeitung on October 29, 2013 (in German).
- Zwischentöne für Computer
Ralf Krauter's interview in German with Dr. Björn Schuller. The interview aired on November 6, 2013 in "Forschung Aktuell," a program of the German public radio station Deutschlandfunk.
- Schloss Dagstuhl: Können Computersysteme Emotionen erkennen?
Press release (in German)
With the rapid growth and omnipresence of digitized multimedia data, the processing, analysis, and understanding of such data by means of automated methods has become a central issue in computer science and associated areas of research. As for the acoustic domain, audio analysis has traditionally been focused on data related to speech with the goal to recognize and transcribe the spoken words. In this seminar, we considered current and future audio analysis tasks that go beyond the classical speech recognition scenario. For example, we looked at the computational analysis of speech with regard to the speakers' traits (e.g., gender, age, height, cultural and social background), physical conditions (e.g., sleepiness, alcohol intoxication, health state), or emotion-related and affective states (e.g., stress, interest, confidence, frustration). So, rather then recognizing what is being said, the goal is to find out how and by whom it is being said. Besides speech, there is a rich variety of sounds such as music recordings, animal sounds, environmental sounds, and combinations thereof. Just as for the speech domain, we discussed how to decompose and classify the content of complex sound mixtures with the objective to infer semantically meaningful information.
When dealing with specific audio domains such as speech or music, it is crucial to properly understand and apply the appropriate domain-specific properties, be they acoustic, linguistic, or musical. Furthermore, data-driven learning techniques that exploit the availability of carefully annotated audio material have successfully been used for recognition and classification tasks. In this seminar, we discussed issues that arise when dealing with rather vague categories as in emotion recognition or when considering general audio sources such as environmental sounds. In such scenarios, model assumptions are often violated, or it becomes impossible to define explicit representations or models. Furthermore, for non-standard audio material, annotated datasets are hardly available. Also, data-driven methods that are used in speech recognition are (often) not directly applicable in this context; instead semi-supervised or unsupervised learning techniques can be a promising approach to remedy these issues. Another central topic of this seminar was concerned with the problem of source separation. In the real world, acoustic data is very complex typically consisting of a superposition of overlapping speech, music, and general sound sources. Therefore, efficient source separation techniques are required that allow for splitting up, re-synthesizing, analyzing, and classifying the individual sources-a problem that, for general audio signals, is yet not well understood.
In this executive summary, we give a short overview of the main topics addressed in this seminar. We start by briefly describing the background of the participants and the overall organization. We then give an overview of the presentations of the participants and the results obtained from the different working groups. Finally, we reflect on the most important aspects of this seminar and conclude with future implications.
Participants, Interaction, Activities
In our seminar, we had 41 participants, who came from various countries around the world including North America ($10$ participants), Japan (1 participant), and Europe (Austria, Belgium, Finland, France, Germany, Greece, Italy, Netherlands, Spain, United Kingdom). Most of the participants came to Dagstuhl for the first time and expressed enthusiasm about the open and retreat-like atmosphere. Besides its international character, the seminar was also highly interdisciplinary. While most of the participating researchers are working in the fields of signal processing and machine learning, we have had participants with a background in cognition, human computer interaction, music, linguistics, and other fields. This made the seminar very special in having many cross-disciplinary intersections and provoking discussions as well as numerous social activities including common music making.
Overall Organization and Schedule
Dagstuhl seminars are known for having a high degree of flexibility and interactivity, which allow participants to discuss ideas and to raise questions rather than to present research results. Following this tradition, we fixed the schedule during the seminar asking for spontaneous contributions with future-oriented content, thus avoiding a conference-like atmosphere, where the focus is on past research achievements. The first two days were used to let people introduce themselves, present scientific problems they are particularly interested in and express their expectations and wishes for the seminar. In addition, we have had six initial stimulus talks, where specific participants were asked to address some burning questions on speech, music, and sound processing from a more meta point of view, see also Section Stimulus Talks. Rather than being usual presentations, most of these stimulus talks seamlessly moved towards an open discussion of the plenum. Based on this input, the second day concluded with a brainstorming session, where we identified central topics covering the participants' interests and discussed the schedule and format of the subsequent days. To discuss these topics, we split up into five groups, each group discussing one of the topics in greater depth in parallel sessions on Wednesday morning. The results and conclusions of these group meetings were then presented to the plenum on Thursday morning, which resulted in vivid discussions. Continuing the previous activities, further parallel group meetings were held on Thursday afternoon, the results of which being presented on Friday morning. Finally, asking each participant to give a short (written) statement of what he or she understands by the seminar's overall topic "Computational Audio Analysis," we had a very entertaining and stimulating session by going through and discussing all these statements one by one. The result of this session can be found in Section Definition CAA. In summary, having a mixture of different presentation styles and group meetings gave all participants the opportunity for presenting and discussing their ideas, while avoiding a monotonous conference-like atmosphere.
We discussed various topics that addressed the challenges when dealing with mixtures of general and non-standard acoustic data. A particular focus was put on data representations and analysis techniques including audio signal processing, machine learning, and probabilistic models. After a joint brainstorming session, we agreed on discussing five central topics which fitted in the overall theme of the seminar and reflected the participants' interests. We now give a brief summary of these topics, which were addressed in the parallel group meetings and resulting panel discussions. A more detailed summary of the outcome of the group sessions can be found in Section Group Sessions.
- The "Small Data" group looked at audio analysis and classification scenarios where only few labeled examples or small amounts of (training) data are available. In such scenarios, machine learning techniques that depend on large amounts of (training) data ("Big Data") are not applicable. Various strategies including model-based as well as semi- and unsupervised approaches were discussed.
- The "Source Separation group addressed the task of decomposing a given sound mixture into elementary sources, which is not only a fundamental problem in audio processing, but also constitutes an intellectual and interdisciplinary challenge. Besides questioning the way the source separation problem is often posed, the need of concrete application scenarios as well as the objective of suitable evaluation metrics were discussed.
- The Interaction and Affect group discussed the question on how to generate and interpret signals that express interactions between different agents. One main conclusion was that one requires more flexible models that better adapts to the temporal and situational context as well as to the agents' roles, behaviors and traits.
- The Knowledge Representation group addressed the issue of how knowledge can be used to define and derive sound units that can be used as elementary building blocks for a wide range of applications. Based on deep neural network techniques, the group discussed how database information and other meta-data can be better exploited and integrated using feed-forward as well as recurrent architectures.
- The Unsupervised Learning group looked at the problem on how to learn the structure of data without reference to external objectives. Besides issues on learning meaningful elementary units, the need of considering hierarchies of abstractions and multi-layer characterizations was discussed.
Besides an extensive discussion of these five main topics, we have had many further contributions and smaller discussions on issues that concern natural human machine communication, human centered audio processing, computational paralinguistics, sound processing in everyday environments, acoustic monitoring, informed source separation, and audio structure analysis.
In our seminar, we addressed central issues on how to process audio material of various types and degrees of complexity. In view of the richness and multitude of acoustic data, one requires representations and machine learning techniques that allow for capturing and coupling various sources of information. Therefore, unsupervised and semi-supervised learning procedures are needed in scenarios where only very few examples and poor training resources are available. Also, source separation techniques are needed, which yield meaningful audio decomposition results even when having only limited knowledge on the type of audio. Another central issue of this seminar was how to bring in the human into the audio processing pipeline. On the one hand, we discussed how we can learn from the way human process and perceive sounds. On the other hand, we addressed the issue on extracting human-related parameters such as affective and paralinguistic information from sound sources. These discussions showed that understanding and processing complex sound mixtures using computational tools poses many challenging research problems yet to be solved.
The Dagstuhl seminar gave us the opportunity for discussing such issues in an inspiring and retreat-like atmosphere. The generation of novel, technically oriented scientific contributions was not the focus of the seminar. Naturally, many of the contributions and discussions were on a rather abstract level, laying the foundations for future projects and collaborations. Thus, the main impact of the seminar is likely to take place in the medium to long term. Some more immediate results, such as plans to share research data and software, also arose from the discussions. As measurable outputs from the seminar, we expect to see several joint papers and applications for funding. Beside the scientific aspect, the social aspect of our seminar was just as important. We had an interdisciplinary, international, and very interactive group of researchers, consisting of leaders and future leaders in our field. Most of our participants visited Dagstuhl for the first time and enthusiastically praised the open and inspiring atmosphere. The group dynamics were excellent with many personal exchanges and common activities. Some scientists mentioned their appreciation of having the opportunity for prolonged discussions with researchers from neighboring research fields---something which is often impossible during conference-like events.
In conclusion, our expectations of the seminar were not only met but exceeded, in particular with respect to networking and community building. Last but not least, we heartily thank the Dagstuhl board for allowing us to organize this seminar, the Dagstuhl office for their great support in the organization process, and the entire Dagstuhl staff for their excellent services during the seminar.
- Xavier Anguera (Telefónica Research - Barcelona, ES) [dblp]
- Jon Barker (University of Sheffield, GB) [dblp]
- Stephan Baumann (DFKI - Kaiserslautern, DE) [dblp]
- Murtaza Bulut (Philips Research Lab. - Eindhoven, NL) [dblp]
- Carlos Busso (The University of Texas at Dallas, US) [dblp]
- Nick Campbell (Trinity College Dublin, IE) [dblp]
- Laurence Devillers (LIMSI - Orsay, FR) [dblp]
- Jonathan Driedger (Universität Erlangen-Nürnberg, DE) [dblp]
- Bernd Edler (Universität Erlangen-Nürnberg, DE) [dblp]
- Anna Esposito (Intern. Institute for Advanced Scientific Studies, IT) [dblp]
- Sebastian Ewert (Queen Mary University of London, GB) [dblp]
- Cédric Févotte (JL Lagrange Laboratory- Nice, FR) [dblp]
- Jort Gemmeke (KU Leuven, BE) [dblp]
- Franz Graf (Joanneum Research - Graz, AT) [dblp]
- Martin Heckmann (Honda Research Europe - Offenbach, DE) [dblp]
- Dorothea Kolossa (Ruhr-Universität Bochum, DE) [dblp]
- Gernot Kubin (TU Graz, AT) [dblp]
- Frank Kurth (Fraunhofer FKIE - Wachtberg, DE) [dblp]
- Sungbok Lee (University of Southern California, US) [dblp]
- Florian Metze (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Roger K. Moore (University of Sheffield, GB) [dblp]
- Emily Mower Provost (University of Michigan - Ann Arbor, US) [dblp]
- Meinard Müller (Universität Erlangen-Nürnberg, DE) [dblp]
- Shrikanth S. Narayanan (University of Southern California, US) [dblp]
- Nobutaka Ono (National Institute of Informatics - Tokyo, JP) [dblp]
- Bryan Pardo (Northwestern University - Evanston, US) [dblp]
- Alexandros Potamianos (National Technical University of Athens, GR) [dblp]
- Bhiksha Raj (Carnegie Mellon University, US) [dblp]
- Gaël Richard (Telecom ParisTech, FR) [dblp]
- Mark Sandler (Queen Mary University of London, GB) [dblp]
- Björn Schuller (TU München, DE) [dblp]
- Joan Serrà (IIIA - CSIC - Barcelona, ES) [dblp]
- Rita Singh (Carnegie Mellon University, US) [dblp]
- Paris Smaragdis (University of Illinois - Urbana-Champaign, US) [dblp]
- Stefano Squartini (Polytechnic University of Marche, IT) [dblp]
- Shiva Sundaram (Audyssey Laboratories, US) [dblp]
- Khiet Truong (University of Twente, NL) [dblp]
- Christian Uhle (Fraunhofer IIS - Erlangen, DE) [dblp]
- Emmanuel Vincent (INRIA Lorraine - Nancy, FR) [dblp]
- Tuomas Virtanen (Tampere University of Technology, FI) [dblp]
- society / human-computer interaction
- Audio Analysis
- Signal Processing
- Machine Learning
- Affective Computing