November 3 – 8 , 2013, Dagstuhl Seminar 13451
Computational Audio Analysis
1 / 2 >
For support, please contact
(Use seminar number and access code to log in)
- Können Computer Gefühlszustände erkennen?
Article about seminar 13451, published in the Saarbrücker Zeitung on October 29, 2013 (in German).
- Zwischentöne für Computer
Ralf Krauter's interview in German with Dr. Björn Schuller. The interview aired on November 6, 2013 in "Forschung Aktuell," a program of the German public radio station Deutschlandfunk.
- Schloss Dagstuhl: Können Computersysteme Emotionen erkennen?
Press release (in German)
With the rapid growth and omnipresence of digitized multimedia data, the processing, analysis, and understanding of such data by means of automated methods has become a central issue in computer science and associated areas of research. As for the acoustic domain, audio analysis has traditionally been focused on data related to speech with the goal to recognize and transcribe the spoken words. In this seminar, we considered current and future audio analysis tasks that go beyond the classical speech recognition scenario. For example, we looked at the computational analysis of speech with regard to the speakers' traits (e.g., gender, age, height, cultural and social background), physical conditions (e.g., sleepiness, alcohol intoxication, health state), or emotion-related and affective states (e.g., stress, interest, confidence, frustration). So, rather then recognizing what is being said, the goal is to find out how and by whom it is being said. Besides speech, there is a rich variety of sounds such as music recordings, animal sounds, environmental sounds, and combinations thereof. Just as for the speech domain, we discussed how to decompose and classify the content of complex sound mixtures with the objective to infer semantically meaningful information.
When dealing with specific audio domains such as speech or music, it is crucial to properly understand and apply the appropriate domain-specific properties, be they acoustic, linguistic, or musical. Furthermore, data-driven learning techniques that exploit the availability of carefully annotated audio material have successfully been used for recognition and classification tasks. In this seminar, we discussed issues that arise when dealing with rather vague categories as in emotion recognition or when considering general audio sources such as environmental sounds. In such scenarios, model assumptions are often violated, or it becomes impossible to define explicit representations or models. Furthermore, for non-standard audio material, annotated datasets are hardly available. Also, data-driven methods that are used in speech recognition are (often) not directly applicable in this context; instead semi-supervised or unsupervised learning techniques can be a promising approach to remedy these issues. Another central topic of this seminar was concerned with the problem of source separation. In the real world, acoustic data is very complex typically consisting of a superposition of overlapping speech, music, and general sound sources. Therefore, efficient source separation techniques are required that allow for splitting up, re-synthesizing, analyzing, and classifying the individual sources-a problem that, for general audio signals, is yet not well understood.
In this executive summary, we give a short overview of the main topics addressed in this seminar. We start by briefly describing the background of the participants and the overall organization. We then give an overview of the presentations of the participants and the results obtained from the different working groups. Finally, we reflect on the most important aspects of this seminar and conclude with future implications.
Participants, Interaction, Activities
In our seminar, we had 41 participants, who came from various countries around the world including North America ($10$ participants), Japan (1 participant), and Europe (Austria, Belgium, Finland, France, Germany, Greece, Italy, Netherlands, Spain, United Kingdom). Most of the participants came to Dagstuhl for the first time and expressed enthusiasm about the open and retreat-like atmosphere. Besides its international character, the seminar was also highly interdisciplinary. While most of the participating researchers are working in the fields of signal processing and machine learning, we have had participants with a background in cognition, human computer interaction, music, linguistics, and other fields. This made the seminar very special in having many cross-disciplinary intersections and provoking discussions as well as numerous social activities including common music making.
Overall Organization and Schedule
Dagstuhl seminars are known for having a high degree of flexibility and interactivity, which allow participants to discuss ideas and to raise questions rather than to present research results. Following this tradition, we fixed the schedule during the seminar asking for spontaneous contributions with future-oriented content, thus avoiding a conference-like atmosphere, where the focus is on past research achievements. The first two days were used to let people introduce themselves, present scientific problems they are particularly interested in and express their expectations and wishes for the seminar. In addition, we have had six initial stimulus talks, where specific participants were asked to address some burning questions on speech, music, and sound processing from a more meta point of view, see also Section Stimulus Talks. Rather than being usual presentations, most of these stimulus talks seamlessly moved towards an open discussion of the plenum. Based on this input, the second day concluded with a brainstorming session, where we identified central topics covering the participants' interests and discussed the schedule and format of the subsequent days. To discuss these topics, we split up into five groups, each group discussing one of the topics in greater depth in parallel sessions on Wednesday morning. The results and conclusions of these group meetings were then presented to the plenum on Thursday morning, which resulted in vivid discussions. Continuing the previous activities, further parallel group meetings were held on Thursday afternoon, the results of which being presented on Friday morning. Finally, asking each participant to give a short (written) statement of what he or she understands by the seminar's overall topic "Computational Audio Analysis," we had a very entertaining and stimulating session by going through and discussing all these statements one by one. The result of this session can be found in Section Definition CAA. In summary, having a mixture of different presentation styles and group meetings gave all participants the opportunity for presenting and discussing their ideas, while avoiding a monotonous conference-like atmosphere.
We discussed various topics that addressed the challenges when dealing with mixtures of general and non-standard acoustic data. A particular focus was put on data representations and analysis techniques including audio signal processing, machine learning, and probabilistic models. After a joint brainstorming session, we agreed on discussing five central topics which fitted in the overall theme of the seminar and reflected the participants' interests. We now give a brief summary of these topics, which were addressed in the parallel group meetings and resulting panel discussions. A more detailed summary of the outcome of the group sessions can be found in Section Group Sessions.
- The "Small Data" group looked at audio analysis and classification scenarios where only few labeled examples or small amounts of (training) data are available. In such scenarios, machine learning techniques that depend on large amounts of (training) data ("Big Data") are not applicable. Various strategies including model-based as well as semi- and unsupervised approaches were discussed.
- The "Source Separation group addressed the task of decomposing a given sound mixture into elementary sources, which is not only a fundamental problem in audio processing, but also constitutes an intellectual and interdisciplinary challenge. Besides questioning the way the source separation problem is often posed, the need of concrete application scenarios as well as the objective of suitable evaluation metrics were discussed.
- The Interaction and Affect group discussed the question on how to generate and interpret signals that express interactions between different agents. One main conclusion was that one requires more flexible models that better adapts to the temporal and situational context as well as to the agents' roles, behaviors and traits.
- The Knowledge Representation group addressed the issue of how knowledge can be used to define and derive sound units that can be used as elementary building blocks for a wide range of applications. Based on deep neural network techniques, the group discussed how database information and other meta-data can be better exploited and integrated using feed-forward as well as recurrent architectures.
- The Unsupervised Learning group looked at the problem on how to learn the structure of data without reference to external objectives. Besides issues on learning meaningful elementary units, the need of considering hierarchies of abstractions and multi-layer characterizations was discussed.
Besides an extensive discussion of these five main topics, we have had many further contributions and smaller discussions on issues that concern natural human machine communication, human centered audio processing, computational paralinguistics, sound processing in everyday environments, acoustic monitoring, informed source separation, and audio structure analysis.
In our seminar, we addressed central issues on how to process audio material of various types and degrees of complexity. In view of the richness and multitude of acoustic data, one requires representations and machine learning techniques that allow for capturing and coupling various sources of information. Therefore, unsupervised and semi-supervised learning procedures are needed in scenarios where only very few examples and poor training resources are available. Also, source separation techniques are needed, which yield meaningful audio decomposition results even when having only limited knowledge on the type of audio. Another central issue of this seminar was how to bring in the human into the audio processing pipeline. On the one hand, we discussed how we can learn from the way human process and perceive sounds. On the other hand, we addressed the issue on extracting human-related parameters such as affective and paralinguistic information from sound sources. These discussions showed that understanding and processing complex sound mixtures using computational tools poses many challenging research problems yet to be solved.
The Dagstuhl seminar gave us the opportunity for discussing such issues in an inspiring and retreat-like atmosphere. The generation of novel, technically oriented scientific contributions was not the focus of the seminar. Naturally, many of the contributions and discussions were on a rather abstract level, laying the foundations for future projects and collaborations. Thus, the main impact of the seminar is likely to take place in the medium to long term. Some more immediate results, such as plans to share research data and software, also arose from the discussions. As measurable outputs from the seminar, we expect to see several joint papers and applications for funding. Beside the scientific aspect, the social aspect of our seminar was just as important. We had an interdisciplinary, international, and very interactive group of researchers, consisting of leaders and future leaders in our field. Most of our participants visited Dagstuhl for the first time and enthusiastically praised the open and inspiring atmosphere. The group dynamics were excellent with many personal exchanges and common activities. Some scientists mentioned their appreciation of having the opportunity for prolonged discussions with researchers from neighboring research fields---something which is often impossible during conference-like events.
In conclusion, our expectations of the seminar were not only met but exceeded, in particular with respect to networking and community building. Last but not least, we heartily thank the Dagstuhl board for allowing us to organize this seminar, the Dagstuhl office for their great support in the organization process, and the entire Dagstuhl staff for their excellent services during the seminar.
Creative Commons BY 3.0 Unported license
Meinard Müller and Shrikanth S. Narayanan and Björn Schuller
- Society / Human-computer Interaction
- Audio Analysis
- Signal Processing
- Machine Learning
- Affective Computing