Deep Learning and Knowledge Integration for Music Audio Analysis
( 20. Feb – 25. Feb, 2022 )
- Rachel Bittner (Spotify - Paris, FR)
- Meinard Müller (Universität Erlangen-Nürnberg, DE)
- Juhan Nam (KAIST - Daejeon, KR)
- Michael Gerke (für wissenschaftliche Fragen)
- Simone Schilke (für administrative Fragen)
- Week 71: The Dagstuhl Dawgs (folk-rnn v Swedish + Sturm) - Blog entry by Bob L. T. Sturm on Tunes from the Ai Frontiers.
Given the increasing amount of digital music, the development of computational tools that allow users to find, organize, analyze, and interact with music has become central to the research field known as Music Information Retrieval (MIR). As in general multimedia processing, many of the recent advances in MIR have been driven by techniques based on deep learning (DL). For example, DL-based techniques have led to significant improvements for numerous MIR tasks, including music source separation, music transcription, chord recognition, melody estimation, beat tracking, tempo estimation, and lyric alignment. In particular, significant improvements could be achieved for specific music scenarios where sufficient training data is available. A particular strength of DL-based approaches is their ability to extract complex features directly from raw audio data, which can then be used to make predictions based on hidden structures and relations.
However, DL-based approaches also come at a cost, being a data-hungry and computing-intensive technology. The design of suitable network architectures can be cumbersome, and the behavior of DL-based systems is often hard to understand. These general properties of DL-based approaches can also be observed when processing music, which spans an enormous range of forms and styles, not to speak of the many ways music may be generated and represented. While in music analysis and classification problems, one aims at capturing musically relevant aspects related to melody, harmony, rhythm, or instrumentation, data-driven approaches often capture confounding factors that may not directly relate to the target concept. One main advantage of classical model-based engineering approaches is that they result in explainable and explicit models that can be adjusted intuitively. On the downside, such hand-engineered approaches require profound signal processing skills and domain knowledge and may result in highly specialized solutions that cannot be directly transferred to other problems.
In this Dagstuhl Seminar, we will critically review the potential and limitations of recent deep learning techniques using music as a challenging application domain. As one main objective of the seminar, we want to systematically explore how musical knowledge can be integrated into neural network architectures to obtain explainable models that are less vulnerable to data biases and confounding factors. Furthermore, besides explainability and generalization aspects, we will also discuss robustness and efficiency issues in the learning as well as inference stage. To give the seminar cohesion, our main focus will be on music analysis tasks applied to audio representations (rather than symbolic music representations). However, related research problems in neighboring fields such as music generation and audio synthesis may also play a role.
More specific questions and issues that will be addressed in this seminar include, but are not limited to the following list:
- Data mining, collection, and annotation
- Data accessibility and copyright issues
- Preprocessing of music data for deep learning
- Musically informed data augmentation
- Multitask learning
- Transfer learning
- Explainable deep learning models
- Differentiable digital signal processing
- Hierarchical models for short-term/long-term dependencies
- Efficiency and robustness issues
- Musical conditioning of deep learning models
- Musically informed input representations
- Structured output spaces
- Integrating knowledge from music-perception and neuroscience research in deep learning systems
- Human-in-the-loop systems for music processing
This executive summary gives an overview of our discussions on the integration of musical knowledge in deep learning approaches while summarizing the main topics covered in this seminar. We also describe the seminar's group composition, the overall organization, and the seminar's activities. Finally, we reflect on the most important aspects of this seminar and conclude with future implications and acknowledgments.
Music is a ubiquitous and vital part of our lives. Thanks to the proliferation of digital music services, we have access to music nearly anytime and anywhere, and we interact with music in a variety of ways, both as listeners and active participants. As a result, music has become one of the most popular categories of multimedia content. In general terms, music processing research aims to contribute concepts, models, and algorithms that extend our capabilities of accessing, analyzing, understanding, and creating music. In particular, the development of computational tools that allow users to find, organize, analyze, generate, and interact with music has become central to the research field known as Music Information Retrieval (MIR). Given the complexity and diversity of music, research has to account for various aspects such as the genre, instrumentation, musical form, melodic and harmonic properties, dynamics, tempo, rhythm, timbre, and so on.
As in general multimedia processing, many of the recent advances in MIR have been driven by techniques based on deep learning (DL). For example, DL-based techniques have led to significant improvements for numerous MIR tasks including music source separation, music transcription, chord recognition, melody extraction, beat tracking, tempo estimation, and lyrics alignment. In particular, major improvements could be achieved for specific music scenarios where sufficient training data is available. A particular strength of DL-based approaches is their capability to extract complex features directly from raw audio data, which can then be used for making predictions based on hidden structures and relations. Furthermore, powerful software packages allow for easily designing, implementing, and experimenting with machine learning models based on deep neural networks (DNNs).
However, DL-based approaches also come at a cost, being a data-hungry and computing-intensive technology. Furthermore, the design of suitable network architectures (including the adaption of hyper-parameters and optimization strategies) can be cumbersome and time-consuming -- a process that is commonly seen more as an art rather than a science. Finally, the behavior of DL-based systems is often hard to understand; the trained models may capture information that is not directly related to the core problem. These general properties of DL-based approaches can also be observed when analyzing and processing music, which spans an enormous range of forms and styles -- not to speak of the many ways music may be generated and represented. While one aims in music analysis and classification problems at capturing musically relevant aspects related to melody, harmony, rhythm, or instrumentation, data-driven approaches often capture confounding factors that may not directly relate to the target concept (e.g., recording conditions in music classification or loudness in singing voice detection).
One main advantage of classical knowledge-based engineering approaches is that they result in explainable and explicit models that can be adjusted intuitively. On the downside, such hand-engineered approaches not only require profound signal processing skills as well as domain knowledge, but also may result in highly specialized solutions that cannot be directly transferred to other problems.
As mentioned earlier, one strong advantage of deep learning is its ability to learn, rather than hand-design, features as part of a model. Nowadays, it seems that attaining state-of-the-art solutions via machine learning depends more on the availability of large quantities of data rather than the sophistication of the approach itself. In this seminar, we critically questioned this statement in the context of concrete music analysis and processing applications. In particular, we explored existing approaches and new directions for combining recent deep learning approaches with classical model-based strategies by integrating knowledge at various stages in the processing pipeline.
There are various ways how one may integrate prior knowledge in DL-based MIR systems. First, one may exploit knowledge already at the input level by using data representations to better isolate information known to be relevant to a task and remove information known to be irrelevant (e.g., by performing vocal source separation before transcribing lyrics). Next, one may incorporate musical knowledge via the model architecture in order to force the model to use its capacity to characterize a particular aspect (e.g., limited receptive fields to prevent a model from "seeing" too much or introducing constraints that mimic DSP systems). Furthermore, the hidden representations can be conditioned to provide humans with “musically sensible control knobs” of the model (e.g., transforming an embedding space to separate out different musical instruments). Knowledge can also be exploited in the design of the output representation (e.g., structured output spaces for chord recognition that account for bass, root, and chroma) or the loss function used for optimization. During the data generation and training process, one may use musically informed data augmentations techniques to enforce certain invariances (e.g., applying pitch shifting to become invariant to musical modulations). Exploiting musical knowledge by combining deep learning techniques with ideas from classical model-based approaches was a core topic of this seminar.
The success of deep learning approaches for learning hidden structures and relations very much depends on the availability of (suitably annotated and structured) data. Therefore, as one fundamental topic, we discussed aspects of generating, collecting, accessing, representing, annotating, preprocessing, and structuring music-related data. These issues are by far not trivial. First of all, music offers a wide range of data types and formats, including text, symbolic data, audio, image, and video. For example, music can be represented as printed sheet music (image domain), encoded as MIDI or MusicXML files (symbolic domain), and played back as audio recordings (acoustic domain). Then, depending on the MIR task, one may need to deal with various types of annotations, including lyrics, chords, guitar tabs, tapping (beat, measure) positions, album covers, as well as a variety of user-generated tags and other types of metadata. To algorithmically exploit the wealth of these various types of information, one requires methods for linking semantically related data sources (e.g., songs and lyrics, sheet music and recorded performances, lead sheet and guitar tabs). Temporal alignment approaches are particularly important to obtain labels for automatic music transcription and analysis tasks. As for data accessibility, copyright issues are the main obstacle for distributing and using music collections in academic research. The generation of freely accessible music (including music composition, performance, and production) requires considerable effort, experience, time, and cost.
Besides the quantity of raw music data and its availability, another crucial issue is the input representation used as the front-end of deep neural networks. For example, log-frequency or Mel spectrograms are often used as input representations when dealing with music signals. We discussed recent research efforts where one tries to directly start with the raw waveform-based audio signal rather than relying on hand-engineered audio representations that exploit domain knowledge. In this context, we discussed how one might resolve phase shift issues by using carefully designed neural network architectures. Further recent research directions include the design of network layers to mimic common front-end transforms or incorporate differentiable filter design methods into a neural network pipeline.
Another central topic we discussed during our seminar was how to exploit musical structures via self-supervised and semi-supervised learning. Instead of relying on large amounts of labeled data, this technique exploits known variants and invariants of a dataset, using lots of unlabeled data. For example, without knowing the transcription of a musical piece, we know how the transcription would change if we shift the whole audio signal by some number of semitones. As another example, we can learn a notion of audio similarity by exploiting the fact that samples from a single musical audio signal are more similar than two samples drawn from different musical audio signals. We also discussed using multi-modal data to give implicit labels, such as text, image, video, and audio correspondences. On the semi-supervised learning side, representations learned in a self-supervised way can be fine-tuned to a particular task with a small amount of labeled data. In this vein, we discussed model generalization, model adaptability, active learning, few-shot learning, and human-in-the-loop systems.
Finally, we addressed topics related to the evaluation of MIR systems. In particular, we discussed the gap between loss functions typically used for optimizing deep learning pipelines and evaluation metrics designed for evaluating specific MIR tasks. In this context, we pointed out the vulnerability of standard metrics to slight variances irrelevant to the perceived output quality, expressing the need for more reliable evaluation metrics. Furthermore, we envisioned the possibility of closing the gap by designing more meaningful loss functions that may be used in the context of end-to-end learning systems.
Participants and Group Composition
In our seminar, we had 22 participants, who came from various locations around the world, including North America (2 participants from the United States), Asia (2 participants from South Korea), and Europe (18 participants from France, Germany, Netherlands, Sweden, United Kingdom). The number of participants and international constellation are remarkable considering the ongoing pandemic. (Note that many of the invited participants, particularly from overseas, were not allowed to go on business trips.) More than half of the participants (12 out of 22) came to Dagstuhl for the first time and expressed enthusiasm about the open and retreat-like atmosphere. Besides its international character, the seminar was also highly interdisciplinary. While most of the participating researchers are working in music information retrieval, we also had participants with a background in musicology, signal processing, machine learning, mathematics, computer vision, and other fields. Our seminar stimulated cross-disciplinary discussions by having experts working in technical and non-technical disciplines while highlighting opportunities for new collaborations among our attendees. Furthermore, the number of participants from the industry (6 out of 22) was relatively high, which also underlines the relevance of the seminar’s topic beyond fundamental research. Most of the participants had a strong musical background, some of them even having a dual career in an engineering discipline and music. This led to numerous social activities, including playing music together. In addition to geographical locations and research disciplines, we tried to foster variety in terms of seniority levels (e.g., we had three Ph.D. students and six participants on the postdoc/junior/assistant professor level) and in terms of gender (6 out of 22 of the participants identify as female). Besides scientific questions, we discussed in our seminar also various challenges that younger colleagues typically face when setting up their research groups and scientific curriculum at the beginning of their academic careers.
Overall Organization and Schedule
Dagstuhl Seminars have a high degree of flexibility and interactivity, allowing participants to discuss ideas and raise questions rather than presenting research results. Following this tradition, we fixed the schedule during the seminar asking for spontaneous contributions with future-oriented content, thus avoiding a conference-like atmosphere, where the focus tends to be on past research achievements. After the organizers gave an overview of the Dagstuhl concept, we started the first day with self-introductions, where all participants introduced themselves and expressed their expectations and wishes for the seminar. We then continued with short (15 to 20 minutes) stimulus talks, where specific participants addressed some critical questions related to the seminar’s overall topic in a non-technical fashion. Each of these talks seamlessly moved towards an open discussion among all participants, where the respective presenters took over the role of a moderator. These discussions were well received and often lasted for more than half an hour. The first day closed with a brainstorming session on central topics covering the participants’ interests while shaping the overall schedule and format for the next day. We continued having stimulus tasks interleaved with extensive discussions on the subsequent days. On the second day, we split into smaller groups, each group discussing a more specific topic in greater depth. The results and conclusions of these parallel group sessions, which lasted between 60 to 90 minutes, were then presented and discussed with the plenum. However, since the overall seminar size of 22 participants was relatively small, it turned out that the division into subgroups was not necessary. Thanks to excellent group dynamics and a fair distribution of speaking time, all participants had their say and were able to express their thoughts in the plenum while avoiding a monotonous conference-like presentation format. On the last day, we enjoyed a tutorial by Umut Simsekli on some theoretical concepts behind deep learning (a topic unanimously desired by the group). We concluded the seminar with a session we called “self-outroductions” where each participant presented their personal view on the seminar’s results.
While working in technical engineering disciplines, most participants also have a strong background and interest in music. This versatility significantly impacted the seminar’s atmosphere, leading to cross-disciplinary intersections and provoking discussions and resulting in intensive joint music-making during the breaks and in the evenings. One particular highlight was a concert on Thursday evening organized by Cynthia Liem and Christof Weiß, where various participant-based ensembles performed a wide variety of music, including classical music, Irish folk music, and jazz.
Conclusions and Acknowledgment
There is a growing trend toward building more interpretable deep learning systems, from the data collection and generation stage, to the input and output representations, to the model structure itself. On the other hand, classical model-based approaches bring a wealth of expertise on techniques for knowledge integration in system design. The Dagstuhl Seminar gave us the opportunity for connecting experts from classical model-based approaches, deep learning-based approaches, and related interdisciplinary fields such as music perception and human-computer interaction in order to generate discussion and spark new collaborations. The generation of novel, technically oriented scientific contributions was not the main focus of the seminar. Naturally, many of the contributions and discussions were on a conceptual level, laying the foundations for future projects and collaborations. Thus, the main impact of the seminar is likely to take place in the medium and long term. Some more immediate results, such as plans to share research data and software, also arose from the discussions. As further measurable outputs from the seminar, we expect to see several joint papers and applications for funding.
Besides the scientific aspect, the social aspect of our seminar was just as important. We had an interdisciplinary, international, and interactive group of researchers, consisting of leaders and future leaders in our field. Many of our participants were visiting Dagstuhl for the first time and enthusiastically praised the open and inspiring setting. The group dynamics were excellent, with many personal exchanges and shared activities. Some scientists expressed their appreciation for having the opportunity for prolonged discussions with researchers from neighboring research fields, which is often impossible during conference-like events. At this point, we would like to let some of the participants have their say:
- Stefan Balke (pmOne – Paderborn, DE): "Dagstuhl is always a wonderful experience, having time to think, talk, and play music. All in a relaxed atmosphere, the seminar feels like a family meeting – especially in these times."
- Alice Cohen-Hadria (IRCAM – Paris, FR): “Now I feel like a part of a community.”
- Dasaem Jeong (Sogang University – Seoul, KR): “Full of insightful discussions, music, and friends in a beautiful place.”
- Cynthia Liem (TU Delft, NL): “Dagstuhl is the one place in the world where one effectively can have a week long unconference. More deeply talking about research and new ideas, enjoying time with academic friends, with much less distraction than one would have at home, or even in a ‘regular’ conference. Especially coming out of a pandemic, I am realizing this is among the most valuable things in our professional life.”
- Daniel Stoller (Spotify – Bonn, DE): “Dagstuhl brings perspectives on the big issues.”
- Yu Wang (New York University – Brooklyn, US): “Discussion is like music: the live version is always better.”
In conclusion, our expectations for the seminar were not only met but exceeded, in particular concerning networking and community building. We want to express our gratitude to the Dagstuhl board for giving us the opportunity to organize this seminar, the Dagstuhl office for their exceptional support in the organization process, and the entire Dagstuhl staff for their excellent service during the seminar. In particular, we want to thank Susanne Bach-Bernhard and Michael Gerke for their assistance during the preparation and organization of the seminar.
- Stefan Balke (pmOne - Paderborn, DE) [dblp]
- Rachel Bittner (Spotify - Paris, FR) [dblp]
- Alice Cohen-Hadria (IRCAM - Paris, FR)
- Simon Dixon (Queen Mary University of London, GB) [dblp]
- Simon Durand (Spotify - Paris, FR)
- Sebastian Ewert (Spotify GmbH - Berlin, DE) [dblp]
- Magdalena Fuentes (NYU - Brooklyn, US)
- Dasaem Jeong (Sogang University - Seoul, KR)
- Michael Krause (Universität Erlangen-Nürnberg, DE)
- Cynthia Liem (TU Delft, NL) [dblp]
- Gabriel Meseguer Brocal (Deezer - Paris, FR)
- Meinard Müller (Universität Erlangen-Nürnberg, DE) [dblp]
- Juhan Nam (KAIST - Daejeon, KR) [dblp]
- Yigitcan Özer (Universität Erlangen-Nürnberg, DE)
- Geoffroy Peeters (Telecom Paris, FR) [dblp]
- Gaël Richard (Telecom Paris, FR) [dblp]
- Umut Simsekli (INRIA - Paris, FR)
- Daniel Stoller (Spotify GmbH - Berlin, DE)
- Bob Sturm (KTH Royal Institute of Technology - Stockholm, SE)
- Gül Varol (ENPC - Marne-la-Vallée, FR)
- Yu Wang (New York University - Brooklyn, US)
- Christof Weiß (Universität Erlangen-Nürnberg, DE) [dblp]
- Dagstuhl-Seminar 11041: Multimodal Music Processing (2011-01-23 - 2011-01-28) (Details)
- Dagstuhl-Seminar 16092: Computational Music Structure Analysis (2016-02-28 - 2016-03-04) (Details)
- Dagstuhl-Seminar 19052: Computational Methods for Melody and Voice Processing in Music Recordings (2019-01-27 - 2019-02-01) (Details)
- Information Retrieval
- Machine Learning
- Music information retrieval
- Audio signal processing
- Deep learning
- Knowledge representation
- User interaction and interfaces