Estimation of elements related to musical concepts

From MIReS

Jump to: navigation, search

By musical concept extraction we refer to the estimation of the elements of a notation system from the audio signal and the estimation of higher-level semantic information from these elements. These elements belong to a vocabulary and are assembled according to a grammar specific to a culture. The challenge here is to automatically derive musical concepts from audio signals or from commonly available symbolic data, such as MIDI or scores. Extracting musical concepts from audio signals is technically a very difficult task and new ways to perform this still need to be found. More specifically, we need to develop better source separation algorithms; develop methodologies for joint estimation of music content parameters; and use symbolic information plus audio data to extract higher level semantic concepts. For this reason, this challenge does not only involve researchers (in signal processing, machine learning, cognition and perception, and musicology), but also content providers (record companies), who could help by delivering material for research (such as the separate audio tracks of multi-track recordings). Enabling the description of data in terms of musical concepts can help improve the understanding of the content and hence develop better use or new uses of this content. Having access to separate source elements or to accurate score information may have a huge impact on the creative industries (game industry, music e-learning, ...) and the music distribution industry (better access to music), as well as facilitating large-scale musicological analyses.


Back to → Roadmap:Technological perspective


Contents

State of the art

In MIR, Audio Content Analysis (ACA) aims at extracting musical concepts using algorithms applied to the audio signal. One of its goals is to estimate the score of a music track (melody, harmony, rhythm, beat and downbeat positions, overall structure) from the audio signal. ACA has been a major focus of research in the MIR community over the past decade. But how can algorithm performance be further improved in this field?

Most ACA algorithms aim at estimating two kinds of concepts: subjective or application-oriented concepts (such as genre, mood, user tags and similarity), and musical concepts (such as pitch, beat, meter, chords and structure). Whatever the task, ACA algorithms first need to extract meaningful information from the signal and then map it to the concept. ACA therefore involves research related to signal processing (extracting better audio features or creating better signal models such as the sinusoidal model performing better source separation), and to knowledge encoding and discovery (how to encode/acquire knowledge in/with an algorithm) including therefore machine-learning (SVM, AdaBoost, RandomForest). Considering that subjective concepts are hard to define, their estimation is usually performed using examples, hence using machine-learning to acquire the knowledge from these examples. Musical concepts can be defined explicitly or by example, hence ACA algorithms either acquire the knowledge through predefined models (such as a musical grammar to define chord transition probabilities) or trained.

A last trend concerns the estimation of subjective concepts using estimated musical concepts as information (for example inferring the genre from the estimated tempo and/or chord progression).

"Musical concepts" denotes the parameters related to written music. Since the MIR community is largely made up of Western researchers, written music refers to the music notation system originating from European classical music, consisting of notes with an associated position and duration inside a bar, in the context of a meter (hence providing beat positions inside the bars), a clef (indicating the octaves assigned to the notes), a key signature (series of sharps or flats) organised by instrument (or hands) into parallel staffs, and finally organised in a large structure corresponding to musical sections. An extension of common notation summarises groups of simultaneously occurring notes using chord symbols. ACA aims at retrieving this music notation from the observation of an audio music track (realisation of a generative music process). Since the audio signal represents a realisation of the music notation it exhibits variations in terms of interpretation (not all the notes are played, pitches vary over time, and musicians modify timing). ACA algorithms estimate pitches with associated starting and ending times which are then mapped to the [pitch-height, clef, key, metrical position and duration] system. All this makes music transcription a difficult problem to solve. Moreover, until recently, from an application point-of-view, the market place was considered limited (to users with musical training). Today, with the success of applications such as Melodyne (multi-pitch estimation), Garage-Band, the need for search using Query-by-Humming (dominant melody extraction), mobile applications such as Tonara (iPad) and online applications such as Songs2See, information related to music transcription is now reaching everyday people. For the estimation of music transcription two major trends can be distinguished.

Non-informed estimation (estimation-from-scratch)

These approaches attempt to estimate the various music score concepts from scratch (without any information such as score or chord-tabs). In this category, approaches have been proposed for estimating the various pitches, the key, the sequence of chords, the beat and downbeat positions and the global structure.

Multi-pitch estimation is probably the most challenging task since it involves being able to identify the various pitches occurring simultaneously and estimating the number of sources playing at any time. According to [Yeh et al., 2010], most multi-pitch algorithms follow three main principles closely related to mechanisms of the auditory system: harmonicity, spectral smoothness, and synchronous amplitude evolution within a given source. From these principles a number of approaches are derived: solving the problem using a global optimisation scheme such as NMF [Vincent et al., 2008], harmonic temporal structured clustering [Kameoka et al., 2007], iterative optimisation or a probabilistic framework [Ryynänen and Klapuri, 2008]. Considering the fact that the performance obtained in the past years in the related MIREX task (~69% note-accuracy for simple music materials) remains almost constant, it seems that a glass ceiling has been reached in this domain and that new approaches should be studied. A sub-problem of multi-pitch estimation can be found in the simpler "melody extraction" problem (which is also related to the "lyric recognition/ alignment" described below). The principles underlying melody extraction methods [Salamon and Gómez, 2012] are similar, but only one pitch needs to be estimated, which is usually the most pre-dominant hence easily detected in the signal. Because of this, much better performance has been achieved for this task (up to 85% in MIREX-2011).

Key and chord estimation are two closely related topics. They both aim at assigning a label chosen from a dictionary (a fixed set of 24 tonalities, or the various triads with possible extensions) to a segment of time. Given that the estimation of key and chords from estimated multi-pitch data is still unreliable algorithms rely for the most part on the extraction of Chroma or Harmonic Pitch Class Profiles [Gómez, 2006] possibly including harmonic/pitch-enhancement or spectrum whitening. Then, a model (either resulting from perceptual experiments, trained using data or inspired by music theory) is used to map the observations to the labels. In this domain, the modelling of dependencies (with HMMs or Bayesian networks) between the various musical parameters is a common practice: dependencies between chords and key [Pauwels and Martens, 2010] between successive chords, between chord, metrical position and bass-note [Mauch and Dixon, 2010], or between chord and downbeat [Papadopoulos and Peeters, 2010]. Key and chord estimation is the research topic that relies the most on music theory.

While music scores define the temporal grid at multiple metrical levels, most research focuses on the beat level (named tactus). In this field, methods can be roughly subdivided into a) audio-to-symbolic or onset-based methods and b) energy-variation-based methods [Scheirer, 1998]. The periodicities can then be used to infer the tempo directly or to infer the whole metrical structure (tatum, tactus, measure, systematic time deviations such as swing factor [Laroche, 2003]) through probabilistic or multi-agent models. Other sorts of front-ends have also been used to provide higher-level context information (chroma-variation, spectral balance [Goto, 2001], [Klapuri et al., 2006], [Peeters and Papadopoulos, 2011]). Given the importance of correct estimation of the musical time-grid provided by beat and downbeat information, this field will remain active for some time. A good overview can be found in [Gouyon and Dixon, 2005].

Research on the estimation of Music Structure from audio started at the end of the ‘90s with the work of Foote [Foote, 1999] (co-occurrence matrix) and Logan [Logan and Chu, 2000]. By "structure" the various works mean detection of homogeneous parts (state approach [Peeters, 2004]) or repetitions of sequences of events, possibly including transpositions or time-stretching (sequence approach [Peeters, 2004]). Both methods share the use of low-level features such as MFCC or Chroma/PCP as front-end. In the first case, methods are usually based on time-segmentation and various clustering or HMM techniques [Levy and Sandler, 2008]. Sequence approaches usually first detect repetitions in a self-similarity matrix and then infer the structure from the detected repetitions using heuristics or fitness approaches [Paulus and Klapuri, 2009]. A good overview of this topic can be found in [Paulus et al., 2010].

Informed estimation (alignment and followers)

These approaches use previous information (such as given by a score, a MIDI file or a text-transcription) and align it to an audio file hence providing inherently its estimation. This method is currently applied to two fields for which estimation-from-scratch remains very complex: scores and lyrics.

Score alignment and score following are two closely related topics in the sense that the latter is the real-time version of the first. They both consist in finding a time-synchronisation between a symbolic representation and an audio signal. Historically, score following was developed first with the goal of allowing interactions between a computer and a musician ([Dannenberg, 1984], [Vercoe, 1984]) using MIDI or fingering information and not audio because of CPU limitations. This work was later extended by Puckette [Puckette, 1990] to take into account pitch estimation from audio and deal with polyphonic data. Given the imperfect nature of observations, [Grubb and Dannenberg, 1997] introduced statistical approaches. Since 1999, Hidden Markov Model/ Viterbi seems to have been chosen as the main model to represent time dependency [Raphael, 1999]. The choice of Viterbi decoding, which is also used in dynamic time warping (DTW) algorithms, is the common point between Alignment and Followers [Orio and Schwarz, 2001]. Since then, the focuses of the two fields have been different. Alignment focuses on solving computational issues related to DTW and Follower on anticipation (using tempo or recurrence information [Cont, 2008]). While formerly being the privilege of a limited number of people, today score following is now accessible to a large audience through recent applications such as Tonara (iPad) or Songs2See (web-based).

Automatic transcription of the lyrics of a music track is another complex task. It involves first locating the signal of the singer in the mixed audio track, and then recognising the lyrics conveyed by this signal (large differences between the characteristics of the singing voice and speech make standard speech transcription systems unsuitable for the singing voice). Work on alignment started with the isolated singing voice [Loscos et al., 1999] and was later extended to the singing voice mixed with other sources. Usually systems first attempt to isolate the singing voice (e.g. using the PreFest dominant melody detection algorithm [Fujihara et al., 2011], then estimate a Voice Activity Criterion and then decode the phoneme sequence using a modified HMM topology (filler model in [Fujihara et al., 2011]), adapting the speech phoneme model to singing. Other systems also exploit the temporal relationships between the text of the lyrics and the music. For example, the system Lyrically [Wang et al., 2004] uses the specific assumption that lyrics are organised in paragraphs as the music is organised in segments. The central segment, the chorus, serves as an anchor-point. Measure positions are used as the anchor-point for lines.

Deriving musical information from symbolic representations

Research related to the extraction of higher-level music elements from symbolic representations has always been at the heart of MIR, with research centred around systems such as Humdrum [Huron, 2002], MuseData, Guido/MIR [Hoos et al., 2001], jSymbolic [McKay and Fujinaga, 2006], Music21 [Cuthbert et al., 2011] or projects such as Wedel Music [Barthélemy and Bonardi, 2001] and MUSART [Birmingham et al., 2001]. Most of the higher-level elements are the same as those targeted by audio description (for example, the symbolic tasks run at MIREX: genre, artist, similarity, cover song, chord, key, melody identification, meter estimation), but some, due to current limitations of audio processing, are still specific to symbolic processing (e.g. recognition of motives and cadences).


References


Challenges



Back to → Roadmap:Technological perspective

Personal tools
Namespaces
Variants
Actions
Navigation
Documentation Hub
MIReS Docs
Toolbox