Estimation of elements related to musical concepts
From MIReS
By musical concept extraction we refer to the estimation of the elements of a notation system from the audio signal and the estimation of higher-level semantic information from these elements. These elements belong to a vocabulary and are assembled according to a grammar specific to a culture. The challenge here is to automatically derive musical concepts from audio signals or from commonly available symbolic data, such as MIDI or scores. Extracting musical concepts from audio signals is technically a very difficult task and new ways to perform this still need to be found. More specifically, we need to develop better source separation algorithms; develop methodologies for joint estimation of music content parameters; and use symbolic information plus audio data to extract higher level semantic concepts. For this reason, this challenge does not only involve researchers (in signal processing, machine learning, cognition and perception, and musicology), but also content providers (record companies), who could help by delivering material for research (such as the separate audio tracks of multi-track recordings). Enabling the description of data in terms of musical concepts can help improve the understanding of the content and hence develop better use or new uses of this content. Having access to separate source elements or to accurate score information may have a huge impact on the creative industries (game industry, music e-learning, ...) and the music distribution industry (better access to music), as well as facilitating large-scale musicological analyses.
Back to → Roadmap:Technological perspective
Contents |
State of the art
In MIR, Audio Content Analysis (ACA) aims at extracting musical concepts using algorithms applied to the audio signal. One of its goals is to estimate the score of a music track (melody, harmony, rhythm, beat and downbeat positions, overall structure) from the audio signal. ACA has been a major focus of research in the MIR community over the past decade. But how can algorithm performance be further improved in this field?
Most ACA algorithms aim at estimating two kinds of concepts: subjective or application-oriented concepts (such as genre, mood, user tags and similarity), and musical concepts (such as pitch, beat, meter, chords and structure). Whatever the task, ACA algorithms first need to extract meaningful information from the signal and then map it to the concept. ACA therefore involves research related to signal processing (extracting better audio features or creating better signal models such as the sinusoidal model performing better source separation), and to knowledge encoding and discovery (how to encode/acquire knowledge in/with an algorithm) including therefore machine-learning (SVM, AdaBoost, RandomForest). Considering that subjective concepts are hard to define, their estimation is usually performed using examples, hence using machine-learning to acquire the knowledge from these examples. Musical concepts can be defined explicitly or by example, hence ACA algorithms either acquire the knowledge through predefined models (such as a musical grammar to define chord transition probabilities) or trained.
A last trend concerns the estimation of subjective concepts using estimated musical concepts as information (for example inferring the genre from the estimated tempo and/or chord progression).
"Musical concepts" denotes the parameters related to written music. Since the MIR community is largely made up of Western researchers, written music refers to the music notation system originating from European classical music, consisting of notes with an associated position and duration inside a bar, in the context of a meter (hence providing beat positions inside the bars), a clef (indicating the octaves assigned to the notes), a key signature (series of sharps or flats) organised by instrument (or hands) into parallel staffs, and finally organised in a large structure corresponding to musical sections. An extension of common notation summarises groups of simultaneously occurring notes using chord symbols. ACA aims at retrieving this music notation from the observation of an audio music track (realisation of a generative music process). Since the audio signal represents a realisation of the music notation it exhibits variations in terms of interpretation (not all the notes are played, pitches vary over time, and musicians modify timing). ACA algorithms estimate pitches with associated starting and ending times which are then mapped to the [pitch-height, clef, key, metrical position and duration] system. All this makes music transcription a difficult problem to solve. Moreover, until recently, from an application point-of-view, the market place was considered limited (to users with musical training). Today, with the success of applications such as Melodyne (multi-pitch estimation), Garage-Band, the need for search using Query-by-Humming (dominant melody extraction), mobile applications such as Tonara (iPad) and online applications such as Songs2See, information related to music transcription is now reaching everyday people. For the estimation of music transcription two major trends can be distinguished.
Non-informed estimation (estimation-from-scratch)
These approaches attempt to estimate the various music score concepts from scratch (without any information such as score or chord-tabs). In this category, approaches have been proposed for estimating the various pitches, the key, the sequence of chords, the beat and downbeat positions and the global structure.
Multi-pitch estimation is probably the most challenging task since it involves being able to identify the various pitches occurring simultaneously and estimating the number of sources playing at any time. According to [Yeh et al., 2010], most multi-pitch algorithms follow three main principles closely related to mechanisms of the auditory system: harmonicity, spectral smoothness, and synchronous amplitude evolution within a given source. From these principles a number of approaches are derived: solving the problem using a global optimisation scheme such as NMF [Vincent et al., 2008], harmonic temporal structured clustering [Kameoka et al., 2007], iterative optimisation or a probabilistic framework [Ryynänen and Klapuri, 2008]. Considering the fact that the performance obtained in the past years in the related MIREX task (~69% note-accuracy for simple music materials) remains almost constant, it seems that a glass ceiling has been reached in this domain and that new approaches should be studied. A sub-problem of multi-pitch estimation can be found in the simpler "melody extraction" problem (which is also related to the "lyric recognition/ alignment" described below). The principles underlying melody extraction methods [Salamon and Gómez, 2012] are similar, but only one pitch needs to be estimated, which is usually the most pre-dominant hence easily detected in the signal. Because of this, much better performance has been achieved for this task (up to 85% in MIREX-2011).
Key and chord estimation are two closely related topics. They both aim at assigning a label chosen from a dictionary (a fixed set of 24 tonalities, or the various triads with possible extensions) to a segment of time. Given that the estimation of key and chords from estimated multi-pitch data is still unreliable algorithms rely for the most part on the extraction of Chroma or Harmonic Pitch Class Profiles [Gómez, 2006] possibly including harmonic/pitch-enhancement or spectrum whitening. Then, a model (either resulting from perceptual experiments, trained using data or inspired by music theory) is used to map the observations to the labels. In this domain, the modelling of dependencies (with HMMs or Bayesian networks) between the various musical parameters is a common practice: dependencies between chords and key [Pauwels and Martens, 2010] between successive chords, between chord, metrical position and bass-note [Mauch and Dixon, 2010], or between chord and downbeat [Papadopoulos and Peeters, 2010]. Key and chord estimation is the research topic that relies the most on music theory.
While music scores define the temporal grid at multiple metrical levels, most research focuses on the beat level (named tactus). In this field, methods can be roughly subdivided into a) audio-to-symbolic or onset-based methods and b) energy-variation-based methods [Scheirer, 1998]. The periodicities can then be used to infer the tempo directly or to infer the whole metrical structure (tatum, tactus, measure, systematic time deviations such as swing factor [Laroche, 2003]) through probabilistic or multi-agent models. Other sorts of front-ends have also been used to provide higher-level context information (chroma-variation, spectral balance [Goto, 2001], [Klapuri et al., 2006], [Peeters and Papadopoulos, 2011]). Given the importance of correct estimation of the musical time-grid provided by beat and downbeat information, this field will remain active for some time. A good overview can be found in [Gouyon and Dixon, 2005].
Research on the estimation of Music Structure from audio started at the end of the ‘90s with the work of Foote [Foote, 1999] (co-occurrence matrix) and Logan [Logan and Chu, 2000]. By "structure" the various works mean detection of homogeneous parts (state approach [Peeters, 2004]) or repetitions of sequences of events, possibly including transpositions or time-stretching (sequence approach [Peeters, 2004]). Both methods share the use of low-level features such as MFCC or Chroma/PCP as front-end. In the first case, methods are usually based on time-segmentation and various clustering or HMM techniques [Levy and Sandler, 2008]. Sequence approaches usually first detect repetitions in a self-similarity matrix and then infer the structure from the detected repetitions using heuristics or fitness approaches [Paulus and Klapuri, 2009]. A good overview of this topic can be found in [Paulus et al., 2010].
Informed estimation (alignment and followers)
These approaches use previous information (such as given by a score, a MIDI file or a text-transcription) and align it to an audio file hence providing inherently its estimation. This method is currently applied to two fields for which estimation-from-scratch remains very complex: scores and lyrics.
Score alignment and score following are two closely related topics in the sense that the latter is the real-time version of the first. They both consist in finding a time-synchronisation between a symbolic representation and an audio signal. Historically, score following was developed first with the goal of allowing interactions between a computer and a musician ([Dannenberg, 1984], [Vercoe, 1984]) using MIDI or fingering information and not audio because of CPU limitations. This work was later extended by Puckette [Puckette, 1990] to take into account pitch estimation from audio and deal with polyphonic data. Given the imperfect nature of observations, [Grubb and Dannenberg, 1997] introduced statistical approaches. Since 1999, Hidden Markov Model/ Viterbi seems to have been chosen as the main model to represent time dependency [Raphael, 1999]. The choice of Viterbi decoding, which is also used in dynamic time warping (DTW) algorithms, is the common point between Alignment and Followers [Orio and Schwarz, 2001]. Since then, the focuses of the two fields have been different. Alignment focuses on solving computational issues related to DTW and Follower on anticipation (using tempo or recurrence information [Cont, 2008]). While formerly being the privilege of a limited number of people, today score following is now accessible to a large audience through recent applications such as Tonara (iPad) or Songs2See (web-based).
Automatic transcription of the lyrics of a music track is another complex task. It involves first locating the signal of the singer in the mixed audio track, and then recognising the lyrics conveyed by this signal (large differences between the characteristics of the singing voice and speech make standard speech transcription systems unsuitable for the singing voice). Work on alignment started with the isolated singing voice [Loscos et al., 1999] and was later extended to the singing voice mixed with other sources. Usually systems first attempt to isolate the singing voice (e.g. using the PreFest dominant melody detection algorithm [Fujihara et al., 2011], then estimate a Voice Activity Criterion and then decode the phoneme sequence using a modified HMM topology (filler model in [Fujihara et al., 2011]), adapting the speech phoneme model to singing. Other systems also exploit the temporal relationships between the text of the lyrics and the music. For example, the system Lyrically [Wang et al., 2004] uses the specific assumption that lyrics are organised in paragraphs as the music is organised in segments. The central segment, the chorus, serves as an anchor-point. Measure positions are used as the anchor-point for lines.
Deriving musical information from symbolic representations
Research related to the extraction of higher-level music elements from symbolic representations has always been at the heart of MIR, with research centred around systems such as Humdrum [Huron, 2002], MuseData, Guido/MIR [Hoos et al., 2001], jSymbolic [McKay and Fujinaga, 2006], Music21 [Cuthbert et al., 2011] or projects such as Wedel Music [Barthélemy and Bonardi, 2001] and MUSART [Birmingham et al., 2001]. Most of the higher-level elements are the same as those targeted by audio description (for example, the symbolic tasks run at MIREX: genre, artist, similarity, cover song, chord, key, melody identification, meter estimation), but some, due to current limitations of audio processing, are still specific to symbolic processing (e.g. recognition of motives and cadences).
References
- [Barthélemy and Bonardi, 2001] Jérôme Barthélemy and Alain Bonardi. Figured bass and tonality recognition. In Proceedings of the 2nd International Symposium on Music Information Retrieval, pp. 129-136, Bloomington, Indiana, USA, 2001.
- [Birmingham et al., 2001] W. Birmingham, R. Dannenberg, G. Wakefield, M. Bartsch, D. Bykowski, D. Mazzoni, C. Meek, M. Mellody, and W. Rand. MUSART: Music retrieval via aural queries. In Proceedings of the 2nd International Symposium on Music Information Retrieval, pp. 73-81, Bloomington, Indiana, USA, 2001.
- [Cont, 2008] A. Cont. ANTESCOFO: Anticipatory synchronization and control of interactive parameters in computer music. In Proceedings of International Computer Music Conference, Belfast, Ireland, 2008.
- [Cuthbert et al., 2011] M.S. Cuthbert, C. Ariza, and L. Friedland. Feature extraction and machine learning on symbolic music using the music21 toolkit. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pp. 387-392, Miami, Florida, USA, 2011.
- [Dannenberg, 1984] R.B. Dannenberg. An on-line algorithm for real-time accompaniment. In Proceedings of the International Computer Music Conference, pp. 193-198, Paris, France, 1984. Computer Music Association.
- [Foote, 1999] Jonathan Foote. Visualizing music and audio using self-similarity. In Proceedings of the ACM Multimedia, pp. 77-80, Orlando, Florida, USA, 1999.
- [Fujihara et al., 2011] H. Fujihara, Masataka Goto, J. Ogata, and H. Okuno. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing, 5(6):1252-1261, 2011.
- [Gómez, 2006] E. Gómez. Tonal description of polyphonic audio for music content processing. INFORMS Journal on Computing, Special Cluster on Computation in Music, 18(3), 2006.
- [Goto, 2001] Masataka Goto. An Audio-based Real-time Beat Tracking System for Music With or Without Drum-sounds. Journal of New Music Research, 30(2): 159–171, 2001.
- [Gouyon and Dixon, 2005] F. Gouyon and S. Dixon. A review of rhythm description systems. Computer Music Journal, 29(1): 34-54, 2005.
- [Grubb and Dannenberg, 1997] L. Grubb and R.B. Dannenberg. A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference, pp. 301-308, Thessaloniki, Greece, 1997.
- [Hoos et al., 2001] H. Hoos, K. Renz, and M. Gorg. GUIDO/MIR an experimental musical information retrieval system based on GUIDO music notation. In Proceedings of the 2nd International Symposium on Music Information Retrieval, Bloomington, Indiana, USA, 2001.
- [Huron, 2002] D. Huron. Music information processing using the humdrum toolkit: Concepts, examples, and lessons. Computer Music Journal, 26(2): 11-26, 2002.
- [Kameoka et al., 2007] H. Kameoka, T. Nishimoto, and S. Sagayama. A multipitch analyzer based on harmonic temporal structured clustering. IEEE Transactions on Audio, Speech and Language Processing, 15(3): 982-994, 2007.
- [Klapuri, 2008] A. Klapuri. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Transactions on Audio, Speech and Language Processing, 16(2): 255-266, 2008.
- [Klapuri et al., 2006] Anssi Klapuri, Antti Eronen, and J. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech and Language Processing, 14(1): 342-355, 2006.
- [Laroche, 2003] J. Laroche. Efficient tempo and beat tracking in audio recordings. Journal of the Audio Engineering Society,, 51(4): 226-233, 2003.
- [Levy and Sandler, 2008] Mark Levy and Mark Sandler. Structural segmentation of musical audio by constrained clustering. IEEE Transactions on Audio, Speech and Language Processing, 16(2): 318-326, 2008.
- [Logan and Chu, 2000] Beth Logan and Stephen Chu. Music summarization using key phrases. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume II, pp. 749-752, Istanbul, Turkey, 2000.
- [Loscos et al., 1999] A. Loscos, P. Cano, and J. Bonada. Low-delay singing voice alignment to text. In Proceedings of International Computer Music Conference, p. 23, Bejing, China, 1999.
- [Mauch and Dixon, 2010] Matthias Mauch and Simon Dixon. Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech, and Language Processing, 18(6): 1280-1289, 2010.
- [McKay and Fujinaga, 2006] C. McKay and I. Fujinaga. jSymbolic: A feature extractor for MIDI files. In Proceedings of the 7th International Conference on Music Information Retrieval, pp. 302-5, Victoria, Canada, 2006.
- [Orio and Schwarz, 2001] Nicola Orio and Diemo Schwarz. Alignment of monophonic and polyphonic music to a score. In Proceedings of International Computer Music Conference, La Havana, Cuba, 2001.
- [Papadopoulos and Peeters, 2010] Hélène Papadopoulos and Geoffroy Peeters. Joint estimation of chords and downbeats from an audio signal. IEEE Transactions on Audio, Speech and Language Processing, 19(1): 138-152, 2010.
- [Paulus and Klapuri, 2009] Jouni Paulus and Anssi Klapuri. Music structure analysis using a probabilistic fitness measure and a greedy search algorithm. IEEE Transactions on Audio, Speech and Language Processing, 17(6): 1159-1170, 2009.
- [Paulus et al., 2010] Jouni Paulus, Meinard Müller, and Anssi Klapuri. Audio-based music structure analysis. In Proceedings of the 11th International Society for Music Information Retrieval Conference, Ultrecht, The Nederlands, 2010.
- [Pauwels and Martens, 2010] Johan Pauwels and Jean-Pierre Martens. Integrating musicological knowledge into a probabilistic system for chord and key extraction. In Proceedings of Audio Engineering Society 128th Convention, London, UK, 2010.
- [Peeters, 2004] Geoffroy Peeters. Deriving Musical Structures from Signal Analysis for Music Audio Summary Generation: Sequence and State Approach, pp. 142-165, Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg, 2004.
- [Peeters and Papadopoulos, 2011] Geoffroy Peeters and Hélène Papadopoulos. Simultaneous beat and downbeat-tracking using a probabilistic framework: theory and large-scale evaluation. IEEE Transactions on Audio, Speech and Language Processing, 19(6): 1754-1769, 2011.
- [Puckette, 1990] M. Puckette. Explode: A user interface for sequencing and score following. In Proceedings of International Computer Music Conference, pp. 259-261, Glasgow, Scotland, 1990.
- [Raphael, 1999] C. Raphael. Automatic segmentation of acoustic musical signals using hidden markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(4): 360-370, 1999.
- [Ryynänen and Klapuri, 2008] M.P. Ryynänen and A.P. Klapuri. Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal, 32(3): 72-86, 2008.
- [Salamon and Gómez, 2012] J. Salamon and E. Gómez. Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6): 1759-1770, 2012.
- [Scheirer, 1998] Eric Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1): 588-601, 1998.
- [Vercoe, 1984] B. Vercoe. The synthetic performer in the context of live performance. In Proceedings of International Computer Music Conference, pp. 199-200, Paris, France, 1984.
- [Vincent et al., 2008] E. Vincent, N. Berlin, and R. Badeau. Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 109-112, Las Vegas, Nevada, USA, 2008.
- [Wang et al., 2004] Y. Wang, M.Y. Kan, T.L. Nwe, A. Shenoy, and J. Yin. Lyrically: automatic synchronization of acoustic musical signals and textual lyrics. In Proceedings of the ACM Multimedia, pp. 212-219. ACM, 2004.
- [Yeh et al., 2010] Chunghsin Yeh, Axel Roebel, and Xavier Rodet. Multiple fundamental frequency estimation and polyphony inference of polyphonic music signals. IEEE Transactions on Audio, Speech and Language Processing, volume 18, p. 6, 2010.
Challenges
- Separate the various sources of an audio signal. The separation of the various sources of an audio track (source separation) facilitates its conversion to a symbolic representation (including the score and the instrument names). Conversely, the prior knowledge of this symbolic information (score and/or instruments) facilitates the separation of the sources. Despite the efforts made over the last decades, efficient source separation and multi-pitch estimation algorithms are still lacking. Alternative strategies should therefore be exploited in order to achieve both tasks, such as collaborative estimation.
- Jointly estimate the musical concepts. In a music piece, many of the different parameters are inter-dependent (notes often start on beat or tatum positions, pitch most likely belongs to the local key). Holistic/joint estimation should be considered to improve the performance of algorithms and the associated computational issues should be solved.
- Develop style-specific musical representations and estimation algorithms. Depending on the music style, different types of representation may be used (e.g. full score for classical music and lead sheets for jazz). Based on previous knowledge of the music style, a priori information may be used to help the estimation of the relevant musical concepts.
- Consider non-Western notation systems. Currently, most analyses are performed from the point of view of Western symbolic notation. Dependence of our algorithms on this system should be made explicit. Other notation systems, other informative and user-adapted music representations, possibly belonging to other music cultures, should be considered, and taken into account by our algorithms.
- Compute values for the reliability of musical concept estimation. Many musical concepts (such as multi-pitch or tempo) are obtained through "estimation" (as opposed to MFCC which is a cascade of mathematical operators). Therefore the values obtained by these estimations may be wrong. The challenge is to enable algorithms to compute a measure of the reliability of their estimation ("how much the algorithm is sure about its estimation"). From a research point of view, this will allow the use of this "uncertainty" estimation in a higher-level system. From an exploitation point of view, this will allow the use of these estimations for automatically tagging music without human intervention.
- Take into account reliability in systems. Estimations of musical concepts (such as multi-pitch or beat) can be used to derive higher-level musical analysis. We should study how the uncertainty of the estimation of these musical concepts can be taken into account in higher-level algorithms.
- Develop user-assisted systems. If it is not possible to estimate the musical concepts fully automatically, then a challenge is to study how this can been done interactively with the user (using relevance feedback).