Music representations

From MIReS

(Difference between revisions)

Jump to: navigation, search

Latest revision as of 21:18, 19 April 2013

Data representations impact the effectiveness of MIR systems in two ways: algorithms are limited by the types of input data they receive, and the user experience depends on the way that MIR systems present music information to the user. A major challenge is to provide abstractions which enable researchers and industry to develop algorithms that meet user needs and to present music information in a form that accords with users’ understanding of music. The same challenge applies to content providers, who need to select appropriate abstractions for structuring, visualising, and sonifying music information. These abstractions include features describing music information, be it audio, symbolic, textual, or image data; ontologies, taxonomies and folksonomies for structuring music information; graphical representations of music information; and formats for maintaining and sonifying music data. The development of standard representations will advance MIR by increasing algorithm and system interoperability between academia and industry as well as between researchers working on MIR subtasks, and will provide a satisfactory user experience by means of musically and semantically meaningful representations.

Back to → Roadmap:Technological perspective

State of the art

While audio recordings capture musical performances with a high level of detail, there is no direct relationship between the individual audio samples and the experience of music, which involves notes, beats, instruments, phrases or melodies (the musicological perspective), and which might give rise to memories or emotions associated with times, places or events where identical or similar music was heard (the user perspective). Although there is a large body of research investigating the relationship between music and its meaning from the philosophical and psychological perspectives [e.g., Minsky, 1981; Robinson, 1997; Cross and Tolbert, 2008; JMM], scientific research has tended to focus more on bridging the "semantic gap" between audio recordings and the abstractions that are found in various types of musical scores, such as pitches, rhythms, melodies and harmonies. This work is known as semantic audio or audio content analysis (see section Estimation of elements related to musical concepts).

In order to facilitate the extraction of useful information from audio recordings, a standard practice is to compute intermediate representations at various levels of abstraction. At each level, features can describe an instant in time (e.g. the onset time of a note), a segment or time interval (e.g. the duration of a chord) or the whole piece (e.g. the key of a piece). Various sets of features and methods for evaluating their appropriateness have been catalogued in the MIR literature [McKinney and Breebaart, 2003; Peeters, 2004; Kim et al., 2005; McEnnis et al., 2005; Pachet and Roy, 2007; Mitrovic et al., 2010].

Low-level features relate directly to signal properties and are computed according to simple formulae. Examples are the zero-crossing rate, spectral centroid and global energy of the signal. Time-domain features such as the amplitude envelope and attack time are computed without any frequency transform being applied to the signal, whereas spectral features such as centroid, spread, flatness, skewness, kurtosis and slope require a time-frequency representation such as the short time Fourier transform (STFT), the constant-Q transform (CQT) [Brown 1991] or the wavelet transform [Mallat, 1999] to be applied as a first processing step. Auditory model-based representations [Meddis and Hewitt, 1992] are also commonly used as a front-end for MIR research.

Mid-level features (e.g. pitches and onset times of notes) are characterised by more complex computations, where the algorithms employed are not always successful at producing the intended results. Typically a modelling step will be performed (e.g. sinusoidal modelling), and the choice of parameters for the model will influence results. For example, in Spectral Modelling Synthesis [Serra and Smith, 1990], the signal is explained in terms of sinusoidal partial tracks created by tracking spectral peaks across analysis frames, plus a residual signal which contains the non-sinusoidal content. The thresholds and rules used to select and group the spectral peaks determine the amount of the signal which is interpreted as sinusoidal. This flexibility means that the representation with respect to such a model is not unique, and the optimal choice of parameters is dependent on the task for which the representation will be used.

High-level features (e.g. genre, tonality, rhythm, harmony and mood) correspond to the terms and concepts used by musicians or listeners to describe aspects of music. To generate such features, the models employed tend to be more complex, and might include a classifier trained on a relevant data set, or a probabilistic model such as a hidden Markov model (HMM) or dynamic Bayesian network (DBN). Automatic extraction of high-level features is not reliable, which means that in practice there is a tradeoff between the expressiveness of the features (e.g. number of classes they describe) and the accuracy of the feature computation.

It should also be noted that the classification of features into categories such as "high-level" is not an absolute judgement, and some shift in usage is apparent, resulting from the search for ever higher levels of abstraction in signal descriptors. Thus features which might have been described as high-level a decade ago might now be considered to be mid-level features. Also features are sometimes described in terms of the models used to compute them, such as psychoacoustic features (e.g. roughness, loudness and sharpness) which are based on auditory models. Some features have been standardised, e.g. in the MPEG7 standard [Kim et al., 2005]. Another form of standardisation is the use of ontologies to capture the semantics of data representations and to support automatic reasoning about features, such as the Audio Feature Ontology proposed by Fazekas [2010].

In addition to the literature discussing feature design for various MIR tasks, another strand of research investigates the automatic generation of features [e.g., Pachet and Roy, 2009]. This is a pragmatic approach to feature generation, whereby features are generated from combinations of simple operators and tested on the training data in order to select suitable features. More recently, deep learning techniques have been used for automatic feature learning in MIR tasks [Humphrey et al., 2012], where they have been reported to be superior to the use of hand-crafted feature sets for classification tasks, although these results have not yet been replicated in MIREX evaluations. It should be noted however that automatically generated features might not be musically meaningful, which limits their usefulness.

Much music information is not in the form of audio recordings, but rather symbolic representations of the pitch, timing, dynamics and/or instrumentation of each of the notes. There are various ways such a representation can arise. First, via the composition process, for example when music notation software is employed, a score can be created for instructing the musicians how to perform the piece. Alternatively, a score might be created via a process of transcription (automatic or manual) of a musical performance. For electronic music, the programming or performance using a sequencer or synthesiser could result in an explicit or implicit score. For example, electronic dance music can be generated, recorded, edited and mixed in the digital domain using audio editing, synthesis and sequencing software, and in this case the software’s own internal data format(s) can be considered to be an implicit score representation.

In each of these cases the description (or prescription) of the notes played might be complete or incomplete. In the Western classical tradition, it is understood that performers have a certain degree of freedom in creating their rendition of a composition, which may involve the choice of tempo, dynamics and articulation, or also ornamentation and sometimes even the notes to be played for an entire section of a piece (an improvised cadenza). Likewise in Western pop and jazz music, a work is often described in terms of a sequence of chord symbols, the melody and the lyrics; the parts of each instrument are then rehearsed or improvised according to the intended style of the music. In these cases, the resulting score can be considered to be an abstract representation of the underlying musical work. One active topic in MIR research is on reducing a music score to a higher-level, abstract representation [Marsden, 2010]. However not all styles of music are based on the traditional Western score. For example, freely improvised and many non-Western musics might have no score before a performance and no established language for describing the performance after the fact.

A further type of music information is textual data, which includes both structured data such as catalogue metadata and unstructured data such as music reviews and tags associated with recordings by listeners. Structured metadata might describe the composers, performers, musical works, dates and places of recordings, instrumentation, as well as key, tempo, and onset times of individual notes. Digital libraries use metadata standards such as Dublin Core and models such as the Functional Requirements for Bibliographic Records (FRBR) to organise catalogue and bibliographic databases. To assist interoperability between data formats and promote the possibility of automatic inference from music metadata, ontologies have been developed such as the Music Ontology [Raimond et al., 2007].

Another source of music information is image data from digitised handwritten or printed music scores. For preserving, distributing, and analysing such information, systems for optical music recognition (OMR) have been under development for several years [Rebelo et al., 2012]. As in audio recordings, intermediate representations at various abstraction levels are computed for digitised scores. The lowest-level representation consists of raw pixels from a digitised grayscale score, from which low-level features such as staff line thickness and vertical line distance are computed. Mid-level features include segmented (but not recognised) symbols, while higher-level features include interpreted symbols and information about connected components or symbol orientation. In order to formalise these abstractions, grammars are employed to represent allowed combinations of symbols.

Looking beyond the conceptual organisation of the data, we briefly address its organisation into specific file formats, and the development and maintenance of software to read, write and translate between these formats. For audio data, two types of representations are used: uncompressed and compressed. Uncompressed (or pulse code modulated, PCM) data consists of just the audio samples for each channel, usually prepended by a short header which specifies basic metadata such as the file format, sampling rate, word size and number of channels. Compression algorithms convert the audio samples into model parameters which describe each block of audio, and these parameters are stored instead of the audio samples, again with a header containing basic metadata. Common audio file formats such as WAV, which is usually associated with PCM data, provide a package allowing a large variety of audio representations. The MP3 format (formally called MPEG-2 Audio Layer III) uses lossy audio compression and is common for consumer audio storage; the use of MP3 files in MIR research has increased in recent years due to the emergence of large-scale datasets. Standard open source software libraries such as libsndfile are available for reading and writing common non-proprietary formats, but some file formats are difficult to support with open source software due to the license required to implement an encoder.

For symbolic music data, a popular file format is MIDI (musical instrument digital interface), but this is limited in expressiveness and scope, as it was originally designed for keyboard instrument sequencing. For scores, a richer format such as MusicXML or MEI (Music Encoding Initiative) is required, which are XML-based representations including information such as note spelling and layout. For guitar "tabs" (a generic term covering tablature as well as chord symbols with or without lyrics), free text is still commonly used, with no standard format, although software has been developed which can parse the majority of such files [Macrae and Dixon 2011]. Some tab web sites have developed their own formats using HTML or XML for markup of the text files. Other text formats such as the MuseData and Humdrum kern format [Selfridge-Field, 1997] have been used extensively for musicological analysis of corpuses of scores.

For structured metadata, formats such as XML are commonly used, and in particular semantic web formats for linked data such as RDFa, RDF/XML, N3 and Turtle are employed. Since these are intended as machine-readable formats rather than for human consumption, the particular format chosen is less important than the underlying ontology which provides the semantics for the data. For image data, OMR systems typically process sheet music scanned at 300 dpi resolution, producing output in expMIDI (expressive MIDI), MusicXML or NIFF (Notation Interchange File Format) formats.

Finally, although music exists primarily in the auditory domain, there is a long tradition of representing music in various graphical formats. Common Western music notation is a primary example, but piano-roll notation, spectrograms and chromagrams also present musical information in potentially useful formats. Since music is a time-based phenomenon, it is common to plot the evolution of musical parameters as a function of time, such as tempo and dynamics curves, which have been used extensively in performance research [Desain and Honing, 1991]. Simultaneous representations of two or more temporal parameters have been achieved using animation, for example the Performance Worm [Dixon et al., 2002], which shows the temporal evolution of tempo and loudness as a trajectory in a two-dimensional space. Other visualisations include similarity matrices for audio alignment and structural segmentation [Muller et al., 2011] and various representations for analysis of harmony and tonality [e.g. Sapp, 2012].

References

[Brown, 1991] J.C. Brown. Calculation of a Constant-Q Spectral Transform. Journal of the Acoustical Society of America, 89 (1): 425-434, 1991.

[Cross and Tolbert, 2008] I. Cross and E. Tolbert. Music and Meaning. In S. Hallam, I. Cross, and M. Thaut, editors: The Oxford Handbook of Music Psychology, Oxford University Press, 2008.

[Desain and Honing, 1991] P. Desain and H. Honing. Tempo Curves Considered Harmful: A Critical Review of the Representation of Timing in Computer Music. In Proceedings of the International Computer Music Conference, pp. 143-149, Montreal, Canada, 1991.

[Dixon et al., 2002] S. Dixon, W. Goebl and G. Widmer. The Performance Worm: Real Time Visualisation of Expression Based on Langner’s Tempo-Loudness Animation. In Proceedings of the International Computer Music Conference, pp. 361-364, Gothenburg, Sweden, 2002.

[Fazekas, 2010] G. Fazekas. Audio Features Ontology, 2010. http://www.omras2.org/AudioFeatures

[Humphrey et al., 2012] E.J. Humphrey, J.P. Bello, and Y. LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the 13th International Society for Music Information Retrieval Conference, pp. 403-408, Porto, Portugal, 2012.

[JMM] The Journal of Music and Meaning. URL: www.musicandmeaning.net

[Kim et al., 2005] H.G. Kim, N. Moreau and T. Sikora. MPEG7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley and Sons, 2005.

[Macrae and Dixon, 2011] R. Macrae and S. Dixon. Guitar Tab Mining, Analysis and Ranking. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pp. 453-458, Miami, Florida, USA, 2011.

[Mallat, 1999] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, San Diego, CA, USA, 1999.

[Marsden, 2010] A. Marsden. Schenkerian analysis by computer: A proof of concept. Journal of New Music Research, 39(3): 269-289, 2010.

[McEnnis et al., 2005] D. McEnnis, C. McKay, I. Fujinaga and P. Depalle. JAudio: A Feature Extraction Library. In Proceedings of the 6th International Conference on Music Information Retrieval, London, UK, 2005.

[McKinney and Breebaart, 2003] M. F. McKinney and J. Breebaart. Features for Audio and Music Classification. In Proceedings of the 4th International Conference on Music Information Retrieval, Maryland, USA, 2003.

[Meddis and Hewitt, 1992] R. Meddis and M.J. Hewitt. Modelling the identification of concurrent vowels with different fundamental frequencies. Journal of the Acoustical Society of America, 91(1): 233-245, 1992.

[Minsky, 1981] M. Minsky. Music, Mind, and Meaning. Computer Music Journal, 5(3), 1981.

[Mitrovic et al., 2010] D. Mitrovic, M. Zeppelzauer, and C. Breiteneder. Features for content-based audio retrieval. Advances in Computers, 78: 71-150, 2010.

[Müller et al., 2011] M. Müller, D.P.W. Ellis, A. Klapuri, and G. Richard. Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6): 1088-1110, 2011.

[Pachet and Roy, 2009] F. Pachet and P. Roy. Analytical Features: A Knowledge-Based Approach to Audio Feature Generation. EURASIP Journal on Audio, Speech, and Music Processing, 2009: 1–23, 2009.

[Peeters, 2004] G. Peeters. A Large Set of Audio Features for Sound Description, http://recherche.ircam.fr/anasyn/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf, 2004.

[Raimond et al., 2007] Y. Raimond, S. Abdallah, M. Sandler and F. Giasson. The Music Ontology. In Proceedings of the 8th International Conference on Music Information Retrieval, pp. 417-422, 2007.

[Rebelo et al., 2012] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A.R.S. Marcal, C. Guedes, and J.S. Cardoso. Optical music recognition: State-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1: 173-190, 2012.

[Robinson, 1997] J. Robinson. Music and Meaning. Cornell University Press, New York, USA, 1997.

[Sapp, 2012] C.S. Sapp. Visual hierarchical key analysis. ACM Computers in Entertainment, 3(4), 2012.

[Selfridge-Field, 1997] E. Selfridge-Field. Beyond MIDI: The Handbook of Musical Codes. MIT Press, 1997.

[Serra and Smith, 1990] X. Serra and J. Smith. Spectral Modeling Synthesis: A Sound Analysis/Synthesis Based on a Deterministic plus Stochastic Decomposition. Computer Music Journal, 14(4): 12-24, 1990.

Challenges

Investigate more musically meaningful features and representations. There is still a significant semantic gap between the representations used in MIR and the concepts and language of musicians and audiences. In particular, many of the abstractions used in MIR do not make sense to a musically trained user, as they ignore or are unable to capture essential aspects of musical communication. The challenge of designing musically meaningful representations must be overcome in order to build systems that provide a satisfactory user experience. This is particularly the case for automatically generated features, such as those utilising deep learning techniques, where the difficulty is creating features well-suited for MIR tasks which are still interpretable by humans.
Develop more flexible and general representations. Many representations are limited in scope and thus constrained in their expressive possibilities. For example, most representations have been created specifically for describing Western tonal music. Although highly constrained representations might provide advantages in terms of simplicity and computational complexity, it means that new representations have to be developed for each new task, which inhibits rapid prototyping and testing of new ideas. Thus there is a need to create representations and abstractions which are sufficiently adaptable, flexible and general to cater for the full range of music styles and cultures, as well as for unforeseen musical tasks and situations.
Determine the most appropriate representation for each application. For some use cases it is not beneficial to use the most general representation, as domain- or task-specific knowledge might aid the analysis and interpretation of data. However, there is no precise methodology for developing or choosing representations, and existing \myq{best practice} covers only a small proportion of the breadth of musical styles, creative ideas and contexts for which representations might be required.
Unify formats and improve system interoperability. The wealth of different standards and formats creates a difficulty for service providers who wish to create seamless systems with a high degree of interoperability with other systems and for researchers who want to experiment with software and data from disparate sources. By encouraging the use of open standards, common platforms, and formats that promote semantic as well as syntactic interoperability, system development will be simpler and more efficient.
Extend the scope of existing ontologies. Existing ontologies cover only a small fraction of musical terms and concepts, so an important challenge is to extend these ontologies to describe all types of music-related information, covering diverse music cultures, communities and styles. These ontologies must also be linked to existing ontologies within and outside of the MIR community in order to gain maximum benefit from the data which is structured according to the ontologies.
Create compact representations that can be efficiently used for large-scale music analysis. It is becoming increasingly important that representations facilitate processing of the vast amounts of music data that exist in current and future collections, for example, by supporting efficient indexing, search and retrieval of music data.
Develop and integrate representations for multimodal data. In order to facilitate content-based retrieval and browsing applications, representations are required that enable comparison and combination of data from diverse modalities, including audio, video and gesture data.

Back to → Roadmap:Technological perspective

Revision as of 21:17, 19 April 2013 (view source) WikiSysop (Talk \| contribs) (→References) ← Older edit		Latest revision as of 21:18, 19 April 2013 (view source) WikiSysop (Talk \| contribs)
Line 121:		Line 121:

	* [Serra and Smith, 1990] X. Serra and J. Smith. Spectral Modeling Synthesis: A Sound Analysis/Synthesis Based on a Deterministic plus Stochastic Decomposition. ''Computer Music Journal'', 14(4): 12-24, 1990.		* [Serra and Smith, 1990] X. Serra and J. Smith. Spectral Modeling Synthesis: A Sound Analysis/Synthesis Based on a Deterministic plus Stochastic Decomposition. ''Computer Music Journal'', 14(4): 12-24, 1990.
		+

	== '''[[Music representations: Challenges\|Challenges]]''' ==		== '''[[Music representations: Challenges\|Challenges]]''' ==

Music representations

From MIReS

Latest revision as of 21:18, 19 April 2013

Contents

State of the art

References

Challenges

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation Hub

MIReS Docs

Toolbox