Music representations

From MIReS

Jump to: navigation, search

Data representations impact the effectiveness of MIR systems in two ways: algorithms are limited by the types of input data they receive, and the user experience depends on the way that MIR systems present music information to the user. A major challenge is to provide abstractions which enable researchers and industry to develop algorithms that meet user needs and to present music information in a form that accords with users’ understanding of music. The same challenge applies to content providers, who need to select appropriate abstractions for structuring, visualising, and sonifying music information. These abstractions include features describing music information, be it audio, symbolic, textual, or image data; ontologies, taxonomies and folksonomies for structuring music information; graphical representations of music information; and formats for maintaining and sonifying music data. The development of standard representations will advance MIR by increasing algorithm and system interoperability between academia and industry as well as between researchers working on MIR subtasks, and will provide a satisfactory user experience by means of musically and semantically meaningful representations.


Back to → Roadmap:Technological perspective


Contents

State of the art

While audio recordings capture musical performances with a high level of detail, there is no direct relationship between the individual audio samples and the experience of music, which involves notes, beats, instruments, phrases or melodies (the musicological perspective), and which might give rise to memories or emotions associated with times, places or events where identical or similar music was heard (the user perspective). Although there is a large body of research investigating the relationship between music and its meaning from the philosophical and psychological perspectives [e.g., Minsky, 1981; Robinson, 1997; Cross and Tolbert, 2008; JMM], scientific research has tended to focus more on bridging the "semantic gap" between audio recordings and the abstractions that are found in various types of musical scores, such as pitches, rhythms, melodies and harmonies. This work is known as semantic audio or audio content analysis (see section Estimation of elements related to musical concepts).

In order to facilitate the extraction of useful information from audio recordings, a standard practice is to compute intermediate representations at various levels of abstraction. At each level, features can describe an instant in time (e.g. the onset time of a note), a segment or time interval (e.g. the duration of a chord) or the whole piece (e.g. the key of a piece). Various sets of features and methods for evaluating their appropriateness have been catalogued in the MIR literature [McKinney and Breebaart, 2003; Peeters, 2004; Kim et al., 2005; McEnnis et al., 2005; Pachet and Roy, 2007; Mitrovic et al., 2010].

Low-level features relate directly to signal properties and are computed according to simple formulae. Examples are the zero-crossing rate, spectral centroid and global energy of the signal. Time-domain features such as the amplitude envelope and attack time are computed without any frequency transform being applied to the signal, whereas spectral features such as centroid, spread, flatness, skewness, kurtosis and slope require a time-frequency representation such as the short time Fourier transform (STFT), the constant-Q transform (CQT) [Brown 1991] or the wavelet transform [Mallat, 1999] to be applied as a first processing step. Auditory model-based representations [Meddis and Hewitt, 1992] are also commonly used as a front-end for MIR research.

Mid-level features (e.g. pitches and onset times of notes) are characterised by more complex computations, where the algorithms employed are not always successful at producing the intended results. Typically a modelling step will be performed (e.g. sinusoidal modelling), and the choice of parameters for the model will influence results. For example, in Spectral Modelling Synthesis [Serra and Smith, 1990], the signal is explained in terms of sinusoidal partial tracks created by tracking spectral peaks across analysis frames, plus a residual signal which contains the non-sinusoidal content. The thresholds and rules used to select and group the spectral peaks determine the amount of the signal which is interpreted as sinusoidal. This flexibility means that the representation with respect to such a model is not unique, and the optimal choice of parameters is dependent on the task for which the representation will be used.

High-level features (e.g. genre, tonality, rhythm, harmony and mood) correspond to the terms and concepts used by musicians or listeners to describe aspects of music. To generate such features, the models employed tend to be more complex, and might include a classifier trained on a relevant data set, or a probabilistic model such as a hidden Markov model (HMM) or dynamic Bayesian network (DBN). Automatic extraction of high-level features is not reliable, which means that in practice there is a tradeoff between the expressiveness of the features (e.g. number of classes they describe) and the accuracy of the feature computation.

It should also be noted that the classification of features into categories such as "high-level" is not an absolute judgement, and some shift in usage is apparent, resulting from the search for ever higher levels of abstraction in signal descriptors. Thus features which might have been described as high-level a decade ago might now be considered to be mid-level features. Also features are sometimes described in terms of the models used to compute them, such as psychoacoustic features (e.g. roughness, loudness and sharpness) which are based on auditory models. Some features have been standardised, e.g. in the MPEG7 standard [Kim et al., 2005]. Another form of standardisation is the use of ontologies to capture the semantics of data representations and to support automatic reasoning about features, such as the Audio Feature Ontology proposed by Fazekas [2010].

In addition to the literature discussing feature design for various MIR tasks, another strand of research investigates the automatic generation of features [e.g., Pachet and Roy, 2009]. This is a pragmatic approach to feature generation, whereby features are generated from combinations of simple operators and tested on the training data in order to select suitable features. More recently, deep learning techniques have been used for automatic feature learning in MIR tasks [Humphrey et al., 2012], where they have been reported to be superior to the use of hand-crafted feature sets for classification tasks, although these results have not yet been replicated in MIREX evaluations. It should be noted however that automatically generated features might not be musically meaningful, which limits their usefulness.

Much music information is not in the form of audio recordings, but rather symbolic representations of the pitch, timing, dynamics and/or instrumentation of each of the notes. There are various ways such a representation can arise. First, via the composition process, for example when music notation software is employed, a score can be created for instructing the musicians how to perform the piece. Alternatively, a score might be created via a process of transcription (automatic or manual) of a musical performance. For electronic music, the programming or performance using a sequencer or synthesiser could result in an explicit or implicit score. For example, electronic dance music can be generated, recorded, edited and mixed in the digital domain using audio editing, synthesis and sequencing software, and in this case the software’s own internal data format(s) can be considered to be an implicit score representation.

In each of these cases the description (or prescription) of the notes played might be complete or incomplete. In the Western classical tradition, it is understood that performers have a certain degree of freedom in creating their rendition of a composition, which may involve the choice of tempo, dynamics and articulation, or also ornamentation and sometimes even the notes to be played for an entire section of a piece (an improvised cadenza). Likewise in Western pop and jazz music, a work is often described in terms of a sequence of chord symbols, the melody and the lyrics; the parts of each instrument are then rehearsed or improvised according to the intended style of the music. In these cases, the resulting score can be considered to be an abstract representation of the underlying musical work. One active topic in MIR research is on reducing a music score to a higher-level, abstract representation [Marsden, 2010]. However not all styles of music are based on the traditional Western score. For example, freely improvised and many non-Western musics might have no score before a performance and no established language for describing the performance after the fact.

A further type of music information is textual data, which includes both structured data such as catalogue metadata and unstructured data such as music reviews and tags associated with recordings by listeners. Structured metadata might describe the composers, performers, musical works, dates and places of recordings, instrumentation, as well as key, tempo, and onset times of individual notes. Digital libraries use metadata standards such as Dublin Core and models such as the Functional Requirements for Bibliographic Records (FRBR) to organise catalogue and bibliographic databases. To assist interoperability between data formats and promote the possibility of automatic inference from music metadata, ontologies have been developed such as the Music Ontology [Raimond et al., 2007].

Another source of music information is image data from digitised handwritten or printed music scores. For preserving, distributing, and analysing such information, systems for optical music recognition (OMR) have been under development for several years [Rebelo et al., 2012]. As in audio recordings, intermediate representations at various abstraction levels are computed for digitised scores. The lowest-level representation consists of raw pixels from a digitised grayscale score, from which low-level features such as staff line thickness and vertical line distance are computed. Mid-level features include segmented (but not recognised) symbols, while higher-level features include interpreted symbols and information about connected components or symbol orientation. In order to formalise these abstractions, grammars are employed to represent allowed combinations of symbols.

Looking beyond the conceptual organisation of the data, we briefly address its organisation into specific file formats, and the development and maintenance of software to read, write and translate between these formats. For audio data, two types of representations are used: uncompressed and compressed. Uncompressed (or pulse code modulated, PCM) data consists of just the audio samples for each channel, usually prepended by a short header which specifies basic metadata such as the file format, sampling rate, word size and number of channels. Compression algorithms convert the audio samples into model parameters which describe each block of audio, and these parameters are stored instead of the audio samples, again with a header containing basic metadata. Common audio file formats such as WAV, which is usually associated with PCM data, provide a package allowing a large variety of audio representations. The MP3 format (formally called MPEG-2 Audio Layer III) uses lossy audio compression and is common for consumer audio storage; the use of MP3 files in MIR research has increased in recent years due to the emergence of large-scale datasets. Standard open source software libraries such as libsndfile are available for reading and writing common non-proprietary formats, but some file formats are difficult to support with open source software due to the license required to implement an encoder.

For symbolic music data, a popular file format is MIDI (musical instrument digital interface), but this is limited in expressiveness and scope, as it was originally designed for keyboard instrument sequencing. For scores, a richer format such as MusicXML or MEI (Music Encoding Initiative) is required, which are XML-based representations including information such as note spelling and layout. For guitar "tabs" (a generic term covering tablature as well as chord symbols with or without lyrics), free text is still commonly used, with no standard format, although software has been developed which can parse the majority of such files [Macrae and Dixon 2011]. Some tab web sites have developed their own formats using HTML or XML for markup of the text files. Other text formats such as the MuseData and Humdrum kern format [Selfridge-Field, 1997] have been used extensively for musicological analysis of corpuses of scores.

For structured metadata, formats such as XML are commonly used, and in particular semantic web formats for linked data such as RDFa, RDF/XML, N3 and Turtle are employed. Since these are intended as machine-readable formats rather than for human consumption, the particular format chosen is less important than the underlying ontology which provides the semantics for the data. For image data, OMR systems typically process sheet music scanned at 300 dpi resolution, producing output in expMIDI (expressive MIDI), MusicXML or NIFF (Notation Interchange File Format) formats.

Finally, although music exists primarily in the auditory domain, there is a long tradition of representing music in various graphical formats. Common Western music notation is a primary example, but piano-roll notation, spectrograms and chromagrams also present musical information in potentially useful formats. Since music is a time-based phenomenon, it is common to plot the evolution of musical parameters as a function of time, such as tempo and dynamics curves, which have been used extensively in performance research [Desain and Honing, 1991]. Simultaneous representations of two or more temporal parameters have been achieved using animation, for example the Performance Worm [Dixon et al., 2002], which shows the temporal evolution of tempo and loudness as a trajectory in a two-dimensional space. Other visualisations include similarity matrices for audio alignment and structural segmentation [Muller et al., 2011] and various representations for analysis of harmony and tonality [e.g. Sapp, 2012].


References


Challenges



Back to → Roadmap:Technological perspective

Personal tools
Namespaces
Variants
Actions
Navigation
Documentation Hub
MIReS Docs
Toolbox