Data processing methodologies

From MIReS

Revision as of 18:11, 20 April 2013 by WikiSysop (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Since its origins, the MIR community has used and adapted data processing methodologies from related research fields like speech processing, text information retrieval, and computer vision. A natural consequential challenge is to more systematically identify potentially relevant methodologies from data processing disciplines and stay up-to-date with their latest developments. This exchange of data processing methodologies reduces duplication of research efforts, and exploits synergies between disciplines which are, at a more abstract level, dealing with similar data processing problems. It will become even more relevant as MIR embraces the full multi-modality of music and its full complexity as a cultural phenomenon. This requires a regular involvement and commitment of researchers from diverse fields of science as well as an effort of communication across disciplines, and possibly even the formulation of joint research agendas. Such a more organised form of exchange of methodologies is likely to have a boosting impact on all participating disciplines due to the joining of forces and combined effort.


Back to → Roadmap:Technological perspective


Contents

State of the art

The origins of Music Information Research were multi-disciplinary in nature. At the first edition of the ISMIR conference series, in 2000, although the number of research papers was significantly smaller than in later editions, papers drew ideas from a relatively large number of disciplines: digital libraries, information retrieval, musicology and symbolic music analysis, speech processing, signal processing, perception and cognition, image processing (with applications to optical music recognition), and user modelling. This initial conference also debated intellectual property matters and systematic evaluations.

Since then, the ISMIR conference has grown tremendously, as illustrated by the number of unique authors that underwent a 400% increase between 2000 and 2011. In the last 12 years, neighbouring fields of science with a longer history have influenced this growth of the MIR community. From the initial diversity of backgrounds and disciplines, not all had equal influence in the growth of MIR. Looking back on the first 12 years of ISMIR shows a clear predominance of bottom-up methodologies issued from data-intensive disciplines such as Speech Processing, Text Retrieval and Computer Vision, as opposed to knowledge-based disciplines such as Musicology or (Music) Psychology. One possible reason for the relatively stronger influence of data-intensive disciplines over knowledge-based ones is that the initial years of ISMIR co-occur with phenomena such as industrial applications of audio compression research and the explosive growth in the availability of data through the Internet (including audio files) [Downie et al., 2009]. Further, following typical tasks from Speech Processing, Computer Vision and Text Retrieval, MIR research rapidly focused on a relatively small set of preferential tasks such as local feature extraction, data modelling for comparison and classification, and efficient retrieval. In the following, we will review data processing methods employed in the three above-mentioned disciplines and relate their domains to the music domain to point out how MIR could benefit from further cross-fertilisation with these disciplines.

The discipline of Speech Processing aims at extracting information from speech signals. It has a long history and has been influential in a number of MIR developments, namely transcription, timbre characterisation, source recognition and source separation.

Musical audio representations have been influenced by research in speech transcription and speaker recognition. It is common-place to start any analysis of musical audio by the extraction of a set of local features, typical of speech transcription and speaker recognition, such as Mel Frequency Cepstrum Coefficients (MFCCs) computed on short-time Fourier transforms. In speech processing, these features make up the basic building blocks of machine learning algorithms that map patterns of features to individual speakers or likely sequences of words in multiple stages (i.e. short sequences of features mapped to phones, sequences of phones mapped to words and sequences of words mapped to sentences). A prevalent technique for mapping from one stage to the next are Hidden Markov Model (HMMs). Similar schemes have been adapted to music audio data and nowadays form the basis of music signal classification in genres, tags or particular instruments.

Research in speech processing has also addressed the problem of separating out a single voice from a recording of many people speaking simultaneously (known as the "cocktail party} problem". A parallel problem when dealing with music data is isolating the components of a polyphonic music signal. Source separation is easier if there are at least as many sensors as sound sources [Mitianoudis, 2004]. But in MIR, a typical research problem is the under-determined source separation of many sound sources in a stereo or mono recording. The most basic instantiation of the problem assumes that N source signals are linearly mixed into M < N channels, where the task is to infer the signals and their mixture coefficients from the mixed signal. To solve it, the space of solutions has to be restricted by making further assumptions, leading to different methods: Independent Component Analysis (ICA) assumes the sources to be independent and non-Gaussian, Sparse Component Analysis (SCA) assumes the sources to be sparse, and Non-negative Matrix Factorisation (NMF) assumes the sources, coefficients and mixture to be non-negative. Given that speech processing and content-based MIR both work in the audio domain, local features can be directly adopted – and in fact, MFCCs have been used in music similarity estimation from the very beginning of MIR [Foote, 1997]. HMMs have also been employed for modelling sequences of audio features or symbolic music [Flexer et al., 2005]. Several attempts have been made to apply source separation techniques to music, utilising domain-specific assumptions on the extracted sources to improve performance: [Virtanen and Klapuri, 2002] assume signals to be harmonic, [Virtanen, 2007] assumes continuity in time, and [Burred, 2008] incorporates instrument timbre models.

Text Retrieval has also had a great influence on MIR, particularly the tasks of document retrieval (in a given collection, find documents relevant to a textual query in the form of search terms or an example document) and document classification (assign a given document to at least one of a given set of classes, e.g., detect the topic of a news article or filter spam emails). Both problems require some abstract model for a document. The first system for document classification [Maron, 1961] represented each document as a word count vector over a manually assembled vocabulary of "clue words", then applied a Naïve Bayes classifier to derive the document's topic, neither regarding the order nor co-occurrence of words within the document. Today, documents are still commonly represented as a word count vector – or Bag of Words (BoW) – for both classification and retrieval, but improvements over [Maron, 1961] have been proposed on several levels, namely stemming, term weighting [Salton and Buckley, 1988], topic modelling [Deerwester et al., 1990], semantic hashing [Hinton and Salakhutdinov, 2011], word sense disambiguation [Navigli, 2009], and N-gram models. Some of these techniques have been applied to find useful abstract representations of music pieces as well, but their use implies that a suitable equivalent to words can be defined for music. Some authors tried to apply vector quantisation ("stemming") to frame-wise audio features ("words") to form a BoW model for similarity search [Seyerlehner et al., 2008]. [Riley et al., 2008] additionally employ TF/IDF term weighting of their so-called "audio-words". [Hoffman et al., 2008] successfully applied HDP topic models for similarity estimation, albeit modelling topics as Gaussian distributions of MFCCs rather than multinomials over discrete words.

Finally, three typical Computer Vision problems have been particularly influential in MIR research, namely scene recognition (classifying images of scenery), multiple object detection (decomposing a complex image into a set of known entities and their locations) and image retrieval by example. Again, in Computer Vision, all these tasks require abstract representations of images or image parts to work with, and researchers have developed a wealth of image-specific local features and global descriptors (see [Datta et al., 2008], pp.17-24 for a review). A common framework has been inspired by Text Retrieval: [Zhu et al., 2002] regard images as documents composed of "keyblocks", in analogy to text composed of keywords. Keyblocks are vector-quantised image patches extracted on a regular grid, forming a 2-dimensional array of "visual words", which can be turned into a Bag of Visual Words (BoVW) by building histograms. Several improvements have since been proposed, namely regarding visual words [Gemert et al., 2008], Pooling [Boureau et al., 2010], Spatial pyramids [Lazebnik et al., 2006], Topic modelling [Sivic et al., 2005], Generative image models [Krizhevsky and Hinton, 2011], Learning invariances [Hinton et al., 2011], Semantic hashing [Krizhevsky and Hinton, 2011]. As for Speech and Text processing, some of these techniques have been adopted for the processing of music audio features. Examples include [Abdallah, 2002] who employs sparse coding of short spectrogram excerpts of harpsichord music, yielding note detectors. [Casagrande et al., 2005] use Haar-like feature extractors inspired from object detection to discriminate speech from music. [Pohle et al., 2010] apply horizontal and vertical edge detectors to identify harmonic and percussive elements. [Lee et al., 2009] apply Convolutional RBMs for local feature extraction with some success in genre classification. [Schlüter and Osendorfer, 2011] learn local image features for music similarity estimation. Additionally, as music pieces can be represented directly as images by using e.g. images of spectrograms, several authors directly applied image processing techniques to music: [Costa et al., 2012] extract features for genre classification with oriented difference of Gaussian filters. Recent improvements on using image features for music classification can be found in [Costa et al., 2012].


References


Challenges



Back to → Roadmap:Technological perspective

Personal tools
Namespaces
Variants
Actions
Navigation
Documentation Hub
MIReS Docs
Toolbox