Data processing methodologies
From MIReS
Since its origins, the MIR community has used and adapted data processing methodologies from related research fields like speech processing, text information retrieval, and computer vision. A natural consequential challenge is to more systematically identify potentially relevant methodologies from data processing disciplines and stay up-to-date with their latest developments. This exchange of data processing methodologies reduces duplication of research efforts, and exploits synergies between disciplines which are, at a more abstract level, dealing with similar data processing problems. It will become even more relevant as MIR embraces the full multi-modality of music and its full complexity as a cultural phenomenon. This requires a regular involvement and commitment of researchers from diverse fields of science as well as an effort of communication across disciplines, and possibly even the formulation of joint research agendas. Such a more organised form of exchange of methodologies is likely to have a boosting impact on all participating disciplines due to the joining of forces and combined effort.
Back to → Roadmap:Technological perspective
Contents |
State of the art
The origins of Music Information Research were multi-disciplinary in nature. At the first edition of the ISMIR conference series, in 2000, although the number of research papers was significantly smaller than in later editions, papers drew ideas from a relatively large number of disciplines: digital libraries, information retrieval, musicology and symbolic music analysis, speech processing, signal processing, perception and cognition, image processing (with applications to optical music recognition), and user modelling. This initial conference also debated intellectual property matters and systematic evaluations.
Since then, the ISMIR conference has grown tremendously, as illustrated by the number of unique authors that underwent a 400% increase between 2000 and 2011. In the last 12 years, neighbouring fields of science with a longer history have influenced this growth of the MIR community. From the initial diversity of backgrounds and disciplines, not all had equal influence in the growth of MIR. Looking back on the first 12 years of ISMIR shows a clear predominance of bottom-up methodologies issued from data-intensive disciplines such as Speech Processing, Text Retrieval and Computer Vision, as opposed to knowledge-based disciplines such as Musicology or (Music) Psychology. One possible reason for the relatively stronger influence of data-intensive disciplines over knowledge-based ones is that the initial years of ISMIR co-occur with phenomena such as industrial applications of audio compression research and the explosive growth in the availability of data through the Internet (including audio files) [Downie et al., 2009]. Further, following typical tasks from Speech Processing, Computer Vision and Text Retrieval, MIR research rapidly focused on a relatively small set of preferential tasks such as local feature extraction, data modelling for comparison and classification, and efficient retrieval. In the following, we will review data processing methods employed in the three above-mentioned disciplines and relate their domains to the music domain to point out how MIR could benefit from further cross-fertilisation with these disciplines.
The discipline of Speech Processing aims at extracting information from speech signals. It has a long history and has been influential in a number of MIR developments, namely transcription, timbre characterisation, source recognition and source separation.
Musical audio representations have been influenced by research in speech transcription and speaker recognition. It is common-place to start any analysis of musical audio by the extraction of a set of local features, typical of speech transcription and speaker recognition, such as Mel Frequency Cepstrum Coefficients (MFCCs) computed on short-time Fourier transforms. In speech processing, these features make up the basic building blocks of machine learning algorithms that map patterns of features to individual speakers or likely sequences of words in multiple stages (i.e. short sequences of features mapped to phones, sequences of phones mapped to words and sequences of words mapped to sentences). A prevalent technique for mapping from one stage to the next are Hidden Markov Model (HMMs). Similar schemes have been adapted to music audio data and nowadays form the basis of music signal classification in genres, tags or particular instruments.
Research in speech processing has also addressed the problem of separating out a single voice from a recording of many people speaking simultaneously (known as the "cocktail party} problem". A parallel problem when dealing with music data is isolating the components of a polyphonic music signal. Source separation is easier if there are at least as many sensors as sound sources [Mitianoudis, 2004]. But in MIR, a typical research problem is the under-determined source separation of many sound sources in a stereo or mono recording. The most basic instantiation of the problem assumes that N source signals are linearly mixed into M < N channels, where the task is to infer the signals and their mixture coefficients from the mixed signal. To solve it, the space of solutions has to be restricted by making further assumptions, leading to different methods: Independent Component Analysis (ICA) assumes the sources to be independent and non-Gaussian, Sparse Component Analysis (SCA) assumes the sources to be sparse, and Non-negative Matrix Factorisation (NMF) assumes the sources, coefficients and mixture to be non-negative. Given that speech processing and content-based MIR both work in the audio domain, local features can be directly adopted – and in fact, MFCCs have been used in music similarity estimation from the very beginning of MIR [Foote, 1997]. HMMs have also been employed for modelling sequences of audio features or symbolic music [Flexer et al., 2005]. Several attempts have been made to apply source separation techniques to music, utilising domain-specific assumptions on the extracted sources to improve performance: [Virtanen and Klapuri, 2002] assume signals to be harmonic, [Virtanen, 2007] assumes continuity in time, and [Burred, 2008] incorporates instrument timbre models.
Text Retrieval has also had a great influence on MIR, particularly the tasks of document retrieval (in a given collection, find documents relevant to a textual query in the form of search terms or an example document) and document classification (assign a given document to at least one of a given set of classes, e.g., detect the topic of a news article or filter spam emails). Both problems require some abstract model for a document. The first system for document classification [Maron, 1961] represented each document as a word count vector over a manually assembled vocabulary of "clue words", then applied a Naïve Bayes classifier to derive the document's topic, neither regarding the order nor co-occurrence of words within the document. Today, documents are still commonly represented as a word count vector – or Bag of Words (BoW) – for both classification and retrieval, but improvements over [Maron, 1961] have been proposed on several levels, namely stemming, term weighting [Salton and Buckley, 1988], topic modelling [Deerwester et al., 1990], semantic hashing [Hinton and Salakhutdinov, 2011], word sense disambiguation [Navigli, 2009], and N-gram models. Some of these techniques have been applied to find useful abstract representations of music pieces as well, but their use implies that a suitable equivalent to words can be defined for music. Some authors tried to apply vector quantisation ("stemming") to frame-wise audio features ("words") to form a BoW model for similarity search [Seyerlehner et al., 2008]. [Riley et al., 2008] additionally employ TF/IDF term weighting of their so-called "audio-words". [Hoffman et al., 2008] successfully applied HDP topic models for similarity estimation, albeit modelling topics as Gaussian distributions of MFCCs rather than multinomials over discrete words.
Finally, three typical Computer Vision problems have been particularly influential in MIR research, namely scene recognition (classifying images of scenery), multiple object detection (decomposing a complex image into a set of known entities and their locations) and image retrieval by example. Again, in Computer Vision, all these tasks require abstract representations of images or image parts to work with, and researchers have developed a wealth of image-specific local features and global descriptors (see [Datta et al., 2008], pp.17-24 for a review). A common framework has been inspired by Text Retrieval: [Zhu et al., 2002] regard images as documents composed of "keyblocks", in analogy to text composed of keywords. Keyblocks are vector-quantised image patches extracted on a regular grid, forming a 2-dimensional array of "visual words", which can be turned into a Bag of Visual Words (BoVW) by building histograms. Several improvements have since been proposed, namely regarding visual words [Gemert et al., 2008], Pooling [Boureau et al., 2010], Spatial pyramids [Lazebnik et al., 2006], Topic modelling [Sivic et al., 2005], Generative image models [Krizhevsky and Hinton, 2011], Learning invariances [Hinton et al., 2011], Semantic hashing [Krizhevsky and Hinton, 2011]. As for Speech and Text processing, some of these techniques have been adopted for the processing of music audio features. Examples include [Abdallah, 2002] who employs sparse coding of short spectrogram excerpts of harpsichord music, yielding note detectors. [Casagrande et al., 2005] use Haar-like feature extractors inspired from object detection to discriminate speech from music. [Pohle et al., 2010] apply horizontal and vertical edge detectors to identify harmonic and percussive elements. [Lee et al., 2009] apply Convolutional RBMs for local feature extraction with some success in genre classification. [Schlüter and Osendorfer, 2011] learn local image features for music similarity estimation. Additionally, as music pieces can be represented directly as images by using e.g. images of spectrograms, several authors directly applied image processing techniques to music: [Costa et al., 2012] extract features for genre classification with oriented difference of Gaussian filters. Recent improvements on using image features for music classification can be found in [Costa et al., 2012].
References
- [Abdallah, 2002] Samer A. Abdallah. Towards Music Perception by Redundancy Reduction and Unsupervised Learning in Probabilistic Models. PhD thesis, King’s College London, London, UK, 2002.
- [Boureau et al., 2010] Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 111–118, Haifa, Israel, 2010.
- [Burred, 2008] Juan José Burred. From Sparse Models to Timbre Learning: New Methods for Musical Source Separation. PhD thesis, Technical University of Berlin, Berlin, Germany, 2008.
- [Casagrande et al., 2005] Norman Casagrande, Douglas Eck, and Balázs Kégl. Frame-Level Speech/Music Discrimination using AdaBoost. In Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR 2005), pp. 345–350, 2005.
- [Costa et al., 2012] Y. Costa, L. Oliveira, A. Koerich, F. Gouyon, and J. Martins. Music genre classification using LBP textural features. Signal Processing, 92(11): 2723-2737, 2012.
- [Datta et al., 2008] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2):1-60, 2008.
- [Deerwester et al., 1990] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6): 391–407, 1990.
- [Downie et al., 2009] J. Stephen Downie, Donald Byrd, and Tim Crawford. Ten years of ISMIR: Reflections on challenges and opportunities. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pp. 13-18, 2009.
- [Flexer et al., 2005] Arthur Flexer, Elias Pampalk, and Gerhard Widmer. Hidden markov models for spectral similarity of songs. In Proceedings of the 8th International Conference on Digital Audio Effects (DAFx-05), Madrid, Spain, 2005.
- [Foote, 1997] Jonathan T. Foote. Content-based retrieval of music and audio. In: Jay C. C. Kuo, Shih F. Chang, and Venkat N. Gudivada, editors, Multimedia Storage and Archiving Systems II (Proceedings SPIE), volume 3229, pp. 138–147, 1997.
- [Gemert et al., 2008] Jan van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, and Arnold W. M. Smeulders. Kernel codebooks for scene categorization. In Proceedings of the 10th European Conference on Computer Vision (ECCV 2008), volume 5304 of Lecture Notes in Computer Science, pp. 696–709. Springer, 2008.
- [Hinton and Salakhutdinov, 2011] Geoffrey Hinton and Ruslan Salakhutdinov. Discovering Binary Codes for Documents by Learning Deep Generative Models. Topics in Cognitive Science, 3(1): 74–91, 2011.
- [Hinton et al., 2011] Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming Auto-Encoders. In Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN 2011), volume 6791 of Lecture Notes in Computer Science, pp. 44–51, Espoo, Finland, Springer, 2011.
- [Hoffman et al., 2008] Matthew Hoffman, David M. Blei, and Perry R. Cook. Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process. In Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR 2008), pp. 349–354, Philadelphia, USA, 2008.
- [Krizhevsky and Hinton, 2011] Alex Krizhevsky and Geoffrey E. Hinton. Using Very Deep Autoencoders for Content-Based Image Retrieval. In Proceedings of the 19th European Symposium on Artificial Neural Networks (ESANN 2011), Bruges, Belgium, 2011.
- [Lazebnik et al., 2006] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 2169–2178, 2006.
- [Lee et al., 2009] Honglak Lee, Yan Largman, Peter Pham, and Andrew Y. Ng. Unsupervised Feature Learning for Audio Classification using Convolutional Deep Belief Networks. In: Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS 2009), pp. 1096–1104. 2009.
- [Maron, 1961] M.E. Maron. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery, 8(3): 404–417, 1961.
- [Mitianoudis, 2004] Nikolaos Mitianoudis. Audio Source Separation using Independent Component Analysis. PhD thesis, Queen Mary University of London, 2004.
- [Navigli, 2009] Roberto Navigli. Word Sense Disambiguation: A Survey. ACM Computing Surveys, 41(2):10:1–10:69, 2009.
- [Pohle et al., 2010] T. Pohle, P. Knees, K. Seyerlehner, and G. Widmer. A High-Level Audio Feature For Music Retrieval and Sorting. In Proceedings of the 13th International Conference on Digital Audio Effects (DAFx-10), Graz, Austria, 2010.
- [Riley et al., 2008] Matthew Riley, Eric Heinen, and Joydeep Ghosh. A text retrieval approach to content-based audio retrieval. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), pp. 295–300, Philadelphia, USA, 2008.
- [Salton and Buckley, 1988] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5): 513–523, 1988.
- [Schlüter and Osendorfer, 2011] Jan Schlüter and Christian Osendorfer. Music Similarity Estimation with the Mean-Covariance Restricted Boltzmann Machine. In Proceedings of the 10th International Conference on Machine Learning and Applications (ICMLA 2011), Honolulu, USA, 2011.
- [Seyerlehner et al., 2008] Klaus Seyerlehner, Gerhard Widmer, and Peter Knees. Frame-level Audio Similarity – A Codebook Approach. In Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.
- [Sivic et al., 2005] Josef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman, and William T. Freeman. Discovering Objects and their Localization in Images. In Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV’05), volume 1, pp. 370–377, 2005.
- [Virtanen and Klapuri, 2002] Tuomas Virtanen and Anssi Klapuri. Separation of Harmonic Sounds Using Linear Models for the Overtone Series. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), volume II, pp. 1757–1760, Orlando, FL, USA, 2002.
- [Virtanen, 2007] Tuomas Virtanen. Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech, and Language Processing, pp. 1066–1074, 2007.
- [Zhu et al., 2002] Lei Zhu, Al Bing Rao, and Aldong Zhang. Theory of Keyblock-based Image Retrieval. ACM Transactions on Information Systems, 20(2): 224–257, 2002.
Challenges
- Systematise cross-disciplinary transfer of methodologies. Early breakthroughs in MIR came from a relatively limited number of external fields, mainly through the contributions of individual researchers working in neighbouring fields e.g. Speech Processing and applying their methodologies to music. Being more systematic about this implies two challenges for the MIR community: first, to stay up-to-date with latest developments in disciplines that were influential in some points of MIR evolution, and second to define ways to systematically identify potentially relevant methodologies from neighbouring disciplines.
- Take advantage of the multiple modalities of music data. Music exists in many diverse modalities (audio, text, video, score, etc.) which in turn call for different processing methodologies. Given a particular modality of interest e.g. audio, in addition to identifying promising processing methodologies from neighbouring fields dealing with the same modality e.g. speech processing, an effort will have to be made to apply methodologies across modalities. Further, as music exists simultaneously in diverse modalities, another challenge for MIR will be to include methodologies from cross-modal processing, i.e. using joint representations/models for data that exists, and can be represented, simultaneously in diverse modalities.
- Adopt recent Machine Learning techniques. As exemplified above, MIR makes a great use of machine learning methodologies, in particular many tasks are formulated according to a batch learning approach where a fixed amount of annotated training data is used to learn models which can then be evaluated with similar data. However, music data can now be found in very large amounts (e.g. in the scale of hundreds of thousands of items for music pieces in diverse modalities, or in the scale of tens of millions in the case of e.g. tags), music is increasingly existing in data streams rather than in data sets, and the characterisation of music data can evolve with time (e.g. tag annotations are constantly evolving, sometimes even in an adverse way). These data characteristics (i.e. very large amounts, streaming, non-stationarity) Big Data characteristics imply a number of challenges for MIR, such as data acquisition, dealing with weakly structured data formats, scalability, online (and real-time) learning, semi-supervised learning, iterative learning and model updates, learning from sparse data, learning with only positive examples, learning with uncertainty, etc. (see e.g. Yahoo! Labs “key scientific challenges” in Machine Learning and the White Paper “Challenges and Opportunities with Big Data” published by the Computing Community Consortium).