Evaluation methodologies
From MIReS
It is paramount to MIR that independent researchers build upon previous research, and an overarching challenge in MIR is to define and implement research evaluation methodologies that effectively contribute to creation of knowledge and general improvements in the field. In many scientific disciplines dealing with data processing, significant improvements over the long term have been achieved by empirically defining evaluation methodologies via several iterations of an experimental "loop" including formalisation, implementation, experimentation, and finally validity analysis. In MIR, evaluation initiatives have played an increasing role in the last 10 years, and the community is presently facing the validity analysis issue: that is, finding the most appropriate way to build upon its own legacy and redefine the evaluation methodologies that will better lead to future improvements, the resolution of which will in turn entail further technical challenges down the line (i.e., down the "loop"). This will require above all a deeper involvement of more MIR researchers in the very definition of the evaluation methodologies, as they are the individuals with the best understanding of relevant computational issues. Importantly, this will also require the involvement of the music industry (via e.g. proposing evaluations of relevance to them), and content providers (in order for researchers to have access to data). Effective MIR evaluations will impact in a fundamental manner the very way MIR research is done, it will positively affect the width and depth of the MIR research, and it will increase the relevance of MIR to other research fields.
Back to → Roadmap: Technological perspective
Contents |
State of the art
Many experimental disciplines have witnessed significant improvements over the long term thanks to community-wide efforts in systematic evaluations. This is the case for instance of (text-based) Information Retrieval with the TREC initiative (Text REtrieval Conference) and the CLEF initiative (Cross-Language Evaluation Forum), Speech Recognition [Pearce and Hirsch, 2000], Machine Learning [Guyon et al., 2004], and Video and Multimedia Retrieval with e.g. the TRECVID and VideoCLEF initiatives (the latter later generalised to the "MediaEval Benchmarking Initiative for Multimedia Evaluation").
Although evaluation "per se" has not been a traditional focus of pioneering computer music conferences (such as the ICMC) and journals (e.g. Computer Music Journal), recent attention has been given to the topic. In 1992, the visionary Marvin Minsky declared: "the most critical thing, in both music research and general AI research, is to learn how to build a common music database" [Minsky and Laske, 1992], but it was not until a series of encounters, workshops and special sessions organised between 1999 and 2003 by researchers from the newly-born Music Information Retrieval community that the necessity of conducting rigorous and comprehensive evaluations was recognised [Downie, 2003].
The first public international evaluation benchmark took place at the ISMIR Conference 2004 [Cano et al., 2006], where the objective was to compare state-of-the-art audio algorithms and systems relevant for some tasks of music content description. This effort has then been systematised and continued via the yearly Music Information Retrieval Evaluation eXchange (MIREX). MIREX has widened the scope of the evaluations and now covers a broad range of tasks, including symbolic data description and retrieval [Downie, 2006].
The number of evaluation endeavors issued from different communities (e.g. Signal Processing, Data Mining, Information Retrieval), yet relevant to MIR, has recently increased significantly. For instance, the Signal Separation Evaluation Campaign (SiSEC) was started in 2008, and deals with aspects of source separation in signals of different natures (music, audio, biomedical, etc.). A Data Mining contest was organised at the 19th International Symposium on Methodologies for Intelligent Systems (ISMIS) with two tracks relevant to MIR research (Tunedit): Music Genre recognition and Musical Instrument recognition. The CLEF initiative (an IR evaluation forum) extended its scope to MIR with the MusiCLEF initiative [Orio et al., 2011]. The ACM Special Interest Group on Knowledge Discovery and Data Mining organises a yearly competition, the KDD Cup, focusing on diverse Data Mining topics every year, and in 2011, the competition focused on a core MIR topic: Music Recommendation. In 2012, the MediaEval (Benchmarking Initiative for Multimedia Evaluation) organised a music-related task for the first time. Also in 2012, the Million Song Dataset challenge appeared, a music recommendation challenge incorporating many different sorts of data (user data, tags, ...).
The establishment of an annual evaluation forum (MIREX), accepted by the community, and the appearance of relevant satellite forums in neighbouring fields have undoubtedly been beneficial to the MIR field. However, a lot of work is still necessary to reach a level where evaluations will have a systematic and traceable positive impact on the development of MIR systems and on the creation of new knowledge in MIR. For about 10 years, meta-evaluation methodologies have been instrumental in advancement of the Text Information Retrieval field; they need to be addressed in MIR too [Urbano, 2013]. The special panel and late-breaking news session held at ISMIR 2012 addressed the various methodologies used in the MIR field and compared those to the ones used in other fields such as Media-Eval [Peeters et al., 2012].
Reproducible Research
Much computational science research is conducted without regard to the long-term sustainability of the outcomes of the research, apart from that which appears in journal and conference publications. Outcomes such as research data and computer software are often stored on local computers, and can be lost over time as projects end, students graduate and equipment fails and/or is replaced. Enormous effort is invested in the production of these outputs, which have great potential value for future research, but the benefit of this effort is rarely felt outside of the research group in which it took place. Arguments for sustainability begin with the cost-savings that result from re-use of software and data, but extend to other issues more fundamental to the scientific process. These are enunciated in the "reproducible research" movement [Buckheit and Donoho, 1995], [Vandewalle et al., 2009], which promotes the idea that, along with any scientific publication, there should be a simultaneous release of all software and data used in generating the results in the publication, so that results may be verified, comparisons with alternative approaches performed, and algorithms extended, without the significant overhead of reimplementing published work.
Various practical difficulties hinder the creation of long-term sustainable research outputs. The research software development process is usually gradual and exploratory, rather than following standard software engineering principles. This makes code less robust, so that it requires greater effort to maintain and adapt. Researchers have varying levels of coding ability, and may be unwilling to publicise their less-than-perfect efforts. Even when researchers do make code available, their priority is to move on to other research, rather than undertake the additional software engineering effort that might make their research more usable. Such software engineering efforts might be difficult to justify in research funding proposals, where funding priority is given to work that is seen to be "research" over "development" efforts. Also, research career progression tends to be awarded on the basis of high-impact papers, while software, data and other outputs are rarely considered. Another perceived difficulty is that public release of software might compromise later opportunities for commercialisation, although various licenses exist which allow both to occur [Stodden, 2009].
To these general problems we may add several issues specific to the music information research community. The release of data is restricted by copyright regulations, particularly relating to audio recordings, but this is also relevant for scores, MIDI files, and other types of data. The laws are complex and vary between countries. Many researchers, being unsure of the legal ramifications of the release of data, prefer the safer option of not releasing data. Reliance on specific hardware or software platforms also makes code difficult to maintain in the longer term. One solution for obsolete hardware platforms is the use of software emulation, as addressed by the EU projects PLANETS and KEEP. For music-related research, such general-purpose emulation platforms might not be sufficient to reproduce audio-specific hardware [Pennycook, 2008].
In the MIR community, great effort has been expended to provide a framework for the comparison of music analysis and classification algorithms, via the MIREX evaluations, as well as the more recent MusiClef and MSD challenges (c.f. the above section). More recently, the Mellon-funded NEMA project attempted to develop a web service to allow researchers to test their algorithms outside of the annual MIREX cycle. Although there are a growing number of open-access journals and repositories for software and data, there are obstacles such as publication costs and lack of training which hinder widespread adoption. Addressing the training aspect are the Sound Software and Sound Data Management Training projects, the SHERPA/RoMEO project, which contains information on journal self-archiving and open access policies, and the [spol-discuss@list.ipol.im spol] initiative for reproducible research.
References
- [Buckheit and Donoho, 1995] J. Buckheit and D. L. Donoho, editors. Wavelets and Statistics, chapter: Wavelab and reproducible research. Springer-Verlag, Berlin, New York, 1995.
- [Cano et al., 2006] P. Cano, E. Gómez, F. Gouyon, P. Herrera, M. Koppenberger, B. Ong, X. Serra, S. Streich, and N. Wack. ISMIR 2004 Audio Description Contest. Music Technology Group Technical Report, University Pompeu Fabra, MTG-TR-2006-02, 2006.
- [Downie, 2006] J. Stephen Downie. The Music Information Retrieval Evaluation eXchange (MIREX). D-Lib Magazine, 2006.
- [Downie, 2003] J.S. Downie. The MIR/MDL Evaluation Project White Paper Collection: Establishing Music Information Retrieval (MIR) and Music Digital Library (MDL) Evaluation Frameworks : Preliminary Foundations and Intrastructures. J. Stephen Downie, Graduate School of Library and Information Science, University of Illinois, 2003.
- [Guyon et al., 2004] Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the NIPS 2003 feature selection challenge. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2004.
- [Minsky and Laske, 1992] Marvin Minsky and Otto E. Laske. A conversation with Marvin Minsky. AI Magazine, 13(3): 31-45, 1992.
- [Orio et al., 2011] Nicola Orio, David Rizo, Riccardo Miotto, Markus Schedl, Nicola Montecchio, and Olivier Lartillot. MusiCLEF: A benchmark activity in multimodal music information retrieval. In Proceedings of 12th International Society for Music Information Retrieval Conference, pp. 603-608. Miami, USA, 2011.
- [Pearce and Hirsch, 2000] David Pearce and Hans-Gunter Hirsch. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH), pp. 29-32, 2000.
- [Peeters et al., 2012] Geoffroy Peeters, Julián Urbano, and Gareth J. F. Jones. Notes from the ISMIR12 late-breaking session on evaluation in music information retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, 2012.
- [Pennycook, 2008] Bruce Pennycook. Who will turn the knobs when I die? Organised Sound, 13(3): 199-208, 2008.
- [Stodden, 2009] Victoria Stodden. The legal framework for reproducible scientific research: Licensing and copyright. Computing in Science and Engineering, 11(1): 35-40, 2009.
- [Urbano, 2013] Julián Urbano. Evaluation in music information retrieval. Journal of Intelligent Information Systems, in print, 2013.
- [Vandewalle et al., 2009] Patrick Vandewalle, Jelena Kovacevic, and Martin Vetterli. Reproducible research in signal processing - what, why, and how. IEEE Signal Processing Magazine, 26(3): 37-47, 2009.
Challenges
- Promote best practice evaluation methodology within the MIR community. The MIR community should strive to promote within itself, at the level of individual researchers, the use of proper evaluations, when appropriate.
- Define meaningful evaluation tasks. Specific tasks that are part of large-scale international evaluations define de facto the topics that new contributors to the MIR field will work on. The very definition of such tasks is therefore of utmost importance and should be addressed according to some agreed criteria. For instance, tasks should have a well-defined community of users for whom they are relevant, e.g. while audio onset detection is only marginally relevant for industry, it is very relevant to research. The MIR research community should also open up to tasks defined by the industry, e.g. as the Multimedia community does with the "Grand Challenges" at the ACM Multimedia conference.
- Define meaningful evaluation methodologies. Evaluation of algorithms should effectively contribute to the creation of knowledge and general improvements in the MIR community. Effectively building upon MIR legacy and providing meaningful improvements call for a constant questioning of all aspects of the evaluation methodology (metrics, corpus definition, etc.). For instance, evaluation metrics are currently useful for quantifying each system’s performance; a challenge is that they also provide qualitative insights on how to improve this system. Also, data curation is costly and time-consuming, which implies a challenge to aggregate, for evaluation purposes, data and metadata with the quality of a curated collection, and to preserve provenance.
- Evaluate whole MIR systems. While evaluation of basic MIR components (estimators for beat, chords, fundamental frequency, etc.) is important, the MIR community must dedicate more effort to evaluation of whole MIR systems, e.g. music recommendation systems, music browsing systems, etc. Such evaluations will lead to insights with regards to which components are relevant to the system and which not.
- Promote evaluation tasks using multimodal data. Most MIR systems are concerned with audio-only or symbolic-only scenarios. A particular challenge is to target the evaluation of multimodal systems, aggregating information from e.g. audio, text, etc.
- Implement sustainable MIR evaluation initiatives. An important challenge for MIR evaluation initiatives is to address their sustainability in time. The MIR community must dedicate more effort to its legacy in terms of evaluation frameworks. This implies many issues related for example to general funding, data availability, manpower, infrastructure costs, continuity, reproducibility, etc.
- Target long-term sustainability of Music Information Research. Focusing on the sustainability of MIR evaluation initiatives is only part of the general challenge to target long-term sustainability of MIR itself. In particular, consistent efforts should be made to foster reproducible research through papers, software and data that can be reused to verify or extend published work. Also training will be necessary to effect the reorientation of the MIR community to adopt research practices for ensuring reproducibility, from the code, data, and publication perspectives. Any progress towards creating reproducible research will have an immediate impact not only on the MIR field, but also towards the application of MIR technologies in other research fields.