Evaluation methodologies

From MIReS

Jump to: navigation, search

It is paramount to MIR that independent researchers build upon previous research, and an overarching challenge in MIR is to define and implement research evaluation methodologies that effectively contribute to creation of knowledge and general improvements in the field. In many scientific disciplines dealing with data processing, significant improvements over the long term have been achieved by empirically defining evaluation methodologies via several iterations of an experimental "loop" including formalisation, implementation, experimentation, and finally validity analysis. In MIR, evaluation initiatives have played an increasing role in the last 10 years, and the community is presently facing the validity analysis issue: that is, finding the most appropriate way to build upon its own legacy and redefine the evaluation methodologies that will better lead to future improvements, the resolution of which will in turn entail further technical challenges down the line (i.e., down the "loop"). This will require above all a deeper involvement of more MIR researchers in the very definition of the evaluation methodologies, as they are the individuals with the best understanding of relevant computational issues. Importantly, this will also require the involvement of the music industry (via e.g. proposing evaluations of relevance to them), and content providers (in order for researchers to have access to data). Effective MIR evaluations will impact in a fundamental manner the very way MIR research is done, it will positively affect the width and depth of the MIR research, and it will increase the relevance of MIR to other research fields.


Back to → Roadmap: Technological perspective


Contents

State of the art

Many experimental disciplines have witnessed significant improvements over the long term thanks to community-wide efforts in systematic evaluations. This is the case for instance of (text-based) Information Retrieval with the TREC initiative (Text REtrieval Conference) and the CLEF initiative (Cross-Language Evaluation Forum), Speech Recognition [Pearce and Hirsch, 2000], Machine Learning [Guyon et al., 2004], and Video and Multimedia Retrieval with e.g. the TRECVID and VideoCLEF initiatives (the latter later generalised to the "MediaEval Benchmarking Initiative for Multimedia Evaluation").

Although evaluation "per se" has not been a traditional focus of pioneering computer music conferences (such as the ICMC) and journals (e.g. Computer Music Journal), recent attention has been given to the topic. In 1992, the visionary Marvin Minsky declared: "the most critical thing, in both music research and general AI research, is to learn how to build a common music database" [Minsky and Laske, 1992], but it was not until a series of encounters, workshops and special sessions organised between 1999 and 2003 by researchers from the newly-born Music Information Retrieval community that the necessity of conducting rigorous and comprehensive evaluations was recognised [Downie, 2003].

The first public international evaluation benchmark took place at the ISMIR Conference 2004 [Cano et al., 2006], where the objective was to compare state-of-the-art audio algorithms and systems relevant for some tasks of music content description. This effort has then been systematised and continued via the yearly Music Information Retrieval Evaluation eXchange (MIREX). MIREX has widened the scope of the evaluations and now covers a broad range of tasks, including symbolic data description and retrieval [Downie, 2006].

The number of evaluation endeavors issued from different communities (e.g. Signal Processing, Data Mining, Information Retrieval), yet relevant to MIR, has recently increased significantly. For instance, the Signal Separation Evaluation Campaign (SiSEC) was started in 2008, and deals with aspects of source separation in signals of different natures (music, audio, biomedical, etc.). A Data Mining contest was organised at the 19th International Symposium on Methodologies for Intelligent Systems (ISMIS) with two tracks relevant to MIR research (Tunedit): Music Genre recognition and Musical Instrument recognition. The CLEF initiative (an IR evaluation forum) extended its scope to MIR with the MusiCLEF initiative [Orio et al., 2011]. The ACM Special Interest Group on Knowledge Discovery and Data Mining organises a yearly competition, the KDD Cup, focusing on diverse Data Mining topics every year, and in 2011, the competition focused on a core MIR topic: Music Recommendation. In 2012, the MediaEval (Benchmarking Initiative for Multimedia Evaluation) organised a music-related task for the first time. Also in 2012, the Million Song Dataset challenge appeared, a music recommendation challenge incorporating many different sorts of data (user data, tags, ...).

The establishment of an annual evaluation forum (MIREX), accepted by the community, and the appearance of relevant satellite forums in neighbouring fields have undoubtedly been beneficial to the MIR field. However, a lot of work is still necessary to reach a level where evaluations will have a systematic and traceable positive impact on the development of MIR systems and on the creation of new knowledge in MIR. For about 10 years, meta-evaluation methodologies have been instrumental in advancement of the Text Information Retrieval field; they need to be addressed in MIR too [Urbano, 2013]. The special panel and late-breaking news session held at ISMIR 2012 addressed the various methodologies used in the MIR field and compared those to the ones used in other fields such as Media-Eval [Peeters et al., 2012].


Reproducible Research

Much computational science research is conducted without regard to the long-term sustainability of the outcomes of the research, apart from that which appears in journal and conference publications. Outcomes such as research data and computer software are often stored on local computers, and can be lost over time as projects end, students graduate and equipment fails and/or is replaced. Enormous effort is invested in the production of these outputs, which have great potential value for future research, but the benefit of this effort is rarely felt outside of the research group in which it took place. Arguments for sustainability begin with the cost-savings that result from re-use of software and data, but extend to other issues more fundamental to the scientific process. These are enunciated in the "reproducible research" movement [Buckheit and Donoho, 1995], [Vandewalle et al., 2009], which promotes the idea that, along with any scientific publication, there should be a simultaneous release of all software and data used in generating the results in the publication, so that results may be verified, comparisons with alternative approaches performed, and algorithms extended, without the significant overhead of reimplementing published work.

Various practical difficulties hinder the creation of long-term sustainable research outputs. The research software development process is usually gradual and exploratory, rather than following standard software engineering principles. This makes code less robust, so that it requires greater effort to maintain and adapt. Researchers have varying levels of coding ability, and may be unwilling to publicise their less-than-perfect efforts. Even when researchers do make code available, their priority is to move on to other research, rather than undertake the additional software engineering effort that might make their research more usable. Such software engineering efforts might be difficult to justify in research funding proposals, where funding priority is given to work that is seen to be "research" over "development" efforts. Also, research career progression tends to be awarded on the basis of high-impact papers, while software, data and other outputs are rarely considered. Another perceived difficulty is that public release of software might compromise later opportunities for commercialisation, although various licenses exist which allow both to occur [Stodden, 2009].

To these general problems we may add several issues specific to the music information research community. The release of data is restricted by copyright regulations, particularly relating to audio recordings, but this is also relevant for scores, MIDI files, and other types of data. The laws are complex and vary between countries. Many researchers, being unsure of the legal ramifications of the release of data, prefer the safer option of not releasing data. Reliance on specific hardware or software platforms also makes code difficult to maintain in the longer term. One solution for obsolete hardware platforms is the use of software emulation, as addressed by the EU projects PLANETS and KEEP. For music-related research, such general-purpose emulation platforms might not be sufficient to reproduce audio-specific hardware [Pennycook, 2008].

In the MIR community, great effort has been expended to provide a framework for the comparison of music analysis and classification algorithms, via the MIREX evaluations, as well as the more recent MusiClef and MSD challenges (c.f. the above section). More recently, the Mellon-funded NEMA project attempted to develop a web service to allow researchers to test their algorithms outside of the annual MIREX cycle. Although there are a growing number of open-access journals and repositories for software and data, there are obstacles such as publication costs and lack of training which hinder widespread adoption. Addressing the training aspect are the Sound Software and Sound Data Management Training projects, the SHERPA/RoMEO project, which contains information on journal self-archiving and open access policies, and the [spol-discuss@list.ipol.im spol] initiative for reproducible research.


References


Challenges



Back to → Roadmap:Technological perspective

Personal tools
Namespaces
Variants
Actions
Navigation
Documentation Hub
MIReS Docs
Toolbox