Evaluation methodologies: Challenges
From MIReS
- Promote best practice evaluation methodology within the MIR community. The MIR community should strive to promote within itself, at the level of individual researchers, the use of proper evaluations, when appropriate.
- Define meaningful evaluation tasks. Specific tasks that are part of large-scale international evaluations define de facto the topics that new contributors to the MIR field will work on. The very definition of such tasks is therefore of utmost importance and should be addressed according to some agreed criteria. For instance, tasks should have a well-defined community of users for whom they are relevant, e.g. while audio onset detection is only marginally relevant for industry, it is very relevant to research. The MIR research community should also open up to tasks defined by the industry, e.g. as the Multimedia community does with the "Grand Challenges" at the ACM Multimedia conference.
- Define meaningful evaluation methodologies. Evaluation of algorithms should effectively contribute to the creation of knowledge and general improvements in the MIR community. Effectively building upon MIR legacy and providing meaningful improvements call for a constant questioning of all aspects of the evaluation methodology (metrics, corpus definition, etc.). For instance, evaluation metrics are currently useful for quantifying each system’s performance; a challenge is that they also provide qualitative insights on how to improve this system. Also, data curation is costly and time-consuming, which implies a challenge to aggregate, for evaluation purposes, data and metadata with the quality of a curated collection, and to preserve provenance.
- Evaluate whole MIR systems. While evaluation of basic MIR components (estimators for beat, chords, fundamental frequency, etc.) is important, the MIR community must dedicate more effort to evaluation of whole MIR systems, e.g. music recommendation systems, music browsing systems, etc. Such evaluations will lead to insights with regards to which components are relevant to the system and which not.
- Promote evaluation tasks using multimodal data. Most MIR systems are concerned with audio-only or symbolic-only scenarios. A particular challenge is to target the evaluation of multimodal systems, aggregating information from e.g. audio, text, etc.
- Implement sustainable MIR evaluation initiatives. An important challenge for MIR evaluation initiatives is to address their sustainability in time. The MIR community must dedicate more effort to its legacy in terms of evaluation frameworks. This implies many issues related for example to general funding, data availability, manpower, infrastructure costs, continuity, reproducibility, etc.
- Target long-term sustainability of Music Information Research. Focusing on the sustainability of MIR evaluation initiatives is only part of the general challenge to target long-term sustainability of MIR itself. In particular, consistent efforts should be made to foster reproducible research through papers, software and data that can be reused to verify or extend published work. Also training will be necessary to effect the reorientation of the MIR community to adopt research practices for ensuring reproducibility, from the code, data, and publication perspectives. Any progress towards creating reproducible research will have an immediate impact not only on the MIR field, but also towards the application of MIR technologies in other research fields.