Musically relevant data
From MIReS
We define "musically relevant data" as any type of machine-readable data that can be analysed by algorithms and that can give us relevant information for the development of musical applications. The main challenge is to gather musically relevant data of sufficient quantity and quality to enable music information research that respects the broad multi-modality of music. After all, music today is an all-encompassing experience that is an important part of videos, computer games, Web applications, mobile apps and services, specialised blogs, artistic applications, etc. Therefore we should be concerned with the identification of all sources of musically relevant data, the proper documentation of the process of data assembly and resolving of all legal and ethical issues concerning the data. Sufficient quantity and quality of data is of course the prerequisite for any kind of music information research. To make progress in this direction it is necessary that the research community works together with the owners of data, be they copyright holders in the form of companies or individual persons sharing their data. Since music information research is by definition a data intensive science, any progress in these directions will have immediate impact on the field. It will enable a fostering and maturing of our research and, with the availability of new kinds of musically relevant data, open up possibilities for new kinds of research and applications.
Back to → Roadmap:Technological perspective
Contents |
State of the art
Music Information Research (MIR) is so far to a large degree concerned with audio, neglecting many of the other forms of media where music also plays an important role. As recently as ten years ago, the main media concerned with music were represented by audio recordings on CDs, terrestrial radio broadcasts, music videos on TV, and printed text in music magazines. Today music is an all-encompassing experience that is an important part of videos, computer games, Web applications, mobile apps and services, artistic applications, etc. In addition to printed text on music there exist a vast range of web-sites, blogs and specialised communities caring and publishing about music. Therefore it is necessary for MIR to broaden its horizons and include a multitude of yet untapped data sources in its research agenda. Data that is relevant for Music Information Research can be categorised into four different subgroups: (i) audio-content is any kind of information computed directly from the audio signal; (ii) music scores is any type of symbolic notation that is normally used for music performance and that captures the musical intention of a composer; (iiI) music-context is all information relevant to music which is not directly computable from the audio itself or the score, e.g. cover artwork, lyrics, but also artists' background and collaborative tags connected to the music; (iv) user-context is any kind of data that allows us to model the users in specific usage settings.
Let us start with the most prevalent source of data: audio content and any kind of information computed directly from the audio. Such information is commonly referred to as "features", with a certain consensus on distinguishing between low-level and high-level features (see e.g. [Casey et al., 2008]). Please see section Music representations for an overview of different kinds of features. It is obvious that audio content data is by far the most widely used and researched form of information in our community. This can e.g. be seen by looking at the tasks of the recent "Music Information Retrieval Evaluation eXchange" (MIREX 2012). MIREX is the foremost yearly community-based framework for formal evaluation of MIR algorithms and systems. Out of the 16 tasks, all but one (Symbolic Melodic Similarity) deal with audio analysis including challenges like: Audio Classification, Audio Melody Extraction, Cover Song Identification, Audio Key Detection, Structural Segmentation and Audio Tempo Estimation. Concerning the availability of audio content data there are several legal and copyright issues. Just to give an example, the by far largest data set in MIR, the "Million Songs Dataset", does not include any audio, only the derived features. In case researchers need to compute their own features they have to use services like "7-Digital" to access the audio. Collections that do contain audio as well are usually very small like e.g. the well known "GTzan" collection assembled by George Tzanetakis in 2002 consisting of 1000 songs freely available from the Marsyas webpage. The largest freely downloadable audio data set is the "1517 Artists" collection consisting of 3180 songs from 1517 artists. There also exist alternative collaborative databases of Creative Commons Licensed sounds like Freesound.
An important source of information to start with is of course symbolic data, thus the score of a piece of music if it is available in a machine readable format, like MIDI, Music XML, sequencer data or other kinds of abstract representations of music. Such music representations can be very close to audio content like e.g. the score to one specific audio rendering but they are usually not fully isomorphic. Going beyond more traditional annotations, recent work in MIR [Macrae and Dixon, 2011] turned its attention to machine readable tablatures and chord sequences, which are a form of hand annotated scores available in non-standardised text files (e.g. "ultimate guitar" contains more than 2.5 million guitar tabs). At the first MIR conference a large part of the contributed papers were concerned with symbolical data. Almost ten years later this imbalance seems to have reversed with authors [Downie et al., 2009] lamenting that "ISMIR must rebalance the portfolio of music information types with which it engages" and that "research exploiting the symbolic aspects of music information has not thrived under ISMIR". Symbolic annotations of music present legal and copyright issues just like audio, but substantial collections (e.g. of MIDI files) do exist.
Music context is all information relevant to a music item under consideration that is not extracted from the respective audio file itself or the score (see e.g. \cite{Schedl_Knees:2009} for an overview). A large part of research on music context is strongly related to web content mining. Over the last decade, mining the World Wide Web has been established as another major source of music related information. Music related data mined from the Web can be distinguished into "editorial" and "cultural" data. Whereas editorial data originates from music experts and editors often associated with the music distribution industry, cultural data makes use of the wisdom of the crowd by mining large numbers of music related websites including social networks. Advantages of web based MIR are the vast amount of available data as well as its potential to access high-level semantic descriptions and subjective aspects of music not obtainable from audio based analysis alone. Data sources include artist-related Web pages, published playlists, song lyrics or blogs and twitter data concerned with music. Other data sources of music context are collaborative tags, mined for example from last.fm [Levy and Sandler, 2007] or gathered via tagging games [Turnbull et al., 2007]. A problem with information obtained automatically from the Web is that it is inherently noisy and erroneous which requires special techniques and care for data clean-up. Data about new and lesser known artists in the so-called "long tail" is usually very sparse which introduces an unwanted popularity bias [Celma, 2011]. A list of data sets frequently used in Web-based MIR is provided by Markus Schedl . The "Million Songs Dataset" contains some web-related information like e.g. tag information provided by Last.fm.
A possibly very rich source of additional information on music content that has so far received little attention in our community is music videos. The most prominent source for music videos is YouTube, but alternatives like Vimeo exist. Uploaded material contains anything from amateur clips to video blogs to complete movies, with a large part of it being music videos. Whereas a lot of the content on YouTube has been uploaded by individuals which may entail all kinds of copyright and legal issues, some large media companies have lately decided to also offer some of their content. There exists a lively community around the so-called TRECVid campaign , a forum, framework and conference series on video retrieval evaluation. One of the major tasks in video information retrieval is automatic labelling of videos, e.g. according to genre, which can be done either globally or locally [Brezeale and Cook, 2008]. Typical information extracted from videos are visual descriptors like color, its entropy and variance, hue, as well as temporal cues like cuts, fades, dissolves. Object-based features like the occurrence of faces or text and motion-based information like motion density and camera movement are also of interest. Text-based information derived from sub-titles, transcripts of dialogues, synopsis or user tags is another valuable source. A potentially very promising approach is the combined analysis of a music video and its corresponding audio, pooling information from both image and audio signals. Combination of general audio and video information is an established topic in the literature, see e.g. [Wang et al., 2003] for an early survey. There already is a limited amount of research explicitly on music videos exploiting both the visual and audio domain [Gillet et al., 2007]. Although the TRECVid evaluation framework supports a "Multimedia event detection evaluation track" consisting of both audio and video, to our knowledge no data set dedicated specifically to music videos exists.
Another yet untapped source are machine readable tetxs on musicology that are available online (e.g. via Google Books). Google books is a search engine that searches the full text of books if they have already been scanned and digitised by Google. This offers the possibility of using Natural Language Processing tools to analyse text books on music, thereby introducing MIR topics to the new emerging field of digital humanities.
As stated above, user-context data is any kind of data that allows us to model a single user in one specific usage setting. In most MIR research and applications so far, the prospective user is seen as a generic being for whom a generic one-for-all solution is sufficient. Typical systems aim at modeling a supposedly objective music similarity function which then drives music recommendation, play-listing and other related services. This however neglects the very subjective nature of music experience and perception. Not only do different people perceive music in different ways depending on their likes, dislikes and listening history, but even one and the same person will exhibit changing tastes and preferences depending on a wide range of factors: time of day, social situation, current mood, location, etc. Personalising music services can therefore be seen as an important topic of future MIR research.
Following recent proposals (see e.g. [Schedl and Knees, 2011]), we distinguish five different kinds of user context data: (i) Environment Context, (ii) Personal Context, (iii) Task Context, (iv) Social Context, (v) Spatio-temporal Context. The environmental context is defined as all entities that can be measured from the surroundings of a user, like presence of other people and things, climate including temperature and humidity, noise and light. The personal context can be divided into the physiological context and the mental context. Whereas physiological context refers to attributes like weight, blood pressure, pulse, or eye color, the mental context is any data describing a user's psychological aspects like stress level, mood, or expertise. Another important form of physiological context data are recordings of gestures during musical performances with either traditional instruments or new interfaces to music. The task content should describe all current activities pursued by the user including actions and activities like direct user input to smart mobile phones and applications, activities like jogging or driving a car, but also interaction with diverse messenger and microblogging services. The latter is a valuable source for a user's social context giving information about relatives, friends, or collaborators. The spatio-temporal context reveals information about a user’s location, place, direction, speed, and time. As a general remark, the recent emergence of "always on" devices (e.g. smart phones) equipped not only with a permanent Web connection, but also with various built-in sensors, has remarkably facilitated the logging of user context data from a technical perspective. Data sets on the user context are still very rare but e.g. the "user - song - play count triplets" and the Last.fm tags of the "Million Song Dataset" could be said to contain this type of personal information.
The proper documentation of the process of data assembly for all kinds of musically relevant data is a major issue which has not yet gained sufficient attention by the MIR community. In [Peeters and Fort, 2012] an overview is provided of the different practices of annotating MIR corpora. Currently, several methodologies are used for collecting these data: - creating an artificial corpus [Yeh et al., 2007], recording corpora [Goto, 2006] or sampling the world of music according to specific criteria (Isophonics [Mauch et al., 2009], Salami [Smith et al., 2011], Billboard [Burgoyne et al., 2011], MillionSong [Bertin-Mahieux et al., 2011]). The data can then be obtained using experts (this is the usual manual annotation [Mauch et al., 2009]), using crowd-sourcing [Levy, 2011] or so-called games with a purpose (Listen-Game [Turnbull et al., 2007], TagATune [Law et al., 2007], MajorMiner [Mandel and Ellis, 2008]) or by aggregating other content (Guitar-Tab [McVicar and De Bie, 2010] MusiXMatch, Last.fm in the case of the MillionSong). As opposed to other domains, micro-working (such as Amazon Mechanical Turk) is not (yet) a common practice in the MIR field. These various methodologies for collecting data involve various costs: from the most expensive (traditional manual annotation) to the less expensive (aggregation or crowd-sourcing). They also involve various qualities of data. This is related to the inter-annotator and intra-annotator agreement which is rarely assessed in the case of MIR. Compared to other fields, such as natural language processing or speech, music-related data collection or creation does not follow dedicated protocols. One of the major issues in the MIR field will be to better define protocols to make reliable annotated MIR corpora. Another important aspect is how our research community relates itself to initiatives aiming at unifying data formats in the world wide web. Initiatives that come to mind are e.g. linked data which is a collection of best practices for publishing and connecting structured data on the Web and, especially relevant for MIR, MusicBrainz which strives to become the ultimate source of music information or even the universal lingua franca of music. It should also be clear that the diverse forms of data important for MIR are very much "live data", i.e. many data sets are constantly changing over time and need to be updated accordingly. Additionally our community should strive to create data repositories which allow open access for the research community and possibly even the general public.
References
- [Bertin-Mahieux et al., 2011] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In [Klapuri and Leider, 2011], pp. 591-596.
- [Brezeale and Cook, 2008] D. Brezeale and D. J. Cook. Automatic Video Classification: A Survey of the Literature. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 38(3): 416-430, 2008.
- [Burgoyne et al., 2011] J. A. Burgoyne, J. Wild, and F. Ichiro. An expert ground-truth set for audio chord recognition and music analysis. In [Klapuri and Leider, 2011], pp. 633-638.
- [Casey et al., 2008] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. Content-Based Music Information Retrieval: Current Directions and Future Challenges. Proceedings of the IEEE, 96(4): 668-696, March 2008.
- [Celma, 2010] Ò. Celma. Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space. Springer-Verlag New York Inc, 2010.
- [Dannenberg et al., 2006] R. Dannenberg, K. Lemström, and A. Tindale, editors. ISMIR 2006, 7th International Conference on Music Information Retrieval, Victoria, Canada, 8-12 October 2006, Proceedings, 2006.
- [Dixon et al., 2007] S. Dixon, D. Bainbridge, and R. Typke, editors. Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007, Vienna, Austria, September 23-27, 2007, Austrian Computer Society, 2007.
- [Downie et al., 2009] J. S. Downie, D. Byrd, and T. Crawford. Ten Years of ISMIR: Reflections On Challenges and Opportunities. In [Hirata et al., 2009], pp. 13-18.
- [Gillet et al., 2007] O. Gillet, S. Essid and G. Richard. On the Correlation of Audio and Visual Segmentations of Music Videos. IEEE Transactions on Circuits and Systems for Video Technology, 17 (3): 347-355, March 2007.
- [Goto, 2006] M. Goto. Aist annotation for the RWC music database. In [Dannenberg et al., 2006], pp. 359–360.
- [Gouyon et al., 2012] F. Gouyon, P. Herrera, L. G. Martins, and M. Müller, editors. Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012, Mosteiro S.Bento Da Vitória, Porto, Portugal, October 8-12, 2012. FEUP Edições, 2012.
- [Hirata et al., 2009] K. Hirata, G. Tzanetakis, and K. Yoshii, editors. Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009, Kobe International Conference Center, Kobe, Japan, October 26-30, 2009. International Society for Music Information Retrieval, 2009.
- [Klapuri and Leider, 2011] A. Klapuri and C. Leider, editors. Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011. University of Miami, 2011.
- [Law et al., 2007] E. L. M. Law, L. von Ahn, R. B. Dannenberg, and M. Crawford. Tagatune: A game for music and sound annotation. In [Dixon et al., 2007], pp. 361-364.
- [Levy, 2011] M. Levy. Improving perceptual tempo estimation with crowd-sourced annotations. In [Klapuri and Leider, 2011], pp. 317-322.
- [Levy and Sandler, 2007] M. Levy and M. Sandler. A semantic space for music derived from social tags. In [Dixon et al., 2007], pp. 411-416.
- [Macrae and Dixon, 2011] R. Macrae and S. Dixon. Guitar Tab Mining, Analysis and Ranking. In [Klapuri and Leider, 2011], pp. 453-458
- [Mandel and Ellis, 2008] M. Mandel and D. Ellis. A web-based game for collecting music metadata. Journal of New Music Research, 37(2): 151–165, 2008.
- [Mauch et al., 2009] M. Mauch, C. Cannam, M. Davies, S. Dixon, C. Harte, S. Kolozali, D. Tidhar, and M. Sandler. OMRAS2 metadata project 2009. In Late-breaking session at the 10th International Conference on Music Information Retrieval, Kobe, Japan, 2009.
- [McVicar and De Bie, 2010] M. McVicar and T. De Bie. Enhancing chord recognition accuracy using web resources. In Proceedings of 3rd International Workshop on Machine Learning and Music, MML '10, pp. 41–44, New York, USA, 2010.
- [Peeters and Fort, 2012] G. Peeters and K. Fort. Towards a (better) definition of the description of annotated M.I.R. corpora. In [Gouyon et al., 2012], pp. 25-30.
- [Schedl and Knees, 2011] M. Schedl and P. Knees. Personalization in Multimodal Music Retrieval. In Proceedings of the 9th International Workshop on Adaptive Multimedia Retrieval (AMR'11), Barcelona, Spain, 2011.
- [Schedl and Knees, 2009] M. Schedl and P. Knees. Context-based music similarity estimation. In Proceedings of the 3rd International Workshop on Learning the Semantics of Audio Signals (LSAS 2009), Graz, Austria, 2009.
- [Smith et al., 2011] J. B. L. Smith, J. A. Burgoyne, I. Fujinaga, D. De Roure, and J. Stephen Downie. Design and creation of a large-scale database of structural annotations. In [Klapuri and Leider, 2011], pp. 555-560.
- [Turnbull et al., 2007] D. Turnbull, R. Liu, L. Barrington, and G. R. G. Lanckriet. A game-based approach for collecting semantic annotations of music. In [Dixon et al., 2007], pp. 535-538.
- [Wang et al., 2003] H. Wang, A.y Divakaran, A. Vetro, S.-F. Chang, and H. Sun. Survey of compressed-domain features used in audio-visual indexing and analysis. Journal of Visual Communication and Image Representation, 14(2): 150-183, 2003.
- [Yeh et al., 2007] C. Yeh, N. Bogaards, and A. Röbel. Synthesized polyphonic music database with verifiable ground truth for multiple f0 estimation. In [Dixon et al., 2007], pp. 393–398.
Challenges
- Identify all relevant types of data sources describing music. We have to consider the all-encompassing experience of music in all its broad multi-modality beyond just audio (video, lyrics, scores, symbolic annotations, gesture, tags, diverse metadata from web-sites and blogs, etc.). To achieve this it will be necessary to work together with experts from the full range of the multimedia community and organise the data gathering process in a more systematic way compared to what has happened so far.
- Guarantee sufficient quality of data (both audio and meta-data). At the moment data available to our community stems from a wide range of very different sources obtained with very different methods often not documented sufficiently. We will have to come to an agreement concerning unified data formats and protocols documenting the quality of our data. For this a dialogue within our community is necessary which should also clarify our relation to more general efforts of unifying data formats.
- Clarify the legal and ethical concerns regarding data availability as well as its use and exploitation. This applies to the question what data we are allowed to have and what data we should have. The various copyright issues will make it indispensable to work together with owners of content, copyright and other stakeholders. All ethical concerns on privacy issues have to be solved. The combination of multiple sources of data poses additional problems in this sense.
- Ascertain what data users are willing to share. One of the central goals of future MIR will be to model the tastes, behaviors and needs of individual and not just generic users. Modelling of individual users for personalisation of MIR services presents a whole range of new privacy issues since it requires handling of very detailed and possibly controversial information. This is of course closely connected to policies of diverse on-line systems concerning privacy of user data. This is also a matter of system acceptance going far beyond mere legal concerns.
- Make available a sufficient amount of data to the research community allowing easy and legal access to the data. Even for audio data, which has been used for research right from the beginning of MIR, availability of sufficient benchmark data sets usable for evaluation purposes is still not a fully resolved issue. To allow MIR to grow from an audio-centered to a fully multi-modal science we will need benchmark data for all these modalities to allow evaluation and comparison of our results. Hence the already existing problem of data availability will become even more severe.
- Create open access data repositories. It will be of great importance for the advancement of MIR to create and maintain sustainable repositories of diverse forms of music related data. These repositories should follow open access licensing schemes.