Musically relevant data

From MIReS

Jump to: navigation, search

We define "musically relevant data" as any type of machine-readable data that can be analysed by algorithms and that can give us relevant information for the development of musical applications. The main challenge is to gather musically relevant data of sufficient quantity and quality to enable music information research that respects the broad multi-modality of music. After all, music today is an all-encompassing experience that is an important part of videos, computer games, Web applications, mobile apps and services, specialised blogs, artistic applications, etc. Therefore we should be concerned with the identification of all sources of musically relevant data, the proper documentation of the process of data assembly and resolving of all legal and ethical issues concerning the data. Sufficient quantity and quality of data is of course the prerequisite for any kind of music information research. To make progress in this direction it is necessary that the research community works together with the owners of data, be they copyright holders in the form of companies or individual persons sharing their data. Since music information research is by definition a data intensive science, any progress in these directions will have immediate impact on the field. It will enable a fostering and maturing of our research and, with the availability of new kinds of musically relevant data, open up possibilities for new kinds of research and applications.


Back to → Roadmap:Technological perspective


Contents

State of the art

Music Information Research (MIR) is so far to a large degree concerned with audio, neglecting many of the other forms of media where music also plays an important role. As recently as ten years ago, the main media concerned with music were represented by audio recordings on CDs, terrestrial radio broadcasts, music videos on TV, and printed text in music magazines. Today music is an all-encompassing experience that is an important part of videos, computer games, Web applications, mobile apps and services, artistic applications, etc. In addition to printed text on music there exist a vast range of web-sites, blogs and specialised communities caring and publishing about music. Therefore it is necessary for MIR to broaden its horizons and include a multitude of yet untapped data sources in its research agenda. Data that is relevant for Music Information Research can be categorised into four different subgroups: (i) audio-content is any kind of information computed directly from the audio signal; (ii) music scores is any type of symbolic notation that is normally used for music performance and that captures the musical intention of a composer; (iiI) music-context is all information relevant to music which is not directly computable from the audio itself or the score, e.g. cover artwork, lyrics, but also artists' background and collaborative tags connected to the music; (iv) user-context is any kind of data that allows us to model the users in specific usage settings.

Let us start with the most prevalent source of data: audio content and any kind of information computed directly from the audio. Such information is commonly referred to as "features", with a certain consensus on distinguishing between low-level and high-level features (see e.g. [Casey et al., 2008]). Please see section Music representations for an overview of different kinds of features. It is obvious that audio content data is by far the most widely used and researched form of information in our community. This can e.g. be seen by looking at the tasks of the recent "Music Information Retrieval Evaluation eXchange" (MIREX 2012). MIREX is the foremost yearly community-based framework for formal evaluation of MIR algorithms and systems. Out of the 16 tasks, all but one (Symbolic Melodic Similarity) deal with audio analysis including challenges like: Audio Classification, Audio Melody Extraction, Cover Song Identification, Audio Key Detection, Structural Segmentation and Audio Tempo Estimation. Concerning the availability of audio content data there are several legal and copyright issues. Just to give an example, the by far largest data set in MIR, the "Million Songs Dataset", does not include any audio, only the derived features. In case researchers need to compute their own features they have to use services like "7-Digital" to access the audio. Collections that do contain audio as well are usually very small like e.g. the well known "GTzan" collection assembled by George Tzanetakis in 2002 consisting of 1000 songs freely available from the Marsyas webpage. The largest freely downloadable audio data set is the "1517 Artists" collection consisting of 3180 songs from 1517 artists. There also exist alternative collaborative databases of Creative Commons Licensed sounds like Freesound.

An important source of information to start with is of course symbolic data, thus the score of a piece of music if it is available in a machine readable format, like MIDI, Music XML, sequencer data or other kinds of abstract representations of music. Such music representations can be very close to audio content like e.g. the score to one specific audio rendering but they are usually not fully isomorphic. Going beyond more traditional annotations, recent work in MIR [Macrae and Dixon, 2011] turned its attention to machine readable tablatures and chord sequences, which are a form of hand annotated scores available in non-standardised text files (e.g. "ultimate guitar" contains more than 2.5 million guitar tabs). At the first MIR conference a large part of the contributed papers were concerned with symbolical data. Almost ten years later this imbalance seems to have reversed with authors [Downie et al., 2009] lamenting that "ISMIR must rebalance the portfolio of music information types with which it engages" and that "research exploiting the symbolic aspects of music information has not thrived under ISMIR". Symbolic annotations of music present legal and copyright issues just like audio, but substantial collections (e.g. of MIDI files) do exist.

Music context is all information relevant to a music item under consideration that is not extracted from the respective audio file itself or the score (see e.g. \cite{Schedl_Knees:2009} for an overview). A large part of research on music context is strongly related to web content mining. Over the last decade, mining the World Wide Web has been established as another major source of music related information. Music related data mined from the Web can be distinguished into "editorial" and "cultural" data. Whereas editorial data originates from music experts and editors often associated with the music distribution industry, cultural data makes use of the wisdom of the crowd by mining large numbers of music related websites including social networks. Advantages of web based MIR are the vast amount of available data as well as its potential to access high-level semantic descriptions and subjective aspects of music not obtainable from audio based analysis alone. Data sources include artist-related Web pages, published playlists, song lyrics or blogs and twitter data concerned with music. Other data sources of music context are collaborative tags, mined for example from last.fm [Levy and Sandler, 2007] or gathered via tagging games [Turnbull et al., 2007]. A problem with information obtained automatically from the Web is that it is inherently noisy and erroneous which requires special techniques and care for data clean-up. Data about new and lesser known artists in the so-called "long tail" is usually very sparse which introduces an unwanted popularity bias [Celma, 2011]. A list of data sets frequently used in Web-based MIR is provided by Markus Schedl . The "Million Songs Dataset" contains some web-related information like e.g. tag information provided by Last.fm.

A possibly very rich source of additional information on music content that has so far received little attention in our community is music videos. The most prominent source for music videos is YouTube, but alternatives like Vimeo exist. Uploaded material contains anything from amateur clips to video blogs to complete movies, with a large part of it being music videos. Whereas a lot of the content on YouTube has been uploaded by individuals which may entail all kinds of copyright and legal issues, some large media companies have lately decided to also offer some of their content. There exists a lively community around the so-called TRECVid campaign , a forum, framework and conference series on video retrieval evaluation. One of the major tasks in video information retrieval is automatic labelling of videos, e.g. according to genre, which can be done either globally or locally [Brezeale and Cook, 2008]. Typical information extracted from videos are visual descriptors like color, its entropy and variance, hue, as well as temporal cues like cuts, fades, dissolves. Object-based features like the occurrence of faces or text and motion-based information like motion density and camera movement are also of interest. Text-based information derived from sub-titles, transcripts of dialogues, synopsis or user tags is another valuable source. A potentially very promising approach is the combined analysis of a music video and its corresponding audio, pooling information from both image and audio signals. Combination of general audio and video information is an established topic in the literature, see e.g. [Wang et al., 2003] for an early survey. There already is a limited amount of research explicitly on music videos exploiting both the visual and audio domain [Gillet et al., 2007]. Although the TRECVid evaluation framework supports a "Multimedia event detection evaluation track" consisting of both audio and video, to our knowledge no data set dedicated specifically to music videos exists.

Another yet untapped source are machine readable tetxs on musicology that are available online (e.g. via Google Books). Google books is a search engine that searches the full text of books if they have already been scanned and digitised by Google. This offers the possibility of using Natural Language Processing tools to analyse text books on music, thereby introducing MIR topics to the new emerging field of digital humanities.

As stated above, user-context data is any kind of data that allows us to model a single user in one specific usage setting. In most MIR research and applications so far, the prospective user is seen as a generic being for whom a generic one-for-all solution is sufficient. Typical systems aim at modeling a supposedly objective music similarity function which then drives music recommendation, play-listing and other related services. This however neglects the very subjective nature of music experience and perception. Not only do different people perceive music in different ways depending on their likes, dislikes and listening history, but even one and the same person will exhibit changing tastes and preferences depending on a wide range of factors: time of day, social situation, current mood, location, etc. Personalising music services can therefore be seen as an important topic of future MIR research.

Following recent proposals (see e.g. [Schedl and Knees, 2011]), we distinguish five different kinds of user context data: (i) Environment Context, (ii) Personal Context, (iii) Task Context, (iv) Social Context, (v) Spatio-temporal Context. The environmental context is defined as all entities that can be measured from the surroundings of a user, like presence of other people and things, climate including temperature and humidity, noise and light. The personal context can be divided into the physiological context and the mental context. Whereas physiological context refers to attributes like weight, blood pressure, pulse, or eye color, the mental context is any data describing a user's psychological aspects like stress level, mood, or expertise. Another important form of physiological context data are recordings of gestures during musical performances with either traditional instruments or new interfaces to music. The task content should describe all current activities pursued by the user including actions and activities like direct user input to smart mobile phones and applications, activities like jogging or driving a car, but also interaction with diverse messenger and microblogging services. The latter is a valuable source for a user's social context giving information about relatives, friends, or collaborators. The spatio-temporal context reveals information about a user’s location, place, direction, speed, and time. As a general remark, the recent emergence of "always on" devices (e.g. smart phones) equipped not only with a permanent Web connection, but also with various built-in sensors, has remarkably facilitated the logging of user context data from a technical perspective. Data sets on the user context are still very rare but e.g. the "user - song - play count triplets" and the Last.fm tags of the "Million Song Dataset" could be said to contain this type of personal information.

The proper documentation of the process of data assembly for all kinds of musically relevant data is a major issue which has not yet gained sufficient attention by the MIR community. In [Peeters and Fort, 2012] an overview is provided of the different practices of annotating MIR corpora. Currently, several methodologies are used for collecting these data: - creating an artificial corpus [Yeh et al., 2007], recording corpora [Goto, 2006] or sampling the world of music according to specific criteria (Isophonics [Mauch et al., 2009], Salami [Smith et al., 2011], Billboard [Burgoyne et al., 2011], MillionSong [Bertin-Mahieux et al., 2011]). The data can then be obtained using experts (this is the usual manual annotation [Mauch et al., 2009]), using crowd-sourcing [Levy, 2011] or so-called games with a purpose (Listen-Game [Turnbull et al., 2007], TagATune [Law et al., 2007], MajorMiner [Mandel and Ellis, 2008]) or by aggregating other content (Guitar-Tab [McVicar and De Bie, 2010] MusiXMatch, Last.fm in the case of the MillionSong). As opposed to other domains, micro-working (such as Amazon Mechanical Turk) is not (yet) a common practice in the MIR field. These various methodologies for collecting data involve various costs: from the most expensive (traditional manual annotation) to the less expensive (aggregation or crowd-sourcing). They also involve various qualities of data. This is related to the inter-annotator and intra-annotator agreement which is rarely assessed in the case of MIR. Compared to other fields, such as natural language processing or speech, music-related data collection or creation does not follow dedicated protocols. One of the major issues in the MIR field will be to better define protocols to make reliable annotated MIR corpora. Another important aspect is how our research community relates itself to initiatives aiming at unifying data formats in the world wide web. Initiatives that come to mind are e.g. linked data which is a collection of best practices for publishing and connecting structured data on the Web and, especially relevant for MIR, MusicBrainz which strives to become the ultimate source of music information or even the universal lingua franca of music. It should also be clear that the diverse forms of data important for MIR are very much "live data", i.e. many data sets are constantly changing over time and need to be updated accordingly. Additionally our community should strive to create data repositories which allow open access for the research community and possibly even the general public.


References


Challenges



Back to → Roadmap:Technological perspective

Personal tools
Namespaces
Variants
Actions
Navigation
Documentation Hub
MIReS Docs
Toolbox