The Million Song Dataset: Giving Back to Music Research

March 4, 2011

For far too long, researchers and engineers working on Music Information Retrieval (MIR) have been forced to pay a hefty ante before being able to conduct their research: namely, they’ve had to build a set of data on which test their theories and hone their algorithms.

It may have started as a flippant suggestion for how to solve that problem, but The Million Song Dataset is now real, and anyone can download it. A collaboration between The Echo Nest and Columbia University’s LabROSA department (Laboratory for the Recognition and Organization of Speech and Audio), The Million Song Dataset has four main objectives:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with The Echo Nest’s API
  • To help new researchers get started in the MIR field.

The Million Song Dataset offers researchers, engineers and commercial developers detailed sonic and cultural attributes for each song, as well as extensive metadata, both provided by The Echo Nest.

 We’re hoping that this will not only give MIR researchers plenty of data to work with, but also strengthen the connection between academic research and commercial development. The Million Song Dataset includes mapping to a 30-second samples on 7digital, allowing music recommendation and other algorithms to produce a commercially demonstrable output with a minimum of effort.

The Million Song Dataset was developed mainly by Brian Whitman and Paul Lamere of The Echo Nest and Daniel P.W. Ellis and Thierry Bertin-Mahieux of LabROSA, with hosting by Infochimps and funding from the National Science Foundation.