Mixlists

Cleaning Musicbrainz Data

First, setup the system:

  1. Download Musicbrainz database dump
  2. Run load_mbdump.py mbdump.tar.bz2 to generate a MySQL import
  3. Create a database: echo "create database musicbrainz" | mysql -u $USER -p
  4. Load data: mysql -u $USER -p musicbrainz < mbdump/mbdump_import.sql
  5. View database statistics

The data is "dirty" because it is only partially normalized. My primary interest is an interface where, when I select an artist, it will show me all songs that artist has performed. In the musicbrainz data each song has an artist string associated with it. Some of these "artists" are the simple name of the artist and some are composites of multiple artist names.

If an artist name is a composite of multiple atomic artists, this fact is represented as an "advanced relationship," but this relationship doesn't exist for many artists.

My first goal is a simple interface for creating the appropriate decomposition of unrecognized composite artists.