Sorry this page looks weird. It was automatically migrated from my old blog, which had a different layout and different CSS.

Migrating Data Into Rails

Often when building a Rails application we need to migrate data in an existing database over to our new one. There are two main ways to do this and your situation will determine which is better for you.

The good news is that ActiveRecord is easy to work with outside Rails.

In both options below, the first CSV step assumes you just do a simple ‘export to CSV’ from your existing database rather than writing a script to pull out only the data you want. If you do the latter instead, you can skip the first CSV step.

I have also assumed that your new database is MySQL. If not, just substitute yours instead.

Option 1

Legacy DB -> CSV (legacy schema) -> ...
CSV (new schema) -> MySQL -> ...
validate with ActiveRecord

You write the script to convert between the first set of CSV files and the second set. You then use MySQL’s batch loading facility to lever the data into the database. Finally you use a script like this one to validate all the data via ActiveRecord in one fell swoop.

Option 2

Legacy DB -> CSV (legacy schema) -> ActiveRecord

You write the script to read the legacy CSV files and create ActiveRecord objects. The data are validated along the way.

In order for your script to work with the ActiveRecord objects, include your /models directory on the load path. E.g:

$ ruby -I lib:/path/to/your/app/models lib/data_migration.rb

Your script also needs to connect to your database via ActiveRecord. This does the trick:

ActiveRecord::Base.establish_connection(
  :adapter  => settings['adapter'],
  :host     => settings['host'],
  :username => settings['username'],
  :password => settings['password'],
  :database => settings['database'])

where settings are pulled by your script out of your Rails app’s config/database.yml for the appropriate environment.

Trade-Offs

Option 1 is far more scalable as your data volume grows. Option 2 can be sped up with ActiveRecord::Extensions, but Option 1 is still faster.

Option 2 is DRY with respect to your new schemas. In option 1 you specify your new schemas twice: once in the new CSV files and again in your ActiveRecord model objects. In option 2 you only specify them in your model objects.

Option 1 allows and requires you to set primary and foreign keys explicitly, which enables you to preserve the original data’s keys. Option 2 allows ActiveRecord to set the keys automatically. This is simpler though requires an extra field if you want to preserve the original primary key of each record.

Option 2 catches invalid records much earlier.

Conclusion

Option 1 is the only realistic option for ‘large’ volumes of data. Option 2 is more elegant and easier to use, so I would prefer it where practical (‘small’ volumes of data).

The boundary between small and large volumes of data depends on the speed of your CPU, your patience and the length of time you would spend writing your migration script: I have found in the past that option 2 loads data slower but that is outweighed by the time saved having fewer hoops to jump through.

Tip

Assuming you’re cleaning your data along the way, write unit tests for every one of your cleaning methods. It’ll save you time and give you confidence. And write some unit tests for your post-migration MySQL database: choose a few records before you start and assert they’re present and correct (via ActiveRecord) in your new database.

Further Reading

If you are pulling data from many sources then you may like to consider Extract-Transform-Load from the ActiveWarehouse project.

Final Words

After all the hard work of concocting, testing and debugging your cleaning and migration scripts, it is most satisfying to start them running and watch your CPU grind away for a while — about time the computer did some work!

Andrew Stewart • 5 December 2006 • Rails
You can reach me by email or on Twitter.