Data Migrations in Rails

Every now and again we need to change actual data in the production database. The first obvious option that comes to mind is to use a Rails migration, especially since the word “migration” is already in the task at hand, a “data migration.” But let’s talk about it some more and let me try to dissuade you from doing so.

Looking the at Rails Guides for Active Records Migration, the first section starts by saying:

Migrations are a feature of Active Record that allows you to evolve your database schema over time. Rather than write schema modifications in pure SQL, migrations allow you to use an easy Ruby DSL to describe changes to your tables.

Have you noticed that the word “data” is absent in the above paragraph? By definition, Rails migrations should only be used for schema changes and not for actual data changes in the database.

Generally speaking, manipulating data in migrations is a bad idea for a few reasons. For one, data migrations files will stay in the db/migrate directory for posterity, and will run whenever a new developer sets their local development environment. This is not very future proof. For example, future changes to a class and its logic can easily break the migration later on. Furthermore, this is not business logic and thus should not stay forever in the code base.

A second issue is that those data migrations might be ignored by future developers if instead of running rake db:migrate the developers run rake db:schema:load or rake db:reset. Both commands merely load the latest version of the database structure using the schema.rb file without touching the migrations.

A third issue is that your application deployment is now dependent on the data migration to be completed. This might not be a problem when your application is new and your database is small. But what about large databases with millions of records? Your deployment will now have to wait for the data manipulation to be finished and that is just asking for trouble, with possible hanging or failed migrations.

I would like to suggest a better alternative using temporary rake tasks. Temporary rake tasks allow us to decouple a deployment from completed migrations. It gives us more control of the data manipulation process by encapsulating it in one place. The downside is that we need remember to either add this rake task to our deployment script or run the rake task manually after deployment. We will also need to clean up after ourselves and remove the temporary rake task once the changes have been deployed and implemented.

When creating a temporary rake task for a data migration, you might be tempted to write something like the following code:

# lib/tasks/temporary/users.rake
namespace :users do
  task :set_newsletter => :environment do
    User.all.each do |user|
      if user.confirmed?
        user.receive_newsletter = true
        user.save
      end
    end
  end
end

There are five problems with the code above:

the task goes through every single user;
it invokes validations and callbacks, which may have unintended consequences;
it uses an if block to check whether a user needs to be updated;
it does not give us a visual indication that it is actually working; and
it does not include a description and thus we will not see the task when running rake -T.

Here is a suggestion for a better rake task:

# lib/tasks/temporary/users.rake
namespace :users do
  desc "Update confirmed users to receive newsletter"
  task set_newsletter: :environment do
    users = User.confirmed
    puts "Going to update #{users.count} users"

    ActiveRecord::Base.transaction do
      users.each do |user|
        user.mark_newsletter_received!
        print "."
      end
    end

    puts " All done now!"
  end
end

In this case:

it includes a description so we will be able to see the task and its description when running rake -T;
it uses a scope to fetch only the records that need to be changed, thus removing the need for the if block and limiting the number of records that need to be touched;
it tells us in advance how many records will be manipulated, gives us a visual indication as it is working, and when it is done; and
it wraps the change in a transaction.

If your database supports transactions, it is always a good idea to wrap the code that actually changes the data with a transaction. Transactions help to deal with crashes, failures, and data consistency. They are extremely important when working with multiple objects to be changed and we want to ensure data integrity. For example when moving money around, we want to guarantee that there will not be a situation where money is withdrawn from one account, but not deposited to another account.

The examples above are very short. If the data change requires more actions, consider pulling out that behaviour into its own method.

If our database is quite large, it is also recommended to run the data migration in batches. You can read more about batches on the Rails Guides at Retrieving Multiple Objects in Batches.

Last but not least, if you ever use the Suspenders gem to generate your base Rails application with thoughtbot’s standard defaults, you might have noticed that it adds a lib/tasks/dev.rake file and a rake dev:prime to your rake tasks. This task should be used to add data to the development environment so that seeds.rb can be kept for data that is required in any environment.

In summary, by creating a temporary rake task, we get the benefits of more robust deployments, non-destructive migrations, and more control over the data manipulation process. And did I remember to mention that we should always do a dry run of our data migration rake task on development and staging first?

Sign up to receive a weekly recap from thoughtbot