Active Record batch processing in parallel processes
Published over 2 years ago

Active Record provides find_each for batch processing of large number of records. However, when you are dealing with REALLY larger number of records ( I’m talking millions here ), find_each can become quite slow.

One obvious solution is to use something like Resque:

User.find_each {|user| Resque.enqueue(MyJob, user) }

But this solution can feel a little heavy in certain cases. Enter forking!

if GC.respond_to?(:copy_on_write_friendly=)
  GC.copy_on_write_friendly = true
end

jobs_per_process = 100
process_count = 10

User.find_in_batches(:batch_size => jobs_per_process * process_count) do |group|
  batches = group.in_groups(process_count)

  batches.each do |batch|
    Process.fork do
      ActiveRecord::Base.establish_connection

      # Do the actual work
      batch.each {|user| .. }
    end
  end

  Process.waitall
end

The above code fetches 1000 records from the db, forks 10 processes and processes 100 records in each process, in parallel, before moving on to the next 1000 records. This should be significantly faster than the usual sequential processing.

It’s advisable to use REE for doing something like this when memory usage is a concern.