Active Record batch processing in parallel processes
Published over 1 year ago
Active Record provides find_each for batch processing of large number of records. However, when you are dealing with REALLY larger number of records ( I’m talking millions here ), find_each can become quite slow.
One obvious solution is to use something like Resque:
User.find_each {|user| Resque.enqueue(MyJob, user) }
But this solution can feel a little heavy in certain cases. Enter forking!
if GC.respond_to?(:copy_on_write_friendly=) GC.copy_on_write_friendly = true end jobs_per_process = 100 process_count = 10 User.find_in_batches(:batch_size => jobs_per_process * process_count) do |group| batches = group.in_groups(process_count) batches.each do |batch| Process.fork do ActiveRecord::Base.establish_connection # Do the actual work batch.each {|user| .. } end end Process.waitall end
The above code fetches 1000 records from the db, forks 10 processes and processes 100 records in each process, in parallel, before moving on to the next 1000 records. This should be significantly faster than the usual sequential processing.
It’s advisable to use REE for doing something like this when memory usage is a concern.