Script Persistence: Writing fail-safe, idempotent Ruby scripts with PStore

import

Nothing makes me happier as a programmer than automating monotonous and incredibly time-consuming tasks with a quick script.

Increasingly these scripts tend to involve interacting with external (read: unreliable) systems. Oftentimes this unreliability can suck the fun right out of scripting. Run a script. Watch it fail halfway through. Go clean up after the partial execution. Run script again. Watch it fail at 99%...

So how can we make potentially unreliable automation scripts fun again? Idempotency.

If I run an idempotent script, and it fails completely halfway through, I just run it again to achieve the exact same state as if the script has executed perfectly all the way through the first time. Beauteous.

Enter PStore

Idempotency in scripts of this nature requires durable persistence of execution state. One could of course use a database to achieve this persistence, but connecting and writing to an external database is a bit heavy-handed for simple scripts. It adds additional dependencies, and makes the script more complex and less portable.

In these cases, PStore is a perfect, lightweight database alternative. PStore is a class included in the Ruby Standard Library (since v1.8.7) that implements file-based persistence, and better yet, it feels just like a Hash.

To get started with PStore, first instantiate a PStore object.

 require "pstore"

execution_state = PStore.new("execution_state")

This creates the file to which store contents will be written.

pstore file

To read and write from your PStore, you must open a transaction. PStore operations are transactional to prevent the store from ever representing a transitory state. Opening a transaction uses an intuitive block syntax:

 execution_state.transaction do
 # Read and write here.
end

At the end of the transaction block, all changes to the PStore are committed to disk. The #commit and #abort methods allow you to programmatically exit the transaction before reaching the end, saving or discarding any changes respectively.

Within a transaction, you can interact with a PStore just as if it were a Hash with #fetch, #delete, #[], and #[]=. PStore dumps objects saved to the store using Marshal, meaning they are stored to the .pstore file as binary. (If you'd like more human-readable persistence file, check out YAML::Store which provides the same utility as PStore, but dumps objects using YAML instead of Marshal.)

 execution_state.transaction do
 execution_state[:steps_completed] ||= []
 execution_state[:steps_completed] << 1
 execution_state.fetch(:steps_completed)
 executions_state.delete(:steps_complete)
end

So using PStore is easy, but how might we use it to achieve idempotency?

Example: Big Data Migrations

Our work on projects at Viget often concludes with large data migrations off of existing systems into the new systems we've just finished constructing. Recently this took the form of a massive and complex data migration out of a Drupal installation and into a new CMS via API requests.

The migration involved several layers of computationally-expensive data transformations, and two separate API requests into the new system for each data entry (one to insert the entry, and a second to build its relationships to other entries). Over a massive dataset, even with a high degree of parallelism, the migration script took nearly an hour to fully execute.

Reperforming the entire migration in the event of a failure wasn't an attractive proposition, but more importantly, we needed the ability to use the same script to quickly migrate new records after the initial import had been performed.

In this case, we use PStore to persist 1) objects resulting from expensive processes, and 2) an ID dictionary that maps between old and new CMS entry IDs. The process is actually very straightforward; before you do anything, check to see if you've done it before.

The steps for this particular migration were as follows:

  1. Read a data entry from Drupal.
  2. Check the converted_entries PStore to see if we already have a transformed version of that entry.
  3. If we do, move along. If we don't, create a transformed version.
  4. Check the id_dictionary PStore to see if we have a new CMS ID for the entry.
  5. If we do, that means the entry is already in the new CMS; move along. If we don't, POST the entry into the CMS.
  6. Check the modified_entries PStore to see if we have already updated the new CMS entry with it's relationships to other entries.
  7. If we have, move along. If we haven't, PUT an update to the entry to add its relationships.

The code example below illustrates the performance of these steps.

 require "pstore"
require "httparty"
# A quick aside:
#
# HTTParty is good for code examples.
# Typhoeus is good for reality.
# https://github.com/typhoeus/typhoeus#making-parallel-requests

$execution_state = PStore.new("data/execution_state")

$execution_state.transaction do
 $execution_state[:converted_entries] ||= Array.new
 $execution_state[:entry_id_dictionary] ||= Array.new
 $execution_state[:modified_entries] ||= Array.new
end

def convert_to_entries(drupal_nodes)
 drupal_nodes.map do |drupal_node|
 $execution_state.transaction do
 existing_converted_entry = $execution_state[:converted_entries][drupal_node.id]

 # Skip the conversion if it has already occurred.
 return existing_converted_entry if existing_converted_entry

 convert_to_entry(drupal_node)
 end
 end
end

def import_entries(entries)
 entries.each do |entry|
 $execution_state.transaction do
 # Skip the import if it has already occurred.
 next if $execution_state[:entry_id_dictionary][entry.drupal_id]

 new_cms_entry_json = to_new_cms_entry_json(entry)
 response = HTTParty.post("https://new.cms.viget.com/api/v1/entries", body: new_cms_entry_json)
 new_cms_id = response[:id]

 $execution_state[:entry_id_dictionary][entry.drupal_id] = new_cms_id
 end
 end
end

def modify_imported_entries(entries)
 entries.each do |entry|
 $execution_state.transaction do
 # Skip the update if it has already occurred.
 next if $execution_state[:modified_entries][entry.new_cms_id]

 response = HTTParty.put("https://new.cms.viget.com/api/v1/entries/#{new_cms_id}" body: cms_entry_modification_json(entry))

 $execution_state[:modified_entries][entry.new_cms_id] = response
 end
 end
end

def migrate(drupal_nodes)
 entries = convert_to_entries(drupal_nodes)
 import_entries(entries)
 modify_imported_entries(entries)
end

def convert_to_entry(drupal_node)
 # An expensive transformation process is performed here.
end

def new_cms_entry_json(entry)
 # Make the JSON for the new CMS entry here.
end

def cms_entry_modification_json(entry)
 # Build the JSON for the relationship-adding update here.
 # This will often involve mapping old related Drupal IDs to the IDs of
 # entries in the new system using $execution_state[:entry_id_dictionary]
end

drupal_nodes = read_entries_from_drupal_database
migrate(drupal_nodes)

There you have it. PStore is fantastic tool for durably persisting data for quick, idempotent Ruby scripts. So next time you find yourself frustratedly running, resetting, and rerunning your scripts, reach for PStore.

Lawson is a neuroscientist-turned-developer who works in our Boulder, CO, office. He builds sophisticated software for clients such as Discovery and Shure.

More posts by Lawson