The ‘Data Migration’ Fallacy: How Changing Our Mental Model Can Make Data Re-Creation Easier

While Viget regularly works on from-scratch products, many of our projects involve rebuilding a client's existing website. This often means data in the old application also needs to exist in the new application.

In these cases the project typically includes a "data migration" effort. I've found that clients often misunderstand what data migration entails, which can lead to a number of pain points. So I've come up with a different way of thinking about the concept, as well an approach for handling data migration throughout a project, to hopefully make the process clearer and more predictable for everyone involved.

Re-Creation, Not Migration

The data migration pain points ultimately stem from the term itself, which instills two flawed mental models of the work that actually needs to be done.

In the first flawed mental model, data is a physical thing that we pick up from the old app and plunk down in the new app (probably using one of these). In the second flawed mental model, we hook up the old app to the new app via the equivalent of a USB cable and transfer the data in the same way you'd transfer photos from a laptop to a backup external hard drive.

These mental models are flawed because the process does not involve transferring data in either the physical or transfer-cable senses. Rather, the process entails re-creating data: Using one app's data as a reference in order to create similar data in another app.

If we start to think in those terms, we can more clearly see the potential pain points in the data re-creation process and devise strategies to mitigate them.

The Implications of Manual Data Re-Creation

To think about the potential pain points, consider the manual end of the re-creation spectrum: one person — or 1,000 monkeys, or 1,000 interns — re-creating the data by hand.

Say the data to be re-created is 500 articles, with an article's data in the new app separated into these fields: headline, author, publication date, body, image. For this example, assume that "author" is itself a structured object; when creating an article, the user selects a single author from a list that's managed elsewhere in the new app, rather than entering free-form text.

Here are some pain points that could arise as the data re-creator hand-types 500 old articles into the new CMS:

  • No Client Data: The client has not provided the existing article data. The data re-creator would have to click through 500 article pages on the old website, copying and pasting text into the new CMS.
  • Unopenable/Unreadable Client Data: The client has provided the data as .wtf files — the proprietary export format of the client's old CMS, natch. The data re-creator either is unable to open the files or opens them to reveal glyph-filled gobbledygook.
  • Unstructured Client Data: In the client-provided data, each article is a single block of text with no delineation between the different elements. The data re-creator has to go through each entry to figure out which text should go in each CMS field.
  • Client Data Full of Code: The client-provided data has HTML, javascript, inline styles, miscellaneous code from the old CMS, etc. The data re-recreator has to delete the code from each article to prevent weird/ugly things from happening if that code were saved in the new app.
  • Client Wants to Re-Create Files, Too: The client wants all of the old app's image files to be re-created in the new app. The data re-creator has to manually attach the images to articles in the new CMS.
  • Client Data Does Not Match New App's Structure/Functionality:
    • The data reflects an old structure that does not match the new app's fields, e.g. there's a "Category" field that does not exist in the new app. Additional development would be needed to add the field.
    • There is data that ostensibly corresponds to a new app field but does not match the new app's functionality, e.g. some articles in the old app have multiple authors. Additional development would be needed to update the field functionality.
    • There is data that corresponds to an associated object that has not yet been created in the new app's CMS, e.g. the data includes 75 author names but none of those authors have been created in the new CMS. The data re-creator would need to first manually create the 75 authors.

The Implications of Partially Automated Data Re-Creation

Thankfully, we rarely have to go through a manual data re-creation process. Nobody has the time, budget, or patience for that. Instead, we go through what I consider a partially automated process, in which a developer writes and runs a script that automatically parses data files (e.g. a CSV) and creates articles in the new app's database from the parsed data. The script basically automates the copy/paste part of the manual process.

This can be fairly straightforward and a relatively low level of effort. But the process can become wildly complicated, and clients typically don't understand why.

Here's why: The exact same pain points apply in the partially automated scenario as in the manual scenario.

For example:

  • No Client Data: If the client does not provide data files, the developer would have to write a script that scrapes the old site's public article pages and attempts to map the unstructured scraped data to the new app fields. This is an extremely high or impossible level of effort.
  • Unopenable/Unreadable Client Data: If a non-developer can't open the client-provided file or can't read what's in the file when opened, neither can a developer.
  • Unstructured Client Data: A developer can't write a script to parse a single blob of text into multiple structured fields because nothing denotes blob line 2 as the author or blob lines 5-150 as the body.

I think it's pretty clear why all of these pain points exist in the manual scenario. But the flawed mental models make it easy to misunderstand (or avoid thinking deeply about) the partially automated scenario. If we recognize that this scenario has the same pain points, we can more effectively mitigate them.

An Approach for Data Re-Creation

So how do we do that? Here's an approach that can guide data re-creation efforts:

  1. Walk client through potential pain points upfront (hey, there's a handy blog post for that!)
  2. Client must provide files for the data to be re-created, using a defined set of standard formats, e.g. XML, CSV, or JSON
  3. These files should be provided early enough — ideally during the sales phase — to be considered during the UX/definition phase(s) and for PM to flag any potential pain points
  4. As soon as the new app's data structures are defined, PM revisits the client-provided files and works with client to define exactly which data is desired for re-creation, what’s acceptable/realistic in terms of preserving or stripping formatting given remaining time/budget, etc.
  5. PM creates a reference spreadsheet or similar document for developers, detailing how the data can be mapped to the new app's data structure as well as any relevant notes ("Ignore data in the CSV's 'Image Caption' column")
  6. Devs create and run script; devs, PM, and client test the results
  7. Client is at the mercy of time/budget/reality if:
    1. Client provides data in non-standard formats, or in other ways that cause the previously warned-against pain points
    2. After UX and/or development is done, client provides data that implies new functionality or updated data structure in the new app

Each data re-creation effort will have its own unique challenges. But with greater client understanding and a well-defined process, we can more effectively identify those challenges with plenty of time left to solve them.

Josh Korr

Josh Korr