Close and Go BackBack to Viget

Backup your Database in Git

David Eisinger
David Eisinger, Web Developer, May 08, 2009 20

Short version: dump your production database into a git repository for an instant backup solution.

Long version: keeping backups of production data is fundamental for a well-run web application, but it’s tricky to maintain history while keeping disk usage at a reasonable level. You could continually overwrite the backup with the latest data, but you risk automatically replacing good data with bad. You could save each version in a separate, timestamped file, but since most of the data is static, you would end up wasting a lot of disk space.

When you think about it, a database dump is just SQL code, so why not manage it the same way you manage the rest of your code — in a source code manager? Setting such a scheme up is dead simple. On your production server, with git installed:

mkdir -p /path/to/backup
cd /path/to/backup
mysqldump -u [user] -p[pass] --skip-extended-insert [database] > [database].sql
git init
git add [database].sql
git commit -m "Initial commit"

The --skip-extended-insert option tells mysqldump to give each table row its own insert statement. This creates a larger initial commit than the default bulk insert, but makes future commits much easier to read and (I suspect) keeps the overall repository size smaller, since each patch only includes the individual records added/updated/deleted.

From here, all we have to do is set up a cronjob to update the backup:

0 * * * * cd /path/to/backup && \
mysqldump -u [user] -p[pass] --skip-extended-insert [database] > [database].sql && \
git commit -am "Updating DB backup"

You may want to add another entry to run git gc every day or so in order to keep disk space down and performance up.

Now that you have all of your data in a git repo, you’ve got a lot of options. Easily view activity on your site with git whatchanged -p. Update your staging server to the latest data with git clone ssh://[hostname]/path/to/backup. Add a remote on Github and get offsite backups with a simple git push.

This technique might fall down if your app approaches Craigslist-level traffic, but it’s working flawlessly for us on SpeakerRate, and should work well for your small- to medium-sized web application.

Ren Provey said on 05/08 at 01:50 PM

This is a beautiful solution.  Thanks David!

Brandon said on 05/08 at 03:57 PM

Very, very nice.  For some reason that never occurred to me.  Bookmarked - thanks.

Tony Pitale said on 05/08 at 04:36 PM

If you’re using PostgreSQL you can use pg_dump to achieve the same thing. The command would be something like this: pg_dump -ROx > [database].sql. User permissions are defined in a more complex fashion but the options for host, user, and password are -h, -U, -W (case sensitive), respectively.

David Eisinger said on 05/08 at 04:43 PM

@Ren & Brandon: thanks, guys!

@Tony: good point – this technique is by no means specific to MySQL and git, although you’d miss out the flexible push/pull nature of a DVCS doing this with something like SVN.

Matt Graham said on 05/08 at 06:11 PM

Does MySQL always dump rows in stable order or does it move them around from day to day?

Joaquin Bravo Contreras said on 05/08 at 07:38 PM

Have you tried using the --tab option of mysqldump. Its output is much clearer and makes for very nice change history when using git or any other version control software.

Ask Brian Aker =P http://krow.livejournal.com/593424.html

Dan Grossman said on 05/08 at 08:50 PM

I tried this some years ago, and it was a horrible solution. The repository grew much faster than the database, since the repository tracks every row that ever was, while in reality, some data gets deleted each day.

Keeping a rotating daily, weekly and monthly backup would’ve used a constant couple gigs of space, while keeping a single backup on a code repository would grow by gigs each week.

Preston Marshall said on 05/08 at 09:59 PM

@Dan Grossman:
This would definitely be the case with SVN, but Git’s repositories only track deltas from commit to commit.  The git-gc command can also be run to collect random garbage in the repository.  Your experience would most definitely be different with Git than with SVN, or another VCS like CVS.

Dan Grossman said on 05/08 at 10:22 PM

1GB a day in new, changed or deleted rows is a 1GB delta from the previous day’s dump. The repository grows by 30GB a month, where rotating backups are constant size.

Dan Grossman said on 05/08 at 10:26 PM

By the way, CVS and Subversion also only store deltas. Git didn’t invent that.

David Eisinger said on 05/08 at 10:38 PM

@Matt Graham: for the databases I work with, which have auto-increment primary keys on all tables, mysqldump outputs the rows in a predictable order.

Jonathan Wright said on 05/09 at 03:44 AM

I stumbled upon this approach a while ago, and have been storing backups of a little database for six months or so. It works wonderfully (for this database).

The system was setup to scrape a website, store some data and trace a few things. Initially the git repo was setup to store the daily scrapes. I was worried that a month or two down the road I’d discover a bug in the data extraction from the pages, and have lost a few months of data. By keeping the original scrapes, I could rerun the analysis after fixing problems. It quickly became obvious that storing a dump of the database didn’t add to the size much.

The cool thing is six months of history is stored in less size than the single uncompressed database dump and the last scrape. That’s just cool.

As Dan Grossman points out, this solution isn’t for everyone. The repo size grows over time, rather than being fixed size. For databases that are big, or have large frequent changes, a growing repo size isn’t acceptable. However, for small, yet important, databases where the compressed/delta’ed size grows a small amount each day, it’s a wonderful solution.

Alexandre Dulaunoy said on 05/09 at 08:55 AM

I like the idea. A small but important note : Git is very good when you have a collection of content, git is not tracking files or metadata (http://git.or.cz/gitwiki/ContentLimitations). If you want an optimal approach for storing database,IMHO it’s better to dump each table in a file. Like that, if you have only updates on a specific table, git is only storing/updating the updated tables.  But only the mysqldump need to be updated to make a dump per table of each database.

Ryan Findley said on 05/09 at 10:38 AM

Careful guys, Git doesn’t store diffs / deltas, it stores the whole file bzipped (this is a blob). See here for more info: http://eagain.net/articles/git-for-computer-scientists/
Blobs are stored in Git’s internal DB using the SHA-1 hash of the blob, and can be inspected using “git show”. I highly recommend the Git Internals PDF Peepcode.

Jonathan Wright said on 05/09 at 10:45 AM

@Ryan Findley: The simple Git repository format does store it’s blobs in gzipped files, but the packed files are more sophisticated. David mentioned running git gc every day or so. Git gc packs up this simple gzip blob format into the deltified pack format where the real savings are made.

Tom Anderson said on 05/09 at 03:58 PM

To address the issue pointed out by @Dan Grossman, you might want to prune old revisions out of your branch.  For example, this command should remove the commit from 7 days ago:

git rebase --onto master~8 master~7

Ryan Findley said on 05/09 at 06:41 PM

Oops, sorry for the misinfo. Thanks Jonathan!

Strony_Internetowe_Szczecin said on 05/10 at 03:17 PM

By the way, CVS and Subversion also only store deltas. I dont know that Git invent that.

sanatate said on 05/11 at 05:26 PM

Nice website! :)

Bram Schoenmakers said on 06/09 at 03:36 AM

There are some configuration options for git which should decrease the disk usage of the repository:

* core.compression = 9 : Flag for gzip to specify the compression level for blobs and packs. Level 1 is fast with larger file sizes, level 9 takes more time but results in better compression.
* repack.usedeltabaseoffset = true : Defaults to false for compatibility reasons, but is supported with Git >=1.4.4.
* pack.windowMemory = 100m : (Re)packing objects may consume lots of memory. To prevent all your resources go down the drain it’s useful to put some limits on that. There is also pack.deltaCacheSize.
* pack.window = 15 : Defaults to 10. With a higher value, Git tries harder to find similar blobs.
* gc.auto = 1000 : Defaults to 6700. As indicated in the article it is recommended to run git gc every once in a while. Personally I run git gc --auto everyday, so only pack things when there’s enough garbage. git gc --auto normally only triggers the packing mechanism when there are 6700 loose objects around. This flag lowers this amount.
* gc.autopacklimit = 10: Defaults to 50. Every time you run git gc, a new pack is generated of the loose objects. Over time you get too many packs which waste space. It is a good idea to combine all packs once in a while into a single pack, so all objects can be combined and deltified. By default git gc does this when there are 50 packs around. But for this situation a lower number may be better.

Commenting is not available in this weblog entry.

We're the Developers

at Viget Labs. We write about web development trends, tips, best practices, industry events, and our projects — all with an emphasis on Ruby on Rails.

Contact Us

Have any questions, comments, ideas, or secrets to share? Let us know.


What is the third letter in apple?

Sorry, you need to have Javascript enabled to use this form. (Don't blame us, blame the spammers!) If you'd like to contact us, please visit our Contact page.