Tuesday, September 30, 2008

My Thoughts on Creating a Backup Storage System

I have worked with backups and network backup systems for about 13 years now. I decided to investigate what it would take to write such a system, using MySQL for the backend database.

The first thing I discovered is that traversing the directories is a slow process, though many times faster than actually backing up the files, but still slow. I discovered a nifty subroutine called nftw that recurses into directories and gives a stat structure and other information for each file found. Much, much faster, and you get the stat structure to boot.

The other thing I have been investigating is a poor man's data de-duplication. I've run some experiments on servers here at work, creating a hash of each file. After looking at millions of files, I settled on MD5 hash, with no duplicate hashes found so far.

I will post later about the code to recurse into directories, and the hash stuff and some timings and why I chose MD5.