Thursday, October 16, 2008

Books that I Use

I thought it might be interesting to list the books I use as reference while programming for this project. In no particular order:

SQL in a Nutshell - Kevin E. Kline - This book was not what I thought it was, but it is a good reference for SQL and distinguishes between various flavors: DB2, MySQL, Oracle, and SQL Server mostly.

High Performance MySQL - Jeremy D. Zawodny & Derek J. Balling - Good book on getting performance out of MySQL

Advanced Programming in the Unix Environment - W. Richard Stevens

Unix Network Programming - W. Richard Stevens

The C Programming Language - Kernighan and Richie

Programming with Posix Threads - David R. Butenhof

I'm sure there's more, but that's all I can think of right now.

Tuesday, September 30, 2008

My Thoughts on Creating a Backup Storage System

I have worked with backups and network backup systems for about 13 years now. I decided to investigate what it would take to write such a system, using MySQL for the backend database.

The first thing I discovered is that traversing the directories is a slow process, though many times faster than actually backing up the files, but still slow. I discovered a nifty subroutine called nftw that recurses into directories and gives a stat structure and other information for each file found. Much, much faster, and you get the stat structure to boot.

The other thing I have been investigating is a poor man's data de-duplication. I've run some experiments on servers here at work, creating a hash of each file. After looking at millions of files, I settled on MD5 hash, with no duplicate hashes found so far.

I will post later about the code to recurse into directories, and the hash stuff and some timings and why I chose MD5.