I thought it might be interesting to list the books I use as reference while programming for this project. In no particular order:
SQL in a Nutshell - Kevin E. Kline - This book was not what I thought it was, but it is a good reference for SQL and distinguishes between various flavors: DB2, MySQL, Oracle, and SQL Server mostly.
High Performance MySQL - Jeremy D. Zawodny & Derek J. Balling - Good book on getting performance out of MySQL
Advanced Programming in the Unix Environment - W. Richard Stevens
Unix Network Programming - W. Richard Stevens
The C Programming Language - Kernighan and Richie
Programming with Posix Threads - David R. Butenhof
I'm sure there's more, but that's all I can think of right now.
Thursday, October 16, 2008
Tuesday, September 30, 2008
My Thoughts on Creating a Backup Storage System
I have worked with backups and network backup systems for about 13 years now. I decided to investigate what it would take to write such a system, using MySQL for the backend database.
The first thing I discovered is that traversing the directories is a slow process, though many times faster than actually backing up the files, but still slow. I discovered a nifty subroutine called nftw that recurses into directories and gives a stat structure and other information for each file found. Much, much faster, and you get the stat structure to boot.
The other thing I have been investigating is a poor man's data de-duplication. I've run some experiments on servers here at work, creating a hash of each file. After looking at millions of files, I settled on MD5 hash, with no duplicate hashes found so far.
I will post later about the code to recurse into directories, and the hash stuff and some timings and why I chose MD5.
The first thing I discovered is that traversing the directories is a slow process, though many times faster than actually backing up the files, but still slow. I discovered a nifty subroutine called nftw that recurses into directories and gives a stat structure and other information for each file found. Much, much faster, and you get the stat structure to boot.
The other thing I have been investigating is a poor man's data de-duplication. I've run some experiments on servers here at work, creating a hash of each file. After looking at millions of files, I settled on MD5 hash, with no duplicate hashes found so far.
I will post later about the code to recurse into directories, and the hash stuff and some timings and why I chose MD5.
Subscribe to:
Posts (Atom)