Monday, April 6, 2009

Initial Stats from 15 Servers

I picked 15 servers at my workplace to get some initial data on. I tried to pick different servers, not multiple copies of the same servers, like for high availability, etc. The data I gathered gave a fairly good poor man's data-dedup savings.

My "poor man's dedup" consists of saving the backup files based on filename/filesze/md5 hash of the file contents. This provided savings, even on single servers, as there are some number of the same named files on the same system, with different paths.

Here are my numbers from the 15 servers:

Total Files: 13753410
Total Objects: 7037848
File Savings: 6715562
Percentage Savings: 48.8283414804

Total File Size: 1747437157724
Total Object Size: 1134826850745
Size Savings: 612610306979
Percentage Savings: 35.0576445208

At 35% space savings seems pretty incredible to me, for as little I had to put into making it happen. Some of the more aggressive data dedup alogoritms search files for long common strings, which certainly might improve the amount of savings, but at the cost of more CPU and IO processing.

No comments: