Monday, April 6, 2009

Performance Problems

I am having MySQL performance problems with my current schema. Here is a rundown of what the inserter does:

- get's the filename, filesize and hash for a file
- checks the objects database to see if that combination already exists
- if it exists, skip to next step
- if it does not exisit, insert it
- insert the information in the file table

Here is the create table definition for the tables:

CREATE TABLE `objects` (
`objectid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`filename` varchar(256) COLLATE latin1_bin DEFAULT NULL,
`filesize` bigint(20) unsigned DEFAULT NULL,
`hash` varchar(32) COLLATE latin1_bin DEFAULT NULL,
PRIMARY KEY (`objectid`),
UNIQUE KEY `nsh` (`filename`,`filesize`,`hash`)
) ENGINE=MyISAM AUTO_INCREMENT=7037849 DEFAULT CHARSET=latin1 COLLATE=latin1_bin

CREATE TABLE `files` (
`path` varchar(4096) COLLATE latin1_bin DEFAULT NULL,
`filename` varchar(256) COLLATE latin1_bin DEFAULT NULL,
`filesize` bigint(20) unsigned DEFAULT NULL,
`hash` varchar(32) COLLATE latin1_bin DEFAULT NULL,
`backuptime` datetime DEFAULT NULL,
`status` enum('Active','Inactive','Deleted') COLLATE latin1_bin DEFAULT NULL,
`objectid` bigint(20) unsigned DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_bin

I currently have a files table per backup client node. On my test system, I can run the inserts sequentially for 15 nodes (37 or so processes for multiple filesystem clients), and it runs in about 50 minutes. When i run them at the same time, it runs many times longer. It looks like it is the objects table, since I tried "insert delayed" and that cut the time back down to around the sequential time. The downside to the delayed, is that a crash can lose data.

Does anyone have any ideas as to what I am doing wrong? If it matters, I am running a dual core AMD 5800+, 4GB memory, and a 4 disk raid-0 SSD Array.

Thanks for any input anyone may have.

Initial Stats from 15 Servers

I picked 15 servers at my workplace to get some initial data on. I tried to pick different servers, not multiple copies of the same servers, like for high availability, etc. The data I gathered gave a fairly good poor man's data-dedup savings.

My "poor man's dedup" consists of saving the backup files based on filename/filesze/md5 hash of the file contents. This provided savings, even on single servers, as there are some number of the same named files on the same system, with different paths.

Here are my numbers from the 15 servers:

Total Files: 13753410
Total Objects: 7037848
File Savings: 6715562
Percentage Savings: 48.8283414804

Total File Size: 1747437157724
Total Object Size: 1134826850745
Size Savings: 612610306979
Percentage Savings: 35.0576445208

At 35% space savings seems pretty incredible to me, for as little I had to put into making it happen. Some of the more aggressive data dedup alogoritms search files for long common strings, which certainly might improve the amount of savings, but at the cost of more CPU and IO processing.