Wednesday, December 9, 2009

Linux Filesystem Benchmarks

I wanted to see how various filesystems in Linux stacked up to each other. So, I decided to benchmark them.

The filesystems I am benchmarking are: ext3, ext4, xfs, reiserfs, btrfs, and nilfs2.

The system I am using to do the benchmarking:

Gigabyte MA790FXT-UD5P Motherboard
AMD Phenom II 955 Quad Core 3.2GHz
4 x Mushkin Silverline PC3-10666 9-9-9-24 1333 2GB Ram
Gigabyte GV-N210OC-512I Geforce 210 512MB 64-bit PCIE Video Card
LG 22x DVD Sata Burner
2 x WD Caviar Blue 320GB 7200RPM Sata Drives (OS/Other)
4 x WD Caviar Blue 80GB 7200RPM Sata Drives (Data)
4 x Patriot Memory 32GB Sata SSD (Database)

Gentoo Linux 10.1

The diskspace used is a software raid 0, comprised of 4 partition slices of 4864 cylinders (37.3GB) from the 80GB Hard Drives.

I used the fileio benchmarks in Sysbench 0.4.10 to do these tests.

I created a script that formats the filesystem, mounts it, runs the sysbench prepare statement, clears the filecache, and runs the benchmark. This is done 5 times for each filesystem - I/O Operation Mode tested, and averaged. Each filesystem is created with it's default values.

SEQWR

SEQWR is the Sequential Write Benchmark.
SEQWR

XFS is the clear winner here, with ext4 following closely. NILFS2 was really bad, but I have to attribute this to the newness of it, and that it's not production ready. It performed poorly in every test except one notably weird exception which I will discuss later (SEQRD). So ignoring NILFS2, JFS was the worst at 2.4x the best.

SEQREWR

SEQREWR is the sequential rewrite benchmark
SEQREWR
JFS just beat out XFS on this one, with EXT3 having a particularly bad showing here at 7x the best.

SEQRD

SEQRD is the sequential read benchmark
SEQRD
I cannot explain why NILFS2 chose this test to shine on, but I suspect either it is something I am missing, or that NILFS2 is just really good at this. All the rest of the filesystems were virtually equal here. If your workload is sequential read, it seems any filesystem would do.

RNDRD

RNDRD is the random read benchmark
RNDRD
The winner here is REISERFS, with JFS and EXT3 close behind. XFS was the worst at 2.8x the best.

RNDWR

RNDWR is the random write benchmark
RNDWR
I can't explain the fantastic showing here on REISERFS, unless it buffers the write and returns without having sync'd it to disk. Bypassing that, EXT3 showed well here, followed by BTRFS and JFS. EXT4 was the worst at 3.8x the best.

RNDRW

RNDRW is the combined random read/write benchmark
RNDRW
REISERFS won here, followed closely by JFS. The loser here is XFS at 4.9x the best.

Conclusion: REISERFS and JFS are pretty close contenders for first place, followed by BTRFS and EXT4. Good old EXT3 would be my pick for fifth, leaving XFS and the still immature, but interesting log based filesystem NILFS2 in last place.

As always, your mileage may vary. I could have done some tuning, as some of the filesystems have parameters for stride and stripe width for raid devices, but once I started tuning, I wouldn't know where to stop, so I thought it was more fair to compare them with their default values.

I plan on testing these same filesystems and I/O patterns on SSD disks next. Also, I am going to test BTRFS compression, but not sure yet if that is interesting enough to post about.


Monday, November 30, 2009

ZFS on SAN - multiple slices from the same raid group?

I have never had much to do with Solaris, until recently. I have heard a lot about ZFS, and it's many advantages, but when I read about the upcoming deduplication in ZFS, I had to try it out. I talked to some people here more familiar with ZFS than me, but the indication from them was that they were getting away from ZFS. High I/O rates were bringing it to it's knees. We are using an EMC SAN here, and while reading about ZFS it occurred to me that what may be happening is that no one planned out the raid groups so that each slice of the SAN came from separate raid groups. I decided to test this out.

I did not have a spare system to run OpenSolaris on, and I needed a system with more SATA ports so I could have more hard disk drives. Here is what I came up with:

Gigabyte MA790FXT-UD5P Motherboard
AMD Phenom II 955 Quad Core 3.2GHz
4 x Mushkin Silverline PC3-10666 9-9-9-24 1333 2GB Ram
Gigabyte GV-N210OC-512I Geforce 210 512MB 64-bit PCIE Video Card
LG 22x DVD Sata Burner
2 x WD Caviar Blue 320GB 7200RPM Sata Drives (OS/Other)
4 x WD Caviar Blue 80GB 7200RPM Sata Drives (Data)
4 x Patriot Memory 32GB Sata SSD (Database)

Gentoo Linux 10.1
OpenSolaris 200906


I decided to test this theory with a 4864 cylinder partition on one of the 80GB drives, and two 2432 cylinder partions on another 80GB drive. Here are the two pools:

pool: storage
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
c7t1d0p3 ONLINE 0 0 0

pool: storages
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storages ONLINE 0 0 0
c7t0d0p3 ONLINE 0 0 0
c7t0d0p4 ONLINE 0 0 0

I decided to use iozone as the test. I was particular interested in the filesizes of 8GB and 16GB, since I have 8GB of memory and wanted to take filecache out of the picture as much as possible. The results were a little surprising:












One Partition
Two Partitions

8GB Filesize16GB Filesize8GB Filesize16GB Filesize
Writer Report (avg)53484KB/s49143KB/s49781KB/s45725KB/s
Reader Report (avg)38038KB/s36979KB/s98214KB/s45676KB/s
Random Read Report (avg)38869KB/s12755KB/s24957KB/s11710KB/s
Random Write Report (avg)30250KB/s25258KB/s27607KB/s23635KB/s
Stride Read Report (avg)76695KB/s31510KB/s57693KB/s34278KB/s
Fwrite Report (avg)53784KB/s48968KB/s29115KB/s28515KB/s
Fread Report (avg)68970KB/s49275KB/s98283KB/s45997KB/s


Here is what jumps out at me:

- Performance for sequential writing with 2 partitions versus 1 partition was approximately 7% slower

- 2 partitions vs 1 partition for the sequential read was a winner with about 2.5x performance on the 8GB filesize, and about 23.5% increase for 16GB

- Random Read for the 8GB filesize showed almost a 37% decrease in performance, while the 16GB filesize only showed about an 8% devrease

- Random Write showed a 8% decrease in performance for the 8GB filesize, vs 6% for the 16GB filesize.

- Stride Read showed the opposite of Sequential Read, with the 8GB filesize showed a 25% with 1 partition vs 2, maybe because the stride read masked the OS filecache?

- Fwrite showed almost a 2 to 1 advantage for the 1 partition configuration

- Fread showed a 50% increase for 2 partition at 8GB filesize, but negligible difference at 16GB filesize.

I am guessing that some of the advantages in sequential read has to do with the OS doing more readahead for two "disks", as the OS sees it vs one "disk".

This test was not exhaustive, and did not include multiple readers and writers, but my gut feel is to stay away from multiple slices in one raid group for writers, and still stay away from it with readers, but for sequential reads, tweak the OS to balance the readahead for better read performance.

I am planning to do more benchmarks on ZFS, with SSD disk and spinning disk, with record sizes, compression, and deduplication when it is available.

Monday, April 6, 2009

Performance Problems

I am having MySQL performance problems with my current schema. Here is a rundown of what the inserter does:

- get's the filename, filesize and hash for a file
- checks the objects database to see if that combination already exists
- if it exists, skip to next step
- if it does not exisit, insert it
- insert the information in the file table

Here is the create table definition for the tables:

CREATE TABLE `objects` (
`objectid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`filename` varchar(256) COLLATE latin1_bin DEFAULT NULL,
`filesize` bigint(20) unsigned DEFAULT NULL,
`hash` varchar(32) COLLATE latin1_bin DEFAULT NULL,
PRIMARY KEY (`objectid`),
UNIQUE KEY `nsh` (`filename`,`filesize`,`hash`)
) ENGINE=MyISAM AUTO_INCREMENT=7037849 DEFAULT CHARSET=latin1 COLLATE=latin1_bin

CREATE TABLE `files` (
`path` varchar(4096) COLLATE latin1_bin DEFAULT NULL,
`filename` varchar(256) COLLATE latin1_bin DEFAULT NULL,
`filesize` bigint(20) unsigned DEFAULT NULL,
`hash` varchar(32) COLLATE latin1_bin DEFAULT NULL,
`backuptime` datetime DEFAULT NULL,
`status` enum('Active','Inactive','Deleted') COLLATE latin1_bin DEFAULT NULL,
`objectid` bigint(20) unsigned DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_bin

I currently have a files table per backup client node. On my test system, I can run the inserts sequentially for 15 nodes (37 or so processes for multiple filesystem clients), and it runs in about 50 minutes. When i run them at the same time, it runs many times longer. It looks like it is the objects table, since I tried "insert delayed" and that cut the time back down to around the sequential time. The downside to the delayed, is that a crash can lose data.

Does anyone have any ideas as to what I am doing wrong? If it matters, I am running a dual core AMD 5800+, 4GB memory, and a 4 disk raid-0 SSD Array.

Thanks for any input anyone may have.

Initial Stats from 15 Servers

I picked 15 servers at my workplace to get some initial data on. I tried to pick different servers, not multiple copies of the same servers, like for high availability, etc. The data I gathered gave a fairly good poor man's data-dedup savings.

My "poor man's dedup" consists of saving the backup files based on filename/filesze/md5 hash of the file contents. This provided savings, even on single servers, as there are some number of the same named files on the same system, with different paths.

Here are my numbers from the 15 servers:

Total Files: 13753410
Total Objects: 7037848
File Savings: 6715562
Percentage Savings: 48.8283414804

Total File Size: 1747437157724
Total Object Size: 1134826850745
Size Savings: 612610306979
Percentage Savings: 35.0576445208

At 35% space savings seems pretty incredible to me, for as little I had to put into making it happen. Some of the more aggressive data dedup alogoritms search files for long common strings, which certainly might improve the amount of savings, but at the cost of more CPU and IO processing.

Monday, March 2, 2009

Recursing into subdirectories

One of the things that needs to be done when backing up files is to determine what files exist on the server. I had written code to recurse into subdirectories, but it was very slow. It also basically only gave the filename, which necessitated a call to stat to get the important information. Then I discovered a system call ftw and nftw. These calls do the work of recursing into the subdirectories, and also provides the stat structure for each file. It is also very much faster than my original code. Here is the code I wrote to do that:


#define _XOPEN_SOURCE 500
#include
#include
#include
#include
#include
#include
#include
#include

static int
display_info(const char *fpath, const struct stat *sb,
int tflag, struct FTW *ftwbuf)
{
printf("p:%s ",fpath);
printf("tf:%d ",tflag);
printf("dev:%lu ",sb->st_dev);
printf("ino:%lu ",sb->st_ino);
printf("m:%o ",sb->st_mode);
printf("nl:%lu ",sb->st_nlink);
printf("u:%d ",sb->st_uid);
printf("g:%d ",sb->st_gid);
printf("rd:%lx ",sb->st_rdev);
printf("s:%lu ",sb->st_size);
printf("bs:%lu ",sb->st_blksize);
printf("b:%lu ",sb->st_blocks);
printf("ta:%lu ",sb->st_atime);
printf("tm:%lu ",sb->st_mtime);
printf("tc:%lu ",sb->st_ctime);
printf("\n");
return 0; /* To tell nftw() to continue */
}

int
main(int argc, char *argv[])
{
int flags = FTW_MOUNT;

if (argc > 2 && strchr(argv[2], 'd') != NULL)
flags = FTW_DEPTH;
if (argc > 2 && strchr(argv[2], 'p') != NULL)
flags = FTW_PHYS;

if (nftw64((argc < 2) ? "." : argv[1], display_info, 80, flags) == -1) {
perror("nftw");
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}