Monday, November 30, 2009

ZFS on SAN - multiple slices from the same raid group?

I have never had much to do with Solaris, until recently. I have heard a lot about ZFS, and it's many advantages, but when I read about the upcoming deduplication in ZFS, I had to try it out. I talked to some people here more familiar with ZFS than me, but the indication from them was that they were getting away from ZFS. High I/O rates were bringing it to it's knees. We are using an EMC SAN here, and while reading about ZFS it occurred to me that what may be happening is that no one planned out the raid groups so that each slice of the SAN came from separate raid groups. I decided to test this out.

I did not have a spare system to run OpenSolaris on, and I needed a system with more SATA ports so I could have more hard disk drives. Here is what I came up with:

Gigabyte MA790FXT-UD5P Motherboard
AMD Phenom II 955 Quad Core 3.2GHz
4 x Mushkin Silverline PC3-10666 9-9-9-24 1333 2GB Ram
Gigabyte GV-N210OC-512I Geforce 210 512MB 64-bit PCIE Video Card
LG 22x DVD Sata Burner
2 x WD Caviar Blue 320GB 7200RPM Sata Drives (OS/Other)
4 x WD Caviar Blue 80GB 7200RPM Sata Drives (Data)
4 x Patriot Memory 32GB Sata SSD (Database)

Gentoo Linux 10.1
OpenSolaris 200906


I decided to test this theory with a 4864 cylinder partition on one of the 80GB drives, and two 2432 cylinder partions on another 80GB drive. Here are the two pools:

pool: storage
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
c7t1d0p3 ONLINE 0 0 0

pool: storages
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storages ONLINE 0 0 0
c7t0d0p3 ONLINE 0 0 0
c7t0d0p4 ONLINE 0 0 0

I decided to use iozone as the test. I was particular interested in the filesizes of 8GB and 16GB, since I have 8GB of memory and wanted to take filecache out of the picture as much as possible. The results were a little surprising:












One Partition
Two Partitions

8GB Filesize16GB Filesize8GB Filesize16GB Filesize
Writer Report (avg)53484KB/s49143KB/s49781KB/s45725KB/s
Reader Report (avg)38038KB/s36979KB/s98214KB/s45676KB/s
Random Read Report (avg)38869KB/s12755KB/s24957KB/s11710KB/s
Random Write Report (avg)30250KB/s25258KB/s27607KB/s23635KB/s
Stride Read Report (avg)76695KB/s31510KB/s57693KB/s34278KB/s
Fwrite Report (avg)53784KB/s48968KB/s29115KB/s28515KB/s
Fread Report (avg)68970KB/s49275KB/s98283KB/s45997KB/s


Here is what jumps out at me:

- Performance for sequential writing with 2 partitions versus 1 partition was approximately 7% slower

- 2 partitions vs 1 partition for the sequential read was a winner with about 2.5x performance on the 8GB filesize, and about 23.5% increase for 16GB

- Random Read for the 8GB filesize showed almost a 37% decrease in performance, while the 16GB filesize only showed about an 8% devrease

- Random Write showed a 8% decrease in performance for the 8GB filesize, vs 6% for the 16GB filesize.

- Stride Read showed the opposite of Sequential Read, with the 8GB filesize showed a 25% with 1 partition vs 2, maybe because the stride read masked the OS filecache?

- Fwrite showed almost a 2 to 1 advantage for the 1 partition configuration

- Fread showed a 50% increase for 2 partition at 8GB filesize, but negligible difference at 16GB filesize.

I am guessing that some of the advantages in sequential read has to do with the OS doing more readahead for two "disks", as the OS sees it vs one "disk".

This test was not exhaustive, and did not include multiple readers and writers, but my gut feel is to stay away from multiple slices in one raid group for writers, and still stay away from it with readers, but for sequential reads, tweak the OS to balance the readahead for better read performance.

I am planning to do more benchmarks on ZFS, with SSD disk and spinning disk, with record sizes, compression, and deduplication when it is available.