Page alignment on SSDs

In our previous post we discussed the optimal block-size for B-trees on solid-state drives. A few people mentioned page alignment – an issue that can cause serious performance hits on SSDs if unaccounted for. It’s a complex topic, and we will dedicate two posts to its discussion. In this post we’ll address alignment behavior while reading directly from the block device. In the next post, we’ll talk about partitioning the drive, and the effects of reading from the filesystem instead of reading from the device directly.

For this test we ran Rebench in random read mode, with block sizes ranging from 512B to 4KB, with a 512B increment. We also set the stride parameter to values ranging from 512B to 4KB, with a 512B increment. In the random read mode, the stride parameter simply aligns random offsets to the boundary. This lets us test how different combinations of block sizes and alignment values affect performance. Here are the results for the 16GB SUPER TALENT MasterDrive OCX (MLC):

graph-sdb-3d

In one glance we can see from the mesh on top that performance spikes whenever the alignment is a power of two. The heatmap shows that performance quickly drops off for larger blocks, and that the best performing workload reads 512B blocks from 4KB-aligned offsets. An open question remains: if we align our blocks at 4KB boundaries and can read the first 512B chunk very quickly, how can we read the rest of the chunks without performance loss? We know from previous testing on our rotational drive that reading larger blocks did not result in a performance drop-off, which means the problem isn’t likely to be in the kernel configuration or the data channel. Perhaps it’s a problem with the drive’s firmware, or the driver, or perhaps it’s an inherent limitation of the drive. We’ll be posting results on the Intel X-25M G2 MLC and X-25E SLC drives soon; we’re looking forward to comparing the results.

Stay tuned for information on how the block size and alignment behaves with different partitioning and file system schemes. In the meantime, if you’d like more precise information on how the drive behaves, here’s a 2D visualization:

graph-sdb-2d

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

10 Responses to “Page alignment on SSDs”

  1. Why do you only test up to a 4kb block size and stride? It is pretty common for databases even on hard disks to use block sizes of 16kb. Since SSDs normally have to read 128kb at a time for read/erase/write, I’d guess they’d be very good at 128kb aligned reads.

    Also, is your test utilizing NCQ? I would guess that if you quckly sent 32 4K read requests (on the same aligned 128kb block), you would get a nice boost over sending a 4K read, waiting for a response, sending the next one, etc.

  2. You noticed that the IOPS numbers from your last post match up exactly against the numbers for *unaligned* reads from this post, right?

    BTW, I second Brian’s request for larger block sizes.

  3. I third (?) Brian’s request for larger block sizes.

  4. Sure, here’s a graph for block sizes and alignments up to 250KB:

    http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph-blk-alg-25k.png

    The highest performance still comes from 512B block sizes aligned to 4K boundaries (it gets just a little higher for larger boundaries, but the delta is so small it’s uninteresting). After 4K, the optimal alignment is the block size itself, but the number of IOPS never comes close to 512B block sizes.

    The B-tree block size post did use unaligned reads (well, aligned to 512K). The idea was to show the process we go through, not to provide actual numbers for people to use. The numbers are too dependent on the drive, the workload, and the database to give a single useful result.

  5. I woke up this morning and realized that what I wrote above didn’t really make sense. First of all, SSDs can read individual NAND pages–now usually 4K, in the near future 8K–even though they cannot write individual pages. So, the block size shouldn’t really matter much.

    The first thing I would check what Linux is using as minimum_io_size for the drive. If minimum_io_size is 512 then that would explain why 512-byte reads are faster than 4KB reads. [1] That shouldn’t be the case because there’s no reason (AFAICT) for a SSD to return a minimum_io_size less than 4096, even in 512/4096 emulation mode [1].

    The second graph has a non-zero origin. So, graphically it looks like 4K reads are twice as slow as 512 byte reads, but really they are only ~20% slower. And, all SSD drive makers (AFAIK) recommend doing all I/O in blocks the size of a NAND page (4K or 8K). I’d really like to see why there seems to be this 20% performance difference. I look forward to seeing your future posts with results for server-class SSDs.

    [1] http://mkp.net/pubs/storage-topology.pdf

  6. Brian, which kernel version are you using? minimum_io_size isn’t exposed by sysfs on our 2.6.28-14-server #47-Ubuntu SMP UTC 2009 x86_64 (at least it’s not in /sys/block/sdb/queue/minimum_io_size).

    You’re right about the non-zero origin, it can be a bit confusing. You get a slightly better resolution this way, so we chose to use a more close range to present the data.

  7. I read that minimum_io_size and others were added in 2.6.31. I’m not sure how you retrieve those values in previous kernels.

    [1] http://www.redhat.com/archives/dm-devel/2009-June/msg00297.html

  8. Great blog guys! I’m not familiar with the term “stride” in the context of SSDs.

    I thought stride was the size of an individual chunk striped acruss multiple RAID disks.

    What does stride mean in the context of DBs on SSDs?

    Thanks, Gabor

  9. In this context stride is simply the alignment of the read offset. So, if the stride is 4k, all reads will be performed at a 4k boundary (i.e. an offset into the drive divisible by 4096).

  10. [...] Page alignment on SSDs [...]

Leave a Reply