In our previous post we touched on alignment issues on solid-state drives. Our test read different-sized blocks from various random points on a raw device, aligned to a particular boundary. Today we’d like to expand on that work, and discuss how other factors affect SSD read performance. In addition to testing different block sizes and alignment boundaries, we tested two other factors: how the drive is partitioned, and what filesystem is used.
We decided to test different partitioning schemes because they can profoundly affect alignment. By default, today’s partitioning tools use 63 sectors per track. Each sector is 512B, so a sector contains 32256B. Unfortunately this value is not 4K aligned (32256 is not divisible by 4096). Since the first partition starts on the second track, the default partition is not 4K aligned. We wanted to test whether this affects performance. We used three partitioning schemes: no partition (reading from a raw block device), default partitioning scheme used by fdisk (not 4K aligned), and a 4K aligned partitioning scheme (we tell fdisk to start on sector 128 instead).
In addition to testing partitioning schemes, we wanted to test how adding a filesystem on top of the device affects performance. We tested reading from the device (or partition) directly, vs. reading a 1GB file from ext2 (created with standard options).
For each of these configurations we ran random reads for block sizes from 512B to 4096B (at 512B increments), and 512B to 4096B aligned boundaries (also at 512B increments). That’s 3 * 2 * 8 * 8 = 384 different combinations, so it’s not immediately clear how to visualize the data. The first thing we did, was to plot six different graphs that visualize block size vs. alignment boundary (one graph for each partitioning and file system combination). We hoped that it would let us pick out some interesting trends:

On these graphs the red line represents a 512B block size, the blue line represents a 4096 block size, and the other colors represent block sizes in between. The x-axis is the alignment boundary, and the y-axis is performance.
Glancing at these graphs we can see some clear trends.
- The red line is always highest (except for a couple of small anomalies), which means reading 512B chunks is always fastest on every setup.
- The graphs that display runs that ran on unpartitioned devices, and the graphs that display runs on aligned partitioned devices are roughly the same.
- Default partition graphs look inverted from their counterparts.
From this, we can reach two important conclusions:
- For drilldown visualizations, we don’t need to worry about the block size since curves that represent different block sizes look the same. We’ll focus on 512B block sizes.
- Graph inversion for unaligned partitions is shifted by 512B, which makes perfect sense: when we add an extra 512 to 32256, we get to a 4KB boundary on the drive ((63*512 + 512) / 4096 = 8).
Let’s take a closer look at the drilled down visualization:

From this graph we can deduce a few interesting things:
- Partition misalignment can cause a 15% drop in performance.
- Reading from the raw device with no file system is occasionally a little faster than reading from an aligned partition with no file system – two fastest modes of operation.
- Reading from ext2 causes a 2% drop in performance compared to reading from the raw device.
Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.





Please include the exact command line you used to run rebench in your posts. Also, what kind of hardware configuration are you using?
I think it is really strange that 4096-byte reads are slower than 512-byte reads. I think it is so strange I dug into the code.
I notice you hard-code HARDWARE_BLOCK_SIZE to 512 and then use this for buffer alignment with posix_memalign. I recommend aligning on max(8192, max(VMM page size, block size being tested)) boundaries instead. I’m not sure it would make a difference but it may affect DMA transfers from the disk–especially if the OS/driver is trying to align all reads on memory page boundaries AND disk page boundaries. Imagine you have a 4K read aligned on a disk page but not aligned on a RAM page. I could easily see this requiring a DMA transfer into a kernel buffer, and then a memcpy into your buffer. If everything is aligned exactly right, you should be able to get a DMA directly into your buffer. However, historically there’ve been all kinds of issues in Linux that have prevented zero-copy I/O like this; I don’t know if all of them have been resolved.
Brian, I tried various memory alignments (from 512B to 128K) and it didn’t make a difference. It’s probably a good idea to align it to max(VM page size, block size tested) – I’ll push the code to do it in a bit.
In the meantime, if you have an SSD, would you mind trying the following:
# rebench /dev/sdb # this sets block size and stride to 512B
# rebench -b 4096 -s 4096 /dev/sdb
What results do you get (on what hardware and kernel)? When I move from 512B block size to 4K I see a performance drop on SuperTalent SSD, as well as Intel SLC. On the current test machine they’re connected via SATA directly to the motherboard (I’m not sure what board and SATA controller this machine has, I’ll find out tonight). I’m on Ubuntu 2.6.28-14-server.
Social comments and analytics for this post…
This post was mentioned on Twitter by patrichards: RT @rethinkdb We’ve released more benchmarks that test alignment, ext2, and partitioning on SSDs. Read more: http://bit.ly/1×92Zd…
I did some more research. It seems that 2.6.28 definitely doesn’t have support for 4K logical sectors. Patches were sent to LKML in the 2.6.29/2.6.30 timeframe but I don’t know if they were ever integrated in any release. In 2.6.31, Linux can interogate the disk to determine what its native sector size is, but that doesn’t mean 2.6.31 supports 4K sector sizes fully enough to affect performance. And, I couldn’t find anything in the 2.6.29, 2.6.30, or 2.6.31 changelogs indicating 4K sectors support at all.
I did find some discussions that were inconclusive about whether it was *safe* to use 512-byte logical sectors on a disk with 4096 byte physical sectors. A few people are saying that 512-on-4096 emulation is NOT safe for magnetic disks. Whether 512-on-4096 emulation is safe for an SSD depends on the SSD controller. It seems most SSD controllers are using a (mostly-)append-only log-structured filesystem internally; in that case, it seems like 512-on-4096 emulation is safe. I that safety-critical applications should always write using the physical sector size (4096 bytes for most SSDs today) for maximum atomicity/safety. I also heard someone mention that many SSDs–especially laptop/desktop class SSDs–can’t/won’t expose 4096-byte sector I/O operations to the OS, only doing 512-on-4096 emulation. I don’t if that is true.
It seems to me that, for most SSDs, a 4K write is always going to be non-destructive if it fails. IMO, ATA should have a mechanism for the disk to say “writes of XXXXk (e.g. 128k) or smaller are non-destructive if they fail” and/or expose a non-destructive write operation. That would enable *huge* performance optimizations for VMMs, filesystems, and databases. Maybe ATA already has one or both of those already; it is something that should be looked into.
I don’t have a machine that I can reformat appropriately to run those tests. That’s why I’m so interesting in running your tests.