One of the first questions to answer when running databases on SSDs is what B-tree block size to use. There are a number of factors that affect this decision:
- The type of workload
- I/O time to read and write the block size
- The size of the cache
That’s a lot of variables to consider. For this blog post we assume a fairly common OLTP scenario – a database that’s dominated by random point queries. We will also sidestep some of the more subtle caching effects by treating the caching algorithm as perfectly optimal, and assuming the cost of lookup in RAM is insignificant.
Even with these restrictions it isn’t immediately obvious what is the optimal block size. Before discussing SSDs, let’s quickly address this problem on rotational drives. If we benchmark the number of IOPS for different block sizes on a typical rotation drive we get the following graph:

There are two things to note. The first, is that the random distribution makes a big difference, resulting in a 25% speedup between uniform and power distributions. The curves, however, are roughly the same, which means that ignoring caching, the ideal block size isn’t dependent on the distribution. The second, is that the number of IOPS is effectively constant for all blocks before 16KB. This is supported by the assumption that the time it takes to read extra information once the arm is properly positioned is insignificant compared to the seek latency and rotational delays. So, for a rotational drive, I/O read time changes are not a significant factor – we should design the block size completely based on the caching effects. But what about solid state drives?
The first natural thing to do is to benchmark the number of IOPS for different block sizes. A couple of runs of Rebench fed into gnuplot give us the following results:

That’s a very different curve! The first thing that jumps out is that random distributions have almost no effect on the results. But what about block size? Given this curve, it isn’t immediately clear what the ideal block size is. Fortunately, we can easily figure it out with a little math. The depth of the B-tree is logb (N) – this is how many hops we need to make to satisfy a given point query. Let’s perform some back of the envelope calculations for a database of one billion rows. Assuming we can fit a single key into the B-tree node in 32 bytes, we can easily figure out the value of B for each block size. Now, all we need to do is plug in N (we use one billion rows) and B into the formula to figure out how many hops we need to make. We simply divide the number of IOPS for each block size from the experimental data above, and we see how many queries per second we can perform with a given block size. We then pick the block size that lets us perform the maximum number of queries (part of the table removed in the interest of brevity):
| 1kb (32 keys) 4579 IOPS |
2kb (64 keys) 4254 IOPS |
4kb (128 keys) 3780 IOPS |
8kb (256 keys) 3197 IOPS |
16kb (512 keys) 2186 IOPS |
32kb (1024 keys) 1769 IOPS |
64kb (2048 keys) 1334 IOPS |
| 5.98 hops 765 q./sec |
4.98 hops 854 q./sec |
4.27 hops 885 q./sec |
3.74 hops 854 q./sec |
3.32 hops 658 q./sec |
2.98 hops 593 q./sec |
2.72 hops 490 q./sec |
So, if we have no cache the optimal block size is 4KB.
There are a number of other factors we didn’t consider here. The most important one is caching. A complete analysis would account for the size of the block cache and how many hops we can avoid by storing some of the tree in memory (naturally this is affected by the block size). Another important factor is write performance. Because RethinkDB makes no in-place modifications, we can safely ignore write-heavy workloads – a scenario that can radically affect the calculations above for traditional databases. Finally, we ignore page read boundaries – a factor that can give a significant boost to performance on solid-state drives. More on that later.
Of course, we wouldn’t ask our customers to go through these calculations. RethinkDB will perform these tests on target hardware automatically and suggest the optimal page size, so you never have to guess.
Edit: A few people e-mailed us to let us know that there are some assumptions our computations rely on that weren’t mentioned in the post. For example, B-Tree nodes might not always be full, which might significantly impact the ideal block size. I want to note that we did not intend to say that 4KB blocks are an ideal size on SSDs. The size of the database, the size of the cache, the means by which the data is inserted, and the performance of the drive (given your file system and RAID configuration) are all crucial factors. In order to determine the ideal size it’s necessary to test the performance of the particular hardware and to plug it into a more complete model. Alternatively, you can switch to RethinkDB when it’s ready.
Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.





Thanks for the write-up. Very informative.
Your graph of the SSD performance reflects non-optimal partitioning.
Every modern OS uses a cluster size that is a multiple of 4KB, so in theory there shouldn’t be a difference in IOPS between a 512B write and a 4KB write.
SSDs also have internal clusters that are multiples of 4KB, and the blocks that get erased for read/erase/write cycles are at least 128KB. But for the sake of backwards compatibility, 512B sectors are still emulated.
Unfortunately, most partitioning tools still start the 1st partition on sector 63 (counting 512B sectors), which is not a multiple of 8 to have it 4KB aligned. If you partition using Windows 7 or other smarter tool to keep your partitions 4KB aligned, you will not see a difference in IOPS between 512B operations and 4KB operations. (and since 4KB will be higher, the dropoff afterwards will be steeper).
Hey glugglug, this test and analysis has nothing to do with disk formatting. It has to do with the choice of page size for storing btree indices. This is a variable chosen by the database server, and is independent of the cluster size on the disk.
Admin: Edited to tone down a little.
gugglug: all tests were performed on the raw block device, so I don’t think they’re affected by partitioning artifacts. In this context we only tested reads, so the 128KB erase cycles aren’t relevant either. It’s very difficult to obtain information on what happens within the drives themselves, so unfortunately we have to resort to building up models of how they operate based on extensive testing. We’re talking with a couple of manufacturers, so hopefully we’ll have more definitive information soon.
Looking closely at those performance numbers, I think glugglug may be right — either you’re making unaligned requests, or the disk is optimized for unaligned requests. I’m getting a very strong fit to a model with 16 kiB internal blocks and a shuffling cost for misaligned reads.
Could you measure E(time to read 16 kiB from offset (16 kB) * A + B for random positive integer A) for B = 0, 0.5 kiB, 1 kiB, … 15.5 kiB? Mean and standard deviation would both be useful.
We did much more alignment related testing. I’ll post the results soon, but in general they’re consistent with these numbers. It might be that the SUPER TALENT MasterDrive OCX we’re testing on behaves differently from other drives. We’ll retest on Intel drives soon.
[...] Rethinking B-tree block sizes on SSDs [...]
[...] RethinkDB: Rethinking B-tree block sizes on SSDs [...]
Hi. I’m great follower of http://www.defmacro.org/ and just found this blog.
Maybe you are interested on this:
http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/
Bye,
Iosif.