Archive for October, 2009

Help make RethinkDB great

We know all too well that everyone has problems with their databases. We want to hear about your database problems, and if you tell us about them, you could win a RethinkDB sweatshirt.
We’re working hard to make RethinkDB as fast, scalable, and flexible as possible. We don’t do this with magic, but by designing RethinkDB for today’s hardware and access patterns. You can help by giving us a clearer picture of what kind of hardware you’re using and what kind of workloads you’re experiencing.
You can find the survey at http://www.rethinkdb.com/survey/. The best responses will get an awesome new RethinkDB sweatshirt, so you can be the envy of all your friends!
If you’d like us to follow up with you, make sure to check off the last question in the survey. If you’re in the Bay Area, we might even buy you lunch.
Scalability demands, rising hardware costs, administration woes, and performance concerns are all fair game. We’re here for you, so tell us your problems.

We know all too well that most companies face database scalability and administration problems. We’re working hard to make RethinkDB as fast, scalable, and flexible as possible. We use a little bit of magic and a lot of engineering for today’s hardware and access patterns. You can help by telling us what your infrastructure and workloads look like. We want you to tell us about your database problems, and in exchange we’ll send a RethinkDB sweatshirt to the best responses.

You can find the survey at http://www.rethinkdb.com/survey/. If you’d like us to follow up with you, make sure to check off the last checkbox. If you’re in the Bay Area, we’ll be happy to buy you lunch for your troubles.

Scalability demands, rising hardware costs, administration struggles, and performance concerns are all fair game. We’re here for you – tell us about your database woes.

More on alignment, ext2, and partitioning on SSDs

In our previous post we touched on alignment issues on solid-state drives. Our test read different-sized blocks from various random points on a raw device, aligned to a particular boundary. Today we’d like to expand on that work, and discuss how other factors affect SSD read performance. In addition to testing different block sizes and alignment boundaries, we tested two other factors: how the drive is partitioned, and what filesystem is used.

We decided to test different partitioning schemes because they can profoundly affect alignment. By default, today’s partitioning tools use 63 sectors per track. Each sector is 512B, so a sector contains 32256B. Unfortunately this value is not 4K aligned (32256 is not divisible by 4096). Since the first partition starts on the second track, the default partition is not 4K aligned. We wanted to test whether this affects performance. We used three partitioning schemes: no partition (reading from a raw block device), default partitioning scheme used by fdisk (not 4K aligned), and a 4K aligned partitioning scheme (we tell fdisk to start on sector 128 instead).

In addition to testing partitioning schemes, we wanted to test how adding a filesystem on top of the device affects performance. We tested reading from the device (or partition) directly, vs. reading a 1GB file from ext2 (created with standard options).

For each of these configurations we ran random reads for block sizes from 512B to 4096B (at 512B increments), and 512B to 4096B aligned boundaries (also at 512B increments). That’s 3 * 2 * 8 * 8 = 384 different combinations, so it’s not immediately clear how to visualize the data. The first thing we did, was to plot six different graphs that visualize block size vs. alignment boundary (one graph for each partitioning and file system combination). We hoped that it would let us pick out some interesting trends:

multiplot-2d

On these graphs the red line represents a 512B block size, the blue line represents a 4096 block size, and the other colors represent block sizes in between. The x-axis is the alignment boundary, and the y-axis is performance.

Glancing at these graphs we can see some clear trends.

  • The red line is always highest (except for a couple of small anomalies), which means reading 512B chunks is always fastest on every setup.
  • The graphs that display runs that ran on unpartitioned devices, and the graphs that display runs on aligned partitioned devices are roughly the same.
  • Default partition graphs look inverted from their counterparts.

From this, we can reach two important conclusions:

  • For drilldown visualizations, we don’t need to worry about the block size since curves that represent different block sizes look the same. We’ll focus on 512B block sizes.
  • Graph inversion for unaligned partitions is shifted by 512B, which makes perfect sense: when we add an extra 512 to 32256, we get to a 4KB boundary on the drive ((63*512 + 512) / 4096 = 8).

Let’s take a closer look at the drilled down visualization:

graph

From this graph we can deduce a few interesting things:

  • Partition misalignment can cause a 15% drop in performance.
  • Reading from the raw device with no file system is occasionally a little faster than reading from an aligned partition with no file system – two fastest modes of operation.
  • Reading from ext2 causes a 2% drop in performance compared to reading from the raw device.

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Page alignment on SSDs

In our previous post we discussed the optimal block-size for B-trees on solid-state drives. A few people mentioned page alignment – an issue that can cause serious performance hits on SSDs if unaccounted for. It’s a complex topic, and we will dedicate two posts to its discussion. In this post we’ll address alignment behavior while reading directly from the block device. In the next post, we’ll talk about partitioning the drive, and the effects of reading from the filesystem instead of reading from the device directly.

For this test we ran Rebench in random read mode, with block sizes ranging from 512B to 4KB, with a 512B increment. We also set the stride parameter to values ranging from 512B to 4KB, with a 512B increment. In the random read mode, the stride parameter simply aligns random offsets to the boundary. This lets us test how different combinations of block sizes and alignment values affect performance. Here are the results for the 16GB SUPER TALENT MasterDrive OCX (MLC):

graph-sdb-3d

In one glance we can see from the mesh on top that performance spikes whenever the alignment is a power of two. The heatmap shows that performance quickly drops off for larger blocks, and that the best performing workload reads 512B blocks from 4KB-aligned offsets. An open question remains: if we align our blocks at 4KB boundaries and can read the first 512B chunk very quickly, how can we read the rest of the chunks without performance loss? We know from previous testing on our rotational drive that reading larger blocks did not result in a performance drop-off, which means the problem isn’t likely to be in the kernel configuration or the data channel. Perhaps it’s a problem with the drive’s firmware, or the driver, or perhaps it’s an inherent limitation of the drive. We’ll be posting results on the Intel X-25M G2 MLC and X-25E SLC drives soon; we’re looking forward to comparing the results.

Stay tuned for information on how the block size and alignment behaves with different partitioning and file system schemes. In the meantime, if you’d like more precise information on how the drive behaves, here’s a 2D visualization:

graph-sdb-2d

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Rethinking B-tree block sizes on SSDs

One of the first questions to answer when running databases on SSDs is what B-tree block size to use. There are a number of factors that affect this decision:

  • The type of workload
  • I/O time to read and write the block size
  • The size of the cache

That’s a lot of variables to consider. For this blog post we assume a fairly common OLTP scenario – a database that’s dominated by random point queries. We will also sidestep some of the more subtle caching effects by treating the caching algorithm as perfectly optimal, and assuming the cost of lookup in RAM is insignificant.

Even with these restrictions it isn’t immediately obvious what is the optimal block size. Before discussing SSDs, let’s quickly address this problem on rotational drives. If we benchmark the number of IOPS for different block sizes on a typical rotation drive we get the following graph:

graph-sda-log

There are two things to note. The first, is that the random distribution makes a big difference, resulting in a 25% speedup between uniform and power distributions. The curves, however, are roughly the same, which means that ignoring caching, the ideal block size isn’t dependent on the distribution. The second, is that the number of IOPS is effectively constant for all blocks before 16KB. This is supported by the assumption that the time it takes to read extra information once the arm is properly positioned is insignificant compared to the seek latency and rotational delays. So, for a rotational drive, I/O read time changes are not a significant factor – we should design the block size completely based on the caching effects. But what about solid state drives?

The first natural thing to do is to benchmark the number of IOPS for different block sizes. A couple of runs of Rebench fed into gnuplot give us the following results:

graph-sdb-log

That’s a very different curve! The first thing that jumps out is that random distributions have almost no effect on the results. But what about block size? Given this curve, it isn’t immediately clear what the ideal block size is. Fortunately, we can easily figure it out with a little math. The depth of the B-tree is logb (N) – this is how many hops we need to make to satisfy a given point query. Let’s perform some back of the envelope calculations for a database of one billion rows. Assuming we can fit a single key into the B-tree node in 32 bytes, we can easily figure out the value of B for each block size. Now, all we need to do is plug in N (we use one billion rows) and B into the formula to figure out how many hops we need to make. We simply divide the number of IOPS for each block size from the experimental data above, and we see how many queries per second we can perform with a given block size. We then pick the block size that lets us perform the maximum number of queries (part of the table removed in the interest of brevity):

1kb (32 keys)
4579 IOPS
2kb (64 keys)
4254 IOPS
4kb (128 keys)
3780 IOPS
8kb (256 keys)
3197 IOPS
16kb (512 keys)
2186 IOPS
32kb (1024 keys)
1769 IOPS
64kb (2048 keys)
1334 IOPS
5.98 hops

765 q./sec
4.98 hops

854 q./sec
4.27 hops

885 q./sec
3.74 hops

854 q./sec
3.32 hops

658 q./sec
2.98 hops

593 q./sec
2.72 hops

490 q./sec

So, if we have no cache the optimal block size is 4KB.

There are a number of other factors we didn’t consider here. The most important one is caching. A complete analysis would account for the size of the block cache and how many hops we can avoid by storing some of the tree in memory (naturally this is affected by the block size). Another important factor is write performance. Because RethinkDB makes no in-place modifications, we can safely ignore write-heavy workloads – a scenario that can radically affect the calculations above for traditional databases. Finally, we ignore page read boundaries – a factor that can give a significant boost to performance on solid-state drives. More on that later.

Of course, we wouldn’t ask our customers to go through these calculations. RethinkDB will perform these tests on target hardware automatically and suggest the optimal page size, so you never have to guess.

Edit: A few people e-mailed us to let us know that there are some assumptions our computations rely on that weren’t mentioned in the post. For example, B-Tree nodes might not always be full, which might significantly impact the ideal block size. I want to note that we did not intend to say that 4KB blocks are an ideal size on SSDs. The size of the database, the size of the cache, the means by which the data is inserted, and the performance of the drive (given your file system and RAID configuration) are all crucial factors. In order to determine the ideal size it’s necessary to test the performance of the particular hardware and to plug it into a more complete model. Alternatively, you can switch to RethinkDB when it’s ready.

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Rebench: cutting through the myths of I/O performance

A very wise systems programmer once told me: “Don’t guess. Measure.” Since then, I’ve learned the hard way that guessing too much about performance is death by a thousand cuts. For RethinkDB, dozens of factors for I/O alone affect performance (not to mention memory, buses, caches, and CPU cores). In order to design the fastest database on Earth, we constantly test the following factors:

  • Performance of read and write operations.
  • Behavior for random and sequential workloads:
    • For random workloads, the behavior of uniform, normal, and power distributions (with different distribution parameters).
    • For sequential workloads, the seek direction and various strides .
  • Performance changes for different block sizes.
  • Type of I/O calls (pread/pwrite vs. read/write vs. aio_read/aio_write vs. mmap).
  • Page cache performance on different workloads compared to direct I/O.
  • Different flushing strategies for write operations.
  • Splitting a given workload across multiple threads, and running multiple different workloads concurrently:
    • For concurrent workloads, different file descriptor sharing strategies.
  • Space utilization of the drive.
  • Different flags (O_APPEND, O_NOATIME, etc.)
  • Different filesystems and mount flags.
  • Performance differences across drives, RAID controllers, and operating systems.

There are a number of existing tools designed to test I/O performance (hdparm, sysbench, IOBench), but none of them gave us the high precision control and number of options we needed. So we wrote our own – Rebench. Rebench is designed to perform precision drilldown tests for different I/O workloads, and combine workloads in order to give an idea of how a system will behave in complex situations. We designed Rebench to be flexible, so every one of the factors we measure can be mixed and matched. With Rebench, if we wonder about a particular aspect of I/O performance, we don’t have to guess – it only takes a couple of seconds to come up with a test that verifies our assumptions.

Here is the default run of Rebench (mode information removed for clarity). /dev/sda is a Western Digital 80GB 7200RPM rotational drive:

$ sudo rebench /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 87 (0.04 MB/sec)

We know that a random, uniform distribution workload with 512 byte block size results in 87 I/O operations per second on our rotational drive. Let’s try sequential reads:

$ sudo rebench -w seq /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 9778 (4.77 MB/sec)

The number of operations per second jumps up to nearly 10,000! What about our solid-state drive? /dev/sdb is a 16GB SUPER TALENT MasterDrive OCX (MLC).

$ sudo rebench -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4682 (2.29 MB/sec)

So it doesn’t perform as well as the rotational drive on sequential read access. How about random reads?

$ sudo rebench /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4923 (2.40 MB/sec)

Ah! We blow the rotational drive out of the water at a factor of 50 improvement. And finally, how does the solid-state drive perform for random writes?

$ sudo rebench -o write /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 16 (0.01 MB/sec)

Not well, at only 16 random write operations per second! How about sequential writes?

$ sudo rebench -o write -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 6576 (3.21 MB/sec)

Basically the same as reads, which means the SSD translation layer for random writes on this drive needs some work.

Finally, if you don’t pass any flags to Rebench on the command line, it accepts them on standard input and treats each line as a separate workload to be run concurrently:

$ sudo rebench
/dev/sdb
/dev/sda
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4636 (2.26 MB/sec)
---
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 85 (0.04 MB/sec)

Rebench is work in progress – we combined dozens of smaller programs we wrote into a unified tool just a few days ago. There’s some spaghetti code involved, and probably some lurking bugs, but in the meantime it gets the job done. You can get it at GitHub:

git clone git://github.com/coffeemug/rebench.git

or download the source directly:

http://github.com/coffeemug/rebench/tarball/master

To build Rebench, simply run make. You may need to install GNU Scientific Library, if you don’t have it already.

Rebench is released under the GPL license, so we welcome improvements, bug fixes, and ports to other operating systems. Last but not least, we welcome hardware donations. Happy benchmarking!