Archive for the ‘Benchmarks’ Category

More on alignment, ext2, and partitioning on SSDs

In our previous post we touched on alignment issues on solid-state drives. Our test read different-sized blocks from various random points on a raw device, aligned to a particular boundary. Today we’d like to expand on that work, and discuss how other factors affect SSD read performance. In addition to testing different block sizes and alignment boundaries, we tested two other factors: how the drive is partitioned, and what filesystem is used.

We decided to test different partitioning schemes because they can profoundly affect alignment. By default, today’s partitioning tools use 63 sectors per track. Each sector is 512B, so a sector contains 32256B. Unfortunately this value is not 4K aligned (32256 is not divisible by 4096). Since the first partition starts on the second track, the default partition is not 4K aligned. We wanted to test whether this affects performance. We used three partitioning schemes: no partition (reading from a raw block device), default partitioning scheme used by fdisk (not 4K aligned), and a 4K aligned partitioning scheme (we tell fdisk to start on sector 128 instead).

In addition to testing partitioning schemes, we wanted to test how adding a filesystem on top of the device affects performance. We tested reading from the device (or partition) directly, vs. reading a 1GB file from ext2 (created with standard options).

For each of these configurations we ran random reads for block sizes from 512B to 4096B (at 512B increments), and 512B to 4096B aligned boundaries (also at 512B increments). That’s 3 * 2 * 8 * 8 = 384 different combinations, so it’s not immediately clear how to visualize the data. The first thing we did, was to plot six different graphs that visualize block size vs. alignment boundary (one graph for each partitioning and file system combination). We hoped that it would let us pick out some interesting trends:

multiplot-2d

On these graphs the red line represents a 512B block size, the blue line represents a 4096 block size, and the other colors represent block sizes in between. The x-axis is the alignment boundary, and the y-axis is performance.

Glancing at these graphs we can see some clear trends.

  • The red line is always highest (except for a couple of small anomalies), which means reading 512B chunks is always fastest on every setup.
  • The graphs that display runs that ran on unpartitioned devices, and the graphs that display runs on aligned partitioned devices are roughly the same.
  • Default partition graphs look inverted from their counterparts.

From this, we can reach two important conclusions:

  • For drilldown visualizations, we don’t need to worry about the block size since curves that represent different block sizes look the same. We’ll focus on 512B block sizes.
  • Graph inversion for unaligned partitions is shifted by 512B, which makes perfect sense: when we add an extra 512 to 32256, we get to a 4KB boundary on the drive ((63*512 + 512) / 4096 = 8).

Let’s take a closer look at the drilled down visualization:

graph

From this graph we can deduce a few interesting things:

  • Partition misalignment can cause a 15% drop in performance.
  • Reading from the raw device with no file system is occasionally a little faster than reading from an aligned partition with no file system – two fastest modes of operation.
  • Reading from ext2 causes a 2% drop in performance compared to reading from the raw device.

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Page alignment on SSDs

In our previous post we discussed the optimal block-size for B-trees on solid-state drives. A few people mentioned page alignment – an issue that can cause serious performance hits on SSDs if unaccounted for. It’s a complex topic, and we will dedicate two posts to its discussion. In this post we’ll address alignment behavior while reading directly from the block device. In the next post, we’ll talk about partitioning the drive, and the effects of reading from the filesystem instead of reading from the device directly.

For this test we ran Rebench in random read mode, with block sizes ranging from 512B to 4KB, with a 512B increment. We also set the stride parameter to values ranging from 512B to 4KB, with a 512B increment. In the random read mode, the stride parameter simply aligns random offsets to the boundary. This lets us test how different combinations of block sizes and alignment values affect performance. Here are the results for the 16GB SUPER TALENT MasterDrive OCX (MLC):

graph-sdb-3d

In one glance we can see from the mesh on top that performance spikes whenever the alignment is a power of two. The heatmap shows that performance quickly drops off for larger blocks, and that the best performing workload reads 512B blocks from 4KB-aligned offsets. An open question remains: if we align our blocks at 4KB boundaries and can read the first 512B chunk very quickly, how can we read the rest of the chunks without performance loss? We know from previous testing on our rotational drive that reading larger blocks did not result in a performance drop-off, which means the problem isn’t likely to be in the kernel configuration or the data channel. Perhaps it’s a problem with the drive’s firmware, or the driver, or perhaps it’s an inherent limitation of the drive. We’ll be posting results on the Intel X-25M G2 MLC and X-25E SLC drives soon; we’re looking forward to comparing the results.

Stay tuned for information on how the block size and alignment behaves with different partitioning and file system schemes. In the meantime, if you’d like more precise information on how the drive behaves, here’s a 2D visualization:

graph-sdb-2d

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Rethinking B-tree block sizes on SSDs

One of the first questions to answer when running databases on SSDs is what B-tree block size to use. There are a number of factors that affect this decision:

  • The type of workload
  • I/O time to read and write the block size
  • The size of the cache

That’s a lot of variables to consider. For this blog post we assume a fairly common OLTP scenario – a database that’s dominated by random point queries. We will also sidestep some of the more subtle caching effects by treating the caching algorithm as perfectly optimal, and assuming the cost of lookup in RAM is insignificant.

Even with these restrictions it isn’t immediately obvious what is the optimal block size. Before discussing SSDs, let’s quickly address this problem on rotational drives. If we benchmark the number of IOPS for different block sizes on a typical rotation drive we get the following graph:

graph-sda-log

There are two things to note. The first, is that the random distribution makes a big difference, resulting in a 25% speedup between uniform and power distributions. The curves, however, are roughly the same, which means that ignoring caching, the ideal block size isn’t dependent on the distribution. The second, is that the number of IOPS is effectively constant for all blocks before 16KB. This is supported by the assumption that the time it takes to read extra information once the arm is properly positioned is insignificant compared to the seek latency and rotational delays. So, for a rotational drive, I/O read time changes are not a significant factor – we should design the block size completely based on the caching effects. But what about solid state drives?

The first natural thing to do is to benchmark the number of IOPS for different block sizes. A couple of runs of Rebench fed into gnuplot give us the following results:

graph-sdb-log

That’s a very different curve! The first thing that jumps out is that random distributions have almost no effect on the results. But what about block size? Given this curve, it isn’t immediately clear what the ideal block size is. Fortunately, we can easily figure it out with a little math. The depth of the B-tree is logb (N) – this is how many hops we need to make to satisfy a given point query. Let’s perform some back of the envelope calculations for a database of one billion rows. Assuming we can fit a single key into the B-tree node in 32 bytes, we can easily figure out the value of B for each block size. Now, all we need to do is plug in N (we use one billion rows) and B into the formula to figure out how many hops we need to make. We simply divide the number of IOPS for each block size from the experimental data above, and we see how many queries per second we can perform with a given block size. We then pick the block size that lets us perform the maximum number of queries (part of the table removed in the interest of brevity):

1kb (32 keys)
4579 IOPS
2kb (64 keys)
4254 IOPS
4kb (128 keys)
3780 IOPS
8kb (256 keys)
3197 IOPS
16kb (512 keys)
2186 IOPS
32kb (1024 keys)
1769 IOPS
64kb (2048 keys)
1334 IOPS
5.98 hops

765 q./sec
4.98 hops

854 q./sec
4.27 hops

885 q./sec
3.74 hops

854 q./sec
3.32 hops

658 q./sec
2.98 hops

593 q./sec
2.72 hops

490 q./sec

So, if we have no cache the optimal block size is 4KB.

There are a number of other factors we didn’t consider here. The most important one is caching. A complete analysis would account for the size of the block cache and how many hops we can avoid by storing some of the tree in memory (naturally this is affected by the block size). Another important factor is write performance. Because RethinkDB makes no in-place modifications, we can safely ignore write-heavy workloads – a scenario that can radically affect the calculations above for traditional databases. Finally, we ignore page read boundaries – a factor that can give a significant boost to performance on solid-state drives. More on that later.

Of course, we wouldn’t ask our customers to go through these calculations. RethinkDB will perform these tests on target hardware automatically and suggest the optimal page size, so you never have to guess.

Edit: A few people e-mailed us to let us know that there are some assumptions our computations rely on that weren’t mentioned in the post. For example, B-Tree nodes might not always be full, which might significantly impact the ideal block size. I want to note that we did not intend to say that 4KB blocks are an ideal size on SSDs. The size of the database, the size of the cache, the means by which the data is inserted, and the performance of the drive (given your file system and RAID configuration) are all crucial factors. In order to determine the ideal size it’s necessary to test the performance of the particular hardware and to plug it into a more complete model. Alternatively, you can switch to RethinkDB when it’s ready.

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Rebench: cutting through the myths of I/O performance

A very wise systems programmer once told me: “Don’t guess. Measure.” Since then, I’ve learned the hard way that guessing too much about performance is death by a thousand cuts. For RethinkDB, dozens of factors for I/O alone affect performance (not to mention memory, buses, caches, and CPU cores). In order to design the fastest database on Earth, we constantly test the following factors:

  • Performance of read and write operations.
  • Behavior for random and sequential workloads:
    • For random workloads, the behavior of uniform, normal, and power distributions (with different distribution parameters).
    • For sequential workloads, the seek direction and various strides .
  • Performance changes for different block sizes.
  • Type of I/O calls (pread/pwrite vs. read/write vs. aio_read/aio_write vs. mmap).
  • Page cache performance on different workloads compared to direct I/O.
  • Different flushing strategies for write operations.
  • Splitting a given workload across multiple threads, and running multiple different workloads concurrently:
    • For concurrent workloads, different file descriptor sharing strategies.
  • Space utilization of the drive.
  • Different flags (O_APPEND, O_NOATIME, etc.)
  • Different filesystems and mount flags.
  • Performance differences across drives, RAID controllers, and operating systems.

There are a number of existing tools designed to test I/O performance (hdparm, sysbench, IOBench), but none of them gave us the high precision control and number of options we needed. So we wrote our own – Rebench. Rebench is designed to perform precision drilldown tests for different I/O workloads, and combine workloads in order to give an idea of how a system will behave in complex situations. We designed Rebench to be flexible, so every one of the factors we measure can be mixed and matched. With Rebench, if we wonder about a particular aspect of I/O performance, we don’t have to guess – it only takes a couple of seconds to come up with a test that verifies our assumptions.

Here is the default run of Rebench (mode information removed for clarity). /dev/sda is a Western Digital 80GB 7200RPM rotational drive:

$ sudo rebench /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 87 (0.04 MB/sec)

We know that a random, uniform distribution workload with 512 byte block size results in 87 I/O operations per second on our rotational drive. Let’s try sequential reads:

$ sudo rebench -w seq /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 9778 (4.77 MB/sec)

The number of operations per second jumps up to nearly 10,000! What about our solid-state drive? /dev/sdb is a 16GB SUPER TALENT MasterDrive OCX (MLC).

$ sudo rebench -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4682 (2.29 MB/sec)

So it doesn’t perform as well as the rotational drive on sequential read access. How about random reads?

$ sudo rebench /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4923 (2.40 MB/sec)

Ah! We blow the rotational drive out of the water at a factor of 50 improvement. And finally, how does the solid-state drive perform for random writes?

$ sudo rebench -o write /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 16 (0.01 MB/sec)

Not well, at only 16 random write operations per second! How about sequential writes?

$ sudo rebench -o write -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 6576 (3.21 MB/sec)

Basically the same as reads, which means the SSD translation layer for random writes on this drive needs some work.

Finally, if you don’t pass any flags to Rebench on the command line, it accepts them on standard input and treats each line as a separate workload to be run concurrently:

$ sudo rebench
/dev/sdb
/dev/sda
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4636 (2.26 MB/sec)
---
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 85 (0.04 MB/sec)

Rebench is work in progress – we combined dozens of smaller programs we wrote into a unified tool just a few days ago. There’s some spaghetti code involved, and probably some lurking bugs, but in the meantime it gets the job done. You can get it at GitHub:

git clone git://github.com/coffeemug/rebench.git

or download the source directly:

http://github.com/coffeemug/rebench/tarball/master

To build Rebench, simply run make. You may need to install GNU Scientific Library, if you don’t have it already.

Rebench is released under the GPL license, so we welcome improvements, bug fixes, and ports to other operating systems. Last but not least, we welcome hardware donations. Happy benchmarking!

RethinkDB performance data.

It’s been a busy and exciting week since we announced RethinkDB. Of all the feedback we received, the most common request was for performance numbers. Before the launch our top priority was correctness. We spent most of our time testing RethinkDB with Wordpress and adding the missing features. As a result, performance suffered. In the past week we tuned the engine back up to high performance. We’re still far from finished with the improvements we want to make, but we feel that we’ve reached a level of performance we can be proud to display.

We wrote our original benchmarking tool in Python, but during our latest benchmarks, we noticed that it was taking about as much time as the engine itself, hiding our real performance numbers. We now have a very small Objective-C program (<900 lines) that uses prepared statements in a tight loop, and times only across the mysql_stmt_execute() call.  For inserts, the benchmark creates a table with three INT columns, two being indexed, and performs N random (non-duplicate) INSERTs (k,k,k) in a loop. For selects, it performs N random indexed point queries. An optional number of SELECT threads run as well, each thread doing repeated indexed point queries throughout execution of the main timed thread.

The benchmarks were run on a 2.5 GHz Pentium Core 2 Duo machine with 2 GB RAM, on a 16 GB SUPER TALENT MasterDrive, an MLC solid state drive, connected via a 3 GB SATA II bus. RethinkDB and MyISAM were run with the stock config options.  We ran the InnoDB test by starting the server with --innodb_flush_log_at_trx_commit=0 --innodb_support_xa=0 --innodb_buffer_pool_size=1536M.

Here are the results:

Insert benchmark with no readers.

For insert performance, RethinkDB maintains a 10x improvement in throughput over MyISAM, with an average of 24534.597 rows/sec up to 2,000,000 rows, while InnoDB handles 8527.424 rows/sec, and MyISAM manages only 2483.277 rows/sec. With more frequent measurements, we can see that InnoDB and MyISAM maintain generally high throughput, but pause periodically for long stretches of time. We believe that this is due to their B-tree structure, which need to expand once in a while, a time-consuming operation that greatly undermines their overall performance.

The threaded benchmark is a bit different:

Multi-threaded insert benchmark.

We’ve also benchmarked selects with no writers:

Single-threaded select graph.

RethinkDB’s select performance is on par with MyISAM and InnoDB for threaded and non-threaded benchmarks. The performance bottleneck for short selects is in the network stack, and while we have plans to tackle this problem, we won’t get to it for a while. However, our algorithms significantly improve RethinkDB performance on long selects and joins — we will write a blog post soon with more detailed results.

As always, comments and concerns are welcome, on our blog, twitter feed, or at info@rethinkdb.com.