<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RethinkDB - Musings on Database Technologies</title>
	<atom:link href="http://www.rethinkdb.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rethinkdb.com/blog</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Fri, 05 Mar 2010 04:32:13 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>High scalability: SQL and computational complexity</title>
		<link>http://www.rethinkdb.com/blog/2010/03/high-scalability-sql-and-computational-complexity/</link>
		<comments>http://www.rethinkdb.com/blog/2010/03/high-scalability-sql-and-computational-complexity/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 04:12:13 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Database Theory]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=300</guid>
		<description><![CDATA[Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our jobs page for more details.
Recently there has been a lot of discussion on fundamental scalability of traditional relational database systems. Many of the blog posts on this topic give a great overview of some of the immediate issues faced by engineers while scaling relational [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;"><em>Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our </em><a href="http://www.rethinkdb.com/jobs"><em>jobs</em></a><em> page for more details.</em></p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">Recently there has been <a href="http://www.yafla.com/dforbes/Getting_Real_about_NoSQL_and_the_SQL_Isnt_Scalable_Lie/">a lot</a> <a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext">of</a> <a href="http://blogs.computerworld.com/15510/the_end_of_sql_and_relational_databases_part_1_of_3">discussion</a> on fundamental scalability of traditional relational database systems. Many of the blog posts on this topic give a great overview of some of the immediate issues faced by engineers while scaling relational databases, but don’t dissect the problem in a systematic way and with sufficient depth to get to the core issues. I’d like to dedicate a series of blog posts to the problem of scalability and how it pertains to relational databases.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">There are a number of aspects of an RDBMS that are relevant to high scalability:</p>
<ul style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; font-family: sans-serif; font-size: small; list-style-type: none; padding: 0px;">
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">Relational data model</li>
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">Constraint enforcement (including referential integrity)</li>
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">SQL and its operational semantics</li>
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">ACID compliance</li>
</ul>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">In addition, every aspect above must be discussed in the context of the following attributes:</p>
<ul style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; font-family: sans-serif; font-size: small; list-style-type: none; padding: 0px;">
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">A specific usage pattern</li>
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">The implementation (a concrete system vs. a theoretically possible ideal)</li>
<li style="margin-top: 0.1em; margin-right: 20px; margin-bottom: 0.1em; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 5px; font-family: sans-serif; font-size: small; list-style-type: disc;">Hardware platform (few expensive machines vs. many cheap machines)</li>
</ul>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">In these series I will deal exclusively with OLTP (realtime) use cases. I will not discuss various forms of analytics, data mining problems, etc. For this post, let’s define realtime as O(k * log N), where k is some small constant that represents a well defined number of queries, and N is the total size of the data (in rows). In other words, all operations of a realtime database must completely evaluate in a logarithmic time relative to the full size of the dataset, and the number of such operations required to make a logically complete business transaction is small and independent of the size of the dataset. I chose this definition because it seems intuitively interesting – other functions that we see in practice are unlikely to satisfy realtime demands of real world systems. In future blog posts I’ll restrict this definition even further to account for constant factors, but for now reasoning in terms of complexity theory is sufficient.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">There are two aspects of high scalability I&#8217;d like to cover &#8211; specific issues and problems relevant today, and what we can expect from data management systems in the future. In order to cover these aspects, I will focus both on concrete systems commonly used in production and on theoretical ideals. For completeness, I will also discuss most issues in terms of horizontal and vertical scaling. Every time I talk about an RDBMS, I will define these contexts explicitly.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">Since I’m a big fan of theory of computation and programming language theory, I’ll kick off the series with a discussion on scalability of SQL from the perspective of theory of computation (here I use the acronym ‘SQL’ in its strictest sense and mean the actual query language as defined by the ANSI standard and implemented by most vendors).</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">From a purely theoretical, computational perspective, the ANSI SQL-92 is equivalent to a primitive recursive language. Most real world SQL implementations are even more expressive, and are Turing-complete. As far as vertical scalability is concerned, SQL is simply too expressive. Even if we restrict ourselves to SQL-92, it is possible to write queries of polynomial, or even exponential complexity – a far cry from a logarithmic requirement we established earlier. This means that according to our definition of real-time, SQL is fundamentally not a vertically scalable language.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">What about horizontal scalability? Reasoning about it is more difficult because it involves somewhat esoteric computational classes, and requires additional assumptions. To simplify the reasoning, we make one key assumption. We assume that we can only have a polynomial number of machines (and cores) – this appears to hold true because the number of machines we can manufacture is dwarfed by the astronomical amounts of information we consume. If this holds true, even if we can trivially parallelize each query, an exponential function (the amount of information) divided by a polynomial (number of machines) still dominates the logarithmic function we defined earlier as acceptable. This means that given modern trends, if a given query isn’t scalable vertically, it also isn’t scalable horizontally, which makes SQL fundamentally unscalable, period.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">Of course so far we’ve shown what we already know – that it is possible to write SQL queries that will likely never be fast enough to evaluate in practice. At first glace this doesn’t appear very useful – all we have to do is avoid writing such queries and use the subset of SQL that can be evaluated in logarithmic time. Unfortunately from a theoretical (and far too often practical) perspective doing this is impossible.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">The culprit is SQL’s lack of operational semantics. Even a simple point query can (and often does) run in O(1) time for hash indexes, O(log N) for tree indexes, or O(N) for a linear scan. For more complicated queries, there are too many edge cases where the optimizer might magically switch from logarithmic to linear execution on a whim, despite having an index available. In practice, these changes result in expensive downtime, and hours of debugging and rearchitecting. For massively scalable realtime systems, this is SQL’s Achilles’ heel – you can’t use a subset of SQL that runs in logarithmic time – some of the time (in practice, far more often than you’d like), you’ll end up writing queries that don’t satisfy your requirements. If we do settle on a declarative query language (and I believe that anything else is a huge step backward) for massively scalable systems, it must have the property that any query you could express in this language is guaranteed to evaluate in logarithmic time.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">Of course such a language has a significantly limited purpose. It cannot be used for most analytics problems, and more importantly for realtime systems, cannot be used for realtime problems which involve polynomial islands of data in the exponential universe. Facebook may some day have billions of users, but any given user is unlikely to have more than a thousand friends. In this scenario there are realtime subproblems where linear, and loglinear queries are perfectly acceptable, and our language can’t handle them. This means that for these subproblems, one must use a different system, which may or may not be an acceptable solution. Perhaps a better solution is to design a database system that only allows to run provably logarithmic queries for massive datasets, but relaxes the requirement for smaller subsets of data.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;">Unfortunately it would be extremely difficult to modify SQL to satisfy this behavior because it isn’t modular – almost all additions are special forms, and it is so far removed from any theoretical model (including relational algebra), that reasoning about it in a rigorous way is extremely difficult for both humans and compilers. My prediction is that systems of the future will use a modular, verifiable, higher order query language capable of enforcing various complexity requirements at compile time, and that it will not look very much like SQL.</p>
<p style="margin-top: 10px; margin-right: 0px; margin-bottom: 20px; margin-left: 0px; font-family: sans-serif; font-size: small; line-height: 1.4em; padding: 0px;"><em>Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our </em><a href="http://www.rethinkdb.com/jobs"><em>jobs</em></a><em> page for more details.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2010/03/high-scalability-sql-and-computational-complexity/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Help make RethinkDB great</title>
		<link>http://www.rethinkdb.com/blog/2009/10/help-make-rethinkdb-great/</link>
		<comments>http://www.rethinkdb.com/blog/2009/10/help-make-rethinkdb-great/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 11:48:37 +0000</pubDate>
		<dc:creator>Michael</dc:creator>
				<category><![CDATA[Announcements]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=277</guid>
		<description><![CDATA[We know all too well that everyone has problems with their databases. We want to hear about your database problems, and if you tell us about them, you could win a RethinkDB sweatshirt.
We&#8217;re working hard to make RethinkDB as fast, scalable, and flexible as possible. We don&#8217;t do this with magic, but by designing RethinkDB [...]]]></description>
			<content:encoded><![CDATA[<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We know all too well that everyone has problems with their databases. We want to hear about your database problems, and if you tell us about them, you could win a RethinkDB sweatshirt.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We&#8217;re working hard to make RethinkDB as fast, scalable, and flexible as possible. We don&#8217;t do this with magic, but by designing RethinkDB for today&#8217;s hardware and access patterns. You can help by giving us a clearer picture of what kind of hardware you&#8217;re using and what kind of workloads you&#8217;re experiencing.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">You can find the survey at http://www.rethinkdb.com/survey/. The best responses will get an awesome new RethinkDB sweatshirt, so you can be the envy of all your friends!</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">If you&#8217;d like us to follow up with you, make sure to check off the last question in the survey. If you&#8217;re in the Bay Area, we might even buy you lunch.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Scalability demands, rising hardware costs, administration woes, and performance concerns are all fair game. We&#8217;re here for you, so tell us your problems.</div>
<p><span style="background-color: #ffffff;"> </span></p>
<p><a href="http://www.rethinkdb.com/survey/"><img class="alignnone" src="http://www.rethinkdb.com/images/survey/survey-header.png" alt="" width="557" height="221" /></a></p>
<p>We know all too well that most companies face database scalability and administration problems. We&#8217;re working hard to make RethinkDB as fast, scalable, and flexible as possible. We use a little bit of magic and a lot of engineering for today&#8217;s hardware and access patterns. You can help by telling us what <em>your</em> infrastructure and workloads look like. We want you to tell us about your database problems, and in exchange we&#8217;ll send a RethinkDB sweatshirt to the best responses.</p>
<p>You can find the survey at <a href="http://www.rethinkdb.com/survey/">http://www.rethinkdb.com/survey/</a>. If you&#8217;d like us to follow up with you, make sure to check off the last checkbox. If you&#8217;re in the Bay Area, we&#8217;ll be happy to buy you lunch for your troubles.</p>
<p>Scalability demands, rising hardware costs, administration struggles, and performance concerns are all fair game. We&#8217;re here for you &#8211; tell us about your database woes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/10/help-make-rethinkdb-great/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More on alignment, ext2, and partitioning on SSDs</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/</link>
		<comments>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 10:20:22 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Benchmarks]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247</guid>
		<description><![CDATA[In our previous post we touched on alignment issues on solid-state drives. Our test read different-sized blocks from various random points on a raw device, aligned to a particular boundary. Today we&#8217;d like to expand on that work, and discuss how other factors affect SSD read performance. In addition to testing different block sizes and [...]]]></description>
			<content:encoded><![CDATA[<p>In our previous <a href="http://www.rethinkdb.com/blog/2009/10/page-alignment-on-ssds/">post</a> we touched on alignment issues on solid-state drives. Our test read different-sized blocks from various random points on a raw device, aligned to a particular boundary. Today we&#8217;d like to expand on that work, and discuss how other factors affect SSD read performance. In addition to testing different block sizes and alignment boundaries, we tested two other factors: how the drive is partitioned, and what filesystem is used.</p>
<p>We decided to test different partitioning schemes because they can profoundly affect alignment. By default, today&#8217;s partitioning tools use 63 sectors per track. Each sector is 512B, so a sector contains 32256B. Unfortunately this value is not 4K aligned (32256 is not divisible by 4096). Since the first partition starts on the second track, the default partition is not 4K aligned. We wanted to test whether this affects performance. We used three partitioning schemes: no partition (reading from a raw block device), default partitioning scheme used by fdisk (not 4K aligned), and a 4K aligned partitioning scheme (we tell fdisk to start on sector 128 instead).</p>
<p>In addition to testing partitioning schemes, we wanted to test how adding a filesystem on top of the device affects performance. We tested reading from the device (or partition) directly, vs. reading a 1GB file from ext2 (created with standard options).</p>
<p>For each of these configurations we ran random reads for block sizes from 512B to 4096B (at 512B increments), and 512B to 4096B aligned boundaries (also at 512B increments). That&#8217;s 3 * 2 * 8 * 8 = 384 different combinations, so it&#8217;s not immediately clear how to visualize the data. The first thing we did, was to plot six different graphs that visualize block size vs. alignment boundary (one graph for each partitioning and file system combination). We hoped that it would let us pick out some interesting trends:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-249" title="multiplot-2d" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/multiplot-2d.png" alt="multiplot-2d" width="512" height="384" /></p>
<p>On these graphs the red line represents a 512B block size, the blue line represents a 4096 block size, and the other colors represent block sizes in between. The x-axis is the alignment boundary, and the y-axis is performance.</p>
<p>Glancing at these graphs we can see some clear trends.</p>
<ul>
<li>The red line is always highest (except for a couple of small anomalies), which means reading 512B chunks is always fastest on every setup.</li>
<li>The graphs that display runs that ran on unpartitioned devices, and the graphs that display runs on aligned partitioned devices are roughly the same.</li>
<li>Default partition graphs look inverted from their counterparts.</li>
</ul>
<p>From this, we can reach two important conclusions:</p>
<ul>
<li>For drilldown visualizations, we don&#8217;t need to worry about the block size since curves that represent different block sizes look the same. We&#8217;ll focus on 512B block sizes.</li>
<li>Graph inversion for unaligned partitions is shifted by 512B, which makes perfect sense: when we add an extra 512 to 32256, we get to a 4KB boundary on the drive ((63*512 + 512) / 4096 = 8).</li>
</ul>
<p>Let&#8217;s take a closer look at the drilled down visualization:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-255" title="graph" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph.png" alt="graph" width="512" height="384" /></p>
<p>From this graph we can deduce a few interesting things:</p>
<ul>
<li>Partition misalignment can cause a 15% drop in performance.</li>
<li>Reading from the raw device with no file system is occasionally a little faster than reading from an aligned partition with no file system &#8211; two fastest modes of operation.</li>
<li>Reading from ext2 causes a 2% drop in performance compared to reading from the raw device.</li>
</ul>
<p><em>Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our <a href="http://www.rethinkdb.com/jobs">jobs</a> page for more details.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Page alignment on SSDs</title>
		<link>http://www.rethinkdb.com/blog/2009/10/page-alignment-on-ssds/</link>
		<comments>http://www.rethinkdb.com/blog/2009/10/page-alignment-on-ssds/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 11:44:16 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Benchmarks]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=223</guid>
		<description><![CDATA[In our previous post we discussed the optimal block-size for B-trees on solid-state drives. A few people mentioned page alignment &#8211; an issue that can cause serious performance hits on SSDs if unaccounted for. It&#8217;s a complex topic, and we will dedicate two posts to its discussion. In this post we&#8217;ll address alignment behavior while [...]]]></description>
			<content:encoded><![CDATA[<p>In our previous <a href="http://www.rethinkdb.com/blog/2009/10/rethinking-b-tree-block-sizes-on-ssds/">post</a> we discussed the optimal block-size for B-trees on solid-state drives. A few people mentioned page alignment &#8211; an issue that can cause serious performance hits on SSDs if unaccounted for. It&#8217;s a complex topic, and we will dedicate two posts to its discussion. In this post we&#8217;ll address alignment behavior while reading directly from the block device. In the next post, we&#8217;ll talk about partitioning the drive, and the effects of reading from the filesystem instead of reading from the device directly.</p>
<p>For this test we ran <a href="http://www.rethinkdb.com/blog/2009/10/rebench-cutting- through-the-myths-of-io-performance/">Rebench</a> i<span style="background-color: #ffffff;">n random read mode, with block sizes ranging from 512B to 4KB, with a 512B increment. We also set the </span><span style="font-family: 'Courier New';"><span style="background-color: #ffffff;">stride</span></span><span style="background-color: #ffffff;"> parameter to values ranging from 512B to 4KB, with a 512B increment. In the random read mode, the </span><span style="font-family: 'Courier New';"><span style="background-color: #ffffff;">stride</span></span><span style="background-color: #ffffff;"> parameter simply aligns random offsets to the boundary. This lets us test how different combinations of block sizes and alignment values affect performance. Here are the results for the 16GB SUPER TALENT MasterDrive OCX (MLC):</span></p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-225" title="graph-sdb-3d" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph-sdb-3d.png" alt="graph-sdb-3d" width="512" height="384" /></p>
<p>In one glance we can see from the mesh on top that performance spikes whenever the alignment is a power of two. The heatmap shows that performance quickly drops off for larger blocks, and that the best performing workload reads 512B blocks from 4KB-aligned offsets. An open question remains: if we align our blocks at 4KB boundaries and can read the first 512B chunk very quickly, how can we read the rest of the chunks without performance loss? We know from previous testing on our rotational drive that reading larger blocks did not result in a performance drop-off, which means the problem isn&#8217;t likely to be in the kernel configuration or the data channel. Perhaps it&#8217;s a problem with the drive&#8217;s firmware, or the driver, or perhaps it&#8217;s an inherent limitation of the drive. We&#8217;ll be posting results on the Intel X-25M G2 MLC and X-25E SLC drives soon; we&#8217;re looking forward to comparing the results.</p>
<p>Stay tuned for information on how the block size and alignment behaves with different partitioning and file system schemes. In the meantime, if you&#8217;d like more precise information on how the drive behaves, here&#8217;s a 2D visualization:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-230" title="graph-sdb-2d" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph-sdb-2d.png" alt="graph-sdb-2d" width="512" height="384" /></p>
<p><em>Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our <a href="http://www.rethinkdb.com/jobs">jobs</a> page for more details.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/10/page-alignment-on-ssds/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Rethinking B-tree block sizes on SSDs</title>
		<link>http://www.rethinkdb.com/blog/2009/10/rethinking-b-tree-block-sizes-on-ssds/</link>
		<comments>http://www.rethinkdb.com/blog/2009/10/rethinking-b-tree-block-sizes-on-ssds/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 23:04:34 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Benchmarks]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=153</guid>
		<description><![CDATA[One of the first questions to answer when running databases on SSDs is what B-tree block size to use. There are a number of factors that affect this decision:

The type of workload
I/O time to read and write the block size
The size of the cache

That’s a lot of variables to consider. For this blog post we [...]]]></description>
			<content:encoded><![CDATA[<p>One of the first questions to answer when running databases on SSDs is what B-tree block size to use. There are a number of factors that affect this decision:</p>
<ul>
<li>The type of workload</li>
<li>I/O time to read and write the block size</li>
<li>The size of the cache</li>
</ul>
<p>That’s a lot of variables to consider. For this blog post we assume a fairly common OLTP scenario – a database that’s dominated by random point queries. We will also sidestep some of the more subtle caching effects by treating the caching algorithm as perfectly optimal, and assuming the cost of lookup in RAM is insignificant.</p>
<p>Even with these restrictions it isn’t immediately obvious what is the optimal block size. Before discussing SSDs, let’s quickly address this problem on rotational drives. If we benchmark the number of IOPS for different block sizes on a typical rotation drive we get the following graph:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-167" title="graph-sda-log" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph-sda-log.png" alt="graph-sda-log" width="512" height="384" /></p>
<p>There are two things to note. The first, is that the random distribution makes a big difference, resulting in a 25% speedup between uniform and power distributions. The curves, however, are roughly the same, which means that ignoring caching, the ideal block size isn’t dependent on the distribution. The second, is that the number of IOPS is effectively constant for all blocks before 16KB. This is supported by the assumption that the time it takes to read extra information once the arm is properly positioned is insignificant compared to the seek latency and rotational delays. So, for a rotational drive, I/O read time changes are not a significant factor – we should design the block size completely based on the caching effects. But what about solid state drives?</p>
<p>The first natural thing to do is to benchmark the number of IOPS for different block sizes. A couple of runs of <a href="http://www.rethinkdb.com/blog/2009/10/rebench-cutting-through-the-myths-of-io-performance/">Rebench</a> fed into gnuplot give us the following results:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-166" title="graph-sdb-log" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/10/graph-sdb-log.png" alt="graph-sdb-log" width="512" height="384" /></p>
<p>That’s a very different curve! The first thing that jumps out is that random distributions have almost no effect on the results. But what about block size? Given this curve, it isn’t immediately clear what the ideal block size is. Fortunately, we can easily figure it out with a little math. The depth of the B-tree is log<sub>b</sub> (N) &#8211; this is how many hops we need to make to satisfy a given point query. Let’s perform some back of the envelope calculations for a database of one billion rows. Assuming we can fit a single key into the B-tree node in 32 bytes, we can easily figure out the value of B for each block size. Now, all we need to do is plug in N (we use one billion rows) and B into the formula to figure out how many hops we need to make. We simply divide the number of IOPS for each block size from the experimental data above, and we see how many queries per second we can perform with a given block size. We then pick the block size that lets us perform the maximum number of queries (part of the table removed in the interest of brevity):</p>
<div style='width: 100%;'>
<table style='border-collapse: collapse; text-align: center; margin-left: auto; margin-right: auto;'>
<tbody>
<tr>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>1kb (32 keys)<br />
4579 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>2kb (64 keys)<br />
4254 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>4kb (128 keys)<br />
3780 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>8kb (256 keys)<br />
3197 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>16kb (512 keys)<br />
2186 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>32kb (1024 keys)<br />
1769 IOPS</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>64kb (2048 keys)<br />
1334 IOPS</td>
</tr>
<tr>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>5.98 hops<br />
<hr/>
765 q./sec</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>4.98 hops<br />
<hr/>
854 q./sec</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>4.27 hops<br />
<hr/>
<strong>885 q./sec</strong></td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>3.74 hops<br />
<hr/>
854 q./sec</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>3.32 hops<br />
<hr/>
658 q./sec</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>2.98 hops<br />
<hr/>
593 q./sec</td>
<td style='border: solid 1px black; padding: 2px; font-size: x-small;'>2.72 hops<br />
<hr/>
490 q./sec</td>
</tr>
</tbody>
</table>
</div>
<p>So, if we have no cache the optimal block size is 4KB.</p>
<p>There are a number of other factors we didn’t consider here. The most important one is caching. A complete analysis would account for the size of the block cache and how many hops we can avoid by storing some of the tree in memory (naturally this is affected by the block size). Another important factor is write performance. Because RethinkDB makes no in-place modifications, we can safely ignore write-heavy workloads – a scenario that can radically affect the calculations above for traditional databases. Finally, we ignore page read boundaries – a factor that can give a significant boost to performance on solid-state drives. More on that later.</p>
<p>Of course, we wouldn’t ask our customers to go through these calculations. RethinkDB will  perform these tests on target hardware automatically and suggest the optimal page size, so you never have to guess.</p>
<p><em>Edit:</em> A few people e-mailed us to let us know that there are some assumptions our computations rely on that weren&#8217;t mentioned in the post. For example, B-Tree nodes might not always be full, which might significantly impact the ideal block size. I want to note that we did <em>not</em> intend to say that 4KB blocks are an ideal size on SSDs. The size of the database, the size of the cache, the means by which the data is inserted, and the performance of the drive (given your file system and RAID configuration) are all crucial factors. In order to determine the ideal size it&#8217;s necessary to test the performance of the particular hardware and to plug it into a more complete model. Alternatively, you can switch to RethinkDB when it&#8217;s ready.</p>
<p><em>Interested in working at RethinkDB? We&#8217;re hiring &#8211; please see our <a href="http://www.rethinkdb.com/jobs">jobs</a> page for more details.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/10/rethinking-b-tree-block-sizes-on-ssds/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Rebench: cutting through the myths of I/O performance</title>
		<link>http://www.rethinkdb.com/blog/2009/10/rebench-cutting-through-the-myths-of-io-performance/</link>
		<comments>http://www.rethinkdb.com/blog/2009/10/rebench-cutting-through-the-myths-of-io-performance/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 07:22:40 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Benchmarks]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=81</guid>
		<description><![CDATA[A very wise systems programmer once told me: &#8220;Don&#8217;t guess. Measure.&#8221; Since then, I&#8217;ve learned the hard way that guessing too much about performance is death by a thousand cuts. For RethinkDB, dozens of factors for I/O alone affect performance (not to mention memory, buses, caches, and CPU cores). In order to design the fastest [...]]]></description>
			<content:encoded><![CDATA[<p>A very wise systems programmer once told me: &#8220;Don&#8217;t guess. Measure.&#8221; Since then, I&#8217;ve learned the hard way that guessing too much about performance is death by a thousand cuts. For RethinkDB, dozens of factors for I/O alone affect performance (not to mention memory, buses, caches, and CPU cores). In order to design the fastest database on Earth, we constantly test the following factors:</p>
<ul>
<li>Performance of <strong>read</strong> and <strong>write</strong> operations.</li>
<li>Behavior for <strong>random</strong> and <strong>sequential</strong> workloads:
<ul>
<li>For random workloads, the behavior of <strong>uniform</strong>, <strong>normal</strong>, and <strong>power</strong> distributions (with different distribution parameters).</li>
<li>For sequential workloads, the seek <strong>direction</strong> and various <strong>strides </strong>.</li>
</ul>
</li>
<li>Performance changes for different <strong>block sizes</strong>.</li>
<li>Type of I/O calls (<strong>pread</strong>/<strong>pwrite</strong> vs. <strong>read</strong>/<strong>write</strong> vs. <strong>aio_read</strong>/<strong>aio_write</strong> vs. <strong>mmap</strong>).</li>
<li><strong>Page cache</strong> performance on different workloads compared to <strong>direct I/O</strong>.</li>
<li>Different <strong>flushing</strong> strategies for write operations.</li>
<li>Splitting a given workload across <strong>multiple threads</strong>, and running multiple different workloads <strong>concurrently</strong>:
<ul>
<li>For concurrent workloads, different file descriptor <strong>sharing </strong>strategies.</li>
</ul>
</li>
<li><strong>Space utilization</strong> of the drive.</li>
<li>Different flags (<strong>O_APPEND</strong>, <strong>O_NOATIME</strong>, etc.)</li>
<li>Different <strong>filesystems</strong> and <strong>mount flags</strong>.</li>
<li>Performance differences across <strong>drives</strong>, <strong>RAID controllers</strong>, and <strong>operating systems</strong>.</li>
</ul>
<p>There are a number of existing tools designed to test I/O performance (<strong>hdparm</strong>, <strong>sysbench</strong>, <strong>IOBench</strong>), but none of them gave us the high precision control and number of options we needed. So we wrote our own &#8211; <strong>Rebench</strong>. Rebench is designed to perform precision drilldown tests for different I/O workloads, and combine workloads in order to give an idea of how a system will behave in complex situations. We designed Rebench to be flexible, so every one of the factors we measure can be mixed and matched. With Rebench, if we wonder about a particular aspect of I/O performance, we don&#8217;t have to guess &#8211; it only takes a couple of seconds to come up with a test that verifies our assumptions.</p>
<p>Here is the default run of Rebench (mode information removed for clarity). /dev/sda is a Western Digital 80GB 7200RPM rotational drive:</p>
<pre>$ sudo rebench /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 87 (0.04 MB/sec)</pre>
<p>We know that a random, uniform distribution workload with 512 byte block size results in 87 I/O operations per second on our rotational drive. Let&#8217;s try sequential reads:</p>
<pre>$ sudo rebench -w seq /dev/sda
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 9778 (4.77 MB/sec)</pre>
<p>The number of operations per second jumps up to nearly 10,000! What about our solid-state drive?  /dev/sdb is a 16GB SUPER TALENT MasterDrive OCX (MLC).</p>
<pre>$ sudo rebench -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4682 (2.29 MB/sec)</pre>
<p>So it doesn&#8217;t perform as well as the rotational drive on sequential read access. How about random reads?</p>
<pre>$ sudo rebench /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4923 (2.40 MB/sec)</pre>
<p>Ah! We blow the rotational drive out of the water at a factor of 50 improvement. And finally, how does the solid-state drive perform for random writes?</p>
<pre>$ sudo rebench -o write /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 16 (0.01 MB/sec)</pre>
<p>Not well, at only 16 random write operations per second! How about sequential writes?</p>
<pre>$ sudo rebench -o write -w seq /dev/sdb
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 6576 (3.21 MB/sec)</pre>
<p>Basically the same as reads, which means the SSD translation layer for random writes on this drive needs some work.</p>
<p>Finally, if you don&#8217;t pass any flags to Rebench on the command line, it accepts them on standard input and treats each line as a separate workload to be run concurrently:</p>
<pre>$ sudo rebench
/dev/sdb
/dev/sda
Benchmarking results for [/dev/sdb] (15GB)
Operations/sec: 4636 (2.26 MB/sec)
---
Benchmarking results for [/dev/sda] (74GB)
Operations/sec: 85 (0.04 MB/sec)</pre>
<p>Rebench is work in progress &#8211; we combined dozens of smaller programs we wrote into a unified tool just a few days ago. There&#8217;s some spaghetti code involved, and probably some lurking bugs, but in the meantime it gets the job done. You can get it at GitHub:</p>
<pre>git clone git://github.com/coffeemug/rebench.git</pre>
<p>or download the source directly:</p>
<pre>http://github.com/coffeemug/rebench/tarball/master</pre>
<p>To build Rebench, simply run <code>make</code>. You may need to install GNU Scientific Library, if you don&#8217;t have it already.</p>
<p>Rebench is released under the GPL license, so we welcome improvements, bug fixes, and ports to other operating systems. Last but not least, we welcome hardware donations. Happy benchmarking!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/10/rebench-cutting-through-the-myths-of-io-performance/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>RethinkDB performance data.</title>
		<link>http://www.rethinkdb.com/blog/2009/08/rethinkdb-performance-data/</link>
		<comments>http://www.rethinkdb.com/blog/2009/08/rethinkdb-performance-data/#comments</comments>
		<pubDate>Wed, 12 Aug 2009 09:30:50 +0000</pubDate>
		<dc:creator>Leif</dc:creator>
				<category><![CDATA[Benchmarks]]></category>

		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=43</guid>
		<description><![CDATA[It&#8217;s been a busy and exciting week since we announced RethinkDB.  Of all the feedback we received, the most common request was for performance numbers.  Before the launch our top priority was correctness. We spent most of our time testing RethinkDB with Wordpress and adding the missing features. As a result, performance suffered. [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a busy and exciting week since we announced RethinkDB.  Of all the feedback we received, the most common request was for performance numbers.  Before the launch our top priority was correctness. We spent most of our time testing RethinkDB with Wordpress and adding the missing features. As a result, performance suffered. In the past week we tuned the engine back up to high performance.  We&#8217;re still far from finished with the improvements we want to make, but we feel that we&#8217;ve reached a level of performance we can be proud to display.</p>
<p>We wrote our original benchmarking tool in Python, but during our latest benchmarks, we noticed that it was taking about as much time as the engine itself, hiding our real performance numbers.  We now have a very small Objective-C program (&lt;900 lines) that uses prepared statements in a tight loop, and times only across the <code>mysql_stmt_execute()</code> call.  F<span style="background-color: #ffffff;">or inserts, the benchmark creates a table with three <code>INT</code> columns, two being indexed, and performs N random (non-duplicate) INSERTs <code>(k,k,k)</code> in a loop.  For selects, it performs N random indexed point queries.  An optional number of SELECT threads run as well, each thread doing repeated indexed point queries throughout execution of the main timed thread.</span></p>
<p><span style="background-color: #ffffff;">The benchmarks were run on a 2.5 GHz Pentium Core 2 Duo machine with 2 GB RAM, on a 16 GB SUPER TALENT MasterDrive, an MLC solid state drive, connected via a 3 GB SATA II bus.  RethinkDB and MyISAM were run with the stock config options.  We ran the InnoDB test by starting the server with <code>--innodb_flush_log_at_trx_commit=0 --innodb_support_xa=0 --innodb_buffer_pool_size=1536M</code>.</span></p>
<p>Here are the results:</p>
<p style="text-align: center;"><img class="size-full wp-image-54 aligncenter" title="insert" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/08/insert.png" alt="Insert benchmark with no readers." width="504" height="353" /></p>
<p>For insert performance, RethinkDB maintains a 10x improvement in throughput over MyISAM, with an average of 24534.597 rows/sec up to 2,000,000 rows, while InnoDB handles 8527.424 rows/sec, and MyISAM manages only 2483.277 rows/sec.  With more frequent measurements, we can see that InnoDB and MyISAM maintain generally high throughput, but pause periodically for long stretches of time.  We believe that this is due to their B-tree structure, which need to expand once in a while, a time-consuming operation that greatly undermines their overall performance.</p>
<p>The threaded benchmark is a bit different:</p>
<p style="text-align: center;"><img class="size-full wp-image-61 aligncenter" title="insert2" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/08/insert2.png" alt="Multi-threaded insert benchmark." width="504" height="353" /></p>
<p>We&#8217;ve also benchmarked selects with no writers:</p>
<p style="text-align: center;"><img class="size-full wp-image-53 aligncenter" title="select" src="http://www.rethinkdb.com/blog/wp-content/uploads/2009/08/select.png" alt="Single-threaded select graph." width="504" height="353" /></p>
<p>RethinkDB&#8217;s select performance is on par with MyISAM and InnoDB for threaded and non-threaded benchmarks. The performance bottleneck for short selects is in the network stack, and while we have plans to tackle this problem, we won&#8217;t get to it for a while. However, our algorithms significantly improve RethinkDB performance on long selects and joins &#8212; we will write a blog post soon with more detailed results.</p>
<p>As always, comments and concerns are welcome, on our <a href="http://rethinkdb.com/blog/">blog</a>, <a href="http://twitter.com/rethinkdb">twitter feed</a>, or at <span class="mh-hyperlinked"><a href='http://mailhide.recaptcha.net/d?k=01mEopN6xz3PUFbr4Ij8gv5A==&c=xs4mCQCFJVEbredaqSKzi_WrPO8mkEYB5nVWD9UEOKo=' onclick="window.open('http://mailhide.recaptcha.net/d?k=01mEopN6xz3PUFbr4Ij8gv5A==&amp;c=xs4mCQCFJVEbredaqSKzi_WrPO8mkEYB5nVWD9UEOKo=', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;">info@rethinkdb.com</a></span>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/08/rethinkdb-performance-data/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>RethinkDB – A New Kind of Database</title>
		<link>http://www.rethinkdb.com/blog/2009/07/rethinkdb-a-new-kind-of-database/</link>
		<comments>http://www.rethinkdb.com/blog/2009/07/rethinkdb-a-new-kind-of-database/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 12:49:28 +0000</pubDate>
		<dc:creator>Slava</dc:creator>
				<category><![CDATA[Announcements]]></category>

		<guid isPermaLink="false">http://rethinkdb.no-ip.org/blog/?p=3</guid>
		<description><![CDATA[Today, we’re ready to announce RethinkDB — a new kind of database. It’s been a winding road. For two years, Mike, Leif, and I have been thinking independently on how to bring a breath of fresh air to the database world. Three months ago, we came together to form a company and bring our ideas [...]]]></description>
			<content:encoded><![CDATA[<p>Today, we’re ready to announce <a href="http://www.rethinkdb.com">RethinkDB</a> — a new kind of database. It’s been a winding road. For two years, Mike, Leif, and I have been thinking independently on how to bring a breath of fresh air to the database world. Three months ago, we came together to form a company and bring our ideas to reality. In these three months, we’ve raised seed funding from <a title="Y Combinator" href="http://www.ycombinator.com">Y Combinator</a>, moved to California, and built a MySQL plugin that implements the core of our vision — a storage engine redesigned for the modern world. With the exception of storage technology, database design has always been beautiful. Now, with dropping costs of storage, the advent of solid state drives, and advances in functional data structures theory, we can finally replace that last messy component of database management systems with an elegant, beautiful solution.</p>
<p>Much work remains to be done. RethinkDB isn’t ready for general production use. So, why release it today? At a recent Y Combinator dinner, Reid Hoffman (the founder of LinkedIn) said: “If you&#8217;re not embarrassed by the first version of your product, you’ve launched too late.” We’re launching too late. The article you’re reading now is served by a WordPress installation running live on RethinkDB. Many of our internal benchmarks outperform a stock MySQL setup. We’re no longer terrified of data corruption (though we still keep our fingers crossed). We’re using RethinkDB for painless hot backups. The time is long overdue for us to share our work with you.</p>
<p>We are committed to building an open, socially responsible company. In the coming weeks we will be releasing as much information about the RethinkDB internals as possible without compromising its commercial success. In the meantime, we’d like to welcome your <a href="http://www.rethinkdb.com/wiki/">feedback</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rethinkdb.com/blog/2009/07/rethinkdb-a-new-kind-of-database/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>
