<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: More on alignment, ext2, and partitioning on SSDs</title>
	<atom:link href="http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sun, 05 Sep 2010 13:51:34 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: chris</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/comment-page-1/#comment-114459</link>
		<dc:creator>chris</dc:creator>
		<pubDate>Sat, 28 Aug 2010 19:35:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247#comment-114459</guid>
		<description>First off let me say I love your blog.  I find myself asking the same questions everyday.  Have you looked into the parallel capabilities of modern flash drives?  I don&#039;t have access to many modern SSD, but to get the r/w speed they claim there must be some parallelization going on.  Chances are, however, that even if parallelization (r/w to multiple chips at the same time, like RAID) is happening the wear-leveling will nullify any intelligent decisions made.</description>
		<content:encoded><![CDATA[<p>First off let me say I love your blog.  I find myself asking the same questions everyday.  Have you looked into the parallel capabilities of modern flash drives?  I don&#8217;t have access to many modern SSD, but to get the r/w speed they claim there must be some parallelization going on.  Chances are, however, that even if parallelization (r/w to multiple chips at the same time, like RAID) is happening the wear-leveling will nullify any intelligent decisions made.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Smith</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/comment-page-1/#comment-7158</link>
		<dc:creator>Brian Smith</dc:creator>
		<pubDate>Wed, 21 Oct 2009 14:45:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247#comment-7158</guid>
		<description>I did some more research. It seems that 2.6.28 definitely doesn&#039;t have support for 4K logical sectors. Patches were sent to LKML in the 2.6.29/2.6.30 timeframe but I don&#039;t know if they were ever integrated in any release. In 2.6.31, Linux can interogate the disk to determine what its native sector size is, but that doesn&#039;t mean 2.6.31 supports 4K sector sizes fully enough to affect performance. And, I couldn&#039;t find anything in the 2.6.29, 2.6.30, or 2.6.31 changelogs indicating 4K sectors support at all.

I did find some discussions that were inconclusive about whether it was *safe* to use 512-byte logical sectors on a disk with 4096 byte physical sectors. A few people are saying that 512-on-4096 emulation is NOT safe for magnetic disks. Whether 512-on-4096 emulation is safe for an SSD depends on the SSD controller. It seems most SSD controllers are using a (mostly-)append-only log-structured filesystem internally; in that case, it seems like 512-on-4096 emulation is safe. I that safety-critical applications should always write using the physical sector size (4096 bytes for most SSDs today) for maximum atomicity/safety. I also heard someone mention that many SSDs--especially laptop/desktop class SSDs--can&#039;t/won&#039;t expose 4096-byte sector I/O operations to the OS, only doing 512-on-4096 emulation. I don&#039;t if that is true.

It seems to me that, for most SSDs, a 4K write is always going to be non-destructive if it fails. IMO, ATA should have a mechanism for the disk to say &quot;writes of XXXXk (e.g. 128k) or smaller are non-destructive if they fail&quot; and/or expose a non-destructive write operation. That would enable *huge* performance optimizations for VMMs, filesystems, and databases. Maybe ATA already has one or both of those already; it is something that should be looked into.

I don&#039;t have a machine that I can reformat appropriately to run those tests. That&#039;s why I&#039;m so interesting in running your tests.</description>
		<content:encoded><![CDATA[<p>I did some more research. It seems that 2.6.28 definitely doesn&#8217;t have support for 4K logical sectors. Patches were sent to LKML in the 2.6.29/2.6.30 timeframe but I don&#8217;t know if they were ever integrated in any release. In 2.6.31, Linux can interogate the disk to determine what its native sector size is, but that doesn&#8217;t mean 2.6.31 supports 4K sector sizes fully enough to affect performance. And, I couldn&#8217;t find anything in the 2.6.29, 2.6.30, or 2.6.31 changelogs indicating 4K sectors support at all.</p>
<p>I did find some discussions that were inconclusive about whether it was *safe* to use 512-byte logical sectors on a disk with 4096 byte physical sectors. A few people are saying that 512-on-4096 emulation is NOT safe for magnetic disks. Whether 512-on-4096 emulation is safe for an SSD depends on the SSD controller. It seems most SSD controllers are using a (mostly-)append-only log-structured filesystem internally; in that case, it seems like 512-on-4096 emulation is safe. I that safety-critical applications should always write using the physical sector size (4096 bytes for most SSDs today) for maximum atomicity/safety. I also heard someone mention that many SSDs&#8211;especially laptop/desktop class SSDs&#8211;can&#8217;t/won&#8217;t expose 4096-byte sector I/O operations to the OS, only doing 512-on-4096 emulation. I don&#8217;t if that is true.</p>
<p>It seems to me that, for most SSDs, a 4K write is always going to be non-destructive if it fails. IMO, ATA should have a mechanism for the disk to say &#8220;writes of XXXXk (e.g. 128k) or smaller are non-destructive if they fail&#8221; and/or expose a non-destructive write operation. That would enable *huge* performance optimizations for VMMs, filesystems, and databases. Maybe ATA already has one or both of those already; it is something that should be looked into.</p>
<p>I don&#8217;t have a machine that I can reformat appropriately to run those tests. That&#8217;s why I&#8217;m so interesting in running your tests.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: uberVU - social comments</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/comment-page-1/#comment-7098</link>
		<dc:creator>uberVU - social comments</dc:creator>
		<pubDate>Wed, 21 Oct 2009 06:43:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247#comment-7098</guid>
		<description>&lt;strong&gt;Social comments and analytics for this post...&lt;/strong&gt;

This post was mentioned on Twitter by patrichards: RT @rethinkdb We&#039;ve released more benchmarks that test alignment, ext2, and partitioning on SSDs. Read more: http://bit.ly/1x92Zd...</description>
		<content:encoded><![CDATA[<p><strong>Social comments and analytics for this post&#8230;</strong></p>
<p>This post was mentioned on Twitter by patrichards: RT @rethinkdb We&#8217;ve released more benchmarks that test alignment, ext2, and partitioning on SSDs. Read more: <a href="http://bit.ly/1x92Zd..." rel="nofollow">http://bit.ly/1&#215;92Zd&#8230;</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slava</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/comment-page-1/#comment-7002</link>
		<dc:creator>Slava</dc:creator>
		<pubDate>Tue, 20 Oct 2009 18:59:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247#comment-7002</guid>
		<description>Brian, I tried various memory alignments (from 512B to 128K) and it didn&#039;t make a difference. It&#039;s probably a good idea to align it to max(VM page size, block size tested) - I&#039;ll push the code to do it in a bit.

In the meantime, if you have an SSD, would you mind trying the following:
&lt;br/&gt;
# rebench /dev/sdb                             # this sets block size and stride to 512B
# rebench -b 4096 -s 4096 /dev/sdb
&lt;br/&gt;
What results do you get (on what hardware and kernel)? When I move from 512B block size to 4K I see a performance drop on SuperTalent SSD, as well as Intel SLC. On the current test machine they&#039;re connected via SATA directly to the motherboard (I&#039;m not sure what board and SATA controller this machine has, I&#039;ll find out tonight). I&#039;m on Ubuntu 2.6.28-14-server.</description>
		<content:encoded><![CDATA[<p>Brian, I tried various memory alignments (from 512B to 128K) and it didn&#8217;t make a difference. It&#8217;s probably a good idea to align it to max(VM page size, block size tested) &#8211; I&#8217;ll push the code to do it in a bit.</p>
<p>In the meantime, if you have an SSD, would you mind trying the following:<br />
<br />
# rebench /dev/sdb                             # this sets block size and stride to 512B<br />
# rebench -b 4096 -s 4096 /dev/sdb<br />
<br />
What results do you get (on what hardware and kernel)? When I move from 512B block size to 4K I see a performance drop on SuperTalent SSD, as well as Intel SLC. On the current test machine they&#8217;re connected via SATA directly to the motherboard (I&#8217;m not sure what board and SATA controller this machine has, I&#8217;ll find out tonight). I&#8217;m on Ubuntu 2.6.28-14-server.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Smith</title>
		<link>http://www.rethinkdb.com/blog/2009/10/more-on-alignment-ext2-and-partitioning-on-ssds/comment-page-1/#comment-6984</link>
		<dc:creator>Brian Smith</dc:creator>
		<pubDate>Tue, 20 Oct 2009 16:52:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.rethinkdb.com/blog/?p=247#comment-6984</guid>
		<description>Please include the exact command line you used to run rebench in your posts. Also, what kind of hardware configuration are you using?

I think it is really strange that 4096-byte reads are slower than 512-byte reads. I think it is so strange I dug into the code. 

I notice you hard-code HARDWARE_BLOCK_SIZE to 512 and then use this for buffer alignment with posix_memalign. I recommend aligning on max(8192, max(VMM page size, block size being tested)) boundaries instead. I&#039;m not sure it would make a difference but it may affect DMA transfers from the disk--especially if the OS/driver is trying to align all reads on memory page boundaries AND disk page boundaries. Imagine you have a 4K read aligned on a disk page but not aligned on a RAM page. I could easily see this requiring a DMA transfer into a kernel buffer, and then a memcpy into your buffer. If everything is aligned exactly right, you should be able to get a DMA directly into your buffer. However, historically there&#039;ve been all kinds of issues in Linux that have prevented zero-copy I/O like this; I don&#039;t know if all of them have been resolved.</description>
		<content:encoded><![CDATA[<p>Please include the exact command line you used to run rebench in your posts. Also, what kind of hardware configuration are you using?</p>
<p>I think it is really strange that 4096-byte reads are slower than 512-byte reads. I think it is so strange I dug into the code. </p>
<p>I notice you hard-code HARDWARE_BLOCK_SIZE to 512 and then use this for buffer alignment with posix_memalign. I recommend aligning on max(8192, max(VMM page size, block size being tested)) boundaries instead. I&#8217;m not sure it would make a difference but it may affect DMA transfers from the disk&#8211;especially if the OS/driver is trying to align all reads on memory page boundaries AND disk page boundaries. Imagine you have a 4K read aligned on a disk page but not aligned on a RAM page. I could easily see this requiring a DMA transfer into a kernel buffer, and then a memcpy into your buffer. If everything is aligned exactly right, you should be able to get a DMA directly into your buffer. However, historically there&#8217;ve been all kinds of issues in Linux that have prevented zero-copy I/O like this; I don&#8217;t know if all of them have been resolved.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
