What's interesting to me is that if you read the summary (http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-s...), most of the recommendations are exactly the same as when programming against a traditional spinning-oxide hard drive: read and write entire blocks if you can, combine block writes, &c.
It's nice to know that all that work on high-performance IO in databases doesn't need to be thrown away just yet.
SSDs do behave like complex block devices from the perspective of a database engine. However, SSDs are sufficiently different from spinning disk that optimal I/O patterns are not the same. Some high-performance database I/O schedulers have separate storage behavioral modes (approximately based on common SSD and HDD behavior) depending on the storage characteristics.
Everything on a computer approximates a block device. Even RAM is treated as a type of complex block device in sophisticated, high-performance databases, because it is. Scheduling operations to optimally match the characteristics of block devices is a (the?) primary optimization mechanism.
Exactly. Q=32 Random 4k IOP reads are much, much faster than a queue depth of 1 on SSDs. On regular hard drives you might see a 200% speedup with a high queue depth, on SSDs you could see 1000-3000% speedup.
SSDs live off their internal parallelism, if you aren't using a Queue>16 you are not making real use of an SSD. Q=32 or even Q=64 are usually the right settings.
It all depends though if you want throughput or latency. If you really really care about latency you should balance the queue depth and probably use it between 16 and 32. The reason is that with higher queues you get more collisions on the same die and then latency suffers. There are read-read, write-read, erase-write and all the other combinations but those three are the interesting ones.
this is true because the NAND has been crippled by imposing the FTL on it. if you could bypass the FTL, then the programming advice would be quite different, and if properly applied result in much better long term performance
Yeah, you can do some pretty wild things if you have direct access to the flash array.
But then you need to handle stuff like wear leveling, transaction management, ECC and other forms of recovery. And a lot of the stuff you need to do is probably flash-part specific (e.g., read disturbance, probably stuff around channel management and throughput, etc.).
I actually proposed allowing the firmware of a recent consumer product have such access to the flash (because I didn't trust the flash vendor's translation layer), but got shot down. I don't know how that turned out; probably they spent a bunch of time doing qualification (code for: "Fix your damned FTL bugs or we find another vendor. Wait. We don't have time for that. Fix as many as you can, or we'll be mad ... or something. Here, have some money.").
In what way would you do it differently than the FTL. Are you familiar with all the restrictions and limitations of using NAND and how it impacts the FTL algorithms?
it's less that i'd do things not currently done in the FTL, and more that it would be deterministic. my algorithm is a cyclic cache that's write heavy, at least initially targeted at low end hardware. if i could bypass the FTL, i could ensure that my algorithm wasn't amplifying. but with FTL, which varies from drive to drive, my usage pattern could result in a great deal of amplification
i'm sure that for any given controller's FTL (and this article claims that there really are only a couple on the market), i could tweak my algorithm to work reasonably well. but that's a sign of a leaky abstraction
i'd also like access to the small SLC portion of the drive, though i'm working around that for now with journaling
i'm not an expert in flash memory. my model is basically a block device with larger block erasure, and that the number of erasures each block can handle is limited
It's nice to know that all that work on high-performance IO in databases doesn't need to be thrown away just yet.