I have followed him for a long time and learned a lot too. I always wonder the same thing about the “tech influencers” and I’d love to know more about how they structure their days.
I find it difficult recently to sit down and complete a meaningful piece of work without being distracted by notifications and questions. In the last year this has been exacerbated by the wait time on LLMs completing.
I would love to know how top performers organise their time.
to be perfectly fair though, this isn't a new failure mode when SSDs arrived on the scene.
drive controllers on HDDs just suddenly go to shit and drop off buses, too.
I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
> I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
This is exactly the opposite of my lived experience. Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners, as far as I can recall, they all have pre-failure indicators, like terrible noises (doesn't help for remote disks), SMART indicators, failed read/write on a couple sectors here and there, etc. If you don't have backups, but you notice in a reasonable amount of time, you can salvage most of your data. Certainly, sometimes the drives just won't spin up because of a bearing/motor issue; but sometimes you can rotate the drive manually to get it started and capture some data.
The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.
Perhaps I missed the pre-failure indicators from SMART, but it's easier when drives fail but remain available for inspection --- look at a healthy drive, look at a failed drive, see what's different, look at all your drives, predict which one fails next. For drives that disappear, you've got to read and collect the stats regularly and then go back and see if there was anything... I couldn't find anything particularly predictive. I feel disappear from the bus is more in the firmware error category vs physical storage problem, so there may not be real indications, unless it's a power on time based failure...
For what it is worth the SMART diagnostics and health indicators have rarely been useful for me, either on SSDs or HDDs. I don't think I've ever had a SMART health warning before a drive dies. Although I did have one drive that gave a "This drive is on DEATH'S DOOR! Replace it IMMEDIATELY!" error for 3 years before I finally got around to replacing it, mostly to avoid having my OS freak out every time it booted up.
Oh, the overall smart status is mostly useless. But some of the individual fields are helpful.
The ones for relocated sectors, pending sectors, etc. When those add up to N, it's time to replace and you can calibrate that based on your monitoring cycle and backup needs. For a look every once in a while, single copy use case, I'd replace around 10 sectors; for daily monitoring, multiple copies, I'd replace towards 100 sectors. You probably won't get warranty coverage at those numbers though.
Mostly I've only seen the smart status warning fire for too many power on hours, which isn't very useful. Power on hours isn't a good indicator of impending doom (unless there's a firmware error at specific values, which can happen for SSDs or spinners)
> Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners
I've had a fair numbet of HDDs throughout the years. My first one, well my dad's, was a massive 20 MB. I've had a 6+ disk ZFS pool going 24/7 since 2007. Oldest disks had over 7 years on-time according to SMART data, replaced them due to capacity.
Out of all that I've only had one HDD go poof gone. The infamous IBM Deathstar[1].
I've had some develop a few bad blocks and that's it, and one which just got worse and worse. But only one which died a sudden death.
Meanwhile I've had multiple SSDs which just stopped working suddenly. Articles write about them going into read-only mode but the ones I've had that went bad just stopped working.
My experience has been the same. Hard drives fail more gracefully than SSDs.
> The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.
I just recovered data a couple weeks ago from my boss's SATA SSD that gave out and went read-only.
I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.
I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.
If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.
I worked on SSD firmware for more than a decade from the early days of SLC memory to TLC memory. SLC memory was so rock solid that you hardly needed any ECC protection. You could go months of use without any errors. And the most common error was erase error which just means to no longer use that back.
But then as the years progressed, the transistors were made smaller and MLC and TLC were introduced all to increase capacity but it made the NAND worse in every other way like endurance, retention, write/erase performance, read disturb. It also makes the algorithms and error recovery process more complicated.
Another difficult thing is recovering the FTL mapping tables from a sudden power loss. Having those power loss protection capacitors makes it so much more robust in every way. I wish more consumer drives included them. It probably just adds $2-3 to the product cost.
That's kind of that ZNS is for: make the SSD dumb but in exchange predictable; let the database on top that already uses some type of CoW structure handle the quantization of erasure blocks; expose all overprovisioning from the start and just give back less usable capacity after an erasure block for erased and skip over any read access sized blocks that got killed off there when mapping logical addresses to physical ones.
That has to exist anyways because due yield reasons some percentage of blocks is expected dead from the factory.
> it seems to me that SSD firmware has to be more complex than HDD firmware
I think they’re complicated in different ways. A hard desk drive has to have an electromagnet powered up in a motor that arm that moves and reads the magnetic balance of the part of the drive under the read head and correlate that to something? Oh, and there are multiple read heads. Seems ridiculously complex!
We have a fleet of few hundred HDDs that is basically being replaced "on next failure" with SSD and that is BY FAR rarer on HDDs, maybe one out of 100 "just dies".
Usually it either starts returning media errors, or slows down (and if it is not replaced in time, slowing down drive usually turns into media error one).
SSDs (at least a big fleet of samsung ones we had) are much worse, just off, not even turning readonly. Of course we have redundancy so it's not really a problem, but if same happened on someone's desktop they'd be screwed if they don't have backups.
> I just put it in a plastic bag into the freezer during 15 minutes, and works.
What's that supposed to do for a SSD?
It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes. But even for HDDs that's dubiously useful these days.
> It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes.
Stuck heads were/are part of the freezing trick.
Another other part of that trick has to do with printed circuit boards and their myriad of connections -- you know, the stuff that both HDDs and SSDs have in common.
Freezing them makes things on the PCB contract, sometimes at different rates, and sometimes that change makes things better-enough, long-enough to retrieve the data.
I've recovered data from a few (non-ancient) hard drives that weren't stuck at all by freezing them. Previous to being frozen, they'd spin up fine at room temperature and sometimes would even work well-enough to get some data off of them (while logging a ton of errors). After being frozen, they became much more complacent.
A couple of them would die again after warming back up, and only really behaved while they were continuously frozen. But that was easy enough, too: Just run the USB cable from the adapter through the door seal on the freezer and plug it into a laptop.
This would work about the same for an SSD, in that: If it helps, then it is helpful.
Semiconductors generally work better the colder they are. Extreme overclockers don't use liquid nitrogen primarily to keep chips at room temperature at extreme power consumption, but to actually run them at temperatures far below freezing.
Complex issue- analog NAND doesn't work anything like the Logic in CPUs.
Far more often it's the act of simply letting a device sit unpowered itself that 'fixes' the issue. Speculation on what changed invariably goes on indefinitely
I don't think one should worry as much about what medias they are backing up to as if they are answering the question "does my data resiliency match my retention needs".
And regularly test restores actually work, nothing worse than thinking you had backups and then they don't restore right.
We left Backbone for Vue in 2018, due to its simplicity and ease of component manufacturing. I'm surprised to learn that Backbone still exists. Maybe I should revisit it.
I remembered why we left, I hate handle 'this' in JavaScript :)
I really liked this article. I met the new kid on the block, the DBMS neighborhood. I also didn't know that Zig programming language existed. So many new things. Congratulations to TigerBeetle! I'm going to tell my team about it and try it out on an interesting project.
reply