It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.
I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.
We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
> we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
This seems like a contradiction. If the smartest engineers you can hire are not smart enough to work within formal verification constraints then we in fact do not know how to do this.
If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.
It’s not that we can’t do it, it’s that higher-velocity occasionally buggy code has proven time and time again to be significantly more profitable than formally verified. The juice is rarely worth the squeeze.
I generally agree with your assessment. But frumplestlatz also says that literally their smartest engineers are not smart enough to do formal verification.
Agreed. Further, this has been true even ignoring formal verification. Who has been in the situation of making the choice to ship known-buggy code to make a release date or satisfy a customer demand for other functionality? All of us, I suspect, if we’re being honest. I certainly have.
> If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.
It doesn’t hinge on having perfect engineers.
It hinges on engineers being able to model problems algebraically and completely, prove the equivalence of multiple such models at different layers of abstraction (including equivalence to the original code), and then prove that useful properties hold across those models.
If the smartest engineers cannot do it, it doesn’t work.
This isn’t even getting to the practical question of whether it’s worth doing, given the significant additional cost. If the smartest folks you can find are not smart enough to use the framework then it’s useless.
Maybe this means the tooling is insufficient. Maybe it means the field isn’t mature enough. Whatever, if you need an IQ two standard deviations above normal and 10x as long it’s not real world useable today.
> If the smartest engineers cannot do it, it doesn’t work.
[FIX:] ..., it doesn't work universally.
And the answer to that is pretty clear. It does not work universally. If every developer started only shipping code they had credibly formally verified, the vast majority of developers would go into shock at the scale of work to be done. Even the best "validators" would fall into career shredding pits, due to "minor" but now insurmountable dependencies in previously unverified projects. The vast majority of projects would go into unrecoverable stalls.
But formal validation can still work some of the time with the right people, on the right scale and kind of project, with the right amount of resources/time expended.
It isn't as if regular "best practices" programming works universally either. But validation is much harder.
> But formal validation can still work some of the time with the right people, on the right scale and kind of project, with the right amount of resources/time expended.
The problem is, it’s unclear exactly what those situations are or even should be. That lack of clarity causes us to fail to recognize when we could have applied these methods and so we just don’t. As much as I see value in formal methods, I’ve never worked with a team that has employed them. And I don’t think I’m at all unique in that.
I don’t understand why anyone should want this. Why should normal engineering efforts be held to the same standards as life-critical systems? Why would anyone expect that CloudFlare DDoS protection be built to the standards of avionics equipment?
Also if we’re being fair, avionics software is far narrower in scope than just “software in general”. And even with that Boeing managed to kill a bunch of people with shitty software.
That's ok, but then you should bow out of the conversation, which is between people that do understand why anyone should want this.
To have predictable behavior is a must have in some industries, less in others. At the level of infrastructure that is deemed critical by some - and I'm curious what JGC's position on this is - the ability to avoid this kind of outage carries a lot of value. The fact that you do not see that CF has achieved life-critical reach is one that tells me that most of this effort is probably going to waste, but I trust that John does see it my way and realizes that if there are ways to avoid these kind of issues they should be researched. Because service uptime is something very important to companies like Cloudflare.
Boeing managed to kill a bunch of people with shitty business practices, not with shitty software, the software did what it was built to do. It is the whole process around that software as well as the type certification process and regulatory oversight that failed dramatically.
> That's ok, but then you should bow out of the conversation, which is between people that do understand why anyone should want this.
I was not making a statement that I am ignorant. I was saying I believe the proposal to model general software engineering after avionics is misguided and inviting you to clarify your position.
It is certainly valid to ask what CloudFlare or anyone else for that matter could learn from avionics engineering or from NASA or from civil engineering focused on large scale projects or anywhere else that good engineering practices might come from. However, there is a persistent undercurrent in discussions around software reliability and general software engineering that ignore the fact that there are major trade-offs made for different engineering efforts.
“Oh, look how reliable avionics are. We should just copy that.”
Cool, except I would bet avionics cost 100 times as much to build per line of code as anything CloudFlare has ever shipped. The design constraints are just fundamentally different. Avionics are built for a specific purpose in an effectively unchanging environment. If Cloudflare built their offerings in the same way, they would never ship new features, the quality of their request filtering would plummet as adversaries adjusted faster than CloudFlare could react, and realistically they would be overtaken by a competitor within a few years at most. They aren’t building avionics, so they shouldn’t engineer as if they are. Their engineering practices should reflect the reality of the environment in which they are building a product.
This is no different than people who ask, “Why don’t we build software the way we build bridges?” Because we’re not building bridges. Most bridges look exactly like some other bridge that was built 10 miles away. That’s nothing like building new software. That’s far more like deploying a new instance of existing software with slightly different config. And this is not to say that there is nothing for software engineers to learn from bridge building, but reductive “just do it like them” thinking is not useful.
> Boeing managed to kill a bunch of people with shitty business practices, not with shitty software, the software did what it was built to do.
The software was poorly designed. No doubt it was implemented the spec. Does that change the fact that the sum total of the engineering yielded a deadly result? There is no papering over the fact that “building to avionics standards” led direct to the deaths of 346 people in this case.
> I was not making a statement that I am ignorant.
ok.
> I was saying I believe the proposal to model general software engineering after avionics is misguided and inviting you to clarify your position.
But we are not talking about 'general software engineering', we are talking about Cloudflare specifically and that makes a massive difference.
> It is certainly valid to ask what CloudFlare or anyone else for that matter could learn from avionics engineering or from NASA or from civil engineering focused on large scale projects or anywhere else that good engineering practices might come from. However, there is a persistent undercurrent in discussions around software reliability and general software engineering that ignore the fact that there are major trade-offs made for different engineering efforts.
I think we are all aware of those trade offs. We are focusing on a specific outage here that cost an absolute fortune and that used some very specific technical constructs and we are wondering if there would have been better alternatives either by using different constructs or by using different engineering principles.
> “Oh, look how reliable avionics are. We should just copy that.”
> Cool, except I would bet avionics cost 100 times as much to build per line of code as anything CloudFlare has ever shipped.
And there is a pretty good chance that had they done that that they would have come out ahead.
> The design constraints are just fundamentally different.
Yes, but not quite that different that lessons learned can not be transported. The main reason why aviation is different is because it is a regulated industry and - at least in the past - regulators have teeth, and without their stamp of approval you are simply not taking off with passengers on board.
> Avionics are built for a specific purpose in an effectively unchanging environment.
That is very much not the case. The environment aircraft are subject to are - and increasingly so due to climate change - dynamic to a point that would probably surprise you.
What is not changing is this: the price for unexpected outcomes in that industry is that at some point global air travel will no longer be seen as safe and that once that happens one of the engines behind our economies will start failing. In that sense the differences with Cloudflare are in fact not that large.
> If Cloudflare built their offerings in the same way, they would never ship new features, the quality of their request filtering would plummet as adversaries adjusted faster than CloudFlare could react, and realistically they would be overtaken by a competitor within a few years at most. They aren’t building avionics, so they shouldn’t engineer as if they are. Their engineering practices should reflect the reality of the environment in which they are building a product.
I do not believe that you are correct here. They could, they can afford it and they have reached a scale at which the door is firmly closed against competitors, this is not a two bit start-up anymore.
> This is no different than people who ask, “Why don’t we build software the way we build bridges?” Because we’re not building bridges. Most bridges look exactly like some other bridge that was built 10 miles away. That’s nothing like building new software. That’s far more like deploying a new instance of existing software with slightly different config.
This too does not show deep insight into the kind of engineering that goes into any particular bridge. That they look the same to you is just the outside, the interface. But how a particular bridge is anchored and engineered can be a world of a difference from another bridge in a different soil situation, even if they look identical. The big trick is that they all look like simple constructs, but they're not.
> The software was poorly designed. No doubt it was implemented the spec. Does that change the fact that the sum total of the engineering yielded a deadly result? There is no papering over the fact that “building to avionics standards” led direct to the deaths of 346 people in this case.
That is not what happened and that is not what the outcome of the accident investigation led to conclude.
Boeing fucked up, not some software engineer taking a short-cut. This was a top down managed disaster with multiple attempts to cover up the root cause and a complete failure of regulatory oversight.
I'm not sure about that. This type of conversation tends toward "shit's easy syndrome" with complexities hand waved away and real trade offs given lip service consideration only. With respect to CloudFlare you specifically said "as soon as they become the cause of an outage they have invalidated their whole reason for existence". I don't know how to square black and white statements like that with an understanding of tradeoffs. A lot of companies would (and do) trade the potential for an outage against the ongoing value of CloudFlare's offerings.
> we are wondering if there would have been better alternatives either by using different constructs or by using different engineering principles.
I think what was actually said was "let's start off with holding them to the same standards as avionics software development". Not so much inquisitive as "shit's easy".
> And there is a pretty good chance that had they done that that they would have come out ahead.
How did you reach that conclusion? CloudFlare has taken a stock hit recently. Even if we attribute that 100% to their outage, they are still up 92% over the last year.
For comparison's sake, CloudFlare was founded after the 737 Max started development. I seriously doubt CloudFlare would have achieved its current success by attempting to ape avionics engineering.
> That is very much not the case. The environment aircraft are subject to are - and increasingly so due to climate change - dynamic to a point that would probably surprise you.
Did you honestly think I was referring to the actual weather? A plane built in 1970 will (assuming it's been maintained) still fly today just fine. The design constraints today are essentially the same and there are no adversaries out there changing the weather in a way that Boeing needs to continuously account for.
This is wholly different from CloudFlare, who is actively fighting botnets and other adversaries who are continuously adapting and changing tactics. The closest analog for avionics would probably be nation states that can scramble GPS.
> In that sense the differences with Cloudflare are in fact not that large.
In the sense that both are important and both happen to involve software, sure. In most other ways the differences are in fact very large.
> I do not believe that you are correct here. They could, they can afford it and they have reached a scale at which the door is firmly closed against competitors, this is not a two bit start-up anymore.
You are ignoring the reality of the situation, and it surfaces in self-contradictory statements like this. They have closed the door firmly on competition so now they need to focus on avionics-like engineering? Why? If their moat is unpassable they should just stop development and keep raking in money. The only reason that they even experienced this outage was because they are in continuous development.
The reality is that their moat is not that wide. If their adversaries or their competition outpace them, they could easily lose their customers to AWS or Azure or someone else.
> This too does not show deep insight into the kind of engineering that goes into any particular bridge. That they look the same to you is just the outside, the interface. But how a particular bridge is anchored and engineered can be a world of a difference from another bridge in a different soil situation, even if they look identical. The big trick is that they all look like simple constructs, but they're not.
Forest for the trees... I did not claim that the bridges are actually the same. But how to build foundations, how to span supports, how thick concrete needs to be and how much rebar, these are well established. Yes, there are calculations and designs but civil engineers have done an excellent job of building a large corpus of practical information that allows them to build bridges with confidence. (And this is definitely something we could learn from them.) Rarely are bridges built mostly with custom components that have never been used before.
> Boeing fucked up, not some software engineer taking a short-cut. This was a top down managed disaster with multiple attempts to cover up the root cause and a complete failure of regulatory oversight.
You're trying to hand wave this away as if I am blaming some individual Boeing engineer, but I'm not.
Engineering isn't just coding. Engineering is the planning and the designing and the building and the testing and everything else that makes the product what it is. Boeing created a system to mask the flight characteristics of their new plane, except it didn't actually work. (And also yes they lied to regulators about it.) If it actually worked it those two planes wouldn't have crashed. A product intended to make planes easier to fly is poorly engineered if it actually crashes planes.
The vast majority of Cloudflare's "customers" are paying 0 to 20 dollars a month, for virtually the same protection coverage and features as most of their 200 dollars/mo customers. That's not remotely in the realm of avionics price structure, be it software or hardware.
It is the aggregate they pay that counts here, not the individual payments.
A better comparison would be to compare this to airline passengers paying for their tickets, they pay a few hundred bucks in the expectation that they will arrive at their destination.
Besides, it is not the customers that determine Cloudflare's business model, Cloudflare does. Note that their whole business is to prevent outages and that as soon as they become the cause of an outage they have invalidated their whole reason for existence. Of course you could then turn this into a statistical argument that as long as they prevent more outages than they cause that they are a net benefit but that's not what this discussion is about, it is first and foremost about the standard of development they are held up against.
Ericsson identified similar issues in their offering long ago and created a very capable solution and I'm wondering if that would not have been a better choice for this kind of project, even if it would have resulted in more resource consumption.
> as soon as they become the cause of an outage they have invalidated their whole reason for existence
This is a bar no engineering effort has ever met. “If you ever fail, even for a moment, there’s no reason for you to even exist.”
There have been 6 fatal passenger airplane crashes in the US this year alone. NASA only built 6 shuttles and 2 of those exploded, killing their crews. And these were life-preserving systems that failed.
Discussions around software engineering quality always seem to veer into spaces where we assign almost mythic properties to other engineering efforts in an attempt to paint software engineering as lazy or careless.
The NASA example should highlight the normalisation of deviance. The Challenger o-rings had failed before and while engineers were very vocal about that, management overruled them. The foam impacts and tile loss were also a known factor in the Columbia disaster but the abort window is very small. Both point to perverse incentives: maintaining the gravy train. One comment made the point earlier that if Cloudflare were more thorough they would not have captured the market because they would be slower. Slow is smooth and smooth is fast but YMMV. At the end of the day everything can be tracked down to a system that incentivizes wealth accumulation over capability with the fixation that capability can be bought which is a lie.
Boeing only makes this class of software quality because they are forced to by law. No one does it unless there is a big expensive legal reason to do so.
Indeed. But: if we want to call this level of infrastructural work 'software engineering' and the impact of failure is as large as it is then that's an argument for either voluntary application of a higher standard or eventual regulation and I'm pretty sure CF would prefer the former over the latter.
> Anyone in avionics software dev to give an opinion?
I've done some for fuel estimation of freighter jets (not quite avionics but close enough to get a sense for the development processes) and the amount of rigor involved in that one project made me a better developer for the rest of my career. Was it slow? Yes, it was very slow. A couple of thousand lines of code, a multiple of that in tests for a very long time compare to what it would normally take me.
But within the full envelope of possible inputs it performed exactly as advertised. The funny thing is that I'm not particularly proud of it, it was the process that kept things running even when my former games programmer mentality would have long ago said 'ship it'.
Some things you just need to do properly, or not at all.
Compare the prices for type certified parts on aircraft compared to comparable (but not proven) similar parts in the automotive space. Its crazy how much more expensive actually proving these parts perform to spec to the level required by aviation law.
It wouldn't surprise me to find doing the same kind of certifications for complex avionics software to be the same.
I left out any commentary on `.unwrap()` from my original comment, but it’s an obvious example of something that should never have appeared in critical code.
ON HN, just a couple of years ago, a famous Rust programmer said that it is OK to use unwrap. Rustaceans supported this position. Cloudflare merely followed the community standard.
Result declares a type-level invariant — an assertion enforced by the compiler, not runtime — that the operation can fail.
Ignoring that is bypassing the type system. It means your types are either wrong, or your type system is incapable of modeling your true invariants.
In the case of the cloudflare error, their types were wrong. That was an avoidable failure. They needed to fix their type-level invariants, not yolo the issue with `.unwrap()`.
Your willful persistent lack of understanding doesn’t mean my philosophy is incoherent. Using `.unwrap()` is always an example of a failure to accurately model your invariants in the type system.
Your definition of "correct" is completely incoherent. Just because an invariant that could be modeled by a type system is not modeled by the type system in any given scenario does not make it incorrect.
You can't engage with my examples and you provide none of your own. So continuing discussion with you is a waste of time.
This is literally what “invariant” means, and what a type system is built to model.
Declaring an invariant in the type system that you then violate is not correct code. I truly can’t even begin to guess at why you’re so voracious in your defense of this particularly poor practice.
[edit]
HN rate limits kicking in, so here’s my reply. I work for a FAANG but I’m not going to say which one. You or a relative are, with almost 100% certainty, relying on code written to that philosophy, by me, daily and widely.
Show me code you've published that is used by real people in real systems that follows the philosophy you've espoused here. Otherwise I'm calling shenanigans.
You are unnecessarily combative in this thread. I don't know what about the GP it is that ticks you off but they're making a lot of sense to me and I don't see why you would be loudly demanding published code when you are having a conversation about an abstract device.
If you haven't read my blog on this topic, I suggest you do so before replying further: https://burntsushi.net/unwrap
It should very clearly state my position. And it provides the examples that I previously referenced.
The GP got a link to this blog in the previous HN thread. They dismissed it out-of-hand without engaging with it at all. And tossed in an ad hominem for good measure. So your issue with me specifically here seems completely inappropriate.
I've presented examples. They haven't. They haven't even bothered to engage with the examples I've provided. I want to read code they've written using this philosophy so that I can see what it looks like in real world usage. Otherwise, the only code I've seen that does something similar uses formal methods. So I simply do not believe that this is practical advice for most programming.
Insisting on examples and evidence to support an argument isn't combative. It's appropriate when extraordinary claims are being made.
If you've published code using this philosophy that is used by real people in real systems, then I would be happy to take a look at that as well. If it exists, I would bet it's in a niche of a niche.
I've had these arguments before about this very topic. Some people have taken me up on this request and actually provided examples. And in 100% of those cases, it turned out there was a mismatch between what they were saying and what the code was doing.
> I work for a FAANG but I’m not going to say which one. You or a relative are, with almost 100% certainty, relying on code written to that philosophy, by me, daily and widely.
Cool story bro.
Like, even interpreted maximally charitably, your statement still doesn’t provide GP’s requested published code. Not “take my word for it” ostensibly deployed software—code; the discussion here is about code constructs for modeling invariants, not solely about runtime behavior.
I’d be interested to see that code discussed in context of the blog post GP linked, which seems to make a compelling argument.
I am enjoined from providing that, and it’d be idiotic to risk my career for an HN ****-measuring contest. If one can’t understand these concepts without example code then this probably isn’t a discussion one can meaningfully contribute to.
Not being able to envision how it is in fact possible to write code with these invariants encoded in the type system is a fundamental fault in one’s ability to reason about this topic, and software correctness in general, in the first place.
> Not being able to envision how it is in fact possible to write code with these invariants encoded in the type system is a fundamental fault in one’s ability to reason about this topic, and software correctness in general, in the first place.
Code proving that it’s possible to avoid branching into an abort (the concept, not necessarily the syscall) was not what the original GP requested. Nor was a copy of your employer’s IP. Published examples which demonstrate how real-world code which intentionally calls panic() could be better written otherwise was my interpretation of the request.
And I’m requesting that, too, because I am interested in learning more about it! Please don’t assume I’m asking out of inexperience with safety critical systems, dick-measuring, faulty reasoning ability, or unfamiliarity with using type systems to avoid runtime errors (where—and this is the source of this discussion—practical and appropriate). If you work on your tone, that would make it much easier to have educating discussions in contexts like this.
> Result declares a type-level invariant — an assertion enforced by the compiler, not runtime — that the operation can fail.
“Can do X” is not an invariant. “Will never do X” (or “Will always do Y”) is an invariant. “Can do X” is the absence of the invariant “Will never do X”.
> Using `.unwrap()` is always an example of a failure to accurately model your invariants in the type system.
No, using .unwrap() provides a narrower invariant to subsequent code by choosing to crash the process via a panic if the Result contains an Error.
It may be a poor choice in some circumstances, and it may be a result of mistakenly believing that code returning the Result itself had failed to represent its invariants fully such that the .unwrap() would be a noop—but even there it respects and narrows the invariant declared, it doesn't ignore it—and, in any case, as it has well-defined behavior in either of the possible input cases, it is silly to describe using it as a failure accurately model invariants in the type system.
“Narrowing” a compile-time invariant without a corresponding proof is formally unsound and does not “respect” the declared invariant in any reasonable sense.
What’s silly is the desire to pretend otherwise because it’s easier.
> “Narrowing” a compile-time invariant without a corresponding proof is formally unsound and does not “respect” the declared invariant in any reasonable sense
The invariant is that either condition X applies or condition Y applies. "Panic and stop execution if X, continue execution with the invariant Y if Y" is not unsound and does respect the original invariant in every possible sense.
It may be the wrong choice of behavior given the frequency of X occurring and the costs incurred by the decision to panic, but that’s not a type-level problem.
Formal verification is well and good, but that is not what unsoundness means.
If a proof trivially demonstrated that a given program’s behavior was indeed “proceed if a condition is satisfied, crash otherwise”, then what? Or do we not trust the verifier with branching code all of a sudden?
Also, in this use case catching the panic and completely forgetting that the function was ever called in the first place is completely acceptable. In web frameworks such as Dioxus/Axum if your users request causes a panic it does not bring down the whole web server it just invalidates that specific request
Rust needs to get rid of .unwrap() and its kin. They're from pre-1.0 Rust, before many of the type system features and error handling syntax sugar were added.
There's no reason to use them as the language provides lots of safer alternatives. If you do want to trigger a panic, you can, but I'd also ask - why?
Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures. Any function that then calls a panicky function needs to be similarly marked. In doing this, we can statically be certain no such methods are called if we want to be rid of the behavior.
Perhaps something like:
panic fn my_panicky_function() {
None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib
}
fn my_safe_function() {
// with a certain compiler or Crates flag, this would fail to compile
// as my_safe_function isn't annotated as `panic`
my_panicky_function()
}
The ideal future would be to have code that is 100% panic free.
> There's no reason to use [panics] as the language provides lots of safer alternatives.
Dunno ... I think runtime assertions and the ability to crash a misbehaving program are a pretty important part of the toolset. If rust required `Result`s to be wired up up and down the entire call tree for the privilege of using a runtime assertion, I think it would be a lot less popular, and probably less safe in practice.
> Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures.
I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.
> a mechanism to prove that code can or cannot panic would be great
As appealing as the idea of a #[cfg(nopanic)] enforcement mechanism is, I think linting for panic() is the optimum, actually.
With a more rigidly enforced nopanic guarantee, I worry that some code and coders would start to rely on it (informally, accidentally, or out of ignorance) as a guarantee of completion, not return behavior. And that’s bad; adding language features which can easily be misconstrued to obscure the fact that all programs can terminate at any time is dangerous.
Lints, on the other hand, can be loud and enforced (and tools to recursively lint source-available dependencies exist), but few people mistake them for runtime behavior enforcement.
> I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.
In one of the domains I work in, a malloc failure and OOMkill are equivalent. We just restart the container. I've done all the memory pressure measurement ahead of time and reasonably understand how the system will behave under load. Ideally it should never happen because we pay attention to this and provision with lots of overhead capacity, failover, etc. We have slow spillover rather than instantaneous catastrophe. Then there's instrumentation, metrics, and alerting.
A surprise bug in my code or a dependency that causes an unexpected panic might cause my application or cluster to restart in ways we cannot predict or monitor. And it can happen across hundreds of application instances all at once. There won't be advanced notice, and we won't have a smoking gun. We might waste hours looking for it. It could be as simple as ingesting a pubsub message and calling unwrap(). Imagine an critical service layer doing this all at once, which in turn kills downstream services, thundering herds of flailing services, etc. - now your entire company is on fire, everyone is being paged, and folks are just trying to make sense of it.
The fact is that the type of bugs that might trigger a user-induced panic might be hidden for a long time and then strike immediately with millions of dollars of consequences.
Maybe the team you implemented an RPC for six months ago changes their message protocol by flipping a flag. Or maybe you start publishing keys with encoded data center affinity bytes, but the schema changed, and the library that is supposed to handle routing did an unwrap() against a topology it doesn't understand - oops! Maybe the new version handles it, but you have older versions deployed that won't handle it gracefully.
These failures tend to sneak up on you, then happen all at once, across the entire service, leaving you with no redundancy. If you ingest a message that causes every instance to death spiral, you're screwed. Then you've got to hope your logging can help you find it quickly. And maybe it's not a simple roll back to resolve. And we know how long Rust takes to build...
The best tool for this surely can't be just a lint? In a supposedly "safe" language? And with no way to screen dependencies?
Just because somebody's use case for Rust is okay with this behavior doesn't mean everyone's tolerates this. Distributed systems folks would greatly appreciate some control over this.
All I'm asking for is tools to help us minimize the surface area for panics. We need as much control over this as we can get.
If you replace panic with a bespoke fallback or retry, have you really gained anything? You can still have all your services die at the same time, and you'll have even less of a smoking gun since you won't have a thousand stack traces pointing at the same line.
The core issue is that resilience to errors is hard, and you can't avoid that via choice of panic versus non-panic equivalents.
All that means is that the `Failure` bubbles up to the very top of `main` (in this scenario) because we're only caring about the happy path (because we can't conceive of what the unhappy path should be other than "crash") and then hits the `panic("Well, that's unexpected")` explicitly in Place B rather than Place A (the `.unwrap`). I'm not sure how that's _better_.
It would not because it would be a compile time error rather than run time error which is a completely different beast if I understand the argument correctly.
What would be a compile time error? The compiler rejecting unwrap? And then you fix that by bubbling the error case up, which fixes the compiler error and leaves you with a runtime error again. But one that's less ergonomic.
You can't force a config file loaded at run time to be correct at compile time. You can only decide what you're going to do about the failure.
The point they are - trying, apparently - making is that if you had a flag or an annotation that you could make to a function that you do not want that function to be built on top of anything that can 'unwrap' that you can rule out some of these cases of unexpected side effects.
Not really. Handler and middleware can handle this without much ceremony. The user gets to, and is informed of and encouraged to, choose.
We also don't get surprised at runtime. It's in the AST and we know at compile time.
The right API signature helps the engineer think about things and puts them in the correct headspace for systems thinking. If something is panicking under the hood, the thought probably doesn't even occur to them.
Yes, but my point is that without a reasonable supervision tree and crash boundary the difference between a composition of Result-returning functions that bottoms out in main's implicit panic and an explicit panic is nil operationally.
While lexically the unwrap actually puts the unhandledness of the error case as close to the source of the issue's source as possible. In order to get that lexical goodness you'd need something much more fine grained than Result.
> I’ve been seeing you blazing this trail since the incident and it feels a short sighted and reductive.
Why is it inappropriate to be able to statically label the behavior?
Maybe I don't want my failure behavior dictated by a downstream dependency or distracted engineer.
The subject of how to fail is a big topic and is completely orthogonal to the topic of how can we know about this and shape our outcomes.
I would rather the policy be encoded with first class tools rather than engineering guidelines and runbooks. Let me have some additional control at what looks like to me not a great expense.
It doesn't feel "safe" to me to assume the engineer meant to do exactly this and all of the upstream systems accounted for it. I would rather the code explicitly declare this in a policy we can enforce, in an AST we can shallowly reason about.
How deep do you go? Being forced to label any function that allocates memory with ”panic”?
Right now you all the instances where the code can panic are labeled. Grep for unwrap, panic, expect etc.
In all my years of professional Rust development I’ve never seen a potential panic pass code review without a discussion. Unless it was trivial like trying to build an invalid Regex from a static string.
You probably know about these, but for the benefit of folks who don't, you can forbid slice access and direct unwraps with clippy. Obviously this only lints your own code and not dependencies.
So slicing is forbidden in this scheme? But not malloc?
This doesn’t seem to be a principled stance on making the language safer. It feels a bit whack-a-mole. “Unwrap is pretty easy to give up. I could live without slicing. Malloc seems hard though. I don’t want to give that up.”
Malloc is fine. We can and do monitor that. It's these undetectable runtime logic problems that are land mines.
In distributed systems, these can cause contagion and broad outages. Recovering can be very difficult and involve hours of complex steps across dozens of teams. Meanwhile you're losing millions, or even hundreds of billions, of dollars for you and your customers.
Someone unwrapping() a Serde wire message or incorrectly indexing a payload should not cause an entire fleet to crash. The tools should require the engineer handle these problems with language features such as Result<>.
Presently, who knows if your downstream library dependency unwrap()s under the hood?
This is a big deal and there could be a very simple and effective fix.
The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.
I don’t actually work in Rust. I think I understand what you’re going for, though. The choice to use panic as a way of propagating errors is fundamentally problematic when it can arise from code you don’t control and potentially cannot even inspect.
I don’t necessarily agree that malloc should be okay (buggy code could try to allocate a TB of memory and OOMKiller won’t fix it) but I can understand that it’s probably workable in most cases.
Unfortunately I think the fix here would require a compatibility break.
The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.
Several of the outages I've been involved in were the result of NPEs or incorrectly processing runtime data. Rust has tools to enforce safety here, but it doesn't have tools to enforce your use of them. If also doesn't have a way to safeguard you from others deciding the behavior for you.
There is potentially a very easy set of non-onerous features we could build that allow us to prevent this.
Except that the outage would still have happened without that .unwrap(). So go ahead and build those features, they sound useful, but don't think that they'd save you from a failure like this.
As the poster here said, the place to build in features that would have prevented this from happening is the DB schema and queries. 5NF would be onerous overkill here, but it seems reasonable to have some degree of forced normalization for something that could affect this much.
(Requiring formal verification of everything involved here would be overkilling the overkill, otoh.)
Unwrap is not only fine, it's a valuable part of the language. Getting rid of it would be a horrible change. What needs to happen is not using an assert (which is really what unwrap is) if an application can't afford to crash.
That's fair, but even there the roll-back would be a lot smoother, besides the supervisor trees are a lot more fine grained than restarting entire containers when they fail.
I’m not sure that panic (speaking generally about the majority of its uses and the spirit of the law; obviously 100% of code does not obey this) is the equivalent of an Erlang process crash in most cases. Rather, I think unwrap()/panic are usually used in ways more similar to erlang:halt/1.
Exactly, but that is kind of the point here. An Erlang 'halt' is something that most Erlang programmers would twig is not what you want in most cases, in most cases you want your process to crash and for the supervisor to restart it if the error is recoverable.
What happened here is systemic: the config file contained an issue severe enough that it precluded the system from running in the first place and unfortunately that caused a runtime error when in fact that validation should have been separate from the actual use. This is where I see the problem with this particular outage. And that makes it an engineering issue much more than a language issue.
Bad configuration files can and do happen, so you take that eventuality into account during systems design.
I'm on libs-api. We will never get rid of unwrap(). It is absolutely okay to use unwrap(). It's just an assertion. Assertions appear in critical code all the time, including the standard library. Just like it's okay to use `slice[i]`.
You can keep unwrap() and panics. I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.
I have personally been involved in nearly a billion dollars of outages myself and am telling you there are simple things the language can do to help users purge their code of this.
This is a Rust foot gun.
A simple annotation and compiler flag to disallow would suffice. It needs to handle both my code and my dependencies. We can build it ourselves as a hack, but it will never be 100% correct.
> I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.
Now this is absolutely a reasonable request. But it's not an easy one to provide depending on how you go about it. For example, I'd expect your suggestion in your other comment to be a non-starter because of the impact it will have on language complexity. But that doesn't mean there isn't a better way. (I just don't know what it is.)
This is a classic motte and bailey. You come out with a bombastic claim like "remove unwrap and its ilk," but when confronted, you retreat to the far more reasonable, "I just want tools to detect and prevent panicking branches." If you had said the latter, I wouldn't have even responded to you. I wouldn't have even batted an eye.
> This is the Hundred Billion Dollar unwrap() Bug.
The Cloudflare bug wasn't even caused by unwrap(). unwrap() is just its manifestation. From a Cloudflare employee:
> In this case the unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.
> unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.
This sounds like the kind of failure Bobby Tables warned about a long time ago. An entire new, safe language was developed to prevent these kinds of failures. “If it compiles it’s probably correct” seems to be the mantra of rust. Nuts.
The fact that this wasn't RCE or anything other than denial of service is a raging success of Rust.
“If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration. I heard it among Haskell programmers long before I heard it in the context of Rust. And guess what? Haskell programs have bugs too.
> “If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration.
If you say so, I believe you. That isn’t how it comes across in daily, granted pithy, discourse around here.
I have a lot of respect for you Andrew, not meaning to attack you per se. You surely can see the irony in the internet falling over because of an app written in rust, and all that comes with this whole story, no?
Nope. Because you've completely mischaracterized not only the actual problem here, but the value proposition of Rust. You're tilting at windmills.
Nobody credible has ever said that Rust will fix all your problems 100% of the time. If that's what you inferred was being sold based on random HN commentary, then you probably want to revisit how you absorb information.
Rust has always been about reducing bugs, with a specific focus on bugs as a result of undefined behavior. It has never, isn't and will never be able to eliminate all bugs. At minimum, you need formal methods for that.
Rust programs can and will have bugs as a result of undefined behavior. The value proposition is that their incidence should be markedly lower than programs written in C or C++ (i.e., implementations of languages that are memory unsafe by default).
What about “detect when the content isn’t correct and take protective measures so that a core service of the global internet _doesn’t_ crash?” Wasn’t that the whole point of rust? I’ll repeat again “if it compiles it is almost absolutely correct” is a mantra I see on hn daily.
Apparently that isn’t true.
Edit: isn’t the whole idea of C/C++ being flawed pivoted around memory management and how flawed the languages are? Wasn’t the whole point of rust to eliminate that whole class of errors? XSS and buffer overflows are almost always caused by “malformed” outside input. Rust apparently doesn’t protect against that.
If you corrupt memory, a huge variety of unpredictable bad things can happen.
If you exit, a known bad thing happens.
No language can protect you from a program's instructions being broken. What protective measures do you have in mind? Do they still result in the service ceasing to process data and reporting a problem to the central controller? The difference between "stops working and waits" and "stops working and calls abort()" is not much, and usually the latter is preferred because it sets off the alarms faster.
Tell me what specifically you want as correct behavior in this situation.
I would expect such a critical piece of code to be able to hot-load and validate a new configuration before it is put into action. I would expect such a change to be rolled out gradually, or at least as gradually as required to ensure that it functions properly before it is able to crash the system wholesale.
I can't say without a lot more knowledge about the implementation and the context what the best tools would be to achieve this but I can say that crashing a presently working system because of a config fuckup should not be in the range of possible expected outcomes.
Because config fuckups are a fact of life so config validation before release is normal.
> I would expect such a critical piece of code to be able to hot-load and validate a new configuration before it is put into action.
And if that config doesn't validate, what should the process do? Maybe it had a previous config, maybe it didn't. And if it keeps running the old config, that adds extra complication to gradual rollout and makes it harder to understand what state the system is in.
> I would expect such a change to be rolled out gradually, or at least as gradually as required to ensure that it functions properly before it is able to crash the system wholesale.
Me too. Note that doing a gradual rollout doesn't care whether the process uses unwrap or uses something gentler to reject a bad config.
> I can say that crashing a presently working system because of a config fuckup should not be in the range of possible expected outcomes.
By "working system" do you mean the whole thing shouldn't go down, or the single process shouldn't go down? I agree with the former but not the latter.
But the operative point in this sub thread is whether unwrap() specifically is load bearing.
If instead they bubbled up the error, printed it and then exited the program---without ever using unwrap---then presumably they still would have had a denial of service problem as a result of OOM.
And even if unwrap were load bearing here, then we would be in agreement that it was an inappropriate use of unwrap. But we are still nowhere near saying "unwrap should literally never be used in production."
The big trick is - as far as I understand it - to acknowledge that systems fail and to engineer for dealing with those failures.
I support your efforts downthread for at least knowing whether or not underlying abstractions are able to generate a panic (which is a massive side effect) or are only able to return valid results or error flags. The higher level the abstraction the bigger the chance that there is a module somewhere in the stack that is able to blow it all up, at the highest level you can pretty much take it as read that this is the case.
So unless you engineer the whole thing from the ground up without any library modules it is impossible to guarantee that this is not the case and as far as I understand your argument you at least want to be informed when that is the case, or, alternatively, to cause the compiler to flag the situation down from your code as incompatible with the guarantees that you are asking for, is that a correct reading?
I would guess at the individual team level they probably still behave like any other tech shop. When the end of the year comes the higher-ups still expect fancy features and accomplishments and saying "well, we spent months writing a page of TLA+ code" is not going to look as "flashy" as another team who delivered 20 new features. It would take someone from above to push and ask that other team who delivered 20 features, where is their TLA+ code verifying their correctness. But, how many people in the middle management chain would do that?
We need modern programming languages with formal verification built-in - should be applicable to specially demarcated functions/modules. It is a headache to write TLA+ and keep the independent spec up2date with the productive code.
It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.
I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.
We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.