Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This is the multi-million dollar .unwrap() story.

That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.



>If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.

If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.


> If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.

It wouldn't, not meaningfully. The outage was caused by change in how they processed the queries. They had no way to observe the changes, nor canaries to see that change is killing them. Plus, they would still need to manually feed and restart services that ingested bad configs.

`expect` would shave a few minutes; you would still spend hours figuring out and fixing it.

Granted, using expect is better, but it's not a silver bullet.


A billion alerts in DD/Sentry/whatever saying the exact problem that coincide with the exact graph of failures would probably be helpful if someone looked at them.


Not if everyone decides to 's/.unwrap()/.expect("Shouldn't happen")/g' in the code base.

Or the good old:

    let x = match res { 
       Ok(x) => x,
       Err(_) => unreachable!(),
    }


If the `.unwrap()` was replaced with

1:

  ?
2:

  map_err, or, or_else, etc.
3:

  match ... { 
    Ok(..) => {}, 
    Err(..) => {},
  }
4:

  if let ... {
  } 
Then it would have been idiomatic Rust code and wouldn't have failed at all.

The function signature returned a `Result<(), (ErrorFlags, i32)>`

Seems like it should have returned an Err((ErrorFlags, i32)) here. Case 2 or 3 above would have done nicely.

Removing unwrap() from Rust would have forced the proper handling of the function call and would have prevented this.

Unwrap() is Rust's original sin.


In general for unexpected errors like these the internal function should log the error, and I assume it was either logged or they can quickly deduce reason based on the line number.


And dare I say, an exhibition of hindsight bias.


semantic? or pedantic?


[flagged]


I'm not sure if this is serious or not, but to take it at face value: the value of this sort of thing in Rust is not that it prevents crashes altogether but rather that it prevents _implicit_ failures. It forces a programmer to make the explicit choice of whether to crash.

There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.


The language semantics do not surface `unwrap` usage or make any guarantees. It should be limited to use in `unsafe` blocks.

> There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.

I would prefer the boiler plate of a match / if-else / if let, etc. to call attention to it. If you absolutely must explicitly panic. Or better - just return an error Result.

It doesn't matter how smart your engineers are. A bad unwrap can sneak in through refactors, changing business logic, changing preconditions, new data, etc.


Restricting unwrap to unsafe blocks adds negative value to the language. It won't prevent unwrap mistakes (people who play fast and loose with it today will just switch to "foo = unsafe { bar.unwrap() };" instead). And it'll muddy the purpose of unsafe by adding in a use that has nothing to do with memory safety. It's not a good idea.


> And it'll muddy the purpose of unsafe by adding in a use that has nothing to do with memory safety.

Then we need more safety semantics around panic behavior. A panic label or annotation that infects every call.

Moreover, I want a way of statically guaranteeing none of my dependencies do this.


> It should be limited to use in `unsafe` blocks.

That would be a fairly significant expansion of what `unsafe` means in Rust, to put it lightly. Not to mention that I think doing so would not really accomplish anything; marking unwrap() `unsafe` would not "surface `unwrap` usage" or "make any guarantees", as it's perfectly fine for safe functions to contain `unsafe` blocks with zero indication of such in the function signature and.


> fairly significant expansion of what `unsafe` means in Rust

I want an expansion of panic free behavior. We'll never get all the way there due to allocations etc., but this is the class of error the language is intended to fix.

This turned into a null pointer, which is exactly what Rust is supposed to quench.

I'll go as far as saying I would like to statically guarantee none of my dependencies use the unwrap() methods. We should be able to design libraries that provably avoid panics to the greatest extent possible.

Unwrap is an easy loss on a technicality.


> I want an expansion of panic free behavior.

Sure, and I'd hardly be one to disagree that a first-party method to guarantee no panics would be nice, but marking unwrap() `unsafe` is definitely not an effective way to go about it.

> but this is the class of error the language is intended to fix.

Is it? I certainly don't see any memory safety problems here.

> This turned into a null pointer, which is exactly what Rust is supposed to quench.

There's some subtlety here - Rust is intended to eliminate UB due to null pointer dereferences. I don't think Rust was ever intended to eliminate panics. A panic may still be undesirable in some circumstances, but a panic is not the same thing as unrestricted UB.

> We should be able to design libraries that provably avoid panics to the greatest extent possible.

Yes, this would be nice indeed. But again, marking unwrap() `unsafe` is not an effective way to do so.

dtolnay's no_panic is the best we have right now IIRC, and there are some prover-style tools in an experimental stage which can accomplish something similar. I don't think either of those are polished enough for first-party adoption, though.


blaming the language is not the way to approach this. if an engineer writes bad code that’s the engineers fault, not the languages.

this was bad code that should have never hit production, it is not a rust language issue.


No. Don't say "you're holding it wrong". The language says "safe" on the tin. It advertises safety. This shouldn't be possible.

This is a null pointer. In Rust.

Unwrap needs to die. We should all fight to remove it.


> The language says "safe" on the tin. It advertises safety.

Rust advertises memory safety (and other closely related things, like no UB, data race safety, etc.). I don't think it's made any promises about hard guarantees of other kinds of safety.


panics are safe, what are you talking about? It is nothing like a null pointer.


you either misunderstand the rust ethos or are intentionally misrepresenting it.

safe refers to memory safety.

once again, if you write bad code, that’s your fault, not the languages. this is a feature of rust that was used incorrectly.


Rust has grown beyond its original design as a "memory safe" language. People are using this as an HTTP/RPC server programming language now. WASM serverless jobs, etc. Rust has found itself deployed in a lot of unexpected places.

These folks are not choosing Rust for the memory safety guarantees. They're choosing Rust for being a fast language with a nice type system that produces "safe" code.

Rust is widely known for producing relatively defect-free code on account of its strong type system and ergonomics. Safety beyond memory safety.

Unwrap(), expect(), and their kin are a direct affront to this.

There are only two uses cases for these: (1) developer laziness, (2) the engineer spent time proving the method couldn't fail, but unfortunately they're not using language design features that allow this to be represented in the AST with static guarantees.

In both of these cases, the engineer should instead choose to (1) pass the Result<T,E> or Option<T> to the caller and let the caller decide what to do, (2) do the same, but change the type to be more appropriate to the caller, (3) handle it locally so the caller doesn't have to deal with it, (4) silently turn it into a success. That's it. That's idiomatic Rust.

This should be concerning to everyone:

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

I'm now panicked (hah) that some dependency of mine will unwrap something and panic at runtime. That's entirely invisible to users. It's extremely dangerous.

Today a billion people saw the result of this laziness. It won't be the last time. And hopefully it never happens in safety-critical applications like aircraft. But the language has no say in this because it isn't taking a stand against this unreasonably sharp edge yet. Hopefully it will. It's a (relatively) easy fix.


that's actually a very good point, fair enough, i agree.

regretfully i'm not sure if such a big language change can be made; though it would be nice.

here's to hoping!


>> This is the multi-million dollar .unwrap() story.

> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.

Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.


> Instead, the routine likely should have worked within `Result`.

But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.

Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.

But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.


>> Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.

> But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.

I agree it is an error, but disagree that it should be a fatal error at that location. The reason being is the method defining the offending `unwrap` construct produces a `Result`, which is fully capable of representing any error `features.append_with_names` could produce.

> But that has nothing to do with the actual bug here, which was that the invariant failed.

The bug is by invoking `unwrap` the process crashed. To the degree that Cloudfare had a massive outage.

Had the logic been such that a `Result` representing this error condition activated an alternate workflow to handle the error (perhaps by logging it, emitting a notification event alerting SRE's, transitioning into a failure mode, or all of these options), then a global outage might have been averted.

Which makes:

> How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.

Very relevant indeed.


A failed config load probably shouldn't be a fatal error if a valid config is already loaded?


Hard to say. Why would you load a new config if a valid config is already loaded?

Maybe the new config has a new update. Who knows? Do we want to keep operating on the old config? Maybe maybe not.

But operating on old config when you don't want to is definitely worse.


Of course it depends on the situation. But I don't see how you could think that in this case, crashing is better than stale config.

Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).


Because stale config could easily go unnoticed for a long time.

Crashing is generally better than behaving incorrectly due to stale configs. Because the problem would get fixed faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: