thundershart's comments

thundershart · on July 21, 2024

> I wonder if there's a concern that staggering the malware signatures would open them up to lawsuits if somebody was hacked in between other customers getting the data and them getting the data.

I'd assume that sort of thing would be covered in the EULA and contract -- but even if it weren't, it seems like allowing customers to define their own definition update strategy would give them a pretty compelling avenue to claim non-liability. If CrowdStrike can credibly claim "hey, we made the definitions available, you chose to wait for 2 weeks to apply them, that's on you", then it becomes much less of a concern.

thundershart · on July 21, 2024

Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.

But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.

On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.

I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?

kiitos · on July 21, 2024

> But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production

In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.

In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Brian_K_White · on July 22, 2024

"In which case I think the blame ultimately falls almost entirely on CrowdStrike"

I would say on the client for buying into CrowdStrike.

And also the client for having no contingencies and just accepting a vendor pinky-swear as meaningful.

CrowdStrike failed at their responsibilities too, I just mean that so did everyone else.

When you cede your own responsibilities to someone else and don't have that backed up with contractually enforced liability to make you whole when they fuck up, and also don't provide your own contingency so it doesn't really matter what some vendor does, that's on you. That's 100% entirely on you and it doesn't matter if a million other people also did the same utterly thoughtless and lazy thing.

kiitos · on July 22, 2024

> I would say on the client for buying into CrowdStrike.

I understand this perspective but I think it misses the forest for the trees. You have to evaluate this kind of stuff in context. Purity tests really smack on tech message boards where nobody has any accountability to any kind of business requirements, but basically no real-world organization operates in that way, so it's all a bit irrelevant.

> When you cede your own responsibilities to someone else ...

This framing is a bit naive, I think. It isn't a boolean. Everything is about risk management, cost/benefit analysis.

thundershart · on July 21, 2024

> From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Honestly, it hadn't even occurred to me that software like this marketed at enterprise customers wouldn't have this kind of control already available. It seems like an obvious thing that any big organization would insist on that I just took it for granted that it existed.

Whoops.

janstice · on July 21, 2024

It seems nuts to me too - MS Defender has this out of the box. From looking at sysadmins on reddit, it seems that CS has a tiered update mechanism, but didn’t use it for this change.

cozzyd · on July 21, 2024

Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

chrisoconnell · on July 22, 2024

>Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

I used to work with regional parks and recreation departments, and they would not approve any updates that did not go through UAW environments that we had set up. All updates had to be deployed to their UAW, thoroughly tested, before going to their production environment.

I get this this is slightly different, but I'd imagine Airlines, Banks, and Hospitals would have far more strict UAW policies to avoid a single vendor from kneecapping operations.

vel0city · on July 21, 2024

> Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production?

They do, but this update bypassed all of those rules.

jamesfinlayson · on July 22, 2024

Checks out - my company had lots of issues on Friday afternoon, and when it first happened I wondered who on Earth decided to roll out updates to prod systems on Friday afternoon.

No one at my company apparently.

suzzer99 · on July 21, 2024

Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.

I don't know how you could assert that this is impossible, hence channel files should be treated as code.

stoolpigeon · on July 21, 2024

I think point 3 of the grand parent indicates admins were not given an opportunity to test this.

My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)

Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.

thundershart · on July 21, 2024

Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.

Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.

volkl48 · on July 21, 2024

It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.

Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.

At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.