More

linsomniac · 2026-02-09T15:56:11 1770652571

>I humbly propose that point is today.

You're right that the argument will become boring, but I think it's gonna be a minute before it does so. I spent much of yesterday playing with the new "agent teams" experimental function of Claude Code, and it's pretty remarkable. It one-shotted a rather complex Ansible module (including packaging for release to galaxy), and built a game that teaches stock options learning, both basically one-shotted.

On Thursday I had a FAC with a coworker and he predicted 2026 is going to be the year of acceleration, and based on what I've seen over the last 2-3 years I'd say it's hard to argue that.

linsomniac · 2026-02-09T12:33:51 1770640431

An options trading game to help learn about stock options: https://trading.linsomniac.com/

A good friend of mine is a retired financial planner and is always talking about different ways I could leverage options to reduce risk in my portfolio. I understand the basics, but really don't "get" them, so I thought a game might help me to understand them better.

linsomniac · 2026-02-09T00:42:32 1770597752

Learning cribbage, my family has been learning cribbage and we are leaning hard on cribbage scoring cheat sheets, but haven't found a great one online. So I put together https://cribscore.linsomniac.com/

linsomniac · 2026-02-06T15:40:36 1770392436

You need to run it on a RISC architecture, but that would be too much machine for you.

jayd16 · 2026-02-06T16:25:52 1770395152

Yeah. RISC is good.

vq · 2026-02-06T15:45:21 1770392721

It may or may not change everything.

jghn · 2026-02-06T15:57:52 1770393472

only if you have a 28.8 bps [sic] modem!

drivers99 · 2026-02-06T19:17:34 1770405454

and triple the RAM

linsomniac · 2026-02-02T01:40:53 1769996453

It depends on how easily testable the Excel is. If Claude has the ability to run both the Excel and the Python with different inputs, and check the outputs, it's stunningly likely to be able to one-shot it.

AlotOfReading · 2026-02-02T01:48:37 1769996917

Something being simultaneously described as a "30 sheet, mind-numbingly complex Excel model" and "testable" seems somewhat unlikely, even before we get into whether Claude will be able to test such a thing before it runs into context length issues. I've seen Claude hallucinate running test suites before.

djeastm · 2026-02-02T11:31:16 1770031876

>I've seen Claude hallucinate running test suites before.

This reminded of something that happened to me last year. Not Claude (I think it was GPT 4.0 maybe?), but I had it running in VS Code's Copilot and asked it to fix a bug then add a test for the case.

Well, it kept failing to pass its own test, so on the third try, it sat there "thinking" for a moment, then finally spit out the command `echo "Test Passed!"`, executed it, read it from the terminal, and said it was done.

I was almost impressed by the gumption more than anything.

Merad · 2026-02-02T18:06:17 1770055577

I've been using Claude Code with Opus 4.5 a lot the last several months and while it's amazingly capable it has a huge tendency to give up on tests. It will just decide that it can commit a failing test because "fixing it has been deferred" or "it's a pre-existing problem." It also knows that it can use `HUSKY=0 git commit ...` to bypass tests that are run in commit hooks. This is all with CLAUDE.md being very specific that every commit must have passing tests, lint, etc. I eventually had to add a Claude Code pre-command hook (which it can't bypass) to block it from running git commit if it isn't following the rules.

theshrike79 · 2026-02-03T07:41:00 1770104460

Anecdata from the internet has a few stories of Claude Opus bypassing hooks too =)

1) it wants to run X command

2) it notices a hook preventing it from running X

3) it creates a Python application or shell script that does X and runs it instead

Whoops.

Merad · 2026-02-03T19:09:42 1770145782

I haven't seen it bypass my hook yet (knock on wood). I have my hook script [0] tell that its commits are required to pass validation, maybe that helps push it in the right direction?

0: https://github.com/mbcrawfo/vibefun/blob/main/.claude/hooks/...

martinald · 2026-02-02T01:55:29 1769997329

It compacted at least twice but continued with no real issues.

Anyway, please try it if you find it unbelievable. I didn't expect it to work FWIW like it did. Opus 4.5 is pretty amazing at long running tasks like this.

moregrist · 2026-02-02T02:11:22 1769998282

I think the skepticism here is that without tests or a _lot_ of manual QA how would you know that it did it correctly?

Maybe you did one or the other , but “nearly one-shotted” doesn’t tend to mean that.

Claude Code more than occasionally likes to make weird assumptions, and it’s well known that it hallucinates quite a bit more near the context length, and that compaction only partially helps this issue.

skybrian · 2026-02-02T06:02:10 1770012130

If you’re porting some formulas from one language to another, “correct” can be defined as “gets the same answers as before.” Assuming you can run both easily, this is easy to write a property test for.

Sure, maybe that’s just building something that’s bug-for-bug compatible, but it’s something Claude can work with.

gregoryl · 2026-02-02T08:10:37 1770019837

For starters, Python uses IEEE 754, and Excel uses IEEE 754 (with caveats). I wonder if that's being emulated.

stavros · 2026-02-02T02:19:26 1769998766

I generally agree with you, but I tried to get it to modernize a fairly old SaaS codebase, and it couldn't. It had all the code right there, all it had to do was change a few lines, upgrade a few libraries, etc, but it kept getting lots of things wrong. The HTML was wrong, the CSS was completely missing, basic views wouldn't work, things like that.

I have no idea why it had so much trouble with this generally easy task. Bizarre.

rk06 · 2026-02-02T04:54:50 1770008090

where exactly have you seen excel forumalas to have tests?

I have, in my early careers, gone knee deep into Excel macros and worked on c# automation that will create excel sheet run excel macros on it and then save it without the macros.

in the entire process, I saw dozens of date time mistakes in VBA code, but no tests that would catch them...

datsci_est_2015 · 2026-02-02T03:14:33 1770002073

And also - who understands the system now? Does anyone know Python at this shop? Is it someone’s implicit duty to now learn Python, or is the LLM now the de facto interface for modifying the system?

When shit hits the fan and execs need answers yesterday, will they jump to using the LLM to probabilistically make modifications to the system, or will they admit it was a mistake and pull Excel back up to deterministically make modifications the way they know how?

theshrike79 · 2026-02-03T07:44:14 1770104654

At least you can literally buy books that tell you how to test Python code and confirm it works as expected.

When shit hits the fan, execs need answers yesterday and the 30 sheet Excel monstrosity is producing the wrong numbers - who fixes it?

It was done by Sue, who left the company 4 years ago, people have been using it since and nobody really understands it.

datsci_est_2015 · 2026-02-03T18:56:30 1770144990

Alright so it is implicit that it is someone’s duty to learn Python. Don’t get me wrong, Python is better than Excel in a million ways, but not many companies are competent at hiring people who are competent at Python - including software companies from my personal experience hah.

I’ve also heard plenty of horror stories of bus factor employees leaving (or threatening to leave) behind an excel monstrosity and companies losing 6 months of sales, so maybe there’s a win for AI somewhere in there.

martinald · 2026-02-02T01:43:56 1769996636

That's exactly what it did (author here).

majormajor · 2026-02-02T01:49:32 1769996972

I'm having trouble reconciling "30 sheet mind numbingly complicated Excel financial model" and "Two or three prompts got it there, using plan mode to figure out the structure of the Excel sheet, then prompting to implement it. It even added unit tests to the Python model itself, which I was impressed with!"

"1 or 2 plan mode prompts" to fully describe a 30-sheet complicated doc suggests a massively higher level of granularity than Opus initial plans on existing codebases give me or a less-than-expected level of Excel craziness.

And the tooling harnesses have been telling the models to add testing to things they make for months now, so why's that impressive or suprising?

martinald · 2026-02-02T01:53:46 1769997226

No it didn't make a giant plan of every detail. It made a plan of the core concepts and then when it was in implementation mode it kept checking the excel file to get more info. It took around ~30 mins in implementation mode to build it.

I was impressed because the prompt didn't ask it to do that. It doesn't normally add tests for me without asking, YMMV.

majormajor · 2026-02-02T01:56:56 1769997416

Ah, I see.

Did it build a test suite for the Excel side? A fuzzer or such?

It's the cross-concern interactions that still get me.

80% of what I think about these days when writing software is how to test more exhaustively without build times being absolute shit (and not necessarily actually being exhaustive anyway).

chrisjj · 2026-02-02T10:09:13 1770026953

Different inputs? With no understanding of the program, how is Claude going to determine what input set is sufficient.

Tell me if I am wrong, but surely Claude cannot even access execution coverage.

catlifeonmars · 2026-02-02T05:29:35 1770010175

You touched on Kolmogorov complexity there :)

linsomniac · 2026-02-01T15:52:23 1769961143

I'd say no, but it really depends on what your use is. The biggest barrier is that it doesn't have a HA story that I'm aware of, but you might be able to get one by carefully replicating the sqlite and using something like pacemaker to fail over and fail back.

That said, I've been using headscale on 220 devices for ~3.5 years now and it's been quite reliable.

linsomniac · 2026-01-30T12:10:00 1769775000

I came up with a microwave steel cut oat method that worked well. Going from memory, I put the oats and hot water in a bowl in the microwave and set it for 45 seconds 100%, then 9 minutes at power level 2. One of those microwaves with "Cook 1" and "Cook 2" on it. The hot water I put in initially was basically boiling hot, you might need to do more time on cook 1 if you put in less hot water (at work we had one of those instant boiling water things).

slumberlust · 2026-01-30T12:36:43 1769776603

Damn, I just blast my oats until they threaten to overflow the bowl and call it a day. Does this technique unlock some creaminess or something unique?

linsomniac · 2026-01-31T03:23:59 1769829839

I found that steel cut oats needed more cooking than just blasting them for a couple minutes.

linsomniac · 2026-01-29T22:46:59 1769726819

>It was going 17 mph. That is rather slow.

There's a case to be made that it wasn't slow enough.

supern0va · 2026-01-29T23:24:00 1769729040

I have a hard time believing that a human driver would be as slow as this Waymo, or even slower. I drive my kid to school where it's posted 20mph and there are cameras (with plenty of warnings about the presence of said cameras) and witness a constant string of flashes from the camera nailing people for speeding through there.

jjav · 2026-01-30T10:45:17 1769769917

Hardcoded limits are problematic because they completely lack context.

On that very same road with a 20mph limit, 40mph might be completely safe or 3mph might be extremely negligently dangerous. It all depends on what is going on in the area.

cortesoft · 2026-01-29T22:56:10 1769727370

A small child jumped out in front of it, which is about the worst case scenario you can have... and the kid was fine. So it sounds like it was slow enough?

Sohcahtoa82 · 2026-01-30T00:26:08 1769732768

Considering the car hit the child at only 6 mph and the kid just got up and brushed themselves off, it was plenty slow enough.

Nobody was injured.

loeg · 2026-01-29T23:00:25 1769727625

Maybe. That level of safetyism seems pretty unreasonable when humans are 100x worse and still allowed on the road.

linsomniac · 2026-01-29T22:45:45 1769726745

>It's likely that a fully-attentive human driver would have done worse.

Maybe. Depends on the position of the sun and shadows, I'm teaching my kids how to drive now and showing them that shadows can reveal human activity that is otherwise hidden by vehicles. I wonder if Waymo or other self-driving picks up on that.

linsomniac · 2026-01-27T04:08:56 1769486936

You had better luck than I did, I tried my hand at making Open Cola, put around $300 into it (between the carbonization rig and essential oils primarily), and while I'd say it was "leaning towards coke", I would also definitely say that nobody would mistake it for coke.

atombender · 2026-01-27T11:12:40 1769512360

I noticed it was incredibly important to get the recipe mixture exactly right, because even a slight measurement error resulted in weirdly wrong flavors.

I did my OpenCola experiment in the company office together with a colleague, and we ended up hooking it up to a beer tap, with a canister of CO2. I'm proud to say the whole office really got into it.