Recently, I asked Codex CLI to refactor some HTML files. It didn't literally cop...

scottbez1 · 2025-10-09T15:21:19 1760023279

The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.

In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.

When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.

For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.

AIorNot · 2025-10-09T16:25:10 1760027110

I think we need better code review tools in the age of LLMs - not just sticking another LLM to do a code review on top of the PR

Needs to clearly handle the large diffs they produce - anyone have any ideas

ngruhn · 2025-10-09T18:50:36 1760035836

I was about to write my own tool for this but then I discovered:

   git diff --color-moved=dimmed-zebra

That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.

paulhebert · 2025-10-09T21:36:38 1760045798

I would love if GitHub implemented this in their UI! There’s and issue: https://github.com/orgs/community/discussions/9632

erichocean · 2025-10-09T19:34:28 1760038468

I used autochrome[0] for Clojure code to do this. (I also made some improvements to show added/removed comments, top-level form moves, and within-string/within-comment edits the way GitHub does.)

At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.

[0] https://fazzone.github.io/autochrome.html

godelski · 2025-10-10T00:50:59 1760057459

That's a great solution and I'm adding it to my fallback. But also, people might be interested in diff-so-fancy[0]. I also like using batcat as a pager.

[0] https://github.com/so-fancy/diff-so-fancy

VMG · 2025-10-09T18:59:43 1760036383

Perfect. This is why I visit this website

karczex · 2025-10-09T19:21:39 1760037699

Thanks:)

steveklabnik · 2025-10-09T17:09:22 1760029762

I personally agree with you. I think that stacked diffs will be more important as a way of dealing with those larger diffs.

alfalfasprout · 2025-10-09T20:28:02 1760041682

Yep, this pattern of LLMs reviewing LLMs is terrifying to me. It's literally the inmates running the asylum.

DrSiemer · 2025-10-10T01:39:42 1760060382

When using a reasonably smart llm, code moves are usually fine, but you have to pay attention whenever uncommon words (like urls or numbers) are involved.

It kind of forces you to always put such data in external files, which is better for code organization anyway.

If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.

In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.

weinzierl · 2025-10-09T11:03:26 1760007806

Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back.

Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.

LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.

flowingfocus · 2025-10-09T13:30:44 1760016644

A diff makes these kind of errors much easier to catch.

Or maybe someone from XEROX has a better idea how to catch subtly altered numbers?

mcpeepants · 2025-10-09T13:32:53 1760016773

I verify all dates manually by memorizing their offset from the date of the signing of the Magna Carta

nedrylandJP · 2025-10-09T16:52:25 1760028745

HN is no place for chicanery.

nonethewiser · 2025-10-09T14:13:39 1760019219

>Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day.

This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.

The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.

You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.

dkarl · 2025-10-09T14:17:59 1760019479

I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest.

It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.

For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.

Romario77 · 2025-10-09T14:44:42 1760021082

LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item.

It's like it has ADHD and forgets or gets distracted in the middle.

And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.

fwip · 2025-10-09T15:06:10 1760022370

It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete.

I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.

radarsat1 · 2025-10-09T15:54:40 1760025280

Agreed. You basically want an LLM to have a tool that writes its own agent to accomplish a repetitive task. I think this is doable.

steveklabnik · 2025-10-09T17:11:44 1760029904

You can already sort of do this by asking it to write a script to do the refactor. Claude sometimes suggests this on its own to me even.

But obviously sometimes larger refactors aren't easy to implement in bash.

fwip · 2025-10-09T21:15:47 1760044547

Right - and ideally, after writing the script to do the task, it could discard all the tokens involved in writing the script.

HarHarVeryFunny · 2025-10-09T20:19:23 1760041163

Right.

In a recent YouTube interview Karpathy claimed that LLMs have a lot more "working memory" than a human:

https://www.youtube.com/watch?v=hM_h0UA7upI&t=1306s

What I assume he's talking about is internal activations such as stored in KV cache that have same lifetime as tokens in the input, but this really isn't the same as "working memory" since these are tied to the input and don't change.

What it seems an LLM would need to do better at these sort of iterative/sequencing tasks would be a real working memory that had more arbitrary task-duration lifetime and could be updated (vs fixed KV cache), and would allow it to track progress or more generally maintain context (english usage - not LLM) over the course of a task.

I'm a bit surprised that this type of working memory hasn't been added to the transformer architecture. It seems it could be as simple as a fixed (non shifting) region of the context that the LLM could learn to read/write during training to assist on these types of task.

An alternative to having embeddings as working memory is to use an external file of text (cf a TODO list, or working notes) for this purpose which is apparently what Claude Code uses to maintain focus over long periods of time, and I recently saw mentioned that the Claude model itself has been trained to use read/write to this sort of text memory file.

dmoy · 2025-10-09T15:04:45 1760022285

Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems

polynomial · 2025-10-09T15:03:44 1760022224

So much for Difference and Repetition.

steveklabnik · 2025-10-09T17:10:33 1760029833

Surprised and a bit delighted to see a Deleuze reference on HN...

yodsanklai · 2025-10-09T12:16:57 1760012217

5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes.

alzoid · 2025-10-09T13:18:48 1760015928

I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system.

I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.

I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.

troupo · 2025-10-09T14:28:37 1760020117

> About half of the code generated fake data rather than actually do anything with the system.

All the time

    // fake data. in production this would be real data
    ... proceeds to write sometimes hundreds of lines
    of code to provide fake data

stuartjohnson12 · 2025-10-09T15:00:17 1760022017

"hey claude, please remove the fake data and use the real data"

"sure thing, I'll add logic to check if the real data exists and only use the fake data as a fallback in case the real data doesn't exist"

colonCapitalDee · 2025-10-09T17:15:21 1760030121

Claude (possible all LLMs, but I mostly use Claude) LOVES this pattern for some reason. "If <thing> fails/does not exist I'll just silently return a placeholder, that way things break silently and you'll tear your hair out debugging it later!" Thanks Claude

jononor · 2025-10-11T06:45:22 1760165122

Optimizing for engagement? You will use Claude again for the debugging...

weakfish · 2025-10-09T15:26:26 1760023586

This comment captures exactly what aggravates me about CC / other agents in a way that I wasn't sure how to express before. Thanks!

alzoid · 2025-10-09T15:15:25 1760022925

I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please.

ewoodrich · 2025-10-09T16:42:46 1760028166

Hah I also happened to use Claude recently to write basic MQTT code to expose some data on a couple Orange Pis I wanted to view in Home Assistant. And it one-shot this super cool mini Python MQTT client I could drop wherever I needed it which was amazing having never worked with MQTT in Python before.

I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.

So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.

Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.

jihadjihad · 2025-10-09T13:30:23 1760016623

I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability.

But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.

qingcharles · 2025-10-09T16:53:54 1760028834

I asked it to change some networking code, which it did perfectly, but I noticed some diffs in another file and found it had just randomly expanded some completely unrelated abbreviations in strings which are specifically shortened because of the character limit of the output window.

bryantnyc · 2025-10-10T00:47:22 1760057242

"...Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task..." https://www.darioamodei.com/post/the-urgency-of-interpretabi...

Anyone care to wager if anthropic is red teaming in production on paying users?

coldtea · 2025-10-09T07:32:43 1759995163

>A few days later, just before deployment to production, I wanted to double check all 40 links.

This was allowed to go to master without "git diff" after Codex was done?

rossant · 2025-10-09T07:36:00 1759995360

It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.

cimi_ · 2025-10-09T08:40:50 1759999250

Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.

The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.

Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.

rossant · 2025-10-09T12:33:43 1760013223

So it took a shortcut as it was too lazy and it lied to your face about it. AGI is here for good.

tuesdaynight · 2025-10-09T12:19:53 1760012393

I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt.

Ezhik · 2025-10-09T12:40:52 1760013652

Your Claude Code actually respects CLAUDE.md?

indigodaddy · 2025-10-09T15:27:47 1760023667

This is why my instinct for this sort of task is, "write a script that I can use to do x y z," instead of "do x y z"

tinodb · 2025-10-10T21:01:48 1760130108

I have piped diffs into an(other) LLM and asked: is this a pure refactor or did things actually change? It usually gives quite good analysis…

exe34 · 2025-10-09T07:57:20 1759996640

this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.

raffael_de · 2025-10-09T08:28:30 1759998510

This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.

thatfrenchguy · 2025-10-09T16:13:32 1760026412

I truly wonder how much time we have before some spectacular failure will happen because a LLM was asked to rewrite a file with a bunch of constants in it in critical software and silently messed up or inverted them in a way that looks reasonable and works in your QA environment and then leads to a spectacular failure in the field.

smougel · 2025-10-09T12:14:19 1760012059

Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation.

grafmax · 2025-10-09T12:06:01 1760011561

Yeah this sort of thing is a huge time waster with LLMs.

rapind · 2025-10-09T16:04:13 1760025853

Incorrect data is a hard one to catch, even with automated tests (even in your tests, you're probably only checking the first link, if you're event doing that).

Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.

You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.

I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.

FitchApps · 2025-10-09T15:06:06 1760022366

Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3

moomoo11 · 2025-10-09T18:39:28 1760035168

Do you write tests and do local testing?

qnleigh · 2025-10-09T17:15:44 1760030144

Interesting, I've seen similar looking behavior in other forms of data extraction. I took a picture of a bookshelf and asked it to list the books. It did well in the beginning but by the middle, it had started making up similar books that were not actually there.

cpfohl · 2025-10-09T13:06:26 1760015186

My custom prompt instructs GPT to output changes to code as a diff/git-patch. I don’t use agents because it makes it hard to see what’s happening and I don’t trust them yet.

ravila4 · 2025-10-09T13:37:42 1760017062

I’ve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase.

That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!

cpfohl · 2025-10-09T19:38:49 1760038729

Yep!! It’s fantastic

BinaryIgor · 2025-10-09T15:59:44 1760025584

"very subtle and silently introduced mistakes" - that's the biggest bottleneck I think; as long as it's true, we need to validate LLMs outputs; as long as we must validate LLMs outputs, our own biological brains are the ultimate bottleneck

bryantnyc · 2025-10-10T00:39:23 1760056763

"...very subtle and silently introduced mistakes are quite dangerous..."

In my view these perfectly serve the purpose of encouraging you to keep burning tokens for immediate revenue as well as potentially using you to train their next model at your expense.

intrasight · 2025-10-09T16:15:06 1760026506

AI coding and no automated testing is a bad combination.

ivape · 2025-10-09T07:12:21 1759993941

You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.

It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.

hansmayer · 2025-10-09T07:46:24 1759995984

Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?

IanCal · 2025-10-09T12:39:04 1760013544

They easily check a bunch of those boxes.

> why don´t we stop pretending that we as users are stupid and don´t know how to use them

This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!

hansmayer · 2025-10-09T13:08:03 1760015283

The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.

IanCal · 2025-10-09T13:41:45 1760017305

> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.

If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.

Let's do a quick overview of recent chats for me:

* Identifying and validating a race condition in some code

* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code

* Identifying an async bug two good engineers couldn't find in a codebase they knew well

* Finding performance issues that had gone unnoticed

* Digging through synapse documentation and github issues to find a specific performance related issue

* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed

* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money

* Going through funding opportunities and matching them against a charity I want to help in my local area

* Building a search integration for my local library to handle my kids reading challenge

* Solving a series of VPN issues I didn't understand

* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.

> the folks pushing or better said

If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.

hansmayer · 2025-10-09T14:03:51 1760018631

Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.

IanCal · 2025-10-09T14:28:27 1760020107

I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.

> Trust me, it sucks

Ok. I'm convinced.

> and under-delivers.

Compared to what promise?

> I am sure we will see those 10x apps rolling in soon, right?

Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).

> It's only been like 4 years since the revolutionary magic machine was announced.

It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.

If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.

Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:

Why are you paying for something that solves literally no problems for you?

hitarpetar · 2025-10-09T15:37:01 1760024221

it's amazing that you picked another dark pattern as your comparison

mbesto · 2025-10-09T15:27:39 1760023659

> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!

The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s

IanCal · 2025-10-09T15:51:01 1760025061

If you base all your tech workings on the promises of CEOs you'll fail badly, you should not be surprised by this.

mbesto · 2025-10-10T13:01:17 1760101277

Thanks for the advice...

ivape · 2025-10-09T07:52:17 1759996337

[flagged]

hansmayer · 2025-10-09T08:45:58 1759999558

Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.

jsjshxnsbs · 2025-10-09T08:18:40 1759997920

I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.

Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.

Stop being so small minded.

laterium · 2025-10-09T18:19:20 1760033960

Why is the bar for it to do your job or completely replace you? It's a tool. If it makes you 5% better at your job, then great. There's a recent study showing it has 15-20% productivity benefits: not completely useless, not 10x. I hope we can have nuance in the conversation.

hansmayer · 2025-10-10T12:07:09 1760098029

...and then there was also a recent MIT study showing it was making everyone less productive. The bar is there because this is how all the AI grifters have been selling this technology - no less than end of work itself. Why should we not hold them accountable for over-promising and under-delivering? Or is that reserved just for the serfs?

exe34 · 2025-10-09T07:59:31 1759996771

I take it you have bought stocks? What do you recommend?

grey-area · 2025-10-09T07:17:06 1759994226

Or just not bother. It sounds pretty useless if it flunks on basic tasks like this.

Perhaps you’ve been sold a lie?

baq · 2025-10-09T19:51:06 1760039466

Read about the jagged frontier. IanCal is right: this is a perfect example of using the tool wrong; you’ve focused on a very narrow use case which is surprisingly hard for the matmuls to not mess up and extrapolate, but extrapolation is incorrect here because the capability frontier is fractal and not continuous.

grey-area · 2025-10-10T07:21:09 1760080869

It’s not surprisingly hard at all, when you consider they have no understanding of the tasks they do nor of the subject material. It’s just a good example of the types of tasks (anything requiring reliability or correct results) that they are fundamentally unsuited to.

Sadly it seems the best use-case for LLMs at this point is bamboozling humans.

baq · 2025-10-10T10:33:12 1760092392

When you take a step back it's surprising that these tools can be actually useful at all in nontrivial tasks, but being surprised doesn't matter in the grand scheme of things. Bamboozling rarely enough for harnesses to keep them in line and ability to inference-time self-correct when bamboozling is detected either by the model itself or by the harness is very useful at least in my work. It's a question of using the tool correctly and understanding its limitations, which is hard if you aren't willing to explore the boundaries and commit to doing it every month basically.

ivape · 2025-10-09T07:22:26 1759994546

Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.

You have to be able to see what this thing can actually do, as opposed to what it can’t.

sebtron · 2025-10-09T07:55:30 1759996530

> Well, you see it hallucinates on long precise strings

But all code is "long precise strings".

ogogmad · 2025-10-09T10:47:51 1760006871

He obviously means random unstructured strings, which code is usually not.

grey-area · 2025-10-09T16:08:46 1760026126

I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.

IanCal · 2025-10-09T12:47:00 1760014020

They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.

On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.

Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.

goalieca · 2025-10-09T13:44:59 1760017499

> I don't think they were ever really sold as that, and we have better tools for that.

We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.

I say they are being sold as a magical do everything tool.

IanCal · 2025-10-09T15:23:10 1760023390

Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.

Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.

mbesto · 2025-10-09T15:33:37 1760024017

What you are describing is "dead reasoning zones".[0]

    "This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.

    But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."

https://jeremyberman.substack.com/p/how-i-got-the-highest-sc...

hitarpetar · 2025-10-09T15:39:18 1760024358

saddest goalpost ever

laterium · 2025-10-09T18:32:16 1760034736

They are lying, because their salary depends on them lying about it. Why does it even matter what they're saying? Why don't we listen to scientists, researchers, practicioners and the real users of the technology and stop repeating what the CEOs are saying?

The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.

When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?

culll_kuprey · 2025-10-09T19:00:14 1760036414

> When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?

Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.

laterium · 2025-10-09T19:14:38 1760037278

It's a new technology which lends itself well to outrageous claims and marketing, but the analogy stands. The CEOs don't get to define the narrative or stand as strawman targets for anti-AI folks to dunk on, sorry. Elon has been repeating "self driving next year" for a decade+ at this point, that doesn't make what Waymo did unimpressive. This level of cynicism is unwarranted is what I'm saying.

IanCal · 2025-10-10T03:52:12 1760068332

You shouldn’t - that’s the point of the comparison. If some insane dentists started saying this you should not stop brushing your teeth!

grey-area · 2025-10-09T18:54:26 1760036066

So questioning the utility of LLMs for knowledge work is now akin to a conspiracy theory?

laterium · 2025-10-09T19:23:52 1760037832

Not what I said at all. Question it all what you want. But disproving outrageous CEO claims doesn't get you there. Whether LLMs are AGI/ASI that will replace everyone is seperate from whether they are useful today as tools. Attacking the first claim doesn't mean much for the second claim, which is the more interesting one.

grey-area · 2025-10-11T16:43:51 1760201031

I'm questioning the basic utility. They are text generation machines. This makes them unsuitable for any work that requires accuracy or understanding, which is the vast majority of knowledge work.

buildbot · 2025-10-09T14:57:13 1760021833

Would you hire a PhD to copy URLs by hand? Would them having PhD make it less likely they’d make a mistake than an high school student doing the same?

parineum · 2025-10-09T15:25:13 1760023513

A high school student would use copy/paste and the urls would be perfect duplicates..

baq · 2025-10-09T19:54:53 1760039693

LLMs aren’t high school students, they’re blobs of numbers which happen to speak English if you poke them right. Use the tool when it’s good at what it does.

parineum · 2025-10-10T02:03:56 1760061836

And the people who are causing this confusion are the CEOs of the companies saying that the newest model is a PHD in your pocket.

IanCal · 2025-10-09T15:52:23 1760025143

> A high school student would use copy/paste and the urls would be perfect duplicates..

Did the LLM have this?

goalieca · 2025-10-09T15:25:05 1760023505

Grad students and even post docs often do a lot of this manual labour for data entry and formatting. Been there, done that.

IanCal · 2025-10-09T15:53:55 1760025235

Manual data entry has lots of errors. All good workflows around this base themselves on this fact.

hitarpetar · 2025-10-09T15:54:40 1760025280

I would not hire anyone for a role that requires computer use who does not know how to use copy/paste

seanw265 · 2025-10-09T14:40:13 1760020813

I suspect you haven't tried a modern mid-to-large-LLM & Agent pair for writing code. They're quite capable, even if not suited for all tasks.

doikor · 2025-10-09T07:14:05 1759994045

I would generalise it to you can’t trust LLMs to generate any kind of unique identifier. Sooner or later it will hallucinate a fake one.

wat10000 · 2025-10-09T14:53:13 1760021593

I would generalize it further: you can't trust LLMs.

They're useful, but you must verify anything you get from them.

fwip · 2025-10-09T15:48:24 1760024904

I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.

Edit: I think I'm just regurgitating the article here.

jollyllama · 2025-10-09T12:20:49 1760012449

> You’re just not using LLMs enough.

> You can never trust the LLM to generate a url

This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.

This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.

IanCal · 2025-10-09T12:42:08 1760013728

It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".

> I'm sick of hearing "you're doing it wrong"

That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.

> when the real answer is "this tool can't do that."

That is what they said.

jollyllama · 2025-10-09T13:07:29 1760015249

> If you use them regularly you wouldn't see a set of urls without thinking...

Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.

IanCal · 2025-10-09T13:16:48 1760015808

This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!

> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".

But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.

"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".

hshdhdhehd · 2025-10-09T07:30:12 1759995012

Well using an LLM is like rolling dice. Logits are probabilities. It is a bullshit machine.

dude250711 · 2025-10-09T09:12:06 1760001126

Yeah, it read like "when running with scissors be careful out there". How about not running with scissors at all?

Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result".

hansmayer · 2025-10-09T13:14:55 1760015695

And if you stab yourself in the stomach ... you must have sucked at running with the scissors :)

amelius · 2025-10-09T08:53:15 1759999995

In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.

globular-toast · 2025-10-09T09:18:52 1760001532

You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).

CaptainOfCoit · 2025-10-09T10:56:09 1760007369

Yeah, pick one for you to do, the other for the LLMs to do, ideally pick the one you're better at, otherwise 50/50 you'll actually become faster.

mehdibl · 2025-10-09T13:54:09 1760018049

Errors are normal and happen ofter. You need to focus on providing it ability to test the changes and fix errors.

If you expect one shot you will get a lot of bad surprises.

polynomial · 2025-10-09T15:03:09 1760022189

Evals don't fix this.

HardCodedBias · 2025-10-09T15:06:19 1760022379

Maybe they don't fix it, but I suspect that they move us towards it occurring less often.

Xss3 · 2025-10-09T09:31:28 1760002288

This is a horror story about bad quality control practices, not the use of LLMs.

__MatrixMan__ · 2025-10-09T12:42:29 1760013749

I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps).

If it was a project written by humans I'd say they were crazy for going so hard on testing.

The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.

worldsayshi · 2025-10-09T07:27:21 1759994841

This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.

Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.

lenkite · 2025-10-09T09:22:32 1760001752

In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.

I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.

When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.

Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.

thunky · 2025-10-09T12:35:41 1760013341

> I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year

The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.

The failure you're describing is that your coworkers are not doing their job.

And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.

mr_mitm · 2025-10-09T10:54:42 1760007282

A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.

worldsayshi · 2025-10-09T12:46:31 1760013991

I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs.

Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.

rullelito · 2025-10-09T07:31:32 1759995092

LLMs are turning into LLMs+hard-coded fixes for every imaginable problem.

worldsayshi · 2025-10-09T21:27:52 1760045272

Why hard coded?

rossant · 2025-10-09T07:32:24 1759995144

I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.

cimi_ · 2025-10-09T08:29:40 1759998580

Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.

Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.

Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.

exe34 · 2025-10-09T07:58:07 1759996687

> that checks that links are not broken?

Can you spot the next problem introduced by this?