Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
I was about to write my own tool for this but then I discovered:
git diff --color-moved=dimmed-zebra
That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.
I used autochrome[0] for Clojure code to do this. (I also made some improvements to show added/removed comments, top-level form moves, and within-string/within-comment edits the way GitHub does.)
At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.
That's a great solution and I'm adding it to my fallback. But also, people might be interested in diff-so-fancy[0]. I also like using batcat as a pager.
When using a reasonably smart llm, code moves are usually fine, but you have to pay attention whenever uncommon words (like urls or numbers) are involved.
It kind of forces you to always put such data in external files, which is better for code organization anyway.
If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.
In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.
Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
>Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day.
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest.
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
What I assume he's talking about is internal activations such as stored in KV cache that have same lifetime as tokens in the input, but this really isn't the same as "working memory" since these are tied to the input and don't change.
What it seems an LLM would need to do better at these sort of iterative/sequencing tasks would be a real working memory that had more arbitrary task-duration lifetime and could be updated (vs fixed KV cache), and would allow it to track progress or more generally maintain context (english usage - not LLM) over the course of a task.
I'm a bit surprised that this type of working memory hasn't been added to the transformer architecture. It seems it could be as simple as a fixed (non shifting) region of the context that the LLM could learn to read/write during training to assist on these types of task.
An alternative to having embeddings as working memory is to use an external file of text (cf a TODO list, or working notes) for this purpose which is apparently what Claude Code uses to maintain focus over long periods of time, and I recently saw mentioned that the Claude model itself has been trained to use read/write to this sort of text memory file.
Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems
5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes.
I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system.
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
Claude (possible all LLMs, but I mostly use Claude) LOVES this pattern for some reason. "If <thing> fails/does not exist I'll just silently return a placeholder, that way things break silently and you'll tear your hair out debugging it later!" Thanks Claude
I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please.
Hah I also happened to use Claude recently to write basic MQTT code to expose some data on a couple Orange Pis I wanted to view in Home Assistant. And it one-shot this super cool mini Python MQTT client I could drop wherever I needed it which was amazing having never worked with MQTT in Python before.
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability.
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
I asked it to change some networking code, which it did perfectly, but I noticed some diffs in another file and found it had just randomly expanded some completely unrelated abbreviations in strings which are specifically shortened because of the character limit of the output window.
It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.
Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt.
this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.
This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.
I truly wonder how much time we have before some spectacular failure will happen because a LLM was asked to rewrite a file with a bunch of constants in it in critical software and silently messed up or inverted them in a way that looks reasonable and works in your QA environment and then leads to a spectacular failure in the field.
Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation.
Incorrect data is a hard one to catch, even with automated tests (even in your tests, you're probably only checking the first link, if you're event doing that).
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3
Interesting, I've seen similar looking behavior in other forms of data extraction. I took a picture of a bookshelf and asked it to list the books. It did well in the beginning but by the middle, it had started making up similar books that were not actually there.
My custom prompt instructs GPT to output changes to code as a diff/git-patch. I don’t use agents because it makes it hard to see what’s happening and I don’t trust them yet.
I’ve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
"very subtle and silently introduced mistakes" - that's the biggest bottleneck I think; as long as it's true, we need to validate LLMs outputs; as long as we must validate LLMs outputs, our own biological brains are the ultimate bottleneck
"...very subtle and silently introduced mistakes are quite dangerous..."
In my view these perfectly serve the purpose of encouraging you to keep burning tokens for immediate revenue as well as potentially using you to train their next model at your expense.
You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.
> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.
I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.
I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Why is the bar for it to do your job or completely replace you? It's a tool. If it makes you 5% better at your job, then great. There's a recent study showing it has 15-20% productivity benefits: not completely useless, not 10x. I hope we can have nuance in the conversation.
...and then there was also a recent MIT study showing it was making everyone less productive. The bar is there because this is how all the AI grifters have been selling this technology - no less than end of work itself. Why should we not hold them accountable for over-promising and under-delivering? Or is that reserved just for the serfs?
Read about the jagged frontier. IanCal is right: this is a perfect example of using the tool wrong; you’ve focused on a very narrow use case which is surprisingly hard for the matmuls to not mess up and extrapolate, but extrapolation is incorrect here because the capability frontier is fractal and not continuous.
It’s not surprisingly hard at all, when you consider they have no understanding of the tasks they do nor of the subject material. It’s just a good example of the types of tasks (anything requiring reliability or correct results) that they are fundamentally unsuited to.
Sadly it seems the best use-case for LLMs at this point is bamboozling humans.
When you take a step back it's surprising that these tools can be actually useful at all in nontrivial tasks, but being surprised doesn't matter in the grand scheme of things. Bamboozling rarely enough for harnesses to keep them in line and ability to inference-time self-correct when bamboozling is detected either by the model itself or by the harness is very useful at least in my work. It's a question of using the tool correctly and understanding its limitations, which is hard if you aren't willing to explore the boundaries and commit to doing it every month basically.
Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.
You have to be able to see what this thing can actually do, as opposed to what it can’t.
I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.
They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
> I don't think they were ever really sold as that, and we have better tools for that.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
What you are describing is "dead reasoning zones".[0]
"This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.
But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."
They are lying, because their salary depends on them lying about it. Why does it even matter what they're saying? Why don't we listen to scientists, researchers, practicioners and the real users of the technology and stop repeating what the CEOs are saying?
The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.
When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
> When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.
It's a new technology which lends itself well to outrageous claims and marketing, but the analogy stands. The CEOs don't get to define the narrative or stand as strawman targets for anti-AI folks to dunk on, sorry. Elon has been repeating "self driving next year" for a decade+ at this point, that doesn't make what Waymo did unimpressive. This level of cynicism is unwarranted is what I'm saying.
Not what I said at all. Question it all what you want. But disproving outrageous CEO claims doesn't get you there. Whether LLMs are AGI/ASI that will replace everyone is seperate from whether they are useful today as tools. Attacking the first claim doesn't mean much for the second claim, which is the more interesting one.
I'm questioning the basic utility. They are text generation machines. This makes them unsuitable for any work that requires accuracy or understanding, which is the vast majority of knowledge work.
LLMs aren’t high school students, they’re blobs of numbers which happen to speak English if you poke them right. Use the tool when it’s good at what it does.
I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.
Edit: I think I'm just regurgitating the article here.
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
> If you use them regularly you wouldn't see a set of urls without thinking...
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.
You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).
I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps).
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
> I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.
I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.
Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!