I also had a good friend who was an absolute wizard with early stablediffusion. he could make the model do things that were supposedly impossible at the time. His prompts were works of art. Now any of the commercial image models go far beyond what he could do. It's interesting to think about how there was this ephemeral art form of manipulating image models that existed for about a year.
The same could be said of prompt engineering. Gone are the days of telling the model that it is an expert software engineer with a PhD in the most relevant subtopic. These days the common wisdom is to just clearly articulate what you want it to do. Huge amounts of energy put into prompt engineering are now completely swept away by incremental model advances.
I have similar usage habits. Not only has nothing like this ever happened for me, but I don’t think it has ever deleted anything that I didn’t want to be deleted, ever. Files only get deleted if I ask for a “cleanup” or something similar.
It has deleted a config directory of a system program I was having it troubleshoot, which was definitely not required, requested or helpful. The deleted files were in my home directory and not the "sandbox" directory I was running it from.
I knew the risks and accepted them, but it is more than capable of doing system actions you can regret.
Anybody talking about AI safety not being an issue, and how people will be able to use it responsibily, should study comments such as these in this thread. Even if one knows better than to do that, people on your team or important public facility will go about using AI like this...
I'm on the same page here. I have seen this sentiment about Codex suddenly being good a few times now, so I booted Codex CLI thinking-high back up after a break and asked it to look for bugs. It promptly found five bugs that didn't actually exist. It was the kind of truly impressively stupid mistake that I haven't seen Claude Code make essentially ever, and made me wonder if this isn't the sort of thing that's making people downplay the power of LLMs for agentic coding.
I asked Sonnet 4.5 to find bugs in the code, it found five high-impact bugs that, when I prompted it a second time, it admitted weren't actually bugs. It's definitely not just Codex.
Perhaps surprisingly considering the current stratospheric prices of GPUs, the performance-per-dollar of compute is still rising faster than exponentially. In a handful years it will be cheap to train something as powerful as the models that cost millions to train today. Algorithmic efficiencies also stack up an make it cheaper to build and serve older models even on the same hardware.
It’s underappreciated that we would already be in a pretty absurdly wild tech trajectory just due to compute hyperabundance even without AI.
Broadly the critique is valid where it applies; I don’t know if it accurately captures the way most people are using LLMs to code, so I don’t know that it applies in most case.
My one concrete pushback to the article is that it states the inevitable end result of vibe coding is a messy unmaintainable codebase. This is empirically not true. At this point I have many vibecoded projects that are quite complex but work perfectly. Most of these are for my private use but two of them serve in a live production context. It goes without saying that not only do these projects work, but they were accomplished 100x faster than I could have done by hand.
Do I also have vibecoded projects that went of the rails? Of course. I had to build those to learn where the edges of the model’s capabilities are, and what its failure modes are, so I can compensate. Vibecoding a good codebase is a skill. I know how to vibecode a good, maintainable codebase. Perhaps this violates your definition of vibecoding; my definition is that I almost never need to actually look at the code. I am just serving as a very hands-on manager. (Though I can look at the code if I need to - have 20 years of coding experience. But if I find that I need to look at the code, something has already gone badly wrong.)
Relevant anecdote: A couple of years ago I had a friend who was incredibly skilled at getting image models to do things that serious people asserted image models definitely couldn’t do at the time. At that time there were no image models that could get consistent text to appear in the image, but my friend could always get exactly the text you wanted. His prompts were themselves incredible works of art and engineering, directly grabbing hold of the fundamental control knobs of the model that most users are fumbling at.
Here’s the thing: any one of us can now make an image that is better than anything he was making at the time. Better compositionality, better understanding of intent, better text accuracy. We do this out of the box and without any attention paid to promoting voodoo at all. The models simply got that much better.
In a year or two, my carefully cultivated expertise around vibecoding will be irrelevant. You will get results like mine by just telling the model what you want. I assert this with high confidence. This is not disappointing to me, because I will be taking full advantage of the bleeding edge of capabilities throughout that period of time. Much like my friend, I don’t want to be good at managing AIs, I want to realize my vision.
100x is such a crazy claim to me - you’re saying you can do in 4 days what would have previously taken over a year. 5 weeks and you can accomplish what would have taken you a decade without LLMs.
In most cases I would never have undertaken those projects at all without AI. One of the projects that is currently live and making me money took about 1 working day with Claude Code. It’s not something I ever would have started without Claude Code, because I know I wouldn’t have the time for it. I have built websites of similar complexity in the past, and since they were free-time type endeavors, they never quite crossed the finish line into commerciality even after several years of on-again-off-again work. So how do you account that with a time multiplier? 100x? Infinite speedup? The counterfactual is a world where the product doesn’t exist at all.
This is where most of the “speedup” happens. It’s more a speedup in overall effectiveness than raw “coding speed.” Another example is a web API for which I was able to very quickly release comprehensive client side SDKs in multiple languages. This is exactly the kind of deterministic boilerplate work LLMs are ideal for, and that would take a human a lot of typing, and looking up details for unfamiliar languages. How long would it have taken me to write SDKs in all those languages by hand? I don’t really know, I simply wouldn’t have done it, I would have just done one SDK in Python and said good enough.
If you really twist my arm and ask me to estimate the speedup on some task that I would have done either way, then yeah I still think a 100x speedup is the right order of magnitude, if we’re talking about Claude Code with Opus 4.1 specifically. In the past I spent about a five years very carefully building a suite of tools for managing my simulation work and serving as a pre/post-processor. Obviously this wasn’t full-time work on the code itself, but the development progressed across that timeframe. I recently threw all that out and replaced it with stuff I rebuilt in about a week with AI. In this case I was leveraging a lot of the learnings I gleaned from the first time I built it, so it’s not a fair one-to-one comparison, but you’re really never going to see a pure natural experiment for this sort of thing.
I think most people are in a professional position where they are sort of externally rate limited. They can’t imaging being 100x more effective. There would be no point to it. In many cases they already sit around doing nothing all day, because they are waiting for other people or processes. I’m lucky to not be in such a position. There’s always somewhere I can apply energy and see results, and so AI acts as an increasingly dramatic multiplier. This is a subtle but crucial point: if you never try to use AI in a way that would even hypothetically result in a big productivity multiplier (doing things you wouldn’t have otherwise done, doing a much more thorough job on the things you need to do, and trying to intentionally speed up your work on core tasks) then you can’t possibly know what the speedup factor is. People end up sounding like a medieval peasant suddenly getting access to a motorcycle and complaining that it doesn’t get them to the market faster, and then you find out that they never actually ride it.
I wonder, have you sat down and tried to vibecode something with Claude Code? If so, what kind of multiplier would you find plausible?
I had the AI implement two parallel implementations of the same thing in one project. Was lots of fun when it was 'fixing' the one that wasn't being used. So yeah, it can definitely muck up your codebase.
Hah today I discovered Claude Code has been copy/pasting gigantic blocks of conditions and styling every time I ask it to add a "--new" flag or whatever in a once tiny now gigantic script I've been adding features to.
It worked fine until recently where I will ask it to tweak some behavior of a command with a flag and it does a diff with like hundreds of lines. So now it's struggling to catch every place it needs to change some hardcoded duplicate values it decided to copy/paste into two dozen random places in the code.
To be fair it is doing a decent job unfucking it now that I noticed and started explicitly showing it how ridiculously cumbersome and unmaintainable it made things with specific examples and refactoring. But if I hadn't bothered to finally sit down and read through it thoroughly it would have just become more broken and inconsistent as it grew exponentially.
Yeah, it will definitely do dumb stuff if you don’t keep an eye on it and intervene if you see the signs that it’s heading in the wrong direction. But it’s very good at course correcting and if you end up in a truly disastrous state you can almost always fix it be reverting to the last working commit and start a fresh context.
Here's one. https://doofmovies.com/
With this project I'm sort of playing a game where I want to see how long I can go without finding out what language the backend is written in. I still don't know.
From experience it seems like preempting context scoping and routing decisions to smaller models just results in those models making bad judgements at a very high speed.
Whenever I experiment with agent frameworks that spawn subagents with scoped subtasks and restricted context, things go off the rails very quickly. A subagent with reduced context makes poorer choices and hallucinates assumptions about the greater codebase, and very often lacks a basic sense the point of the work. This lack of situational awareness is where you are most likely to encounter js scripts suddenly appearing in your Python repo.
I don’t know if there is a “fix” for this or if I even want one. Perhaps the solution, in the limit, actually will be to just make the big-smart models faster and faster, so they can chew on the biggest and most comprehensive context possible, and use those exclusively.
eta: The big models have gotten better and better at longer-running tasks because they are less likely to make a stupid mistake that derails the work at any given moment. More nines of reliability, etc. By introducing dumber models into this workflow, and restricting the context that you feed to the big models, you are pushing things back in the wrong direction.
"True" multi-objective optimization can be not only solve traditional optimization problems, but can act as a control system for dynamic time-varying multi-objective problems.
The same could be said of prompt engineering. Gone are the days of telling the model that it is an expert software engineer with a PhD in the most relevant subtopic. These days the common wisdom is to just clearly articulate what you want it to do. Huge amounts of energy put into prompt engineering are now completely swept away by incremental model advances.