Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My friends at Google are some of the most negative about the potential of AI to improve software development. I was always surprised by this and assumed internally at Google would be one of the first places to adopt these.




I've generally found an inverse correlation between "understands AI" and "exuberance for AI".

I'm the only person at my current company who has had experience at multiple AI companies (the rest have never worked on it in a production environment, one of our projects is literally something I got paid to deliver customers at another startup), has written professionally about the topic, and worked directly with some big names in the space. Unsurprisingly, I have nothing to do with any of our AI efforts.

One of the members of our leadership team, who I don't believe understands matrix multiplication, genuinely believes he's about to transcend human identity by merging with AI. He's publicly discussed how hard it is to maintain friendship with normal humans who can't keep up.

Now I absolutely think AI is useful, but these people don't want AI to be useful they want it to be something that anyone who understands it knows it can't be.

It's getting to the point where I genuinely feel I'm witnessing some sort of mass hysteria event. I keep getting introduced to people who have almost no understanding of the fundamentals of how LLMs work who have the most radically fantastic ideas about what they are capable of on a level I have ever experienced in my fairly long technical career.


Personally, I don't understand how LLMs work. I know some ML math and certainly could learn, and probably will, soon.

But my opinions about what LLMs can do are based on... what LLMs can do. What I can see them doing. With my eyes.

The right answer to the question "What can LLMs do?" is... looking... at what LLMs can do.


I'm sure you're already familiar with the ELIZA effect [0], but you should be a bit skeptical of what you are seeing with your eyes, especially when it comes to language. Humans have an incredible weakness to be tricked by language.

You should be doubly skeptically ever since RLHF has become standard as the model has literally been optimized to give you answers you find most pleasing.

The best way to measure of course is with evaluations, and I have done professional LLM model evaluation work for about 2 years. I've seen (and written) tons of evals and they both impress me and inform my skepticism about the limitations of LLMs. I've also seen countless times where people are convinced "with their eyes" they've found a prompt trick that improves the results, only to be shown that this doesn't pan out when run on a full eval suite.

As an aside: What's fascinating is that it seems our visual system is much more skeptical, an eyeball being slightly off created by a diffusion model will immediately set off alarms where enough clever word play from an LLM will make us drop our guard.

0. https://en.wikipedia.org/wiki/ELIZA_effect


We get around this a bit when using it to write code since we have unit tests and can verify that it's making correct changes and adhering to an architecture. It has truly become much more capable in the last year. This technology is so flexible that it can be used in ways no eval will ever touch and still perform well. You can't just rely on what the labs say about it, you have to USE it.

Interesting observation about the visual system. Truth be told, we get the visual feedback about the world at a much higher data rat AND the visual about the world is usually much higher correlated with reality, whereas the language is a virtual byproduct of cognition and communication.

No one understands how LLMs work. But some people manage to delude themselves into thinking that they do.

One key thing that people prefer not to think about is that LLMs aren't created by humans. They are created by an inhuman optimization algorithm that humans have learned to invoke and feed with data and computation.

Humans have a say in what it does and how, but "a say" is about the extent of it. The rest is a black box - incomprehensible products of a poorly understood mathematical process. The kind of thing you have to research just to get some small glimpses of how it does what it does.

Expecting those humans to understand how LLMs work is a bit like expecting a woman to know how humans work because she made a human once.


Bro- do you even matrix multiply?

Spot on in my experience.

I work in a space where I get to build and optimise AI tools for my own and my team's use pretty much daily. As such I focus mainly on AI'ing the crap out of boring & time-consuming stuff that doesn't interest any of us any more, and luckily enough there's a whole lot of low hanging fruit in that space where AI is a genuine time, cost and sanity saver.

However any activity that requires directed conscious thought and decision making where the end state isn't clearly definable up front tends to be really difficult for AI. So much of that work relies on a level of intuition and knowledge that is very hard to explain to a layman - let alone eidetic idiots like most AIs.

One example is trying to get AI to identify security IT incidents in real time and take proactive action. Skilled practitioners can fairly easily use AI to detect anomalous events in near real time, but getting AI to take the next step to work out which combinations of "anomalous" activities equate to "likely security incident" is much harder. A reasonably competent human can usually do that relatively quickly, but often can't explain how they do it.

Working out what action is appropriate once the "likely security incident" has been identified is another task that a reasonably competent human can do, but where AIs are hopeless. In most cases, a competent human is WAAAY better at identifying a reasonable way forward based on insufficient knowledge. In those cases, a good decision made quickly is preferable to a perfect decision made slowly, and humans understand this fairly intuitively.


I think there is a correlation between when you can you expect from something when I know their internals vs someone that doesn’t know but is not like who knows internals is much much better.

Example: many people created websites without a clue of how they really work. And got millions of people on it. Or had crazy ideas to do things with them.

At the same time there are devs that know how internals work but can’t get 1 user.

pc manufacturers never were able to even imagine what random people were able to do with their pc.

This to say that even if you know internals you can claim you know better, but doesn’t mean it’s absolute.

Sometimes knowing the fundamentals it’s a limitation. Will limit your imagination.


I'm a big fan of the concept of 初心 (Japanese: Shoshin aka "beginners mind" [0] ) and largely agree with Sazuki's famous quote:

> “In the beginner’s mind there are many possibilities, but in the expert’s there are few”

Experts do tend to be limited in what they see as possible. But I don't think that allows carte blanche belief that a fancy Markov Chain will let you transcend humanity. I would argue one of the key concepts of "beginners mind" is not radical assurance in what's possible but unbounded curiosity and willingness to explore with an open mind. Right now we see this in the Stable Diffusion community: there are tons of people who also don't understand matrix multiplication that are doing incredible work through pure experimentation. There's a huge gap between "I wonder what will happen if I just mix these models together" and "we're just a few years from surrendering our will to AI". None of the people I'm concerned about have what I would consider an "open mind" about the topic of AI. They are sure of what they know and to disagree is to invite complete rejection. Hardly a principle of beginners mind.

Additionally:

> pc manufacturers never were able to even imagine what random people were able to do with their pc.

Belies a deep ignorance of the history of personal computing. Honestly, I don't think modern computing has still ever returned to the ambition of what was being dreampt up, by experts, at Xerox PARC. The demos on the Xerox Alto in the early 1970s are still ambitious in some senses. And, as much as I'm not a huge fan, Gates and Jobs absolutely had grand visions for what the PC would be.

0. https://en.wikipedia.org/wiki/Shoshin


I think this is what is blunted by mass education and most textbooks. We need to discover it again if we want to enjoy our profession with all the signals flowing from social media about all the great things other people are achieving. Staying stupid and hungry really helps.

I think this is more about mechanistic understanding vs fundamental insight kind of situation. The linear algebra picture is currently very mechanistic since it only tells us what the computations are. There are research groups trying to go beyond that but the insight from these efforts are currently very limited. However, the probabilistic view is very much clearer. You can have many explorable insights, both potentially true and false, by jıst understanding the loss functions, what the model is sampling from, what is the marginal or conditional distributions are and so on. Generative AI models are beautiful at that level. It is truly mind blowing that in 2025, we are able to sample from the megapixel image distributions conditioned on the NLP text prompts.

If were true then people could predict this AI many years ago

If you dig ml/vision papers from old, you will see that formulation-wise they actually did, but they lacked the data, compute, and the mechanistic machinery provided by the transformer architecture. The wheels of progress are slow and requires many rotations to finally reach somewhere.

> I've generally found an inverse correlation between "understands AI" and "exuberance for AI".

Few years ago I had this exact observation regarding self driving cars. Non/semi engineers who worked in the tech industry were very bullish about self driving cars, believing every and ETA spewed by Musk, engineers were cautious optimistically or pessimistically depending on their understanding of AI, LiDAR, etc.


This completely explains why so many engineers are skeptical of AI while so many managers embrace it: The engineers are the ones who understand it.

(BTW, if you're an engineer who thinks you don't understand AI or are not qualified to work on it, think again. It's just linear algebra, and linear algebra is not that hard. Once you spend a day studying it, you'll think "Is that all there is to it?" The only difficult part of AI is learning PyTorch, since all the AI papers are written in terms of Python nowadays instead of -- you know -- math.)

I've been building neural net systems since the late 1980s. And yes they work and they do useful things when you have modern amounts of compute available, but they are not the second coming of $DEITY.


Linear algebra cannot be learned in a day. Maybe multiplying matrices when the dimensions allow but there is far more to linear algebra than knowing how to multiply matrices. Knowing when and why is far more interesting. Knowing how to decompose them. Knowing what a non-singular matrix is and why it’s special and so on. Once you know what’s found in a basic lower devision linear algebra class, one can move it linear programming and learn about cost functions and optimization or numerical analysis. PyTorch is just a calculator. If I handed someone a Ti-84 they wouldn’t magically know how to bust out statistics on it…

> This completely explains why so many engineers are skeptical of AI while so many managers embrace it: The engineers are the ones who understand it.

Curiously some Feynman chap reported that several NASA engineers put the chance of the Challenger going kablooie—an untechnical term for rapid unscheduled deconstruction, which the Challenger had then just recently exhibited—at 1 in 200, or so, while the manager said, after some prevarications—"weaseled" is Feynman's term—that the chance was 1 in 100,000 with 100% confidence.


I mostly disagree with this. Lots of things correlate weakly with other things, often in confusing and overlapping ways. For instance, expertise can also correlate with resistance to change. Ego can correlate with protection of the status quo and dismissal of people who don't have the "right" credentials. Love of craft can correlate with distaste for automation of said craft (regardless of the effectiveness of the automation). Threat to personal financial stability can correlate with resistance (regardless of technical merit). Potential for personal profit can correlate with support (regardless of technical merit). Understanding neural nets can correlate both with exuberance and skepticism in slightly different populations.

Correlations are interesting but when examined only individually they are not nearly as meaningful as they might seem. Which one you latch onto as "the truth" probably says more about what tribe you value or want to be part of than anything fundamental about technology or society or people in general.


It's definitely interesting to look at people's mental models around AI.

I don't know shit about the math that makes it work, but my mental model is basically - "A LLM is an additional tool in my toolbox which performs summarization, classification and text transformation tasks for me imperfectly, but overall pretty well."

Probably lots of flaws in that model but I just try to think like an engineer who's attempting to get a job done and staying up to date on his tooling.

But as you say there are people who have been fooled by the "AI" angle of all this, and they think they're witnessing the birth of a machine god or something. The example that really makes me throw up my hands is r/MyBoyfriendIsAI where you have women agreeing to marry the LLM and other nonsense that is unfathomable to the mentally well.

There's always been a subset of humans who believe unimaginably stupid things, like that there's a guy in the sky who throws lightning bolts when he's angry, or whatever. The interesting (as in frightening) trend in modernity is that instead of these moron cults forming around natural phenomena we're increasingly forming them around things that are human made. Sometimes we form them around the state and human leaders, increasingly we're forming them around technologies, in line with Arthur C. Clarke's third law - that "Any sufficiently advanced technology is indistinguishable from magic."

If I sound harsh it's because I am, we don't want these moron cults to win, the outcome would be terrible, some high tech version of the Dark Ages. Yet at this moment we have business and political leaders and countless run-of-the-mill tech world grifters who are leaning into the moron cult version of AI rather than encouraging people to just see it as another tool in the box.


Google has good engineers. Generally I've noticed the better someone is at coding the more critical they are of AI generated code. Which make sense honestly. It's easier to spot flaws the more expert you are. This doesn't mean they don't use AI gen code, just they are more careful with when an where.

Yes, because they're more likely to understand that the computer isn't this magical black box, and that just because we've made ELIZA marginally better, doesn't mean it's actually good. Anecdata, but the people I've seen be dazzled by AI the most are people with little to no programming experience. They're also the ones most likely to look on computer experts with disdain.

Well yeah. And because when an expert looks at the code chatgpt produces, the flaws are more obvious. It programs with the skill of the median programmer on GitHub. For beginners and people who do cookie cutter work, this can be incredible because it writes the same or better code they could write, fast and for free. But for experts, the code it produces is consistently worse than what we can do. At best my pride demands I fix all its flaws before shipping. More commonly, it’s a waste of time to ask it to help, and I need to code the solution from scratch myself anyway.

I use it for throwaway prototypes and demos. And whenever I’m thrust into a language I don’t know that well, or to help me debug weird issues outside my area of expertise. But when I go deep on a problem, it’s often worse than useless.


This is why AI is the perfect management Rorschach test.

To management (out of IC roles for long enough to lose their technical expertise), it looks perfect!

To ICs, the flaws are apparent!

So inevitably management greenlights new AI projects* and behaviors, and then everyone is in the 'This was my idea, so it can't fail' CYA scenario.

* Add in a dash of management consulting advice here, and note that management consultants' core product was already literally 'something that looks plausible enough to make execs spend money on it'


In my experience (with ChatGPT 5.1 as of late) is that the AI follows a problem->solution internal logic and doesn't think and try to structure its code.

If you ask for an endpoint to a CRUD API, it'll make one. If you ask for 5, it'll repeat the same code 5 times and modify it for the use case.

A dev wouldn't do this, they would try to figure out the common parts of code, pull them out into helpers, and try to make as little duplicated code as possible.

I feel like the AI has a strong bias towards adding things, and not removing them. The most obviously wrong thing is with CSS - when I try to do some styling, it gets 90% of the way there, but there's almost always something that's not quite right.

Then I tell the AI to fix a style, since that div is getting clipped or not correctly centered etc.

It almost always keeps adding properties, and after 2-3 tries and an incredibly bloated style, I delete the thing and take a step back and think logically about how to properly lay this out with flexbox.


> If you ask for an endpoint to a CRUD API, it'll make one. If you ask for 5, it'll repeat the same code 5 times and modify it for the use case. > >A dev wouldn't do this, they would try to figure out the common parts of code, pull them out into helpers, and try to make as little duplicated code as possible. > >I feel like the AI has a strong bias towards adding things, and not removing them.

I suspect this is because an LLM doesn't build a mental model of the code base like a dev does. It can decide to look at certain files, and maybe you can improve this by putting a broad architecture overview of a system in an agents.md file, I don't have much experience with that.

But for now, I'm finding it most useful still think in terms of code architecture, and give it small steps that are part of that architecture, and then iterate based on your own review of AI generated code. I don't have the confidence in it to just let some agent plan, and then run for tens of minutes or even hours building out a feature. I want to be in the loop earlier to set the direction.


A good system prompt goes a long way with the latest models. Even just something as simple as "use DRY principles whenever possible." or prompting a plan-implement-evaluate cycle gets pretty good results, at least for tasks that are doing things that AI is well trained on like CRUD APIs.

> If you ask for an endpoint to a CRUD API, it'll make one. If you ask for 5, it'll repeat the same code 5 times and modify it for the use case.

I don’t think this is an inherent issue to the technology. Duplicate code detectors have been around for ages. Given an AI agent a tool which calls one, and ask it to reduce duplication, it will start refactoring.

Of course, there is a risk of going too far in the other direction-refactorings which technically reduce duplication but which have unacceptable costs (you can be too DRY). But some possible solutions: (a) ask it to judge if the refactoring is worth it or not - if it judges no, just ignore the duplication and move on; (b) get a human to review the decision in (a); (c) if AI repeatedly makes wrong decision (according to human), prompt engineering, or maybe even just some hardcoded heuristics


It actually is somewhat a limit of the technology. LLMs can't go back and modify their own output, later tokens are always dependent on earlier tokens and they can't do anything out of order. "Thinking" helps somewhat by allowing some iteration before they give the user actual output, but that requires them to write it the long way and THEN refactor it without being asked, which is both very expensive and something they have to recognize the user wants.

Coding agents can edit their own output - because their output is tool calls to read and write files, and so it can write a file, run some check on it, modify the file to try to make it pass, run the check again, etc

Sorry but from where I sit, this is only marginally closes gap from AI to truly senior engineers.

Basically human junior engineers start by writing code in a very procedural and literal style with duplicate logic all over the place because that's the first step in adapting human intelligence to learning how to program. Then the programmer realizes this leads to things becoming unmaintainable and so they start to learn the abstraction techniques of functions, etc. An LLM doesn't have to learn any of that, because they already know all languages and mechanical technique in their corpus, so this beginning journey never applies.

But what the junior programmer has that the LLM doesn't, is an innate common sense understanding of human goals that are driving the creation of the code to begin with, and that serves them through their entire progression from junior to senior. As you point out, code can be "too DRY", but why? Senior engineers understand that DRYing up code is not a style issue, its more about maintainability and understanding what is likely to change, and what will be the apparent effects to human stakeholders who depend on the software. Basically do these things map to things that are conceptually the same for human users and are unlikely to diverge in the future. This is also a surprisingly deep question as perhaps every human stakeholder will swear up and down they are the same, but nevertheless 6 months from now a problem arises that requires them to diverge. At this point there is now a cognitive overhead and dissonance of explaining that divergence of the users who were heretofore perfectly satisfied with one domain concept.

Ultimately the value function for success of a specific code factoring style depends on a lot of implicit context and assumptions that are baked into the heads of various stakeholders for the specific use case and can change based on myriad outside factors that are not visible to an LLM. Senior engineers understand the map is not the territory, for LLMs there is no territory.


I’m not suggesting AIs can replace senior engineers (I don’t want to be replaced!)

But, senior engineers can supervise the AI, notice when it makes suboptimal decisions, intervene to address that somehow (by editing prompts or providing new tools)… and the idea is gradually the AI will do better.

Rather than replacing engineers with AIs, engineers can use AIs to deliver more in the same amount of time


Which I think points out the biggest issue with current AI - knowledge workers in any profession at any skill level tend to get the impression that AI is very impressive, but is prone to fail at real world tasks unpredictably, thus the mental model of 'junior engineer' or any human that does its simple tasks by itself reliably, is wrong.

AI operating at all levels needs to be constantly supervised.

Which would still make AI a worthwhile technology, as a tool, as many have remarked before me.

The problem is, companies are pushing for agentic AI instead of one that can do repetitve, short horizon tasks in a fast and reliable manner.


Sure. My point was AI was already 25% of the way there even with their verbose messy style. I think with your suggestions (style guidance, human in the loop, etc) we get at most 30% of the way there.

Bad code is only really bad if it needs to be maintained.

If your AI reliably generates working code from a detailed prompt, the prompt is now the source that needs to be maintained. There is no important reason to even look at the generated code


> the prompt is now the source that needs to be maintained

The inference response to the prompt is not deterministic. In fact, it’s probably chaotic since small changes to the prompt can produce large changes to the inference.


The inference response to the prompt is not deterministic.

So? Nobody cares.

Is the output of your C compiler the same every time you run it? How about your FPGA synthesis tool? Is that deterministic? Are you sure?

What difference does it make, as long as the code works?


> Is the output of your C compiler the same every time you run it?

Yes? Because of actual engineering mind you and not rolling the dice until the lucky number comes up.

https://reproducibility.nixos.social/evaluations/2/2d293cbfa...


It's not true for a place-and-route engine, so why does it have to be true for a C compiler?

Nobody else cares. If you do, that's great, I guess... but you'll be outcompeted by people who don't.



That's an advertisement, not an answer.

Did you really read and understand this page in the 1 minute between my post and your reply or did you write a dismissive answer immediately?

Eh, I'll get an LLM to give me a summary later.

In the meantime: no, deterministic code generation isn't necessary, and anyone who says it is is wrong.


The C compiler will still make working programs every time, so long as your code isn’t broken. But sometimes the code chatgpt produces won’t work. Or it'll kinda work but you’ll get weird, different bugs each time you generate it. No thanks.

Nothing matters but d/dt. It's so much better than it was a year ago, it's not even funny.

How weird would it be if something like this worked perfectly out of the box, with no need for further improvement and refinement?


> So? Nobody cares

Yeah the business surely won't care when we rerun the prompt and the server works completely differently.

> Is the output of your C compiler the same every time you run it

I've never, in my life, had a compiler generate instructions that do something completely different from what my code specifies.

That you would suggest we will reach a level where an English language prompt will give us deterministic output is just evidence you've drank the kool-aid. It's just not possible. We have code because we need to be that specific, so the business can actually be reliable. If we could be less specific, we would have done that before AI. We have tried this with no code tools. Adding randomness is not going to help.


I've never, in my life, had a compiler generate instructions that do something completely different from what my code specifies.

Nobody is saying it should. Determinism is not a requirement for this. There are an infinite number of ways to write a program that behaves according to a given spec. This is equally true whether you are writing the source code, an LLM is writing the source code, or a compiler is generating the object code.

All that matters is that the program's requirements are met without undesired side effects. Again, this condition does not require deterministic behavior on the author's part or the compiler's.

To the extent it does require determinism, the program was poorly- or incompletely-specified.

That you would suggest we will reach a level where an English language prompt will give us deterministic output is just evidence you've drank the kool-aid.

No, it's evidence that you're arguing with a point that wasn't made at all, or that was made by somebody else.


You're on the wrong axis. You have to be deterministic about following the spec, or it's a BUG in the compiler. Whether or not you actually have the exact same instructions, a compiler will always do what the code says or it's bugged.

LLMs do not and cannot follow the spec of English reliably, because English is open to interpretation, and that's a feature. It makes LLMs good at some tasks, but terrible for what you're suggesting. And it's weird because you have to ignore the good things about LLMs to believe what you wrote.

> There are an infinite number of ways to write a program that behaves according to a given spec

You're arguing for more abstractions on top of an already leaky abstraction. English is not an appropriate spec. You can write 50 pages of what an app should do and somebody will get it wrong. It's good for ballparking what an app should do, and LLMs can make that part faster, but it's not good for reliably plugging into your business. We don't write vars, loops, and ifs for no reason. We do it because, at the end of the day, an English spec is meaningless until someone actually encodes it into rules.

The idea that this will be AI, and we will enjoy the same reliability we get with compilers, is absurd. It's also not even a conversation worth having when LLMs hallucinate basic linux commands.


People are betting trillions that you're the one who's "on the wrong axis." Seems that if you're that confident, there's money to be made on the other side of the market, right? Got any tips?

Essentially all of the drawbacks to LLMs you're mentioning are either already obsolete or almost so, or are solvable by the usual philosopher's stone in engineering: negative feedback. In this case, feedback from carefully-structured tests. Safe to say that we'll spend more time writing tests and less time writing original code going forward.


> People are betting trillions that you're the one who's "on the wrong axis."

People are betting trillions of dollars that AI agents will do a lot of useful economic work in 10 years. But if you take the best LLMs in the world, and ask them to make a working operating system, C compiler or web browser, they fail spectacularly.

The insane investment in AI isn't because today's agents can reliably write software better than senior developers. The investment is a bet that they'll be able to reliably solve some set of useful problems tomorrow. We don't know which problems they'll be able to reliably solve, or when. They're already doing some useful economic work. And AI agents will probably keep getting smarter over time. Thats all we know.

Maybe in a few years LLMs will be reliable enough to do what you're proposing. But neither I - nor most people in this thread - think they're there yet. If you think we're wrong, prove us wrong with code. Get ChatGPT - or whichever model you like - to actually do what you're suggesting. Nobody is stopping you.


Get ChatGPT - or whichever model you like - to actually do what you're suggesting. Nobody is stopping you.

I do, all the time.

But if you take the best LLMs in the world, and ask them to make a working operating system, C compiler or web browser, they fail spectacularly.

Like almost any powerful tool, there are a few good ways to use LLM technology and countless bad ways. What kind of moron would expect "Write an operating system" or "Write a compiler" or "Write a web browser" to yield anything but plagiarized garbage? A high-quality program starts with a high-quality specification, same as always. Or at least with carefully-considered intent.

The difference is, given a sufficiently high-quality specification, an LLM can handle the specification->source step, just as a compiler or assembler relieves you of having to micromanage the source->object code step.

IMHO, the way it will shake out is that LLMs as we know them today will be only components, perhaps relatively small ones, of larger systems that translate human intent to machine function. What we call "programming" today is only one implementation of a larger abstraction.


I think this might be plausible in the future, but it needs a lot more tooling. For starters you need to be able to run the prompt through the exact same model so you can reproduce a "build".

Even the exact same model isn't enough. There are several sources of nondeterminism in LLMs. These would all need to be squashed or seeded - which as far as I know isn't a feature that openai / anthropic / etc provide.

OK, then the current models aren't as good as I thought/hoped.

I guess one thing it means is that we still need extensive test suites. I suppose an LLM can write those too.


Well.. except the AI models are nondeterministic. If you ask an AI the same prompt 20 times, you'll get 20 different answers. Some of them might work, some probably won't. It usually takes a human to tell which are which and fix problems & refactor. If you keep the prompt, you can't manually modify the generated code afterwards (since it'll be regenerated). Even if you get the AI to write all the code correctly, there's no guarantee it'll do the same thing next time.

> It programs with the skill of the median programmer on GitHub

This is a common intuition but it's provably false.

The fact that LLMs are trained on a corpus does not mean their output represents the median skill level of the corpus.

Eighteen months ago GPT-4 was outperforming 85% of human participants in coding contests. And people who participate in coding contests are already well above the median skill level on Github.

And capability has gone way up in the last 18 months.


The best argument I've yet heard against the effectiveness of AI tools for SW dev is the absence of an explosion of shovelware over the past 1-2 years.

https://mikelovesrobots.substack.com/p/wheres-the-shovelware...

Basically, if the tools are even half as good as some proponents claim, wouldn't you expect at least a significant increase in simple games on Steam or apps in app stores over that time frame? But we're not seeing that.


Are you sure we aren't seeing an increase in steam games?

Charts I'm looking at show a mild exponential around 2024 https://www.statista.com/statistics/552623/number-games-rele...

Also theres probably a bottleneck in manual review time.


The shovelware is the companies getting funded…

https://docs.google.com/spreadsheets/d/1Uy2aWoeRZopMIaXXxY2E...

The shovelware software is coming…


Interesting approach. I can think of one more explanation the author didn't consider: what if software development time wasn't the bottleneck to what he analyzed? The chart for Google Play app submissions, for example, goes down because Google made it much more difficult to publish apps on their store in ways unrelated to software quality. In that case, it wouldn't matter whether AI tools could write a billion production-ready apps, because the limiting factor is Google's submission requirements.

There are other charts besides Google play. Particularly insightful id the steam chart as steam is already full of shovelware and, in my experience, many developers wish they were making games but the pay is bad.

GitHub repos is pretty interesting too but it could be that people just aren't committing this stuff. Showing zero increase is unexpected though.


I've had this same thought for some time. There should have been an explosion in startups, new product from established companies, new apps by the dozen every day. If LLMs can now reliably turn an idea into an application, where are they?

There is a deluge, every day. Just nobody notices or uses them.

Still figuring out if they’re adding value to 200 customers or not.

The argument against this is that shovelware has a distinctly different distribution model now.

App stores have quality hurdles that didn’t exist in the diskette days. The types of people making low quality software now can self publish (and in fact do, often), but they get drowned out by established big dogs or the ever-shifting firehose of our social zeitgeist if you are not where they are.

Anyone who has been on Reddit this year in any software adjacent sub has seen hundreds (at minimum) of posts about “feedback on my app” or slop posts doing a god awful job of digging for market insights on pain points.

The core problem with this guy’s argument is that he’s looking in the wrong places - where a SWE would distribute their stuff, not a normie - and then drawing the wrong conclusions. And I am telling you, normies are out there, right now, upchucking some of the sloppiest of slop software you could ever imagine with wanton abandon.


Interesting, I would make the exact opposite conclusion from the same data: if AI coding was that bad, we'd see more crapware.

Coding Contest != Software Engineering

Or even solving problems that business need to solve, generally speaking.

This complete misunderstand of what software engineering even is is the major reason so many engineers are fed up with the clueless leaders foisting AI tools upon their orgs because they apparently lack the critical reasoning skills to be able to distinguish marketing speak from reality.


Algorithmic coding contests are not an equivalent skillset to professional software development

Trying to figure out how to align this with my experiences (which match the parents’ comment), and I have an idea:

Coding contests are not like my job at all.

My job is taking fuzzy human things and making code that solves it. Frankly AI isn’t good at closing open issues on open source projects either.


I don't think this disproves my claim, for several reasons.

First, I don't know where those human participants came from, but if you pick people off the street or from a college campus, they aren't going to be the world's best programmers. On the other hand, github users are on average more skilled than the average CS student. Even students and beginners who use github usually don't have much code there. If the LLMs are weighted to treat every line of code about same, they'd pick up more lines of code from prolific developers (who are often more experienced) than they would from beginners.

Also in a coding contest, you're under time pressure. Even when your code works, its often ugly and thrown together. On github, the only code I check in is code that solves whatever problem I set out to solve. I suspect everyone writes better code on github than we do in programming competitions. I suspect if you gave the competitors functionally unlimited time to do the programming competition, many more would outperform GPT-4.

Programming contests also usually require that you write a fully self contained program which has been very well specified. The program usually doesn't need any error handling, or need to be maintained. (And if it does need error handling, the cases are all fully specified in the problem description). Relatively speaking, LLMs are pretty good at these kind of problems - where I want some throwaway code that'll work today and get deleted tomorrow.

But most software I write isn't like that. And LLMs struggle to write maintainable software in large projects. Most problems aren't so well specified. And for most code, you end up spending more effort maintaining the code over its lifetime than it takes to write in the first place. Chatgpt usually writes code that is a headache to maintain. It doesn't write or use local utility functions. It doesn't factor its code well. The code is often overly verbose. It often writes code that's very poorly optimized. Or the code contains quite obvious bugs for unexpected input - like overflow errors or boundary conditions. And the code it produces very rarely handles errors correctly. None of these problems really matter in programming competitions. But it does matter a lot more when writing real software. These problems make LLMs much less useful at work.


Chess AI trained at specific human levels performs better than any humans at those levels, because the random mistakes get averaged out.

https://www.maiachess.com


> The fact that LLMs are trained on a corpus does not mean their output represents the median skill level of the corpus.

It does, by default. Try asking ChatGPT to implement quicksort in JavaScript, the result will be dogshit. Of course it can do better if you guide it, but that implies you recognize dogshit, or at least that you use some sort of prompting technique that will veer it off the beaten path.


I asked the free version of ChatGPT to implement quicksort in JS. I can't really see much wrong with it, but maybe I'm missing something? (Ugh, I just can't get HN to format code right... pastebin here: https://pastebin.com/tjaibW1x)

----

function quickSortInPlace(arr, left = 0, right = arr.length - 1) { if (left < right) { const pivotIndex = partition(arr, left, right); quickSortInPlace(arr, left, pivotIndex - 1); quickSortInPlace(arr, pivotIndex + 1, right); } return arr; }

function partition(arr, left, right) { const pivot = arr[right]; let i = left;

  for (let j = left; j < right; j++) {
    if (arr[j] < pivot) {
      [arr[i], arr[j]] = [arr[j], arr[i]];
      i++;
    }
  }

  [arr[i], arr[right]] = [arr[right], arr[i]]; // Move pivot into place
  return i;
}

This is exactly the level of code I've come to expect from chatgpt. Its about the level of code I'd want from a smart CS student. But I'd hope to never use this in production:

- It always uses the last item as a pivot, which will give it pathological O(n^2) performance if the list is sorted. Passing an already sorted list to a sort function is a very common case. Good quicksort implementations will use a random pivot, or at least the middle pivot so re-sorting lists is fast.

- If you pass already sorted data, the recursive call to quickSortInPlace will take up stack space proportional to the size of the array. So if you pass a large sorted array, not only will the function take n^2 time, it might also generate a stack overflow and crash.

- This code: ... = [arr[j], arr[i]]; Creates an array and immediately destructures it. This is - or at least used to be - quite slow. I'd avoid doing that in the body of quicksort's inner loop.

- There's no way to pass a custom comparator, which is essential in real code.

I just tried in firefox:

    // Sort an array of 1 million sorted elements
    arr = Array(1e6).fill(0).map((_, i) => i)
    console.time('x')
    quickSortInPlace(arr)
    console.timeEnd('x')
My computer ran for about a minute then the javascript virtual machine crashed:

    Uncaught InternalError: too much recursion
This is about the quality of quicksort implementation I'd expect to see in a CS class, or in a random package in npm. If someone on my team committed this, I'd tell them to go rewrite it properly. (Or just use a standard library function - which wouldn't have these flaws.)

OK, you just added requirements the previous poster had not mentioned. Firstly, how often do you really need to sort a million elements in a browser anyway? I expect that sort of heavy lifting would usually be done on the server, where you'd also want to do things like paging.

Secondly, if a standard implementation was to be used, that's essentially a No-Op. AI will reuse library functions where possible by default and agents will even "npm install" them for you. This is purely the result of my prompt, which was simply "Can you write a QuickSort implementation in JS?"

In any case, to incorporate your feedback, I simply added "that needs to sort an array of a million elements and accepts a custom comparator?" to the initial prompt and reran in a new session, and this is what I got in less than 5 seconds. It runs in about 160ms on Chrome:

https://pastebin.com/y2jbtLs9

How long would your team-mate have taken? What else would you change? If you have further requirements, seriously, you can just add those to the prompt and try it for yourself for free. I'd honestly be very curious to see where it fails.

However, this exchange is very illustrative: I feel like a lot of the negativity is because people expect AI to read their minds and then hold it against it when it doesn't.


> OK, you just added requirements the previous poster had not mentioned.

Lol of course! The real requirements for a piece of software are never specified in full ahead of time. Figuring out the spec is half the job.

> Firstly, how often do you really need to sort a million elements in a browser anyway? I expect that sort of heavy lifting would usually be done on the server

Who said anything about the browser? I run javascript on the server all the time.

Don't defend these bugs. 1 million items just isn't very many items for a sort function. On my computer, the built in javascript sort function can sort 1 million sorted items in 9ms. I'd expect any competent quicksort implementation to be able to do something similar. Hanging for 1 minute then crashing is a bug.

If you want a use case, consider the very common case of sorting user-supplied data. If I can send a JSON payload to your server and make it hang for 1 minute then crash, you've got a problem.

> If you have further requirements, seriously, you can just add those to the prompt and try it for yourself for free. [..] How long would your team-mate have taken?

We've gotta compare like for like here. How long does it take to debug code like this when an AI generates it? It took me about 25 minutes to discover & verify those problems. That was careful work. Then you reprompted it, and then you tested the new code to see if it fixed the problems. How long did that take, all added together? We also haven't tested the new code for correctness or to see if it has new bugs. Given its a complete rewrite, there's a good chance chatgpt introduced new issues. I've also had plenty of instances where I've spotted a problem and chatgpt apologises then completely fails to fix the problem I've spotted. Especially lifetime issues in rust - its really bad at those!

The question is this: Is this back and forth process faster or slower than programming quicksort by hand? I'm really not sure. Once we've reviewed and tested this code, and fixed any other problems in it, we're probably looking at about an hour of work all up. I could probably implement quicksort at a similar quality in a similar amount of time. I find writing code is usually less stressful than reviewing code, because mistakes while programming are usually obvious. But mistakes while reviewing are invisible. Neither you nor anyone else in this thread spotted the pathological behavior this implementation had with sorted data. Finding problems like that by just looking is hard.

Quicksort is also the best case for LLMs. Its a well understood, well specified problem with a simple, well known solution. There isn't any existing code it needs to integrate with. But those aren't the sort of problems I want chatgpt's help solving. If I could just use a library, I'm already doing that. I want chatgpt to solve problems its probably never seen before, with all the context of the problem I'm trying to solve, to fit in with all the code we've already written. It often takes 5-10 minutes of typing and copy+pasting just to write a suitable prompt. And in those cases, the code chatgpt produces is often much, much worse.

> I feel like a lot of the negativity is because people expect AI to read their minds and then hold it against it when it doesn't.

Yes exactly! As a senior developer, my job is to solve the problem people actually have, not the problem they tell me about. So yes of course I want it to read my mind! Actually turning a clear spec into working software is the easy part. ChatGPT is increasingly good at doing the work of a junior developer. But as a senior dev / tech lead, I also need to figure out what problems we're even solving, and what the best approach is. ChatGPT doesn't help much when it comes to this kind of work.

(By the way, that is basically a perfect definition of the difference between a junior and senior developer. Junior devs are only responsible for taking a spec and turning it into working software. Senior devs are responsible for reading everyone's mind, and turning that into useful software.)

And don't get me wrong. I'm not anti chatgpt. I use it all the time, for all sorts of things. I'd love to use it more for production grade code in large codebases if I could. But bugs like this matter. I don't want to spend my time babysitting chatgpt. Programming is easy. By the time I have a clear specification in my head, its often easier to just type out the code myself.


> Figuring out the spec is half the job.

That's where we come in of course! Look into spec-driven development. You basically encourage the LLM to ask questions and hash out all these details.

> Who said anything about the browser?... Don't defend these bugs.

A problem of insufficient specification... didn't expect an HN comment to turn into an engineering exercise! :-) But these are the kinds of things you'd put in the spec.

> How long does it take to debug code like this when an AI generates it? It took me about 25 minutes to discover & verify those problems.

Here's where it gets interesting: before reviewing any code, I basically ask it to generate tests, which always all pass. Then I review the main code and test code, at which point I usually add even more test-cases (e.g. https://news.ycombinator.com/item?id=46143454). And, because codegen is so cheap, I can even include performance tests, (which statistically speaking, nobody ever does)!

Here's a one-shot result of that approach (I really don't mean to take up more of your time, this is just so you can see what it is capable of): https://pastebin.com/VFbW7AKi

While I do review the code (a habit -- I always review my own code before a PR), I review the tests more closely because, while boring, I find them a) much easier to review, and b) more confidence-inspiring than manual review of intricate logic.

> I want chatgpt to solve problems its probably never seen before, with all the context of the problem I'm trying to solve, to fit in with all the code we've already written.

Totally, and again, this is where we come in! Still, it is a huge productivity booster even in uncommon contexts. E.g. I'm using it to do computer vision stuff (where I have no prior background!) with opencv.js for a problem not well-represented in the literature. It still does amazingly well... with the right context. It's initial suggestions were overindexed on the common case, but over many conversations, it "understood" my use-case and consistently gives appropriate suggestions. And because it's vision stuff, I can instantly verify results by sight.

Big caveat: success depends heavily on the use-case. I have had more mixed results in other cases, such as concurrency issues in an LD_PRELOAD library in C. One reason for the mixed sentiments we see.

> ChatGPT is increasingly good at doing the work of a junior developer.

Yes, and in fact, I've been rather pessimistic about the prospects of junior developers, a personal concern given I have a kid who wants to get into software engineering...

I'll end with a note that my workflow today is extremely different from before AI, and it took me many months of experimentation to figure out what worked for me. Most engineers simply haven't had the time to do so, which is another reason we see so many mixed sentiments. But I would strongly encourage everybody to invest the time and effort because this discipline is going to change drastically really quickly.


By volume the vast majority of code on github is students. Think about that when you average github for ai

> By volume the vast majority of code on github is students.

Who determined this? How?


I saw an ad for Lovable. The very first thing I noticed was an exchange where the promoter asked the AI to fix a horizontal scroll bar that was present on his product listing page. This is a common issue with web development, especially so for beginners. The AI’s solution? Hide overflow on the X axis. Probably the most common incorrect solution used by new programmers.

But to the untrained eye the AI did everything correctly.


Yes. The people who are amazed with AI were never that good at a particular subject area in the first place - I dont care who you are. You were not good enough - how do I know this? Well I know economics, corporate finance, accounting et al very deeply. Ive engaged with LLMS for years now and still they cannot get below the surface level and are not improving further than this.

Its easy to recall information, but something entirely different to do something with that information. Which is what those subject ares are all about - taking something (like a theory) and applying it in a disciplined manner given the context.

Thats not to diminish what LLMs can do. But lets get real.


It works both ways. If you are good, it's also easier to spot moments of brilliance from AI agent when it saves you hours of googling, reading docs, some trial and error while you pour yourself cup of coffee and ponder the next steps. You can spot when a single tab press saved you minutes.

Yes. Love it for quick explorations of available options, reviewing my work, having it propose tests, getting its help with debugging, and all kinds of general subject matter questions. I don’t trust it to write anything important but it can help with a sketch.

I am not a great (some would argue, not even good) programmer, and I find a lot of issues with LLM generated code. Even Claude pro does really weird dumb stuff.

It starts to make you realize how unaware many people must be of what their programs are doing to accept AI stuff wholesale.

> Generally I've noticed the better someone is at coding the more critical they are of AI generated code.

I'm having a dejavu of yesterday's discussion: https://news.ycombinator.com/item?id=46126988


IMO this is mostly just an ego thing. I often see staff+ engineers make up reasons why AI is bad, when really it’s just a prompting skill issue.

When something threatens a thing that gives you value, people tend to hate it


This seems to be the overall trend in AI. If you're an expert in something, you can see where it's wrong. If you're not, you can't.

Engineers at Google are much less likely to be doing green-field generation of large amounts of code . It's much more incremental, carefully measured changes to mature, complex software stacks, and done within the Google ecosystem, which is heavily divergent from the OSS-focused world of startups, where most training data comes from

That is the problem.

AI is optimized to solve a problem no matter what it takes. It will try to solve one problem by creating 10 more.

I think long time/term agentic AI is just snake oil at this point. AI works best if you can segment your task into 5-10 minutes chunks, including the AI generating time, correcting time and engineer review time. To put it another way, a 10 minute sync with human is necessary, otherwise it will go astray.

Then it just makes software engineering into bothering supervisor job. Yes I typed less, but I didn’t feel the thrill of doing so.


> it just makes software engineering into bothering supervisor job.

I'm pretty sure this is the entire enthusiasm from C-level for AI in a nutshell. Until AI SWE resisted being mashed into a replaceable cog job that they don't have to think/care about. AI is the magic beans that are just tantalizingly out of reach and boy do they want it.


But every version of AI for almost a century had this property, right down from the first vocoders that were going to replace entire callcenters to convolutional AI that was going to give us self-driving cars. Yes, a century, vocoders were 1930s technology, but they can essentially read the time aloud.

... except they didn't. In fact most AI tech were good for a nice demo and little else.

In some cases, really unfairly. For instance, convnet map matching doesn't work well not because it doesn't work well, but because you can't explain to humans when it won't work well. It's unpredictable, like a human. If you ask a human to map a building in heavy fog they may come back with "sorry". SLAM with lidar is "better", except no, it's a LOT worse. But when it fails it's very clear why it fails because it's a very visual algorithm. People expect of AIs that they can replace humans but that doesn't work, because people also demand AIs never say no, never fail, like the Star Trek computer (the only problem the star trek computer ever has is that it is misunderstood or follows policy too well). If you have a delivery person occasionally they will radically modify the process, or refuse to deliver. No CEO is ever going to allow an AI drone to change the process and No CEO will ever accept "no" from an AI drone. More generally, no business person seems to ever accept a 99% AI solution, and all AI solutions are 99%, or actually mostly less.

AI winters. I get the impression another one is coming, and I can feel it's going to be a cold one. But in 10 years, LLMs will be in a lot of stuff, like with every other AI winter. A lot of stuff ... but a lot less than CEOs are declaring it will be in today.


Luckily for us, technologies like SQL made similar promises (for more limited domains) and C suites couldn't be bothered to learn that stuff either.

Ultimately they are mostly just clueless, so we will either end up with legions of way shittier companies than we have today (because we let them get away with offloading a bunch of work to tools they rms int understand and accepting low quality output) or we will eventually realize the continued importance of human expertise.


There are plenty of good tasks left, but they're often one-off/internal tooling.

Last one at work: "Hey, here are the symptoms for a bug, they appeared in <release XYZ> - go figure out the CL range and which 10 CLs I should inspect first to see if they're the cause"

(Well suited to AI, because worst case I've looked at 10 CLs in vain, and best case it saved me from manually scanning through several 1000 CLs - the EV is net positive)

It works for code generation as well, but not in a "just do my job" way, more in a "find which haystack the needle is in, and what the rough shape of the new needle is". Blind vibecoding is a non-starter. But... it's a non-starter for greenfields too, it's just that the FO of FAFO is a bit more delayed.


My internal mnemonic for targeting AI correctly is 'It's easier to change a problem into something AI is good at, than it is to change AI into something that fits every problem.'

But unfortunately the nuances in the former require understanding strengths and weaknesses of current AI systems, which is a conversation the industry doesn't want to have while it's still riding the froth of a hype cycle.

Aka 'any current weaknesses in AI systems are just temporary growing pains before an AGI future'


> 'any current weaknesses in AI systems are just temporary growing pains before an AGI future'

I see we've met the same product people :)


I had a VP of a revenue cycle team tell me that his expectation was that they could fling their spreadsheets and Word docs on how to do calculations at an AI powered vendor, and AI would be able to (and I direct quote) "just figure it all out."

That's when I realized how far down the rabbit hole marketing to non-technical folks on this was.


I think it’s a fair point that google has more stakeholders with a serious investment in some flubbed AI generated code not tanking their share value, but I’m not sure the rest of it is all that different from what engineer at $SOME_STARTUP does after the first ~8monthes the company is around. Maybe some folks throwing shit at a wall to find PMF are really getting a lot out of this, but most of us are maintaining and augmenting something we don’t want to break.

Yeah but Google won’t expect you to use AI tools developed outside Google and trained on primarily OSS code. It would expect you to use the Google internal AI tools trained on google3, no?

Excuse the throwaway. It's not even just the employees, but it doesn't even seem like the technical leadership seriously cares about internal AI use. Before I left all they pushed was code generation, but my work was 80% understanding 5-20 year old code and 20% actual development. If they put any noticeable effort into an LLM that could answer "show me all users of Proto.field that would be affected by X", my life would've been changed for the better, but I don't think the technical leadership understands this, or they don't want to spare the TPUs.

When I started at my post-Google job, I felt so vindicated when my new TL recommended that I use an LLM to catch up if no one was available to answer my questions.


Googler, opinion is my own.

Working on our mega huge code basis with lots of custom tooling and bleeding edge stuff hasn't been the best for for AI generated code compared to most companies.

I do think AI as a rubber ducky / research assistant type has been overall helpful as a SWE.


Makes sense to me.

From the outside, the AI push at Google very closely resembles the death march that Google+ but immensely more intense from the entire tech ecosystem following suit.


Being forced to adopt tools regardless of fit to workflow (and being smart enough to understand the limitations of the tools despite management's claims) correlates very well to being negative on them.

I notice that expert tends to be pretty bimodal. e.g. chef either enjoy really well made food or some version of scrappy fast food comfort they grew up eating.

Bimodal here suggests either/or which I don’t think is correct for either chefs or code enjoyers. I think experts tend to eschew snobbery more and can see the value in comfort food, quick and dirty AI prototypes or boilerplate, or say cheap and drinkable wine, while also being able to appreciate what the truly high-end looks like.

It’s the mid-range with pretensions that gets squeezed out. I absolutely do not need a $40 bottle of wine to accompany my takeout curry, I definitely don’t need truffle slices added to my carbonara, and I don’t need to hand-roll conceptually simple code.


You cannot trust someone’s judgement on something if that something can result in them being unemployed.

Or if they stand to make a lot of money.

See both sides can be pithy.


because autocorrect and predictive text doesn't help when half your job is revisions

so would love to be a fly in there office and hear all their convos

People who've spent their life perfecting a craft are exactly the people you'd expect would be most negative about something genuinely disrupting that craft. There is significant precedent for this. It's happened repeatedly in history. Really smart, talented people routinely and in fact quite predictably resist technology that disrupts their craft, often even at great personal cost within their own lifetime.

I don't know that i consider recognizing the limitations of a tool to be resistance to the idea. It makes sense that experts would recognize those limitations most acutely -- my $30 harbor freight circular saw is a lifesaver for me when I'm doing slapdash work in my shed, but it'd be a critical liability for a professional carpenter needing precision cuts. That doesn't mean the professional carpenter is resistant to the idea of using power saws, just that they necessarily must be more discerning than I do.

Yes you get it. Obviously “writing code” will die. It will hold on in legacy systems that need bespoke maintenance, like COBOL systems have today. There will be artisanal coders, like there are artisanal blacksmiths, who do it the old fashioned way, and we will smile and encourage them. Within 20 years, writing code syntax will be like writing assembly: something they make you do in school, something that your dad reminds you about the good old days.

I talked to someone who was in denial about this, until he said he had conflated writing code with solving problems. Solving problems isn’t going anywhere! Solving problems: you observe a problem, write out a solution, implement that solution, measure the problem again, consider your metrics, then iterate.

“Implement it” can mean writing code, like the past 40 years, but it hasn’t always been. Before coding, it was economics and physics majors, who studied and implemented scientific management. For the next 20 years, it will be “describe the tool to Claude code and use the result”.


But Claude cannot code at all, it's gonna shit the bed and it learns only on human coders to be able to even know an example is a solution rather than a malware...

Every greenfield project uses claude code to write 90+% of code. Every YC startup for the past six months says AI writes 90+% of their code. Claude code writes 90+% of my code. That’s today.

It works great. I have a faster iteration cycle. For existing large codebases, AI modifications will continue to be okay-ish. But new companies with a faster iteration cycle will outcompete olds ones, and so in the long run most codebases will use the same “in-distribution” tech stacks and architecture and design principles that AI is good at.


> Every greenfield project uses claude code to write 90+% of code.

Who determined this? How?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: