More

Denzel · 2026-02-12T01:15:15 1770858915

First, very cool! Thank you for sharing some actual projects with the prompts logged.

I think you and I have different definitions of “one-shotting”. If the model has to be steered, I don’t consider that a one-shot.

And you clearly “broke” the model a few times based on your prompt log where the model was unable to solve the problem given with the spec.

Honestly, your experience in these repos matches my daily experience with these models almost exactly.

I want to see good/interesting work where the model is going off and doing its thing for multiple hours without supervision.

Dylan16807 · 2026-02-12T03:56:53 1770868613

> I want to see good/interesting work where the model is going off and doing its thing for multiple hours without supervision.

I'd be hesitant to use that as a way to evaluate things. Different systems run at different speeds. I want to see how much it can get done before it breaks, in different scenarios.

minimaxir · 2026-02-12T01:38:12 1770860292

I never claimed Opus 4.5 can one-shot things? Even human-written software takes a few iterations to add/polish new features as they come to mind.

> And you clearly “broke” the model a few times based on your prompt log where the model was unable to solve the problem given with the spec.

That's less due to the model being wrong and more due to myself not knowing what I wanted because I am definitely not a UI/UX person. See my reply in the sibling thread.

Denzel · 2026-02-12T13:59:28 1770904768

Apologies, I may have misinterpreted the passage below from your repo:

> This crate was developed with the assistance of Claude Opus 4.5 initially to answer the shower thought "would the Braille Unicode trick work to visually simulate complex ball physics in a terminal?" Opus 4.5 one-shot the problem, so I decided to further experiment to make it more fun and colorful.

Also, yes, I don’t dispute that human written software takes iteration as well. My point is that the significance of autonomous agentic coding feels exaggerated if I’m holding the LLM’s hand more than I have to hold a senior engineer’s hand.

That doesn’t mean the tech isn’t valuable. The claims just feel over exaggerated.

minimaxir · 2026-02-12T18:00:25 1770919225

If you click the video that line links to, it one-shot the original problem as very explicitly defined as a PoC, not the entire project. The final project shipped is substantially different, and that's the difference between YOLO vibecoding and creating something useful.

There's also the embarrassing corner physics bugs present in that video, which was something that required a fix in the first few prompts.

Denzel · 2026-02-11T22:42:48 1770849768

Weird, I broke Opus 4.5 pretty easily by giving some code, a build system, and integration tests that demonstrate the bug.

CC confidently iterated until it discovered the issue. CC confidently communicated exactly what the bug was, a detailed step-by-step deep dive into all the sections of the code that contributed to it. CC confidently suggested a fix that it then implemented. CC declared victory after 10 minutes!

The bug was still there.

I’m willing to admit I might be “holding it wrong”. I’ve had some successes and failures.

It’s all very impressive, but I still have yet to see how people are consistently getting CC to work for hours on end to produce good work. That still feels far fetched to me.

Denzel · 2026-01-29T13:39:30 1769693970

Are you a one-person shop? How do you find clients?

_dark_matter_ · 2026-01-29T13:52:47 1769694767

Almost always they start by having connections that hire them (old colleagues, former friends, etc.), building out those connections (conference talks, doing really good work, writing high quality blogs), and then if you're lucky, some word of mouth.

Denzel · 2026-01-29T17:06:01 1769706361

Good points - admittedly, I didn’t put enough effort into building connections through different pipelines back when I was contracting. Upwork and a few personal connections were my sole sources.

It just felt really difficult to do both the engineering work while trying to do customer development at the same time.

The fact that OP has been able to do this for so long, while supporting a family, piqued my interest.

cjohnson318 · 2026-01-29T16:10:45 1769703045

Can confirm.

Denzel · 2026-01-15T14:44:51 1768488291

Would you mind sharing the repo?

Bishonen88 · 2026-01-15T16:18:41 1768493921

Sure, why not. I'll drop a repo over the weekend

LilBytes · 2026-01-16T05:47:06 1768542426

Thank you. I'll keep an eye out too. I've not seen any good examples of 'good vibe coded products' yet.

Good being a difficult term to define but most of not all of us here know what I mean

Bishonen88 · 2026-01-18T15:05:12 1768748712

https://github.com/bishonen/plusone

Expectation management: Please remember that this is a result of 8 vibe coding sessions (~15 minutes each).

Denzel · 2026-01-18T19:23:43 1768764223

Very interesting, thanks for sharing! Looks like you have considerable experience with vibe coding to be able to produce that in 2 hours.

Bishonen88 · 2026-01-19T08:51:34 1768812694

While I do have some experience with vibe coding, this could've been done by my wife who has little tech knowledge. That's the scary part.

My flow was to open 3 terminals, ask AI to work on some feature in each, check how it looks in the frontend and if it didn't look/work quite right, I asked it again. Once I deemed the feature OK, I just cleared the context and went on to a new task. The 3 terminals ate through claude $20 within 10-15 minutes.

I wonder if this breaks at some point when the codebase is more complex/large, or does not. If it doesn't, then the future is scary, because everyone can recreate many of the SAAS products within hours. What's the moat for Todoist for example? Without AI, it would've taken quite some effort, know-how and time to get something similar up and running. I reckon that with the $100 plan, I could have made it almost identical to it. Perhaps I could even create mobile app builds as well (react native perhaps). What stops me from then offering this for 1/5 the cost of the real app?

And that's established apps. Imagine how easy/trivial it is to clone something that's new, and that was possibly vibe coded itself. E.g. someone posts to HN "Show HN: I made xyz". It looks great, it works great, it has a great idea. Then we take LLM's and recreate it within 4 hours. Poof! There's no reason to pay for it instantly.

That's what I find depressing, though - having a great idea and using LLM to create a great product, will not be enough. People will be able to clone everything. At least that's what my little experience with claude tells me. And now let's just wait 1 more year and see how good claude code 2.0 and co will be. I reckon sooner than later, 0 tech-knowledge will be needed to get apps up and running.

That's why it's time to pivot to some other role in the near future ;)

Denzel · 2026-01-09T05:02:50 1767934970

> It's that we're paying more for objectively worse service than we had a decade ago.

> I'm not asking for magic, I'm asking where went the reliability we already had, at the prices we're already paying.

My god thank you! My partner and I have been talking about this for the past 2 years in the context of food service and delivery service industry.

Greater than 50% of all our restaurant orders are straight up wrong or missing items, whether it’s from local places, chains, or fast food restaurants.

The unreliability is staggering, especially because we’re paying so much more!

It’s gotten so bad that we’re done with certain services and establishments for good now, or we make sure to QC before leaving the restaurant to ensure everything is in the bag.

Even more ironic, this happened a couple weeks ago at Texas Roadhouse — the same restaurant I worked in decades ago as a teenager, so I remember the process we had to go through for to-go orders.

First, we’d take the order over the phone. We’d repeat the order back to the customer to confirm everything (1st QC). When the food came up in the window, we’d pack the food in bags, crossing off every item on the receipt before stapling it to the bag (2nd QC). When the customer came to pick up their food, we’d have to take every box out of the bag, show the customer the food, and confirm that everything they expected in their order was there (3rd QC).

No customer. Every left. With an incorrect order. Simple.

That process is gone now. We paid more and came home missing my partner’s meal. Wtf.

ChilledTonic · 2026-01-09T05:13:03 1767935583

I hear a lot of stories like this; but the question I always come back to is - what incentive is their to discard the working system? In your case its the 3qc step process, why is that just gone now?

The more I look into these systematic changes the less sense it makes.

smelendez · 2026-01-09T06:00:10 1767938410

I don't know anything about Texas Roadhouse but in general I'd say a lot of processes got sloppy after technological changes repeatedly added complexity.

Decades ago, when OP worked there, I'm guessing Texas Roadhouse only took takeout orders over the phone and in person, and didn't receive that many. It was less common to order takeout from sit-down restaurants. There was one procedure, the steps made logical sense, it could be implemented entirely within the restaurant without a lot of IT. And it worked, and it probably didn't take that much staff time on a normal night.

Now, you can still order by phone, but I see you can also at least order online for pickup, order via UberEats for delivery, and order via DoorDash for pickup or delivery. They've likely added these various modes over time, and I'm sure each has its own subtly different procedure reflecting various IT systems nobody in the restaurant has any control over.

The three-part QC process might still work for phone orders but those are probably rare. Orders picked up by the actual end customer could use a two-part QC process, verifying the items against the receipt and presenting them for visual inspection. But orders getting picked up by a delivery person can practically only get a quick check as they're loaded in the bag, because the delivery person is in a hurry and won't want to stand there and help check the receipt against what's in the bag. They also may not be able to effectively do so if they don't know the menu (for instance, there are several sides that could be "steamed vegetables" at a quick glance https://www.texasroadhouse.com/location/457-countrysideil/di...) and aren't sufficiently fluent in English.

Rather than have a complex flowchart for the overworked staff maximizing QC for every case, it's very easy to default to the minimum which works in every case, which is hurriedly comparing the menu items that come out of the kitchen to what's on the receipt as they're loaded into the bag. It's very easy to get this wrong, especially if you're overworked and distracted and loading multiple bags at once.

lotsofpulp · 2026-01-09T05:34:58 1767936898

It’s more expensive labor (for the same quality) and fewer redundant staff. I suspect demographics trends are part of the reason.

Denzel · 2026-01-09T02:42:58 1767926578

Uhm, you actually just proved their point if you run the numbers.

For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.

In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh

The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.

You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.

And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.

Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.

kingstnap · 2026-01-09T07:46:56 1767944816

5 to 10 tokens per second is bungus tier rates.

https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...

NVIDIAs 8xB200 gets you 30ktps on Deepseek 671B at maximum utilization thats 1 trillion tokens per year. At a dollar per million tokens that's $1 million.

The hardware costs around $500k.

Now ideal throughput is unlikely, so let's say your get half that. It's still 500B tokens per year.

Gemini 3 Flash is like $3/million tokens and I assume it's a fair bit bigger, maybe 1 to 2T parameters. I can sort of see how you can get this to work with margins as the AI companies repeated assert.

Denzel · 2026-01-09T08:48:45 1767948525

Cool, that potential 5x cost improvement just got delivered this year. A company can continue running the previous generation until EOL, or take a hit by writing off the residual value - either way they’ll have a mixed cost model that puts their token cost somewhere in the middle between previous and current gens.

Also, you’re missing material capex and opex costs from a DC perspective. Certain inputs exhibit diseconomies of scale when your demand outstrips market capacity. You do notice electricity cost is rising and companies are chomping at the bit to build out more power plants, right?

Again, I ran the numbers for simplicity’s sake to show it’s not clear cut that these models are profitable. “I can sort of see how you can get this to work” agrees with exactly what I said: it’s unclear, certainly not a slam dunk.

Especially when you factor in all the other real-world costs.

We’ll find out soon enough.

surajrmal · 2026-01-09T14:58:45 1767970725

Google runs everything on their tpus which are substantially less costly than to make and use less energy to run. While I'm sure openai and others are bleeding money by subsidizing things, I'm not entirely sure that's true for Google (despite it actually being easier if they wanted to).

Denzel · 2026-01-15T14:46:54 1768488414

I’m well aware of what Google does and their AI strategy ;)

Denzel · 2026-01-02T02:30:35 1767321035

We probably work at the same company, given you used MAANG instead of FAANG.

As one of the WAU (really DAU) you’re talking about, I want to call out a couple things: 1) the LOC metrics are flawed, and anyone using the agents knows this - eg, ask CC to rewrite the 1 commit you wrote into 5 different commits, now you have 5 100% AI-written commits; 2) total speed up across the entire dev lifecycle is far below 10x, most likely below 2x, but I don’t see any evidence of anyone measuring the counterfactuals to prove speed up anyways, so there’s no clear data; 3) look at token spend for power users, you might be surprised by how many SWE-years they’re spending.

Overall it’s unclear whether LLM-assisted coding is ROI-positive.

ben_w · 2026-01-02T10:53:32 1767351212

To add to your point:

If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.

I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d

aspenmartin · 2026-01-02T13:21:36 1767360096

Oh yes all of this I agree with. I had tried to clarify this above but your examples are clearer: my point is: all measures and studies I have personally seen of AI impact on productivity have been deeply flawed for one reason or another.

Total speed up is WAY less than 10x by any measure. 2x seems too high too.

By data alone it’s a bit unclear of impact I agree. But I will say there seems to be a clear picture that to me, starting from a prior formed from personal experience, indicates some real productivity impact today, with a trajectory that suggests these claims of a lot of SWE work being offloaded to agents over the next few years seems not that far fetched.

- adoption and retention numbers internally and externally. You can argue this is driven by perverse incentives and/or the perception performance mismatch but I’m highly skeptical of this even though the effects of both are probably really, it would be truly extraordinary to me if there weren’t at least a ~10-20% bump in productivity today and a lot of headroom to go as integration gets better and user skill gets better and model capabilities grow

- benchmark performance, again benchmarks are really problematic but there are a lot of them and all of them together paint a pretty clear picture of capabilities truly growing and growing quickly

- there are clearly biases we can think of that would cause us to overestimate AI impact, but there are also biases that may cause us to underestimate impact: e.g. I’m now able to do work that I would have never attempted before. Multitasking is easier. Experiments are quicker and easier. That may not be captured well by e.g. task completion time or other metrics.

I even agree: quality of agentic code can be a real risk, but:

- I think this ignores the fact that humans have also always written shitty code and always will; there is lots of garbage in production believe me, and that predates agentic code

- as models improve, they can correct earlier mistakes

- it’s also a muscle to grow: how to review and use humans in the loop to improve quality and set a high bar

Denzel · 2026-01-09T02:49:49 1767926989

Great response, we’re like 98% aligned at a high-level. :) These next few years will be interesting.

Denzel · 2025-12-30T22:46:38 1767134798

Asking as an eng that's starting to drive daily with CC:

- How much has your TTM reduce by? How did you measure?

- What's the net difference when you factor in token spend expenses?

- By how much can Anthropic increase prices before crossing over your break-even point?

Denzel · 2025-12-24T22:23:10 1766614990

‘Desirable difficulty’ is the research term. To solve your problem, first understand your users need a mindset change. We need to connect their action to a “satisfying feeling” as you said.

You want your users to be like weight lifters. No lifter comes out the gym saying, “Man that was the best workout, felt so easy,” to the contrary, lifters use progressive overload to induce difficulty because that difficulty connects to the results they want.

For your users, you need some way to measure the outcome, so that you can show them, “hey look, that mild discomfort lead to more progress on what you care about,” and then you need to consistently message that some difficulty is good.

Mindset change takes consistency and time. Won’t happen over night. You’ll know you succeeded when students become aware of “hey, I’m not learning as well if it doesn’t feel difficult”, and then react by increasing the challenge.

watwut · 2025-12-25T19:42:05 1766691725

Weightlifters use weight they can lift and feel good after the session. Literally. They may feel tired, but they feel good. They see weights go up and feel like progressing (unless they are in fact stagnating).

That is literal opposite of what OP describes. What OP describes is weight lifter taking on weight they cant lift and conatantly feeling like a failure after each training.

ChadNauseam · 2025-12-25T02:19:47 1766629187

That's a great analogy. I'll need to think about how to message that difficulty is good. It's a tricky proposition.

Denzel · 2025-11-23T14:39:14 1763908754

https://www.pnas.org/doi/10.1073/pnas.1320040111

In 2014, Facebook published a paper showing how they can manipulate users’ emotions with their news feed algorithm.

Facebook ran this test on 700k users without consent.

I deactivated my account the day I read that paper and never looked back.