Hacker Newsnew | past | comments | ask | show | jobs | submit | ej88's commentslogin

why do we have guides and lessons on how to use a chainsaw when we can hack the tree with an axe?


The chainsaw doesn't sometimes chop off your arm when you are using it correctly.


If you swing an axe with a lack of hand eye coordination you don't think it's possible to seriously injure yourself?


Was the axe or the chainsaw designed in such a way that guarantees that it will definitely miss the log and hit your hand fair amount of the times you use it? If it were, would you still use it? Yes, these hand tools are dangerous, but they were not designed so that it would probably cut off your hand even 1% of the time. "Accidents happen" and "AI slop" are not even remotely the same.

So then with "AI" we're taking a tool that is known to "hallucinate", and not infrequently. So let's put this thing in charge of whatever-the-fuck we can?

I have no doubt "AI" will someday be embedded inside a "smart chainsaw", because we as humans are far more stupid than we think we are.


https://scale.com/leaderboard/swe_bench_pro_commercial

I definitely trust the totally private dataset more.


I'm a little skeptical of a full on 2008-style 'burst'. I imagine it'll be closer to a slow deflation as these companies need to turn a profit.

Fundamentally, serving a model via API is profitable (re: Dario, OpenAI), and inference costs come down drastically over time.

The main expense comes twofold: 1. The cost of train a new model is extremely expensive. GPUs / yolo runs / data

2. Newer models tend to churn through more tokens and be more expensive to serve in the beginning before optimizations are made.

(not including payrolls)

OpenAI and Anthropic can become money printers once they downgrade the Free tiers, add ads or other attention monetizing methods, and rely on a usage model once people and businesses become more and more integrated with LLMs, which are undoubtedly useful.



Not really sure how this article refutes what I said?

He defines it as "everything that happens from when you put a prompt in to generate an output" -> but he seems to conflate inference with a query. Putting in input to generate the next single token is inference. A query or response just means the LLM repeats this until the stop token is emitted. (Happy to be corrected here)

The cost of inference per token is going down - the cost per query goes up because models consume more tokens, which was my point.

Either way, charging consumers per token pretty much guarantees that serving models is profitable (each of Anthropic's prior models turn a profit). The consumer-friendly flat 20$ subscription is not sustainable in the long run.

https://epoch.ai/data-insights/llm-inference-price-trends

https://www.snellman.net/blog/archive/2025-06-02-llms-are-ch...

https://x.com/eladgil/status/1827521805755806107


This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358


A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.

The examples are from all the major commercial American LLMs as listed in a sister comment.

You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.


If you train an LLM on chess, it will learn that too. You don't need to explain the rules, just feed it chess games, and it will stop making illegal moves at some point. It is a clear example of an inferred world model from training.

https://arxiv.org/abs/2501.17186

PS "Major commercial American LLM" is not very meaningful, you could be using GPT4o with that description.


I my opinion the author refers to a LLMs inability to create a inner world, a world model.

That means it does not build a mirror of a system based on its interactions.

It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.

It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.


The article just isn't that coherent for me.

> when a new model is released as the SOTA, 99% of the demand immediately shifts over to it

99% is in the wrong ballpark. Lots of users use Sonnet 4 over Opus 4, despite Opus being 'more' SOTA. Lots of users use 4o over o3 or Gemini over Claude. In fact it's never been a closer race on who is the 'best': https://openrouter.ai/rankings

>switch from opus ($75/m tokens) to sonnet ($15/m) when things get heavy. optimize with haiku for reading. like aws autoscaling, but for brains.

they almost certainly built this behavior directly into the model weights

???

Overall the article seems to argue that companies are running into issues with usage-based pricing due to consumers not accepting or being used to usage based pricing and it's difficult to be the first person to crack and switch to usage based.

I don't think it's as big of an issue as the author makes it out to be. We've seen this play out before in cloud hosting.

- Lots of consumers are OK with a flat fee per month and using an inferior model. 4o is objectively inferior to o3 but millions of people use it (or don't know any better). The free ChatGPT is even worse than 4o and the vast majority of chatgpt visitors use it!

- Heavy users or businesses consume via API and usage based pricing (see cloud). This is almost certainly profitable.

- Fundamentally most of these startups are B2B, not B2C


> Lots of users use 4o over o3

How much of that is the naming?

Personally I just avoid OpenAIs models entirely because I have absolutely no way of telling how their products stack up against one another or which to use for what. In what world does o3 sort higher than 4o?

If I have to research your products by name to determine what to use for something that is already a commodity, you've already lost and are ruled out.


It's the naming. He is confusing 4o/4o-mini with o4-mini, the latter is a pretty strong model and it's also one of the newest. Oh and it's cheaper than the non-mini 4o.


There's both a 4o and an o4? And they're different?


Yes. 4o is a non-CoT model that is the continuation of the GPT-4 line, itself superseded by 4.1. o4 is the continuation of the CoT model line.

There's also 4o-mini and o4-mini...


No, I meant 4o over o3. For a ton of people a reasoning model's latency is overkill for them asking for inspiration on what to make for dinner.


o4-mini isn’t really that great in comparison to o3, and I still use o3 as my daily driver for reasoning tasks. I don’t really have a purpose for o4-mini, not even for coding tasks.


> In fact it's never been a closer race on who is the 'best'

Thank you for pointing out that fact. Sometimes it's very hard to keep perspective.

Sometimes I use Mistral as my main LLM. I know it's not lauded as the top performing LLM but the truth of the matter is that it's results are just as useful as the best models that ChatGPT/Gemini/Claude outputs, and it is way faster.

There is indeed diminished returns on the current blend of commercial LLMs. Deep seek already proved that cost can be a major factor and quality can even improve. I think we're very close to see competition based on price, which might be the reason there is so much talk about mixture of experts approaches and how specialized models can drive down cost while improving targeted output.


If you're after speed, Groq is excellent. They've recently added Kimi K2.


Yeah, my biggest problem with CC is that it's slow, prone to generating tons of bullshit exposition, and often goes down paths that I can tell almost immediately will yield no useful result.

It's great if you can leave it unattended, but personally, coding's an active thing for me, and watching it go is really frustrating.


I can't deal with any of the in editor tools. I'd love something that handled inputting changes (with manual review!) while still giving me 100% control over the context and actually doing as its told.


Not having a stake in something currently rocketing up in value is certainly a cause for FOMO and / or incentive to disparage it.


Can you share the chats? I tried with o3 and it gave a pretty reasonable answer on the first try.

https://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...


contracts are higher margin and give consistent ARR compared to data labelling


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: