Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.




They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text

Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.


1491 vs 1418 ELO means the stronger model wins about 60% of the time.

Probably naive questions:

Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?

Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?


Yes, of course.

Wow. If all the trillions only produces that small of a diff... that's shocking. That's the sort of knowledge that could pop the bubble.

I wouldn't trust LMArena results much. They measure user preference and users are highly skewed by style, tone etc.

You can litteraly "improve" your model on LMArena by just adding a bunch of emojis.


I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.

The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.

Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.


We're seeing the same thing for many companies, even in the US. Exposing your entire codebase to an unreliable third party is not exactly SOC / ISO compliant. This is one of the core things that motivated us to develop cortex.build so we could put the model on the developer's machine and completely isolate the code without complicated model deployments and maintenance.

Does your company use Microsoft Teams?

Mistral is founded by multiple Meta engineers, no?

Funded mostly by US VCs?

Hosted primarily on Azure?

Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?


Mistral are mostly focusing on b2b, and for customers that want to self-host (banks and stuff). So their founders being from Meta, or where their cloud platform are hosted, are entirely irrelevant to the story.

The fact they would not exist without the leeches and built their business on the leeches is irrelevant.

Pan-nationalism is a hell of a drug: a company that does not know you exist puts out an objectively awful release, and people take frank discussion of it as a personal slight.


Those who crawled the web without consent, and then put their LLM in a blackbox without attribution, with secret prompt and secret weights -- ie. all of this without giving back, while creating tons of Co2. Those are the leeches.

Ah, so "crawled the web without consent, and then put their LLM in a blackbox without attribution" is not being a leech once you release the weights of an underperforming model using someone else's arch.

I knew y'all's standards were lower but geez!


At the very least it is a step in the right direction. Can't say the same for these proprietary models. And guess which country has all these proprietary models? USA.

Thank goodness for that, otherwise all we might have is useless copies of Deepseek.

If you want to allocate capital efficiently planet-scale you have to ignore nations to the largest extent possible.

> The fact they would not exist without the leeches and built their business on the leeches is irrelevant.

How so?


It's wayyyy to early in the game to say who is out-executing whom.

I mean why do you think those guys left Meta? It reminds me of a time ten years ago I was sitting on a flight with a guy who works for the natural gas industry. I was (cough still am) a pretty naive environmentalist, so I asked him what he thought of solar, wind, etc. and why should we be investing in natural gas when there are all these other options. His response was simple. Natural gas can serve as a bridge from hydrocarbons to true green energy sources. Leverage that dense energy to springboard the other sources in the mix and you build a path forward to carbon free energy.

I see Mistral's use of US VCs the same way. Those VCs are hedging their bets and maybe hoping to make a few bucks. A few of them are probably involved because they're buddies with the former Meta guys "back in the day." If Mistral executes on their plan of being a transparent b2b option with solid data protections then they used those VCs the way they deserve to be used and the VCs make a few bucks. If Europe ever catches up to the US in terms of data centers, would Mistral move off of Azure? I'd bet $5 that they would.


I didn't mean to imply US bad EU good. As such, this isn't about which passport the VCs have, but about local hosting and open weight models. A closed model from a US company always comes with the risk of data exfiltration either for training or thanks to CLOUD Act etc (i.e. industrial espionage).

And personally I don't care at all about the performance delta - we are talking about a difference of 6 to at most 12 months here, between closed source SOTA and open weight models.


They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.


Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332

??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.

We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)


If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

Completely agree, that there are legitimate reasons to prefer comparison to e.g. deepeek models. But that doesn't change my point, we both agree that the comparisons would be extremely unfavorable.

> that the comparisons would be extremely unfavorable.

Why should they compare apples to oranges? Ministral3 Large costs ~1/10th of Sonnet 4.5. They clearly target different users. If you want a coding assistant you probably wouldn't choose this model for various reasons. There is place for more than only the benchmark king.


Come on. Do you just not read posts at all?

Which lightweight models do these compare unfavorably with?

Here's what I understood from the blog post:

- Mistral Large 3 is comparable with the previous Deepseek release.

- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.


And implicit in this is that it compares very poorly to SOTA models. Do you disagree with that? Do you think these Models are beating SOTA and they did not include the benchmarks, because they forgot?

Those are SOTA for open models. It's a separate league from closed models entirely.

> It's a separate league from closed models entirely.

To be fair, the SOTA models aren't even a single LLM these days. They are doing all manner of tool use and specialised submodel calls behind the scenes - a far cry from in-model MoE.


> Do you disagree with that?

I think that Qwen3 8B and 4B are SOTA for their size. The GPQA Diamond accuracy chart is weird: Both Qwen3 8B and 4B have higher scores, so they used this weid chart where "x" axis shows the number of output tokens. I missed the point of this.


Generation time is more or less proportional to tokens * model size, so if you can get the same quality result with fewer tokens from the same size of model, then you save time and money.

Thanks. That was not obvious to me either.

> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

Why would they? They know they can't compete against the heavily closed-source models.

They are not even comparing against GPT-OSS.

That is absolutely and shockingly bearish.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: