On flights with shitty wifi I have been running gpt-oss:120b on my macbook using...

embedding-shape · 2025-11-07T16:02:30 1762531350

GPT-OSS-120b/20b is probably the best you can run on your own hardware today. Be careful with the quantized versions though, as they're really horrible compared to the native MXFP4. I haven't looked in this particular case, but Ollama tends to hide their quantizations for some reason, so most people who could be running 20B with MXFP4, are still on Q8 and getting much worse results than they could.

jmorgan · 2025-11-08T04:13:47 1762575227

The gpt-oss weights on Ollama are native mxfp4 (the same weights provided by OpenAI). No additional quantization is applied, so let me know if you're seeing any strange results with Ollama.

Most gpt-oss GGUF files online have parts of their weights quantized to q8_0, and we've seen folks get some strange results from these models. If you're importing these to Ollama to run, the output quality may decrease.

throwaway314155 · 2025-11-07T16:46:36 1762533996

What’s the distinction between MXP4 and Q8 exactly?

embedding-shape · 2025-11-07T16:50:16 1762534216

It's a different way of doing quantization (https://huggingface.co/docs/transformers/en/quantization/mxf...) but I think the most important thing is that OpenAI delivered their own quantization (the MXFP4 from OpenAI/GPT-OSS on HuggingFace, guaranteed correct) whereas all the Q8 and other quantizations you see floating around are community efforts, with somewhat uneven results depending on who done it.

Concretely from my testing, both 20B and 120B has a lot higher refusal rate with Q8 compared to MXFP4, and lower quality responses overall. But don't take my word for it, the 20B weights are tiny and relatively effortless to try both versions and compare yourself.

throwaway314155 · 2025-11-07T17:25:55 1762536355

Wow, thanks for the info. I'm planning on testing this on my M4 Max w/ 36 GB today.

edit:

So looking here https://ollama.com/library/gpt-oss/tags it seems ollama doesn't even provide the MXFP4 variants, much less hide them.

Is the best way to run these variants via llama.cpp or...?

spullara · 2025-11-07T21:36:58 1762551418

on the model description page they claim they support it:

Quantization - MXFP4 format

OpenAI utilizes quantization to reduce the memory footprint of the gpt-oss models. The models are post-trained with quantization of the mixture-of-experts (MoE) weights to MXFP4 format, where the weights are quantized to 4.25 bits per parameter. The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory, and the larger model to fit on a single 80GB GPU.

Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format.

Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.

throwaway314155 · 2025-11-07T23:20:52 1762557652

Can you link to that page? I’m not finding these variants.

spullara · 2025-11-08T00:40:01 1762562401

as far as I can tell that is the only variant.

https://ollama.com/library/gpt-oss

Patrick_Devine · 2025-11-08T17:50:31 1762624231

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

ode · 2025-11-07T17:33:50 1762536830

LMStudio

throwaway314155 · 2025-11-07T20:49:31 1762548571

Can you be more specific? I've got LM Studio downloaded but it's not clear where are the official OpenAI releases? Are they all only available via transformers? The only one that shows up in search appears to be the distilled gpt-oss 20B...

spullara · 2025-11-07T21:35:57 1762551357

they support that format according to the model page on their site:

https://ollama.com/library/gpt-oss

eli · 2025-11-07T16:36:36 1762533396

Should be a bit faster if you run an MLX version of the model with LM Studio instead. Ollama doesn't support MLX.

Qwen3-Coder is in the same ballpark and maybe a bit better at coding

ZeroCool2u · 2025-11-07T16:48:43 1762534123

LM Studio will run dynamic quants from Unsloth too. Much nicer than Ollama.

mrkiouak · 2025-11-07T18:57:13 1762541833

The key thing I'm confident in is that 2-3 years from now there's going to be a model(s) and workflow that has comparable accuracy, perhaps noticeable (but tolerable) higher latency that can be run locally. There's just no reason to believe this isn't achievable.

Hard to understand how this won't make all of the solutions for existing use cases commodity. I'm sure 2-3 years from now there'll be stuff that seems like magic to us now -- but it will be more-meta, more "here's a hypothesis of a strategically valuable outcome and heres a solution (with market research and user testing done".

I think current performance and leading models will turn out to have been terrible indicators for future market leader (and my money will remain on the incumbents with the largest cash reserves (namely Google) that have invested in fundamental research and scaling).

sebastiennight · 2025-11-07T16:02:58 1762531378

Could you share which Macbook model? And what context size you're getting.

onion2k · 2025-11-07T16:29:19 1762532959

I just checked gpt-oss:20b on my M4 Pro 24GB, and got 400.67 tokens/s on input and 46.53 tokens/s on output. That's for a tiny context of 72 tokens.

sebastiennight · 2025-11-08T23:31:55 1762644715

This message was amazing and I want about to hit [New Tab] and purchase one myself until the penultimate word.

turblety · 2025-11-07T16:13:43 1762532023

Are you running the full 65GB model on a MacBook Pro? What tokens per second do you get? What specs? M5?

iAMkenough · 2025-11-07T16:19:43 1762532383

If they're running 120B on a M5 (32GB max of memory today), I'd like to know how.

thaw13579 · 2025-11-07T16:26:25 1762532785

Probably an M4 which has up to 128GB currently

spullara · 2025-11-07T21:38:10 1762551490

I am running the full model on an 128GB M3 Max.

jonaustin · 2025-11-07T18:31:59 1762540319

On an m4 pro 128gb: 75 t/s.

Caveat: That's just for the first prompt.

moralestapia · 2025-11-07T16:25:12 1762532712

That must be a beefed up MacBook (or you must be quite patient).

gpt-oss:20b on my M1 MBP is usable but quite slow.