30B-A3B works extremely well as a generalist chat model when you pair with scaffolding such as web search. It's fast (for me) using my workstation at home running a 5070 + 128GB of DDR4 3200 RAM @ ~28 tok/s. Love MoE models.
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
llama-server.exe --host 0.0.0.0 --no-webui --alias "Qwen3-30B-A3B-Q4_K_XL" --model "F:\models\unsloth\Qwen3-30B-A3B-128K-GGUF\Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf" --ctx-size 45000 --n-gpu-layers 99 --slots --metrics --batch-size 2048 --ubatch-size 2048 --temp 0.6 --top-p 0.95 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.1 --jinja --reasoning-format deepseek --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --no-mmap --threads 8 --cache-reuse 256 --override-tensor "blk\.([0-9][02468])\.ffn_._exps\.=CPU"