Well the seemingly cheap comes with significantly degraded performance, particular for agentic use. Have you tried replacing Claude Code with some locally deployed model, say, on 4090 or 5090? I have. It is not usable.
> Deepseek on openrouter is still 25x cheaper than claude
Is it? Or only when you don’t factor in Claude cached context? I’ve consistently found it pointless to use open models because the price of the good ones is so close to cached context on Claude that I don’t need them.
Deepseek via their API also has cached context, although the tokens/s was much lower than Claude when I tried it. But for background agents the price difference makes it absolutely worth it.
Well, those are also extremely limited vram areas that wouldn't be able to run anything in the ~70b parameter space. (Can you run 30b even?)
Things get a lot more easier at lower quantisation, higher parameter space, and there's a lot of people's whose jobs for AI are "Extract sentiment from text" or "bin into one of these 5 categories" where that's probably fine.
Strictly speaking, you have not deployed any model on a 5090 because a 5090 card has never been produced.
And without specifying your quantization level it's hard to know what you mean by "not usable"
Anyway if you really wanted to try cheap distilled/quantized models locally you would be using used v100 Teslas and not 4 year old single chip gaming GPUs.
they took the already ridiculous v3.1 terminus model, added this new deepseek sparse attention thing, and suddenly it’s doing 128k context at basically half the inference cost of the old version with no measurable drop in reasoning or multilingual quality. like, imo gold medal level math and code, 100+ languages, all while sipping tokens at 14 cents per million input. that’s stupid cheap.
the rl recipe they used this time also seems way more stable. no more endless repetition loops or random language switching you sometimes got with the earlier open models. it just works.
what really got me is how fast the community moved. vllm support landed the same day, huggingface space was up in hours, and people are already fine-tuning it for agent stuff and long document reasoning.
i’ve been playing with it locally and the speed jump on long prompts is night and day. feels like the gap to the closed frontier models just shrank again.
anyone else tried it yet?
The one thing I wish it has is 3.5mm audio jack. Both Xbox and SONY's dualsense controller have this. But SONY don't support audio via Bluetooth. The Xbox one need a USB adapter but its build is not as good as SONY's. SONY don't have a USB adapter. Given Steam controller is already using an USB puck, it should be able to support it.
Infuse is good, but it does not feel so well-polished for the desktop, for example, some windows for pop-up could have been a real window, but were a pop-up that blocks the main player.
I want to note that: long prompts are good only if the model is optimized for it. I have tried to swap the underlying model for Claude Code. Most local models, even those claimed to work with long context and tool use, don't work well when instruction becomes too long. This has become an issue for tool use, where tool use works well in small ChatBot-type conversation demos, but when Claude's code-level prompt length increases, it just fails, either forgetting what tools are there, forgetting to use them, or returning in the wrong formats. Only the model by OpenAI, Google's Gemini, kind of works, but not as well as Anthropic's own models. Besides they feel much slower.
I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well.
Recently, I visited the Pennsylvania Railroad Museum and was fascinated to learn that when steel railcars were first introduced—despite being far safer than their wooden predecessors, which could easily be crushed—many people feared they might attract lightning. It's such a good analogue to our movement into AI reality.
reply