I imagine it's pretty bad risk to reward ratio for most companies. Especially wh...

simonw · 2025-10-19T23:18:37 1760915917

Yeah, that's my assumption too. Fine-tuning is really expensive in terms of skills and time needed to attempt it, and there's a very real chance that your attempts will fail to make a meaningful improvement over being smarter with your prompts.

Even worse, even if you DO get an improvement you are likely to find that it was a waste of time in a month or two when then next upgraded version of the underlying models are released.

The places it makes sense from what I can tell are mainly when you are running so many prompts that the cost saving by running a smaller, cheaper model can outweigh the labor and infrastructure costs involved in getting it to work. If your token spend isn't tens (probably hundreds) of thousands of dollars you're unlikely to save money like this.

If it's not about cost saving, the other reasons are latency and being able to achieve something that the root model just couldn't do.

Datadog reported a latency improvement, because fine-tuning let them run a much smaller (and hence faster) model. That's a credible reason if you are building high value features that a human being is waiting on, like live-typing features.

The most likely cases I've heard of for getting the model to do something it just couldn't do before mainly involve vision LLMs, which makes sense to me - training a model to be able to classify images that weren't in the training set might make more sense than stuffing more example images into the prompt (though models like Gemini will accept dozens of not hundreds of comparable images in the prompt, which can then benefit from prompt caching).

The last category is actually teaching it a new skill. The best example here are low-resource programming languages - Jane Street and OCaml or Morgan Stanley and Q for example.

Jane Street OCaml: https://www.youtube.com/watch?v=0ML7ZLMdcl4

Morgan Stanley Q: https://huggingface.co/morganstanley/qqWen-1.5B-SFT

daxfohl · 2025-10-20T01:59:50 1760925590

Have you heard of any attempts to bake MCP definitions into LoRA adapters? I've been wondering if that's a viable approach, so you don't have to put them all in context, and toggling them on and off would just be a matter of applying or unapplying the weights. That seems like it'd be more robust than putting "enable FooMCP" "disable FooMCP" etc in the context, which I'd think would trip up the LLM eventually. And it would avoid full rebuild of the KV cache that'd be required if you fully removed FooMCP from the context prefix.

Depending on use case you could either insert the LoRA weights as their own layers at runtime (no time to create, but extra layer to compute each token), merge them with existing layers (initial delay to merge layers, but no runtime penalty after), or have pre-merged models for common cases (no perf penalty but have to reserve more storage).

simonw · 2025-10-20T04:18:49 1760933929

I've not heard of anyone trying that, but I don't think I've been looking in the right kinds of places.

My current mental model of LoRA is that this would be unlikely to Work, but I've never used them so I don't really know what I'm talking about. Would be a very interesting experiment!