It absolutely theoretically can, but afaik neither V8 or the JVM can actually do it to a level where it outperforms the static optimisations of GCC or LLVM.
Is this still the case or am I going on outdated info on the matter?
The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.
You can fit the weights + a tiny context window into 24GB, absolutely. But you can't fit anything of any reasonable size. Or Ollama's implementation is broken, but it needs to be restricted beyond usability for it not to freeze up the entire machine when I last tried to use it.
LLMs do typically encode a confidence level in their embeddings, they just never use it when asked. There were multiple papers on this a few years back and they got reasonable results out of it. I think it was in the GPT3.5 era though
reply