Fully depends on the model, how much conversational context you provide, but if ...

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.