Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: