Yea that latency makes sense; "listening" includes turn detection and STT, "thinking" LLM + TTS _and then_ our model, so the pipeline latency stacks up pretty quick. The actual video model starts streaming out frames <500ms from the TTS generation, but we're still working on reducing latency from parts of the pipeline that we are using off the shelf.
We have a high level blog post here https://www.keyframelabs.com/blog/persona-1 about the architecture of the video model, the WebRTC "agent" stack is Livekit + a few backend components hosted in Modal.
We've been tinkering with building realtime talking head models (avatar models, etc.) for a while now, and finally have something that works (well enough)! Operates at ~2x realtime on a 4090, significantly faster than that on enterprise grade GPUs.
The main use case we designed for was language learning, particularly having a conversational partner -- generally we've found that adding a face to the voice really helps trigger the fight or flight response, which we've found to be the hardest part of speaking a new language with confidence.
But in building out the system around the model to enable that use case (tool use on a canvas for speaking prompts and images, memory to make conversations less stale, etc.), we think there's potential for other use cases too.
reply