I initially explored a fully client-side approach (including WebAssembly), but it didn’t work well in practice. Real-time audio transcription and multi-language translation are both compute-intensive, and browser-only solutions ran into performance and reliability limits, especially for longer videos and live streams.
Using a dedicated backend allowed more consistent latency and accuracy across browsers, and made features like multiple simultaneous subtitle languages and searchable transcripts feasible.
It’s a standard unpacked Chrome extension install (developer mode → load unpacked). Happy to answer any technical questions about the pipeline or trade-offs.