> Can you give some more info on how you generated the models? glow-tts and melg...

> Can you give some more info on how you generated the models?

glow-tts and melgan, which are somewhat unpopular choices given the proliferation of Tacotron2/Waveglow. I chose these due to their sparsity and speed.

> I'm also interested in the tech stack you're using to implement this webapp... Would love some details!

It's a Rust microservice architecture. There's a proxy layer that decodes the request and sends it to the appropriate backend, and then there's the tts service that is horizontally scaled and is responsible for loading the model pipeline and turning requests into audio.

> ..What's next?

For me? Voice conversion in the near term. This takes microphone input and turns it into the target speaker's voice.

I'm also spending a lot of time on photogrammetry. I have a 3d volumetric webcam system right now that I have much bigger plans for.