XTTSv2 is released. I’d say it’s a big jump in quality.
- Better voice cloning
- Better audio
- Impressive prosody and expressiveness
- Added more languages, I guess total 16 languages.
- Non-EN languages sounds way better
- Streaming under 200ms ( I have 3090)
- Finetuning code
Here you can try https://huggingface.co/spaces/coqui/xtts
Does anyone know if there is a detailed model description somewhere? They don’t seem to have a full technical report anywhere and the documentation just describes the model API.