Introducing Distil-Whisper: 6x faster than Whisper while performing to within 1% WER on out-of-distribution test data.
Through careful data selection and filtering, Whisper’s robustness to noise is maintained and hallucinations reduced.
For more information, refer to:
- 👨💻 The GitHub repo: https://github.com/huggingface/distil-whisper
- 📚 The official paper: https://arxiv.org/abs/2311.00430
Here’s a quick overview of how it works:
1. Distillation
The Whisper encoder performs 1 forward pass, while the decoder performs as many as the number of tokens generated. That means that the decoder accounts for >90% of the total inference time. Therefore, reducing decoder layers is more effective than encoder layers.
With this in mind, we keep the whole encoder, but only 2 decoder layers. The resulting model is then 6x faster. A weighted distillation loss is used to train the model, keeping the encoder frozen 🔒 This ensures we inherit Whisper’s robustness to noise and different audio distributions.
2. Data
Distil-Whisper is trained on a diverse corpus of 22,000 hours of audio from 9 open-sourced datasets with permissive license. Pseudo-labels are generated using Whisper to give the labels for training. Importantly, a WER filter is applied so that only labels that score above 10% WER are kept. This is key to keeping performance! 🔑
3. Results
Distil-Whisper is 6x faster than Whisper, while sacrificing only 1% on short-form evaluation. On long-form evaluation, Distil-Whisper beats Whisper. We show that this is because Distil-Whisper hallucinates less
4. Usage
Checkpoints are released under the Distil-Whisper repository with a direct integration in 🤗 Transformers and an MIT license.
5. Training Code
Training code will be released in the Distil-Whisper repository this week, enabling anyone in the community to distill a Whisper model in their choice of language!
How does the speedup compare fo
faster-whisper
?I’m a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:
- [/r/datascienceproject] Distil-Whisper: a distilled variant of Whisper that is 6x faster (r/MachineLearning)
^(If you follow any of the above links, please respect the rules of reddit and don’t vote in the other threads.) ^(Info ^/ [1](/message/compose?to=/r/TotesMessenger))
Contact ↩︎
Found 4 relevant code implementations for “Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling”.
Ask the author(s) a question about the paper or code.
If you have code to share with the community, please add it here 😊🙏
–
To opt out from receiving code links, DM me.