Hey r/MachineLearning!

At Hugging Face, we’ve worked hard the last months to create a powerful, but fast distilled version of Whisper. We’re excited to share our work with you now!

Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.

For more information, please have a look:

- GitHub page: https://github.com/huggingface/distil-whisper/tree/main

- Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf

Quick summary:

  1. Distillation Process

We’ve kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.

  1. Data

We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.

  1. Results

We’ve evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.

  1. Robust to noise

Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.

  1. Pushing for max inference time

Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!

  1. Checkpoints?!

Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.

  • blackkettle@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Why did you bury the RTF results in the appendix? These are way more interesting than relative latency compared to vanilla whisper!

    One thing I would question: A100GPU might be ‘typical’ for a lot of cloud offerings like what you see coming from DeepGram and the like, but it’s definitely not typical of what you’d find or have access to for any sort of on-site client. Here CPU is still king.

    Have you run any experiments with CPU only inference? What does that look like? Here you can still achieve 0.01x-0.06xRT with CPU only inference and basically the same accuracy with a fine-tuned model utilizing the latest production releases from K2/icefall. This looks like it’s getting closer, but I’d still be inclined to recommend using this distilled model to pre-generate a bunch of pseudo labeled training data for a smaller, dedicated K2/sherpa production system for anything on site.