vLLM Whisper 🔈

By on

We forked vLLM to support better throughput and memory-efficient inference for Whisper. Not just that, this fork able to stream output tokens like other LLM serving and support SRT format for both streaming and batch serving, you can check out the fork at https://github.com/mesolitica/vllm-whisper

How to start?

Simply git clone and install!

            
          

Make sure `nvcc` is available in your machine or the installation will fail. We will provide Docker image soon so developers can start super easy.

Run OpenAI compatible server

To start is very simple, we can follow the exact CLI as stated in vLLM OpenAI Compatible Server documentation,

                
              

If you got logs some sort like below, you are good to go!

            
          

Run Whisper API using OpenAI python library

it is very simple, you just have to insert dummy api key and replace the base url to vLLM Whisper host and port, and I am going to transcribe Lex Fridman on Grigori Perelman turning away $1 million and Fields Medal from Lex Clips,

              
          

The output is below,

            
          

Feel free to change `response_format` to any supported format.

Run Whisper streaming JSON format

Simply pass `true` to `stream` parameter, but unfortunately, OpenAI library does not support Whisper streaming, so you have to use Python or cURL or any request library. Below is cURL example,

            
          

Or you can use AIOHTTP Python,

            
          
            
          

Run Whisper streaming SRT format

We also support SRT streaming format, simply pass `srt` to `response_format` parameter,

            
          

Or you can use AIOHTTP Python,

            
          
            
          

Process any length of audio using Torchaudio

This forked use Torchaudio to chunk audio in real-time manner and keep appending until reached 30 seconds chunk and will feed into Whisper to predict.

And each requests will store their own timestamp to keep track the last audio timestamp predicted so we can give accurate timestamp on longer audio.

              
            
            
          

Pull Request

We already created pull requests at https://github.com/vllm-project/vllm/pull/5964, there are so much things we still need to do, Encoder-Decoder architecture is still WIP in vLLM, so we got feedbacks to upstream back NeuralMagic Encoder-Decoder progress.

Can we improve more?

We need to cache Encoder hidden states on the first step

Right now for each steps, vLLM will always recompute Encoder hidden states and pass to Decoder model. If we able to cache this, we able to save 1/4 computation time for the next steps.

We need to cache KV Cross Attention on the first step

KV Cross Attention is always the same because it just matmul KV layers with Encoder hidden states, and right now vLLM always recompute this KV Cross Attention for each steps.