Our North start at Cerebrium has always been to help companies implement ML based products as quickly as possible. Therefore, we are always looking to the community for products to build and see how we can deliver it — quickly. In this case it was a tweet from the Founder of Scale AI, Alex Wang.
Excited about the hype around Whisper and a practical use-case of the model we were excited to tackle this - also we don’t want anyone feeling like an A.I peasant. Below we walk through the steps we took to implement this.
You can test the finished product by forwarding/sending a voicenote to the number: +14245442827
Initially, we were worried about the verification time of a Whatsapp application so we thought it would be best to start with creating our Whatsapp bot. We knew Twilio had a Whatsapp integration, are developer friendly and had reasonable pricing based on usage so we picked them as our underlying provider. You can follow the steps below to setup your Twilio Whatsapp account:
To implement our messaging logic we looked at using Twilio Studio, which is Twilio’s version of AWS lambda to host/run our code. However, it had a huge caveat in that it couldn’t run functions that lasted longer than 10 seconds. A quick test of Whisper on Huggingface will show you that even for a short audio file (5 seconds) it is going to take 6 seconds and we knew that the 10 second latency requirement was cutting it too fine. After all, Whatsapp transcribers become useful for those long 2 minute voice notes. Based off our experience, the fastest would be to use AWS lambda.
We knew AWS Lambda supported 15 minute function times which was more than suitable for our use case. We weren’t going to do anything fancy with the Serverless framework or AWS SAM but just simply upload our function as a zip file to AWS through the UI.
First, we created an index.js file with the following code. This is the end product that we will explain step by step below.
We import the necessary libraries as well as our Twillio account credentials. You can find your account credentials, Account SID and Auth Token, on your Twilio console.
The Twilio docs aren’t great in showing what an example request looks like to your function so we started off by just logging the event and context information to see what we are working with.
We can see that event.body sends us a base64 encoded string which we need to decode. Decoding it we can see it looks like this:
This is a query parameter string and so we need to extract a few key values here, namely:
You might be wondering why the media variables end in ‘0’. This is because you can send multiple media items in a Whatsapp message and this is how Twilio separates each file sent. The first media url is MediaUrl0, and the second is MediaUrl1 which corresponds with MediaContentType0, MediaContentType1. You can see in the query string there is a variable called ‘numMedia’ that indicates how many media files are sent.
Next we have to download the media file in order to send it to our Whisper model.
We create an async function since depending on how big the media file is, it might take some time to download. We pass the media url we received earlier to the function as well as a filename. You will see we download the file to the tmp storage of a lambda function. This is on purpose since once a lambda function terminates, it will delete everything in tmp. Voice notes are personal and could contain sensitive information so we didn’t want to store it anywhere. We generate a unique uuid for the filename since if the function remains warm other voicenotes could remain in the tmp folder so we don’t want to refer to another users tmp function.
Deploying Whisper on HuggingFace has a few issues
With Cerebrium, you can do inference on voice notes that are 5+ minutes and you only pay for inference time. We also automatically handle the scaling infrastructure of your infrastructure to handle traffic spikes.
In order to deploy Whisper on Cerebrium, you first need to install our framework. Note, the following has to be done in Python.
You can then create a simple python file or execute everything in the python shell. We will use the Whisper Tiny model since it is suitable for our use case. In order to get an API key you will need to sign up at https://dashboard.cerebrium.ai. This is so we can associate a model to your account. We don’t require a credit card for sign up and we give you $50 in free credits to test out the service.
Once you run the above you should be returned an endpoint that you can use to send a request. If you want to read more about deploying using our framework you can take a look at our docs here.
We created a function in our index.js that is responsible for sending data through to our deployed Whisper model.
We read the download audio file from tmp and encode it as base64 as this is how our deployed more is expecting the file. We then make a POST request to our API endpoint using the API key from our dashboard in order to authorize the request.
Once we get a response from our deployed model, we use the Twilio client to send the message to the user, using the mobile numbers we received from the request previously.
Once you have finished creating the index.js file, you should zip the contents, both the index.js file and the node_modules folder. The node_modules will appear after installing the required libraries such as fs, node-fetch etc