home
navigate_next
Blog
navigate_next
Applications

Creating a Whatsapp voicenote transcriber using Whisper in 1 hour

Creating a Whatsapp voicenote transcriber using Whisper in 1 hour
Michael Louis
Co-Founder & CEO

Our North start at Cerebrium has always been to help companies implement ML based products as quickly as possible. Therefore, we are always looking to the community for products to build and see how we can deliver it — quickly. In this case it was a tweet from the Founder of Scale AI, Alex Wang.

Excited about the hype around Whisper and a practical use-case of the model we were excited to tackle this - also we don’t want anyone feeling like an A.I peasant. Below we walk through the steps we took to implement this.

You can test the finished product by forwarding/sending a voicenote to the number: +14245442827

Setup Whatsapp bot

Initially, we were worried about the verification time of a Whatsapp application so we thought it would be best to start with creating our Whatsapp bot. We knew Twilio had a Whatsapp integration, are developer friendly and had reasonable pricing based on usage so we picked them as our underlying provider. You can follow the steps below to setup your Twilio Whatsapp account:

  1. Sign up at https://www.twilio.com/messaging/whatsapp
  2. Fill in your information and create an account. The free trial comes with $15 of credits to start.
  3. You are then automatically taken to the sandbox for Whatsapp and are asked to send a message to the Twilio test bot number. Send it a message as this is how we will be doing our testing.
  4. Great! You now have your test bot setup that you can run tests with.

To implement our messaging logic we looked at using Twilio Studio, which is Twilio’s version of AWS lambda to host/run our code. However, it had a huge caveat in that it couldn’t run functions that lasted longer than 10 seconds. A quick test of Whisper on Huggingface will show you that even for a short audio file (5 seconds) it is going to take 6 seconds and we knew that the 10 second latency requirement was cutting it too fine. After all, Whatsapp transcribers become useful for those long 2 minute voice notes. Based off our experience, the fastest would be to use AWS lambda.

AWS Lambda function

We knew AWS Lambda supported 15 minute function times which was more than suitable for our use case. We weren’t going to do anything fancy with the Serverless framework or AWS SAM but just simply upload our function as a zip file to AWS through the UI.

First, we created an index.js file with the following code. This is the end product that we will explain step by step below.


const fs = require('fs');
const path = require('path');
const fetch = require('node-fetch');
const accountSid = 'ACad55ef868ab1325595b95e32e3da4958';
const authToken = '65eb5ca9fd573916967d91fb9a6127b5';
const { v4: uuidv4 } = require('uuid');
const client = require('twilio')(accountSid, authToken);

exports.handler = async function (event, context) {
  const buff = Buffer.from(event.body, 'base64');
  const formEncodedParams = buff.toString('utf-8');
  const urlSearchParams = new URLSearchParams(formEncodedParams);

  const contentType = urlSearchParams.get('MediaContentType0');
  const to = urlSearchParams.get('To');
  const from = urlSearchParams.get('From');
  const mediaUrl = urlSearchParams.get('MediaUrl0');

  if (mediaUrl && contentType.includes('audio')) {
  
    await client.messages.create({
        from: `${to}`,
        body: "I'm working on your VN and will post the text back here. I usually take 20 seconds per minute of a VN.",
        to: `${from}`
        })
        .then(message => {
        console.log(message.sid)
    });
    const filename = uuidv4();
    await downloadFile(mediaUrl, filename);
    modelResponse = await whisperModel(`/tmp/${filename}.ogg`);

    await client.messages.create({ from: `${to}`, body: modelResponse, to: `${from}` }).then((message) => {
      console.log(message.sid);
    });
  } else {
    await client.messages
      .create({
        from: `${to}`,
        body: 'Unfortunately I can only transcribe Voicenotes - I want to be part of the conversation.',
        to: `${from}`
      })
      .then((message) => console.log(message.sid));
  }
};

async function downloadFile(url, filename) {
  const fullPath = path.resolve(`/tmp/${filename}.ogg`);

  if (!fs.existsSync(fullPath)) {
    const response = await fetch(url);
    const fileStream = fs.createWriteStream(fullPath);

    response.body.pipe(fileStream);
    
    await new Promise((resolve, reject) => {
      fileStream.on('finish', resolve);
      fileStream.on('error', reject);
    });
  }
}

async function whisperModel(filename) {
  const data = fs.readFileSync(filename, 'base64');
  const response = await fetch('https://inference.cerebrium.ai/runs/pXXXXX_whisper-tiny', {
    headers: { Authorization: 'c_api_key-xxxxxxxxxxxxxxxxxxxxxxxxxxx' },
    method: 'POST',
    body: JSON.stringify([[
                  {
                      "audio_in": {
                          "b64_encoded": data,
                          "filetype": "mp3"
                      }
                  }
              ]
          ])
  });
  const result = await response.json();
  return result.result[0][0].text_out;
}


We import the necessary libraries as well as our Twillio account credentials. You can find your account credentials, Account SID and Auth Token, on your Twilio console.

The Twilio docs aren’t great in showing what an example request looks like to your function so we started off by just logging the event and context information to see what we are working with.

We can see that event.body sends us a base64 encoded string which we need to decode. Decoding it we can see it looks like this:


MediaContentType0=audio%2Fogg&SmsMessageSid=MMd7e0b115043ee61f38756f5d67b3a1e2&NumMedia=1&ProfileName=Michael+Louis&SmsSid=MMd7e0b115043ee61f38756f5d67b3a1e2&WaId=27848401480&SmsStatus=received&Body=&To=whatsapp%3A%2B141XXXX&NumSegments=1&ReferralNumMedia=0&MessageSid=MMd7e0b115043ee61f38756f5d67b3a1e2&AccountSid=ACad55ef868ab1325595b95e32e3da4958&From=whatsapp%3A%2B27XXXXXXXX&MediaUrl0=https%3A%2F%2Fapi.twilio.com%2F2010-04-01%2FAccounts%2FACad55ef868ab1325595b95e32e3da4958%2FMessages%2FMMd7e0b115043ee61f38756f5d67b3a1e2%2FMedia%2FME7bbd5a9c70e191edab17e89XXXXX&ApiVersion=2010-04-01

This is a query parameter string and so we need to extract a few key values here, namely:

  • MediaContentType0: This tells us what type of content was sent. Since our bot is only used for transcribing voice notes we want to return an error message if a user sends us text or images
  • To: Who the message was sent to. This is your number since users will send the to you
  • From: This is the number of the user. i.e: Who the message was from
  • MediaUrl0: The Twilio URL of the media file. We will need this later to download the file.

const buff = Buffer.from(event.body, "base64");
const formEncodedParams = buff.toString("utf-8");
const urlSearchParams = new URLSearchParams(formEncodedParams);

const contentType = urlSearchParams.get("MediaContentType0");
const to = urlSearchParams.get("To");
const from = urlSearchParams.get("From");
const mediaUrl = urlSearchParams.get("MediaUrl0");

You might be wondering why the media variables end in ‘0’. This is because you can send multiple media items in a Whatsapp message and this is how Twilio separates each file sent. The first media url is MediaUrl0, and the second is MediaUrl1 which corresponds with MediaContentType0, MediaContentType1. You can see in the query string there is a variable called ‘numMedia’ that indicates how many media files are sent.

Next we have to download the media file in order to send it to our Whisper model.


async function downloadFile(url, filename) {
  
  const fullPath = path.resolve(`/tmp/${filename}.ogg`);

  if (!fs.existsSync(fullPath)) {
    const response = await fetch(url);
    const fileStream = fs.createWriteStream(fullPath);

    response.body.pipe(fileStream);
    await new Promise((resolve, reject) => {
        fileStream.on('finish', resolve);
        fileStream.on('error', reject);
      });
  }

}

We create an async function since depending on how big the media file is, it might take some time to download. We pass the media url we received earlier to the function as well as a filename. You will see we download the file to the tmp storage of a lambda function. This is on purpose since once a lambda function terminates, it will delete everything in tmp. Voice notes are personal and could contain sensitive information so we didn’t want to store it anywhere. We generate a unique uuid for the filename since if the function remains warm other voicenotes could remain in the tmp folder so we don’t want to refer to another users tmp function.

Deploying Whisper on Cerebrium

Deploying Whisper on HuggingFace has a few issues

  1. Your model can only do inference on 30 second voice notes
  2. You are charged by the hour for your compute instances
  3. You have to setup the number of min and max replicas

With Cerebrium, you can do inference on voice notes that are 5+ minutes and you only pay for inference time. We also automatically handle the scaling infrastructure of your infrastructure to handle traffic spikes.

In order to deploy Whisper on Cerebrium, you first need to install our framework. Note, the following has to be done in Python.


pip install cerebrium

You can then create a simple python file or execute everything in the python shell. We will use the Whisper Tiny model since it is suitable for our use case. In order to get an API key you will need to sign up at https://dashboard.cerebrium.ai. This is so we can associate a model to your account. We don’t require a credit card for sign up and we give you $50 in free credits to test out the service.


from cerebrium import deploy, Model_type
model_pipeline = (ModelType.PREBUILT, 'whisper-tiny')
endpoint = deploy(model_pipeline, 'whatsapp-transcriber', "c_api_key-XXXXXXXXXXXXXXXXXXXX")

Once you run the above you should be returned an endpoint that you can use to send a request. If you want to read more about deploying using our framework you can take a look at our docs here.

We created a function in our index.js that is responsible for sending data through to our deployed Whisper model.

We read the download audio file from tmp and encode it as base64 as this is how our deployed more is expecting the file. We then make a POST request to our API endpoint using the API key from our dashboard in order to authorize the request.

Once we get a response from our deployed model, we use the Twilio client to send the message to the user, using the mobile numbers we received from the request previously.

Publish Lambda

Once you have finished creating the index.js file, you should zip the contents, both the index.js file and the node_modules folder. The node_modules will appear after installing the required libraries such as fs, node-fetch etc

arrow_back
Back to blog