Voice Activity Detection with WebRTC

Implementing VAD with WebRTC using the non-standard RTCAudioSink component

Demonstrating voice activity detection

While implementing voice calls for Matokai, I needed a way to know which user is speaking. WebRTC does not handle this by default so it was time to get creative again.

What I needed to achieve

To implement voice activity detection properly, I needed to have a way to:

  1. send data other than video and audio over RTC
  2. access an audio track’s PCM data that we receive from a peer
  3. identify whether a PCM frame contains speech
  4. broadcast a packet to all connected peers to let them know who is speaking

Sending packets using an unordered/unreliable data channel

In WebRTC, there are two types of data channels; an ordered and an unordered one. While RTC works entirely on UDP, an ordered data channel ensures that data is received like TCP would do at the cost of latency and performance.

For our use case, we’ll need an unordered data channel because it’s okay if some packets never arrive, and we can use the performance, so we can rest assured that the activity packet is relatively lined up with audio data.

Creating an unordered data channel

Start by creating an unordered data channel on the client side like so:

/* Build an unordered data channel */
const unorderedDataChannel = this.peerConnection.createDataChannel('channel', {
    ordered: false
});

Receiving data channels from clients

On the other end, we will receive this data channel during SDP negotiation. We can retrieve it as such:

/**
 * @author Icseon
 * @description This callback is invoked once a peer sends a data channel to us
 * @param channel
 */
peerConnection.ondatachannel = async({ dataChannel }) => {
  
  /* Tell ourselves about the fact we have received a data channel from a client */
  debug(`received a data channel labeled: ${dataChannel.label}`);
    
  /* If this is the data channel we expect, register it for our peer somehow. That's up to you. */
  if (dataChannel.label === 'channel')
  {
    
      /* Add the dataChannel to our peerConnection somehow for later access */
      peerConnection.someClass.dataChannel = channel;
        
  }
  
};

Listening for packets from the server

Now we have a one-sided communication channel between the client and the server (server to client only), but we are not yet handling any packets on the client. Listen to packets from the server like this:

/* Listen for data from the server. */
unorderedDataChannel.onmessage = (data) => {
    
    /* We are going to do something with this data later on. For now, let's do a simple console.log */
    console.log(data);
    
});

We can now send data to the client that is not audio or video which means we can send voice activity packets later!

Accessing the audio tracks

I am assuming that you have sent your media streams over RTC before creating your connection. If not, go back and implement that first.

Receiving media tracks from clients

Let’s start simple and create a way to echo back audio data to our client. Make sure to check if we are dealing with an audio track because clients can also send video tracks. We are going to expand on this very soon:

/**
 * @author Icseon
 * @description This callback is invoked once a peer sends a track to us
 * @param channel
 */
peerConnection.addEventListener('track', async({ track, streams }) => {

    /* Let's know what we have received */
    debug(`got track of kind: ${track.kind}`);
    
    /* Check to see if we got an audio track */
    if (track.kind === 'audio')
    {
    
        /* Add a transceiver to our peer connection which will transmit audio data back to our client */
        peerConnection.addTransceiver(track, {
            direction: 'sendonly',
            streams
        });
    
    }

});

WebRTC is now sending back your own audio. However, you can’t hear yourself yet. We’ll need to handle tracks on the client side as well and playback the media stream on the client side after receiving.

Receiving media tracks from the server

That is done by listening for an audio track, exactly the same way we have just done on the server side - except we also add a new Audio element.

/**
 * @author Icseon
 * @description Process incoming tracks
 * @param RTCTrackEvent
 */
this.peerConnection.ontrack = (RTCTrackEvent) => {

    /* Are we dealing with an audio track? */
    if (RTCTrackEvent.track.kind === 'audio')
    {
    
        /* Create a new audio element and begin playing the media stream */
        const audioElement = document.createElement('audio');
        audioElement.srcObject = RTCTrackEvent.streams[0]; /* A track may contain many streams - we only care about the first one */
        audioElement.play();
    
    }

}

After handling the ontrack event on the client side, we should be able to hear ourselves! We are not listening to our own microphone directly, rather, we are listening to our microphone through WebRTC.

Reading PCM audio data using RTCAudioSink

We are now handling audio data from client peers on the server side. However, as it stands, we do not have a way to access PCM data yet.

To begin receiving PCM audio data from a remote audio track, we are going to be using the non-standard RTCAudioSink component WebRTC provides. This component will allow us to very easily access raw PCM data from any audio track.

/* Construct a new RTCAudioSink using the audio track we have received */
const audioSink = new RTCAudioSink(track);

/* Handle audio data */
audioSink.ondata = (data) => {

    /* Read PCM data from the samples */
    const pcm = data.samples;
    
    /* This will spam your console every 10ms with raw PCM data. We now have access to PCM audio data! */
    console.log(pcm);

}

At this point, we have successfully implemented a way to receive raw PCM audio data from an RTC peer and can start to use this data to see if there is speech in it.

Using VAD to detect speech#

Installing & Initializing VAD

We now have the ability to access raw PCM audio frames and can use this alongside VAD to detect if the audio frame contains speech. For this, we can use the @ozymandiasthegreat/vad npm package. Let’s start by constructing VAD:

/* Retrieve VAD through the VadBuilder */
const VAD = await VADBuilder();
const vad = new VAD(VADMode.VERY_AGGRESSIVE, 48000); /* WebRTC has a sample rate of 48000 */

Using VAD to detect voice activity

Right now, we have access to VAD and can start using it to detect speech in audio frames. We can do this by using the processFrame method VAD provides. Let’s go back to our RTCAudioSink and add the logic required for identifying speech.

/* Construct a new RTCAudioSink using the audio track we have received */
const audioSink = new RTCAudioSink(track);

/* Handle audio data */
audioSink.ondata = (data) => {

    /* Read PCM data from the samples */
    const pcm = data.samples;
    
    /* Determine if the PCM data contains speech */
    const vadResult = vad.processFrame(pcm);
    
    /* If the vadResult indicates we have speech, log a message to the console indicating such */
    if (vadResult === VADEvent.VOICE)
    {
        console.log('speech detected!');
    }

}

Awesome. We now have a way to detect speech from audio. We’re almost there, we only need to notify all peers that somebody is speaking.

Sending packets to all peers to notify them of voice activity

Note: To not overcomplicate this too much, we are going to be using a simple array of WebRTCPeerConnection instances. I’ll assume this array is named peers.

Defining the voice activity packet

A clean approach of building a packet in my personal opinion is abstracting the structure away in a class. Let’s start by building the VoiceActivityPacket class which we are going to be sending to all peers.

export default class VoiceActivityPacket {

    /**
    * @author Icseon
    * @description VoiceActivityPacket constructor
    * @param username
    */
    constructor(username)
    {
        
        /* For easy packet identification, I am choosing to add the packet type in the constructor */
        this.packetId = 'VoiceActivity';
        
        /* We really just need to know who is speaking. That's all. */
        this.username = username;
        
    }

}

Broadcasting the voice activity packet

Now that we have defined the voice activity packet, we can start sending it to all peers and handle it. Let’s start by sending it to everyone:

/* Construct a new RTCAudioSink using the audio track we have received */
const audioSink = new RTCAudioSink(track);

/* Handle audio data */
audioSink.ondata = (data) => {

    /* Read PCM data from the samples */
    const pcm = data.samples;
    
    /* Determine if the PCM data contains speech */
    const vadResult = vad.processFrame(pcm);
    
    /* If the vadResult indicates we have speech, log a message to the console indicating such */
    if (vadResult === VADEvent.VOICE)
    {
    
        /* Build the voice activity packet */
        const packet = new VoiceActivityPacket(peerConnection.someClass.username); /* You need to deal with authentication somehow, I'll assume the username is accessible like this. */
        
        /* Loop through every peer in the peers array */
        peers.forEach((peer) => {
        
            /* We can only send arrayBuffers, blobs and strings. That's why JSON.stringify() is required */
            peer.someClass.dataChannel.send(JSON.stringify(packet));
        
        });
    }

}

We are now sending the voice activity packet to all peers. Obviously, this is a very primitive way of broadcasting packets but for demonstration purposes it should suffice. All that’s left to be done is handle the packet on the client side.

Handling the voice activity packet

The voice activity packet is now being sent and received to clients. It’s time to start handling it.

/* Listen for data from the server. */
unorderedDataChannel.onmessage = (data) => {

    /* Parse JSON */
    data = JSON.parse(data);
    
    /* Determine what packet we have received */
    switch(data.packetId)
    {
        
        /* Handle voice activity packets */
        case 'VoiceActivity':
            
            /* You can handle this in any way you'd like. In this post, we are just going to log who is speaking. */
            console.log(`${data.username} is speaking!`);
            break;
    }
    
});

In this snippet, we are checking the packetId of the data we receive and handling the VoiceActivityPacket by logging the username of the speaker. You may do anything with this information like invoking a UI transition to clarify that a participant is speaking.

That’s a wrap!

You have now read how I deal with voice activity detection with WebRTC. I left out a lot of implementation specific details because I do not know your use case - if you’re going to use this knowledge then you should apply it in the scope of your project.

Keep in mind that SDP negotiation is required after adding a new track/transceiver to peers and that it needs to be handled through your signaling server(s) accordingly.

Thank you for reading my post, and I hope that this helps someone out. I’ll be writing more technical posts like this one in the near future as I have more to write about.

— Icseon