How to build spacious-scale discontinue-to-discontinue encrypted community video calls

How to build spacious-scale discontinue-to-discontinue encrypted community video calls

Sign launched discontinue-to-discontinue encrypted community calls a year ago, and since then we’ve scaled from toughen for 5 participants all of the capacity to 40. There may be not any off the shelf software program that will enable us to toughen calls of that dimension while guaranteeing that all verbal change is discontinue-to-discontinue encrypted, so we constructed our gain open source Sign Calling Service to withhold out the job. This post will mutter how it works in extra detail.

Selective Forwarding Units (SFUs)

In a community name, every celebration desires to fetch their audio and video to every diverse participant within the name. There are 3 that you just would mediate of traditional architectures for doing so:

  • Beefy mesh: Each name participant sends its media (audio and video) straight away to every diverse name participant. This works for terribly runt calls, but doesn’t scale to many participants. Most of us factual don’t gain an Web connection hasty ample to ship 40 copies of their video on the same time.
  • Server mixing: Each name participant sends its media to a server. The server “mixes” the media together and sends it to every participant. This works with many participants, but will not be smartly generous with discontinue-to-discontinue encryption on story of it requires that the server be ready to predict and alter the media.
  • Selective Forwarding: Each participant sends its media to a server. The server “forwards” the media to diverse participants with out viewing or altering it. This works with many participants, and is smartly generous with discontinue-to-discontinue-encryption.

Because Sign must gain discontinue-to-discontinue encryption and scale to many participants, we exhaust selective forwarding. A server that does selective forwarding is always known as a Selective Forwarding Unit or SFU.

If we point of curiosity on the float of media from a single sending participant by an SFU to a variety of receiving participants, it appears to be like to be like worship this:

SFU diagram

A simplified model of the first loop within the code in an SFU appears to be like to be like worship this:

let socket = std::win::UdpSocket::bind(config.server_addr);  
let mut customers = ...;  // adjustments over time as customers join and high-tail away
loop {
  let mut incoming_buffer = [0u8; 1500];
  let (incoming_size, sender_addr) = socket.recv_from(&mut incoming_buffer);
  let incoming_packet = &incoming_buffer[..incoming_size];

  for receiver in &customers {
     // Don't ship to yourself
     if sender_addr != receiver.addr {
       // Rewriting the packet is important for reasons we will mutter later.
       let outgoing_packet = rewrite_packet(incoming_packet, receiver);
       socket.send_to(&outgoing_packet, receiver.addr);

Sign’s Initiating Supply SFU

When constructing toughen for community calls, we evaluated many open source SFUs, but handiest two had ample congestion management (which, as we’ll rep out about rapidly, is serious). We launched community calls using a modified model of 1 in all them, but rapidly came across that even with heavy changes, we couldn’t reliably scale previous 8 participants as a consequence of excessive server CPU utilization. To scale to more participants, we wrote a brand novel SFU from scratch in Rust. It has now been serving all Sign community requires 9 months, scales to 40 participants with ease (seemingly more in due route), and is readable ample to support as a reference implementation for an SFU consistent with the WebRTC protocols (ICE, SRTP, transport-cc, and googcc).

Let’s now exhaust a deeper dive into the hardest portion of an SFU. As you may well need guessed, it’s more complicated than the simplified loop above.

The Hardest Piece of an SFU

The hardest portion of an SFU is forwarding the factual video resolutions to every name participant while community prerequisites are continuously changing.

This area is a aggregate of the following traditional complications:

  1. The capacity of every participant’s Web connection is continuously changing and laborious to know. If the SFU sends too powerful, it would reason extra latency. If the SFU sends too runt, the tremendous will seemingly be low. So the SFU have to continuously and fastidiously adjust how powerful it sends every participant to be “factual factual”.
  2. The SFU can not regulate the media it forwards; to adjust how powerful it sends, it have to keep terminate out from media sent to it. If the “menu” to keep terminate out from gain been restricted to sending both the very best seemingly resolution on hand or nothing in any admire, it’d be tough to adjust to a extensive diversity of community prerequisites. So every participant have to ship the SFU a variety of resolutions of video and the SFU have to continuously and fastidiously change between them.

The answer is to mix a variety of methods which we can teach about for my portion:

  • Simulcast and Packet Rewriting enable switching between diverse video resolutions.
  • Congestion Support watch over determines the factual amount to ship.
  • Payment Allocation determines what to ship interior that budget.

Simulcast and Packet Rewriting

In expose for the SFU in an effort to alter between diverse resolutions, every participant have to ship to the SFU many layers (resolutions) simultaneously. Here’s known as simulcast. If we point of curiosity on factual one sender’s media being forwarded to 2 receivers, it appears to be like to be like worship this, the establish every receiver switches between runt and medium layers but at diverse times:

Simulcast diagram

However what does the receiving participant rep out about as the SFU switches between diverse layers? Does it rep out about one layer switching resolutions or does it rep out a couple of number of layers switching on and off? This would perchance seem worship a minor distinction, but it has main implications for the characteristic the SFU have to play. Some video codecs, equivalent to VP9 or AV1, develop this uncomplicated: switching layers is constructed into the video codec in a capacity known as SVC. Because we’re peaceful using VP8 to toughen a extensive differ of gadgets, and since VP8 doesn’t toughen SVC, the SFU have to carry out one thing to remodel 3 layers into 1.

Here’s such as how video streaming apps stream diverse tremendous video to you reckoning on how hasty your Web connection is. You predict a single video stream switching between diverse resolutions, and within the background, you may well very smartly be receiving diverse encodings of the same video stored on the server. Esteem a video streaming server, the SFU sends you diverse resolutions of the same video. However not like a video streaming server, there’s nothing stored and it have to carry out this fully on the soar. It does so by utilizing a route of known as packet rewriting.

Packet rewriting is the scheme of altering the timestamps, sequence numbers, and the same IDs that are contained in a media packet that cowl the establish on a media timeline a packet belongs. It transforms packets from many fair media timelines (one for every layer) into one unified media timeline (one layer). The IDs that must peaceful be rewritten when using RTP and VP8 are the following:

  • RTP SSRC: Identifies a stream of consecutive RTP packets. Each simulcast layer is acknowledged by a various SSRC. To convert from many layers (as an instance, 1, 2, and 3) to 1 layer, we have to commerce (rewrite) this value to the same value (whisper, 1).
  • RTP sequence amount: Orders the RTP packets that portion an SSRC. Because every layer has a various amount of packets, it’s not that you just would mediate of to ahead packets from a variety of layers with out changing (rewriting) the sequence numbers. As an instance, if we are attempting to ahead sequence numbers [7, 8, 9] from one layer followed by [8, 9, 10, 11] from any other layer, we can’t ship them as [7, 8, 9, 9, 10, 11]. As any other we’d gain to rewrite them as one thing worship [7, 8, 9, 10, 11, 12, 13].
  • RTP timestamp: Signifies when a video must peaceful be rendered relative to a tainted time. Since the WebRTC library we exhaust chooses a various tainted time for every layer, the timestamps are not smartly generous between layers, and we have to commerce (rewrite) the timestamps of 1 layer to examine that of any other.
  • VP8 Image ID and TL0PICIDX: Identifies a community of packets which develop up a video physique, and the dependencies between video frames. The receiving participant wants this data to decode the video physique sooner than rendering. Neutral like RTP timestamps, the WebRTC library we exhaust chooses diverse sets of PictureIDs for every layer, and we have to rewrite them when combining layers.

It will most likely perchance also be theoretically that you just would mediate of to handiest rewrite factual the RTP SSRCs and sequence numbers if we altered the WebRTC library to make exhaust of fixed timestamps and VP8 PictureIDs across layers. Alternatively, we already gain many purchasers in exhaust producing inconsistent IDs, so we gain to rewrite all of those IDs to dwell backwards smartly generous. And since the code to rewrite the a variety of IDs is quite identical to rewriting RTP sequence numbers, it’s not tough to withhold out so.

To rework a single outgoing layer from a variety of incoming layers for a given video stream, the SFU rewrites packets consistent with the following principles:

  1. The outgoing SSRC is continually the incoming SSRC of the smallest layer.
  2. If the incoming packet has an SSRC diverse than the one at cowl selected, don’t ahead it.
  3. If the incoming packet is the first after a transformation between layers, alter the IDs to symbolize essentially the most novel space on the outgoing timeline (one space after the utmost space forwarded thus far).
  4. If the incoming packet is a continuation of packets after a transformation (it hasn’t factual switched), alter the IDs to symbolize the same relative space on the timeline consistent with when the change occurred within the old rule.

As an instance, if we had two enter layers with SSRCs A and B and a transformation occured after two packets, packet rewriting may perchance perchance ogle one thing worship this:

Packet rewriting

A simplified model of the code appears to be like to be like one thing worship this:

let mut selected_ssrc = ...;  // Changes over time as bitrate allocation occurs
let mut previously_forwarded_incoming_ssrc = None;
// (RTP seqnum, RTP timestamp, VP8 Image ID, VP8 TL0PICIDX)
let mut max_outgoing_ids = (0, 0, 0, 0);
let mut first_incoming_ids = (0, 0, 0, 0);
let mut first_outgoing_ids = (0, 0, 0, 0);
for incoming in incoming_packets {
  if selected_ssrc == incoming.ssrc {
    let just_switched = Some(incoming.ssrc) != previously_forwarded_incoming_ssrc;
    let outgoing_ids = if just_switched {
      // There may be a gap of 1 seqnum to indicate to the decoder that the
      // old physique used to be (potentially) incomplete.
      // For that reason there's a 2 for the seqnum.
      let outgoing_ids = max_outgoing + (2, 1, 1, 1);
      first_incoming_ids = incoming.ids;
      first_outgoing_ids = outgoing_ids;
    } else {
      first_outgoing_ids + (incoming.ids - first_incoming_ids)

    yield outgoing_ids;

    previous_outgoing_ssrc = Some(incoming.ssrc);
    max_outgoing_ids = std::cmp::max(max_outgoing_ids, outgoing_ids);

Packet rewriting is smartly generous with discontinue-to-discontinue encryption since the rewritten IDs and timestamps are added to the packet by the sending participant after the discontinue-to-discontinue encryption is applied to the media (more on that below). It’s such as how TCP sequence numbers and timestamps are added to packets after encryption when using TLS. This implies the SFU can predict these timestamps and IDs, but these values don’t appear to be any more appealing than TCP sequence numbers and timestamps. In diverse words, the SFU doesn’t be taught the relaxation from these values except that the participant is peaceful sending media.

Congestion Support watch over

Congestion management is a mechanism to search out out how powerful to ship over a community: not too powerful and never too runt. It has a protracted historical previous, largely within the produce of TCP’s congestion management. Unfortunately, TCP’s congestion management algorithms generally don’t work smartly for video calls on story of they tend to reason will improve in latency that consequence in a sorrowful name abilities (every so continually known as “trot”). To present appropriate congestion management for video calls, the WebRTC team created googcc, a congestion management algorithm which is ready to search out out the factual amount to ship with out inflicting spacious will improve in latency.

Congestion management mechanisms generally rely on some roughly suggestions mechanism sent from the packet receiver to the packet sender. googcc is designed to work with transport-cc, a protocol whereby the receiver sends periodic messages abet to the sender announcing, as an instance, “I purchased packet X1 at time Z1; packet X2 at time Z2, …”. The sender then combines this data with its gain timestamps to know, as an instance, “I sent packet X1 at time Y1 and it used to be bought at Z1; I sent packet X2 at time Y2 and it used to be bought at Z2…”.

Within the Sign Calling Service, we gain implemented googcc and transport-cc within the produce of stream processing. The inputs into the stream pipeline are the aforementioned info about when packets gain been sent and acquired, which we name acks. The outputs of the pipeline are adjustments in how powerful must peaceful be sent over the community, which we name the target ship rates.

The predominant few steps of the float space the acks on a graph of prolong vs. time after which calculate a slope to search out out if the prolong is increasing, cutting back, or staunch. The actually handy step decides what to withhold out consistent with the contemporary slope. A simplified model of the code appears to be like to be like worship this:

let mut target_send_rate = config.initial_target_send_rate;
for route in delay_directions {
  match route {
    DelayDirection::Reducing => {
      // Whereas the prolong is cutting back, withhold the target rate to let the queues drain.
    DelayDirection::Regular => {
      // Whereas prolong is staunch, elevate the target rate.
      let elevate = ...;
      target_send_rate += elevate;
      yield target_send_rate;
    DelayDirection::Rising => {
      // If the prolong is increasing, decrease the rate.
      let decrease = ...;
      target_send_rate -= decrease;
      yield target_send_rate;

Here’s the crux of googcc: If latency is increasing, kill sending so powerful. If latency is cutting back, let it proceed. If latency is staunch, are trying sending more. The consequence is a ship rate which carefully approximates the right community capacity while adjusting to adjustments and keeping latency low.

For certain, the “…” within the code above about how powerful to elevate or decrease is subtle. Congestion management is laborious. However now you may well rep out about how it generally works for video calls:

  1. The sender picks an initial rate and starts sending packets.
  2. The receiver sends abet suggestions about when it bought the packets.
  3. The sender uses that suggestions to adjust the ship rate with the foundations described above.

Payment Allocation

As soon as the SFU knows how powerful to ship, it now have to establish what to ship (which layers to ahead). This route of, which we name rate allocation, is worship the SFU selecting from a menu of layers constrained by a ship rate budget. As an instance, if every participant is sending 2 layers and there are 3 diverse participants, there would be 6 total layers on the menu.

If the budget is astronomical ample, we can ship everything we favor (up to the suited layer for every participant). However if not, we have to prioritize. To support in prioritization, every participant tells the server what resolutions it wants by asking for a maximum resolution. The utilization of that data, we exhaust the following principles for rate allocation:

  1. Layers higher than the requested maximum are excluded. As an instance, there’s no such thing as a have to ship you excessive resolutions of every video must you’re handiest viewing a grid of runt videos.
  2. Smaller layers are prioritized over higher layers. As an instance, it’s miles more healthy to predict all and sundry in low resolution in desire to just a few in excessive resolution and others below no circumstances.
  3. Larger requested resolutions are prioritized sooner than smaller requested resolutions. As an instance, when you may well rep out about all and sundry, then the video that appears to be like greatest to you may well have in with higher tremendous sooner than the others.

A simplified model of the code appears to be like to be like worship the following.

// The enter: a menu of video alternatives.
// Each has a situation of layers to elevate from and a requested maximum resolution.
let videos = ...;

// The output: for every video above, which layer to ahead, if any
let mut allocated_by_id = HashMap::novel();
let mut allocated_rate = 0;

// Ample first
videos.sort_by_key(|video| Reverse(video.requested_height));

// Lowest layers for every sooner than the upper layer for any
for layer_index in 0..=2 {
  for video in &videos {
    if video.requested_height > 0 {
      // The predominant layer which is "astronomical ample", or the suited layer if none are.
      let requested_layer_index = video.layers.iter().space(
         |layer| >= video.requested_height).unwrap_or(video.layers.dimension()-1)
      if layer_index <= requested_layer_index {
        let layer = &video.layers[layer_index];
        let (_, allocated_layer_rate) = allocated_by_id.fetch(&video.identification).unwrap_or_default();
        let increased_rate = allocated_rate + layer.rate - allocated_layer_rate;
        if increased_rate < target_send_rate {
          allocated_by_id.insert(video.identification, (layer_index, layer.rate));
          allocated_rate = increased_rate;

Placing all of it together

By combining these thr

Be a part of the pack! Be a part of 8000+ others registered users, and fetch chat, develop teams, post updates and develop chums around the globe!



Hey! look, i give tutorials to all my users and i help them!