How WebRTC Works

WebRTC (Web Real-Time Communication) enables peer-to-peer audio, video, and arbitrary data transfer directly between web browsers, with no plugins or downloads. It is the protocol stack behind Google Meet, Discord voice chat, Facebook Messenger video calls, and countless other applications. But beneath the simple getUserMedia() and RTCPeerConnection APIs lies a remarkably complex system of NAT traversal, encryption, codec negotiation, and congestion control — all happening in real time.

This article explains the full WebRTC stack: how two browsers discover each other, punch through NATs and firewalls, negotiate codecs, encrypt everything end-to-end, and adapt to changing network conditions — all within milliseconds.

Why Peer-to-Peer in Browsers?

Traditional web communication follows a client-server model. Your browser sends an HTTP request to a server, the server responds, and the connection is done. Even WebSockets maintain a persistent connection but still route everything through a server. This works well for web pages, APIs, and chat messages, but it adds unacceptable latency for real-time media. A voice call routed through a server in another continent adds hundreds of milliseconds of round-trip delay, making natural conversation impossible.

WebRTC solves this by establishing direct connections between browsers. Once the connection is set up, audio and video packets flow directly from one user's browser to another — no server in the middle. This minimizes latency, reduces server bandwidth costs, and keeps media data off third-party infrastructure.

But there is a fundamental problem: browsers do not have public IP addresses. Most users sit behind NAT devices that hide their real address. Two browsers behind two different NATs cannot simply open a socket to each other. The majority of WebRTC's complexity exists to solve this single problem.

The Signaling Channel

WebRTC deliberately does not define how two peers find each other. The specification covers everything after initial contact — the media stack, encryption, NAT traversal — but the mechanism by which two browsers exchange their initial connection parameters is left to the application developer. This is called signaling.

In practice, signaling is almost always done via WebSockets. Both browsers connect to a signaling server, which relays messages between them. These messages contain two types of information: SDP (Session Description Protocol) offers and answers describing media capabilities, and ICE candidates describing network addresses the peer can be reached at.

The signaling server is not in the media path. Once the WebRTC connection is established, the signaling server could disappear and the call would continue. It is only needed to bootstrap the connection and, optionally, to renegotiate if conditions change.

SDP: The Offer/Answer Model

Before two peers can exchange media, they need to agree on formats, codecs, encryption parameters, and transport details. This negotiation uses the Session Description Protocol (SDP), defined originally in RFC 4566 for SIP telephony and adapted for WebRTC in RFC 8829.

The flow follows an offer/answer model:

  1. Peer A creates an offer — This SDP blob describes everything Peer A supports: which audio codecs (Opus, G.711), which video codecs (VP8, VP9, H.264, AV1), supported RTP extensions, DTLS fingerprint, ICE credentials, and more.
  2. Peer A sends the offer to Peer B via the signaling channel.
  3. Peer B creates an answer — Peer B examines the offer, selects the codecs and parameters it also supports, and generates an SDP answer.
  4. Peer B sends the answer back via the signaling channel.

An SDP message is a text blob of key-value lines. A simplified example:

v=0
o=- 4625943528 2 IN IP4 127.0.0.1
s=-
t=0 0
m=audio 9 UDP/TLS/RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=ice-ufrag:aB3d
a=ice-pwd:xYz9kL2mN4pQ7rS1tU5vW8
a=fingerprint:sha-256 AB:CD:EF:01:23:...
m=video 9 UDP/TLS/RTP/SAVPF 96 97 98
a=rtpmap:96 VP8/90000
a=rtpmap:97 VP9/90000
a=rtpmap:98 H264/90000

The m= lines define media sections (audio, video, data). Each section lists supported payload types, and a=rtpmap lines map payload numbers to codec names. The a=ice-ufrag and a=ice-pwd lines provide credentials for ICE connectivity checks, and the a=fingerprint line carries the DTLS certificate fingerprint used to verify the encrypted connection.

ICE: Finding a Path Through the NAT

The Interactive Connectivity Establishment (ICE) framework (RFC 8445) is WebRTC's solution to NAT traversal. Its job is to find a working network path between two peers, even when both are behind NATs, firewalls, or restrictive corporate networks.

ICE works by gathering candidates — potential network addresses that a peer could be reached at — and then systematically testing pairs of candidates to find which ones can actually exchange packets.

Candidate Types

ICE defines three types of candidates, in order of preference:

Peer A 192.168.1.42:54321 host candidate NAT A 203.0.113.5:30001 srflx candidate STUN Server TURN Server 198.51.100.10:3478 relay candidate NAT B 198.51.100.99:40002 srflx candidate Peer B 10.0.0.7:12345 host candidate

The ICE Process

Once both peers have gathered their candidates, ICE pairs them up and tests connectivity:

  1. Candidate gathering — Each peer collects host candidates from local interfaces, queries STUN servers for server-reflexive candidates, and allocates TURN relay candidates.
  2. Candidate exchange — Candidates are sent to the remote peer via the signaling channel (as a=candidate lines in SDP, or via trickle ICE).
  3. Connectivity checks — ICE forms candidate pairs (one local, one remote) and sends STUN Binding Requests on each pair. A pair is valid if both peers receive the other's check and respond.
  4. Candidate pair selection — The controlling agent (typically the offerer) nominates the best working pair based on priority. Priority favors host > srflx > relay.

Trickle ICE is an optimization where candidates are sent to the remote peer as they are discovered, rather than waiting for all gathering to complete. This significantly reduces connection setup time because connectivity checks can begin while TURN allocation (the slowest step) is still in progress.

STUN: Discovering Your Public Address

A STUN (Session Traversal Utilities for NAT) server has a simple but essential job: tell a client what its public IP address and port are. When your browser is behind a NAT, it knows its local address (e.g., 192.168.1.42) but has no idea what address the outside world sees. STUN provides this information.

The STUN protocol (RFC 8489) works as follows:

  1. The client sends a Binding Request to the STUN server (typically on UDP port 3478).
  2. The STUN server examines the source IP and port of the incoming packet — this is the address the NAT has assigned.
  3. The STUN server sends a Binding Response containing this observed address in an XOR-MAPPED-ADDRESS attribute.

The client now knows its server-reflexive address and can share it as a candidate. STUN servers are lightweight and stateless — Google runs public STUN servers at stun.l.google.com:19302 that handle millions of requests. STUN is also used during ICE connectivity checks: the STUN Binding Request/Response exchange verifies that two candidates can exchange packets and simultaneously measures round-trip time.

STUN works for most NAT types but fails with symmetric NATs, which assign a different external port for every unique destination. The server-reflexive address discovered via STUN will be different from the address seen by the remote peer, so the reflexive candidate is useless. In this case, TURN is the fallback.

TURN: Relay When Direct Fails

TURN (Traversal Using Relays around NAT, RFC 8656) provides a relay server that forwards packets between peers when direct connectivity is impossible. The client allocates a relay address on the TURN server, and all media packets are forwarded through it.

Peer A (symmetric NAT) NAT A TURN Relay Server relayed media NAT B Peer B (symmetric NAT) X direct path blocked ~86% of WebRTC connections succeed with STUN only ~14% require TURN relay (mostly corporate firewalls) TURN adds ~20-80ms extra latency per hop

TURN is the only option that works in every network configuration, including symmetric NATs and strict corporate firewalls that block all UDP. TURN can operate over UDP, TCP, or even TLS-over-TCP (port 443) to bypass firewalls that inspect traffic. However, TURN is expensive: all media traffic passes through the relay server, consuming bandwidth and adding latency. In practice, about 86% of WebRTC connections succeed with direct connectivity (host or server-reflexive candidates), with only 14% requiring TURN — predominantly from corporate and institutional networks.

NAT Traversal in Detail

Understanding why NAT traversal is hard requires understanding NAT behavior. A NAT creates a mapping between an internal address:port and an external address:port. The critical question is how restrictive the NAT is about which external hosts can send packets to that mapped port:

ICE's connectivity checks handle all of these except symmetric NATs. The simultaneous STUN binding requests from both sides effectively "punch holes" in the NATs — each side sends a packet to the other's reflexive address, creating the necessary mapping for return traffic. This is why ICE sends checks from both peers: the check from A opens A's NAT for B's response, and vice versa.

DTLS: Encrypting the Connection

Once ICE establishes a working path, WebRTC uses DTLS (Datagram Transport Layer Security, RFC 9147) to encrypt the connection. DTLS is essentially TLS adapted for UDP — it provides the same authentication and encryption but handles packet loss and reordering that are normal on UDP.

The DTLS handshake runs over the ICE-established path. Both peers exchange certificates and negotiate an encryption key. The certificate fingerprint is verified against the fingerprint in the SDP — this is how WebRTC ensures you are talking to the peer you negotiated with, not a man-in-the-middle. The DTLS handshake produces a shared SRTP master key that is used to encrypt all media.

This is a critical design decision: the encryption keys are derived from a handshake between the two peers, not provisioned by a server. Even if the signaling server is compromised, an attacker cannot decrypt the media without intercepting the DTLS handshake itself. However, if the signaling server substitutes its own SDP fingerprint during the offer/answer exchange, it could mount a man-in-the-middle attack — which is why some applications verify fingerprints out-of-band.

SRTP: Encrypted Media Transport

Audio and video packets are sent using SRTP (Secure Real-time Transport Protocol, RFC 3711). RTP is the standard protocol for carrying real-time media — it adds timestamps, sequence numbers, and payload type identifiers that receivers need to play back media smoothly. SRTP adds encryption and message authentication on top of RTP using the keys derived from the DTLS handshake.

Each SRTP packet contains:

WebRTC also uses RTCP (RTP Control Protocol) for reporting reception statistics, packet loss rates, and round-trip times. These reports drive the congestion control algorithms that adapt quality to network conditions. RTCP is also encrypted as SRTCP.

Codec Negotiation

WebRTC supports multiple audio and video codecs. During SDP negotiation, peers agree on which codecs to use based on mutual support, quality requirements, and hardware capabilities.

Audio Codecs

Video Codecs

The SDP offer lists codecs in preference order. The answerer selects from the intersection of supported codecs. Codec selection can change during a call through renegotiation — generating a new offer/answer exchange via the signaling channel.

Data Channels: Arbitrary Data over WebRTC

WebRTC is not just for audio and video. Data channels provide a general-purpose, bidirectional data transport between peers. You can send text, files, game state, screen-sharing coordinates — any arbitrary data — with configurable reliability and ordering.

Data channels are built on SCTP (Stream Control Transmission Protocol) running over DTLS. SCTP provides:

An unreliable, unordered data channel gives you UDP-like semantics — perfect for game state updates or mouse cursor positions where old data is useless. A reliable, ordered channel gives you TCP-like semantics — suitable for file transfer or chat messages. This flexibility, combined with NAT traversal and encryption that come for free with the WebRTC connection, makes data channels a powerful building block.

The protocol stack for data channels is: Application Data -> SCTP -> DTLS -> ICE -> UDP. This means data channel traffic benefits from the same NAT traversal and encryption as media traffic, sharing the same ICE candidate pair.

Simulcast and SVC

In a group video call, different participants have different bandwidth constraints and screen sizes. Sending the same high-resolution stream to someone on a mobile network and someone on fiber is wasteful. WebRTC addresses this with simulcast and SVC (Scalable Video Coding).

Simulcast means the sender encodes the same video at multiple resolutions and bitrates simultaneously — typically three layers (e.g., 180p, 360p, 720p). The SFU (Selective Forwarding Unit) then forwards the appropriate layer to each receiver based on their available bandwidth and the size of their video display.

SVC encodes multiple quality layers into a single stream. A base layer provides low-quality video, and enhancement layers add resolution, frame rate, or fidelity. An SFU can strip enhancement layers for bandwidth-constrained receivers without re-encoding. VP9 has native SVC support, making it popular for SFU deployments. AV1 also supports SVC and is expected to replace VP9 as the preferred SVC codec.

The key advantage of both approaches is that the SFU never needs to decode and re-encode video — it operates on encrypted packets and makes forwarding decisions based on RTP headers and RTCP feedback. This keeps SFU CPU usage low and latency minimal.

SFU vs MCU: Server Architectures

While WebRTC enables peer-to-peer connections, group calls require a server architecture. Two models exist:

SFU Architecture Selective Forwarding Unit SFU Peer A Peer B Peer C Peer D No transcoding — forwards packets Low CPU, low latency Each peer uploads 1x, downloads (N-1)x MCU Architecture Multipoint Control Unit MCU decode + mix + re-encode Peer A Peer B Peer C Peer D Decodes all, mixes into one stream Very high CPU, added latency Each peer uploads 1x, downloads 1x Comparison SFU: Google Meet, Zoom, Discord Scales to 100s of participants ~50-100ms typical latency MCU: Legacy conferencing, recording Limited by transcoding capacity ~200-500ms typical latency Most modern platforms use SFU (sometimes hybrid)

SFU (Selective Forwarding Unit)

An SFU receives each participant's media stream and forwards it to every other participant — but it does not decode, mix, or re-encode anything. It operates on encrypted RTP packets, making forwarding decisions based on RTP headers, RTCP feedback, and simulcast/SVC layer information. This makes SFUs lightweight: a single SFU server can handle hundreds of participants.

The downside is download bandwidth: each participant receives a separate stream from every other participant. For a 10-person call, each person downloads 9 streams. SFUs mitigate this by selecting the appropriate simulcast layer per receiver and by muting video for off-screen participants. All major platforms — Google Meet, Zoom (for most calls), Discord, Microsoft Teams — use SFU architectures.

MCU (Multipoint Control Unit)

An MCU decodes all incoming streams, composites them into a single mixed stream (e.g., a grid layout), re-encodes the result, and sends one stream to each participant. This minimizes downstream bandwidth but requires enormous server-side compute. An MCU must decode and re-encode every stream in real time, introducing both latency and cost. MCUs are rarely used for live calls today but remain relevant for server-side recording and transcription.

Insertable Streams and End-to-End Encryption

Standard WebRTC encrypts media between each peer and the server (hop-by-hop). In an SFU topology, the SFU could theoretically inspect the media. Insertable Streams (also called Encoded Transform) allow applications to apply a custom transform to encoded frames before they are encrypted with SRTP.

This enables true end-to-end encryption (E2EE) for group calls: the application encrypts each frame with a key shared only among the call participants. The SFU forwards the doubly-encrypted packets without being able to read the inner encryption layer. The protocol stack becomes:

Raw frame -> Codec encode -> E2EE encrypt (app key) -> SRTP encrypt (DTLS key) -> Network
Network -> SRTP decrypt -> E2EE decrypt (app key) -> Codec decode -> Render

The SFU can still strip SRTP, read RTP headers for routing decisions, and select simulcast layers — but the encoded media payload is opaque. Signal, Google Meet, and Zoom all offer E2EE modes using this mechanism. The main challenge is key management: distributing and rotating the E2EE key among participants requires a secure side channel, and adding or removing participants requires a key rotation to maintain forward secrecy.

Performance Tuning

Real-time media is unforgiving. Unlike file downloads or web pages where a brief stall is acceptable, a 200ms delay in a voice call is noticeable and a 500ms delay makes conversation difficult. WebRTC includes several mechanisms to maintain quality under adverse conditions.

Bandwidth Estimation

WebRTC uses GCC (Google Congestion Control) to estimate available bandwidth. GCC monitors packet arrival times at the receiver and detects congestion by looking for increasing inter-packet delays. When delay increases, the estimated bandwidth is reduced; when delay is stable, bandwidth is gradually probed upward. The sender adjusts the video encoder bitrate in real time based on these estimates.

This is fundamentally different from TCP's congestion control, which reacts to packet loss. TCP's loss-based approach is too aggressive for real-time media — by the time packets are being dropped, the user is already experiencing degraded quality. GCC's delay-based approach detects congestion earlier, before the buffer overflows and packets are lost.

Jitter Buffers

Packets arrive at irregular intervals due to network jitter. A jitter buffer collects incoming packets and plays them out at a steady rate, absorbing timing variations. The buffer must be large enough to smooth jitter but small enough to minimize latency. WebRTC uses adaptive jitter buffers that grow when jitter is high and shrink when the network is stable. Typical jitter buffer sizes range from 20ms to 200ms.

Forward Error Correction (FEC)

Rather than waiting for retransmission (which adds latency), WebRTC can send redundant data using FEC. If a packet is lost, the receiver can reconstruct it from the FEC data without requesting a retransmission. Opus has built-in FEC (useinbandfec=1 in the SDP), and video FEC is available via FlexFEC (RFC 8627). FEC trades bandwidth for reliability — the sender uses extra bandwidth to send redundant data, but the receiver experiences fewer gaps.

NACK and Retransmission

For video keyframes and other critical data, WebRTC supports NACK (Negative Acknowledgment) — the receiver detects a missing packet by its sequence number and requests retransmission. This only works when the round-trip time is short enough that the retransmitted packet arrives before its playout deadline. NACK and FEC are complementary: FEC handles isolated losses instantly, while NACK handles burst losses that overwhelm FEC.

PLI and FIR

When a video decoder loses its reference frame (due to heavy packet loss), it cannot decode subsequent frames. The receiver sends a PLI (Picture Loss Indication) or FIR (Full Intra Request) to request that the sender generate a new keyframe. This causes a momentary bitrate spike but recovers the video stream. Minimizing PLI frequency is important for bandwidth stability.

The Complete WebRTC Connection Flow

Putting it all together, establishing a WebRTC connection involves these steps, most of which happen in parallel:

  1. getUserMedia() — The browser requests camera and microphone access from the user.
  2. RTCPeerConnection created — The application creates a peer connection object configured with ICE servers (STUN and TURN URLs).
  3. Offer created — The offerer calls createOffer(), generating an SDP describing its media capabilities.
  4. ICE gathering begins — The browser starts gathering host, server-reflexive, and relay candidates in parallel.
  5. Offer sent via signaling — The SDP offer and initial ICE candidates are sent to the remote peer via the application's signaling channel (typically WebSockets).
  6. Answer created — The answerer calls createAnswer() with the received offer, producing an SDP answer.
  7. ICE connectivity checks — Both sides test candidate pairs with STUN binding requests. The first working pair is used.
  8. DTLS handshake — Certificates are exchanged, fingerprints verified against SDP, and SRTP keys are derived.
  9. Media flows — Encoded audio and video are encrypted with SRTP and sent over the selected candidate pair.
  10. Ongoing adaptation — GCC adjusts bitrate, jitter buffers adapt, FEC and NACK handle packet loss, simulcast layers are switched as bandwidth changes.

This entire process typically completes in under one second on a good network — from calling createOffer() to hearing the remote peer's audio.

WebRTC and the Network

From a network perspective, WebRTC traffic appears as UDP packets (or TCP/TLS if UDP is blocked and TURN-over-TCP is used). The packets carry DTLS during the handshake and SRTP during the media phase. Because WebRTC uses UDP, the traffic does not appear in TCP connection tracking — firewalls that only allow established TCP connections will block WebRTC unless TURN-over-TCP is available.

WebRTC connections traverse the same BGP routes and autonomous systems as any other internet traffic. The latency of a WebRTC call is bounded by the physical distance between peers (or between each peer and the TURN server). You can explore the BGP paths between networks to understand why a call between two specific locations might have high latency — the traffic may be traversing multiple exchange points and transit providers.

See BGP routing data in real time

Open Looking Glass
More Articles
How BitTorrent Works
How WebSockets Work
What is BGP? The Internet's Routing Protocol Explained
What is an Autonomous System (AS)?
What is a BGP Looking Glass?
How to Look Up an IP Address's BGP Route