How WebRTC Works

WebRTC (Web Real-Time Communication) enables peer-to-peer audio, video, and arbitrary data transfer directly between web browsers, with no plugins or downloads. It is the protocol stack behind Google Meet, Discord voice chat, Facebook Messenger video calls, and countless other applications. But beneath the simple getUserMedia() and RTCPeerConnection APIs lies a remarkably complex system of NAT traversal, encryption, codec negotiation, and congestion control — all happening in real time.

This article explains the full WebRTC stack: how two browsers discover each other, punch through NATs and firewalls, negotiate codecs, encrypt everything end-to-end, and adapt to changing network conditions — all within milliseconds.

Why Peer-to-Peer in Browsers?

Traditional web communication follows a client-server model. Your browser sends an HTTP request to a server, the server responds, and the connection is done. Even WebSockets maintain a persistent connection but still route everything through a server. This works well for web pages, APIs, and chat messages, but it adds unacceptable latency for real-time media. A voice call routed through a server in another continent adds hundreds of milliseconds of round-trip delay, making natural conversation impossible.

WebRTC solves this by establishing direct connections between browsers. Once the connection is set up, audio and video packets flow directly from one user's browser to another — no server in the middle. This minimizes latency, reduces server bandwidth costs, and keeps media data off third-party infrastructure.

But there is a fundamental problem: browsers do not have public IP addresses. Most users sit behind NAT devices that hide their real address. Two browsers behind two different NATs cannot simply open a socket to each other. The majority of WebRTC's complexity exists to solve this single problem.

The Signaling Channel

WebRTC deliberately does not define how two peers find each other. The specification covers everything after initial contact — the media stack, encryption, NAT traversal — but the mechanism by which two browsers exchange their initial connection parameters is left to the application developer. This is called signaling.

In practice, signaling is almost always done via WebSockets. Both browsers connect to a signaling server, which relays messages between them. These messages contain two types of information: SDP (Session Description Protocol) offers and answers describing media capabilities, and ICE candidates describing network addresses the peer can be reached at.

The signaling server is not in the media path. Once the WebRTC connection is established, the signaling server could disappear and the call would continue. It is only needed to bootstrap the connection and, optionally, to renegotiate if conditions change.

SDP: The Offer/Answer Model

Before two peers can exchange media, they need to agree on formats, codecs, encryption parameters, and transport details. This negotiation uses the Session Description Protocol (SDP), defined originally in RFC 4566 for SIP telephony and adapted for WebRTC in RFC 8829.

The flow follows an offer/answer model:

Peer A creates an offer — This SDP blob describes everything Peer A supports: which audio codecs (Opus, G.711), which video codecs (VP8, VP9, H.264, AV1), supported RTP extensions, DTLS fingerprint, ICE credentials, and more.
Peer A sends the offer to Peer B via the signaling channel.
Peer B creates an answer — Peer B examines the offer, selects the codecs and parameters it also supports, and generates an SDP answer.
Peer B sends the answer back via the signaling channel.

An SDP message is a text blob of key-value lines. A simplified example:

v=0
o=- 4625943528 2 IN IP4 127.0.0.1
s=-
t=0 0
m=audio 9 UDP/TLS/RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=ice-ufrag:aB3d
a=ice-pwd:xYz9kL2mN4pQ7rS1tU5vW8
a=fingerprint:sha-256 AB:CD:EF:01:23:...
m=video 9 UDP/TLS/RTP/SAVPF 96 97 98
a=rtpmap:96 VP8/90000
a=rtpmap:97 VP9/90000
a=rtpmap:98 H264/90000

The m= lines define media sections (audio, video, data). Each section lists supported payload types, and a=rtpmap lines map payload numbers to codec names. The a=ice-ufrag and a=ice-pwd lines provide credentials for ICE connectivity checks, and the a=fingerprint line carries the DTLS certificate fingerprint used to verify the encrypted connection.

ICE: Finding a Path Through the NAT

The Interactive Connectivity Establishment (ICE) framework (RFC 8445) is WebRTC's solution to NAT traversal. Its job is to find a working network path between two peers, even when both are behind NATs, firewalls, or restrictive corporate networks.

ICE works by gathering candidates — potential network addresses that a peer could be reached at — and then systematically testing pairs of candidates to find which ones can actually exchange packets.

Candidate Types

ICE defines three types of candidates, in order of preference:

Host candidates — The device's local IP addresses (e.g., 192.168.1.42:54321). These work when both peers are on the same local network.
Server-reflexive candidates (srflx) — The public IP and port as seen by a STUN server. This is the address the NAT has mapped your internal address to. These work when the NAT allows incoming packets from any address on the mapped port.
Relay candidates — An address on a TURN server that will relay packets. This always works but adds latency and server cost. It is the fallback when direct connectivity fails.

The ICE Process

Once both peers have gathered their candidates, ICE pairs them up and tests connectivity:

Candidate gathering — Each peer collects host candidates from local interfaces, queries STUN servers for server-reflexive candidates, and allocates TURN relay candidates.
Candidate exchange — Candidates are sent to the remote peer via the signaling channel (as a=candidate lines in SDP, or via trickle ICE).
Connectivity checks — ICE forms candidate pairs (one local, one remote) and sends STUN Binding Requests on each pair. A pair is valid if both peers receive the other's check and respond.
Candidate pair selection — The controlling agent (typically the offerer) nominates the best working pair based on priority. Priority favors host > srflx > relay.

Trickle ICE is an optimization where candidates are sent to the remote peer as they are discovered, rather than waiting for all gathering to complete. This significantly reduces connection setup time because connectivity checks can begin while TURN allocation (the slowest step) is still in progress.

STUN: Discovering Your Public Address

A STUN (Session Traversal Utilities for NAT) server has a simple but essential job: tell a client what its public IP address and port are. When your browser is behind a NAT, it knows its local address (e.g., 192.168.1.42) but has no idea what address the outside world sees. STUN provides this information.

The STUN protocol (RFC 8489) works as follows:

The client sends a Binding Request to the STUN server (typically on UDP port 3478).
The STUN server examines the source IP and port of the incoming packet — this is the address the NAT has assigned.
The STUN server sends a Binding Response containing this observed address in an XOR-MAPPED-ADDRESS attribute.

The client now knows its server-reflexive address and can share it as a candidate. STUN servers are lightweight and stateless — Google runs public STUN servers at stun.l.google.com:19302 that handle millions of requests. STUN is also used during ICE connectivity checks: the STUN Binding Request/Response exchange verifies that two candidates can exchange packets and simultaneously measures round-trip time.

STUN works for most NAT types but fails with symmetric NATs, which assign a different external port for every unique destination. The server-reflexive address discovered via STUN will be different from the address seen by the remote peer, so the reflexive candidate is useless. In this case, TURN is the fallback.

TURN: Relay When Direct Fails

TURN (Traversal Using Relays around NAT, RFC 8656) provides a relay server that forwards packets between peers when direct connectivity is impossible. The client allocates a relay address on the TURN server, and all media packets are forwarded through it.

TURN is the only option that works in every network configuration, including symmetric NATs and strict corporate firewalls that block all UDP. TURN can operate over UDP, TCP, or even TLS-over-TCP (port 443) to bypass firewalls that inspect traffic. However, TURN is expensive: all media traffic passes through the relay server, consuming bandwidth and adding latency. In practice, about 86% of WebRTC connections succeed with direct connectivity (host or server-reflexive candidates), with only 14% requiring TURN — predominantly from corporate and institutional networks.

NAT Traversal in Detail

Understanding why NAT traversal is hard requires understanding NAT behavior. A NAT creates a mapping between an internal address:port and an external address:port. The critical question is how restrictive the NAT is about which external hosts can send packets to that mapped port:

Full cone (endpoint-independent mapping) — Any external host can send packets to the mapped port. STUN reflexive candidates work perfectly.
Address-restricted cone — Only hosts that the internal device has previously sent to can reply. Both peers need to send packets first, which ICE's simultaneous connectivity checks accomplish.
Port-restricted cone — Like address-restricted, but also checks the source port. ICE still works because the connectivity checks are sent to specific address:port pairs.
Symmetric — A different external mapping is created for each destination. STUN discovers one mapping, but the remote peer sees a different one. Only TURN works reliably.

ICE's connectivity checks handle all of these except symmetric NATs. The simultaneous STUN binding requests from both sides effectively "punch holes" in the NATs — each side sends a packet to the other's reflexive address, creating the necessary mapping for return traffic. This is why ICE sends checks from both peers: the check from A opens A's NAT for B's response, and vice versa.

DTLS: Encrypting the Connection

Once ICE establishes a working path, WebRTC uses DTLS (Datagram Transport Layer Security, RFC 9147) to encrypt the connection. DTLS is essentially TLS adapted for UDP — it provides the same authentication and encryption but handles packet loss and reordering that are normal on UDP.

The DTLS handshake runs over the ICE-established path. Both peers exchange certificates and negotiate an encryption key. The certificate fingerprint is verified against the fingerprint in the SDP — this is how WebRTC ensures you are talking to the peer you negotiated with, not a man-in-the-middle. The DTLS handshake produces a shared SRTP master key that is used to encrypt all media.

This is a critical design decision: the encryption keys are derived from a handshake between the two peers, not provisioned by a server. Even if the signaling server is compromised, an attacker cannot decrypt the media without intercepting the DTLS handshake itself. However, if the signaling server substitutes its own SDP fingerprint during the offer/answer exchange, it could mount a man-in-the-middle attack — which is why some applications verify fingerprints out-of-band.

SRTP: Encrypted Media Transport

Audio and video packets are sent using SRTP (Secure Real-time Transport Protocol, RFC 3711). RTP is the standard protocol for carrying real-time media — it adds timestamps, sequence numbers, and payload type identifiers that receivers need to play back media smoothly. SRTP adds encryption and message authentication on top of RTP using the keys derived from the DTLS handshake.

Each SRTP packet contains:

Sequence number — For detecting packet loss and reordering
Timestamp — For synchronizing playback timing
SSRC — Synchronization source identifier, unique per media stream
Encrypted payload — The actual audio or video data
Authentication tag — HMAC to verify integrity

WebRTC also uses RTCP (RTP Control Protocol) for reporting reception statistics, packet loss rates, and round-trip times. These reports drive the congestion control algorithms that adapt quality to network conditions. RTCP is also encrypted as SRTCP.

Codec Negotiation

WebRTC supports multiple audio and video codecs. During SDP negotiation, peers agree on which codecs to use based on mutual support, quality requirements, and hardware capabilities.

Audio Codecs

Opus — The mandatory-to-implement codec for WebRTC. Opus handles everything from narrow-band voice (6 kbps) to full-band stereo music (510 kbps). It combines SILK (voice) and CELT (music) algorithms and adapts its bitrate dynamically. Virtually every WebRTC implementation uses Opus for audio.
G.711 (PCMU/PCMA) — Legacy telephone codec, 64 kbps, required for interoperability with traditional telephony systems.

Video Codecs

VP8 — The original mandatory WebRTC video codec, developed by Google. Good compression, low complexity, universally supported. Typically used at 500 kbps to 2 Mbps.
VP9 — 30-50% better compression than VP8 at the same quality. Supports SVC (scalable video coding) layers, which is crucial for SFU architectures. Well-supported in Chrome and Firefox.
H.264 — The most widely deployed video codec globally. Required by the WebRTC specification for interoperability. Hardware encoding/decoding support is near-universal, reducing CPU usage on mobile devices.
AV1 — The newest option, offering ~30% better compression than VP9. AV1 is increasingly supported in browsers and is the future of WebRTC video, especially for bandwidth-constrained scenarios. However, encoding is computationally expensive, so hardware encoder support (which is appearing in newer chips) is important.

The SDP offer lists codecs in preference order. The answerer selects from the intersection of supported codecs. Codec selection can change during a call through renegotiation — generating a new offer/answer exchange via the signaling channel.

Data Channels: Arbitrary Data over WebRTC

WebRTC is not just for audio and video. Data channels provide a general-purpose, bidirectional data transport between peers. You can send text, files, game state, screen-sharing coordinates — any arbitrary data — with configurable reliability and ordering.

Data channels are built on SCTP (Stream Control Transmission Protocol) running over DTLS. SCTP provides:

Multiple streams — Many independent data channels can run over a single SCTP association, each with its own ordering and reliability settings.
Configurable reliability — Channels can be reliable (like TCP, retransmitting lost packets), unreliable (like UDP, dropping lost packets), or partially reliable (retransmit for up to N attempts or T milliseconds).
Configurable ordering — Channels can be ordered (packets delivered in sequence) or unordered (packets delivered as they arrive).

An unreliable, unordered data channel gives you UDP-like semantics — perfect for game state updates or mouse cursor positions where old data is useless. A reliable, ordered channel gives you TCP-like semantics — suitable for file transfer or chat messages. This flexibility, combined with NAT traversal and encryption that come for free with the WebRTC connection, makes data channels a powerful building block.

The protocol stack for data channels is: Application Data -> SCTP -> DTLS -> ICE -> UDP. This means data channel traffic benefits from the same NAT traversal and encryption as media traffic, sharing the same ICE candidate pair.

Simulcast and SVC

In a group video call, different participants have different bandwidth constraints and screen sizes. Sending the same high-resolution stream to someone on a mobile network and someone on fiber is wasteful. WebRTC addresses this with simulcast and SVC (Scalable Video Coding).

Simulcast means the sender encodes the same video at multiple resolutions and bitrates simultaneously — typically three layers (e.g., 180p, 360p, 720p). The SFU (Selective Forwarding Unit) then forwards the appropriate layer to each receiver based on their available bandwidth and the size of their video display.

SVC encodes multiple quality layers into a single stream. A base layer provides low-quality video, and enhancement layers add resolution, frame rate, or fidelity. An SFU can strip enhancement layers for bandwidth-constrained receivers without re-encoding. VP9 has native SVC support, making it popular for SFU deployments. AV1 also supports SVC and is expected to replace VP9 as the preferred SVC codec.

The key advantage of both approaches is that the SFU never needs to decode and re-encode video — it operates on encrypted packets and makes forwarding decisions based on RTP headers and RTCP feedback. This keeps SFU CPU usage low and latency minimal.

SFU vs MCU: Server Architectures

While WebRTC enables peer-to-peer connections, group calls require a server architecture. Two models exist:

SFU (Selective Forwarding Unit)

An SFU receives each participant's media stream and forwards it to every other participant — but it does not decode, mix, or re-encode anything. It operates on encrypted RTP packets, making forwarding decisions based on RTP headers, RTCP feedback, and simulcast/SVC layer information. This makes SFUs lightweight: a single SFU server can handle hundreds of participants.

The downside is download bandwidth: each participant receives a separate stream from every other participant. For a 10-person call, each person downloads 9 streams. SFUs mitigate this by selecting the appropriate simulcast layer per receiver and by muting video for off-screen participants. All major platforms — Google Meet, Zoom (for most calls), Discord, Microsoft Teams — use SFU architectures.

MCU (Multipoint Control Unit)

An MCU decodes all incoming streams, composites them into a single mixed stream (e.g., a grid layout), re-encodes the result, and sends one stream to each participant. This minimizes downstream bandwidth but requires enormous server-side compute. An MCU must decode and re-encode every stream in real time, introducing both latency and cost. MCUs are rarely used for live calls today but remain relevant for server-side recording and transcription.

Insertable Streams and End-to-End Encryption

Standard WebRTC encrypts media between each peer and the server (hop-by-hop). In an SFU topology, the SFU could theoretically inspect the media. Insertable Streams (also called Encoded Transform) allow applications to apply a custom transform to encoded frames before they are encrypted with SRTP.

This enables true end-to-end encryption (E2EE) for group calls: the application encrypts each frame with a key shared only among the call participants. The SFU forwards the doubly-encrypted packets without being able to read the inner encryption layer. The protocol stack becomes:

Raw frame -> Codec encode -> E2EE encrypt (app key) -> SRTP encrypt (DTLS key) -> Network
Network -> SRTP decrypt -> E2EE decrypt (app key) -> Codec decode -> Render

The SFU can still strip SRTP, read RTP headers for routing decisions, and select simulcast layers — but the encoded media payload is opaque. Signal, Google Meet, and Zoom all offer E2EE modes using this mechanism. The main challenge is key management: distributing and rotating the E2EE key among participants requires a secure side channel, and adding or removing participants requires a key rotation to maintain forward secrecy.

Performance Tuning

Real-time media is unforgiving. Unlike file downloads or web pages where a brief stall is acceptable, a 200ms delay in a voice call is noticeable and a 500ms delay makes conversation difficult. WebRTC includes several mechanisms to maintain quality under adverse conditions.

Bandwidth Estimation

WebRTC uses GCC (Google Congestion Control) to estimate available bandwidth. GCC monitors packet arrival times at the receiver and detects congestion by looking for increasing inter-packet delays. When delay increases, the estimated bandwidth is reduced; when delay is stable, bandwidth is gradually probed upward. The sender adjusts the video encoder bitrate in real time based on these estimates.

This is fundamentally different from TCP's congestion control, which reacts to packet loss. TCP's loss-based approach is too aggressive for real-time media — by the time packets are being dropped, the user is already experiencing degraded quality. GCC's delay-based approach detects congestion earlier, before the buffer overflows and packets are lost.

Jitter Buffers

Packets arrive at irregular intervals due to network jitter. A jitter buffer collects incoming packets and plays them out at a steady rate, absorbing timing variations. The buffer must be large enough to smooth jitter but small enough to minimize latency. WebRTC uses adaptive jitter buffers that grow when jitter is high and shrink when the network is stable. Typical jitter buffer sizes range from 20ms to 200ms.

Forward Error Correction (FEC)

Rather than waiting for retransmission (which adds latency), WebRTC can send redundant data using FEC. If a packet is lost, the receiver can reconstruct it from the FEC data without requesting a retransmission. Opus has built-in FEC (useinbandfec=1 in the SDP), and video FEC is available via FlexFEC (RFC 8627). FEC trades bandwidth for reliability — the sender uses extra bandwidth to send redundant data, but the receiver experiences fewer gaps.

NACK and Retransmission

For video keyframes and other critical data, WebRTC supports NACK (Negative Acknowledgment) — the receiver detects a missing packet by its sequence number and requests retransmission. This only works when the round-trip time is short enough that the retransmitted packet arrives before its playout deadline. NACK and FEC are complementary: FEC handles isolated losses instantly, while NACK handles burst losses that overwhelm FEC.

PLI and FIR

When a video decoder loses its reference frame (due to heavy packet loss), it cannot decode subsequent frames. The receiver sends a PLI (Picture Loss Indication) or FIR (Full Intra Request) to request that the sender generate a new keyframe. This causes a momentary bitrate spike but recovers the video stream. Minimizing PLI frequency is important for bandwidth stability.

The Complete WebRTC Connection Flow

Putting it all together, establishing a WebRTC connection involves these steps, most of which happen in parallel:

getUserMedia() — The browser requests camera and microphone access from the user.
RTCPeerConnection created — The application creates a peer connection object configured with ICE servers (STUN and TURN URLs).
Offer created — The offerer calls createOffer(), generating an SDP describing its media capabilities.
ICE gathering begins — The browser starts gathering host, server-reflexive, and relay candidates in parallel.
Offer sent via signaling — The SDP offer and initial ICE candidates are sent to the remote peer via the application's signaling channel (typically WebSockets).
Answer created — The answerer calls createAnswer() with the received offer, producing an SDP answer.
ICE connectivity checks — Both sides test candidate pairs with STUN binding requests. The first working pair is used.
DTLS handshake — Certificates are exchanged, fingerprints verified against SDP, and SRTP keys are derived.
Media flows — Encoded audio and video are encrypted with SRTP and sent over the selected candidate pair.
Ongoing adaptation — GCC adjusts bitrate, jitter buffers adapt, FEC and NACK handle packet loss, simulcast layers are switched as bandwidth changes.

This entire process typically completes in under one second on a good network — from calling createOffer() to hearing the remote peer's audio.

WebRTC and the Network

From a network perspective, WebRTC traffic appears as UDP packets (or TCP/TLS if UDP is blocked and TURN-over-TCP is used). The packets carry DTLS during the handshake and SRTP during the media phase. Because WebRTC uses UDP, the traffic does not appear in TCP connection tracking — firewalls that only allow established TCP connections will block WebRTC unless TURN-over-TCP is available.

WebRTC connections traverse the same BGP routes and autonomous systems as any other internet traffic. The latency of a WebRTC call is bounded by the physical distance between peers (or between each peer and the TURN server). You can explore the BGP paths between networks to understand why a call between two specific locations might have high latency — the traffic may be traversing multiple exchange points and transit providers.

AS15169 — Google (operates STUN servers and Google Meet SFUs)
AS13335 — Cloudflare (operates Calls, a WebRTC SFU platform)
AS8075 — Microsoft (Teams WebRTC infrastructure)
AS14618 — Amazon (Chime, Kinesis Video Streams)
AS36183 — Discord (voice and video infrastructure)