How SIP and VoIP Work: Telephony Signaling Over IP Networks

SIP (Session Initiation Protocol) is the signaling protocol that establishes, modifies, and terminates voice and video calls over IP networks. Defined in RFC 3261, SIP is the backbone of modern telephony: it is what makes your VoIP phone ring, connects your video conference, routes calls through enterprise PBX systems, and carries billions of minutes of voice traffic daily across carrier networks. When you hear "VoIP," SIP is almost always the signaling protocol underneath. But SIP only handles the call setup — the actual voice and video data flows over RTP (Real-time Transport Protocol), a separate media plane that SIP negotiates using SDP (Session Description Protocol). Understanding the interplay between SIP, SDP, and RTP is essential to understanding how modern telephony works.

This article explains the full VoIP stack: how SIP establishes calls, how SDP negotiates codecs and media parameters, how RTP carries voice and video in real time, how NAT traversal works for VoIP, and how modern networks like VoLTE bring carrier-grade voice to LTE and 5G.

SIP: The Signaling Layer

SIP is a text-based, request-response protocol modeled after HTTP. Like HTTP, SIP messages have headers, a method line (or status line), and an optional body. Like HTTP, SIP uses URIs to identify resources — in this case, users. A SIP address looks like sip:[email protected] or sips:[email protected]:5061 (for TLS-secured SIP). And like HTTP, SIP is designed to be extensible — new headers and methods can be added without breaking existing implementations.

But SIP differs from HTTP in fundamental ways. SIP is a peer-to-peer protocol: any SIP endpoint can be both client and server. When Alice calls Bob, Alice's phone is a client (sending an INVITE request) and Bob's phone is a server (sending back a response). When Bob calls Alice later, the roles reverse. SIP also supports forking — a single INVITE can ring multiple devices simultaneously — and dialog state, where a sequence of related messages (like an INVITE, ACK, re-INVITE, and BYE) are tracked as a single logical session.

SIP typically runs over UDP on port 5060 for unencrypted signaling, or over TLS on port 5061 for encrypted signaling. TCP is also supported and is mandatory for messages exceeding the MTU (SIP messages can be large when carrying SDP with many codec options). In modern deployments, especially VoLTE, SIP runs exclusively over TLS or IPsec-protected channels.

SIP Messages: Requests and Responses

SIP defines six core request methods:

SIP responses use numeric status codes, grouped by class — again mirroring HTTP:

A SIP INVITE message looks like this:

INVITE sip:[email protected] SIP/2.0
Via: SIP/2.0/UDP 192.168.1.100:5060;branch=z9hG4bK776asdhds
Max-Forwards: 70
To: Bob <sip:[email protected]>
From: Alice <sip:[email protected]>;tag=1928301774
Call-ID: [email protected]
CSeq: 314159 INVITE
Contact: <sip:[email protected]:5060>
Content-Type: application/sdp
Content-Length: 142

v=0
o=alice 2890844526 2890844526 IN IP4 192.168.1.100
s=-
c=IN IP4 192.168.1.100
t=0 0
m=audio 49170 RTP/AVP 0 8 97
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 opus/48000/2

The headers establish the dialog identifiers (Call-ID, From tag, To tag), routing information (Via, Contact), and the body carries an SDP offer listing supported audio codecs.

The INVITE/ACK/BYE Call Flow

The basic SIP call flow is a three-way handshake for setup, followed by a single message for teardown:

SIP Call Flow: INVITE / 200 OK / ACK / BYE Alice (UAC) SIP Proxy Bob (UAS) INVITE (SDP offer) INVITE (SDP offer) 100 Trying 180 Ringing 180 Ringing ring! 200 OK (SDP answer) 200 OK (SDP answer) ACK RTP Media (voice/video) — bidirectional BYE 200 OK t=0 pickup call hangup
  1. INVITE — Alice sends an INVITE to Bob (possibly via one or more SIP proxies). The INVITE carries an SDP offer describing Alice's media capabilities: codecs, IP address, port, and protocol.
  2. 100 Trying — The proxy immediately responds with 100 Trying to tell Alice it received the request and is working on it. This stops Alice's retransmission timer.
  3. 180 Ringing — Bob's phone starts ringing and sends 180 Ringing back. Alice hears a ringback tone.
  4. 200 OK — Bob picks up. His phone sends 200 OK with an SDP answer describing Bob's chosen codec and media address.
  5. ACK — Alice confirms receipt of the 200 OK. The three-way handshake is complete and the SIP dialog is established.
  6. RTP media flows — Voice (and/or video) packets flow directly between Alice and Bob using the addresses negotiated in SDP. The SIP proxy is typically not in the media path.
  7. BYE — When either party hangs up, they send a BYE within the dialog. The other side responds with 200 OK and the session ends.

The three-way handshake (INVITE / 200 OK / ACK) exists because SIP was designed for unreliable transports like UDP. Unlike TCP, which guarantees delivery at the transport layer, SIP must handle retransmissions at the application layer. If the 200 OK is lost, the UAS retransmits it; if the ACK is lost, the UAC retransmits it. The three-way exchange ensures both sides know the session is established even over lossy networks.

SIP Server Architecture

SIP defines several logical server roles, though in practice a single server process often implements multiple roles:

Registrar Server

A registrar accepts REGISTER requests and maintains a location database mapping SIP URIs (sip:[email protected]) to contact addresses (sip:[email protected]:5060). When Alice's phone boots up or connects to the network, it sends a REGISTER to its domain's registrar. The registration has a limited lifetime (typically 3600 seconds) and must be refreshed before it expires. If Alice has multiple devices — a desk phone, a softphone, and a mobile app — each sends its own REGISTER, and the registrar stores multiple contact bindings for the same URI.

Proxy Server

A proxy server routes SIP requests on behalf of clients. When Alice sends INVITE to sip:[email protected], her phone may not know Bob's current IP address. It sends the INVITE to the proxy for example.com, which looks up Bob in the location database (populated by the registrar) and forwards the INVITE to Bob's current contact address. Proxies can operate in two modes:

Proxies add Via headers to requests and use Record-Route headers if they need to stay in the signaling path for subsequent messages in the dialog. A proxy that Record-Routes itself will see all re-INVITEs, UPDATEs, and BYEs for the dialog — important for billing, policy enforcement, and call recording.

Redirect Server

A redirect server does not forward requests. Instead, it responds to requests with a 3xx response containing the target's current address, telling the client to try that address directly. This is useful for load distribution and for letting clients discover the optimal path without proxies staying in the signaling path.

Back-to-Back User Agent (B2BUA)

A B2BUA is not technically a SIP server type defined in RFC 3261 but is extremely common in practice. Unlike a proxy (which forwards requests), a B2BUA terminates the incoming SIP dialog and creates an entirely new outgoing dialog. The two dialogs have different Call-IDs, tags, and SDP. B2BUAs are used in Session Border Controllers (SBCs), PBX systems, and application servers because they provide complete control over both legs of the call — they can modify SDP, transcode media, enforce security policies, and hide internal network topology.

SDP: Session Description Protocol

SDP (RFC 8866, originally RFC 4566) is the language SIP uses to describe media sessions. An SDP body is included in INVITE requests (as an offer) and 200 OK responses (as an answer). SDP itself is not a protocol — it is a data format that describes the parameters of a media session: what codecs are supported, where media should be sent, and what protocol to use.

The key SDP fields for VoIP are:

The Offer/Answer Model

SDP negotiation follows the offer/answer model defined in RFC 3264. The offerer (usually the caller) lists all codecs it supports in preference order. The answerer (the callee) selects the codecs it wants to use from the offered list and responds. The answerer must not add codecs that were not in the offer. After the exchange, both sides know exactly which codecs to use, what port to send media to, and in which direction media flows.

If the call parameters need to change mid-session — for example, adding video to a voice call, or switching codecs — either party sends a re-INVITE with a new SDP offer. The other side responds with a new SDP answer, and the media session is updated without dropping the call.

Codec Negotiation

Codec selection is one of the most important aspects of VoIP quality. The SDP offer/answer exchange determines which codec both sides will use, and different codecs make dramatically different tradeoffs between bandwidth, latency, and audio quality.

Common VoIP codecs and their characteristics:

CodecBitrateSample RateFrame SizeUse Case
G.711 (PCMU/PCMA)64 kbps8 kHz20msPSTN interop, LAN calls
G.7298 kbps8 kHz10msLow-bandwidth WAN links
G.72264 kbps16 kHz20msWideband (HD Voice)
Opus6-510 kbps8-48 kHz2.5-60msModern VoIP, WebRTC
AMR-WB (G.722.2)6.6-23.85 kbps16 kHz20msVoLTE, mobile carriers
EVS5.9-128 kbps8-48 kHz20msNext-gen VoLTE, 5G voice

G.711 is the universal baseline — every SIP device supports it. It uses no compression, transmitting 64 kbps of PCM audio (8000 samples/second, 8 bits/sample). The lack of compression means minimal encoding delay and high quality for narrowband (300-3400 Hz) audio, but it consumes significant bandwidth. G.711 comes in two variants: PCMU (mu-law, used in North America and Japan) and PCMA (A-law, used everywhere else).

Opus is the modern codec of choice. Developed by the IETF (RFC 6716), Opus is royalty-free and combines the SILK speech codec with the CELT audio codec. It adapts dynamically to network conditions, adjusting bitrate from 6 kbps (acceptable speech) to 510 kbps (transparent full-bandwidth audio). Opus is mandatory in WebRTC and increasingly supported in SIP endpoints.

AMR-WB (Adaptive Multi-Rate Wideband) is the codec used in VoLTE networks worldwide. It operates at nine different bitrates (6.6 to 23.85 kbps) and can adapt to changing network conditions by switching between modes. AMR-WB gives HD Voice quality — noticeably better than the narrowband G.711 of traditional phone calls — while using a fraction of the bandwidth.

RTP: The Media Transport

Once SIP and SDP have negotiated the session parameters, the actual voice and video data flows over RTP (Real-time Transport Protocol, RFC 3550). RTP provides the framing, sequencing, and timing information needed to deliver real-time media over an unreliable network.

RTP runs over UDP — not TCP. Real-time media requires low latency above all else. TCP's retransmission and ordering guarantees add delays that are unacceptable for voice: a retransmitted packet that arrives 200ms late is useless for a voice conversation. Better to lose a packet entirely and conceal the gap than to delay the entire stream waiting for one packet. UDP lets the application make this tradeoff.

Each RTP packet carries:

A typical voice call generates 50 RTP packets per second (20ms per packet). At G.711 rates, each packet carries 160 bytes of audio payload plus 12 bytes of RTP header, 8 bytes of UDP header, and 20 bytes of IP header — 200 bytes total per packet, for a total of 80 kbps per direction including all headers. With Opus at 20 kbps, the same packet rate yields roughly 50 bytes of payload per packet, or about 38 kbps per direction with headers.

RTCP: Control and Quality Reporting

RTCP (RTP Control Protocol) runs alongside RTP on the next higher port number (if RTP uses port 49170, RTCP uses 49171). RTCP provides out-of-band feedback about the quality of the RTP stream. Every few seconds, each participant sends RTCP reports containing:

RTCP reports are critical for VoIP quality monitoring. They provide the raw data needed to calculate MOS scores, detect network problems, and trigger codec adaptation. Modern SIP endpoints and media gateways use RTCP-XR (Extended Reports, RFC 3611) for even more detailed metrics including burst/gap loss patterns, round-trip delay, and signal/noise levels.

VoIP Quality Metrics

Voice quality in VoIP is ultimately about the human experience: can you understand the other person, and does the conversation feel natural? Several measurable network metrics determine this:

Latency (One-Way Delay)

End-to-end latency is the total time from when the speaker says a word to when the listener hears it. This includes codec encoding delay, packetization delay, network transit time, jitter buffer delay, and decoding delay. The ITU-T G.114 recommendation states that one-way delay should be below 150ms for acceptable conversational quality. Above 150ms, speakers begin talking over each other. Above 400ms, conversation becomes very difficult. Traditional PSTN calls typically have 25-50ms of one-way delay; a VoIP call over a well-connected network adds 60-100ms; a call traversing multiple continents via a NATted path may reach 200ms or more.

Jitter

Jitter is the variation in packet arrival times. If packets are sent every 20ms but arrive at intervals of 15ms, 25ms, 18ms, 22ms, the jitter is the variance of these intervals. High jitter means packets arrive erratically — some too early, some too late. VoIP endpoints use a jitter buffer to absorb this variation: incoming packets are held in a buffer for a short period (typically 20-60ms) and played out at regular intervals. A larger jitter buffer absorbs more variation but adds latency. An adaptive jitter buffer adjusts its size based on observed network conditions.

Packet Loss

When RTP packets are lost in transit (dropped by congested routers, for example), the receiver must conceal the gap. Packet Loss Concealment (PLC) algorithms typically replay the last received audio frame at reduced amplitude, or interpolate between the surrounding frames. G.711 with PLC can tolerate up to about 1% random packet loss without noticeable degradation. Opus handles packet loss better due to built-in redundancy features. Above 5% packet loss, voice quality degrades significantly regardless of codec.

MOS (Mean Opinion Score)

The Mean Opinion Score is a numerical measure of voice quality on a scale from 1 (bad) to 5 (excellent). Originally determined by human listeners rating call quality, MOS can now be estimated algorithmically using the E-model (ITU-T G.107). The E-model takes latency, jitter, packet loss, and codec characteristics as inputs and produces an R-factor (0-100), which maps to MOS. Typical scores:

NAT Traversal for VoIP

NAT is the single biggest operational challenge in VoIP deployment. SIP and RTP were designed in an era when endpoints had public IP addresses. Both protocols embed IP addresses in their payloads — SIP puts addresses in Contact and Via headers, and SDP puts media addresses in c= and m= lines. A NAT device rewrites the IP header but does not touch the SIP/SDP payload, creating a mismatch: the SDP says "send media to 192.168.1.100:49170" but that address is unreachable from the public internet.

Several techniques address this problem:

STUN (Session Traversal Utilities for NAT)

STUN (RFC 8489) allows an endpoint to discover its public IP address and port by querying an external STUN server. The SIP endpoint sends a STUN Binding Request to the server; the server responds with the observed public address. The endpoint then uses this public address in its SDP c= line and SIP Contact header. STUN works for most NAT types but fails with symmetric NATs, which assign a different public port for every destination.

TURN (Traversal Using Relays around NAT)

TURN (RFC 8656) provides a relay server that forwards media packets. When direct connectivity is impossible (symmetric NAT, restrictive firewalls), the endpoint allocates a relay address on the TURN server and puts that address in its SDP. All media then flows through the TURN server, adding latency and server cost but guaranteeing connectivity.

ICE (Interactive Connectivity Establishment)

ICE (RFC 8445) is the framework that ties STUN and TURN together. It gathers multiple candidate addresses (local, server-reflexive via STUN, and relay via TURN), exchanges them via SDP, and systematically tests connectivity between all candidate pairs to find the best working path. ICE is standard in WebRTC and increasingly supported in SIP via the ICE-SIP interworking specification (RFC 5765). For a deeper explanation of the ICE candidate gathering and connectivity check process, see how NAT traversal works.

Session Border Controllers (SBCs)

In carrier and enterprise VoIP, the most common NAT traversal solution is the Session Border Controller. An SBC sits at the network edge, terminates SIP sessions (acting as a B2BUA), and rewrites SDP addresses. It also media-anchors the RTP streams — forcing media to flow through the SBC so it can NAT-traverse on behalf of endpoints. SBCs also provide security (topology hiding, rate limiting, protocol normalization), regulatory compliance (lawful intercept), and interoperability between different vendor implementations.

VoIP NAT Traversal via Session Border Controller Private Network IP Phone 192.168.1.100:5060 RTP: 192.168.1.100:49170 Softphone 192.168.1.101:5060 NAT 203.0.113.5 SBC SIP B2BUA Media Relay SDP Rewrite 198.51.100.1 Carrier / Public VoIP SIP Proxy / Registrar PSTN Gateway Remote Phone 198.51.100.50:5060 SIP + RTP SIP RTP SBC rewrites SDP addresses, anchors media, hides topology

SRTP: Securing the Media

Standard RTP sends media in the clear — anyone who can capture the packets can listen to the call. SRTP (Secure RTP, RFC 3711) encrypts and authenticates RTP packets using symmetric-key cryptography, typically AES-128 in counter mode for encryption and HMAC-SHA1 for authentication.

SRTP does not define how keys are exchanged — that is handled by a key management protocol. Several mechanisms exist:

The SRTP header is identical to RTP in structure, but the payload is encrypted and a configurable-length authentication tag is appended. The overhead is minimal — typically 10 bytes for the authentication tag — so SRTP adds negligible latency and bandwidth.

SIP Trunking

A SIP trunk is an IP-based connection between an enterprise phone system (IP-PBX) and an Internet Telephony Service Provider (ITSP) that replaces traditional PSTN trunks (T1/E1 lines, PRI circuits). Instead of physical connections carrying 23 or 30 voice channels, a SIP trunk carries voice as RTP over the enterprise's internet connection or a dedicated MPLS circuit.

SIP trunking architecture involves:

SIP trunking offers significant advantages over traditional PSTN trunks: capacity can scale dynamically (no physical line limits), calls between sites can stay on-net (free), long-distance and international rates are typically lower, and the same network infrastructure carries voice, video, and data. The tradeoff is that voice quality depends on the quality of the IP network — jitter, packet loss, and congestion directly affect calls in ways they never could on a dedicated TDM circuit.

QoS (Quality of Service) is therefore critical for SIP trunking. VoIP packets should be marked with DSCP EF (Expedited Forwarding, 46) for priority queuing. Enterprise networks typically implement strict priority queuing for voice traffic and bandwidth reservation on WAN links to ensure that data traffic cannot starve voice of bandwidth.

VoLTE: Voice over LTE

Before VoLTE, mobile voice calls on LTE networks used Circuit-Switched Fallback (CSFB): when you made a phone call, your phone dropped from the LTE data network to the legacy 2G/3G circuit-switched network. This caused a multi-second delay before the call connected, wasted spectrum, and prevented the carrier from eventually shutting down legacy networks.

VoLTE solves this by carrying voice calls as VoIP — SIP signaling and RTP media — over the LTE data network. The architecture leverages the 3GPP IMS (IP Multimedia Subsystem):

VoLTE uses the AMR-WB codec for HD Voice (wideband audio at 16 kHz), delivering significantly better audio quality than the narrowband AMR codec of 2G/3G calls. The LTE network provides dedicated QoS bearers for voice traffic — a dedicated data pipe with guaranteed bandwidth and low latency — so voice quality is consistent even when the data network is congested.

VoLTE also enables ViLTE (Video over LTE) using the same IMS infrastructure but with H.264 or H.265 video codecs in addition to AMR-WB audio. And it enables RCS (Rich Communication Services), the carrier messaging standard that uses SIP for session setup.

As carriers deploy 5G, VoNR (Voice over New Radio) extends the same IMS/SIP architecture to 5G networks, with the Evolved Packet Core replaced by the 5G Core but the SIP signaling and RTP media planes remaining largely the same.

SIP Security Considerations

SIP is a frequent target of abuse because it controls resources with real monetary value — phone calls cost money. Common SIP attacks include:

Defenses include TLS for SIP signaling, SRTP for media encryption, SIP digest authentication with strong passwords, SBC rate limiting, fail2ban-style intrusion detection on SIP registrars, and network segmentation to isolate voice infrastructure from general internet traffic.

SIP and the Network Layer

From a network perspective, SIP and RTP traffic traverses the same BGP routes and autonomous systems as any other IP traffic. VoIP quality is directly tied to the network path: a call routed through a congested peering link or across many AS hops will have higher latency and more packet loss than one that stays on a single well-provisioned network.

Major VoIP providers operate their own autonomous systems and peer heavily at internet exchange points to minimize the number of network hops voice traffic must traverse. SIP trunking providers often offer direct interconnects or dedicated circuits to avoid traversing the public internet entirely.

VoIP traffic is also sensitive to route changes. A BGP convergence event — where routes shift after a link failure — can cause packet loss and reordering that momentarily degrades call quality. Unlike bulk data transfers that simply slow down during a routing event, voice traffic produces audible glitches: clicks, gaps, or momentary silence when packets are lost or arrive out of order during convergence.

Explore the Network

Use the god.ad BGP Looking Glass to explore the routing paths that VoIP traffic traverses. Look up the IP addresses of SIP providers and VoIP infrastructure to see their BGP routes, origin AS, and AS paths. Understanding the network path between endpoints helps explain call quality — every AS hop, peering point, and transit link adds latency that directly affects voice conversations.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained