How Container Networking Works: Docker, Kubernetes, and CNI

Containers transformed how software is deployed, but every container still needs to communicate over a network. Under the hood, container networking is built on the same Linux primitives that power routers and firewalls: network namespaces, virtual interfaces, routing tables, and iptables rules. Understanding these building blocks explains how Docker connects containers on a single host, how Kubernetes gives every pod a routable IP, and why some organizations use BGP to distribute pod routes across their infrastructure.

Linux Network Namespaces: The Foundation

Every container gets its own isolated network stack through a Linux kernel feature called network namespaces. A network namespace is a complete, independent copy of the networking state: its own interfaces, routing tables, iptables rules, and sockets. Processes inside one namespace cannot see or interact with the network interfaces of another namespace, even though they share the same physical hardware.

When the host kernel boots, everything runs in the default (root) network namespace. Creating a new namespace is a single system call (unshare(CLONE_NEWNET) or ip netns add). The new namespace starts with nothing but a loopback interface. It has no connectivity to the outside world until you explicitly wire it up.

This isolation is the reason containers feel like lightweight VMs from a networking perspective. Each container has its own eth0, its own IP address, its own routing table, and its own port space. Two containers can both listen on port 80 without conflicting, because their port 80s exist in different namespaces.

Veth Pairs: Wiring Namespaces Together

A veth pair (virtual Ethernet pair) is a pair of virtual network interfaces connected to each other like a virtual cable. Any packet sent into one end appears on the other. One end lives in the container's namespace; the other end lives in the host's namespace (or in a bridge).

The veth pair is the fundamental plumbing for all container networking on Linux. When Docker creates a container, it:

Creates a new network namespace for the container
Creates a veth pair
Moves one end (renamed eth0) into the container's namespace
Attaches the other end to a bridge (like docker0) in the host namespace
Assigns an IP address to the container's eth0 from the bridge's subnet

The result: the container has its own interface with its own IP, connected through a virtual cable to the host's bridge. Packets flow between them at kernel speed with no copying overhead.

Docker Networking Modes

Docker offers four networking modes, each making different tradeoffs between isolation, performance, and connectivity.

Bridge Mode (Default)

When you run docker run without specifying a network, the container connects to the default docker0 bridge. Docker creates a subnet (typically 172.17.0.0/16), assigns each container an IP from that range, and uses NAT (via iptables masquerade rules) to allow containers to reach the outside world. Incoming traffic requires explicit port mapping (-p 8080:80), which creates DNAT rules to forward traffic from the host's port to the container's port.

The bridge itself is a Linux kernel bridge, functionally identical to a layer-2 Ethernet switch. It learns MAC addresses, forwards frames between ports, and maintains a forwarding database. Containers on the same bridge can communicate directly at layer 2. The host can also communicate with containers because the bridge interface (docker0) has an IP address on the same subnet.

MACVLAN Mode

MACVLAN assigns a unique MAC address to each container and connects it directly to the host's physical network interface, bypassing the bridge entirely. Each container appears as a distinct physical device on the network and gets its own IP address from the upstream DHCP server or subnet. This avoids the overhead of bridge forwarding and NAT, making it useful for workloads like network monitoring or legacy applications that expect to be directly on the LAN. The tradeoff is that MACVLAN containers cannot communicate with the host itself (without additional configuration) and require the physical NIC to support promiscuous mode.

Host Mode

With --network host, Docker skips namespace creation entirely. The container shares the host's network namespace, sees all host interfaces, and uses the host's IP address directly. There is no NAT, no bridge, and no port mapping. This offers the best performance (no veth overhead) but sacrifices isolation. Two containers cannot both bind to the same port, and the container can see and modify all host network configuration.

None Mode

With --network none, Docker creates a network namespace but does not add any interfaces except loopback. The container has no external connectivity. This is used for workloads that process data without any network access, providing the strongest isolation.

Overlay Mode

Overlay networks span multiple Docker hosts, enabling containers on different machines to communicate as if they were on the same layer-2 network. Docker's overlay driver uses VXLAN (Virtual Extensible LAN) tunnels to encapsulate container traffic inside UDP packets that travel across the host network. We will examine VXLAN in detail when we discuss Kubernetes overlay networks below.

Docker Bridge Internals: iptables and NAT

Docker's bridge networking relies heavily on iptables for traffic management. Every NAT translation creates an entry in the kernel's conntrack table, which defaults to a maximum of 65,536 entries on most Linux distributions. In busy container hosts running hundreds of containers with heavy connection churn, conntrack table exhaustion is a common failure mode — new connections are silently dropped when the table is full, producing symptoms that look like random network failures rather than a resource limit. Operators running container-dense hosts often need to tune net.netfilter.nf_conntrack_max upward. Understanding these rules and their associated state tracking demystifies common Docker networking problems.

When Docker starts, it creates several iptables chains:

DOCKER chain (nat table) -- handles DNAT for port mappings. A rule like -p 8080:80 creates: -A DOCKER -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80
POSTROUTING (nat table) -- masquerades outgoing container traffic so it appears to come from the host's IP: -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
DOCKER-ISOLATION chains (filter table) -- prevent traffic from flowing between different Docker networks, enforcing network-level isolation
FORWARD chain (filter table) -- permits forwarding for container traffic that matches established connections or containers with port mappings

This is the same NAT mechanism that home routers use to share a single public IP among multiple devices. Docker brings it to the container level, using private address space (172.17.0.0/16) internally and translating to the host's address externally.

Container-to-Container Communication

How containers talk to each other depends on their network configuration:

Same bridge network -- direct layer-2 communication through the bridge. Packets never leave the host kernel. Docker's embedded DNS server (127.0.0.11) resolves container names to IPs, so containers can reach each other by name.
Different bridge networks -- blocked by Docker's isolation rules. Containers must be connected to the same network, or traffic must go through published ports on the host.
Different hosts (overlay) -- VXLAN encapsulation. The container sends a packet to its peer's overlay IP; the host kernel encapsulates it in a UDP packet addressed to the remote host; the remote host decapsulates it and delivers it to the destination container.

The Kubernetes Networking Model

Kubernetes imposes a fundamentally different networking model than Docker's default bridge-and-NAT approach. The Kubernetes networking model has three rules:

Every pod gets its own IP address -- no sharing, no port conflicts between pods
All pods can communicate with all other pods without NAT -- a pod's IP is routable across the entire cluster
The IP that a pod sees for itself is the same IP that other pods see for it -- no address translation tricks

This model is radically simpler than Docker's NAT-based approach. There are no port mappings to manage, no DNAT rules to debug, and no confusion about which address a service thinks it has. Every pod is a first-class citizen on the network with a real, routable IP address. This is closer to how physical hosts work on a traditional network.

Kubernetes does not implement networking itself. Instead, it defines the model and delegates implementation to CNI plugins.

Notice how each node gets its own pod CIDR -- a /24 subnet carved from the cluster's larger pod address space. Pods on the same node share a local bridge and communicate directly. Pods on different nodes communicate through the underlay, which is where CNI plugins come in.

CNI: The Container Network Interface

CNI (Container Network Interface) is a specification that defines how container runtimes configure networking for containers. When Kubernetes creates a pod, it calls a CNI plugin binary with a simple JSON configuration. The plugin sets up the veth pair, assigns an IP, configures routes, and returns. When the pod is destroyed, the plugin is called again to clean up.

CNI's simplicity is its power. The specification is minimal: ADD (set up networking for a container), DEL (tear it down), and CHECK (verify it). This clean interface means that wildly different networking implementations -- from simple bridges to BGP-routed fabrics to eBPF dataplanes -- all plug into Kubernetes identically.

The major CNI plugins differ in how they solve the key problem: making pod IPs routable across nodes.

Flannel

Flannel is the simplest CNI plugin. It assigns each node a subnet from a larger address space (e.g., 10.244.0.0/16, with each node getting a /24). For cross-node traffic, Flannel's default backend uses VXLAN to encapsulate pod packets in UDP. It is easy to set up and works in any environment but adds encapsulation overhead and has limited support for network policies.

Calico

Calico takes a fundamentally different approach: instead of overlay encapsulation, it uses BGP to distribute pod routes. Each node runs a BGP speaker (BIRD) that announces its pod CIDR to other nodes. The physical network routes pod traffic natively at layer 3, with no encapsulation overhead. Calico also supports VXLAN mode for environments where BGP is not feasible (like public clouds that don't support custom routing). We will examine Calico's BGP mode in detail below.

Cilium

Cilium uses eBPF (extended Berkeley Packet Filter) programs attached to the kernel's networking stack. Instead of relying on iptables for packet filtering and NAT, Cilium compiles network policy and service routing into eBPF bytecode that runs directly in the kernel. This bypasses the iptables chain evaluation entirely, providing significant performance improvements in clusters with thousands of services and network policies. Cilium supports VXLAN, Geneve, and native routing as its data plane.

Weave Net

Weave creates a mesh overlay network where every node maintains encrypted tunnels to every other node. It can traverse NAT and firewalls using a gossip protocol to discover peers. Weave's strength is simplicity and the ability to work in hostile network environments, but the full-mesh topology limits scalability compared to other options.

VXLAN: How Overlay Networks Work

VXLAN (Virtual Extensible LAN) is the dominant overlay technology for container networking. It solves a specific problem: making containers on different hosts appear to be on the same layer-2 network, even when the hosts themselves are connected through a routed layer-3 network.

VXLAN works by wrapping the entire original Ethernet frame inside a UDP packet. The outer IP header uses the host IPs (the underlay), while the inner frame carries the container's pod IPs (the overlay). The VXLAN header itself is 8 bytes and includes a 24-bit VNI (VXLAN Network Identifier) that allows up to 16 million separate virtual networks -- far more than the 4,094 VLAN limit that VXLAN was designed to replace.

The tradeoff is overhead. Each VXLAN packet adds 50 bytes of headers (outer Ethernet + outer IP + UDP + VXLAN), which reduces the effective MTU. If the underlay MTU is 1500, the overlay MTU drops to 1450. Jumbo frames (MTU 9000) on the underlay mitigate this, but not all environments support them.

VXLAN endpoints are called VTEPs (VXLAN Tunnel Endpoints). In container networking, each node runs a VTEP. When pod A on node 1 sends a packet to pod B on node 2, the VTEP on node 1 looks up which node owns pod B's IP, encapsulates the packet with node 2's IP as the outer destination, and sends it across the underlay. Node 2's VTEP decapsulates it and delivers the inner packet to pod B.

BGP-Based Pod Networking with Calico

Calico's BGP mode eliminates overlay overhead entirely by using the real network to route pod traffic. This is the same BGP protocol that routes traffic between autonomous systems on the internet, now applied within a data center to distribute container routes.

In Calico's BGP mode, each Kubernetes node runs a BGP agent (historically BIRD, now also the calico-node daemon). Each node is assigned a pod CIDR (e.g., 10.244.1.0/24). The node's BGP speaker announces this CIDR to its BGP peers -- either the other nodes directly (full mesh) or a pair of route reflectors (for larger clusters).

The beauty of this approach is that pod IPs become native routes in the data center's routing table. A top-of-rack switch running BGP can learn the pod CIDRs just as it learns any other route. There is no encapsulation, no tunneling, and no MTU penalty. Traffic from pod to pod takes the same path as traffic from host to host.

For a cluster with nodes at 10.0.1.1 and 10.0.1.2, the BGP routes might look like:

10.244.1.0/24 via 10.0.1.1   # pods on node A
10.244.2.0/24 via 10.0.1.2   # pods on node B

This is conceptually identical to how an ISP like Google (AS15169) announces its IP prefixes via BGP. The difference is scale and scope: internet BGP operates across autonomous systems with complex policy, while Calico BGP operates within a single data center with simple next-hop routing.

Calico also supports BGP peering with physical routers. In large deployments, Calico nodes peer with the top-of-rack (ToR) switch via eBGP, and the switches propagate pod routes through the data center's existing BGP fabric. This integrates container networking seamlessly with the physical network infrastructure.

kube-proxy: Service Routing Inside the Cluster

While CNI plugins handle pod-to-pod connectivity, kube-proxy handles Kubernetes Services -- the stable, virtual IPs that load balance traffic across a set of pod backends. kube-proxy runs on every node and programs the data plane to intercept traffic destined for Service IPs and redirect it to a healthy backend pod.

kube-proxy operates in one of three modes:

iptables Mode (Default)

kube-proxy writes iptables rules that DNAT traffic from the Service IP (ClusterIP) to a randomly selected backend pod IP. For a Service with three backends, there are three iptables rules with probability-based matching (1/3, 1/2, 1/1) that achieve roughly equal distribution. This is stateless random load balancing at the kernel level.

The problem with iptables mode is scale. iptables rules are evaluated linearly. With thousands of Services, each having dozens of endpoints, the rule chains grow to tens of thousands of entries. Every new connection traverses these chains, adding latency. Rule updates require rewriting the entire chain atomically, which causes brief CPU spikes.

IPVS Mode

IPVS (IP Virtual Server) is a kernel-level load balancer that uses hash tables instead of linear chain evaluation. kube-proxy in IPVS mode creates a virtual server for each Service and adds real servers for each backend pod. IPVS supports multiple load-balancing algorithms (round-robin, least connections, weighted, source hash) and handles thousands of services with O(1) connection cost instead of O(n).

eBPF Mode (Cilium)

Cilium replaces kube-proxy entirely with eBPF programs. Service routing is compiled into BPF maps and programs attached to the network interfaces. Connection tracking, NAT, and load balancing all happen in eBPF, bypassing both iptables and IPVS. This is the highest-performance option, with the added benefit that eBPF programs can implement more sophisticated logic like socket-level load balancing (connecting sockets directly to backend pods, avoiding DNAT entirely for intra-node traffic).

Kubernetes Service Types

Kubernetes defines several Service types that control how a Service is exposed:

ClusterIP

The default Service type. A virtual IP (e.g., 10.96.0.1) is allocated from the Service CIDR and is only reachable from within the cluster. kube-proxy or its replacement programs every node to DNAT traffic to this IP to a backend pod. This is for internal service-to-service communication.

NodePort

Exposes the Service on every node's IP at a static port (range 30000-32767). External clients can reach the Service by connecting to any node's IP on the NodePort. The node receiving the traffic performs DNAT to a backend pod, which may be on a different node, adding an extra network hop.

LoadBalancer

In cloud environments, this type provisions an external load balancer (like an AWS NLB or GCP load balancer) that distributes traffic to the NodePorts on each node. The cloud load balancer has a public IP and health-checks the nodes. This is how most Kubernetes applications are exposed to the internet -- and the external load balancer's IP is routed via BGP like any other public address.

ExternalName

Maps a Service to an external DNS name (CNAME). No proxying occurs. This is simply a DNS alias.

NetworkPolicy: Container Firewalling

Kubernetes NetworkPolicy resources define firewall rules for pods. By default, all pods can communicate with all other pods (the Kubernetes networking model). NetworkPolicy allows you to restrict this by specifying ingress and egress rules based on pod labels, namespace selectors, and CIDR blocks.

A NetworkPolicy looks like:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-to-api
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: web
    ports:
    - port: 8080

This policy allows only pods labeled app: web to reach pods labeled app: api on port 8080. All other ingress to the api pods is denied.

NetworkPolicy enforcement is handled by the CNI plugin, not by Kubernetes itself. Calico implements policies as iptables rules or eBPF programs on each node. Cilium uses eBPF exclusively. Flannel does not implement NetworkPolicy at all, which is why many clusters use Flannel for networking paired with Calico for policy (a combination called "Canal").

Service Mesh: The Networking Layer Above

A service mesh adds another layer of networking abstraction on top of Kubernetes CNI. Meshes like Istio, Linkerd, and Cilium Service Mesh inject a sidecar proxy (typically Envoy) into every pod. All traffic in and out of the pod flows through this proxy.

The service mesh data plane provides capabilities that CNI and kube-proxy do not:

Mutual TLS (mTLS) -- automatic encryption and authentication between all pods without application changes
L7 load balancing -- routing based on HTTP headers, paths, and methods rather than just L3/L4 information
Observability -- automatic collection of request rates, latency histograms, and error rates for every service-to-service call
Traffic management -- canary deployments, traffic splitting, retries, timeouts, and circuit breaking
Authorization policies -- fine-grained access control at the application layer (e.g., "service A can call GET /api/users on service B but not DELETE")

Each proxy in the mesh maintains connections to the other proxies, creating an overlay of application-aware connections on top of the CNI network. The control plane (like Istio's istiod) distributes configuration, certificates, and service discovery information to all the proxies.

Newer service meshes are moving away from sidecar proxies. Cilium uses eBPF to implement mesh functionality directly in the kernel, avoiding the memory and CPU overhead of sidecar containers. Istio's "ambient mesh" mode uses per-node proxies instead of per-pod sidecars.

How It All Fits Together

Container networking is a stack of abstractions, each solving a different problem:

Network namespaces provide isolation -- each container has its own network stack
Veth pairs provide connectivity -- virtual cables between namespaces
Bridges provide local switching -- containers on the same host can communicate
CNI plugins (VXLAN, BGP, eBPF) provide cross-node routing -- pods on different hosts can communicate
kube-proxy (iptables, IPVS, eBPF) provides service abstraction -- stable virtual IPs load-balanced across pods
NetworkPolicy provides segmentation -- firewall rules between pods
Service mesh provides application-layer networking -- mTLS, L7 routing, observability

At the bottom of this stack, everything reduces to IP packets flowing over Ethernet and being routed by the same protocols that power the internet. When Calico announces a pod CIDR via BGP, it is doing fundamentally the same thing that Cloudflare (AS13335) does when it announces its anycast prefixes. When kube-proxy creates a DNAT rule for a ClusterIP Service, it is doing the same thing a home router does for port forwarding. Container networking is not magic -- it is well-understood networking primitives composed into a system that handles the unique challenges of dynamic, ephemeral workloads.

Explore Network Infrastructure

The cloud providers and CDNs that run the largest Kubernetes clusters are themselves major autonomous systems on the internet. Explore their BGP routing to see how container platforms connect to the wider internet:

AS16509 -- Amazon Web Services (EKS)
AS15169 -- Google Cloud (GKE)
AS8075 -- Microsoft Azure (AKS)
AS13335 -- Cloudflare (runs Kubernetes at the edge)
AS14618 -- Amazon (additional AS for AWS services)