How Kubernetes Networking Works: Pods, Services, CNI, and Ingress
Kubernetes networking is the set of rules, abstractions, and implementations that govern how pods, services, and external clients communicate in a Kubernetes cluster. Kubernetes imposes a fundamental networking model: every pod gets its own IP address, every pod can reach every other pod without NAT, and agents on a node can communicate with all pods on that node. This "flat network" model is deliberately simple at the specification level, but implementing it requires sophisticated networking -- CNI plugins that build overlay or underlay networks, kube-proxy rules that implement service load balancing, Ingress controllers that route external HTTP traffic, and NetworkPolicy enforcement that provides segmentation. Understanding how each of these layers works -- from the packet level through the kernel data path to the API abstractions -- is essential for anyone running production workloads on Kubernetes.
The Kubernetes Networking Model
Kubernetes defines three fundamental networking requirements (formalized in the Kubernetes networking design document):
- Pod-to-pod: Every pod can communicate with every other pod in the cluster without NAT. Each pod has a unique cluster-wide IP address. A process in pod A can connect to pod B's IP on any port and reach the process inside pod B directly.
- Pod-to-service: Kubernetes Services provide stable virtual IPs (ClusterIPs) and DNS names that map to a set of backend pods. Kube-proxy or its replacement implements load balancing across the pod endpoints.
- External-to-service: Traffic from outside the cluster can reach services via NodePort, LoadBalancer, or Ingress resources.
This model is intentionally simple: no NAT between pods, no port mapping, no network address translation. A pod's IP address is the same whether seen from inside the pod, from another pod, or from the node. This simplifies application design because services can use standard networking -- bind to a port, connect to an IP -- without worrying about container networking translations.
The Kubernetes project does not implement this model itself. Instead, it defines the Container Network Interface (CNI) specification and delegates the actual implementation to CNI plugins.
CNI Plugins: Building the Pod Network
A CNI plugin is responsible for assigning IP addresses to pods, configuring network interfaces in pod network namespaces, and establishing connectivity between pods on the same node and across nodes. Different CNI plugins use different techniques to achieve this.
Overlay Networks (VXLAN, Geneve)
Overlay CNI plugins (Flannel, Calico in VXLAN mode, Weave Net) encapsulate pod traffic in an outer packet (VXLAN or Geneve encapsulation) to tunnel it across the underlying node network. Each node gets a pod subnet (e.g., 10.244.1.0/24 for node 1, 10.244.2.0/24 for node 2), and cross-node pod traffic is encapsulated in a UDP packet addressed to the destination node's IP.
The advantage of overlays is simplicity and portability: the underlay network (physical switches, cloud VPC) needs no modification and no awareness of pod IPs. The disadvantage is overhead: encapsulation adds ~50 bytes per packet (VXLAN header), reduces the effective MTU, and requires encap/decap processing on each node. For high-throughput workloads, overlay overhead can reduce throughput by 5-15%.
Underlay/Routed Networks (BGP)
Routed CNI plugins (Calico in BGP mode, Cilium in native routing mode) avoid encapsulation entirely. Instead, they announce pod subnets to the network via BGP, making pod IPs routable on the physical network. Each node runs a BGP speaker (Calico uses BIRD) that peers with the top-of-rack (ToR) switch or a BGP route reflector, announcing the node's pod CIDR range.
With BGP-routed networking, pod packets traverse the physical network natively with no encapsulation overhead. The ToR switch knows that 10.244.2.0/24 is reachable via node 2's IP (192.168.1.11) because Calico's BIRD daemon announced that route. This approach provides the best performance (no encapsulation overhead, full MTU available) but requires network infrastructure that supports BGP peering with cluster nodes -- not possible in all environments (some cloud VPCs do not allow BGP).
eBPF-Based Networking (Cilium)
Cilium uses eBPF programs attached to network interfaces to implement pod networking, load balancing, and network policy enforcement entirely in the Linux kernel. Instead of iptables rules (which are evaluated sequentially and scale poorly), Cilium's eBPF programs use hash maps for O(1) lookups and can make forwarding decisions at the earliest possible point in the kernel's network stack (XDP for ingress, tc for egress).
Cilium can operate in overlay mode (VXLAN/Geneve) or native routing mode. In native routing mode with BGP (using Cilium's built-in BGP speaker or MetalLB), it provides the same routed networking as Calico but with eBPF's performance advantages for service load balancing and network policy enforcement.
Cloud-Native CNI (AWS VPC CNI, Azure CNI)
Cloud-specific CNI plugins assign pods real VPC IP addresses from the cloud provider's IPAM system. AWS VPC CNI assigns secondary IP addresses from the node's VPC subnet to each pod, making pod IPs native VPC addresses that are routable across VPC peering connections, transit gateways, and VPNs without any overlay or encapsulation.
The limitation is IP address consumption: each pod consumes a VPC IP address. In large clusters, this can exhaust the subnet's address space. AWS mitigates this with prefix delegation (assigning /28 blocks to ENIs instead of individual IPs) and secondary CIDR ranges.
Kubernetes Services: ClusterIP, NodePort, LoadBalancer
A Kubernetes Service is an abstraction that provides a stable network identity (a ClusterIP and a DNS name) for a set of pods selected by a label selector. When pods are created, destroyed, or rescheduled, the Service's endpoints are automatically updated, and clients connecting to the Service's ClusterIP are transparently load-balanced across the current set of healthy pods.
ClusterIP
The default Service type. Kubernetes assigns a virtual IP address from the Service CIDR range (e.g., 10.96.0.0/12). This IP does not correspond to any network interface -- it exists only as a set of iptables DNAT rules or IPVS virtual servers on each node. When a pod sends a packet to the ClusterIP, kube-proxy's rules intercept the packet and rewrite the destination to one of the Service's backend pod IPs.
apiVersion: v1
kind: Service
metadata:
name: payment-service
spec:
type: ClusterIP
selector:
app: payment
ports:
- port: 80
targetPort: 8080
ClusterIP Services are only reachable from within the cluster. They provide the stable endpoint that other services (and service mesh sidecars) use to reach backend pods.
NodePort
Extends ClusterIP by allocating a port (default range: 30000-32767) on every node in the cluster. Traffic arriving at any node's IP on the NodePort is forwarded to the Service's backend pods, even if the pod is running on a different node. This enables external access without a cloud load balancer but exposes services on high-numbered ports and requires clients to know node IP addresses.
LoadBalancer
Extends NodePort by provisioning an external load balancer from the cloud provider (AWS NLB/ALB, GCP Network LB, Azure LB). The cloud load balancer receives external traffic and distributes it to the NodePorts on cluster nodes. For on-premises clusters, MetalLB provides LoadBalancer Service support by announcing the Service IP via BGP or L2 ARP/NDP.
Each LoadBalancer Service gets its own external IP, which is expensive in cloud environments (AWS charges per NLB per hour plus per LB capacity unit). This is why Ingress controllers exist: they consolidate many Services behind a single LoadBalancer.
kube-proxy: Implementing Service Load Balancing
kube-proxy is the component that implements Kubernetes Service semantics on each node. It watches the Kubernetes API for Service and Endpoints objects and configures the node's network stack to intercept traffic to ClusterIPs and forward it to backend pods.
iptables Mode (Default)
In iptables mode, kube-proxy creates iptables rules in the nat table that match packets destined for a ClusterIP and DNAT them to a randomly selected backend pod IP. The random selection uses iptables' --probability flag to implement weighted round-robin:
# Simplified iptables rules for a 3-endpoint Service
-A KUBE-SERVICES -d 10.96.0.100/32 -p tcp --dport 80 \
-j KUBE-SVC-PAYMENT
-A KUBE-SVC-PAYMENT -m statistic --mode random \
--probability 0.33333 -j KUBE-SEP-POD1
-A KUBE-SVC-PAYMENT -m statistic --mode random \
--probability 0.50000 -j KUBE-SEP-POD2
-A KUBE-SVC-PAYMENT -j KUBE-SEP-POD3
-A KUBE-SEP-POD1 -p tcp -j DNAT --to-destination 10.244.1.5:8080
-A KUBE-SEP-POD2 -p tcp -j DNAT --to-destination 10.244.2.8:8080
-A KUBE-SEP-POD3 -p tcp -j DNAT --to-destination 10.244.3.2:8080
iptables mode works well for small to medium clusters but has scalability issues: iptables rules are evaluated sequentially, so a cluster with 10,000 Services and 100,000 endpoints generates hundreds of thousands of rules that add measurable latency to every connection setup. Endpoint updates require regenerating the entire iptables rule set, which can take seconds in large clusters and block new connections during the update.
IPVS Mode
IPVS (IP Virtual Server) is a Linux kernel module purpose-built for load balancing. In IPVS mode, kube-proxy creates IPVS virtual servers for each Service and real servers for each endpoint. IPVS uses hash tables for O(1) lookups instead of iptables' linear rule traversal, making it dramatically more scalable for large clusters.
IPVS also supports more load balancing algorithms than iptables' random selection: round-robin, least connections, destination hashing, source hashing, shortest expected delay, and never queue. The service.kubernetes.io/ipvs-scheduler annotation selects the algorithm per Service.
eBPF-Based kube-proxy Replacement (Cilium)
Cilium can completely replace kube-proxy by implementing Service load balancing with eBPF programs. Cilium's eBPF service handler runs at the socket layer (using cgroup/connect4 eBPF hooks), intercepting connections before they even enter the TCP/IP stack. This eliminates the DNAT step entirely -- the application's connect() syscall is transparently rewritten to the backend pod's IP, and the packet is created with the correct destination from the start.
This approach has lower overhead than both iptables and IPVS because there is no per-packet header rewriting, no connection tracking table entry for the DNAT, and no reverse NAT on return packets. Benchmarks show 10-20% throughput improvement for Service-to-Service traffic compared to iptables mode.
DNS for Services
Kubernetes runs a DNS server (CoreDNS, successor to kube-dns) that provides name resolution for Services and Pods. Every Service gets a DNS record: <service-name>.<namespace>.svc.cluster.local. A ClusterIP Service gets an A/AAAA record pointing to its ClusterIP. A headless Service (ClusterIP: None) gets A records for each individual pod IP, enabling client-side load balancing (used by StatefulSets and gRPC services that need direct pod connections).
SRV records provide port discovery: _http._tcp.payment-service.default.svc.cluster.local returns the port number alongside the address. ExternalName Services create CNAME records that alias to an external DNS name, useful for referencing external databases or services within Kubernetes DNS.
Ingress Controllers
An Ingress resource defines rules for routing external HTTP/HTTPS traffic to internal Services based on hostname and URL path. The Ingress resource itself is just a specification -- an Ingress controller watches for Ingress resources and configures its own reverse proxy to implement the routing rules.
Common Ingress controllers:
- NGINX Ingress Controller -- Runs NGINX as a reverse proxy, configuring it via dynamically generated nginx.conf. The most widely deployed Ingress controller.
- Traefik -- Auto-discovers Ingress resources and configures routing dynamically. Supports automatic TLS via Let's Encrypt.
- Envoy-based (Contour, Emissary, Gateway API) -- Uses Envoy Proxy as the data plane with xDS for dynamic configuration. Contour implements the Kubernetes Gateway API natively.
- Cloud provider (ALB Ingress Controller, GCE Ingress) -- Provisions cloud load balancers (AWS ALB, GCP HTTP(S) LB) and configures their target groups based on Ingress resources.
The Ingress controller itself is typically deployed as a Deployment or DaemonSet behind a LoadBalancer Service. It consolidates routing for many internal Services behind a single external IP, reducing the number of cloud load balancers needed.
Gateway API: The Ingress Successor
The Kubernetes Gateway API (graduated to GA in Kubernetes 1.28) is the successor to Ingress. It separates concerns into three resources: GatewayClass (infrastructure provider), Gateway (the listener configuration -- ports, protocols, TLS), and HTTPRoute (the routing rules). This separation allows infrastructure teams to manage Gateways while application teams independently configure routes, and it supports features that Ingress cannot express: traffic splitting, header-based matching, request mirroring, and gRPC routing.
NetworkPolicy: Kubernetes Firewalling
By default, Kubernetes pods can communicate freely with all other pods in the cluster -- there is no network segmentation. NetworkPolicy resources provide pod-level firewalling, specifying which pods can communicate with which other pods and on which ports.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payment-service-policy
namespace: production
spec:
podSelector:
matchLabels:
app: payment
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: order-service
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
This policy restricts the payment pods: they can only receive traffic from order-service pods on port 8080, and they can only send traffic to database pods on port 5432. All other traffic is denied.
NetworkPolicy enforcement depends entirely on the CNI plugin. The Kubernetes API server stores the policy, but the CNI plugin must implement it. Flannel does not support NetworkPolicy at all. Calico implements it via iptables or eBPF. Cilium implements it via eBPF with the additional capability of L7 network policies (filtering by HTTP path, method, or headers -- not just L3/L4 tuples). If your CNI does not support NetworkPolicy, the policies are silently ignored -- this is a common operational surprise.
Service Mesh Integration
A service mesh like Istio or Linkerd adds another networking layer on top of Kubernetes Services and CNI. The mesh's sidecar proxies intercept traffic after the CNI has established pod connectivity, wrapping connections in mTLS and providing L7 load balancing, observability, and traffic management.
This creates a layered networking architecture: the CNI provides L3 pod-to-pod connectivity, kube-proxy (or its replacement) provides L4 Service load balancing, and the service mesh provides L7 application-aware routing and security. Each layer serves a distinct purpose, and understanding the interactions between them is essential for debugging network issues in mesh-enabled clusters.
For example, when a mesh is present, kube-proxy's Service load balancing is often redundant -- the sidecar proxy handles load balancing at L7 with more sophisticated algorithms (least requests, consistent hashing). Cilium in kube-proxy replacement mode + Istio ambient mode creates an optimized stack where eBPF handles L3/L4 and Envoy waypoint proxies handle L7, with no per-pod sidecars.
External Connectivity: BGP and MetalLB
For bare-metal Kubernetes clusters (not running on a cloud provider), exposing Services externally requires announcing Service IPs to the network. MetalLB fills this gap by providing LoadBalancer Service support for bare-metal clusters:
- L2 mode -- MetalLB responds to ARP/NDP requests for the Service IP, directing all traffic to a single leader node. Simple but limited: no true load balancing, and failover requires a new leader election (seconds of downtime).
- BGP mode -- MetalLB peers with the network's BGP router and announces the Service IP. The router distributes traffic across nodes via ECMP. This is the production-grade approach: true load balancing, sub-second failover via BGP withdrawal, and integration with the physical network fabric.
Cilium has also incorporated BGP peering functionality (via GoBGP), allowing it to announce Service IPs and pod CIDRs via BGP without requiring MetalLB as an additional component.
Pod Networking Internals
At the Linux kernel level, each pod runs in its own network namespace. The CNI plugin creates a veth (virtual ethernet) pair: one end is placed in the pod's network namespace (typically named eth0) and the other end is attached to a bridge or routing table in the host namespace. The pod's IP address is configured on its eth0 interface, and a default route points to the host-side veth peer.
For intra-node pod-to-pod traffic, packets traverse the veth pair into the host namespace and are then forwarded to the destination pod's veth pair via the bridge or routing table. For cross-node traffic, the packet exits the host's physical interface (with or without encapsulation, depending on the CNI) and is routed to the destination node.
This architecture means that all pod traffic passes through the host's network stack, where it is subject to iptables rules (kube-proxy Services, NetworkPolicy), eBPF programs (Cilium), and any other kernel networking features. The host's network namespace is the choke point through which all pod traffic flows.
Kubernetes Networking and BGP
BGP plays multiple roles in Kubernetes networking. CNI plugins like Calico use BGP to distribute pod routes across the cluster. MetalLB uses BGP to announce LoadBalancer Service IPs to the upstream network. Multi-cluster networking solutions use BGP to establish routing between clusters. And the cloud networks that host Kubernetes clusters are themselves interconnected via BGP peering between autonomous systems.
For bare-metal Kubernetes deployments, the BGP integration between the cluster and the physical network fabric is a critical design decision. Calico's BGP peering with ToR switches, combined with MetalLB's Service IP announcements, creates a fully routed architecture where pod IPs and Service IPs are first-class citizens on the network -- no overlays, no NAT, and sub-second failover via BGP route withdrawal.
Explore the BGP routing behind your Kubernetes cluster's network with the god.ad BGP Looking Glass. Look up your cloud provider's ASN -- AWS (AS16509), Google Cloud (AS15169), or Azure (AS8075) -- to see how the underlying network infrastructure routes traffic to and from your clusters.