How Cilium and eBPF Networking Works: Replacing iptables in Kubernetes

Cilium is an open-source networking, observability, and security platform for Kubernetes and other container orchestrators, built on eBPF (extended Berkeley Packet Filter). Instead of relying on iptables or kernel modules for packet processing, Cilium attaches eBPF programs directly to kernel hooks at the socket, traffic control (TC), and XDP layers, creating a fully programmable datapath that can enforce network policies, perform load balancing, and provide deep observability -- all without leaving kernel space. Cilium has become the default CNI (Container Network Interface) plugin for managed Kubernetes services including GKE, EKS, and AKS, and it is the foundation of projects like Tetragon (runtime security) and Cilium Service Mesh.

Why eBPF Changes Everything

Traditional Linux networking relies on a stack designed decades ago: Netfilter hooks, iptables chains, conntrack tables, and kube-proxy's iptables or IPVS rules. This architecture has fundamental scaling problems. Every Kubernetes Service creates multiple iptables rules, and in a cluster with 10,000 Services, the iptables chain can exceed 100,000 rules, each evaluated linearly for every packet. Rule updates require a full table rewrite, which can take seconds and cause packet drops during the update window.

eBPF solves this by moving packet processing logic into small, verified programs that the kernel JIT-compiles and attaches to specific hooks. These programs run in kernel space with near-native performance, can access hash maps and arrays for O(1) lookups (replacing iptables' O(n) chain traversal), and can be updated atomically without disrupting traffic. Cilium exploits eBPF at multiple attachment points:

Cilium eBPF Datapath NIC / Driver XDP Hook DDoS / LB / Drop TC Ingress Policy / DNAT / Decap Routing TC Egress Policy / SNAT / Encap Pod-to-Pod (East-West) Traffic Pod A identity=12345 10.0.1.42 cgroup/connect4 ClusterIP -> PodIP TC on veth Policy check bpf_redirect Direct to dst veth Pod B identity=67890 10.0.2.17 eBPF Maps (shared kernel state) Identity Map IP -> security identity Policy Map src_id+dst_id -> verdict Service Map ClusterIP:port -> backends CT Map Connection tracking Cilium Agent (userspace) Watches K8s API -> compiles eBPF -> updates maps -> manages endpoints

Identity-Based Security

Traditional network security policies use IP addresses and port numbers to define who can talk to whom. This breaks down in Kubernetes, where pod IPs are ephemeral -- a pod might get IP 10.0.1.42 now and 10.0.1.99 after a restart. Iptables-based CNIs handle this by regenerating rules every time a pod is created or destroyed, which is both slow and error-prone at scale.

Cilium takes a fundamentally different approach: identity-based security. Every endpoint (pod, external CIDR, or service) is assigned a numeric security identity derived from its Kubernetes labels. For example, all pods with labels app=frontend, env=production share the same identity (say, 12345). When a packet is sent from one pod to another, Cilium does not check the source IP against a list of allowed IPs. Instead, it:

  1. Looks up the source IP in the identity map (an eBPF hash map) to find its security identity
  2. Looks up the (source identity, destination identity, port, protocol) tuple in the policy map
  3. Returns an ALLOW or DROP verdict in O(1) time

This model scales independently of the number of pods. Whether you have 10 pods or 10,000 pods with the label app=frontend, the policy map has the same number of entries because policy is defined in terms of identities, not individual endpoints. Identity allocation is coordinated cluster-wide by the Cilium operator, which ensures the same labels always produce the same identity on every node.

For traffic crossing node boundaries, the source identity must be communicated to the destination node. Cilium supports multiple mechanisms: it can encode the identity in the VXLAN or Geneve tunnel header (using unused bits in the VNI field), or for direct-routing mode, it uses a per-node eBPF map that maps source IPs to identities, synchronized across the cluster.

Kube-Proxy Replacement

Kubernetes' built-in service load balancing is handled by kube-proxy, which traditionally uses iptables or IPVS to redirect traffic from ClusterIP:port to backend pod IPs. Cilium can fully replace kube-proxy with eBPF-based service load balancing that is faster, more scalable, and more feature-rich.

In kube-proxy replacement mode, Cilium handles service resolution at the socket level using cgroup hooks. When a pod calls connect() to a ClusterIP, the eBPF program intercepts the system call, looks up the ClusterIP in the service map, selects a backend using Maglev consistent hashing (or random/round-robin), and rewrites the destination address before the connection is established. The kernel creates the TCP connection directly to the backend pod, completely bypassing conntrack, NAT, and iptables.

The performance implications are significant:

NodePort and LoadBalancer services are also handled in eBPF: XDP programs on the host's physical interface perform DNAT for incoming NodePort traffic, and DSR (Direct Server Return) mode allows backend pods to reply directly to the client without going through the original node, reducing latency and bandwidth on the ingress node.

Cilium Network Policies

Kubernetes defines a basic NetworkPolicy resource, but it only covers L3/L4 filtering (IP, port, protocol) and lacks features like DNS-based rules, L7 filtering, and egress FQDN controls. Cilium implements the standard Kubernetes NetworkPolicy API and extends it with CiliumNetworkPolicy (CNP) and CiliumClusterwideNetworkPolicy (CCNP) custom resources that support:

Hubble: eBPF-Powered Observability

Hubble is Cilium's observability layer. It taps into the eBPF datapath to provide flow-level visibility without any instrumentation in the application. Because Hubble observes traffic at the kernel level, it sees all flows -- even those that are dropped by policy, which traditional monitoring tools miss entirely.

Hubble operates in two modes: Hubble (per-node) runs as part of the Cilium agent on each node and exports flows to a local Unix socket or gRPC endpoint. Hubble Relay aggregates flows from all nodes and provides a cluster-wide API. The Hubble CLI (hubble observe) and Hubble UI consume this API to display real-time flow data.

Each Hubble flow record includes: source/destination pod name, namespace, labels, security identity, IP, port, L4 protocol, L7 protocol details (HTTP method, URL, status code, gRPC method, Kafka topic), verdict (forwarded, dropped, error), drop reason (if applicable), and the network policy that caused the verdict. This level of detail makes Hubble invaluable for debugging connectivity issues, verifying policy behavior, and building service dependency maps.

Hubble also exports metrics in Prometheus format. You can build Grafana dashboards showing request rates, error rates, and latency distributions per service -- similar to what a service mesh provides, but without sidecar proxies. Hubble metrics are based on eBPF-observed flows, so they capture all traffic including UDP, TCP, and non-HTTP protocols that service mesh telemetry typically misses.

Networking Modes: Overlay, Direct Routing, and DSR

Cilium supports multiple networking modes, each with different tradeoffs:

ClusterMesh: Multi-Cluster Connectivity

Cilium ClusterMesh connects multiple Kubernetes clusters into a unified networking domain. Pods in different clusters can communicate directly using pod IPs, and services can be shared across clusters for global load balancing and failover.

ClusterMesh works by connecting the Cilium agents in each cluster to a shared etcd (or kvstore) that synchronizes endpoint identity information across cluster boundaries. When a pod in Cluster A needs to reach a pod in Cluster B, Cilium knows the remote pod's identity and the tunnel endpoint (node IP) to reach it. Policy enforcement works across cluster boundaries -- you can write a CiliumNetworkPolicy in Cluster A that allows traffic from identities in Cluster B.

Shared services in ClusterMesh are annotated with service.cilium.io/global: "true". When a pod resolves a global service, Cilium returns backends from all clusters, optionally weighted by annotation. If all backends in the local cluster are unhealthy, traffic fails over to remote clusters. This provides active-active multi-cluster load balancing without requiring a service mesh or external load balancer.

Cilium Service Mesh

Cilium includes built-in service mesh capabilities that operate without sidecar proxies. Traditional service meshes like Istio inject an Envoy sidecar into every pod, creating two additional network hops per request (ingress proxy and egress proxy) and consuming significant CPU and memory. Cilium's approach is different: L4 features (mTLS, retries, circuit breaking at the connection level) are handled entirely in eBPF in the kernel, while L7 features (HTTP routing, header manipulation, gRPC load balancing) use a per-node Envoy instance instead of per-pod sidecars.

This architecture provides several advantages: reduced resource overhead (one Envoy per node instead of one per pod), lower latency (eBPF processing avoids the cost of two extra TCP connections per request), and simpler operations (no sidecar injection, no pod restart required to enable mesh features). Cilium Service Mesh integrates with the Gateway API for ingress and traffic management, and supports mTLS using SPIFFE identities for workload authentication.

Bandwidth Management and Rate Limiting

Cilium provides pod-level bandwidth management using eBPF-based Earliest Departure Time (EDT) rate limiting, which is more efficient than the traditional Token Bucket Filter (TBF) qdisc. You annotate pods with kubernetes.io/egress-bandwidth and kubernetes.io/ingress-bandwidth annotations, and Cilium's eBPF programs enforce the limits by scheduling packet departure times. EDT-based rate limiting achieves smoother throughput, lower jitter, and better burst handling compared to TBF, because it does not queue packets -- it simply stamps each packet with the earliest time it should be transmitted, and the kernel's FQ (Fair Queue) scheduler handles the rest.

Troubleshooting Cilium

Cilium provides comprehensive debugging tools:

The most common issues in Cilium deployments are: identity conflicts (two unrelated sets of pods accidentally sharing the same identity due to identical labels), policy maps reaching capacity (the default eBPF map size may be too small for clusters with many identities), and MTU misconfigurations (VXLAN overhead requires reducing the container MTU by at least 50 bytes from the host MTU, or enabling eBPF-based MTU clamping).

Cilium, eBPF, and the Network

Cilium represents a fundamental shift in how container networking works -- from static iptables rules to a programmable, identity-aware datapath. By building on eBPF, it achieves performance that rivals or exceeds kernel-bypass solutions like DPDK, while remaining fully integrated with the Linux kernel's networking stack. For Kubernetes clusters at scale, Cilium's combination of identity-based security, kube-proxy replacement, and Hubble observability provides a networking layer that is both more performant and more operationally manageable than traditional alternatives. You can explore the autonomous systems and BGP routes that carry your cluster's external traffic using the god.ad looking glass.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained