How Cilium and eBPF Networking Works: Replacing iptables in Kubernetes

Cilium is an open-source networking, observability, and security platform for Kubernetes and other container orchestrators, built on eBPF (extended Berkeley Packet Filter). Instead of relying on iptables or kernel modules for packet processing, Cilium attaches eBPF programs directly to kernel hooks at the socket, traffic control (TC), and XDP layers, creating a fully programmable datapath that can enforce network policies, perform load balancing, and provide deep observability -- all without leaving kernel space. Cilium has become the default CNI (Container Network Interface) plugin for managed Kubernetes services including GKE, EKS, and AKS, and it is the foundation of projects like Tetragon (runtime security) and Cilium Service Mesh.

Why eBPF Changes Everything

Traditional Linux networking relies on a stack designed decades ago: Netfilter hooks, iptables chains, conntrack tables, and kube-proxy's iptables or IPVS rules. This architecture has fundamental scaling problems. Every Kubernetes Service creates multiple iptables rules, and in a cluster with 10,000 Services, the iptables chain can exceed 100,000 rules, each evaluated linearly for every packet. Rule updates require a full table rewrite, which can take seconds and cause packet drops during the update window.

eBPF solves this by moving packet processing logic into small, verified programs that the kernel JIT-compiles and attaches to specific hooks. These programs run in kernel space with near-native performance, can access hash maps and arrays for O(1) lookups (replacing iptables' O(n) chain traversal), and can be updated atomically without disrupting traffic. Cilium exploits eBPF at multiple attachment points:

XDP (eXpress Data Path) -- The earliest possible hook, before the kernel allocates an sk_buff. Cilium uses XDP for DDoS mitigation, early packet filtering, and hardware-offloaded load balancing. XDP programs can drop, redirect, or pass packets at line rate with minimal CPU overhead.
TC (Traffic Control) -- Hooks at the ingress and egress of network interfaces. Cilium attaches TC programs to container veth pairs and physical interfaces for policy enforcement, NAT, encapsulation/decapsulation, and load balancing.
Socket-level hooks -- cgroup/connect4, cgroup/sendmsg4, and similar hooks that intercept socket operations before packets are even created. Cilium uses these for transparent service redirection: when a pod connects to a ClusterIP, the cgroup hook rewrites the destination to a backend pod's IP at the socket level, completely bypassing kube-proxy and conntrack for east-west traffic.
LSM (Linux Security Module) hooks -- Used by Tetragon for runtime security enforcement, process-level policy, and file access control.

Identity-Based Security

Traditional network security policies use IP addresses and port numbers to define who can talk to whom. This breaks down in Kubernetes, where pod IPs are ephemeral -- a pod might get IP 10.0.1.42 now and 10.0.1.99 after a restart. Iptables-based CNIs handle this by regenerating rules every time a pod is created or destroyed, which is both slow and error-prone at scale.

Cilium takes a fundamentally different approach: identity-based security. Every endpoint (pod, external CIDR, or service) is assigned a numeric security identity derived from its Kubernetes labels. For example, all pods with labels app=frontend, env=production share the same identity (say, 12345). When a packet is sent from one pod to another, Cilium does not check the source IP against a list of allowed IPs. Instead, it:

Looks up the source IP in the identity map (an eBPF hash map) to find its security identity
Looks up the (source identity, destination identity, port, protocol) tuple in the policy map
Returns an ALLOW or DROP verdict in O(1) time

This model scales independently of the number of pods. Whether you have 10 pods or 10,000 pods with the label app=frontend, the policy map has the same number of entries because policy is defined in terms of identities, not individual endpoints. Identity allocation is coordinated cluster-wide by the Cilium operator, which ensures the same labels always produce the same identity on every node.

For traffic crossing node boundaries, the source identity must be communicated to the destination node. Cilium supports multiple mechanisms: it can encode the identity in the VXLAN or Geneve tunnel header (using unused bits in the VNI field), or for direct-routing mode, it uses a per-node eBPF map that maps source IPs to identities, synchronized across the cluster.

Kube-Proxy Replacement

Kubernetes' built-in service load balancing is handled by kube-proxy, which traditionally uses iptables or IPVS to redirect traffic from ClusterIP:port to backend pod IPs. Cilium can fully replace kube-proxy with eBPF-based service load balancing that is faster, more scalable, and more feature-rich.

In kube-proxy replacement mode, Cilium handles service resolution at the socket level using cgroup hooks. When a pod calls connect() to a ClusterIP, the eBPF program intercepts the system call, looks up the ClusterIP in the service map, selects a backend using Maglev consistent hashing (or random/round-robin), and rewrites the destination address before the connection is established. The kernel creates the TCP connection directly to the backend pod, completely bypassing conntrack, NAT, and iptables.

The performance implications are significant:

No double NAT -- With kube-proxy, ClusterIP traffic goes through DNAT (ClusterIP -> PodIP) and then SNAT (to preserve the return path). Cilium's socket-level rewrite eliminates both NAT operations.
Reduced conntrack pressure -- Each NAT operation creates a conntrack entry. Eliminating NAT for east-west traffic dramatically reduces conntrack table size and connection tracking overhead.
O(1) service resolution -- The eBPF service map is a hash table. Looking up a service is constant time regardless of how many services exist in the cluster. Compare this to iptables' linear chain traversal.
Atomic updates -- Updating a service's backend list means updating entries in an eBPF map, which is atomic and takes microseconds. Kube-proxy must regenerate and apply the entire iptables ruleset.
Maglev hashing -- Cilium supports Maglev consistent hashing for service backends, which minimizes connection disruption when backends are added or removed. Kube-proxy IPVS mode supports consistent hashing, but iptables mode does not.

NodePort and LoadBalancer services are also handled in eBPF: XDP programs on the host's physical interface perform DNAT for incoming NodePort traffic, and DSR (Direct Server Return) mode allows backend pods to reply directly to the client without going through the original node, reducing latency and bandwidth on the ingress node.

Cilium Network Policies

Kubernetes defines a basic NetworkPolicy resource, but it only covers L3/L4 filtering (IP, port, protocol) and lacks features like DNS-based rules, L7 filtering, and egress FQDN controls. Cilium implements the standard Kubernetes NetworkPolicy API and extends it with CiliumNetworkPolicy (CNP) and CiliumClusterwideNetworkPolicy (CCNP) custom resources that support:

L7 protocol-aware filtering -- Allow HTTP GET to /api/v1/users but deny POST to /admin. Cilium inspects HTTP, gRPC, Kafka, and DNS headers within eBPF and enforces per-request policies. This is implemented using an L7 proxy (Envoy) that Cilium transparently redirects matching traffic to.
DNS-based egress policies -- Allow egress to *.amazonaws.com without specifying IP addresses. Cilium intercepts DNS responses, records the IP-to-FQDN mapping, and dynamically updates eBPF policy maps as DNS records change.
Label-based selectors -- Policies select endpoints by Kubernetes labels, not IPs. A policy like "allow ingress from app=frontend to app=backend on port 8080" works regardless of how many pods exist or which IPs they have.
Entity selectors -- Built-in entities like world (all external traffic), cluster (all in-cluster traffic), host (the node itself), and remote-node simplify common policy patterns.
CIDR policies -- Allow/deny traffic to specific IP ranges, useful for controlling egress to external services or on-premises networks.
Default deny -- Applying any CiliumNetworkPolicy to an endpoint automatically puts it in default-deny mode for the selected traffic direction. All traffic not explicitly allowed is dropped.

Hubble: eBPF-Powered Observability

Hubble is Cilium's observability layer. It taps into the eBPF datapath to provide flow-level visibility without any instrumentation in the application. Because Hubble observes traffic at the kernel level, it sees all flows -- even those that are dropped by policy, which traditional monitoring tools miss entirely.

Hubble operates in two modes: Hubble (per-node) runs as part of the Cilium agent on each node and exports flows to a local Unix socket or gRPC endpoint. Hubble Relay aggregates flows from all nodes and provides a cluster-wide API. The Hubble CLI (hubble observe) and Hubble UI consume this API to display real-time flow data.

Each Hubble flow record includes: source/destination pod name, namespace, labels, security identity, IP, port, L4 protocol, L7 protocol details (HTTP method, URL, status code, gRPC method, Kafka topic), verdict (forwarded, dropped, error), drop reason (if applicable), and the network policy that caused the verdict. This level of detail makes Hubble invaluable for debugging connectivity issues, verifying policy behavior, and building service dependency maps.

Hubble also exports metrics in Prometheus format. You can build Grafana dashboards showing request rates, error rates, and latency distributions per service -- similar to what a service mesh provides, but without sidecar proxies. Hubble metrics are based on eBPF-observed flows, so they capture all traffic including UDP, TCP, and non-HTTP protocols that service mesh telemetry typically misses.

Networking Modes: Overlay, Direct Routing, and DSR

Cilium supports multiple networking modes, each with different tradeoffs:

VXLAN overlay -- The default mode. Cilium encapsulates pod-to-pod traffic crossing node boundaries in VXLAN (or Geneve) tunnels. This works with any underlying network (the physical network only needs to route node IPs) and supports identity propagation in tunnel headers. The overhead is roughly 50 bytes per packet (outer Ethernet, IP, UDP, and VXLAN headers).
Direct routing (native) -- Pod CIDRs are directly routable on the underlying network. Cilium programs routes on each node and can use BGP (via a built-in BGP speaker or integration with MetalLB/kube-router) to advertise pod CIDRs to physical network routers. This avoids encapsulation overhead and achieves the best performance, but requires the physical network to accept routes from nodes.
AWS ENI mode -- On AWS, Cilium can allocate pod IPs directly from the VPC's IP address space using the ENI (Elastic Network Interface) API. Each pod gets a VPC-routable IP, eliminating the need for overlay or BGP. The VPC's native routing handles all traffic. This provides the best integration with VPC security groups, flow logs, and other AWS networking features.
DSR (Direct Server Return) -- For NodePort and LoadBalancer services, the backend pod replies directly to the client instead of sending return traffic through the ingress node. This reduces latency and saves bandwidth on the ingress node, but requires the underlying network to accept asymmetric routing.

ClusterMesh: Multi-Cluster Connectivity

Cilium ClusterMesh connects multiple Kubernetes clusters into a unified networking domain. Pods in different clusters can communicate directly using pod IPs, and services can be shared across clusters for global load balancing and failover.

ClusterMesh works by connecting the Cilium agents in each cluster to a shared etcd (or kvstore) that synchronizes endpoint identity information across cluster boundaries. When a pod in Cluster A needs to reach a pod in Cluster B, Cilium knows the remote pod's identity and the tunnel endpoint (node IP) to reach it. Policy enforcement works across cluster boundaries -- you can write a CiliumNetworkPolicy in Cluster A that allows traffic from identities in Cluster B.

Shared services in ClusterMesh are annotated with service.cilium.io/global: "true". When a pod resolves a global service, Cilium returns backends from all clusters, optionally weighted by annotation. If all backends in the local cluster are unhealthy, traffic fails over to remote clusters. This provides active-active multi-cluster load balancing without requiring a service mesh or external load balancer.

Cilium Service Mesh

Cilium includes built-in service mesh capabilities that operate without sidecar proxies. Traditional service meshes like Istio inject an Envoy sidecar into every pod, creating two additional network hops per request (ingress proxy and egress proxy) and consuming significant CPU and memory. Cilium's approach is different: L4 features (mTLS, retries, circuit breaking at the connection level) are handled entirely in eBPF in the kernel, while L7 features (HTTP routing, header manipulation, gRPC load balancing) use a per-node Envoy instance instead of per-pod sidecars.

This architecture provides several advantages: reduced resource overhead (one Envoy per node instead of one per pod), lower latency (eBPF processing avoids the cost of two extra TCP connections per request), and simpler operations (no sidecar injection, no pod restart required to enable mesh features). Cilium Service Mesh integrates with the Gateway API for ingress and traffic management, and supports mTLS using SPIFFE identities for workload authentication.

Bandwidth Management and Rate Limiting

Cilium provides pod-level bandwidth management using eBPF-based Earliest Departure Time (EDT) rate limiting, which is more efficient than the traditional Token Bucket Filter (TBF) qdisc. You annotate pods with kubernetes.io/egress-bandwidth and kubernetes.io/ingress-bandwidth annotations, and Cilium's eBPF programs enforce the limits by scheduling packet departure times. EDT-based rate limiting achieves smoother throughput, lower jitter, and better burst handling compared to TBF, because it does not queue packets -- it simply stamps each packet with the earliest time it should be transmitted, and the kernel's FQ (Fair Queue) scheduler handles the rest.

Troubleshooting Cilium

Cilium provides comprehensive debugging tools:

cilium status -- Shows the overall health of the Cilium agent, including eBPF map usage, controller status, and cluster connectivity
cilium endpoint list -- Lists all managed endpoints (pods) with their identity, policy verdict, and state
cilium bpf policy get <endpoint-id> -- Dumps the eBPF policy map for a specific endpoint, showing which identities are allowed
cilium bpf ct list global -- Shows the connection tracking table entries managed by Cilium's eBPF programs
cilium monitor -- Real-time packet-level tracing showing every eBPF verdict, drop reason, and policy decision
hubble observe --pod <name> -- Filters Hubble flows for a specific pod, showing all connections and their verdicts

The most common issues in Cilium deployments are: identity conflicts (two unrelated sets of pods accidentally sharing the same identity due to identical labels), policy maps reaching capacity (the default eBPF map size may be too small for clusters with many identities), and MTU misconfigurations (VXLAN overhead requires reducing the container MTU by at least 50 bytes from the host MTU, or enabling eBPF-based MTU clamping).

Cilium, eBPF, and the Network

Cilium represents a fundamental shift in how container networking works -- from static iptables rules to a programmable, identity-aware datapath. By building on eBPF, it achieves performance that rivals or exceeds kernel-bypass solutions like DPDK, while remaining fully integrated with the Linux kernel's networking stack. For Kubernetes clusters at scale, Cilium's combination of identity-based security, kube-proxy replacement, and Hubble observability provides a networking layer that is both more performant and more operationally manageable than traditional alternatives. You can explore the autonomous systems and BGP routes that carry your cluster's external traffic using the god.ad looking glass.