How Service Meshes Work: Istio, Linkerd, and Envoy Sidecars
A service mesh is a dedicated infrastructure layer for managing service-to-service communication in microservice architectures. It works by deploying a network proxy (called a sidecar) alongside each service instance, forming a mesh of proxies that intercept all inter-service traffic. The mesh provides mutual TLS encryption, load balancing, traffic routing, observability, and resilience features without requiring application code changes. The architecture splits into two planes: the data plane, consisting of sidecar proxies (typically Envoy) that handle every byte of inter-service traffic, and the control plane, a centralized management system that configures the proxies, distributes certificates, and enforces policy. Istio, Linkerd, and Consul Connect are the dominant open-source service mesh implementations, each making different architectural tradeoffs. Understanding how service meshes work requires examining both planes in detail.
The Problem Service Meshes Solve
In a monolithic application, function calls between components are in-process -- they are fast, reliable, and require no network awareness. When you decompose a monolith into microservices, those in-process calls become network calls: HTTP requests, gRPC RPCs, or message queue publishes. Each network call introduces failure modes that did not exist before: the target service may be down, overloaded, or experiencing transient errors. The network between services may be congested, partitioned, or adding unpredictable latency.
Without a service mesh, each service must independently implement:
- Service discovery -- Finding the IP addresses and ports of upstream services
- Load balancing -- Distributing requests across multiple instances of an upstream service
- Retries and timeouts -- Handling transient failures with exponential backoff and deadline propagation
- Circuit breaking -- Stopping requests to a failing upstream to prevent cascade failures
- Mutual TLS -- Authenticating and encrypting every service-to-service connection
- Observability -- Emitting metrics, traces, and access logs for every request
- Traffic control -- Canary deployments, traffic splitting, header-based routing
Implementing these features in application code leads to "fat client" libraries: Netflix's Hystrix (circuit breaking), Ribbon (load balancing), and Eureka (service discovery) were the canonical examples. But these libraries must be implemented in every language your services use, must be upgraded in lockstep across hundreds of services, and couple your application to a specific networking framework. A service mesh extracts all of this into infrastructure, making it language-agnostic and operationally independent of application deployments.
The Data Plane: Sidecar Proxies
The data plane is the layer that actually processes traffic. In a sidecar-based service mesh, each service instance (typically a Kubernetes pod) is paired with a proxy process that intercepts all inbound and outbound network traffic. Envoy Proxy is the dominant data plane implementation, used by Istio, Consul Connect, and AWS App Mesh.
Traffic Interception
In Kubernetes, the sidecar proxy intercepts traffic using iptables rules or (increasingly) eBPF programs injected by an init container during pod startup. All outbound TCP connections from the application container are transparently redirected to the sidecar's inbound listener on port 15001. The sidecar determines the intended destination (using the original destination IP from the iptables REDIRECT or TPROXY target), applies routing rules, performs TLS operations, and forwards the request to the upstream service's sidecar.
This transparent interception is what makes service meshes "application transparent" -- the application opens a connection to payments-service:8080 as if connecting directly, unaware that the sidecar is handling the connection, wrapping it in mTLS, load-balancing across replicas, and collecting telemetry.
What the Sidecar Does Per-Request
For each outbound request from the application, the Envoy sidecar performs a series of operations:
- Service discovery -- Resolves the service name to a set of endpoint IP addresses using data pushed from the control plane via the xDS API.
- Load balancing -- Selects an endpoint using the configured algorithm (round-robin, least requests, ring hash, or random with two choices).
- mTLS handshake -- Establishes a mutually authenticated TLS connection to the destination sidecar using SPIFFE-based identities (certificates issued and rotated by the control plane). If the connection is already established from a previous request, Envoy reuses it from the connection pool.
- Authorization check -- Evaluates whether the source service is authorized to call the destination service, based on authorization policies distributed by the control plane.
- Request forwarding -- Forwards the request to the destination sidecar, which forwards it to the local application container.
- Observability -- Emits metrics (request count, latency histogram, error rate), records a distributed trace span, and writes an access log entry.
- Resilience -- If the request fails, applies retry policy (configurable retry count, retry budget, retryable status codes). If the endpoint is consistently failing, applies circuit breaking to stop sending traffic to it (outlier ejection).
The Control Plane
The control plane is the brain of the service mesh. It does not touch any data traffic -- instead, it configures the data plane proxies by pushing configuration, certificates, and routing rules.
Istio's Control Plane (istiod)
Istio is the most widely deployed service mesh. Its control plane, istiod, runs as a single binary (since Istio 1.5; earlier versions split it into Pilot, Citadel, and Galley as separate processes) that performs three functions:
- Pilot (traffic management) -- Watches Kubernetes API server for Service, Endpoint, and Istio custom resources (VirtualService, DestinationRule, Gateway). Translates these into Envoy configuration and pushes it to all sidecar proxies via the xDS gRPC streaming API. When a new pod starts, Pilot immediately pushes a full configuration snapshot. When services change, Pilot pushes incremental updates via delta xDS.
- Citadel (security) -- Acts as a certificate authority (CA) that issues X.509 certificates to each sidecar proxy. Certificates encode the service's SPIFFE identity (e.g.,
spiffe://cluster.local/ns/default/sa/payment-service) and are rotated automatically every 24 hours by default. This is how mTLS between services is established without any application configuration. - Galley (configuration validation) -- Validates Istio custom resource definitions and provides configuration processing. In modern Istio, this is integrated into istiod's admission webhook.
Linkerd's Control Plane
Linkerd takes a deliberately simpler approach. Instead of Envoy, Linkerd uses its own purpose-built Rust proxy (linkerd2-proxy) that is lighter weight (under 10MB, sub-millisecond p99 latency overhead) but supports fewer features than Envoy. Linkerd's control plane consists of a destination controller (service discovery and routing), an identity controller (mTLS certificate management), and a proxy injector (automatic sidecar injection). Linkerd emphasizes operational simplicity -- it has fewer knobs to turn than Istio but also fewer failure modes.
Mutual TLS in the Mesh
Mutual TLS is arguably the most valuable feature a service mesh provides. Without a mesh, implementing mTLS between all services requires each service to manage its own certificates, configure TLS in its server and client code, implement certificate rotation, and verify peer identities. In a mesh with 200 services, this means managing 200+ certificates, each with rotation schedules, revocation procedures, and trust chain configurations.
The service mesh automates this entirely:
- The control plane runs a certificate authority that issues short-lived certificates (24-hour or less) to each sidecar.
- Each sidecar receives a certificate with its SPIFFE identity embedded as the SAN (Subject Alternative Name).
- When sidecar A connects to sidecar B, both present their certificates. Each verifies the other's certificate against the mesh CA's root certificate.
- Certificates are automatically rotated before expiration -- no human intervention, no coordination between teams.
- Authorization policies can reference SPIFFE identities: "only the
order-serviceidentity can call thepayment-serviceon the/chargeendpoint."
This is zero-trust networking in practice: every service-to-service connection is authenticated (via certificate identity), encrypted (via TLS), and authorized (via policy). The network itself is untrusted -- even if an attacker gains access to the pod network, they cannot intercept or forge inter-service traffic without valid mesh certificates.
Traffic Management
Service meshes provide sophisticated traffic routing capabilities that enable deployment patterns impossible with traditional load balancers.
Traffic Splitting (Canary Deployments)
Route a percentage of traffic to a new version of a service while the rest continues to the stable version. In Istio, this is configured via VirtualService:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- route:
- destination:
host: payment-service
subset: v1
weight: 90
- destination:
host: payment-service
subset: v2
weight: 10
This splits traffic 90/10 between v1 and v2. Unlike DNS-based traffic splitting (which is statistical and TTL-bounded), mesh traffic splitting is per-request and precise. The sidecar proxy makes a routing decision for every single request, and the split happens at the exact configured ratio.
Header-Based Routing
Route specific requests to specific service versions based on HTTP headers. A common pattern is to route internal test traffic to a canary version:
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: payment-service
subset: v2
- route:
- destination:
host: payment-service
subset: v1
This is powerful for testing: QA engineers set a header to route their traffic through the new version while all production traffic goes to the stable version. Combined with distributed tracing, you can follow a tagged request through the entire call graph to verify that the new version behaves correctly.
Fault Injection
Service meshes can inject faults into the data plane to test resilience. You can inject HTTP error responses (e.g., return 500 for 5% of requests to payment-service) or latency (e.g., add 3 seconds of delay to 10% of requests). This enables chaos engineering without modifying application code -- the sidecar proxy injects the fault transparently, and you observe how downstream services handle it.
Circuit Breaking and Outlier Detection
Circuit breaking prevents a failing service from taking down the entire system. When a service starts returning errors, the circuit breaker "opens" and stops sending traffic to it, allowing it time to recover. This is critical in microservice architectures where a single failing service can cause cascading failures through retry storms.
Envoy's circuit breaking operates on two levels:
- Connection-level circuit breaking -- Limits the maximum number of connections, pending requests, and concurrent retries to any single upstream cluster. When these limits are hit, Envoy immediately returns an error rather than queuing more work.
- Outlier detection (per-endpoint) -- Monitors each individual endpoint within a cluster. If an endpoint returns too many errors (configurable threshold, e.g., 5 consecutive 5xx responses), it is ejected from the load balancing pool for a configurable duration (e.g., 30 seconds). The ejection duration increases exponentially for repeated failures. This is analogous to a per-endpoint circuit breaker.
In Istio, circuit breaking is configured via DestinationRule:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 50
http2MaxRequests: 200
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
Observability
Because every inter-service request passes through a sidecar proxy, the mesh automatically generates comprehensive telemetry without any application instrumentation:
- Metrics -- Request count, request duration (p50/p90/p99), request size, response size, error rates (4xx, 5xx), retry counts. These are emitted as Prometheus metrics by default in both Istio and Linkerd, with source and destination service labels. This gives you a complete service-to-service traffic matrix.
- Distributed tracing -- The sidecar generates trace spans for each request, annotated with service identity, request path, response code, and latency. Combined with application-level trace context propagation (the application must forward trace headers like
x-request-idandtraceparent), this provides end-to-end request tracing across the entire call graph. - Access logs -- Each sidecar can emit structured access logs in JSON or text format for every request, including source/destination identity, request path, response code, latency, and mTLS details.
The observability data enables powerful operational tools: service dependency graphs (auto-generated from traffic metrics), error rate dashboards per service pair, latency heat maps, and golden signal alerting (traffic, errors, latency, saturation) without any application-specific configuration.
Service Mesh Performance Overhead
The sidecar proxy adds latency and resource overhead to every request. This is the most common objection to service mesh adoption, and understanding the actual overhead is critical for deciding whether a mesh is appropriate.
Measured overhead varies by implementation and workload:
- Istio (Envoy sidecar) -- p50 latency overhead: ~0.5-1ms, p99: ~2-5ms. Memory per sidecar: 40-100MB. CPU per sidecar: 0.01-0.1 cores depending on request rate. The overhead is primarily from iptables redirection (two extra kernel-userspace transitions per request), TLS handshakes (amortized over connection reuse), and protocol parsing.
- Linkerd (linkerd2-proxy) -- p50 latency overhead: ~0.3-0.5ms, p99: ~1-2ms. Memory per sidecar: 10-20MB. Linkerd's Rust proxy is lighter weight than Envoy and optimized for the specific sidecar use case, at the cost of fewer features.
The overhead matters most for latency-sensitive, high-throughput services with deep call graphs. If a request traverses 10 services (each with two sidecar hops), the mesh adds 10-20ms of total latency. For services with 100ms+ inherent latency (database queries, external API calls), this is negligible. For sub-millisecond in-memory services, it may be unacceptable.
Sidecarless Service Mesh (Ambient Mode)
Istio's "ambient mesh" (introduced in 2022, graduating to stable in Istio 1.22) eliminates the sidecar proxy for most traffic. Instead of a per-pod sidecar, ambient mode uses a per-node ztunnel (zero-trust tunnel) DaemonSet for L4 traffic (mTLS encryption and basic authorization) and optional per-service waypoint proxies (Envoy instances) for L7 features (HTTP routing, traffic splitting, header-based policies).
This architecture reduces the resource overhead dramatically: instead of N sidecars (one per pod), you have M ztunnels (one per node, where M << N) plus optional waypoint proxies only for services that need L7 features. The tradeoff is that L4-only mode provides encryption and identity but not per-request routing, retries, or L7 observability. Services that need L7 features opt in to waypoint proxies, which are shared across multiple services rather than per-pod.
Service Mesh Implementations Compared
The three major service mesh implementations make distinct architectural choices:
- Istio -- The most feature-rich mesh. Uses Envoy as its data plane, providing the full breadth of Envoy's features (rate limiting, WASM extensions, external authorization, complex routing). Istio's control plane is more complex and resource-intensive than alternatives, and its configuration surface (VirtualService, DestinationRule, Gateway, AuthorizationPolicy, PeerAuthentication, etc.) has a steep learning curve. Best for organizations that need advanced traffic management and are willing to invest in operational expertise.
- Linkerd -- Prioritizes simplicity and operational lightness. Its Rust proxy is purpose-built for the sidecar use case, with lower overhead than Envoy but fewer features. Linkerd auto-detects protocols (HTTP/1.1, HTTP/2, gRPC) and does not require explicit protocol declaration. Configuration is simpler than Istio's. Best for teams that want mTLS, observability, and basic traffic management without the operational weight of Istio.
- Consul Connect -- Part of HashiCorp's Consul service mesh. Works across Kubernetes and non-Kubernetes environments (VMs, bare metal), making it the best choice for hybrid deployments. Uses Envoy as its data plane but with Consul's service discovery and intention-based authorization model. Integrates with HashiCorp Vault for certificate management.
When to Use (and Not Use) a Service Mesh
A service mesh adds significant operational complexity. It is justified when:
- You have 20+ microservices with complex inter-service communication patterns
- You need zero-trust security (mTLS everywhere) without modifying application code
- You need uniform observability across services written in multiple languages
- You need advanced deployment patterns (canary, traffic splitting, fault injection)
- Your compliance requirements mandate encrypted service-to-service communication
A service mesh is overkill when:
- You have fewer than 10 services and simple communication patterns
- Your services already implement their own mTLS and observability
- The latency overhead of sidecar proxies is unacceptable for your workload
- You do not have the team expertise to operate and debug the mesh infrastructure
Service Mesh and Network Infrastructure
A service mesh operates at the application layer within a cluster, but it does not exist in isolation from the underlying network. The mesh's mTLS connections traverse the Kubernetes CNI network, which traverses the physical or virtual network fabric, which ultimately connects to the broader internet via BGP routing.
For multi-cluster service mesh deployments (Istio's multi-cluster mode, Linkerd's multi-cluster gateway), inter-cluster traffic flows over the wide-area network. This traffic is subject to the same BGP routing, peering relationships, and latency constraints as any other internet traffic. Understanding the AS paths and peering topology between your clusters' networks is essential for designing multi-cluster mesh architectures that meet latency requirements.
Explore the routing infrastructure underlying your service mesh's network paths with the god.ad BGP Looking Glass. Look up your cloud provider's ASN -- AWS (AS16509), Google Cloud (AS15169), or Azure (AS8075) -- to see the BGP routes that carry your inter-cluster mesh traffic.