How Service Meshes Work: Istio, Linkerd, and Envoy Sidecars

A service mesh is a dedicated infrastructure layer for managing service-to-service communication in microservice architectures. It works by deploying a network proxy (called a sidecar) alongside each service instance, forming a mesh of proxies that intercept all inter-service traffic. The mesh provides mutual TLS encryption, load balancing, traffic routing, observability, and resilience features without requiring application code changes. The architecture splits into two planes: the data plane, consisting of sidecar proxies (typically Envoy) that handle every byte of inter-service traffic, and the control plane, a centralized management system that configures the proxies, distributes certificates, and enforces policy. Istio, Linkerd, and Consul Connect are the dominant open-source service mesh implementations, each making different architectural tradeoffs. Understanding how service meshes work requires examining both planes in detail.

The Problem Service Meshes Solve

In a monolithic application, function calls between components are in-process -- they are fast, reliable, and require no network awareness. When you decompose a monolith into microservices, those in-process calls become network calls: HTTP requests, gRPC RPCs, or message queue publishes. Each network call introduces failure modes that did not exist before: the target service may be down, overloaded, or experiencing transient errors. The network between services may be congested, partitioned, or adding unpredictable latency.

Without a service mesh, each service must independently implement:

Implementing these features in application code leads to "fat client" libraries: Netflix's Hystrix (circuit breaking), Ribbon (load balancing), and Eureka (service discovery) were the canonical examples. But these libraries must be implemented in every language your services use, must be upgraded in lockstep across hundreds of services, and couple your application to a specific networking framework. A service mesh extracts all of this into infrastructure, making it language-agnostic and operationally independent of application deployments.

The Data Plane: Sidecar Proxies

The data plane is the layer that actually processes traffic. In a sidecar-based service mesh, each service instance (typically a Kubernetes pod) is paired with a proxy process that intercepts all inbound and outbound network traffic. Envoy Proxy is the dominant data plane implementation, used by Istio, Consul Connect, and AWS App Mesh.

Traffic Interception

In Kubernetes, the sidecar proxy intercepts traffic using iptables rules or (increasingly) eBPF programs injected by an init container during pod startup. All outbound TCP connections from the application container are transparently redirected to the sidecar's inbound listener on port 15001. The sidecar determines the intended destination (using the original destination IP from the iptables REDIRECT or TPROXY target), applies routing rules, performs TLS operations, and forwards the request to the upstream service's sidecar.

This transparent interception is what makes service meshes "application transparent" -- the application opens a connection to payments-service:8080 as if connecting directly, unaware that the sidecar is handling the connection, wrapping it in mTLS, load-balancing across replicas, and collecting telemetry.

Service Mesh Data Plane: Sidecar Architecture Pod A (order-service) App :8080 Envoy sidecar iptables REDIRECT all TCP -> :15001 istio-init: configures iptables before app starts Pod B (payment-service) Envoy sidecar App :8080 mTLS (auto-rotated) SPIFFE identity What the Sidecar Provides (per-request) mTLS encryption + identity verification Load balancing + circuit breaking Retries + timeouts + deadline propagation Metrics + tracing + access logs Traffic splitting + canary routing All features applied transparently -- the application connects via plain HTTP/gRPC, the sidecar handles the rest

What the Sidecar Does Per-Request

For each outbound request from the application, the Envoy sidecar performs a series of operations:

  1. Service discovery -- Resolves the service name to a set of endpoint IP addresses using data pushed from the control plane via the xDS API.
  2. Load balancing -- Selects an endpoint using the configured algorithm (round-robin, least requests, ring hash, or random with two choices).
  3. mTLS handshake -- Establishes a mutually authenticated TLS connection to the destination sidecar using SPIFFE-based identities (certificates issued and rotated by the control plane). If the connection is already established from a previous request, Envoy reuses it from the connection pool.
  4. Authorization check -- Evaluates whether the source service is authorized to call the destination service, based on authorization policies distributed by the control plane.
  5. Request forwarding -- Forwards the request to the destination sidecar, which forwards it to the local application container.
  6. Observability -- Emits metrics (request count, latency histogram, error rate), records a distributed trace span, and writes an access log entry.
  7. Resilience -- If the request fails, applies retry policy (configurable retry count, retry budget, retryable status codes). If the endpoint is consistently failing, applies circuit breaking to stop sending traffic to it (outlier ejection).

The Control Plane

The control plane is the brain of the service mesh. It does not touch any data traffic -- instead, it configures the data plane proxies by pushing configuration, certificates, and routing rules.

Istio's Control Plane (istiod)

Istio is the most widely deployed service mesh. Its control plane, istiod, runs as a single binary (since Istio 1.5; earlier versions split it into Pilot, Citadel, and Galley as separate processes) that performs three functions:

Linkerd's Control Plane

Linkerd takes a deliberately simpler approach. Instead of Envoy, Linkerd uses its own purpose-built Rust proxy (linkerd2-proxy) that is lighter weight (under 10MB, sub-millisecond p99 latency overhead) but supports fewer features than Envoy. Linkerd's control plane consists of a destination controller (service discovery and routing), an identity controller (mTLS certificate management), and a proxy injector (automatic sidecar injection). Linkerd emphasizes operational simplicity -- it has fewer knobs to turn than Istio but also fewer failure modes.

Mutual TLS in the Mesh

Mutual TLS is arguably the most valuable feature a service mesh provides. Without a mesh, implementing mTLS between all services requires each service to manage its own certificates, configure TLS in its server and client code, implement certificate rotation, and verify peer identities. In a mesh with 200 services, this means managing 200+ certificates, each with rotation schedules, revocation procedures, and trust chain configurations.

The service mesh automates this entirely:

  1. The control plane runs a certificate authority that issues short-lived certificates (24-hour or less) to each sidecar.
  2. Each sidecar receives a certificate with its SPIFFE identity embedded as the SAN (Subject Alternative Name).
  3. When sidecar A connects to sidecar B, both present their certificates. Each verifies the other's certificate against the mesh CA's root certificate.
  4. Certificates are automatically rotated before expiration -- no human intervention, no coordination between teams.
  5. Authorization policies can reference SPIFFE identities: "only the order-service identity can call the payment-service on the /charge endpoint."

This is zero-trust networking in practice: every service-to-service connection is authenticated (via certificate identity), encrypted (via TLS), and authorized (via policy). The network itself is untrusted -- even if an attacker gains access to the pod network, they cannot intercept or forge inter-service traffic without valid mesh certificates.

Traffic Management

Service meshes provide sophisticated traffic routing capabilities that enable deployment patterns impossible with traditional load balancers.

Traffic Splitting (Canary Deployments)

Route a percentage of traffic to a new version of a service while the rest continues to the stable version. In Istio, this is configured via VirtualService:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service
  http:
  - route:
    - destination:
        host: payment-service
        subset: v1
      weight: 90
    - destination:
        host: payment-service
        subset: v2
      weight: 10

This splits traffic 90/10 between v1 and v2. Unlike DNS-based traffic splitting (which is statistical and TTL-bounded), mesh traffic splitting is per-request and precise. The sidecar proxy makes a routing decision for every single request, and the split happens at the exact configured ratio.

Header-Based Routing

Route specific requests to specific service versions based on HTTP headers. A common pattern is to route internal test traffic to a canary version:

  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: payment-service
        subset: v2
  - route:
    - destination:
        host: payment-service
        subset: v1

This is powerful for testing: QA engineers set a header to route their traffic through the new version while all production traffic goes to the stable version. Combined with distributed tracing, you can follow a tagged request through the entire call graph to verify that the new version behaves correctly.

Fault Injection

Service meshes can inject faults into the data plane to test resilience. You can inject HTTP error responses (e.g., return 500 for 5% of requests to payment-service) or latency (e.g., add 3 seconds of delay to 10% of requests). This enables chaos engineering without modifying application code -- the sidecar proxy injects the fault transparently, and you observe how downstream services handle it.

Circuit Breaking and Outlier Detection

Circuit breaking prevents a failing service from taking down the entire system. When a service starts returning errors, the circuit breaker "opens" and stops sending traffic to it, allowing it time to recover. This is critical in microservice architectures where a single failing service can cause cascading failures through retry storms.

Envoy's circuit breaking operates on two levels:

In Istio, circuit breaking is configured via DestinationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 50
        http2MaxRequests: 200
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Observability

Because every inter-service request passes through a sidecar proxy, the mesh automatically generates comprehensive telemetry without any application instrumentation:

The observability data enables powerful operational tools: service dependency graphs (auto-generated from traffic metrics), error rate dashboards per service pair, latency heat maps, and golden signal alerting (traffic, errors, latency, saturation) without any application-specific configuration.

Service Mesh Performance Overhead

The sidecar proxy adds latency and resource overhead to every request. This is the most common objection to service mesh adoption, and understanding the actual overhead is critical for deciding whether a mesh is appropriate.

Measured overhead varies by implementation and workload:

The overhead matters most for latency-sensitive, high-throughput services with deep call graphs. If a request traverses 10 services (each with two sidecar hops), the mesh adds 10-20ms of total latency. For services with 100ms+ inherent latency (database queries, external API calls), this is negligible. For sub-millisecond in-memory services, it may be unacceptable.

Sidecarless Service Mesh (Ambient Mode)

Istio's "ambient mesh" (introduced in 2022, graduating to stable in Istio 1.22) eliminates the sidecar proxy for most traffic. Instead of a per-pod sidecar, ambient mode uses a per-node ztunnel (zero-trust tunnel) DaemonSet for L4 traffic (mTLS encryption and basic authorization) and optional per-service waypoint proxies (Envoy instances) for L7 features (HTTP routing, traffic splitting, header-based policies).

This architecture reduces the resource overhead dramatically: instead of N sidecars (one per pod), you have M ztunnels (one per node, where M << N) plus optional waypoint proxies only for services that need L7 features. The tradeoff is that L4-only mode provides encryption and identity but not per-request routing, retries, or L7 observability. Services that need L7 features opt in to waypoint proxies, which are shared across multiple services rather than per-pod.

Service Mesh Implementations Compared

The three major service mesh implementations make distinct architectural choices:

When to Use (and Not Use) a Service Mesh

A service mesh adds significant operational complexity. It is justified when:

A service mesh is overkill when:

Service Mesh and Network Infrastructure

A service mesh operates at the application layer within a cluster, but it does not exist in isolation from the underlying network. The mesh's mTLS connections traverse the Kubernetes CNI network, which traverses the physical or virtual network fabric, which ultimately connects to the broader internet via BGP routing.

For multi-cluster service mesh deployments (Istio's multi-cluster mode, Linkerd's multi-cluster gateway), inter-cluster traffic flows over the wide-area network. This traffic is subject to the same BGP routing, peering relationships, and latency constraints as any other internet traffic. Understanding the AS paths and peering topology between your clusters' networks is essential for designing multi-cluster mesh architectures that meet latency requirements.

Explore the routing infrastructure underlying your service mesh's network paths with the god.ad BGP Looking Glass. Look up your cloud provider's ASN -- AWS (AS16509), Google Cloud (AS15169), or Azure (AS8075) -- to see the BGP routes that carry your inter-cluster mesh traffic.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained