How DPDK Works: Kernel-Bypass Packet Processing at 100 Gbps

DPDK (Data Plane Development Kit) is an open-source framework that bypasses the operating system kernel to process network packets directly in userspace, achieving throughput rates that the kernel networking stack cannot match. Originally developed by Intel and released in 2010, DPDK provides a set of libraries and drivers that allow applications to poll network interface cards (NICs) directly, eliminating the interrupt overhead, context switches, and memory copies that make the kernel's networking path too slow for workloads demanding tens of millions of packets per second. If you have used a high-performance virtual switch, a telecom network function, or a software-defined router that handles traffic for networks visible in a BGP looking glass, you have likely used infrastructure built on DPDK.

The core insight behind DPDK is that the Linux kernel's networking stack was designed for generality, not raw speed. The kernel handles every packet through a layered architecture of interrupts, soft IRQs, socket buffers, protocol processing, and context switches between kernel and userspace. Each of these layers adds latency and consumes CPU cycles. For a web server handling a few thousand requests per second, this overhead is negligible. For a network appliance forwarding 100 Gbps of traffic — roughly 150 million packets per second of minimum-size Ethernet frames — the kernel stack becomes the bottleneck. DPDK removes that bottleneck entirely.

Why the Kernel Networking Stack Is Slow

To understand DPDK, you must first understand what it bypasses. When a packet arrives at a NIC in a standard Linux system, the following sequence occurs:

  1. Hardware interrupt — The NIC signals the CPU via an interrupt. The CPU stops whatever it was doing, saves its register state, and jumps to the interrupt handler. This context switch takes hundreds of nanoseconds and pollutes CPU caches.
  2. Driver processing — The kernel's NIC driver reads the packet descriptor from the NIC's receive ring, allocates an sk_buff (socket buffer) structure, and copies or maps the packet data into kernel memory. Each sk_buff allocation involves the kernel's memory allocator (SLAB/SLUB), which takes locks and can contend across cores.
  3. NAPI / softIRQ — Linux uses NAPI (New API) to batch packet processing. After the initial hardware interrupt, the kernel schedules a soft interrupt (softIRQ) to process queued packets. The softIRQ runs in a special context with its own scheduling constraints, adding another layer of indirection.
  4. Network stack traversal — Each packet passes through netfilter hooks (iptables/nftables rules), routing table lookups, protocol handling (IP, TCP, UDP), and connection tracking. Each layer examines the packet headers and makes forwarding decisions. For a forwarding appliance that does not need TCP termination, this work is wasted.
  5. Socket buffer copy — If the packet is destined for a userspace application, the data is copied from the kernel's sk_buff into the application's buffer via a recv() or read() system call. This system call involves a context switch from userspace to kernel space and back, plus a memory copy.

Each of these steps costs time. Hardware interrupts cost 5-10 microseconds including cache effects. System calls cost 1-2 microseconds. Memory copies cost proportionally to packet size. Multiply these costs by millions of packets per second, and the kernel stack consumes entire CPU cores just moving data. Benchmarks consistently show that the Linux kernel stack tops out at roughly 1-3 million packets per second per core for forwarding workloads, depending on packet size and configuration. A 100 Gbps NIC receiving minimum-size 64-byte frames requires processing approximately 148.8 million packets per second — far beyond what a single core running the kernel stack can handle.

Kernel Stack vs DPDK: Packet Path Kernel Networking NIC (Hardware) IRQ Hardware Interrupt Handler NAPI / SoftIRQ Processing Netfilter / Routing / Protocols syscall sk_buff Copy to Userspace Application ~1-3 Mpps/core DPDK Kernel Bypass NIC (Hardware) DMA to hugepages Kernel bypassed entirely No IRQs, no sk_buffs, no copies PMD Poll Loop (Userspace) mbuf Processing Application Logic ~40-80+ Mpps/core 10-40x improvement

DPDK Architecture: How Kernel Bypass Works

DPDK achieves its performance by taking the NIC away from the kernel entirely. When a DPDK application starts, it unbinds the NIC from the kernel's network driver (e.g., ixgbe, i40e, mlx5_core) and binds it to a userspace-compatible driver — either UIO (Userspace I/O) or VFIO (Virtual Function I/O). From that point on, the kernel has no visibility into the NIC. There are no interrupts, no kernel driver, no sk_buff allocations, and no system calls in the packet processing path. The NIC's registers and DMA rings are memory-mapped directly into the DPDK application's address space.

Environment Abstraction Layer (EAL)

The EAL is DPDK's initialization and platform abstraction layer. When you call rte_eal_init() at program startup, the EAL performs a series of critical setup steps:

Poll Mode Drivers (PMDs)

PMDs are the heart of DPDK. Instead of waiting for the NIC to interrupt the CPU when packets arrive, a PMD continuously polls the NIC's receive descriptor ring in a tight loop. The main loop of a DPDK application looks conceptually like this:

while (1) {
    // Poll the NIC directly — no syscall, no interrupt
    nb_rx = rte_eth_rx_burst(port_id, queue_id, mbufs, BURST_SIZE);

    if (nb_rx > 0) {
        // Process packets
        for (i = 0; i < nb_rx; i++) {
            process_packet(mbufs[i]);
        }

        // Transmit processed packets
        nb_tx = rte_eth_tx_burst(port_id, queue_id, mbufs, nb_rx);

        // Free any unsent packets
        for (i = nb_tx; i < nb_rx; i++) {
            rte_pktmbuf_free(mbufs[i]);
        }
    }
}

This loop burns 100% of the CPU core. There is no idle state, no sleep, no waiting. The core is dedicated solely to checking for packets and processing them. This is DPDK's fundamental tradeoff: it trades CPU efficiency for latency and throughput. An idle DPDK application consumes the same CPU as a fully loaded one. In return, it processes packets with single-digit microsecond latency and can sustain line rate on 100 Gbps interfaces.

The rte_eth_rx_burst() function does not invoke a system call. It directly reads the NIC's receive descriptor ring from mapped memory, checks which descriptors have been filled by the NIC's DMA engine, and returns pointers to the corresponding mbufs. This entire operation happens in userspace, with no kernel involvement whatsoever.

DPDK provides PMDs for all major NIC vendors: Intel (ixgbe, i40e, ice), Mellanox/NVIDIA (mlx4, mlx5), Broadcom (bnxt), Amazon (ena for EC2 instances), and many others. Each PMD is tuned for its specific NIC hardware, exploiting hardware-specific features like multiple receive queues, RSS (Receive Side Scaling), and hardware offloads.

Memory Buffers: mbufs and Mempools

In the kernel stack, each packet gets an sk_buff that is individually allocated and freed. This per-packet allocation is a major source of overhead: the kernel's memory allocator must maintain free lists, handle locking for multi-core access, and manage slab caches. DPDK replaces this with mempools — pre-allocated, fixed-size pools of packet buffers called mbufs.

An mbuf pool is created at initialization time with a call like rte_pktmbuf_pool_create("MBUF_POOL", 8192, 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, socket_id). This allocates 8,192 mbufs on the specified NUMA socket, with each mbuf containing a fixed-size buffer for packet data. The pool uses a lock-free ring buffer internally, so allocating or freeing an mbuf is an atomic compare-and-swap operation — typically 10-20 nanoseconds versus hundreds of nanoseconds for kernel sk_buff allocation.

Each mbuf has a structure optimized for cache efficiency. The first cache line (64 bytes) contains the most frequently accessed fields: buffer address, data offset, packet length, and port number. Metadata that is accessed less frequently sits in subsequent cache lines. This layout means the common case — read the packet, check the length, forward it — touches only one or two cache lines per packet.

For packets larger than a single mbuf buffer, DPDK chains mbufs into a linked list, similar to scatter-gather I/O. Jumbo frames or reassembled packets span multiple mbufs without requiring contiguous memory allocation.

rte_ring: Lock-Free Inter-Core Communication

When DPDK applications run across multiple cores (which they almost always do), they need to pass packets between cores without locks. The rte_ring data structure is a fixed-size, lock-free FIFO queue implemented with atomic operations. It supports single-producer/single-consumer, multi-producer/single-consumer, and multi-producer/multi-consumer modes.

The ring uses a simple algorithm: it maintains head and tail indexes, and producers and consumers advance these indexes atomically using __sync_bool_compare_and_swap() (CAS). In the single-producer/single-consumer case, no atomic operations are needed at all — only memory ordering barriers. This gives inter-core packet passing latencies measured in tens of nanoseconds.

rte_ring is used throughout DPDK: it backs mempool allocations, connects pipeline stages across cores, and provides the abstraction for virtual device I/O. The ring size must be a power of two, which allows the index-to-slot mapping to use a bitwise AND instead of a modulo operation — a micro-optimization that matters at 100 million operations per second.

Hugepages and Memory Architecture

DPDK's reliance on hugepages is not a convenience — it is a hard requirement. Packet processing accesses memory at rates that overwhelm the CPU's TLB when standard 4 KB pages are used. Consider a DPDK application processing 80 million packets per second. Each packet requires at least one mbuf access (packet data), one descriptor access (NIC ring), and often one or more lookup table accesses (flow tables, routing tables). That is at least 240 million memory accesses per second, each requiring a virtual-to-physical address translation.

A modern x86 CPU has a two-level TLB: the L1 DTLB holds 64-72 entries, and the L2 TLB holds 1,536-2,048 entries. With 4 KB pages, these TLBs can cover at most 8 MB of memory — far less than the gigabytes of memory a DPDK application uses for packet buffers, flow tables, and lookup structures. TLB misses trigger a page table walk that takes 10-100 nanoseconds (or more if page table entries are not in cache). At 240 million accesses per second, even a 1% TLB miss rate produces 2.4 million page table walks per second, costing 24-240 milliseconds of CPU time per second.

With 2 MB hugepages, the same TLB entries cover 3 GB of memory. With 1 GB hugepages (available on modern x86 CPUs), a single TLB entry covers an entire gigabyte. This effectively eliminates TLB misses for DPDK's working set. The EAL configures hugepages at startup by mounting the hugetlbfs filesystem and mapping the required number of pages:

# Reserve 1024 hugepages of 2MB each (2GB total)
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Mount hugetlbfs
mkdir -p /dev/hugepages
mount -t hugetlbfs nodev /dev/hugepages

NUMA Awareness

Modern multi-socket servers have Non-Uniform Memory Access (NUMA) architectures where each CPU socket has its own local memory. Accessing local memory takes roughly 80 nanoseconds, while accessing memory on a remote NUMA node takes 130-200 nanoseconds — a 60-150% penalty. For PCIe devices like NICs, the NUMA effect is even more pronounced: a NIC is physically attached to one CPU socket, and DMA operations between the NIC and memory on the remote socket traverse the inter-socket interconnect (Intel UPI or AMD Infinity Fabric), adding latency and consuming interconnect bandwidth.

DPDK is deeply NUMA-aware. The EAL detects which NUMA node each NIC is attached to and ensures that:

Violating NUMA locality in a high-rate packet processing application can reduce throughput by 30-50%. DPDK makes NUMA placement explicit: almost every allocation function takes a socket_id parameter, and the application developer is expected to maintain NUMA locality throughout the processing pipeline.

Processing Models: Run-to-Completion vs Pipeline

DPDK applications use two primary architectural models for distributing work across cores:

Run-to-Completion

In the run-to-completion model, each core handles a packet from start to finish. A core receives a packet from the NIC, performs all processing (parsing, classification, lookup, modification, encapsulation), and transmits the result — without passing the packet to any other core. Each core operates on its own set of NIC queues using RSS (Receive Side Scaling) to distribute packets across queues based on flow hashing.

Run-to-completion is the simplest model and often the fastest for straightforward forwarding applications. Each packet touches one core's cache, there is no inter-core communication overhead, and the application scales linearly by adding more cores with more NIC queues. The l3fwd example application in DPDK uses this model, and achieves near-line-rate forwarding on 100 Gbps NICs.

Pipeline Model

In the pipeline model, packet processing is divided into stages, and each stage runs on a different core. For example: core 0 receives packets and classifies them, core 1 performs route lookups, core 2 applies security policies, and core 3 transmits. Packets move between stages via rte_ring queues.

The pipeline model suits complex processing chains where different stages have different computational costs. It also allows stages to be shared across flows — a single classification stage can serve multiple forwarding stages. However, the pipeline model adds inter-core latency (one rte_ring enqueue/dequeue per stage) and increases cache pressure because each packet's data must be loaded into a new core's cache at each stage. For simple forwarding, run-to-completion outperforms pipeline. For complex processing (deep packet inspection, encryption, protocol reassembly), pipeline can better utilize cores by keeping each core's instruction cache warm with a single stage's code.

Run-to-Completion Core 0 (Queue 0) RX Process TX Full path per pkt Core 1 (Queue 1) RX Process TX Full path per pkt Pipeline Core 0 RX + Classify Stage 0 ring Core 1 Lookup + Modify Stage 1 ring Core 2 TX Stage 2 Run-to-Completion Tradeoffs + No inter-core latency + Cache-friendly (1 core/pkt) + Linear scaling with cores - Each core needs full state - I-cache pressure if complex Pipeline Tradeoffs + Stage specialization + Shared state between flows + Warm I-cache per stage - Inter-core ring latency - Bottleneck at slowest stage

DPDK vs Kernel Networking: When to Use Each

DPDK is not a replacement for the kernel networking stack. It is a specialized tool for a specific class of problems. The tradeoffs are significant:

Use DPDK when you need sustained packet rates above 5-10 million packets per second, deterministic sub-10-microsecond latency, or when the kernel stack measurably cannot keep up. For lower-rate workloads, the kernel stack's generality, security isolation, and ecosystem integration outweigh DPDK's raw performance. Many modern kernels with XDP and eBPF can achieve 20-30 million packets per second per core while remaining integrated with the kernel's networking features — a middle ground that has significantly narrowed DPDK's advantage for many use cases.

DPDK in Network Function Virtualization (NFV)

DPDK's most significant deployment is in NFV — the telecom industry's shift from proprietary hardware appliances to software running on commodity servers. Functions that traditionally ran on purpose-built hardware — firewalls, load balancers, routers, session border controllers, deep packet inspection engines — now run as Virtual Network Functions (VNFs) on standard x86 servers. DPDK is the foundation that makes this economically viable.

OVS-DPDK: The Virtual Switch

Open vSwitch (OVS) is the dominant open-source virtual switch in data centers and telecom networks. In its default mode, OVS uses the kernel's datapath for forwarding, which inherits all of the kernel stack's performance limitations. OVS-DPDK replaces the kernel datapath with DPDK, moving all packet forwarding into userspace.

The performance difference is dramatic. Kernel-based OVS can forward 1-2 million packets per second. OVS-DPDK, on the same hardware, can forward 10-20 million packets per second with VXLAN encapsulation, and even more for simple L2 forwarding. For telecom operators running thousands of virtual machines with latency-sensitive VNFs, this difference determines whether software can replace hardware. OVS-DPDK manages the virtual port connections between VMs and the physical NICs. Each virtual machine connects to OVS-DPDK via vhost-user — a userspace implementation of the virtio transport that uses shared memory and eventfd for high-performance VM-to-switch communication, bypassing the kernel in both the host and guest.

VPP (Vector Packet Processing)

FD.io VPP (part of the Linux Foundation) is a DPDK-based packet processing framework developed originally by Cisco. VPP's innovation is vector processing: instead of processing packets one at a time through the entire graph, VPP processes a vector of packets (typically 256) through each graph node before moving to the next node. This keeps the instruction cache hot — each graph node's code is loaded once and applied to all 256 packets — and amortizes the cost of function calls and pipeline stalls across the vector. VPP achieves multi-terabit forwarding rates on commodity hardware and powers the CSIT (Continuous System Integration Testing) benchmarks used by the telecom industry to validate NFV performance.

Userspace TCP Stacks

For applications that need TCP but also need DPDK's performance, several userspace TCP stacks have been built on DPDK:

SR-IOV and DPDK

Single Root I/O Virtualization (SR-IOV) is a hardware technology that allows a single physical NIC to present multiple virtual NICs (Virtual Functions, or VFs) to the host operating system. Each VF has its own set of queues, interrupts, and DMA channels, and can be assigned directly to a virtual machine or container using PCI passthrough. DPDK supports both the Physical Function (PF) and Virtual Functions (VFs). In a typical NFV deployment:

This architecture allows each VNF to achieve near-bare-metal packet processing performance while maintaining the isolation and management benefits of virtualization. Mellanox/NVIDIA ConnectX and Intel E810 NICs support hundreds of VFs, enabling dense VNF deployments with full DPDK performance.

AF_XDP: The Kernel's Answer to DPDK

AF_XDP (Address Family XDP) is a socket type introduced in Linux 4.18 that provides a middle ground between full kernel networking and DPDK-style kernel bypass. AF_XDP uses XDP (eXpress Data Path) to redirect packets from the NIC driver directly to a userspace socket, bypassing most of the kernel networking stack while keeping the NIC under kernel control.

AF_XDP works by creating a shared memory region (UMEM) between kernel and userspace, with ring buffers for packet descriptors. An XDP program attached to the NIC redirects selected packets to the AF_XDP socket via XDP_REDIRECT. The userspace application polls the ring buffers, similar to DPDK's PMD model. The critical difference is that the NIC remains bound to a kernel driver — it still has an IP address, still works with standard tools, and still participates in the kernel's networking for non-AF_XDP traffic.

Performance-wise, AF_XDP achieves 20-30 million packets per second per core — significantly better than the kernel stack's 1-3 Mpps and approaching DPDK's 40-80+ Mpps. For applications that need better-than-kernel performance but cannot tolerate DPDK's operational model (dedicated cores, no kernel integration, no standard tools), AF_XDP is increasingly compelling.

DPDK itself now includes an AF_XDP PMD, so applications can use AF_XDP as a backend while keeping the DPDK programming model. This allows gradual migration or hybrid deployments where some ports use full kernel bypass and others use AF_XDP for flexibility.

io_uring and DPDK

Linux's io_uring interface (introduced in 5.1) provides asynchronous, batched system calls that reduce the per-operation overhead of kernel I/O. While io_uring was designed primarily for storage I/O, its networking support has improved significantly. For TCP-oriented workloads, io_uring networking can approach the throughput of userspace stacks without leaving the kernel, because it batches system calls and reduces context switch overhead.

However, io_uring does not compete with DPDK for raw packet-forwarding workloads. It still uses the kernel's TCP/IP stack, still allocates sk_buff structures, and still processes packets through netfilter. For applications that need line-rate L2/L3 forwarding without TCP termination — the core DPDK use case — io_uring does not change the equation. The two technologies address different layers of the stack.

Performance Numbers in Practice

DPDK's performance has been extensively benchmarked. Representative numbers from published benchmarks and real-world deployments include:

ConfigurationThroughputLatency
L2 forwarding, 64B frames, single core, 25 GbE~37 Mpps<3 us
L3 forwarding, 64B frames, single core, 100 GbE~60 Mpps<5 us
OVS-DPDK, VXLAN encap, 64B, 4 cores~15 Mpps<20 us
VPP L2 bridge, 64B, 2 cores, 100 GbE~148 Mpps (line rate)<10 us
IPsec tunnel (AES-GCM-128), 512B, 4 cores~40 Gbps<30 us
Kernel Linux forwarding (comparison)~1-3 Mpps/core~50-100 us
AF_XDP forwarding (comparison)~24 Mpps/core<10 us

These numbers depend heavily on hardware (NIC model, CPU generation, memory speed, PCIe generation), packet size (smaller packets are harder because the per-packet overhead dominates), and the complexity of processing. The general pattern is consistent: DPDK delivers 10-40x the throughput of kernel networking for forwarding workloads, with 10-50x lower latency.

DPDK in the Real World

DPDK powers critical infrastructure across telecom, cloud, and enterprise networks:

The DPDK Ecosystem and Alternatives

DPDK is maintained by the Linux Foundation as part of the DPDK project (dpdk.org). Releases follow a quarterly cadence (YY.MM format: 24.03, 24.07, 24.11, etc.), and each release supports an explicit set of NIC hardware. The project includes:

Several alternative approaches to high-performance networking compete with or complement DPDK:

Writing Your First DPDK Application

A minimal DPDK application that receives and drops packets (useful for benchmarking NIC receive performance) requires about 100 lines of C code. The essential structure is:

#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>

#define RX_RING_SIZE 1024
#define NUM_MBUFS    8191
#define MBUF_CACHE   250
#define BURST_SIZE   32

int main(int argc, char *argv[]) {
    struct rte_mempool *mbuf_pool;
    struct rte_eth_conf port_conf = {0};
    uint16_t port_id = 0;

    // Initialize EAL — parses --lcores, --socket-mem, etc.
    rte_eal_init(argc, argv);

    // Create mbuf pool on NUMA socket 0
    mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL",
        NUM_MBUFS, MBUF_CACHE, 0,
        RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    // Configure the Ethernet port
    rte_eth_dev_configure(port_id, 1, 0, &port_conf);

    // Set up RX queue with mbufs from our pool
    rte_eth_rx_queue_setup(port_id, 0, RX_RING_SIZE,
        rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);

    // Start the port
    rte_eth_dev_start(port_id);

    // Poll loop — runs forever on this core
    struct rte_mbuf *bufs[BURST_SIZE];
    while (1) {
        uint16_t nb_rx = rte_eth_rx_burst(port_id, 0,
            bufs, BURST_SIZE);
        for (uint16_t i = 0; i < nb_rx; i++)
            rte_pktmbuf_free(bufs[i]);
    }
}

Compile this with the DPDK build system (meson/ninja) or pkg-config: gcc -o rxdrop rxdrop.c $(pkg-config --cflags --libs libdpdk). Run it with sudo ./rxdrop --lcores=0 -a 0000:03:00.0, where -a specifies the PCI address of the NIC to use. The EAL will bind the NIC to a userspace driver, allocate hugepages, pin the thread to core 0, and begin polling.

DPDK and Container Networking

DPDK's integration with container networking presents unique challenges. Containers expect standard Linux networking — veth pairs, network namespaces, iptables rules — which conflicts with DPDK's kernel-bypass model. Several approaches bridge this gap:

The Future of Kernel Bypass

DPDK established kernel bypass as a mainstream approach to high-performance networking. But the landscape continues to evolve. AF_XDP and XDP/eBPF are narrowing the performance gap while maintaining kernel integration. SmartNICs are moving packet processing off the host CPU entirely. Hardware P4 switches allow custom forwarding pipelines at ASIC speeds. And DPDK itself continues adding features — hardware offloads, crypto acceleration, regex matching, and machine learning inference on SmartNIC accelerators.

The underlying trend is clear: the boundary between hardware and software networking continues to blur. DPDK was the first widely adopted framework to make this boundary programmable for commodity hardware. Whether future packet processing happens in userspace (DPDK), in the kernel (XDP/eBPF), or on SmartNIC hardware, the ideas DPDK pioneered — hugepages, poll-mode drivers, NUMA-aware allocation, lock-free data structures, and batch processing — remain foundational to how high-performance networking works.

Explore Network Infrastructure

The networks that deploy DPDK at scale — telecom operators, cloud providers, CDN edges — are all visible in the global BGP routing table. Use the god.ad BGP Looking Glass to examine their route announcements, AS paths, and peering relationships. You can look up any network that handles high-throughput traffic to see how it connects to the rest of the internet, trace the autonomous systems in its forwarding path, and understand the routing infrastructure that DPDK-accelerated appliances serve.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained