How DPDK Works: Kernel-Bypass Packet Processing at 100 Gbps
DPDK (Data Plane Development Kit) is an open-source framework that bypasses the operating system kernel to process network packets directly in userspace, achieving throughput rates that the kernel networking stack cannot match. Originally developed by Intel and released in 2010, DPDK provides a set of libraries and drivers that allow applications to poll network interface cards (NICs) directly, eliminating the interrupt overhead, context switches, and memory copies that make the kernel's networking path too slow for workloads demanding tens of millions of packets per second. If you have used a high-performance virtual switch, a telecom network function, or a software-defined router that handles traffic for networks visible in a BGP looking glass, you have likely used infrastructure built on DPDK.
The core insight behind DPDK is that the Linux kernel's networking stack was designed for generality, not raw speed. The kernel handles every packet through a layered architecture of interrupts, soft IRQs, socket buffers, protocol processing, and context switches between kernel and userspace. Each of these layers adds latency and consumes CPU cycles. For a web server handling a few thousand requests per second, this overhead is negligible. For a network appliance forwarding 100 Gbps of traffic — roughly 150 million packets per second of minimum-size Ethernet frames — the kernel stack becomes the bottleneck. DPDK removes that bottleneck entirely.
Why the Kernel Networking Stack Is Slow
To understand DPDK, you must first understand what it bypasses. When a packet arrives at a NIC in a standard Linux system, the following sequence occurs:
- Hardware interrupt — The NIC signals the CPU via an interrupt. The CPU stops whatever it was doing, saves its register state, and jumps to the interrupt handler. This context switch takes hundreds of nanoseconds and pollutes CPU caches.
- Driver processing — The kernel's NIC driver reads the packet descriptor from the NIC's receive ring, allocates an
sk_buff(socket buffer) structure, and copies or maps the packet data into kernel memory. Eachsk_buffallocation involves the kernel's memory allocator (SLAB/SLUB), which takes locks and can contend across cores. - NAPI / softIRQ — Linux uses NAPI (New API) to batch packet processing. After the initial hardware interrupt, the kernel schedules a soft interrupt (softIRQ) to process queued packets. The softIRQ runs in a special context with its own scheduling constraints, adding another layer of indirection.
- Network stack traversal — Each packet passes through netfilter hooks (iptables/nftables rules), routing table lookups, protocol handling (IP, TCP, UDP), and connection tracking. Each layer examines the packet headers and makes forwarding decisions. For a forwarding appliance that does not need TCP termination, this work is wasted.
- Socket buffer copy — If the packet is destined for a userspace application, the data is copied from the kernel's
sk_buffinto the application's buffer via arecv()orread()system call. This system call involves a context switch from userspace to kernel space and back, plus a memory copy.
Each of these steps costs time. Hardware interrupts cost 5-10 microseconds including cache effects. System calls cost 1-2 microseconds. Memory copies cost proportionally to packet size. Multiply these costs by millions of packets per second, and the kernel stack consumes entire CPU cores just moving data. Benchmarks consistently show that the Linux kernel stack tops out at roughly 1-3 million packets per second per core for forwarding workloads, depending on packet size and configuration. A 100 Gbps NIC receiving minimum-size 64-byte frames requires processing approximately 148.8 million packets per second — far beyond what a single core running the kernel stack can handle.
DPDK Architecture: How Kernel Bypass Works
DPDK achieves its performance by taking the NIC away from the kernel entirely. When a DPDK application starts, it unbinds the NIC from the kernel's network driver (e.g., ixgbe, i40e, mlx5_core) and binds it to a userspace-compatible driver — either UIO (Userspace I/O) or VFIO (Virtual Function I/O). From that point on, the kernel has no visibility into the NIC. There are no interrupts, no kernel driver, no sk_buff allocations, and no system calls in the packet processing path. The NIC's registers and DMA rings are memory-mapped directly into the DPDK application's address space.
Environment Abstraction Layer (EAL)
The EAL is DPDK's initialization and platform abstraction layer. When you call rte_eal_init() at program startup, the EAL performs a series of critical setup steps:
- Hugepage allocation — The EAL reserves large pages of memory (2 MB or 1 GB hugepages) from the operating system. These hugepages reduce TLB (Translation Lookaside Buffer) misses dramatically: a system with 16 GB of memory needs 4 million TLB entries with standard 4 KB pages, but only 8,192 entries with 2 MB hugepages. Since TLB misses cost 10-100 nanoseconds each, this alone provides a significant performance improvement for memory-intensive packet processing.
- CPU core assignment — The EAL pins DPDK threads to specific CPU cores using
pthread_setaffinity_np(). Each lcore (logical core) runs a single DPDK thread that will never be preempted by the OS scheduler, never migrate between cores, and never share its CPU with unrelated work. This eliminates context-switch overhead and ensures CPU caches remain warm. - PCI device scanning — The EAL probes the PCI bus for NICs, binds them to UIO or VFIO drivers, and maps their registers and memory regions into the application's address space.
- Memory channel detection — DPDK detects memory channel configuration and distributes allocations across channels to maximize memory bandwidth, which matters when DMA engines are filling buffers at 100 Gbps.
- NUMA topology — The EAL identifies the NUMA (Non-Uniform Memory Access) topology of the system and ensures that memory allocations for a given NIC happen on the same NUMA node as the NIC. Accessing memory on the wrong NUMA node adds 40-100 nanoseconds of latency per access — at 100 million packet accesses per second, this would waste entire cores.
Poll Mode Drivers (PMDs)
PMDs are the heart of DPDK. Instead of waiting for the NIC to interrupt the CPU when packets arrive, a PMD continuously polls the NIC's receive descriptor ring in a tight loop. The main loop of a DPDK application looks conceptually like this:
while (1) {
// Poll the NIC directly — no syscall, no interrupt
nb_rx = rte_eth_rx_burst(port_id, queue_id, mbufs, BURST_SIZE);
if (nb_rx > 0) {
// Process packets
for (i = 0; i < nb_rx; i++) {
process_packet(mbufs[i]);
}
// Transmit processed packets
nb_tx = rte_eth_tx_burst(port_id, queue_id, mbufs, nb_rx);
// Free any unsent packets
for (i = nb_tx; i < nb_rx; i++) {
rte_pktmbuf_free(mbufs[i]);
}
}
}
This loop burns 100% of the CPU core. There is no idle state, no sleep, no waiting. The core is dedicated solely to checking for packets and processing them. This is DPDK's fundamental tradeoff: it trades CPU efficiency for latency and throughput. An idle DPDK application consumes the same CPU as a fully loaded one. In return, it processes packets with single-digit microsecond latency and can sustain line rate on 100 Gbps interfaces.
The rte_eth_rx_burst() function does not invoke a system call. It directly reads the NIC's receive descriptor ring from mapped memory, checks which descriptors have been filled by the NIC's DMA engine, and returns pointers to the corresponding mbufs. This entire operation happens in userspace, with no kernel involvement whatsoever.
DPDK provides PMDs for all major NIC vendors: Intel (ixgbe, i40e, ice), Mellanox/NVIDIA (mlx4, mlx5), Broadcom (bnxt), Amazon (ena for EC2 instances), and many others. Each PMD is tuned for its specific NIC hardware, exploiting hardware-specific features like multiple receive queues, RSS (Receive Side Scaling), and hardware offloads.
Memory Buffers: mbufs and Mempools
In the kernel stack, each packet gets an sk_buff that is individually allocated and freed. This per-packet allocation is a major source of overhead: the kernel's memory allocator must maintain free lists, handle locking for multi-core access, and manage slab caches. DPDK replaces this with mempools — pre-allocated, fixed-size pools of packet buffers called mbufs.
An mbuf pool is created at initialization time with a call like rte_pktmbuf_pool_create("MBUF_POOL", 8192, 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, socket_id). This allocates 8,192 mbufs on the specified NUMA socket, with each mbuf containing a fixed-size buffer for packet data. The pool uses a lock-free ring buffer internally, so allocating or freeing an mbuf is an atomic compare-and-swap operation — typically 10-20 nanoseconds versus hundreds of nanoseconds for kernel sk_buff allocation.
Each mbuf has a structure optimized for cache efficiency. The first cache line (64 bytes) contains the most frequently accessed fields: buffer address, data offset, packet length, and port number. Metadata that is accessed less frequently sits in subsequent cache lines. This layout means the common case — read the packet, check the length, forward it — touches only one or two cache lines per packet.
For packets larger than a single mbuf buffer, DPDK chains mbufs into a linked list, similar to scatter-gather I/O. Jumbo frames or reassembled packets span multiple mbufs without requiring contiguous memory allocation.
rte_ring: Lock-Free Inter-Core Communication
When DPDK applications run across multiple cores (which they almost always do), they need to pass packets between cores without locks. The rte_ring data structure is a fixed-size, lock-free FIFO queue implemented with atomic operations. It supports single-producer/single-consumer, multi-producer/single-consumer, and multi-producer/multi-consumer modes.
The ring uses a simple algorithm: it maintains head and tail indexes, and producers and consumers advance these indexes atomically using __sync_bool_compare_and_swap() (CAS). In the single-producer/single-consumer case, no atomic operations are needed at all — only memory ordering barriers. This gives inter-core packet passing latencies measured in tens of nanoseconds.
rte_ring is used throughout DPDK: it backs mempool allocations, connects pipeline stages across cores, and provides the abstraction for virtual device I/O. The ring size must be a power of two, which allows the index-to-slot mapping to use a bitwise AND instead of a modulo operation — a micro-optimization that matters at 100 million operations per second.
Hugepages and Memory Architecture
DPDK's reliance on hugepages is not a convenience — it is a hard requirement. Packet processing accesses memory at rates that overwhelm the CPU's TLB when standard 4 KB pages are used. Consider a DPDK application processing 80 million packets per second. Each packet requires at least one mbuf access (packet data), one descriptor access (NIC ring), and often one or more lookup table accesses (flow tables, routing tables). That is at least 240 million memory accesses per second, each requiring a virtual-to-physical address translation.
A modern x86 CPU has a two-level TLB: the L1 DTLB holds 64-72 entries, and the L2 TLB holds 1,536-2,048 entries. With 4 KB pages, these TLBs can cover at most 8 MB of memory — far less than the gigabytes of memory a DPDK application uses for packet buffers, flow tables, and lookup structures. TLB misses trigger a page table walk that takes 10-100 nanoseconds (or more if page table entries are not in cache). At 240 million accesses per second, even a 1% TLB miss rate produces 2.4 million page table walks per second, costing 24-240 milliseconds of CPU time per second.
With 2 MB hugepages, the same TLB entries cover 3 GB of memory. With 1 GB hugepages (available on modern x86 CPUs), a single TLB entry covers an entire gigabyte. This effectively eliminates TLB misses for DPDK's working set. The EAL configures hugepages at startup by mounting the hugetlbfs filesystem and mapping the required number of pages:
# Reserve 1024 hugepages of 2MB each (2GB total)
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# Mount hugetlbfs
mkdir -p /dev/hugepages
mount -t hugetlbfs nodev /dev/hugepages
NUMA Awareness
Modern multi-socket servers have Non-Uniform Memory Access (NUMA) architectures where each CPU socket has its own local memory. Accessing local memory takes roughly 80 nanoseconds, while accessing memory on a remote NUMA node takes 130-200 nanoseconds — a 60-150% penalty. For PCIe devices like NICs, the NUMA effect is even more pronounced: a NIC is physically attached to one CPU socket, and DMA operations between the NIC and memory on the remote socket traverse the inter-socket interconnect (Intel UPI or AMD Infinity Fabric), adding latency and consuming interconnect bandwidth.
DPDK is deeply NUMA-aware. The EAL detects which NUMA node each NIC is attached to and ensures that:
- mbuf pools are allocated on the same NUMA node as the NIC that will fill them
- Descriptor rings (shared between the NIC's DMA engine and the PMD) are on the NIC's local NUMA node
- Worker threads processing packets from a NIC run on cores belonging to the same NUMA node
- Lookup tables and flow state accessed during packet processing reside in local memory
Violating NUMA locality in a high-rate packet processing application can reduce throughput by 30-50%. DPDK makes NUMA placement explicit: almost every allocation function takes a socket_id parameter, and the application developer is expected to maintain NUMA locality throughout the processing pipeline.
Processing Models: Run-to-Completion vs Pipeline
DPDK applications use two primary architectural models for distributing work across cores:
Run-to-Completion
In the run-to-completion model, each core handles a packet from start to finish. A core receives a packet from the NIC, performs all processing (parsing, classification, lookup, modification, encapsulation), and transmits the result — without passing the packet to any other core. Each core operates on its own set of NIC queues using RSS (Receive Side Scaling) to distribute packets across queues based on flow hashing.
Run-to-completion is the simplest model and often the fastest for straightforward forwarding applications. Each packet touches one core's cache, there is no inter-core communication overhead, and the application scales linearly by adding more cores with more NIC queues. The l3fwd example application in DPDK uses this model, and achieves near-line-rate forwarding on 100 Gbps NICs.
Pipeline Model
In the pipeline model, packet processing is divided into stages, and each stage runs on a different core. For example: core 0 receives packets and classifies them, core 1 performs route lookups, core 2 applies security policies, and core 3 transmits. Packets move between stages via rte_ring queues.
The pipeline model suits complex processing chains where different stages have different computational costs. It also allows stages to be shared across flows — a single classification stage can serve multiple forwarding stages. However, the pipeline model adds inter-core latency (one rte_ring enqueue/dequeue per stage) and increases cache pressure because each packet's data must be loaded into a new core's cache at each stage. For simple forwarding, run-to-completion outperforms pipeline. For complex processing (deep packet inspection, encryption, protocol reassembly), pipeline can better utilize cores by keeping each core's instruction cache warm with a single stage's code.
DPDK vs Kernel Networking: When to Use Each
DPDK is not a replacement for the kernel networking stack. It is a specialized tool for a specific class of problems. The tradeoffs are significant:
- Dedicated cores — DPDK claims CPU cores exclusively. A 4-core DPDK application consumes all four cores at 100% utilization, whether processing packets or idle. On a general-purpose server, this is wasteful. On a dedicated network appliance, it is the intended design.
- No kernel networking — A NIC bound to DPDK is invisible to the kernel. It has no IP address in
ifconfig, cannot be used for SSH, does not participate in the kernel's routing table, and cannot runiptablesrules. Applications must implement any needed protocol logic themselves. DPDK provides libraries for common tasks (ARP handling, IP fragmentation, flow classification), but the application is responsible for the logic. - No socket API — Applications cannot use
socket(),bind(),listen(),accept(), or any POSIX networking API on DPDK ports. If you need to handle TCP connections, you must use a userspace TCP stack (like F-Stack, mTCP, or Seastar) or implement the protocol yourself. - Operational complexity — DPDK requires hugepage configuration, NIC driver binding, NUMA-aware deployment, and careful core assignment. The operational overhead is substantial compared to a standard Linux networking application.
- Security surface — Running as root (or with specific capabilities) and mapping NIC hardware directly into userspace increases the security surface. A bug in the DPDK application can corrupt NIC DMA descriptors or access arbitrary physical memory. The kernel's isolation guarantees are gone.
Use DPDK when you need sustained packet rates above 5-10 million packets per second, deterministic sub-10-microsecond latency, or when the kernel stack measurably cannot keep up. For lower-rate workloads, the kernel stack's generality, security isolation, and ecosystem integration outweigh DPDK's raw performance. Many modern kernels with XDP and eBPF can achieve 20-30 million packets per second per core while remaining integrated with the kernel's networking features — a middle ground that has significantly narrowed DPDK's advantage for many use cases.
DPDK in Network Function Virtualization (NFV)
DPDK's most significant deployment is in NFV — the telecom industry's shift from proprietary hardware appliances to software running on commodity servers. Functions that traditionally ran on purpose-built hardware — firewalls, load balancers, routers, session border controllers, deep packet inspection engines — now run as Virtual Network Functions (VNFs) on standard x86 servers. DPDK is the foundation that makes this economically viable.
OVS-DPDK: The Virtual Switch
Open vSwitch (OVS) is the dominant open-source virtual switch in data centers and telecom networks. In its default mode, OVS uses the kernel's datapath for forwarding, which inherits all of the kernel stack's performance limitations. OVS-DPDK replaces the kernel datapath with DPDK, moving all packet forwarding into userspace.
The performance difference is dramatic. Kernel-based OVS can forward 1-2 million packets per second. OVS-DPDK, on the same hardware, can forward 10-20 million packets per second with VXLAN encapsulation, and even more for simple L2 forwarding. For telecom operators running thousands of virtual machines with latency-sensitive VNFs, this difference determines whether software can replace hardware. OVS-DPDK manages the virtual port connections between VMs and the physical NICs. Each virtual machine connects to OVS-DPDK via vhost-user — a userspace implementation of the virtio transport that uses shared memory and eventfd for high-performance VM-to-switch communication, bypassing the kernel in both the host and guest.
VPP (Vector Packet Processing)
FD.io VPP (part of the Linux Foundation) is a DPDK-based packet processing framework developed originally by Cisco. VPP's innovation is vector processing: instead of processing packets one at a time through the entire graph, VPP processes a vector of packets (typically 256) through each graph node before moving to the next node. This keeps the instruction cache hot — each graph node's code is loaded once and applied to all 256 packets — and amortizes the cost of function calls and pipeline stalls across the vector. VPP achieves multi-terabit forwarding rates on commodity hardware and powers the CSIT (Continuous System Integration Testing) benchmarks used by the telecom industry to validate NFV performance.
Userspace TCP Stacks
For applications that need TCP but also need DPDK's performance, several userspace TCP stacks have been built on DPDK:
- F-Stack — Ports FreeBSD's TCP/IP stack to userspace on DPDK. Provides a POSIX-like socket API so existing applications can be adapted with minimal changes. Used in production by Tencent for high-performance proxy and CDN servers.
- mTCP — A research TCP stack built for multi-core scalability. mTCP achieves 25 million small HTTP transactions per second on an 8-core machine by eliminating lock contention through per-core TCP state and batched system call processing.
- Seastar — A C++ framework (used by ScyllaDB) that provides a share-nothing architecture where each core runs its own TCP stack with its own memory allocator, avoiding all inter-core coordination. Seastar achieves 12 million IOPS for database workloads on DPDK.
SR-IOV and DPDK
Single Root I/O Virtualization (SR-IOV) is a hardware technology that allows a single physical NIC to present multiple virtual NICs (Virtual Functions, or VFs) to the host operating system. Each VF has its own set of queues, interrupts, and DMA channels, and can be assigned directly to a virtual machine or container using PCI passthrough. DPDK supports both the Physical Function (PF) and Virtual Functions (VFs). In a typical NFV deployment:
- The PF is managed by OVS-DPDK on the host for control plane operations
- VFs are passed through to VMs or containers, each running its own DPDK application
- The NIC's hardware switching (e-switch) handles forwarding between VFs at hardware speed, without involving the host CPU
This architecture allows each VNF to achieve near-bare-metal packet processing performance while maintaining the isolation and management benefits of virtualization. Mellanox/NVIDIA ConnectX and Intel E810 NICs support hundreds of VFs, enabling dense VNF deployments with full DPDK performance.
AF_XDP: The Kernel's Answer to DPDK
AF_XDP (Address Family XDP) is a socket type introduced in Linux 4.18 that provides a middle ground between full kernel networking and DPDK-style kernel bypass. AF_XDP uses XDP (eXpress Data Path) to redirect packets from the NIC driver directly to a userspace socket, bypassing most of the kernel networking stack while keeping the NIC under kernel control.
AF_XDP works by creating a shared memory region (UMEM) between kernel and userspace, with ring buffers for packet descriptors. An XDP program attached to the NIC redirects selected packets to the AF_XDP socket via XDP_REDIRECT. The userspace application polls the ring buffers, similar to DPDK's PMD model. The critical difference is that the NIC remains bound to a kernel driver — it still has an IP address, still works with standard tools, and still participates in the kernel's networking for non-AF_XDP traffic.
Performance-wise, AF_XDP achieves 20-30 million packets per second per core — significantly better than the kernel stack's 1-3 Mpps and approaching DPDK's 40-80+ Mpps. For applications that need better-than-kernel performance but cannot tolerate DPDK's operational model (dedicated cores, no kernel integration, no standard tools), AF_XDP is increasingly compelling.
DPDK itself now includes an AF_XDP PMD, so applications can use AF_XDP as a backend while keeping the DPDK programming model. This allows gradual migration or hybrid deployments where some ports use full kernel bypass and others use AF_XDP for flexibility.
io_uring and DPDK
Linux's io_uring interface (introduced in 5.1) provides asynchronous, batched system calls that reduce the per-operation overhead of kernel I/O. While io_uring was designed primarily for storage I/O, its networking support has improved significantly. For TCP-oriented workloads, io_uring networking can approach the throughput of userspace stacks without leaving the kernel, because it batches system calls and reduces context switch overhead.
However, io_uring does not compete with DPDK for raw packet-forwarding workloads. It still uses the kernel's TCP/IP stack, still allocates sk_buff structures, and still processes packets through netfilter. For applications that need line-rate L2/L3 forwarding without TCP termination — the core DPDK use case — io_uring does not change the equation. The two technologies address different layers of the stack.
Performance Numbers in Practice
DPDK's performance has been extensively benchmarked. Representative numbers from published benchmarks and real-world deployments include:
| Configuration | Throughput | Latency |
|---|---|---|
| L2 forwarding, 64B frames, single core, 25 GbE | ~37 Mpps | <3 us |
| L3 forwarding, 64B frames, single core, 100 GbE | ~60 Mpps | <5 us |
| OVS-DPDK, VXLAN encap, 64B, 4 cores | ~15 Mpps | <20 us |
| VPP L2 bridge, 64B, 2 cores, 100 GbE | ~148 Mpps (line rate) | <10 us |
| IPsec tunnel (AES-GCM-128), 512B, 4 cores | ~40 Gbps | <30 us |
| Kernel Linux forwarding (comparison) | ~1-3 Mpps/core | ~50-100 us |
| AF_XDP forwarding (comparison) | ~24 Mpps/core | <10 us |
These numbers depend heavily on hardware (NIC model, CPU generation, memory speed, PCIe generation), packet size (smaller packets are harder because the per-packet overhead dominates), and the complexity of processing. The general pattern is consistent: DPDK delivers 10-40x the throughput of kernel networking for forwarding workloads, with 10-50x lower latency.
DPDK in the Real World
DPDK powers critical infrastructure across telecom, cloud, and enterprise networks:
- Telecom core networks — 4G EPC (Evolved Packet Core) and 5G UPF (User Plane Function) implementations from vendors like Ericsson, Nokia, and Samsung use DPDK for user-plane packet processing. Every mobile data packet you send passes through DPDK-accelerated VNFs. These network functions are deployed in data centers that advertise routes visible in the global routing table.
- Cloud provider virtual networking — AWS uses DPDK in its Elastic Network Adapter (ENA) host-side processing. Azure's SmartNIC/FPGA stack interfaces with DPDK-based software. Google's Andromeda virtual networking uses comparable kernel-bypass techniques.
- Content delivery — CDN providers use DPDK for edge proxy servers that terminate millions of TLS connections and forward cached content at line rate. The CDN networks that serve cached content from edge locations rely on DPDK to handle the aggregate traffic volume.
- High-frequency trading — Financial firms use DPDK (and the similar Solarflare OpenOnload) to achieve sub-microsecond network latencies for market data feeds and order execution.
- DDoS mitigation — Scrubbing centers use DPDK to inspect and filter hundreds of gigabits of attack traffic in real time. When a load balancer or DDoS filter needs to process volumetric attacks, DPDK provides the packet budget to analyze every packet at line rate.
The DPDK Ecosystem and Alternatives
DPDK is maintained by the Linux Foundation as part of the DPDK project (dpdk.org). Releases follow a quarterly cadence (YY.MM format: 24.03, 24.07, 24.11, etc.), and each release supports an explicit set of NIC hardware. The project includes:
- Core libraries — EAL, mbuf, ring, mempool, timer, hash, LPM (Longest Prefix Match), ACL (Access Control List), cryptodev, eventdev
- PMDs — Drivers for physical NICs (Intel, NVIDIA, Broadcom, Amazon, Marvell, etc.), virtual devices (virtio, vhost, PCAP, AF_XDP), and crypto accelerators (Intel QAT, ARM CE)
- Sample applications — l2fwd, l3fwd, testpmd, ipsec-secgw, flow_classify, and many others that serve as both examples and production starting points
Several alternative approaches to high-performance networking compete with or complement DPDK:
- AF_XDP — As discussed above, offers 60-80% of DPDK's performance with kernel integration. Best for applications that need fast networking without fully abandoning the kernel.
- XDP/eBPF — eBPF programs attached at the XDP hook process packets in the NIC driver before they enter the kernel stack. XDP achieves 20-30+ Mpps per core and can drop, redirect, or modify packets without userspace involvement. Cloudflare's DDoS mitigation and Meta's Katran load balancer use XDP.
- Netmap — An older kernel-bypass framework that maps NIC rings into userspace. Netmap preceded DPDK and influenced its design, but has a smaller ecosystem and fewer supported NICs.
- PF_RING ZC — A commercial kernel-bypass solution from ntop that provides zero-copy packet capture. Used primarily in network monitoring rather than forwarding.
- RDMA/RoCE — Remote Direct Memory Access over Converged Ethernet provides kernel bypass for point-to-point communication, commonly used in storage (NVMe-oF) and HPC clusters rather than general packet forwarding.
Writing Your First DPDK Application
A minimal DPDK application that receives and drops packets (useful for benchmarking NIC receive performance) requires about 100 lines of C code. The essential structure is:
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#define RX_RING_SIZE 1024
#define NUM_MBUFS 8191
#define MBUF_CACHE 250
#define BURST_SIZE 32
int main(int argc, char *argv[]) {
struct rte_mempool *mbuf_pool;
struct rte_eth_conf port_conf = {0};
uint16_t port_id = 0;
// Initialize EAL — parses --lcores, --socket-mem, etc.
rte_eal_init(argc, argv);
// Create mbuf pool on NUMA socket 0
mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL",
NUM_MBUFS, MBUF_CACHE, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
// Configure the Ethernet port
rte_eth_dev_configure(port_id, 1, 0, &port_conf);
// Set up RX queue with mbufs from our pool
rte_eth_rx_queue_setup(port_id, 0, RX_RING_SIZE,
rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);
// Start the port
rte_eth_dev_start(port_id);
// Poll loop — runs forever on this core
struct rte_mbuf *bufs[BURST_SIZE];
while (1) {
uint16_t nb_rx = rte_eth_rx_burst(port_id, 0,
bufs, BURST_SIZE);
for (uint16_t i = 0; i < nb_rx; i++)
rte_pktmbuf_free(bufs[i]);
}
}
Compile this with the DPDK build system (meson/ninja) or pkg-config: gcc -o rxdrop rxdrop.c $(pkg-config --cflags --libs libdpdk). Run it with sudo ./rxdrop --lcores=0 -a 0000:03:00.0, where -a specifies the PCI address of the NIC to use. The EAL will bind the NIC to a userspace driver, allocate hugepages, pin the thread to core 0, and begin polling.
DPDK and Container Networking
DPDK's integration with container networking presents unique challenges. Containers expect standard Linux networking — veth pairs, network namespaces, iptables rules — which conflicts with DPDK's kernel-bypass model. Several approaches bridge this gap:
- DPDK-accelerated CNIs — Kubernetes Container Network Interface plugins like Userspace CNI and Multus CNI can assign DPDK-compatible interfaces (SR-IOV VFs or vhost-user sockets) to pods alongside the standard pod network. The DPDK interface provides a fast data plane while the standard interface handles control traffic.
- SmartNIC offload — Modern SmartNICs (NVIDIA BlueField, Intel IPU) run OVS-DPDK or VPP on embedded ARM cores, performing virtual switching in hardware. The container host sees standard networking, but the data plane runs DPDK on the SmartNIC — combining kernel-bypass performance with transparent container networking.
- Service mesh acceleration — Projects like FD.io's Ligato and Calico-VPP use DPDK/VPP as the data plane for Kubernetes service meshes, replacing kube-proxy's iptables rules with DPDK-accelerated forwarding. This benefits large clusters where iptables rules grow linearly with the number of services.
The Future of Kernel Bypass
DPDK established kernel bypass as a mainstream approach to high-performance networking. But the landscape continues to evolve. AF_XDP and XDP/eBPF are narrowing the performance gap while maintaining kernel integration. SmartNICs are moving packet processing off the host CPU entirely. Hardware P4 switches allow custom forwarding pipelines at ASIC speeds. And DPDK itself continues adding features — hardware offloads, crypto acceleration, regex matching, and machine learning inference on SmartNIC accelerators.
The underlying trend is clear: the boundary between hardware and software networking continues to blur. DPDK was the first widely adopted framework to make this boundary programmable for commodity hardware. Whether future packet processing happens in userspace (DPDK), in the kernel (XDP/eBPF), or on SmartNIC hardware, the ideas DPDK pioneered — hugepages, poll-mode drivers, NUMA-aware allocation, lock-free data structures, and batch processing — remain foundational to how high-performance networking works.
Explore Network Infrastructure
The networks that deploy DPDK at scale — telecom operators, cloud providers, CDN edges — are all visible in the global BGP routing table. Use the god.ad BGP Looking Glass to examine their route announcements, AS paths, and peering relationships. You can look up any network that handles high-throughput traffic to see how it connects to the rest of the internet, trace the autonomous systems in its forwarding path, and understand the routing infrastructure that DPDK-accelerated appliances serve.