How BGP Graceful Restart Works: Preserving Forwarding Across Restarts
BGP Graceful Restart (GR) is a mechanism defined in RFC 4724 that allows a BGP speaker to preserve its forwarding state across a restart of the BGP process. Without Graceful Restart, when a BGP session drops, the peer immediately withdraws all routes learned from the restarting speaker, triggering a ripple of convergence events across the internet. With GR, the peer (called the "helper" or "receiving speaker") continues to use the previously learned routes for a negotiated period, giving the restarting speaker time to re-establish the session and re-advertise its routes without causing a traffic disruption.
This matters enormously in production networks. Software upgrades, process crashes, and configuration reloads are routine. Each of these events can cause a BGP session reset. In a network carrying thousands of prefixes across dozens of peering sessions, a restart without GR can cause millions of route withdrawals to propagate globally, shifting traffic across alternate paths, creating congestion, and potentially causing packet loss far from the network that restarted. Graceful Restart transforms a potentially service-affecting event into one that is transparent to traffic.
The Problem: What Happens Without Graceful Restart
To understand why GR exists, consider what happens when a BGP speaker restarts without it:
- The BGP process terminates (planned upgrade, crash, or configuration reload).
- The TCP connection underlying the BGP session drops. The peer detects this either immediately (via TCP RST) or after the BGP hold timer expires (default 90 seconds, or much faster if BFD is in use).
- The peer removes all routes learned from the restarting speaker from its Adj-RIB-In, runs the best-path algorithm, and withdraws those routes from all of its own peers.
- Those withdrawals propagate through the internet, causing every AS along each AS path to reconverge.
- When the restarting speaker comes back up, it re-establishes the BGP session, exchanges OPEN messages, and re-advertises all of its routes.
- The peer re-installs the routes, runs best-path selection again, and re-announces them to its own peers.
- The rest of the internet converges back to the original state.
The entire sequence -- withdraw, converge to backup paths, re-advertise, converge back -- can take minutes, during which traffic takes suboptimal paths and may be dropped entirely if no alternate path exists. All of this churn is unnecessary if the restarting router's forwarding plane (the hardware or kernel-level data path) never actually stopped forwarding packets. The control plane (BGP process) went away, but the forwarding table in the router's line cards or kernel remained intact.
Graceful Restart exploits this separation between control plane and forwarding plane. If the router can keep forwarding packets using its existing FIB while the BGP process restarts, and the peer can keep its routes in the RIB during that window, traffic continues to flow as if nothing happened.
RFC 4724: The Graceful Restart Mechanism
RFC 4724, published in January 2007, defines the core Graceful Restart mechanism for BGP. It introduces three key concepts: the Graceful Restart capability (negotiated in OPEN messages), the restart timer, and the End-of-RIB marker.
GR Capability Negotiation
Graceful Restart must be negotiated between peers. A BGP speaker advertises its willingness to support GR by including the Graceful Restart Capability (capability code 64) in its OPEN message during session establishment. This capability contains:
- Restart Flags (4 bits) -- The most significant bit is the Restart State (R) bit. When set, it indicates that the speaker has restarted and its forwarding state may have been preserved. The remaining bits are reserved (one is used by LLGR as the Notification (N) bit, discussed later).
- Restart Time (12 bits) -- The number of seconds the speaker expects to take to restart its BGP process. The peer uses this as the upper bound for how long it will retain stale routes. Maximum value: 4095 seconds (~68 minutes).
- Address Family tuples -- For each AFI/SAFI (Address Family Identifier / Subsequent Address Family Identifier) the speaker supports, a tuple of AFI (16 bits), SAFI (8 bits), and a Flags byte. The critical flag here is the Forwarding State (F) bit: when set, it indicates that the forwarding state for that address family has been preserved across the restart.
Both peers must include the GR capability in their OPEN messages for Graceful Restart to be active on the session. However, the roles are asymmetric: at any given restart event, one peer is the restarting speaker and the other is the helper (or receiving speaker).
Restarting Speaker Behavior
When the BGP process restarts on a router that supports GR:
- Preserve forwarding state -- Before or during the restart, the router preserves its Forwarding Information Base (FIB). On modern routers, the FIB lives in hardware (ASICs, line cards) or in the kernel, separate from the BGP process. The BGP process signals the forwarding plane to mark existing entries as stale but continue using them -- this is called Non-Stop Forwarding (NSF) in many vendor implementations.
- Re-establish TCP and BGP session -- The restarting speaker opens a new TCP connection to the peer and sends an OPEN message with the Graceful Restart capability. It sets the Restart State (R) bit to 1, indicating that this is a post-restart session establishment. For each AFI/SAFI where forwarding state was preserved, it sets the Forwarding State (F) bit to 1.
- Defer best-path selection -- The restarting speaker should defer running the decision process for routes received from the helper until it has received the End-of-RIB marker from the helper, or until a "stale timer" expires. This prevents premature route selection with incomplete information.
- Send End-of-RIB -- After sending all UPDATE messages to re-advertise its routes, the restarting speaker sends an End-of-RIB marker to signal that its initial routing table dump is complete.
Helper (Receiving Speaker) Behavior
The helper is the peer that did not restart. When it detects that the BGP session has dropped (via TCP RST, hold timer expiry, or BFD):
- Check GR capability -- If the peer had previously advertised the GR capability with a non-zero Restart Time, the helper enters Graceful Restart mode instead of immediately withdrawing routes.
- Mark routes as stale -- All routes received from the restarting peer are marked as stale in the Adj-RIB-In. These routes remain in the RIB and continue to be used for forwarding and announced to other peers.
- Start the Restart Timer -- The helper starts a timer set to the Restart Time value from the peer's most recent OPEN message. If this timer expires before the restarting speaker re-establishes the session, the helper deletes all stale routes and performs normal withdrawal procedures. The restart has failed.
- Accept new session -- When the restarting speaker reconnects, the helper checks the R bit and F bits in the new OPEN message. If the F bit is set for an address family, the helper continues to retain the stale routes for that AFI/SAFI. If F is not set, the helper immediately deletes stale routes for that address family -- the restarting speaker is signaling that its forwarding state was not preserved.
- Process End-of-RIB -- When the helper receives the End-of-RIB marker from the restarting speaker, it deletes any routes still marked as stale that were not refreshed by the new UPDATE messages. These routes were present before the restart but are no longer being advertised, so they should be withdrawn.
The End-of-RIB Marker
RFC 4724 introduces the End-of-RIB marker (also called the "initial update completion marker") as a signal that a speaker has finished sending its initial routing table after session establishment. The marker is simply an UPDATE message with no reachable NLRI and no withdrawn routes -- an empty UPDATE for the address family in question.
For IPv4 unicast, the End-of-RIB marker is an UPDATE message with no NLRI field and no Withdrawn Routes field. For other address families using multiprotocol extensions (RFC 4760), it is an UPDATE with an empty MP_UNREACH_NLRI attribute for the appropriate AFI/SAFI.
The End-of-RIB marker serves two purposes:
- Stale route cleanup -- The helper uses it to know when it can safely delete stale routes that the restarting speaker did not re-advertise. Without this marker, the helper would have to rely on a timer, which is less precise.
- Best-path deferral -- The restarting speaker uses the End-of-RIB markers it receives from its peers to know when it has complete routing information from each peer and can run the best-path decision process.
The End-of-RIB marker has proven so useful that it is now commonly implemented even outside the GR context. Many BGP implementations send an End-of-RIB after the initial route exchange on any new session, regardless of whether GR is negotiated. Route reflectors and other aggregation points use it to sequence their own best-path calculations.
Timers and Their Interaction
Getting the timers right is critical to a successful Graceful Restart. There are several timers involved, and their relationships determine whether the restart is transparent or causes a traffic disruption:
Restart Timer
Advertised in the GR capability, this is the time the restarting speaker expects to need to come back up and re-establish the BGP session. The helper uses this as the deadline. Typical values range from 120 to 300 seconds. Setting it too low risks the helper deleting routes before the restart completes. Setting it too high means the helper retains potentially invalid routes for too long if the restart fails entirely.
Stale Path Timer (Selection Deferral Timer)
This is a local timer on the restarting speaker. After re-establishing sessions, the restarting speaker waits for End-of-RIB markers from its peers before running the best-path decision process. The stale path timer sets an upper bound on this wait. RFC 4724 suggests a default of 360 seconds. If the timer expires before all peers have sent End-of-RIB, the restarting speaker runs best-path with whatever information it has and deletes any remaining stale routes from its own FIB.
BGP Hold Timer and BFD
The BGP hold timer (default 90 seconds per RFC 4271) governs how long a speaker waits for a KEEPALIVE or UPDATE before declaring the session dead. BFD (Bidirectional Forwarding Detection) can detect link or neighbor failures in milliseconds. When GR is enabled, the interaction between BFD and GR requires careful consideration: BFD can trigger a session reset very quickly, which starts the GR process. This is usually desirable -- fast detection of a restart event means the helper enters GR mode promptly and the restart timer starts early, giving the restarting speaker the full window to recover.
However, BFD and GR can conflict if BFD detects a failure that is not a graceful restart -- for example, a link failure where the restarting speaker's forwarding plane is also down. In that case, retaining stale routes for the full restart timer causes traffic to be black-holed. Many implementations allow configuring BFD to be "GR-aware" so it can distinguish between a control-plane restart (where GR should activate) and a forwarding-plane failure (where routes should be withdrawn immediately).
Forwarding State Preservation: The Hard Part
The GR mechanism in RFC 4724 assumes that the restarting router's forwarding plane continues to work during the restart. This assumption is the foundation of the entire mechanism -- if the forwarding plane goes down, retaining stale routes on the helper just causes traffic to be black-holed instead of being rerouted.
How forwarding state is preserved depends on the router platform:
- Distributed forwarding architectures -- On chassis-based routers with separate route processor (RP) and line cards, the line cards maintain their own copy of the FIB. When only the RP (which runs BGP) restarts, the line cards continue forwarding using their existing FIB. This is the cleanest form of NSF and is supported by all major chassis platforms (Cisco ASR/NCS, Juniper MX, Nokia SR, Arista 7500R).
- Centralized platforms -- On fixed-form-factor routers where the CPU runs both BGP and the forwarding plane (common in software routers and some lower-end hardware), preserving the FIB across a process restart requires the forwarding plane to be decoupled from the BGP process -- for instance, by using a separate kernel-level forwarding table (as in Linux with FRR or BIRD) or a separate forwarding process (as in Juniper's Trio on smaller platforms).
- Software routers -- Platforms like FRR, BIRD, and GoBGP running on Linux can leverage the kernel's routing table. The kernel FIB persists even when the BGP daemon restarts, providing natural forwarding continuity. The key challenge is ensuring the BGP daemon can recover its state and reconcile with the kernel FIB.
Long-Lived Graceful Restart (LLGR)
RFC 4724's Graceful Restart has a fundamental limitation: the Restart Time field is only 12 bits, capping the maximum at 4095 seconds (~68 minutes). For many operational scenarios, this is insufficient. A major software upgrade, a complex debugging session, or a hardware replacement can take hours. If the BGP process does not return within the restart timer, the helper withdraws all stale routes, triggering full reconvergence.
Long-Lived Graceful Restart (LLGR), defined in RFC 9494 (published 2023), extends the GR mechanism to support arbitrarily long restart periods. LLGR adds a second phase after the regular GR timer expires:
- The regular GR procedure runs as defined in RFC 4724.
- When the Restart Timer expires without the restarting speaker reconnecting, instead of deleting stale routes, the helper transitions them to LLGR stale state.
- LLGR stale routes are kept in the RIB but with a dramatically reduced priority: the LLGR_STALE community (
65535:6) is attached, and the routes are treated as less preferred than any non-stale route in the best-path algorithm. They are also re-advertised to other LLGR-aware peers with the LLGR_STALE community, so those peers also de-prefer them. - A new Long-Lived Stale Time (32 bits, supporting up to ~136 years) governs how long LLGR stale routes are retained. Practical values are typically hours to days.
- When the restarting speaker finally reconnects, the LLGR stale routes are refreshed or deleted, just as in regular GR.
The key insight of LLGR is that a de-preferred route is better than no route. If the only path to a destination goes through the restarting speaker, an LLGR stale route will still be used -- but if any alternative path exists, that path will be preferred. This provides a graceful degradation: traffic shifts to alternate paths during an extended outage but falls back to the stale route as a last resort.
LLGR Communities
RFC 9494 defines two well-known communities for LLGR:
- LLGR_STALE (65535:6) -- Attached to routes that have entered the LLGR stale phase. Signals to all LLGR-aware recipients that this route should be de-preferred. Non-LLGR-aware peers must not receive routes with this community (the helper strips them before advertising to non-LLGR peers, or does not advertise them at all).
- NO_LLGR (65535:7) -- Can be attached to routes by the originator to indicate that LLGR should not be applied. If a route carries NO_LLGR, the helper deletes it when the regular GR timer expires, even if LLGR is negotiated on the session.
The Notification (N) Bit
Standard BGP Graceful Restart as defined in RFC 4724 only activates when the TCP session drops unexpectedly. If the BGP session is closed with a NOTIFICATION message (the BGP error mechanism), GR does not apply -- the assumption being that a NOTIFICATION indicates a protocol error that should trigger full reconvergence.
However, many operational scenarios involve NOTIFICATION messages that are not protocol errors: a hard reset triggered by a configuration change, a session reset for maintenance, or a peer sending CEASE with a subcodes like "administrative reset." In these cases, preserving forwarding state during the restart is just as desirable as during a crash.
RFC 8538 (updated by RFC 9494) introduces the Notification (N) bit in the Graceful Restart capability flags. When both peers set the N bit, GR procedures apply even when the session is terminated by a NOTIFICATION message. The speaker that sends the NOTIFICATION can indicate via the N bit in the subsequent OPEN (when it reconnects) that its forwarding state was preserved despite the NOTIFICATION-triggered session close.
GR Roles: Restarting Speaker vs. Helper
The asymmetry between the restarting speaker and the helper is a common source of confusion. A few important clarifications:
- Both sides must advertise GR capability, but they take different roles at restart time. Either peer can be the restarting speaker at any given event.
- GR-aware but not willing to retain routes -- A speaker can advertise GR capability with a Restart Time of 0 and no address families listed. This means "I understand GR and will set the R bit when I restart, but I don't expect my peer to retain routes for me." This is called being a GR helper only.
- Helper-only mode -- Many operators configure their route servers (at IXPs) and route reflectors in helper-only mode. These devices retain stale routes for restarting peers but never expect to restart themselves with forwarding preservation (because they don't forward traffic at all -- they are purely control-plane devices).
- Dual restarts -- If both speakers restart simultaneously (e.g., both sides of a link reboot), GR cannot help because there is no helper to retain routes. This is an inherent limitation.
Operational Considerations
Deploying Graceful Restart in production involves several practical considerations beyond the protocol mechanics.
When GR Helps
- Software upgrades -- ISSU (In-Service Software Upgrade) on routers with redundant route processors is the canonical use case. The standby RP takes over, the BGP process restarts, and GR prevents route churn during the switchover.
- BGP daemon restarts -- On Linux-based routers running FRR or BIRD, restarting the BGP daemon for configuration changes or version upgrades while the kernel continues forwarding.
- Process crashes -- If the BGP process crashes and auto-restarts quickly enough, GR can mask the event entirely from the rest of the network.
- Route reflector maintenance -- Taking a route reflector offline for maintenance while clients retain their reflected routes.
When GR Hurts
- Forwarding plane failures -- If the router's forwarding plane is also down (hardware failure, line card crash, fiber cut), GR causes the helper to retain routes pointing to a black hole. Traffic is silently dropped instead of being rerouted to an alternate path. This is the most dangerous failure mode of GR.
- Slow restarts -- If the restart consistently takes longer than the configured Restart Timer, GR provides no benefit. The helper expires the timer, deletes stale routes, and the network reconverges anyway -- just with a delay.
- Security concerns -- GR requires the helper to trust that the restarting peer's forwarding state is actually preserved. A malicious peer could exploit this by deliberately restarting to keep stale (and potentially hijacked) routes in the helper's RIB for an extended period.
Tuning Recommendations
- Restart Timer -- Set to 2-3x your measured worst-case restart time. Too short and GR will fail on slow restarts; too long and you retain stale routes for extended periods during real failures. 120-180 seconds is common for hardware platforms; 30-60 seconds for software routers with fast startup.
- Stale Path Timer -- The default of 360 seconds from RFC 4724 works for most deployments. Increase it if you have many peers that are slow to send End-of-RIB.
- LLGR Stale Time -- If you deploy LLGR, typical values range from 3600 seconds (1 hour) to 86400 seconds (1 day). The right value depends on your maintenance windows and how long you're willing to carry de-preferred routes.
- BFD interaction -- If you use BFD for fast failure detection, ensure your implementation supports GR-aware BFD. Otherwise, BFD may trigger a GR event for every forwarding-plane failure, leading to black-holing.
GR in the Real World: Vendor and Software Support
Graceful Restart is supported by all major BGP implementations, though the depth of support varies:
- Cisco IOS/IOS-XR/NX-OS -- Full GR and NSF support on all modern platforms. LLGR support was added in IOS-XR 7.x. Typically requires explicit configuration (
bgp graceful-restart). - Juniper Junos -- Full GR and NSF support. LLGR has been supported since Junos 15.1. GR is enabled by default on many platforms but can be tuned. Junos uses "graceful-restart" at the routing-options and protocol levels.
- Arista EOS -- Full GR support. LLGR support in recent releases. Configurable per peer group or globally.
- FRRouting (FRR) -- Full GR support as both restarting speaker and helper. LLGR support was added in FRR 8.x. GR leverages the Linux kernel's FIB for forwarding continuity. Configured via
bgp graceful-restartin the router bgp context. - BIRD -- Full GR and LLGR support. BIRD is widely used on IXP route servers, where helper-only mode is common.
- OpenBGPD -- GR support as a helper. Does not support being a restarting speaker with forwarding preservation (consistent with its use case as a route server / control-plane-only device).
GR and Route Server Operations at IXPs
GR is particularly important at Internet Exchange Points where route servers aggregate hundreds of peers. When a route server restarts for maintenance or upgrades, without GR, all peers would lose all routes learned via the route server simultaneously -- potentially disrupting peering traffic across the entire exchange.
With GR (and especially LLGR), a route server restart is transparent to the participants. The peers (acting as helpers) retain the routes learned from the route server during the restart, and traffic continues to flow over the peering LAN. Major IXP route server platforms like BIRD and OpenBGPD support GR precisely for this reason.
Conversely, when a peer restarts, the route server acts as a helper, retaining that peer's routes so other participants continue to send traffic to the restarting peer (whose forwarding plane is presumably still active).
Relationship to Other High-Availability Mechanisms
Graceful Restart does not exist in isolation. It is part of a broader set of high-availability mechanisms in modern networks:
- Non-Stop Routing (NSR) -- An alternative to GR where the standby route processor maintains a fully synchronized copy of the BGP RIB. When the active RP fails, the standby takes over without dropping the TCP session or requiring the peer to act as a helper. NSR is transparent to the peer but requires more memory and CPU on the local router. NSR and GR can coexist as complementary mechanisms.
- BFD -- Bidirectional Forwarding Detection provides fast failure detection (milliseconds) that can trigger GR. BFD and GR interact closely: BFD detects the event, GR manages the recovery.
- ECMP and multipath -- Equal-Cost Multipath provides redundancy at the forwarding level. If traffic is spread across multiple next hops, losing one (during a GR event) reduces capacity but does not cause a complete outage. GR and ECMP complement each other.
- Add-Path -- BGP Add-Path (RFC 7911) allows a speaker to advertise multiple paths for the same prefix. This gives the receiving speaker backup paths in its RIB, reducing the impact of any single path withdrawal. Combined with GR, this provides defense in depth.
Common Pitfalls and Debugging
When GR does not work as expected, the following are common causes:
- GR not negotiated -- Check that both peers include the GR capability in their OPEN messages. Many implementations require explicit configuration to enable GR. Use
show bgp neighboroutput to verify GR state. - Restart takes too long -- If the BGP process takes longer to restart than the advertised Restart Time, the helper expires the timer and deletes stale routes. Monitor actual restart times and adjust the timer accordingly.
- Forwarding state not preserved -- The restarting speaker sends F=0, telling the helper that forwarding state was lost. The helper immediately deletes stale routes. Check NSF/NSR configuration on the restarting router and ensure the forwarding plane is decoupled from the control plane.
- Route policy changes -- If routing policy changed during the restart (e.g., new filters were applied), the routes re-advertised after restart may differ from the stale routes. This is correct behavior but can cause unexpected route changes.
- NOTIFICATION clears GR state -- If the N bit is not negotiated and the session is cleared with a NOTIFICATION (e.g.,
clear bgp neighborwith a hard reset), GR will not activate. Use soft reset or ensure the N bit is negotiated. - Max-prefix limits -- If the helper has max-prefix limits configured and the restarting speaker sends more prefixes than the limit during route refresh, the session may be torn down again, defeating GR.
Explore BGP Sessions and Routes
Understanding Graceful Restart helps explain why BGP convergence events sometimes cause traffic shifts and sometimes do not. Use the god.ad BGP Looking Glass to examine real AS paths and observe how routes propagate through the global routing table. When you see a route through a particular autonomous system, consider whether that AS's routers support GR -- and whether a restart on that path would be transparent or disruptive.
- AS13335 -- Cloudflare: extensive use of GR and NSR across its global backbone
- AS15169 -- Google: one of the largest BGP deployments, heavy use of GR for maintenance
- AS6939 -- Hurricane Electric: major transit provider with hundreds of IXP peerings
- 1.1.1.1 -- Examine AS path stability for a heavily anycast prefix
- 8.8.8.8 -- Check how Google DNS routes propagate across transit and peering