Optus BGP Outage 2023: How a Routing Change Took Down an Entire National Carrier
On November 8, 2023, Optus -- Australia's second-largest telecommunications provider and a subsidiary of Singapore Telecommunications (Singtel) -- suffered a nationwide network outage that lasted over 14 hours. Approximately 10 million customers lost access to mobile, fixed-line, and internet services. The outage knocked out emergency triple-zero (000) calls for hundreds of thousands of people, disrupted hospitals, public transit payment systems, and businesses across the country. The root cause was a BGP routing change that propagated through Optus's network in a way that the company's routers could not handle, triggering a cascading failure of its core routing infrastructure.
The Optus outage of 2023 stands as one of the most severe telecommunications failures in Australian history. It prompted a formal investigation by the Australian Communications and Media Authority (ACMA), a Senate inquiry, and significant regulatory action. The incident demonstrates how a single BGP routing event can take down an entire national carrier, and it exposed critical gaps in Australia's telecommunications resilience -- particularly regarding access to emergency services during network failures.
Background: Optus and the Australian Telecommunications Landscape
Optus operates AS4804 (Optus Internet) and AS7474 (Optus Backbone), among other AS numbers. It is the second-largest carrier in Australia after Telstra and serves roughly 10 million mobile subscribers and 1.5 million fixed broadband customers. Optus's network spans the continent, providing mobile coverage in urban and regional areas, fixed-line services via its own fiber and HFC infrastructure, and enterprise connectivity.
Optus is a wholly owned subsidiary of Singtel, Singapore's largest telecommunications company, which operates AS7473 and other AS numbers. The Singtel/Optus network is interconnected with the global internet through peering relationships at major Internet Exchange Points and through transit from upstream providers. This international connectivity becomes relevant to understanding how the outage-triggering event entered Optus's network.
What Happened: The Technical Root Cause
According to Optus's public statements and the subsequent ACMA investigation, the outage was triggered by changes to routing information received from an international peering network. Specifically, a Singtel internet exchange in its international network propagated BGP routing table changes that included an unusually large number of route updates. These route updates were passed to Optus's core routers.
The key failure was in how Optus's routers handled this routing information:
- Route update propagation. On the morning of November 8, Optus's border routers received a set of BGP route updates from the Singtel peering exchange. The nature of these updates -- likely involving a large number of prefixes or AS path changes -- was unusual but not inherently malicious.
- Routing table overflow / processing failure. The volume or nature of the route updates exceeded the processing capacity or routing table limits of Optus's core Cisco-based routers. Optus had not implemented adequate filters or safety mechanisms to prevent such updates from propagating into the core network. When the routers could not process the routing information correctly, they entered a failure state.
- Cascading router failures. As core routers failed or became unstable, they withdrew their own BGP sessions, causing adjacent routers to recalculate routes and propagate further instability. This positive feedback loop -- routers failing, causing route withdrawals, causing other routers to fail -- spread across Optus's entire national backbone within minutes.
- Complete network isolation. With the core routing fabric down, Optus's mobile base stations, fixed-line aggregation points, and internet gateways all lost connectivity to each other and to the internet. The network was effectively partitioned into isolated segments with no ability to route traffic between them.
Timeline of the Outage
All times are in Australian Eastern Daylight Time (AEDT), which is UTC+11.
- ~04:05 AEDT (17:05 UTC, Nov 7) -- Optus's network begins experiencing failures. Mobile, fixed broadband, and enterprise services start going offline across the country. The outage affects all states and territories.
- ~05:00 AEDT -- Optus publicly acknowledges the outage, stating that they are "aware of an issue" and are "working to restore services as quickly as possible." No technical details are provided.
- ~06:00-08:00 AEDT -- The full scope of the outage becomes clear. Hospital communication systems, public transit card readers (Opal in Sydney, Myki in Melbourne), EFTPOS payment terminals, and business phone systems are all affected. Reports emerge of people unable to call 000 for emergencies.
- ~09:00 AEDT -- The Australian government holds an emergency briefing. The Minister for Communications contacts Optus CEO Kelly Bayer Rosmarin demanding an explanation. The Optus Network Operations Center (NOC) is working to manually rebuild the routing fabric.
- ~12:00-14:00 AEDT -- Services begin to be partially restored in some areas as engineers rebuild BGP sessions and restore routing between network segments. Recovery is gradual and uneven.
- ~18:30 AEDT -- Optus declares that services have been substantially restored. Some residual issues continue for hours afterward.
Why Recovery Took 14 Hours
A critical question raised by the Optus outage is why recovery took so long. A BGP misconfiguration can typically be rolled back in minutes -- simply filter the offending routes, reset the BGP sessions, and let the routing table reconverge. The extended recovery time in Optus's case was due to several compounding factors:
- Widespread router instability. The initial BGP event did not just cause incorrect routes -- it caused core routers to become unstable or unresponsive. Simply rolling back the triggering routes was not enough; the routers needed to be individually stabilized and their BGP sessions manually re-established.
- Loss of management plane access. When the core routing fabric failed, Optus engineers lost remote access to many routers. The management network shared infrastructure with the data plane, meaning engineers could not SSH into routers to diagnose and fix the problem. Physical access to equipment or out-of-band management was needed in some cases.
- Controlled recovery to avoid re-triggering. Engineers could not simply bring all routers back online simultaneously. Each segment had to be restored carefully, with routes verified and filters applied before re-establishing BGP sessions, to prevent the cascade from recurring.
- Scale of the network. Optus's network spans an entire continent. Core routers in Sydney, Melbourne, Brisbane, Perth, Adelaide, and regional locations all needed to be individually addressed. The geographic dispersion of equipment added time to the recovery process.
Impact on Emergency Services
The most serious consequence of the Optus outage was the disruption to Australia's emergency call service (Triple Zero / 000). Under Australian law, all mobile phones -- even those without an active SIM card -- must be able to call 000. The system works by routing the call through whatever mobile network is available, not just the subscriber's own carrier.
However, when Optus's entire mobile network was down, Optus subscribers' phones could not connect to the Optus network at all. In theory, these phones should have been able to roam to Telstra or Vodafone's networks to place emergency calls. In practice, many handsets were not configured for emergency roaming, or the handshake to connect to an alternate network failed. The result was that hundreds of thousands of Australians could not call emergency services for hours.
Optus reported that approximately 228 calls to 000 failed to connect during the outage. While this number may seem small, each of those calls could have been a life-threatening emergency. The Australian government subsequently mandated that all carriers implement emergency call roaming capabilities to prevent a similar failure from endangering lives.
Regulatory Response
The Optus outage triggered significant regulatory and political action:
- ACMA Investigation. The Australian Communications and Media Authority launched a formal investigation into whether Optus had met its obligations under the Telecommunications Act and the Emergency Call Service Determination. The investigation focused on the failure to provide access to emergency services and the adequacy of Optus's network resilience measures.
- Senate Inquiry. The Australian Senate's Environment and Communications References Committee conducted a public inquiry into the outage, calling Optus executives and technical experts to testify. The inquiry examined the technical root cause, Optus's incident response, and systemic issues in Australian telecommunications resilience.
- CEO Resignation. Optus CEO Kelly Bayer Rosmarin resigned on November 20, 2023, twelve days after the outage. This was her second major crisis at Optus in just over a year, following the September 2022 data breach that exposed personal information of 9.8 million customers.
- Mandatory Network Resilience Standards. The Australian government announced plans to develop mandatory network resilience standards for telecommunications carriers, including requirements for redundant routing, emergency call roaming, and out-of-band management access.
Technical Lessons
BGP Route Filtering Is Not Optional
The Optus outage was preventable. Standard BGP best practices -- max-prefix limits, route filtering, and prefix validation via RPKI -- exist precisely to prevent scenarios where an unexpected volume or type of route updates can destabilize a network. The MANRS (Mutually Agreed Norms for Routing Security) initiative, which Optus had not fully implemented, provides a framework for these protections.
At a minimum, every BGP session should have:
- Max-prefix limits -- Automatically shut down a BGP session if the number of received prefixes exceeds a configured threshold. This prevents a peer from flooding the routing table.
- Prefix filters -- Explicitly whitelist the prefixes expected from each peer, rejecting everything else. For transit peers, filter based on IRR (Internet Routing Registry) data and RPKI ROAs.
- Route damping or rate limiting -- Limit the rate at which route updates are processed to prevent a burst of updates from overwhelming the control plane.
Management Plane Separation
When the core routing fabric failed, Optus lost the ability to remotely manage its routers. This is a critical design flaw. The management plane (SSH, SNMP, telemetry) should be accessible via an out-of-band network that does not depend on the production routing fabric. Many large carriers use dedicated management networks, console servers with cellular backup, or satellite-based out-of-band access to ensure they can always reach their equipment -- even during a total routing failure.
Emergency Service Resilience
The failure of emergency call access exposed a systemic vulnerability in how mobile networks handle emergency roaming. When a subscriber's home network is completely unavailable, the handset should automatically attempt to register on any available network for emergency calls. This capability exists in the GSM/LTE specifications (emergency call without registration), but it was not reliably implemented across all handset and network combinations in Australia at the time of the outage.
Comparison with Other BGP Incidents
The Optus outage shares characteristics with several other major BGP-related incidents:
- Facebook October 2021 -- A maintenance command accidentally withdrew all BGP routes for Facebook's network, causing a 6-hour global outage. Like Optus, Facebook's management tools depended on the same network, hampering recovery.
- Route leaks -- While the Optus outage was caused by received routes overwhelming internal infrastructure rather than routes being leaked to the internet, the root cause is the same family of problems: inadequate BGP filtering and a failure to validate routing information at network boundaries.
- CenturyLink/Level3 August 2020 -- A BGP flowspec (traffic filtering) rule propagation error caused a cascading failure across CenturyLink's network, resulting in a 5+ hour outage affecting 911 services in multiple US states. Like Optus, the CenturyLink incident demonstrated how BGP control plane failures can have life-safety implications.
Explore the Routing Infrastructure
The BGP routing tables that Optus's routers process are the same routing tables visible through looking glass tools. You can examine Optus's AS4804 and AS7474 routes, explore their peering relationships, and look at how Australian internet traffic is routed today. Use the god.ad BGP Looking Glass to look up any autonomous system, IP address, or network prefix to see how the global routing table connects networks around the world.