Azure DNS Outage 2023: How a Misconfigured Deployment Took Down Microsoft Services

On July 30, 2023, Microsoft Azure experienced a widespread outage that disrupted access to a broad range of Microsoft services and third-party applications hosted on Azure. The root cause was a misconfigured deployment to Azure's DNS infrastructure that triggered a cascading failure across multiple Azure regions. For approximately two hours, DNS resolution for Azure-hosted services degraded severely, making Azure Portal, Microsoft 365, Teams, Power Platform, and hundreds of thousands of customer applications intermittently unreachable.

This outage is notable because it was not caused by a hardware failure, a DDoS attack, or a third-party dependency. It was a self-inflicted wound: a configuration change deployed to Azure's own DNS servers caused those servers to return malformed responses, which triggered retries and amplification that overwhelmed the DNS infrastructure further. The incident illustrates how even the most sophisticated cloud providers remain vulnerable to DNS as a single point of failure -- a lesson reinforced repeatedly by events like the Facebook DNS outage of 2021 and the Akamai DNS outage of 2021.

What Is Azure DNS?

Azure DNS serves two critical functions within Microsoft's cloud infrastructure:

Azure DNS is built on a globally distributed anycast network, with DNS servers deployed in every Azure region. Queries are routed to the nearest available DNS server based on BGP anycast routing. This architecture provides low latency and high availability under normal conditions, but it also means that a configuration change deployed globally can affect all regions simultaneously.

Timeline of the Outage

Azure DNS Outage Timeline -- July 30, 2023 (UTC) ~14:35 UTC Configuration change deployed to Azure DNS infrastructure. A code change affecting DNS name resolution is pushed globally. ~14:40 UTC DNS anomalies begin. Azure DNS returns malformed responses. Recursive resolvers begin retrying failed queries aggressively. ~14:45 UTC Cascading failures spread across multiple Azure regions. Query volume spikes as retries and timeouts compound. ~15:10 UTC Microsoft acknowledges the issue on Azure Status page. Teams, M365, Azure Portal, Power Platform confirmed affected. ~15:30 UTC Rollback of the DNS configuration change initiated. Engineering teams begin reverting the deployment globally. ~16:10 UTC Rollback complete. DNS resolution begins recovering. Cached DNS entries expire and clients receive correct responses. ~17:00 UTC Full recovery confirmed. All services restored to normal. Some residual impact from DNS cache propagation delays.

The Root Cause: A Misconfigured DNS Deployment

According to Microsoft's post-incident review, a code change was deployed to Azure DNS servers that contained a defect affecting how DNS queries were processed. The change was intended to improve DNS infrastructure resilience, but it introduced a bug that caused DNS servers to return invalid responses for a subset of queries.

The specific failure mechanism involved the following chain:

  1. The deployment rolled out globally. Unlike application deployments that typically follow a staged rollout (canary region -> ring 0 -> ring 1 -> global), this DNS infrastructure change was deployed more broadly before the defect was detected. Azure DNS is a foundational service -- it cannot easily be tested in isolation because virtually every other Azure service depends on it.
  2. DNS servers began returning malformed or erroneous responses. The exact nature of the malformed responses has not been fully disclosed, but the effect was that recursive resolvers and clients received responses they could not use -- SERVFAIL, truncated responses, or responses with incorrect records.
  3. Retry storms amplified the problem. When DNS resolution fails, clients and recursive resolvers retry. A single failed query can generate 3-5 retries, each of which hits the already-struggling DNS servers. This positive feedback loop -- failure causes retries, retries increase load, increased load causes more failures -- is a well-known pattern in DNS outages. The Dyn DNS attack of 2016 demonstrated how retry amplification can overwhelm DNS infrastructure even beyond the initial trigger.
  4. Dependent services cascaded. Azure services that rely on DNS for service discovery, endpoint resolution, and health checking all began experiencing failures. Azure Active Directory, Key Vault, Storage, SQL Database, and dozens of other services use DNS internally to locate their own infrastructure. When DNS failed, these services could not connect to their backends, and they began returning errors to customers.

The Blast Radius

The outage affected services across multiple categories:

The geographic scope was broad. While not every region was equally affected (the severity depended on the specific DNS servers each client was routed to via anycast), reports of degradation came from North America, Europe, Asia Pacific, and other regions. The anycast architecture that normally provides resilience meant that the faulty configuration was serving queries globally.

Why DNS Outages Cascade

DNS Failure Cascade in Cloud Infrastructure Azure DNS (broken) malformed responses / SERVFAIL Azure AD / Entra ID cannot resolve endpoints Storage / SQL / KV cannot resolve backends Azure Portal / APIs management plane down SSO / Auth flows fail users cannot log in Data access fails apps cannot read/write Ops cannot mitigate portal inaccessible Customer-Facing Service Disruption M365, Teams, customer apps, APIs all return errors Retry amplification failed queries x3-5 retries = overwhelming load

DNS is the most deeply embedded dependency in modern cloud infrastructure. Every service-to-service call, every API request, every database connection begins with a DNS lookup. When DNS fails, the failure does not manifest as "DNS is down" -- it manifests as "everything is down," because applications see timeout errors, connection refused errors, and SERVFAIL responses without understanding that the root cause is DNS resolution.

The cascade pattern repeats in every major DNS outage:

  1. DNS resolution fails or becomes unreliable
  2. Services that depend on DNS (which is everything) begin failing their health checks and timing out on backend connections
  3. Load balancers, service meshes, and orchestrators detect unhealthy instances and attempt to route around them -- but the "healthy" instances are also failing DNS resolution
  4. Retry logic in applications amplifies the DNS query load by 3-5x
  5. Monitoring and alerting systems, which also depend on DNS, may fail to notify operators
  6. Management consoles (Azure Portal, AWS Console) become unreachable, hampering incident response

Comparison with Other DNS Outages

The Azure DNS outage of 2023 fits a well-established pattern of DNS-related incidents at major providers:

The common thread is clear: DNS is the Achilles heel of internet services. It is a single point of failure that, when it breaks, takes down everything that depends on name resolution -- which is essentially everything.

Microsoft's Response and Remediation

Microsoft's post-incident response included several commitments to prevent similar outages:

Lessons for Infrastructure Operators

The Azure DNS outage of 2023 reinforces several critical lessons for anyone operating internet infrastructure:

DNS Must Be Treated as Critical Infrastructure

DNS changes should be subject to the most rigorous deployment procedures in the organization -- equal to or exceeding those for the network fabric itself. A DNS failure is not a localized service disruption; it is a total infrastructure failure. Canary deployments, automated regression testing, and kill switches for DNS changes are not optional.

Retry Amplification Is a Force Multiplier

Any system that triggers client retries must be designed to handle the amplified load. DNS is particularly vulnerable because retries happen at multiple layers: the application retries, the stub resolver retries, the recursive resolver retries, and each retry generates multiple queries (AAAA and A, for example). A DNS failure that doubles query latency can easily generate a 5x increase in query volume.

Management Plane Independence

If the management console goes down when the data plane goes down, operators cannot fix the problem through normal channels. Monitoring, alerting, and management tools should have independent DNS resolution (or use hardcoded IP addresses for critical endpoints) to remain functional during DNS outages. This is the same lesson the Facebook outage taught: when Facebook's engineers needed to fix the BGP configuration, their remote access tools also used Facebook DNS and were therefore also unreachable.

Multi-Provider DNS Resilience

Organizations hosting critical services on any single cloud provider should consider secondary DNS providers or local DNS caching that can continue to serve cached records during an upstream outage. While this adds complexity, it provides a buffer during the critical window between a DNS failure and its remediation. DNS supports multiple authoritative nameservers for a zone precisely for this reason -- distributing authority across independently operated servers is a well-established resilience pattern.

The Bigger Picture

The Azure DNS outage of July 2023 is one data point in a pattern that repeats every year or two at a major provider. DNS remains the most fragile dependency in the internet's architecture -- a globally shared namespace that everything queries and nothing can function without. As infrastructure grows more complex and more services are layered atop cloud platforms, the blast radius of a DNS failure only increases.

Exploring the routing infrastructure that connects DNS servers, cloud regions, and end users is essential for understanding these failure modes. Use the god.ad BGP Looking Glass to examine the BGP routes and autonomous systems behind Microsoft Azure's network, or look up any IP address to see how traffic is routed across the internet.

See BGP routing data in real time

Open Looking Glass
More Articles
Comcast BGP Route Leak 2017: How a Filtering Failure Hijacked Traffic
Optus BGP Outage 2023: How a Routing Change Took Down an Entire National Carrier
What is BGP? The Internet's Routing Protocol Explained
What is an Autonomous System (AS)?
What is a BGP Looking Glass?
How to Look Up an IP Address's BGP Route