How DNS Fails: 12 Failure Modes That Take Down Websites

Why DNS Failures Are Uniquely Difficult

DNS failures are harder to diagnose than failures at any other layer of the web stack for three reasons. First, DNS is a distributed caching system, so the same query can produce different results depending on where and when you ask. A site might be unreachable from one geographic location but work perfectly from another, or fail for one ISP's customers while working for others.

Second, DNS failures often present as something else entirely. When DNS resolution fails, browsers show "site can't be reached" or "DNS_PROBE_FINISHED_NXDOMAIN" — error messages that most users and many engineers interpret as the server being down, when the server is running perfectly and the problem is purely in name resolution.

Third, DNS changes propagate at the speed of cache expiration, not at the speed of deployment. When you push a code change, it takes effect immediately. When you change a DNS record, the change reaches different resolvers at different times depending on their cache state, the TTL of the previous record, and the behavior of intermediate caching resolvers. This means DNS failures can appear, disappear, and reappear over hours or days as caches expire and refresh.

Failure Mode 1: Expired Domain Registration

This is the simplest failure mode and arguably the most embarrassing. When a domain registration expires, the registrar stops responding to DNS queries for that domain, and the entire site goes offline. Despite being entirely preventable, domain expiration continues to cause outages because registration is typically handled by a person or team separate from the engineering organization, payment methods on file expire, renewal notification emails go to mailboxes that are no longer monitored, or organizational changes result in nobody knowing which team owns the domain registration.

Diagnostic approach: Run a WHOIS lookup on the domain. If the registration has expired, the WHOIS record will show an expiration date in the past, and the domain status will typically show "redemptionPeriod" or "pendingDelete." The fix requires contacting the registrar, and most registrars provide a grace period (usually 30-45 days) during which the domain can be renewed at a higher fee.

Prevention: Enable auto-renewal with a payment method that does not expire (or update it proactively). Set domain expiration monitoring in at least two independent systems. Ensure multiple people in the organization have registrar account access, and document the registrar, account, and recovery procedures.

Failure Mode 2: Nameserver Delegation Mismatch

A domain has two sets of nameserver records: the delegation at the registrar (the NS records in the parent zone) and the NS records in the domain's own zone file. These must match. When they do not — because someone updated nameservers at the registrar but not in the zone file, or vice versa — the result is intermittent resolution failures.

The symptoms of this mismatch are confusing because queries sometimes succeed and sometimes fail, depending on which nameserver the resolver contacts. If the registrar delegates to nameservers A and B, but the zone file lists nameservers B and C, queries that reach nameserver A will receive an authoritative response, queries to B will work, and queries to C might work or might fail depending on whether C actually has the zone loaded.

Diagnostic approach: Compare the NS records from the parent zone (using dig NS example.com @parent-ns) with the NS records from the zone itself (using dig NS example.com @zone-ns). If they differ, you have found the problem.

Prevention: After any nameserver change, verify that the delegation at the registrar and the NS records in the zone match. Include this verification in your DNS change checklist.

Failure Mode 3: Missing or Incorrect Glue Records

Glue records are A records for nameservers that exist within the domain they serve. If your domain is example.com and your nameservers are ns1.example.com and ns2.example.com, the parent zone needs glue records that provide the IP addresses of ns1 and ns2. Without these glue records, resolvers face a circular dependency: they need to resolve ns1.example.com to find the nameserver for example.com, but to resolve ns1.example.com they need to query the nameserver for example.com.

This failure mode is relatively rare with modern DNS hosting providers (most people use external nameservers like ns1.cloudflare.com that do not require glue records), but it still occurs when organizations run their own nameservers with in-domain names. It also appears when migrating between DNS providers if the old glue records are removed before the new nameserver configuration is complete.

Diagnostic approach: Check whether your nameservers are within your own domain. If they are, verify that glue records exist at the registrar. The command dig +trace example.com will show whether glue records are being returned in the additional section of the referral response.

Failure Mode 4: TTL-Related Propagation Delays

Time To Live values control how long resolvers cache DNS responses. When you change a DNS record, resolvers that have the old record cached will continue serving the old data until the TTL expires. If the TTL was set to 86400 seconds (24 hours) — a common default — it can take up to 24 hours for all resolvers to see the new record.

The failure pattern here is not that DNS is broken, but that the change is not propagating as expected. This becomes a critical issue during incident response. If your server's IP address changes (due to a failover, migration, or infrastructure change) and the A record has a 24-hour TTL, approximately one-third of your traffic will still be directed to the old IP address for 8 hours after the change, two-thirds after 16 hours, and it takes the full 24 hours for all resolvers to catch up.

Some resolvers — notably some ISP resolvers — ignore TTL values and impose their own minimum caching time, sometimes as long as 48 hours. This means even after the TTL expires, a small percentage of users may still see the old records.

Prevention: Before any planned DNS change, reduce the TTL to a short value (60-300 seconds) at least 48 hours in advance. This ensures that when you make the actual change, the old records are cached for at most a few minutes rather than hours. After the change has propagated and is verified, increase the TTL back to its normal value to reduce query load on your nameservers.

Failure Mode 5: CNAME at Zone Apex

The DNS specification (RFC 1034) prohibits CNAME records at the zone apex — the bare domain (example.com as opposed to www.example.com). This is because the CNAME record type cannot coexist with any other record type, and the zone apex must have SOA and NS records.

Despite this prohibition, people regularly try to create CNAME records at the zone apex, usually because their hosting provider gives them a CNAME target (like myapp.herokuapp.com) rather than a static IP address. Some DNS providers allow this and silently "flatten" the CNAME into an A record at query time. Others reject the configuration. Still others accept it and serve it, relying on the fact that most resolvers tolerate the technically-invalid response.

The failure occurs when a resolver strictly follows the RFC and rejects the CNAME response at the zone apex, when the CNAME target changes IP and the flattening is not updated quickly enough, or when MX or TXT records at the same apex conflict with the CNAME.

Diagnostic approach: Query the apex domain for CNAME records. If one exists, check whether your DNS provider supports CNAME flattening (also called ANAME or ALIAS records). Verify that the flattened IP is current and correct.

Failure Mode 6: DNSSEC Validation Failure

DNSSEC adds cryptographic signatures to DNS responses, allowing resolvers to verify that responses have not been tampered with. When DNSSEC is correctly configured, it provides strong protection against DNS spoofing. When it is misconfigured, it causes hard failures for users whose resolvers perform DNSSEC validation.

DNSSEC failures are particularly disorienting because they affect only a subset of users. Resolvers that do not validate DNSSEC (which still includes many ISP resolvers) work perfectly. Resolvers that do validate (including Google's 8.8.8.8 and Cloudflare's 1.1.1.1) will return SERVFAIL for any domain with a DNSSEC validation error, making the site completely unreachable for users of those resolvers.

Common causes of DNSSEC validation failure include: expired RRSIG records (signatures have a validity period similar to certificates), mismatched DS records at the registrar after a key rollover, zone signing errors after a zone transfer or provider migration, and algorithm incompatibilities between the signing system and resolvers.

Diagnostic approach: Use dig +dnssec example.com to check for DNSSEC records and delv example.com to perform DNSSEC validation locally. Online tools like DNSViz provide visual chain validation. If DNSSEC is misconfigured and you cannot fix it quickly, removing the DS record from the registrar will disable DNSSEC validation (at the cost of removing the security protection).

Failure Mode 7: Recursive Resolver Outage

Your authoritative DNS can be perfectly configured and your server can be running flawlessly, but if the recursive resolver your users depend on is experiencing problems, your site will appear to be down. Most users rely on their ISP's resolver or a public resolver like Google DNS or Cloudflare DNS. If that resolver goes down, users cannot resolve any domain, not just yours.

This failure mode is outside your control, but it is important to recognize it during incident triage so you do not chase a problem in your own infrastructure that does not exist. If you receive reports that your site is down, but you can reach it from your own network, and your monitoring shows it as up, the problem might be a resolver outage affecting a subset of users.

Diagnostic approach: Ask affected users to try a different DNS resolver. If switching from their ISP's resolver to 8.8.8.8 or 1.1.1.1 fixes the problem, the issue is with their resolver, not your DNS. Check public DNS status pages (Google and Cloudflare both publish status information) and social media for reports of resolver outages.

Failure Mode 8: Geographic DNS Routing Errors

GeoDNS (geographic DNS routing) directs users to the nearest server or CDN edge based on their geographic location. This works by returning different IP addresses in response to DNS queries based on the source IP of the resolver. The failure occurs when the geographic database is incorrect, when a resolver serves users from a different region than its own location, or when the GeoDNS configuration has gaps that leave some regions unserved.

A common variant of this failure is when a large public resolver (like Google DNS) sends queries from a resolver location that does not match the geographic location of the actual user. Google mitigates this with the EDNS Client Subnet extension, which includes the user's subnet in the DNS query, but not all authoritative servers support this extension. Without it, all Google DNS users in a region might be routed to a server optimized for the location of Google's resolver rather than the user's actual location.

Diagnostic approach: Query your domain from resolvers in different geographic locations (or use online tools that query from multiple locations). Compare the returned IP addresses with your expected geographic routing configuration. Check whether the IP addresses returned correspond to the correct regional server or CDN edge.

Failure Mode 9: DNS Amplification and DDoS

DNS is frequently used as an amplification vector in DDoS attacks because DNS responses are typically much larger than DNS queries, and because DNS uses UDP, which makes source address spoofing trivial. An attacker sends a DNS query with a spoofed source IP (your nameserver's IP), and the response — which can be 50-100 times larger than the query — is sent to your nameserver, overwhelming it.

Even if you are not the target of the DDoS attack, your authoritative nameservers can become collateral damage. If your nameservers are hosted on shared infrastructure (which is common with managed DNS providers), a DDoS attack targeting another customer on the same infrastructure can degrade or disable DNS resolution for your domain as well.

Prevention: Use a managed DNS provider with DDoS mitigation capabilities (Cloudflare, AWS Route 53, Google Cloud DNS). Configure rate limiting on your authoritative nameservers if you run your own. Use multiple DNS providers (multi-provider DNS) so that if one provider is attacked, the other can still serve responses.

Failure Mode 10: Stale Negative Cache

When a resolver queries for a record that does not exist, it caches that negative response for the duration specified by the SOA record's minimum TTL field (often called the negative cache TTL). If you accidentally delete a DNS record, the negative response gets cached. When you recreate the record, resolvers that have the negative response cached will continue to report the record as non-existent until their negative cache expires.

This is particularly problematic because the negative cache TTL is often set to a long duration — commonly 3600 seconds (one hour) — and is controlled by the SOA record rather than the individual record's TTL. Even if you set a 60-second TTL on your A record, deleting and recreating it will result in a one-hour outage for resolvers that cached the negative response.

Prevention: Never delete and recreate DNS records. Instead, update them in place. If you must delete a record, understand that the negative cache TTL in your SOA record determines how long the gap will persist. Set the SOA minimum TTL to a reasonable value (300-600 seconds is a good balance between cache efficiency and recovery speed).

Failure Mode 11: Provider Migration Data Loss

Migrating from one DNS provider to another is one of the highest-risk operations in DNS management. The typical process involves exporting zone data from the old provider, importing it into the new provider, verifying the records, and then changing the nameserver delegation at the registrar. Each step has failure potential.

Zone exports may be incomplete — some providers do not export all record types, or they export records in a format that the new provider cannot import cleanly. Records that were created through the provider's UI (like aliased records or provider-specific features) may not have standard DNS equivalents and are silently dropped during export.

The most dangerous moment is when the nameserver delegation changes. If the new provider does not have all records configured correctly, the moment the delegation switches, users will start getting incorrect or empty responses. Because of caching, rolling back the nameserver delegation does not immediately fix the problem — resolvers that already queried the new (incorrect) nameservers have cached the wrong data.

Prevention: Before changing delegation, reduce TTLs on all records at the old provider to 60 seconds (at least 48 hours before the switch). Verify every record at the new provider by querying the new nameservers directly (dig @new-ns1 example.com). Run both providers in parallel during the transition. Test from multiple locations using online DNS checking tools.

Failure Mode 12: Subdomain Takeover via Dangling DNS

A dangling DNS record is a CNAME or A record that points to a resource that no longer exists. The classic example is a CNAME record pointing to a cloud service (like an S3 bucket, a Heroku app, or an Azure resource) that has been decommissioned. The DNS record still exists, but the target has been released and can potentially be claimed by an attacker.

When an attacker claims the orphaned resource, they gain control of the content served on your subdomain. This allows them to host phishing pages, steal cookies (if the subdomain shares a cookie scope with your main domain), and abuse your domain's reputation for spam or malware distribution.

This failure mode is not a traditional outage — the subdomain resolves and serves content — but the content is controlled by an attacker rather than you. It is included in this list because it represents a DNS configuration failure with severe consequences, and because it is increasingly common as organizations use and decommission cloud services without cleaning up the associated DNS records.

Prevention: Audit DNS records regularly for dangling records. When decommissioning any cloud service, remove the corresponding DNS record before releasing the resource. Use monitoring that checks not just whether a domain resolves, but whether the content served matches expected content (subdomain takeover will resolve successfully but serve unexpected content).

Building a DNS Resilience Strategy

No single measure protects against all twelve failure modes. A comprehensive DNS resilience strategy combines several practices:

Multi-provider DNS: Configure your domain with nameservers from two independent DNS providers. If one provider experiences an outage or DDoS attack, the other continues to serve responses. This requires keeping zone data synchronized across both providers, which adds operational complexity but dramatically improves resilience.

External DNS monitoring: Use monitoring that queries your DNS from multiple geographic locations and validates not just resolution success but response correctness. Alert on unexpected changes in resolved IP addresses, missing records, and DNSSEC validation failures.

Change management: Treat DNS changes with the same rigor as production deployments. Use a change checklist, verify changes from multiple locations before and after, and maintain rollback procedures. Never make DNS changes on a Friday afternoon.

Regular audits: Review your complete DNS zone quarterly. Look for records that point to decommissioned services, records with excessively long TTLs, and records that are no longer needed. Each unnecessary record is a potential future failure point.

DNS is simultaneously the simplest and most consequential layer of web infrastructure. Every failure mode described in this article has caused real outages at real organizations. The good news is that every one of them is preventable with proper monitoring, documentation, and operational discipline. The bad news is that most organizations do not invest in DNS resilience until after they experience their first DNS-related outage. The purpose of this article is to help you be proactive rather than reactive — to understand the failure modes before you encounter them in production.