SSL Certificate Expiry Patterns: Why Sites Still Go Down in 2026
Let's Encrypt made free certificates ubiquitous. ACME protocol automated renewals. Yet expired certificates remain one of the top five causes of website outages. This article examines why the problem persists and what the failure patterns reveal about infrastructure management.
The Paradox of Automated Certificate Management
The SSL/TLS ecosystem in 2026 is fundamentally different from what it was a decade ago. Let's Encrypt issues certificates that are valid for 90 days and provides tools to renew them automatically. Most hosting providers handle certificate management transparently. Cloud platforms like AWS, Google Cloud, and Azure offer managed certificate services that handle provisioning and renewal without any manual intervention.
Despite all of this, certificate expiration continues to cause outages at organizations of every size. In 2025, several high-profile incidents were traced directly to expired certificates — not at small shops running outdated infrastructure, but at companies with dedicated security teams and substantial engineering resources.
The paradox exists because automation solves the simple case perfectly while creating new categories of failure that are harder to detect and prevent. Understanding these failure patterns is essential for anyone responsible for website reliability.
Failure Pattern 1: The Automation Gap
The most common pattern is not "we forgot to renew" — it is "we thought renewal was automated, but it was not for this specific certificate." Organizations typically have multiple certificate types across different systems: web server certificates managed by Let's Encrypt or a cloud provider, internal certificates for service-to-service communication, wildcard certificates purchased from commercial CAs, and certificates embedded in client applications or IoT devices.
Automation is usually configured for the primary web-facing certificates. The certificates that expire are the ones that fall outside the automated system — an intermediate certificate on a load balancer that was configured manually two years ago, a certificate pinned in a mobile app that requires an app store update to replace, or a certificate used by a third-party integration that nobody remembers configuring.
The gap exists because certificate inventories are rarely comprehensive. Most organizations cannot answer the question "how many certificates do we have, where are they installed, and when do they expire?" with confidence. Without a complete inventory, you cannot verify that automation covers everything. The certificates you do not know about are the ones that will expire unexpectedly.
Failure Pattern 2: Silent Renewal Failures
Automated renewal systems fail silently more often than most people realize. The ACME protocol used by Let's Encrypt requires the server to respond to a validation challenge — either by serving a specific file over HTTP (HTTP-01 challenge) or by creating a DNS TXT record (DNS-01 challenge). Both methods can fail without producing obvious errors.
HTTP-01 challenges fail when a reverse proxy, CDN, or WAF is reconfigured in a way that blocks the challenge path. This happens frequently when infrastructure teams make changes without awareness of how certificate validation works. A new firewall rule that blocks requests to /.well-known/acme-challenge/ will silently prevent certificate renewal. The existing certificate continues to work until it expires, which might be weeks or months after the change that broke renewal.
DNS-01 challenges fail when DNS credentials are rotated, when the DNS provider changes its API, or when rate limits are exceeded. These failures produce log entries that are easy to miss among the noise of typical server logs. The certbot or equivalent tool writes a warning, but if nobody is monitoring certbot's output — and most automated setups do not have alerting on renewal failures — the warning goes unnoticed.
The fundamental issue is that certificate automation is configured once and then assumed to work indefinitely. But the environment around it changes: DNS providers update APIs, firewall rules change, server configurations are modified, and containers are rebuilt from updated base images. Each change has the potential to break renewal without breaking the running service, creating a time bomb that detonates 60-90 days later.
Failure Pattern 3: Certificate Chain Breakage
Even when a certificate is renewed successfully, the deployment can fail if the certificate chain is incomplete or incorrect. A complete TLS certificate chain requires three components: the server's leaf certificate, one or more intermediate certificates, and trust in the root certificate (which is pre-installed in browsers and operating systems).
Chain issues are insidious because they often do not affect all clients equally. Modern desktop browsers implement a technique called Authority Information Access (AIA) fetching, where the browser can download missing intermediate certificates automatically. This means the site works perfectly in Chrome on a developer's laptop but fails completely on older Android devices, embedded systems, API clients, or command-line tools like curl.
The Let's Encrypt cross-sign expiration in 2021 was a large-scale example of this pattern. When the DST Root CA X3 cross-signed certificate expired, millions of devices running older operating systems lost the ability to validate Let's Encrypt certificates, even though the certificates themselves were perfectly valid. The affected devices included older Android phones, some Smart TVs, and various IoT devices that could not be easily updated.
Chain problems can also emerge during certificate transitions. When a certificate authority changes its intermediate certificate — which happens periodically for security reasons — servers that have the old intermediate certificate cached or hardcoded will start serving an incorrect chain. The new leaf certificate is valid, but the chain leading to it is broken for clients that do not have the new intermediate certificate cached.
Failure Pattern 4: The Multi-Environment Drift
Organizations running multiple environments — staging, production, disaster recovery, blue-green deployments — face a unique challenge: certificates need to be valid and current in every environment, but automation is often configured only for the primary production environment.
The disaster recovery environment is the most common victim. It exists specifically for the situation where primary production fails, but because it is rarely activated, its certificates quietly expire. When a genuine disaster occurs and traffic is redirected to the DR environment, users encounter certificate errors — adding a security warning to an already stressful situation and potentially preventing the DR environment from functioning as intended.
Blue-green deployments create a similar risk. The inactive ("blue") environment might have certificates that were valid when it was last deployed but have since expired. When a deployment switches traffic to the blue environment, the expired certificates cause an immediate outage that is often misdiagnosed as a deployment problem rather than a certificate problem.
Development and staging environments that use real domain names (rather than localhost or self-signed certificates) also contribute to this pattern. When staging certificates expire, development workflows break, and engineers waste time debugging certificate issues instead of building features. Some teams respond by disabling certificate validation in development, which then masks legitimate certificate issues that propagate to production.
Failure Pattern 5: Organizational Knowledge Loss
Certificates are typically configured by one person at one point in time. If that person leaves the organization, the institutional knowledge of how certificates are managed goes with them. This is especially problematic for certificates that are not managed by standard automation — custom configurations, certificates for legacy systems, certificates stored in hardware security modules (HSMs), or certificates used by third-party integrations.
The knowledge loss problem is compounded by the infrequency of certificate-related tasks. If a certificate has a one-year validity period, the person who configured it might only think about it once. If they leave the company six months later, there are six more months before anyone discovers that nobody knows how to renew that particular certificate.
Documentation helps but does not solve the problem entirely. Certificate management documentation is rarely kept current because the process seems simple ("just run certbot renew") and the details that matter — where the certificate is installed, what validation method it uses, what credentials are needed, what services need to be restarted — are often omitted because they seem obvious to the person writing the documentation.
Detection: Why Monitoring Must Be External
Internal monitoring systems have a fundamental limitation when it comes to certificate checking: they are typically inside the network boundary and may not see the same certificate chain that external users see. A load balancer might terminate TLS and serve internal traffic over HTTP, meaning internal health checks never validate the certificate at all.
External certificate monitoring — checking the certificate as seen by an actual browser connecting from the public internet — catches problems that internal monitoring misses. This includes chain issues (since external clients do not have access to internally-cached intermediate certificates), CDN-level certificate problems (where the CDN serves a different certificate than the origin server), and geographic variations (where different CDN edge nodes might serve different certificates).
Effective certificate monitoring checks not just whether the certificate is valid today, but how many days remain until expiration. Alerting at 30, 14, and 7 days before expiration provides enough warning to address both automated renewal failures and manually-managed certificates. The 30-day alert catches problems early enough to investigate calmly. The 14-day alert escalates urgency. The 7-day alert triggers emergency procedures.
Monitoring should also validate the complete certificate chain, not just the leaf certificate. A chain validation failure is a leading indicator of problems that may not manifest as a user-visible error immediately but will cause issues for specific client populations.
Prevention: Building a Certificate Management Practice
Preventing certificate-related outages requires treating certificate management as an ongoing practice rather than a one-time configuration task. The practice has four components:
Inventory: Maintain a complete list of every certificate in use across all environments, including certificates used by third-party services that you depend on. This inventory should include the certificate's subject, issuer, expiration date, where it is installed, how it is renewed, and who is responsible for it. Review the inventory quarterly.
Automation verification: Do not assume automated renewal works — verify it. For each certificate managed by automation, confirm that a recent renewal has actually completed successfully. Check certbot logs, ACME client logs, or cloud provider certificate status regularly. Set up alerts on renewal failures, not just on certificate expiration.
External monitoring: Use an external service that checks your certificates from multiple geographic locations at regular intervals. This catches chain issues, CDN misconfigurations, and geographic routing problems that internal monitoring cannot detect. Set expiration alerts at 30, 14, and 7 days.
Runbook maintenance: For every certificate, document the complete renewal procedure, including prerequisites, credentials needed, validation method, installation steps, and verification procedure. Test the runbook by having someone other than the original author execute it. Update it after every renewal.
The Shorter Validity Period Trend
The industry is moving toward shorter certificate validity periods. Let's Encrypt certificates are valid for 90 days. The CA/Browser Forum has discussed reducing the maximum validity period for all publicly-trusted certificates from the current 398 days to 90 days or even shorter. Apple has already announced plans to accept only 45-day certificates by 2028.
Shorter validity periods improve security by limiting the window during which a compromised certificate can be used and by ensuring that certificate revocation is less critical (since certificates expire quickly anyway). However, they make the failure patterns described above more frequent and more urgent. A renewal failure with a 90-day certificate gives you at most 60 days to notice and fix the problem (assuming the renewal attempt happens 30 days before expiration). With a 45-day certificate, that window shrinks to roughly 15-30 days.
This trend makes robust certificate monitoring not optional but essential. Organizations that have been relying on annual manual renewals will need to adopt automation, and organizations that have automation will need to invest in monitoring that automation's health. The cost of getting certificate management wrong increases as the margin for error decreases.
Conclusion
SSL certificate expiration is not a technical problem — it is an organizational one. The tools to prevent it exist and are mature. The failures occur in the gaps between tools: the certificates that fall outside automation, the renewals that fail silently, the chains that break partially, the environments that drift, and the knowledge that walks out the door with departing team members. Solving it requires acknowledging that certificate management is an ongoing practice, not a set-and-forget configuration, and investing in the inventory, monitoring, and documentation practices that make expiration-related outages preventable.