After the Let’s Encrypt CAA rechecking incident we have to say that the certificate revocation system is not only theoretically broken, but also in practice. Four days after the incident was published 3 million certificates should have been revoked, but Let’s Encrypt changed course on revocation, because of the minimal security risk. Now, that all the affected certificates have already expired, see behind the problems of the certificate revocation system.
First of all, let us be clear that Let’s Encrypt had a very important mission making the encryption free and easy to use on the internet. Their effort was essential to make it possible that even the smallest internet sites can protect the customer’s data and privacy by using transport layer security (TLS), but now they fall into the deep hole of the broken revocation system.
It is known that there is no certificate revocation mechanism which can satisfy the requirements of each stakeholders, namely client application developers, server administrators, middleware solution manufacturers, certificate authorities, but it must be stated when a certificate gets compromised – until it is revoked and the clients learned the fact of revocation – it should be assumed that a malicious third party – who has no authority to the host(s), which the certificate belongs to – could impersonalize any kind of service (web, mail, remote access, etc.) with any domain names the certificate is validated to. The fact that a certificate was issued during the period for which CAA rechecking issue was persited, does not mean that certificate has been compromised, just there was a possibility that Let’s Encrypt issued a certificate to a domain, which is disallowed the issuance for Let’s Encrypt by CAA record, but malicious third party who wanted to exploit the issue had to be verified the ownership over the domain.
Without going into the technical details, which are discussed in detail by Scott Helme, 3 ways of revocation checking used today must be summarized (skip it if you want)
The oldest mechanism which is nothing else than the collection of the revoked, but not expired, certificates managed by certificate authorities. The most important disadvantage of CRL is the size. As the Let’s Encrypt CAA checking issue showed it can happen that millions of certificates must be revoked at the same time. In that case, the CRL size dramatically grows and all the clients (eg: browsers, mail clients, ...) that want to check the revocation state of a certificate must download the CRL file. It might cause significant delays until the website content can be shown to the user. If that was not enough the CRL should be updated regularly. You can imagine what it means on mobile or embedded devices with a limited amount of memory or bandwidth.
That is why browsers do not use this mechanism at all. Chrome and Firefox have their own mechanism respectively CRLSets and OneCRL to solve the problem. Both of them are CRL collections maintained by Google and Mozilla, which are updated regularly in the background, without the user needing to restart or update their browser. It is fair enough if we are talking about Chrome/Chromium and Firefox, what about the several other browsers and other types of clients (eg: mailing clients) and several other protocols that use TLS or the organisations which want to extend the trusted store with their private CAs?
This mechanism could solve the size issue that CRL suffers from by making it possible to ask a central server (OCSP responder) whether a certificate is revoked or not. The problem is the fact that the OCSP responders are managed by the CAs and if you are asking for the revocation state of a certificate of a website it is quite likely that you are just visiting that site, so you can be tracked in that way.
OCSP Stapling is currently the best method for checking certificate validity.. It solves both the size issue of CRL and the privacy issue of OCSP by transferring the responsibility to the server. In case of this mechanism, the server proves to the client that the certificate which will be shown is not revoked. The server obtains the proof from certificate authority which issued the certificate and the server caching it. When the client connects to the server, it offers the cached proof to the client during the TLS handshake. The bitter pill we have to swallow is that only one revocation state can be sent by the server, prior to TLS 1.3, so in contrast to CRL and OCSP, the full certificate chain cannot be validated, but only the server certificate.
What should we expect from a certificate or a certificate authority? A certificate, at least the domain validated ones, prove that you are communicating with a server which is operated by the authoritative owner of the domain (eg: google.com) and nothing else. Simply put you visit the page you are expected to visit. The certificate authority is the trusted third party who is responsible for checking the legitimacy of the certificate request for a given domain. If the request was valid the certificate authority signs the request to prove that. When a certificate authority learns that the certificate has been compromised it must revoke the certificate, regardless of who is the responsible, the certificate authority or the certificate requester. In this way, the visitor can recognize – using one or more revocation checking mechanisms mentioned above – that there is an issue with the certificate.
After the Let’s Encrypt CAA rechecking issue we have to ask that the revocation is a must or a certificate authority can consider arguments and counter-arguments! In that case this is a theoretical question, as the issue has minimal security risk, but what if a serious issue will happen in the future.
On the one hand, one could say that in that case 3 million internet sites would stop after the revocation. Maybe it could be a reason for a certificate authority to give an unspecified grace period for millions of certificates which were not renewed by the domain owners after the issue had been announced. On the other hand, it needs to be emphasized that all the revocation check mechanisms depend on the fact of revocation. Until the grace period would run out millions of certificates would remain valid and the customers of the services, who use these certificates, cannot recognize that there is any security issue with these sites. In that case there would be a chance for a malicious party to collect sensitive information (eg: passwords, card numbers) when we are visiting an exploited site while we think we are still safe. The customer’s trust is violated because they cannot decide they really want to visit the site which has just a potentially compromised certificate.
The problem is if we do not know the reason for the revocation, we lose this information. There is no kind of moderately compromised certificate. If there is a little chance or it is 100% sure that the certificate is compromised, there is no difference in what the certificate authority should do, the protocol is the same. They have to revoke the certification, there is no room to distinguish those cases. On the other side of this story, we have a simple customer, who does not know what happens. A compromise of a certificate is not their responsibility,as they have no skill to make that kind of decision. That is what a certificate authority is good for.
It needs to be clarified for whom the certificate authority is responsible to. For the domain owner who paid for the service? For the customer who trusts is in the certificate authority that it will revoke a certificate if there is any chance of compromise? We might have to determine if we are asking the right question. What the certificate authority is responsible for? I believe the answer should be security first and the unbroken confidence in the whole PKI system. There is an obvious similarity between the responsibility of the public health systems in the event of an epidemic and the certificate authorities in the event of a security incident. Assess the severity of the problem, mitigate the adverse effects, keep up the trust in the system, avoid panic. There is also an obvious dissimilarity. While there is no cure for certificate compromise, there is a cure for the running service that uses the compromised certificate, namely certificate renewal.
What could be an unfavorable effect of a certificate compromise? Clients cannot distinguish that the services run by an authoritative owner of the domain or a malicious third party, so a man-in-the-middle attack can be performed and all the data transferred between the service which runs with a compromised certificate and the client traffic can potentially be intercepted by a malicious third party. The user's credentials (eg: passwords) should be considered as compromised and should be changed as soon as possible. The data transferred while the certificate was not renewed may be considered as leaked and altered according to the likelihood of the compromise.
The trolley problem occurs here, which also occurs in case of self-driving cars. How the car should decide if an accident cannot be avoided and either the car owner or the somebody outside of the car will die. If the car prefers the case where the owner dies, the question is who will pay for that type of car, when the end of the situation is definitely no-win. In our case the problem has no such serious consequences, but the question is still the same.
If a certificate authority tends to revoke a certificate in the slightest suspicion of compromise, means that the service will stop, which causes financial loss to the service provider and inconvenience for the client. Why should the domain owner pay for it? A domain owner should pay because the real value of a certificate is the customer trust of the PKI system, where the trust is represented by a certificate authority.
If the protocol depends on the amount of the affected certificates, what would be the boundary, where a certificate authority can postpone the revocation? Security issues like the one in question, or more serious ones, may and always will happen. It is not great, not terrible. The problem is the information delay and the risk it carries. Should anybody trust the PKI system if certificate authorities can hide security issues from internet users by not revoking possible compromised certificates? It is a question even if the hiding is a temporary act, caused by only a postponement of the revocation. Anyway the situation solves itself in a relatively short time, if we were talking about Let’s Encrypt, validity is only 90 days, so certificates that might pose a threat to clients will expire in 3 months. In case of other certificate authorities the validity can be any 27 months.
The only option to mitigate long turn around times, is to create an environment where quick and efficient reactions are possible via automation. And that is what Let’s Encrypt excels at, as all the processes around the certificate revocation and renewal are fully automated. Let’s imagine a situation when a poorly automated certificate authority, without Automatic Certificate Management Environment (ACME) protocol support, announces that 3 million, or just 300 thousand certificates are compromised! How much time would be needed to renew all of them, 1 week, 1 month or more? Should we tolerate that services with possibly compromised certificates run during that time if every opportunity was given to the domain owner to renew the certificates? Actually, that was the case. It is just icing on the cake, that Let’s Encrypt issues certificates with the shortest validation period (90 days).
Without questioning the significance of business continuity, the position that possibly compromised certificates should always be revoked is hardly modifiable. Namely without this step neither standard certificate checking mechanisms (CRL, OCSP, OCSP stapling) nor the implementations depending on them can work. Both client applications (eg: browsers, mail clients, ...) and middleware solutions become powerless as they are not able to recognize the problem if the compromised certificates are not revoked. The only workaround would be that vendors implement custom, likely not automated, mechanisms to collect and maintain the certificates which are potentially compromised, but officially nor revoked, which would be a waste of significant resources and would be highly vendor specific, just like CRLSets and OneCRL. We should not think of extending the grace period, while the certificates are not revoked. We should think of how the costs and risks of troubleshooting can be minimized on the business side.
Let’s Encrypt showed the clear way of cost and risk minimization. Nothing can be cheaper and reliable than a fully automated system. Security issues are part and always will be part of the IT industry, but reaction times can be reduced. The case in question will solve itself in 90 days as all the affected certificates will expire at most 90 days. If a serious issue would happen to most of the certificate authorities the problem would not solve itself within 90 days, but within 27 months (!) as 1-year max validity ballot failed at the CA/B Forum. Let’s Encrypt voted yes anyway. Under such circumstances, there is no pressure on automation so there is a huge amount of certificates that are not checked and renewed automatically, which may cause that revocation to postpone. What could be the practical solutions?
Both in CRL and OCSP entries there is a field containing the time of revocation. If this time would be in near future in such cases it would give enough, but fixed-term grace period to businesses and it also made it possible to check certificates against the future revocation having the right to decide how that situation is handled. Client applications may warn their users that the certificate of the visited site will be revoked in the near future, so there is a risk to visit. Middleware solutions may apply strict rules on this future revocation according to their configuration.
Software solutions that implement the ACME protocol may check whether a certificate will be revoked in the near future just like they check whether a certificate will expire in the near future. In both cases, the certificate would be renewed which would solve the problem within a day as this kind of software solution runs in a daily period. There are already some ACME compatible tools, libraries and certificate authorities in the wild, that we can use to implement our automated certificate renewal process.
Many systems cannot be automated, for instance where the certificate authority does not support the ACME protocol or an extended validation (EV) certificate is used, where automation is not an option. In those cases monitoring and alerting systems could give appropriate solutions. Just like in other cases certificates should be checked against the future revocation, but instead of renewal – which is not possible in an automated way – alerts would be sent with the right severity to the right person.
Unfortunately the revocation checking mechanisms have several theoretical and practical difficulties. although the fundament of the mechanisms is the fact of revocation, this alone is far from sufficient. Client applications should have a revocation checking mechanism which could work under real circumstances, but they do not have one.
- for client software revocation lists are unsuitable mechanism as
- the size may extend suddenly as the case in question shows
- even small revocation lists would cause significant memory usage on mobile devices because all the revocation lists of the significant amount of intermediate certificates should be stored in memory
- regular update of the revocation lists would cause considerable traffic charge which may have financial implications on mobile devices
- CRL distribution points might inaccessible for a client as it comes from another server (maintained by CA) and different protocol (HTTP instead of HTTPS)
- neither major browsers nor other clients software (for example email clients) uses the CRL mechanism
- for middleware devices, revocation lists are almost the optimal solution as
- Middleware devices can prefetch, regularly update and cache the revocation lists easily
- the vast majority of certificate authorities support the revocation list in their intermediate certificates which issue the server certificates directly,
- revocation list can be used to check a full certificate chain, but
- there is a huge deficiency namely, the most popular certificate authority does not support CRL, so almost 200 million certificates cannot be checked that way
- for certificate authorities, it is a comfortable mechanism as
- It is not used by large amount of client applications, means there is no huge load on CRL distribution points which came from distributed sources, which makes defense difficult against a DDoS attack
- for client software, OCSP is an unsuitable mechanism as
- although the size of the OCSP server response is small, so the size problem of the CRL has been solved, but
- OCSP responders may or may not accessible for a client as it comes from another server (maintained by CA) and different protocol (HTTP instead of HTTPS)
- all the other problems of revocation lists still exist in case of OCSP, but
- causes another problem, namely, the client can be tracked by an OCSP responder as a client software can check a certain certificate, so it can be concluded that the client communicates with the holder of the certificate
- for middleware devices, OCSP is an imperfect mechanism as
- OCSP responses cannot be prefetched and used as a database like a CRL, but
- it can be fetched on-demand and can be cached for later use for a moderate amount of time
- for certificate authorities, it is a comfortable mechanism as
- in practice there is no client which use that mechanism to check the certificate revocation, so it cause minimal load on the OCSP responder
- for client software, OCSP stapling is a suitable mechanism as
- it solves all the problems of CRL and OCSP, but
- it is supported by only the third of the top 150.000 sites, and the dynamic growth in recent years (3-5 percent point per year) seems to have come to halt
- Furthermore, a new problem arises that has previously been solved by CRL and OCSP, namely only the leaf certificate could be checked by OCSP Staple
- for middleware devices, OCSP stapling could be a perfect mechanism as
- leaf certificates could be check by OCSP stapling
- intermediate certificate could by check by CA
- for certificate authorities, OCSP stapling is a demanding mechanism as
- If a certificate authority had 200 million active certificates, which is the actual case of Let’s Encrypt, and only third of servers would support OCSP stapling and a validity period of the staple would be 5 minutes it would result 200.000 requests per second on the OCSP responder, which is apparently causing difficulties
- for server software, OCSP stapling is also a demanding mechanism as
- they have to take over a part of the burden from the OCSP responder as they have to be the reliable sources of the staple
- for server maintainers, OCSP stapling is a risky mechanism as
- for security reasons the OCSP stapling should go together with OCSP Must-Staple extension in the certificate which tells the client that if a certificate with that extension is served an OCSP staple must also be served, but if there is an outage on OCSP responder side or any fetching problem on server side it cause validation failure on client side, however browsers do soft fail revocation check, meaning if the requested staple is not served or not served fast enough they ignore it
Nowadays the focus is on the major browsers, however the problem is the same when we are using a minor text-based web browser or a popular command line tool in an automated environment. Not to mention that although the majority of the internet traffic is HTTPS, we should not forget about the several other protocols (FTP, IMAP, LDAP, OpenVPN, POP3, RDP, SIP, SMTP, XMPP, …) which use TLS, and some others (IPSec, WireGuard, QUIC, ...) which are not, but may also be affected in the problems of certificate revocation.
To make the long story short the revocation check system has some theoretical and several practical problems. Now a brand new question has arisen in the fundamental operation of the certificate authorities. May or may not a certificate authority postpone the revocation of the possibly compromised certificate because it would made a huge number of internet sites inaccessible?
What would be the threshold, 1 thousand or 1 million sites? In my humble opinion it is a qualitative question instead of a quantitative one. Certificate authorities have to be strict and consistent when they decide on revocation to preserve the trust in the PKI system, especially that in the last few years there were some issues (PROCERT, Symantec, WoSign, …) which tore down the reputation of the certificate authorities, however the case in question cannot be compared to them. In parallel all the certificate authorities must make the certificate handling as easy and as automated as Let's Encrypt does to minimize the risk and the cost of the certificate renewal after a security issue on the certificate authority. It helps to avoid the case when there are millions of possibly compromised certificates still running on servers and helps to declare that if a certificate owner had an easy way to avoid the consequences of a sudden revocation the business need comes after the security.
Minority Report on Let’s Encrypt CAA Rechecking is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.