Interesting findings and learnings as we navigated the problems when Let’s Encrypt’s SSL Root Certificate expired on Sept 30, 2021.
SSL at AgroStar
We use SSL to secure all our API endpoints. There are many SSL certificate providers, but we use Let’s Encrypt because:
It is free, and,
We can automate certificate renewal.
People usually forget when a certificate will expire and forget to renew it before that happens. Then, for some time, the website or endpoint is inaccessible. Thanks to Let’s Encrypt’s automatic renewal mechanism that has never happened at AgroStar.
Chain of Trust
Before I get into the actual problems we faced, let’s recap a bit about SSL certificates and the chain of trust.
Every secure website presents a certificate to the browser. Ideally, if the browser should trust the certificate, it should be in its trust store. The trust store is a collection of certificates that the browser makers have validated and marked as trusted. But with over a billion websites, it would be nearly impossible to validate each one and add it to the trust store.
So, instead, the browser relies on a chain of trust. It works like this: if I trust A, and A says, “I vouch for B,” then I can trust B too. And further on, if B says, “I vouch for C”, I can trust C as well.
In certificates, the statement “I vouch for…” is carried out by signing (or more commonly, issuing) the certificate. Now, to check if the browser can trust a website’s certificate, it looks at the issuer of the website’s certificate. Suppose the issuer is already trusted (i.e., exists in the trust store), well and good. If not, it looks at the issuer’s issuer and further up the chain until it finds a certificate in the trust store.
Let’s take medium.com’s certificate as an example. You can inspect this chain by clicking on the lock icon in your browser’s URL bar and then on Connection is Secure > Certificate is valid in Chrome or the Show Certificate button in Safari.
In medium.com’s certificate chain, Baltimore CyberTrust is the root certificate, Cloudflare Inc is an intermediate certificate, and the medium.com website certificate is the leaf certificate.
We can trust the medium.com’s certificate because:
Cloudflare Inc ECC CA3 issued (vouches for) it, and,
Baltimore CyberTrust issued (vouches for) Cloudflare Inc ECC CA-3’s certificate, and,
Baltimore CyberTrust exists in the browser’s trust store.
You can verify the last fact by navigating through your browser’s security settings. You can find it under Preferences > Privacy and Security > Manage Certificates in Chrome. Safari does not let you access this from the browser, but you can see it from Finder under Applications > Utilities > Keychain Access.
Certificate Expiry
All certificates have an expiry date to protect their secret key. Anyone can guess the key by brute force, i.e., try all possible values. It just takes a very, very long time to do that, and the expiry date ensures that before someone tries all the combinations, they have run out of time.
When a website’s certificate expires, we need to renew it or get a new one. It’s that simple. But what happens when a root certificate expires? Again, create a new one. Simple?
In an ideal world, it would have been, but the world is not ideal. A root certificate gets its value because it is part of the browser’s trust store. Getting a new certificate into all browsers’ trust stores is not very easy. That’s because there are many operating systems and browsers, and all of them have to update their trust stores. Most operating systems and browsers automatically update themselves with the new root certificates, but not all.
In particular, in Android versions before 7.1.1, it is impossible to update the trust store. The reason is beyond the scope of this post, but you could read about it in this article. And many devices are running Android 7.1.1.
And that is precisely the problem Let’s Encrypt faced: their root certificate expired on 30th September 2021. But they came up with an innovative solution.
The Let’s Encrypt Solution
Let’s Encrypt’s old certificate chain looked like this:
DST Root CA X3 (expired) > Let’s Encrypt R3 > Website
Since DST Root CA X3 was expiring, they got a new root certificate called ISRG Root X1. They could have issued new website certificates using this and a new intermediate certificate. It would have looked like this:
ISRG Root X1 (new) > Let’s Encrypt R3 (new)> Website
But as we know, this wouldn’t work on the old Android devices. But due to a quirk in certificate validation in these devices, the old one would continue to work. That’s because this OS doesn’t care about the expiry date of the root certificate. (This is a debatable point: is it OK to trust a certificate issued by someone, but that someone is no more? Different clients deal with it differently, but practically, the laxity proved very useful for Let’s Encrypt.)
It would have been simple to issue two certificates for the websites — one for old Android devices and another for newer clients. But a website can have only one certificate. It looks like a dead-end, but they solved it with innovative thinking.
They arrived at the following new chain:
DST Root CA X3 (expired) > ISRG Root X1 > Let’s Encrypt R3 > Website
Having two root certificates is a bit unusual because (a) a root certificate is signing another root certificate, and (b) the validity of the issued certificate (ISRG Root X1) is longer than the
validity of the issuer (DST Root CA X3). But it would work, and that’s all that mattered.
New clients would have ISRG Root X1 in their trust store and needn’t look beyond that in the chain of trust.
Old Android devices would not have ISRG Root X1, but they would trust DST Root CA X3 and would thus validate the chain.
Brilliant? Yes! So, all Let’s Encrypt issued certificates have used this chain since January 2021. AgroStar’s certificate too looked like this, and we thought we were good.
October Mayhem — Episode 1
Late night on 30th September, one of our services started failing. The error was, of course, “Invalid SSL certificate” while accessing an API of an upstream service. But some other services were accessing the same API without any problem. Damn! This was confusing.
The result was inconclusive when we tried other clients such as curl to access the upstream service. From some systems, it would work, but it failed from some others. The biggest problem was with Postman; it did not work for anyone. We were stumped as to why it failed, that too only for some clients.
When we compared the clients’ libraries, we found that the version of the pythonrequests module was outdated on the service that was failing. We upgraded the requests module from 2.11 to 2.18, and the error went away! We assumed that 2.11 had a bug and gave it no further thought.
As for Postman, a bit of Googling told us that others were facing the same problem. That convinced us that our Let’s Encrypt certificate was valid, just that some clients were buggy. And that included Postman. As expected, Postman released a fixed version in two days (that was quick!).
Learning: certification validation is not a standard. Every client has its own trust store, and also a different algorithm for validation. Ensure that all your software is up to date.
October Mayhem — Episode 2
The next day, another internal batch job reported failures for certificate validation, trying to access other endpoints secured by the same Let’s Encrypt certificate. This failure, too, was from a python script, so our original thought was to upgrade the requests module. But it didn’t help.
A few checks using curl and other clients from the system showed that the problem was with the OS in this case. More research on the internet led us to this brilliant post in the Geek Culture blog with a very in-depth analysis of the issue. The server was running Ubuntu 14.04, with OpenSSL 1.0.1. This (old) version of OpenSSL had a quirk: it continued up the trust chain even after finding a recognized root certificate! Despite seeing a trusted certificate (ISRG Root X1), it traversed the chain and found the expired DST Root CA X3 certificate. It then declared the leaf certificate invalid.
The solution, in this case, was to delete the expired DST Root CA X3 from the trust store. Now, OpenSSL 1.0.1 had to stop traversing the chain when it found the ISRG Root X1 certificate so that it wouldn’t find an expired certificate up the chain.
To remove the offending certificate, we removed the entry in the file /etc/ca-certificates.conf and ran the command update-ca-certificates. Upgrading OpenSSL on the system was another alternative, but it was too complicated due to its dependencies.
Learning: It is good to keep Operating Systems also up-to-date. 14.04 was way too old, we should have moved forward much earlier.
October Mayhem — Episode 3
By now, we were confident that there was no problem with the Let’s Encrypt issued certificate and that the clients out there were buggy or not up to date. The only recourse (or so it appeared) was to update the clients.
But when the clients are not under your control, you have a complicated problem at hand. An AgroStar partner accessed our external API endpoints to send us some data. We depend on them because we got valuable customer leads through this partnership.
On day 3, they reported that their API calls failed with certificate validation issues. They insisted that our certificate was invalid and showed us the following screenshot:
We know that the validator should stop at ISRG Root X1, and it was apparent to us that the problem was with the client. But we couldn’t convince them. It ceased to be a technical problem at this point. After a day of back-and-forth, we realized it was not worth the trouble and loss of leads. So we purchased a new certificate from another security vendor and used that for this API endpoint alone.
Learning: There is no point in fighting a battle you cannot win, even if you know that you are right. Battles can wear you down.
Also, with the new certificate, we now had something to fall back on quickly if something else broke. But fortunately, that was the last problem. All our APIs are now running smoothly and securely.
Comments