Analyzing Let's Encrypt statistics via Map/Reduce

Published 2017-05-16

I've been supplying the statistics for Let's Encrypt since they've launched. In Q4 of 2016 their volume of certificates exceeded the ability of my database server to cope, and I moved it to an Amazon RDS instance.

Ow.

Amazon's RDS service is really excellent, but paying out of pocket hurts.

I've been re-building my existing Golang/MySQL tools into a Golang/Python toolchain at slowly over the past few months. This switches from a SQL database with flexible, queryable columns to a EBS volume of folders containing certificates.

The general structure is now:

/ct/state/Y3QuZ29vZ2xlYXBpcy5jb20vaWNhcnVz
/ct/2017-08-13/qEpqYwR93brm0Tm3pkVl7_Oo7KE=.pem
/ct/2017-08-13/oracle.out
/ct/2017-08-13/oracle.out.offsets
/ct/2017-08-14/qEpqYwR93brm0Tm3pkVl7_Oo7KE=.pem

Fetching CT

Underneath /ct/state exists the state of the log-fetching utility, which is now a Golang tool named ct-fetch. It's mostly the same as the ct-sql tool I have been using, but rather than interact with SQL, it simply writes to disk.

This program creates folders for each notAfter date seen in a certificate. It appends each certificate to a file named for its issuer, so the path structure for cert data looks like:

/BASE_PATH/NOT_AFTER_DATE/ISSUER_BASE64.pem

The Map step

A Python 3 script ct-mapreduce-map.py processes each date-named directory. Inside, it reads each .pem and .cer file, decoding all the certificates within and tallying them up. When it's done in the directory, it writes out a file named oracle.out containing the tallies, and also a file named oracle.out.offsets with information so it can pick back up later without starting over.

The tallies contain, for each issuer:

The number of certificates issued each day
The set of all FQDNs issued
The set of all Registered Domains (eTLD + 1 label) issued

The resulting oracle.out is large, but much smaller than the input data.

The Reduce step

Another Python 3 script ct-mapreduce-reduce.py finds all oracle.out files in directories whose names aren't in the past. It reads each of these in and merges all the tallies together.

The result is the aggregate of all of the per-issuer data from the Map step.

This will get converted then into the data sets currently available at https://ct.tacticalsecret.com/

State

I'm not yet to the step of synthesizing the data sets; each time I compare this data with my existing data sets there are discrepancies - but I believe I've improved my error reporting enough that after another re-processing, the data will match.

Performance

Right now I'm trying to do all this with a free-tier AWS EC2 instance and 100 GB of EBS storage; it's not currently clear to me whether that will be able to keep up with Let's Encrypt's issuance volume. Nevertheless, even an EBS-optimized instance is much less expensive than an RDS instance, even if the approach is less flexible.

☕️ Insufficient Coffee