Going Nuclear: Modeling Threats to Distributed Systems
Another Brief Look (at our Data)
The data collected by our agents drives this whole project. At worst, we need the raw counter data from the Geiger counter, otherwise this whole project is a non-starter. At best, we can collect other pieces of data that provide better insight into the surrounding environment. The bigger picture is often important when doing remote monitoring so we should consider our requirements carefully. For the sake of argument lets say we want to collect the following pieces of data:
- Counts per minute
- Temperature
- Humidity
- Barometric pressure
These allow us to measure the current levels of radiation while also providing us weather conditions which can provide insight during investigations into radiation spikes or patterns of fluctuations.
This data should be collected every minute and shipped off to one of our servers. What kinds of risks do we need to consider here?
Threats to Data Availability
At first glance you might think that there aren’t a lot of threats to our data. I mean, its just raw numbers that get collected and sent to a bunch of servers, right? Big whoop. Consider what happens when someone starts fiddling with that data though, unbeknownst to anyone.
Part of our project is to alert people when we see dangerous levels of radiation. Creating positive trends can create false alarms and cause undo panic. Creating negative trends can mask dangerous radiation levels and prevent necessary alerts from being dispatched.
These threats can manifest in a number of different ways.
Malicious Agents
In a perfect world only our agents would try and talk with our servers. Unfortunately, that’s not the case and we have to deal with rogues trying to talk to our servers without our permission. We can simplify this into an idea of a good agent (ours) and evil agents (theirs). Evil agents can be any number of things, but the gist is that they’re conceptually emulating the behavior of a real agent, but to do evil things.
As such, we should only be accepting data from trusted (good) agents. That means we need to authenticate all requests coming from all agents. Each agent should be known with a given ID and should have to provide proof they’re really them. There are quite a number of ways we can do this, but they all mostly boil down into two methods: shared key (symmetric key, e.g. passwords) or public key (asymmetric, e.g. certificates). Authentication is the idea that only our trusted agent has a known key, and they do something to prove they have that key.
The easiest thing to do is to just send the key with every request like you do with passwords, but that means anyone listening in can snatch up that key and impersonate our trusted agent. We need to use a function that creates a value based on our key, but doesn’t contain the key. In this case, we want to generate a signature.
Our idea of authentication works by creating a body of data, running a hash function over the data, encrypting the hash against a key (technically you would be keying the hash if its a symmetric key), and attaching the encrypted value (our signature) to the end of the data, thus creating a token. You can validate the token by following the exact same process and comparing the results. If you don’t have the right key then your signature won’t match, or if the data has been modified then the hash won’t match.
Most tokens also contain additional information besides just an identifier like an issuance and expiry timestamp showing that you can only use the token for a limited period of time.
Once we know how we’re going to authenticate our agents we need to pick the type of key to use.
Shared keys are small, lightweight, and orders of magnitude faster to use. They are however significantly more difficult to protect because every party to the authentication process needs a copy of the whole key. This issue is compounded by the fact that you need to transfer those keys between each party at least once. Transferring keys is never a good thing to do either. What about keeping keys synchronized across all datacenters? How do you protect the key in transit? You can encrypt it, but against what key? You could encrypt that key against another key and repeat ad infinitum. Referring to our process from earlier we would see something like this:
id = "TK-421"
Clientkey = "12345"
signature = hash(id, key) = ".abc"
token = user + signature = "TK-421.abc"
Serverkey = "12345"
signature = hash(id, key) = ".abc"
token = user + signature = "TK-421.abc"
"TK-421.abc" == "TK-421.abc" => authenticated!
See how both sides need the key? The comparison would fail if the keys were different because the hash would produce a different result.
Alternatively you also have public keys, which are are part of an asymmetric pair of keys. The idea is simple (ish). You can encrypt data against one key and decrypt the data with the other key, but you cannot encrypt and decrypt the same data with the same key. The benefit with using asymmetric keys is that you only need one of the keys, the public key, to authenticate a party. The party being authenticated only needs the other key, the private key, to authenticate themselves. That means the private key never needs to be transferred anywhere and stay safely locked away.
The process is a bit more complex though.
id = "TK-421"
ClientclientHash = hash(id)
signature = privateKey.encrypt(hash) == ".abc123"
token = user + signature = "TK-421.abc123"
ServerhashDecrypted = pubKey.decrypt(signature)
serverHash = hash(id)
serverHash == hashDecrypted => authenticated!
The value of this method is significant because the private key used for authentication never leaves the client. The public key is available to anyone that wants to verify the token (hence the name), but that’s the only thing you can do with the public key. There are some negative aspects to using asymmetric keys though, and the biggest is performance. The signature is considerably larger than a symmetric signature, and the cost of signing data and verifying signatures is an order of magnitude higher.
Consider some measurements I made comparing symmetric vs asymmetric keys for some unrelated benchmarking. These were made using an off-the-shelf JWT library instead of using the raw crypto primitives, but the comparison is still valid because its the same basic process. I measured this data by performing 10,000 iterations of generating a token and then validating the token.
These are my results.
| Algorithm | Type | Average Signing Time (ms) | Average Validation Time (ms) | Token Size (chars) | Key Strength (bits) | 
| HMAC 256 | Symmetric | 0.049 | 0.0564 | 445 | 256 | 
| EC 256 | Asymmetric | 0.8115 | 2.3534 | 488 | 128 | 
| RSA 2048 | Asymmetric | 5.2612 | 0.1927 | 744 | 112 | 
This is just a subset of the results, but as you can see the differences are obvious. You can get better results by choosing different algorithms, but there will always be a tendency of symmetric keys being faster. You have to balance the risk of storing keys in multiple places with system performance.
Let’s assume that we’ve picked asymmetric keys for authenticating our agents. We need to store the private key in such a way that the agent itself can get access to it when it needs to, but also so the keys can’t be stolen by anyone prying open the enclosure and picking at the components.
Large centralized environments like datacenters can benefit from devices called Hardware Security Modules (HSM). They are designed to protect keys by embedding them in physical memory and offloading the crypto work required of the keys from dependent servers, that way the keys never leave the safety of the HSM. The controlled environment allows for more complex protection mechanisms as well that will zero out or destroy the memory if signs of device tampering are detected.
It’s impractical to use an HSM for every single agent out there tohugh, so we need to look at something a bit more portable. Most laptops these days actually have a similar device built into their motherboards called a Trusted Platform Module (TPM). They are used to store sensitive keys in a similar manner. Alternatively there are also manufactured chips dedicated to protecting keys that are compact versions of HSMs too. We should consider storing our keys in a similar manner.
We can now start sending data to our servers and can be fairly confident the data is coming from an agent we trust, almost. If we look back at our authentication check above the only thing we’re authenticating is the ID of the agent, not whether the data has been modified. We trust its been sent by someone we know, but if we consider that our data is going to be flowing over the internet, that means any number of devices will sit between our agent and our servers. Any one of those devices can twiddle bits maliciously or inadvertently, and our servers would have no idea something has changed.
What we need is proof of authenticity of the data we’re sending. We can do this easily enough by just sticking the data in the token with our device ID, or we can use a similar mechanism and sign the data with another key. The signature now provides evidence the data hasn’t been modified, but what happens if someone keeps sending the same signed data over and over again? It might mean our servers collect hundreds or thousands or millions of spurious records that influence data analysis, or mask alarming data.
We can use a value called a nonce to detect duplicate submissions. It’s just a non-repeating counter in the signed data packet that increments every time the agent checks in. The server is already storing the data from the check-in, so all it needs to do is grab the last checked-in value and verify its less than the new value. Alternatively you can use a timestamp as well. If the value has been seen already then discard the request and drop the connection.
By this point we have a pretty good idea of what we actually need to send to the server. We also know how JWTs perform, so why not use one for our check in?
{
"did":"tk-421",
"jti":"10801543",
"exp":1488513872,
"cpm":"16.23",
"tmp":"72.04",
"hum":"0.87",
"bar":"101.2"
}
We see our device ID (‘did’), the nonce (‘jti’ – JWT ID), the token expiry (‘exp’), and all our data. Running that through our signing process will provide us a small blob of data we can send to the servers.
Lastly, we need to make sure our agent is sending data to our servers and not someone else’s. This is possibly one of the easiest solutions: use SSL/TLS. We don’t even need a certificate signed by a CA either. We just need each agent to know what the expected certificate is, and validate the provided certificate matches. This is called certificate pinning. It’s magical.
A side effect of using SSL/TLS is that we also get encrypted communications for free, meaning nobody can see the raw data as its transiting the internet.
This would be obviously useful if we’re sending sensitive data in our check-ins, but presently we’re not doing that. There are some subtle operational protections we get out this though: it blocks wholesale replaying of check-ins because attackers just can’t see the data. They’d have guess at the structure of the data.
Malicious Actors (everything else)
Thankfully machines have not become sentient and still require evil people to do bad things. With any luck this’ll be true for quite some time.
I do however fear the day my Xbox realizes it’s had enough of me repeatedly swearing at it.
Previously we talked about the idea of a malicious agent, but there really is no such thing as a malicious agent because it still requires an actor controlling the agent. In the previous section it was implied someone was forcing an agent to do bad things. Everything else is getting categorized here.
What happens if someone disables or destroys an agent? At some point we have to assume that agent is out of commission.
First of all, we should be monitoring the agent and making sure its checking in, and alert when its been offline for a given period of time, or when it starts sending in strange data. Second, we need to make sure all the other agents around it can pick up the slack (sound familiar?) and can fill in enough information to cover the gap of missing data.
In earlier sections we looked at distributing the databases across datacenters in case one or more datacenters was destroyed. This doesn’t solve all the problems though.
Supposing for a moment that someone was able to tamper with the data in the database. First of all, data should be immutable meaning that once its received and stored in the database it should never be modified. We can be careful to block update statements to the database, or rely on database technologies that support read only fields. We also have to block delete statements as well.
Supposing someone was able to run an update or delete statement, we need auditing to track that those events occurred. Databases have transaction logs and triggers for this sort of thing, and we can track changes to data fairly easily, assuming the attacker doesn’t have rights to delete the log, or disable the triggers.
Assuming the attacker does have the ability to delete logs and disable triggers, we need a way to roll back the changes. Distributed database systems actually do help solve part of this problem because an attack on a single database might not replicate to any other databases. We can run consistency checks across each database and compare data. When things are out of whack we investigate. Unfortunately we can’t rely on this alone, so we need one last thing: backups.
All data should be backed up. All databases should be backed up. Everything should be backed up.
 
Everything!
Backups are wonderful. Of course, backups need to be up to date. Stale backups are not wonderful. That means all data needs to be backed up regularly. Additionally, all backups need to be validated that they 1) succeeded, and 2) haven’t been tampered with themselves. It’s a vicious rabbit hole we descend.
Validating that the backups haven’t been tampered with means knowing that they are in the same state as they were when they were created. We can rely on signatures (or something like a MAC specifically) to create a known good fingerprint of the backup, and then we can make copies of the fingerprint and give them to a whole bunch of people. You can be reasonably certain a backup is untouched if enough people say that fingerprint matches.
Developing the Model
This model is incomplete. Partly because I haven’t gone through every single component bit by bit, but also because threat models are constantly changing, because threats are constantly changing. It’s important to understand that this model is living. It needs to be reviewed often enough to compensate for changes to the system that will inevitably occur when theory meets reality.
The point of this post was to go over the process involved in developing a threat model, not to develop an actual threat model. Hopefully a few things were learned along the way, but if nothing else I have one key lesson: it makes quite a bit more sense to start this whole process looking at the agent first.
You can’t build an accurate model that describes the threats to a system if you don’t understand the nature of the system you’re building.
Follow Ups
I make no claim that this post contains everything you need to know about threat modeling. There are quite a few resources out there on the topic and I can’t list them all, but I can provide a starting point.
- OWASP Page on Threat Modeling
- Microsoft’s Threat Modeling tool
- Threat Modeling: Designing for Security (book, 2014)
- Threat Modeling (book, 2004)