It probably won’t come as a shock to you that as I was writing up my last post on IoT and my new Geiger counter I was mentally reviewing all the things that
scared the crap out of me had me concerned security-wise. I don’t mean the apocalyptic visions of Fallout, but about the fact that I have a device I don’t necessarily trust sitting on my network constantly feeding data to a remote server without much control by me.
I’m predictable like that.
Upon further review I realized I wanted to write up my thoughts on how I would protect against such an unknown, but really… you just need to stick it on an isolated network. You can’t see me, and I can’t see you. I’ll leave it to the networking wonks to decide what isolated really means in this case.
That of course left me with a very short and very boring post.
It then occurred to me that I’ve never really written in detail about different types of threats or risks or how you model them when developing a system. Of course, I don’t want to write in detail about different types of threats — there’s plenty of content out there already about this sort of thing. What I do want to do though is talk about how you model the risks so you can work through all the potential threats that might exist.
In keeping with the theme (because it’s fun and highly improbable) we’re going to build a fictitious radiation monitoring system just like the uRADMonitor, but we’re going to look at how you would evaluate the risks to such a system and how you might mitigate them.
I don’t have any affiliation with the uRADMonitor project aside from the fact that I have a device of theirs in my office on my network. I can’t speak to their motivation for building the system and I can’t say what they’ve done to protect their system. This is purely a thought experiment on how I would build and protect a similar system.
Additionally, this is not necessarily a model where I’ve listed every possible risk or threat and rigorously verified solutions work. In fact, I’ve left out a number of things to consider.
Remember: it’s just for fun.
Projects are created to solve particular problems (also, sky is blue: report). This monitoring network is no different. We’ll review the problem as we see it, and then we’ll create a design to solve the problem. We’ll then review the design and see what kinds of risks we might find and how we can solve for them as well. We’ll iterate on this a few times and hopefully come up with a secure solution.
Not surprisingly I’ll go off on tangents about things I find interesting and gloss over other things. They may not be pertinent to the discussion at all.
Radiation levels affect the health and well being of the population. Sudden significant changes in levels can have dangerous health effects and indicate signs of environmental accidents or geopolitical unrest. An unbiased real time record of radiation levels needs to be created and monitored so appropriate actions can be taken in the event of serious danger.
Build and deploy a distributed network of radiation monitoring agents that persist data to a central store so the data can be reviewed by analysts and automatically notify the appropriate people when significant events occur.
If you’ve ever done product design that cryptic answer might be all you need from the product owner. So we need to break that down a bit. We’ve got a whole bunch of agents distributed throughout the world sending data back to a central server. This data should be consumable by analysts (scientists, safety authorities, etc.) either through an API (a program might be analyzing the data), or through a graphical interface (an actual person wanting to review data). These are easy enough to develop — create writable web API’s for the agents, a web portal for the people, and read only web API’s for the programs.
I’d say we glossed over the solution a bit, and we did that purposefully because investing a lot of effort into designing a system without understanding the risks is, well, a waste of time. We just need enough to get things moving. We have a good idea of what we want to build for now, so lets try and understand some of the risks. Our initial design looks something like this:
Figure 1: Basic Data Flow Diagram
When we consider the risk of something, we consider the chance that thing might occur. We need to weigh the chance of that thing occurring with the effect it would have on the system. We then compare the cost of the effect with the cost of mitigating the effect and pick whichever one is lower. We generally want the lowest possible level of risk achievable based on the previous formula. In our case it’s actually pretty easy to work out because anything that disrupts the capabilities of the network is considered high risk. Let me see if I can explain why.
The idea of having a distributed network of things connecting back to a central server is not novel. The fairly detailed diagram above could easily be confused with a thousand other diagrams for a thousand other systems doing a thousand different things. There are a lot of commonly considered risks related to this and they can probably be summed up into these properties:
- Server availability – Is the server online and accepting requests whenever an agent wants to check in, or when an analyst wants to review the data?
- Agent availability – Is an agent online, able to measure the things it needs to measure, and have a connection back to the server to check in?
- Data accuracy – Do agents check in data that is measured properly and hasn’t been fiddled with?
This is a pretty broad but reasonable assessment for a number of systems. The risk is that any of these properties answer false because a failure in any one of these properties means the system can’t function as intended. We need to consider what sorts of things can affect these properties.
Here’s the thing: it needs to be available 100% of the time and absolutely accurate. It has the potential to influence the actions of many people regarding things that can kill them. This is one of those things that most security people really only dream about building (we have weird dreams). There can be serious consequences if we fail. Therefore we need to consider even the tiniest chance of something occurring that could affect those three properties and mitigate.
As system designers we need to go nuclear. (yes I know that was a bad pun. I’ll show myself out.)
In our case the system needs to withstand serious catastrophic events simply because it’s intended to monitor serious catastrophic events. If such an event occurs and the system can’t withstand the effects then we’ve failed and the whole point of the project can’t be realized. Conversely, it also needs to withstand all the usual threats that might exist for a run-of-the-mill distributed system.
Normally we would need to do an analysis on the cost of failure so we could get a baseline for how much we should invest, but this is my model and I’m an impossible person. The cost of failure is your life. (muahaha?)
Threats to Server Availability
Interestingly this property is easiest to review, relatively speaking. Our goal is to make sure that everything on the right side of that internet boundary is always available so everything on the left side of the boundary can access it.
Let’s consider for a moment that we want all the data stored in one single location so we can get a holistic view of the world. Easy enough, right? Just write it all to one database on a server in a datacenter somewhere near the office that has public internet access. We want this server to always be available. That means 100% up-time for a single server. I’ll just go right out and say it: that’s impossible. Consider some of the more mundane things that can cause a server to go offline. We deal with these things every day and because of that there’s always something complaining that it can’t connect.
- Datacenter network failure
- Server updates requiring a reboot
- Power failure
- Hardware failure
Now, imagine we’ve got thousands of agents checking in every minute. Assuming 5 minutes of downtime for just one of these events and we’re talking 4000-5000 points of data lost. That’s a low assumption though. Hardware failure might mean the server is out of commission for hours or days. And don’t forget: hardware is measured in mean time to failure — odds are pretty good something will fail at some point.
Let’s continue. Consider what happens we start adding more agents. We’re talking global coverage, so a thousand agents probably aren’t going to cut it. We’ll need a few more.
I make no guarantees my math is accurate.
We can take a WAG at the number of agents we’ll need based on a few existing data sets provided by various agencies. The US EPA has their own monitoring network of 130 agents. The US is about 3,805,927 mi² in area. Assuming each monitoring agent is spaced out evenly (they aren’t) we see an average coverage area of 29,276 mi² for each agent. They also check in once every hour. The resolution of the data isn’t great. It’s useless to anyone more than a few miles away from an agent, and nobody would know something serious has happened within enough time to do something about it.
Conversely the USGS NURE program over a period of years measured the radiation levels in great detail by collecting hundreds of thousands of samples across the country, but they were for a single point in time. The resolution is fantastic, but it’s not real-time. We want something in between. Let’s try and get a more concrete number.
According to 2007 census there were 39,044 local governments (lets call them cities for simplicity) in the US. Let’s assume we want 4 agents per city. As of July 1st, 2013 there were 295 cities (again, for simplicity) with a population greater than 100,000 people. Lets also assume we want 10 agents for each one of those larger cities. (Remember, its a WAG)
[(39,044 – 295) x 4] + (295 x 10) = 154,996 + 2950 = 157,946
So that means based on our very rough estimates we need 157,946 agents for just the United States. Each one of those agents would check in every minute, so that means we’re talking 227,442,240 check-ins every single day. Now extrapolate for the rest of the world.
Let’s assume we need 22 times as many agents (world population / US population). That’s 3,474,812 agents, or 5 billion check-ins a day.
This would of course require more analysis. We don’t necessarily want to deploy agents based solely on population density because there would be significant fluctuations in the number of agents per city vs rural areas. The numbers are good enough for our fake model though.
Now, back to our server. We didn’t consider load as a risk, but now we need to, so we add a few more servers. We’ll revisit this in a second. What else might affect the availability of our servers?
Regional Infrastructure Failures
Suppose the internet backbone providing service to our datacenter craps out. What happens if a hurricane comes through and floods the datacenter? What happens if a trigger-happy organization nukes the general area? We need to assume our datacenter will simply cease to exist one day without warning. This is conceptually easy enough to solve for — add more datacenters. If we need more datacenters, we therefore need datacenters spread far enough apart that whatever took out one datacenter won’t take out another.
Determining the best place for a datacenter is not an easy task. There are lots of things to take into consideration. A short list might look something like this:
- Year-round weather conditions
- Availability to the internet backbones
- Geopolitical stability of the region
- Local economic stability affecting cost of operations
- Relative distance to other datacenters
A few of these are easy enough to reconcile. A map of the internet shows that there are plenty of hubs where we could stick a datacenter.
We could overlay such maps, find cities where there’s usually nice weather, investigate the various required stability (geopolitical/economic/etc.), and pick a bunch of candidates.
Geopolitical stability is a tricky thing to measure. For instance, Canada is stable, and predictably so for the foreseeable future. It’s highly unlikely the government would boot us out of the country or destroy our datacenter. However, if nuclear war struck, it might not be wise to build a datacenter near Ottawa because the city would be a prime target for attack.
Then again, winter storms in Ottawa suck so I wouldn’t want to build a datacenter there anyway.
Once we have our candidates we need to consider the desired distribution throughout the world. Since we’ve got agents all around world, we need a pretty even distribution of these datacenters. It seems logical that we’d want agents to check in to the nearest datacenter, but in reality we don’t necessarily care where the agent checks in. What we care about is that if a datacenter is taken out of commission, no others are also taken out, and that the agents can fail over to other datacenters. What this does mean though is that if an agent checks into one datacenter then the submitted data needs to be replicated immediately to all the other datacenters before it can be considered committed.
Actually, it doesn’t require immediate replication to all datacenters at once; it might just need replication to a quorum of datacenters before the data is considered committed.
Another thing to consider is the total number of datacenters required. If one goes down the rest have to pick up the slack. That means each datacenter cannot be running at full capacity. For the sake of argument lets say we choose to build 16 datacenters. I’ve picked this number because that’s the number of Microsoft Azure datacenters worldwide. They’ve already done their due-diligence picking good locations (and I already have a pretty map of it).
Assuming a roughly even distribution of load per datacenter, that’s approximately 312,500,000 check-ins a day per datacenter. Assuming total nuclear war (that’s a phrase I never thought I’d use) and we lose all but one datacenter, that means that one datacenter needs to take on 15 times the load, which is 4.7 billion check-ins a day.
Practically speaking if 15 datacenters went offline suddenly I suspect quite a few agents would also go offline, so the total number of check-ins would decrease considerably.
What does that mean for the size of a single datacenter then? We haven’t really looked at that yet. If we based our assumptions on the diagram at the beginning we could conceivably have 1-4 servers. One for each element, or one for all elements. It would be really nice if it took more than a single server going down to take out a datacenter, so we need at least a set of servers.
Let’s assume we’re working with commodity hardware, and each web server can handle an arbitrary load of 5000 requests a second. Now, since our agents have a fixed schedule of checking in every minute, we can easily figure out there would be around 217,000 requests a minute or 3616 a second, with a compensation of some sort for time skew of each agent. A single server could handle the load fairly easily at 72% capacity, and doubling the server count for redundancy means we’re looking at only ~36% capacity (admittedly, there’s a bit more involved so its not simply half, but we’re keeping the math simple).
That’s actually surprising to me. I assumed we’d need a few servers to handle the load. Of course, this isn’t taking into account the overhead of persisting and replicating the data across datacenters.
But then we’ve got the analysis side to consider. We don’t know how many people want to review the data, and report generation on millions of data points isn’t computationally cheap. For now we can assume we only want to report on significant changes, so the computational overhead probably isn’t a lot. We can dedicate one server to that.
So lets assume each datacenter looks the same with the following servers:
- 2 agent check-in web servers
- 2 database servers
- 1 compute server
Our connection to the internet should be redundant as well.
Therefore overall we’ve got 32 servers for agents checking in, 32 database servers, and 16 compute servers throughout the world, communicating across 32 internet backbones. In contrast, Microsoft has at least a million servers worldwide in all of it’s datacenters for all of its services, and Netflix uses 500-1000 servers just for it’s front-end services.
Of course, we’re forgetting the total nuclear war scenario again.
Our datacenters currently have a maximum load of 10,000 requests a second, so that means we could take on the load of a smidge more than one other datacenter. That’s actually pretty good for most scenarios, but not so good for ours because we still need 8 times the capacity in a single datacenter for our worst case scenario. If we keep the math simple and just multiply the current number of servers by 8, we get 640 servers worldwide. It actually might be lower because we probably wouldn’t need that many database and compute servers in one location. Additionally we don’t need all those servers running at once. We only need them when another datacenter goes offline.
I think we’ve covered off enough for now to keep our servers online and our data available in case of catastrophe. What about risk to the agents?
A Brief Look at Our Agent
The agents are the lifeblood of our network. Without them we’d be dead (in the water). They provide a global health check by measuring the levels of radiation in their immediate areas and reporting their findings back to a central location for analysis. The termradiation is a bit vague though. It’s simply defined as the emission of energy through something. Radio waves and light are forms of radiation and they are, of course, not harmful.
What we really want to measure is ionizing radiation, which is an energy emission that is sufficiently powerful enough to create ions by kicking electrons loose from nearby atoms. There are two types of ionizing radiation: wave-based (electromagnetic) and particle-based.
Electromagnetic radiation becomes problematic when the wavelength has a high enough energy potential (somewhere between 10eV – 33eV which is below the ultraviolet spectrum). Wave-based ionizing radiation occurs in two forms we care about: X-ray, and Gamma-ray. These two differ only in wavelength where gamma rays are considerably shorter in length.
Figure 2: Electromagnetic Radiation Spectrum (source: Wikipedia)
Particle-based radiation occurs in three forms: alpha, beta, and neutron radiation. Alpha radiation is a bastardized form of helium, beta radiation is a wild electron, and neutron radiation is, well, a wild neutron. These particles become problematic when they have very high momentum.
Ionizing radiation can have an effect on nearby atoms because they have enough energy to cause the atoms around them to dislodge electrons. Suffice to say the atoms don’t like that because it can mean breaking apart existing chemical bonds, or causing new bonds to form. As you can imagine that might not go so well in living organisms.
Monitoring ionizing radiation levels can be done by various types of sensors like Geiger counters, proportional counters, or scintillators by watching changes in voltage (or luminescence in the case of scintillators) caused by ionizing events. These sensors can monitor most forms of ionizing radiation, but their use cases depend on what exactly you’re wanting to measure. Geiger counters are pretty good at measuring overall counts of ionizing events, but they don’t necessarily distinguish between forms of radiation. Conversely, proportional counters are capable of making distinctions.
Now, what we’re wanting to do is measure ionizing radiation within the bounds of it having an effect on human health. That means we’re actually wanting to measure dose rate, not just a simple count of events. That means we also don’t necessarily care whatkind of radiation is present, just that radiation is present.
That’s actually not true. If something serious occurred that produced large amounts of radiation, it would be useful to know what kind was present so we know how to protect ourselves.
For instance, a high level of alpha radiation is manageable because it can’t pass through the thin layer of dead skin on our bodies, but we need to be careful because particles emitting alpha radiation can be inhaled or swallowed.
Beta radiation is trickier to manage because it can pass through thicker materials, and in doing so emit x-rays if it’s the wrong type of material.
Gamma radiation is even trickier to manage because it can easily pass through thick layers of shielding without much effort.
Lastly, neutron radiation is the hardest to manage because its the most powerful. It can radiate through thick concrete or metal, and it can actually cause other materials to become irradiated through neutron activation. Thankfully it’s also extraordinarily rare, and only found in the middle of nuclear reactors (or explosions). It’s a safe bet we don’t need to monitor for this, but we can monitor it indirectly because the byproduct of neutron radiation is other forms of radiation.
The effects radiation has on health is measured through effective dose over a period of time. It takes large doses in short periods to harm us, but it only takes low doses over longer periods of time to harm us. We can see this in examples of exposure:
Figure 3: Dose rate affecting health
Given this, we can use a Geiger counter acting as a dosimeter for the basis of the sensor in our agent. The agent would poll the Geiger counter and send the results back to the servers for analysis.
Threats to Agent Availability
Earlier we estimated that we would need 3,474,812 agents distributed globally to create our network. They will live in areas with any number of adverse conditions so we need to consider what could interfere with the operations of an agent.
This is probably the most obvious condition to think about. We have to worry about serious weather conditions like torrential rain, extreme heat, freezing, dust, etc. There are also other things to consider like natural disasters causing flooding or fire, or stranger things like wild animals making nests in or around the agents.
We can compensate for these conditions by building durable enclosures that are weatherproofed for the intended climates.
Not surprisingly we’ve also got to consider the condition that the internet or power will disappear from time to time.
Figure 4: Reliable internet, isn’t.
As it happens I’m all too familiar with this condition, though sometimes my preference for resolution doesn’t help anything.
Figure 5: A preferred solution
Losing power is potentially an easy thing to solve for in the short term as we can power our agents by battery. If we design it right, we can drive our agents on less than 5V @ 500mA power (a figure gathered from the design of the uRADMonitor), or 2.5W. For comparison, the lithium-ion battery in my laptop has the capacity for 100Wh. That means it can drive 100 watts for an hour, or 1 watt for 100 hours (or in the case of my laptop, 3-4 hours on a good day). Therefore it can power our agent for 40 hours before needing a recharge.
If we’re concerned with running our agents off grid we could potentially power them with solar panels too. In our case the required power is fairly low so we could conceivably use a small panel to run the agent and charge a battery during the day, and let the battery power the agent at night.
Getting back to the internet connection though, that might be a bit trickier. Populated areas often have connections to the internet of some sort, whether it be wired, cellular, or satellite. Our agent needs to support one of these methods, otherwise it’s not phoning home. These connections aren’t necessarily the most stable though, as I mentioned above. We can compensate by requiring our agents to have redundant connections to the internet, or even create wireless mesh networks between each agent. That would allow agents to redirect connections to other agents and piggy-back off their active connection.
We are forgetting one thing though. What happens in the nuclear scenario? Not surprisingly, we really kind of need our agents to withstand the effects they’re trying to measure. The absolute worst case scenario is that a bomb is dropped right on top of our agent. The short answer is that we can’t do much about that. It will be obliterated in microseconds. The second worst-case scenario on the other hand is slightly more workable, and that’s if a bomb is dropped just outside the blast range from our agent.
Figure 6: This is exactly how I picture our bomb being dropped.
Within seconds our agent will be awash in different types of radiation. This actually has a curious effect on electronics. The blast will cause an electromagnetic pulse (EMP) of radiation that will interact with the metal conductors in our electronic circuits and force electrons to start flowing causing spikes in current. If there’s enough current in those spikes our electronics will be fried. Best case the spikes just affect data by doing things like altering values being read from or written to memory.
That’s nothing compared to what happens if our agent starts getting bombarded with particle radiation though. The particles will actually knock loose atoms from the silicon chips as they skip across the substrates, kind of like a boulder taking out chunks of the ground as it careens down a hill. That of course can cause very strange behaviors in the chips rendering them inoperable.
There are things we can do to protect our agents from this sort of thing of course. Best case is that we can use Rad-hardened components for everything, but there probably aren’t rad-hardened versions of every component we’d need to use, and we need to be careful because they are only guaranteed to certain levels of radiation. That won’t necessarily solve all of our problems though. We’d need to build properly shielded enclosures to protect any exposed components as well as shield all wiring, since it would act as an antenna for the EMP. We might also consider doubling up on circuit components and run them in parallel so we can have fault tolerance. This is common practice in hostile environments for devices like satellites in space.
You can use a voting system to determine whether your data has been altered. Two processors compute a value and compare to each others value. If both agree they’re the same then you can be reasonably certain its fine, but if one disagrees then you get them to restart the process until you’re certain things are kosher, or you shut down the processor and wait for things to blow over.
It’s reasonable to assume the blast would have taken out our connection to the internet. The internet itself was never actually intended to withstand nuclear attacks, as some myths might lead you to believe. The underlying protocol, TCP, was built to withstand quite a number of failure modes, and designers of networks that could withstand nuclear war co-opted the protocol in their own designs. The internet as a whole was however designed to be extraordinarily resilient to failure, so odds are pretty good only our little corner of the internet was taken out.
Satellites flying above probably wouldn’t be affected by the blast (maybe, I’m assuming). Our backup connection could be satellite-based. Once our agent acknowledged their primary connection is toast they can just switch modes — eventually. One side effect of a nuclear blast is that it ionizes the air around it so much that radio communications can’t operate for a number of hours afterword. That means our mesh network wouldn’t work either.
Our agent would unfortunately be stranded there isolated from the rest of the world for a few hours before it can start phoning home. There isn’t a lot we can do at this point. Our agent needs to wait it out. Therefore, we need our agents to be smart enough to queue up measurements for an indefinite period of time.
If we’ve designed and built our agents properly we can now begin to start collecting data.
Another Brief Look (at our Data)
The data collected by our agents drives this whole project. At worst, we need the raw counter data from the Geiger counter, otherwise this whole project is a non-starter. At best, we can collect other pieces of data that provide better insight into the surrounding environment. The bigger picture is often important when doing remote monitoring so we should consider our requirements carefully. For the sake of argument lets say we want to collect the following pieces of data:
- Counts per minute
- Barometric pressure
These allow us to measure the current levels of radiation while also providing us weather conditions which can provide insight during investigations into radiation spikes or patterns of fluctuations.
This data should be collected every minute and shipped off to one of our servers. What kinds of risks do we need to consider here?
Threats to Data Availability
At first glance you might think that there aren’t a lot of threats to our data. I mean, its just raw numbers that get collected and sent to a bunch of servers, right? Big whoop. Consider what happens when someone starts fiddling with that data though, unbeknownst to anyone.
Part of our project is to alert people when we see dangerous levels of radiation. Creating positive trends can create false alarms and cause undo panic. Creating negative trends can mask dangerous radiation levels and prevent necessary alerts from being dispatched.
These threats can manifest in a number of different ways.
In a perfect world only our agents would try and talk with our servers. Unfortunately, that’s not the case and we have to deal with rogues trying to talk to our servers without our permission. We can simplify this into an idea of a good agent (ours) and evil agents (theirs). Evil agents can be any number of things, but the gist is that they’re conceptually emulating the behavior of a real agent, but to do evil things.
As such, we should only be accepting data from trusted (good) agents. That means we need to authenticate all requests coming from all agents. Each agent should be known with a given ID and should have to provide proof they’re really them. There are quite a number of ways we can do this, but they all mostly boil down into two methods: shared key (symmetric key, e.g. passwords) or public key (asymmetric, e.g. certificates). Authentication is the idea that only our trusted agent has a known key, and they do something to prove they have that key.
The easiest thing to do is to just send the key with every request like you do with passwords, but that means anyone listening in can snatch up that key and impersonate our trusted agent. We need to use a function that creates a value based on our key, but doesn’t contain the key. In this case, we want to generate a signature.
Our idea of authentication works by creating a body of data, running a hash function over the data, encrypting the hash against a key (technically you would be keying the hash if its a symmetric key), and attaching the encrypted value (our signature) to the end of the data, thus creating a token. You can validate the token by following the exact same process and comparing the results. If you don’t have the right key then your signature won’t match, or if the data has been modified then the hash won’t match.
Most tokens also contain additional information besides just an identifier like an issuance and expiry timestamp showing that you can only use the token for a limited period of time.
Once we know how we’re going to authenticate our agents we need to pick the type of key to use.
Shared keys are small, lightweight, and orders of magnitude faster to use. They are however significantly more difficult to protect because every party to the authentication process needs a copy of the whole key. This issue is compounded by the fact that you need to transfer those keys between each party at least once. Transferring keys is never a good thing to do either. What about keeping keys synchronized across all datacenters? How do you protect the key in transit? You can encrypt it, but against what key? You could encrypt that key against another key and repeat ad infinitum. Referring to our process from earlier we would see something like this:
id = "TK-421"
key = "12345"
signature = hash(id, key) = ".abc"
token = user + signature = "TK-421.abc"
key = "12345"
signature = hash(id, key) = ".abc"
token = user + signature = "TK-421.abc"
"TK-421.abc" == "TK-421.abc" => authenticated!
See how both sides need the key? The comparison would fail if the keys were different because the hash would produce a different result.
Alternatively you also have public keys, which are are part of an asymmetric pair of keys. The idea is simple (ish). You can encrypt data against one key and decrypt the data with the other key, but you cannot encrypt and decrypt the same data with the same key. The benefit with using asymmetric keys is that you only need one of the keys, the public key, to authenticate a party. The party being authenticated only needs the other key, the private key, to authenticate themselves. That means the private key never needs to be transferred anywhere and stay safely locked away.
The process is a bit more complex though.
id = "TK-421"
clientHash = hash(id)
signature = privateKey.encrypt(hash) == ".abc123"
token = user + signature = "TK-421.abc123"
hashDecrypted = pubKey.decrypt(signature)
serverHash = hash(id)
serverHash == hashDecrypted => authenticated!
The value of this method is significant because the private key used for authentication never leaves the client. The public key is available to anyone that wants to verify the token (hence the name), but that’s the only thing you can do with the public key. There are some negative aspects to using asymmetric keys though, and the biggest is performance. The signature is considerably larger than a symmetric signature, and the cost of signing data and verifying signatures is an order of magnitude higher.
Consider some measurements I made comparing symmetric vs asymmetric keys for some unrelated benchmarking. These were made using an off-the-shelf JWT library instead of using the raw crypto primitives, but the comparison is still valid because its the same basic process. I measured this data by performing 10,000 iterations of generating a token and then validating the token.
These are my results.
|Algorithm||Type||Average Signing Time (ms)||Average Validation Time (ms)||Token Size (chars)||Key Strength (bits)|
This is just a subset of the results, but as you can see the differences are obvious. You can get better results by choosing different algorithms, but there will always be a tendency of symmetric keys being faster. You have to balance the risk of storing keys in multiple places with system performance.
Let’s assume that we’ve picked asymmetric keys for authenticating our agents. We need to store the private key in such a way that the agent itself can get access to it when it needs to, but also so the keys can’t be stolen by anyone prying open the enclosure and picking at the components.
Large centralized environments like datacenters can benefit from devices called Hardware Security Modules (HSM). They are designed to protect keys by embedding them in physical memory and offloading the crypto work required of the keys from dependent servers, that way the keys never leave the safety of the HSM. The controlled environment allows for more complex protection mechanisms as well that will zero out or destroy the memory if signs of device tampering are detected.
It’s impractical to use an HSM for every single agent out there tohugh, so we need to look at something a bit more portable. Most laptops these days actually have a similar device built into their motherboards called a Trusted Platform Module (TPM). They are used to store sensitive keys in a similar manner. Alternatively there are also manufactured chips dedicated to protecting keys that are compact versions of HSMs too. We should consider storing our keys in a similar manner.
We can now start sending data to our servers and can be fairly confident the data is coming from an agent we trust, almost. If we look back at our authentication check above the only thing we’re authenticating is the ID of the agent, not whether the data has been modified. We trust its been sent by someone we know, but if we consider that our data is going to be flowing over the internet, that means any number of devices will sit between our agent and our servers. Any one of those devices can twiddle bits maliciously or inadvertently, and our servers would have no idea something has changed.
What we need is proof of authenticity of the data we’re sending. We can do this easily enough by just sticking the data in the token with our device ID, or we can use a similar mechanism and sign the data with another key. The signature now provides evidence the data hasn’t been modified, but what happens if someone keeps sending the same signed data over and over again? It might mean our servers collect hundreds or thousands or millions of spurious records that influence data analysis, or mask alarming data.
We can use a value called a nonce to detect duplicate submissions. It’s just a non-repeating counter in the signed data packet that increments every time the agent checks in. The server is already storing the data from the check-in, so all it needs to do is grab the last checked-in value and verify its less than the new value. Alternatively you can use a timestamp as well. If the value has been seen already then discard the request and drop the connection.
By this point we have a pretty good idea of what we actually need to send to the server. We also know how JWTs perform, so why not use one for our check in?
We see our device ID (‘did’), the nonce (‘jti’ – JWT ID), the token expiry (‘exp’), and all our data. Running that through our signing process will provide us a small blob of data we can send to the servers.
Lastly, we need to make sure our agent is sending data to our servers and not someone else’s. This is possibly one of the easiest solutions: use SSL/TLS. We don’t even need a certificate signed by a CA either. We just need each agent to know what the expected certificate is, and validate the provided certificate matches. This is called certificate pinning. It’s magical.
A side effect of using SSL/TLS is that we also get encrypted communications for free, meaning nobody can see the raw data as its transiting the internet.
This would be obviously useful if we’re sending sensitive data in our check-ins, but presently we’re not doing that. There are some subtle operational protections we get out this though: it blocks wholesale replaying of check-ins because attackers just can’t see the data. They’d have guess at the structure of the data.
Malicious Actors (everything else)
Thankfully machines have not become sentient and still require evil people to do bad things. With any luck this’ll be true for quite some time.
I do however fear the day my Xbox realizes it’s had enough of me repeatedly swearing at it.
Previously we talked about the idea of a malicious agent, but there really is no such thing as a malicious agent because it still requires an actor controlling the agent. In the previous section it was implied someone was forcing an agent to do bad things. Everything else is getting categorized here.
What happens if someone disables or destroys an agent? At some point we have to assume that agent is out of commission.
First of all, we should be monitoring the agent and making sure its checking in, and alert when its been offline for a given period of time, or when it starts sending in strange data. Second, we need to make sure all the other agents around it can pick up the slack (sound familiar?) and can fill in enough information to cover the gap of missing data.
In earlier sections we looked at distributing the databases across datacenters in case one or more datacenters was destroyed. This doesn’t solve all the problems though.
Supposing for a moment that someone was able to tamper with the data in the database. First of all, data should be immutable meaning that once its received and stored in the database it should never be modified. We can be careful to block update statements to the database, or rely on database technologies that support read only fields. We also have to block delete statements as well.
Supposing someone was able to run an update or delete statement, we need auditing to track that those events occurred. Databases have transaction logs and triggers for this sort of thing, and we can track changes to data fairly easily, assuming the attacker doesn’t have rights to delete the log, or disable the triggers.
Assuming the attacker does have the ability to delete logs and disable triggers, we need a way to roll back the changes. Distributed database systems actually do help solve part of this problem because an attack on a single database might not replicate to any other databases. We can run consistency checks across each database and compare data. When things are out of whack we investigate. Unfortunately we can’t rely on this alone, so we need one last thing: backups.
All data should be backed up. All databases should be backed up. Everything should be backed up.
Backups are wonderful. Of course, backups need to be up to date. Stale backups are not wonderful. That means all data needs to be backed up regularly. Additionally, all backups need to be validated that they 1) succeeded, and 2) haven’t been tampered with themselves. It’s a vicious rabbit hole we descend.
Validating that the backups haven’t been tampered with means knowing that they are in the same state as they were when they were created. We can rely on signatures (or something like a MAC specifically) to create a known good fingerprint of the backup, and then we can make copies of the fingerprint and give them to a whole bunch of people. You can be reasonably certain a backup is untouched if enough people say that fingerprint matches.
Developing the Model
This model is incomplete. Partly because I haven’t gone through every single component bit by bit, but also because threat models are constantly changing, because threats are constantly changing. It’s important to understand that this model is living. It needs to be reviewed often enough to compensate for changes to the system that will inevitably occur when theory meets reality.
The point of this post was to go over the process involved in developing a threat model, not to develop an actual threat model. Hopefully a few things were learned along the way, but if nothing else I have one key lesson: it makes quite a bit more sense to start this whole process looking at the agent first.
You can’t build an accurate model that describes the threats to a system if you don’t understand the nature of the system you’re building.
I make no claim that this post contains everything you need to know about threat modeling. There are quite a few resources out there on the topic and I can’t list them all, but I can provide a starting point.