I’d say we glossed over the solution a bit, and we did that purposefully because investing a lot of effort into designing a system without understanding the risks is, well, a waste of time. We just need enough to get things moving. We have a good idea of what we want to build for now, so lets try and understand some of the risks. Our initial design looks something like this:
When we consider the risk of something, we consider the chance that thing might occur. We need to weigh the chance of that thing occurring with the effect it would have on the system. We then compare the cost of the effect with the cost of mitigating the effect and pick whichever one is lower. We generally want the lowest possible level of risk achievable based on the previous formula. In our case it’s actually pretty easy to work out because anything that disrupts the capabilities of the network is considered high risk. Let me see if I can explain why.
The idea of having a distributed network of things connecting back to a central server is not novel. The fairly detailed diagram above could easily be confused with a thousand other diagrams for a thousand other systems doing a thousand different things. There are a lot of commonly considered risks related to this and they can probably be summed up into these properties:
- Server availability – Is the server online and accepting requests whenever an agent wants to check in, or when an analyst wants to review the data?
- Agent availability – Is an agent online, able to measure the things it needs to measure, and have a connection back to the server to check in?
- Data accuracy – Do agents check in data that is measured properly and hasn’t been fiddled with?
This is a pretty broad but reasonable assessment for a number of systems. The risk is that any of these properties answer false because a failure in any one of these properties means the system can’t function as intended. We need to consider what sorts of things can affect these properties.
Here’s the thing: it needs to be available 100% of the time and absolutely accurate. It has the potential to influence the actions of many people regarding things that can kill them. This is one of those things that most security people really only dream about building (we have weird dreams). There can be serious consequences if we fail. Therefore we need to consider even the tiniest chance of something occurring that could affect those three properties and mitigate.
As system designers we need to go nuclear. (yes I know that was a bad pun. I’ll show myself out.)
In our case the system needs to withstand serious catastrophic events simply because it’s intended to monitor serious catastrophic events. If such an event occurs and the system can’t withstand the effects then we’ve failed and the whole point of the project can’t be realized. Conversely, it also needs to withstand all the usual threats that might exist for a run-of-the-mill distributed system.
Normally we would need to do an analysis on the cost of failure so we could get a baseline for how much we should invest, but this is my model and I’m an impossible person. The cost of failure is your life. (muahaha?)
Threats to Server Availability
Interestingly this property is easiest to review, relatively speaking. Our goal is to make sure that everything on the right side of that internet boundary is always available so everything on the left side of the boundary can access it.
Let’s consider for a moment that we want all the data stored in one single location so we can get a holistic view of the world. Easy enough, right? Just write it all to one database on a server in a datacenter somewhere near the office that has public internet access. We want this server to always be available. That means 100% up-time for a single server. I’ll just go right out and say it: that’s impossible. Consider some of the more mundane things that can cause a server to go offline. We deal with these things every day and because of that there’s always something complaining that it can’t connect.
- Datacenter network failure
- Server updates requiring a reboot
- Power failure
- Hardware failure
Now, imagine we’ve got thousands of agents checking in every minute. Assuming 5 minutes of downtime for just one of these events and we’re talking 4000-5000 points of data lost. That’s a low assumption though. Hardware failure might mean the server is out of commission for hours or days. And don’t forget: hardware is measured in mean time to failure — odds are pretty good something will fail at some point.
Let’s continue. Consider what happens we start adding more agents. We’re talking global coverage, so a thousand agents probably aren’t going to cut it. We’ll need a few more.
I make no guarantees my math is accurate.
We can take a WAG at the number of agents we’ll need based on a few existing data sets provided by various agencies. The US EPA has their own monitoring network of 130 agents. The US is about 3,805,927 mi² in area. Assuming each monitoring agent is spaced out evenly (they aren’t) we see an average coverage area of 29,276 mi² for each agent. They also check in once every hour. The resolution of the data isn’t great. It’s useless to anyone more than a few miles away from an agent, and nobody would know something serious has happened within enough time to do something about it.
Conversely the USGS NURE program over a period of years measured the radiation levels in great detail by collecting hundreds of thousands of samples across the country, but they were for a single point in time. The resolution is fantastic, but it’s not real-time. We want something in between. Let’s try and get a more concrete number.
According to 2007 census there were 39,044 local governments (lets call them cities for simplicity) in the US. Let’s assume we want 4 agents per city. As of July 1st, 2013 there were 295 cities (again, for simplicity) with a population greater than 100,000 people. Lets also assume we want 10 agents for each one of those larger cities. (Remember, its a WAG)
[(39,044 – 295) x 4] + (295 x 10) = 154,996 + 2950 = 157,946
So that means based on our very rough estimates we need 157,946 agents for just the United States. Each one of those agents would check in every minute, so that means we’re talking 227,442,240 check-ins every single day. Now extrapolate for the rest of the world.
Let’s assume we need 22 times as many agents (world population / US population). That’s 3,474,812 agents, or 5 billion check-ins a day.
This would of course require more analysis. We don’t necessarily want to deploy agents based solely on population density because there would be significant fluctuations in the number of agents per city vs rural areas. The numbers are good enough for our fake model though.
Now, back to our server. We didn’t consider load as a risk, but now we need to, so we add a few more servers. We’ll revisit this in a second. What else might affect the availability of our servers?
Regional Infrastructure Failures
Suppose the internet backbone providing service to our datacenter craps out. What happens if a hurricane comes through and floods the datacenter? What happens if a trigger-happy organization nukes the general area? We need to assume our datacenter will simply cease to exist one day without warning. This is conceptually easy enough to solve for — add more datacenters. If we need more datacenters, we therefore need datacenters spread far enough apart that whatever took out one datacenter won’t take out another.
Determining the best place for a datacenter is not an easy task. There are lots of things to take into consideration. A short list might look something like this:
- Year-round weather conditions
- Availability to the internet backbones
- Geopolitical stability of the region
- Local economic stability affecting cost of operations
- Relative distance to other datacenters
A few of these are easy enough to reconcile. A map of the internet shows that there are plenty of hubs where we could stick a datacenter.
We could overlay such maps, find cities where there’s usually nice weather, investigate the various required stability (geopolitical/economic/etc.), and pick a bunch of candidates.
Geopolitical stability is a tricky thing to measure. For instance, Canada is stable, and predictably so for the foreseeable future. It’s highly unlikely the government would boot us out of the country or destroy our datacenter. However, if nuclear war struck, it might not be wise to build a datacenter near Ottawa because the city would be a prime target for attack.
Then again, winter storms in Ottawa suck so I wouldn’t want to build a datacenter there anyway.
Once we have our candidates we need to consider the desired distribution throughout the world. Since we’ve got agents all around world, we need a pretty even distribution of these datacenters. It seems logical that we’d want agents to check in to the nearest datacenter, but in reality we don’t necessarily care where the agent checks in. What we care about is that if a datacenter is taken out of commission, no others are also taken out, and that the agents can fail over to other datacenters. What this does mean though is that if an agent checks into one datacenter then the submitted data needs to be replicated immediately to all the other datacenters before it can be considered committed.
Actually, it doesn’t require immediate replication to all datacenters at once; it might just need replication to a quorum of datacenters before the data is considered committed.
Another thing to consider is the total number of datacenters required. If one goes down the rest have to pick up the slack. That means each datacenter cannot be running at full capacity. For the sake of argument lets say we choose to build 16 datacenters. I’ve picked this number because that’s the number of Microsoft Azure datacenters worldwide. They’ve already done their due-diligence picking good locations (and I already have a pretty map of it).
Assuming a roughly even distribution of load per datacenter, that’s approximately 312,500,000 check-ins a day per datacenter. Assuming total nuclear war (that’s a phrase I never thought I’d use) and we lose all but one datacenter, that means that one datacenter needs to take on 15 times the load, which is 4.7 billion check-ins a day.
Practically speaking if 15 datacenters went offline suddenly I suspect quite a few agents would also go offline, so the total number of check-ins would decrease considerably.
What does that mean for the size of a single datacenter then? We haven’t really looked at that yet. If we based our assumptions on the diagram at the beginning we could conceivably have 1-4 servers. One for each element, or one for all elements. It would be really nice if it took more than a single server going down to take out a datacenter, so we need at least a set of servers.
Let’s assume we’re working with commodity hardware, and each web server can handle an arbitrary load of 5000 requests a second. Now, since our agents have a fixed schedule of checking in every minute, we can easily figure out there would be around 217,000 requests a minute or 3616 a second, with a compensation of some sort for time skew of each agent. A single server could handle the load fairly easily at 72% capacity, and doubling the server count for redundancy means we’re looking at only ~36% capacity (admittedly, there’s a bit more involved so its not simply half, but we’re keeping the math simple).
That’s actually surprising to me. I assumed we’d need a few servers to handle the load. Of course, this isn’t taking into account the overhead of persisting and replicating the data across datacenters.
But then we’ve got the analysis side to consider. We don’t know how many people want to review the data, and report generation on millions of data points isn’t computationally cheap. For now we can assume we only want to report on significant changes, so the computational overhead probably isn’t a lot. We can dedicate one server to that.
So lets assume each datacenter looks the same with the following servers:
- 2 agent check-in web servers
- 2 database servers
- 1 compute server
Our connection to the internet should be redundant as well.
Therefore overall we’ve got 32 servers for agents checking in, 32 database servers, and 16 compute servers throughout the world, communicating across 32 internet backbones. In contrast, Microsoft has at least a million servers worldwide in all of it’s datacenters for all of its services, and Netflix uses 500-1000 servers just for it’s front-end services.
Of course, we’re forgetting the total nuclear war scenario again.
Our datacenters currently have a maximum load of 10,000 requests a second, so that means we could take on the load of a smidge more than one other datacenter. That’s actually pretty good for most scenarios, but not so good for ours because we still need 8 times the capacity in a single datacenter for our worst case scenario. If we keep the math simple and just multiply the current number of servers by 8, we get 640 servers worldwide. It actually might be lower because we probably wouldn’t need that many database and compute servers in one location. Additionally we don’t need all those servers running at once. We only need them when another datacenter goes offline.
I think we’ve covered off enough for now to keep our servers online and our data available in case of catastrophe. What about risk to the agents?