How Azure AD Windows Sign-in Works
Let's talk Azure AD join and what that means to a Windows device. What's it mean to be joined to something?
— Steve Syfuhs (@SteveSyfuhs) September 22, 2020
Twitter warning: Like all good things this is mostly correct, with a few details fuzzier than others for reasons: a) details are hard on twitter; b) details are fudged for greater clarity; c) maybe I'm just dumb.
Back in the early days of the universe PCs were workgroup machines, meaning everything stayed local to it, or they were domain joined, meaning they belonged to a domain.
Local is pretty self explanatory. Domain join is where a Domain Controller dictated things such as authentication, authorization, policy, and what not. This allows for centralized management of two or more machines. Neat.
I've gone into great detail about how authentication works on domain join.
A useful model to think about is the idea of an authority. A local workgroup machine is itself it's own authority. Meaning the local machine stores the passwords and does the auth. Domain join has the Domain Controller as the authority, meaning it needs a DC to bless the logon.
This authority more or less has final say over everything on that machine. There is exactly one authority in Windows.
If we jump ahead a decade or two we come across The Cloud and it forever changed how everything everywhere did things. For better or for worse.
Domain Joined machines didn't exactly fit well into this new world because of technical limitations of how authentication and management worked. They always need line of sight to a domain controller to get anything interesting done.
With The cloud you don't need line of sight to your internal servers anymore because everything is out on the internet.
So we introduced Azure AD Join. That means we changed the authority from your on-prem domain controller to Azure AD. When you type in your password it gets verified by AAD, not AD. Let's talk about how that works.
Well, it turns out it works almost identically to domain join. Windows is kinda predictable like that.
Where it diverges is in which packages get used. Instead of msv1_0 and Kerberos, we have a new package: CloudAP. CloudAP is the thing that talks to AAD and MSA (formerly live[.]com and Passport).
Each of those exist as separate, but internal plugin implementations to CloudAP. We're focusing on the AAD plugin.
So you've typed your password in, the credential providers do their thing, they fire them off the LSA, LSA iterates through all the APs.
It hits msv1_0 and Kerberos and both say 🤷♂️ "not our problem". It then hits CloudAP and it says "heck yes I can do something with this." So off CloudAP goes.
Now CloudAP determines its AAD, loads up that plugin, begins the authentication dance. Of course, the first thing it does is check the cache, because that's how logon works. Do a fast check to get to the desktop, then a long check in the background to whatever authority.
However, if the last logon timestamp is less than a few hours ago, the long check is skipped because frankly it's kinda unnecessary every. single. time. But lets say this is the first logon, or its been more than a few hours. How does the client authenticate to AAD?
Through OAuth! Believe it or not it's OAuth all the way down. A customized form of OAuth, but at it's core it's completely compatible and per-spec.
Anyway, the first thing the plugin does is figure out where AAD lives. It turns out we have more than one AAD: public, regional, and government. The client was stamped with this information long ago, so in the end it knows it needs to hit https://login.microsoftonline.com/tid/token.
From there it determines if it can authenticate directly to AAD, or if it's a federated user and needs to go elsewhere. We'll come back to federated. Let's assume its a regular managed user.
So the client knows where to go, and it first requests a nonce from AAD. This acts as liveness check to make sure it's not going to be a replay. It's a short randomly generated value.
The client takes the nonce plus the user's username and password and signs it with a device key that was registered when the machine was first joined.
It then takes that signed blob and fires it off to that AAD /token endpoint. AAD looks up the device, verifies the blob, validates the username and password (and makes sure they all live in the same tenant), and if all goes well forms a response.
This response includes a Primary Refresh Token (PRT), an encrypted session key, and an ID Token. The PRT is kinda like your TGT. You use it to exchange it for tokens to other resources. The ID token is like that workstation ticket that tells the machine all about the user.
And that session key is special. The session key is encrypted to a device key that was registered way back when the device was first set up. This key is used to bind the PRT to the device because the session key is used when exchanging the PRT.
Now the client has a useful PRT so it stuffs it into the cache, decrypts the session and also stuffs it into the cache, and then validates the the ID token to log the user on. Within the ID token is useful information like user SID and what not.
All this bubbles up out of CloudAP, through to LSA so it can fill in all the session details, and off you go.
As I've said, at a high enough level this is identical to the other flows.
Now supposing you're an enterprise customer and you live in both AD and AAD. You need access to on-prem resources, how does this work?
It's kinda easy actually. Once Windows has proof from AAD that your credentials are good, LSA opens up the Kerberos AP, hands those creds to Kerberos and says "have at it", and then it's Kerberos all the way down again.
But here's where it diverges a bit more. When an application needed a Kerberos ticket it would call into the SSPI (nay GSS) library and ask for a ticket to some resource. We don't have an SSPI plugin for OAuth.
The reason for this is because SSPI's don't allow for UI (ish), whereas OAuth is inherently UI-driven for things like Consent, or in our case Conditional Access.
It also happens that SSPIs are error prone and kind of a PITA to use if you've never touched them before. So we built a new thing: Web Account Manager (WAM). It's more or less like SSPI, except it has a different API model and handles UI natively.
So when an application like Office or Teams or Edge or whatever needs an OAuth token it asks WAM for one, and if the same application needs a Kerberos ticket it asks SSPI.
Now the interesting thing is, what happens when an AADJ machine needs access to another AADJ machine for something like file shares, or RDP?
Well, it turns out we can't do Kerberos because the on-prem KDC doesn't know anything about the other machine (because AAD is the authority). So what do?
We do PKU2U! Which I discussed in the RDP thread.
Tl;dr; It's kinda like Kerberos (it's actually a copy), but instead of symmetric secrets it uses certificates, and instead of three parties it's two. Those certificates are issued by AAD. Go read the other thread for more information.
So no machine A and B are talking to each other using PKU2U. What's in these tickets so things like authorization can happen?
Let's talk about Windows' authorization for a moment. For the last thirty plus years Windows has relied on the same model, more or less. It all relies on this thing called the SID -- the security identifier. Everything has one: users, groups, computers, domains, etc.
A SID has a special form of S-1-AuthorityIdentifier-Authority1-Authority2-Authority3-Authority4-RelativeIdentifier. In other words S-1-5-21-111-222-333-555, where 1-3 represent your domain, and 555 represents you the user.
In on-prem Active Directory the form is S-1-{Domain}-{User}. This makes it super easy to identify things later on, and you immediately know what domain a user belongs. This, however, is an incredibly painful design for AD internals because of how those RIDs are allocated.
Tl;dr; each DC gets a pool of RIDs within a range. DC1 gets RIDs 1000-1500, DC2 gets RIDs 1501-2000, etc. Eventually you run out because of this allocation mechanism and it's a bad bad bad bad bad day for anyone needing to recover from it.
This simply wouldn't scale in AAD, so we make up a SID and the entire thing represents the user. There is zero relationship to the domain or tenant. It's a fixed value so it's consistent for each user, but there's no relationship between users anymore.
This kinda makes it difficult to manage authorization rules. Mea culpa -- we're working on making it better, promise!
Anyway, as it turns out, we store these SIDs in the certificate, not the PKU2U ticket, so the far machine must parse the certificate, extract these details and we now have something that can turn into an NT token.
As you might imagine this gets a bit complicated when connecting from domain-joined to AADJ machines. That's because the AADJ machine has no knowledge of the SIDs in the Kerberos ticket PAC, and in fact doesn't even have a key to decrypt the ticket, so there's no ticket at all.
The domain-joined machine might not be aware of PKU2U so depending on a whole bunch of conditions might succeed, or might not, but either way the SIDs don't match anything, so you're back to maybe authenticating but not being authorized.
Wheeeeeeee. We're working on it.
Hybrid Join
So that in a nutshell is AADJ. Let's look at the other thing that people tend to get confused about, which is Hybrid Join.
Remember those authority things? We have a domain authority, and we have an AAD authority. Hybrid join uses the *domain* authority. Full stop. End of the line. What makes it hybrid?
The hybrid story here is about management and SSO. The hybrid joined machine can be managed by Intune/MDM or Group Policy. It gets SSO support to cloud resources AND on-prem resources, but no matter what the domain is the authorizing thing.
The way this works you push a group policy configuration down to each of your domain joined machines, and this tells them to start connecting to AAD and registering themselves. This registration creates the device in AAD which registers some keys.
Then the next time the user logs on it goes through the same old dance. Credential goes to credential provider, CP goes to LSA, LSA says "who can do something about this??" MSV1_0 and Kerberos say "ayyyye" and do their thing.
But now once those are done CloudAP jumps up and exclaims it too can do something!!! And so it does. The difference is if Kerberos fails, it doesn't move on to AAD, there's no cache involved for CloudAP, etc.
It hits msv1_0 and Kerberos and both say 🤷♂️ "not our problem". It then hits CloudAP and it says "heck yes I can do something with this." So off CloudAP goes.
— Steve Syfuhs (@SteveSyfuhs) September 22, 2020
So hybrid gets your WAM for SSO, but you're still relying on your on-prem domain to do things.
Why don't we make hybrid allow you to log in with AAD (as described by authority)? Well, it's a little more mundane than folks would think: because it would be impossible for everyone to reason about or manage.
Have you ever had two bosses? They have competing agendas and priorities and many times they're in conflict. Trying to unravel it is an exercise in madness. So if you want cloud to be your authority you should consider switching to AADJ. Your stress levels will thank you.
Speaking of bosses, here's Bruce reviewing these threads and wondering why I'm not coding.