I spent the better part of the last two years building the authentication stack used by FSLogix in Azure Virtual Desktop for AADJ machines.
— Steve Syfuhs (@SteveSyfuhs) December 1, 2021
Twitter warning: Like all good things this is mostly correct, with a few details fuzzier than others for reasons: a) details are hard on twitter; b) details are fudged for greater clarity; c) maybe I'm just dumb.
It turns out building an entirely new form of hybrid authentication is complicated. How did we make a platform protocol like SMB work for Windows machines without requiring Active Directory?
Easy (welllll): we turned Azure AD into a KDC for Kerberos.
What do I mean by that? You turned AAD into a KDC? Huh? You did what?!
Let's break this down a bit. You have Azure Virtual Desktop, and you want to stash user profiles into a centralized place. Voila: FSLogix. Not new. But you don't want to host your own file share.
Okay, we have this thing called Azure Files which is SMB over Azure Storage, so stuff the profiles there. That's...actually really cool.
How do you authenticate to it? Easy! NTLM. Wait. No, crap. Bad. Bad. Bad.
So AzFiles introduced Active Directory support.
In principle this is somewhat simple. You create a service principal in your own AD and give it an SPN of cifs/your.file.core.windows.net, and whenever you type
\\your.file.core.windows.net your Windows client will dutifully ask AD for a ticket and away you go.
"All" Azure Files needs to know is the service principal password and it can decrypt the ticket, get your identity, get your SIDs, and everyone is happy.
Except how does the AVD host contact AD? It needs to be domain joined or AADJ'ed. And it needs line of sight to a DC.
Okay, well your AVD hosts are in the cloud, and your DCs are in your on-prem network and that means VPN or moving DCs into the cloud AND still needing a VPN for replication yadda yadda yadda. Why can't Azure Files just use AAD to authenticate users?
Let's call it impedance mismatch. Windows understands SMB. SMB understands NTLM or Kerberos. AAD understands OAuth.
Well, what if we make AAD understand Kerberos? How do you make the world's largest [web-based] identity provider natively understand the world's most-used [not-web-based] authentication protocol?
And so begins a two-year journey.
First of all, let's set some ground rules. Kerberos is great for many reasons, but it's also janky for others. It has two legs: AS and TGS.
AS exchanges primary creds for tickets.
TGS exchanges tickets for tickets.
The AS leg can be a bit janky.
AS requires both parties have knowledge of the client's password (or asymmetric keys) and AAD doesn't necessarily know either of those, at least not in the forms necessary to make the protocol work. Also, what about support for new authenticators like FIDO? Hmm.
Wait a minute. We've solved this problem already. What if AAD gives us a TGT during logon? That's how we made FIDO work. Hybrid Authentication with FIDO (syfuhs.net)
As it turns out that wasn't quite good enough. The TGT we issue for FIDO targets your on-prem servers, and doesn't contain things like a PAC, and is intended to be exchanged for an on-prem TGT the minute it's received so it's just...weird.
Alright, what if we give you a separate TGT? A Cloud TGT, if you will. I've logged into Windows. It's fired off a request to AAD to get my PRT, and with my PRT comes back this FIDO TGT (optionally) as well as a Cloud TGT.
And so it gets interesting.
We have some curious problems to contend with now. A user belongs to a single realm. When you log in you get a single primary TGT, and that more or less dictates who you contact to get tickets. Well, my realm is on-prem. The cloud TGT can't trample that lest we break the world.
So I now belong to two realms: on-prem and the meta-realm
KERBEROS.MICROSOFTONLINE.COM. Everyone belongs to it. To clarify up front: this is not an isolation thing, it's just a global name. There's still tenant isolation.
This meta-realm is conceptually simple: when you want to get a Kerberos ticket to a cloud resource you ask the
KERBEROS.MICROSOFTONLINE.COM realm. Easy.
Okay, I've logged in, gotten my PRT, gotten my cloud TGT, and now AVD asks FSLogix to load my profile from
\\mystuff.file.core.windows.net. FSLogix doesn't know any better so it asks Windows, Windows asks SMB, SMB says "heeeeeey I need a ticket?" Good old SSO. How Windows Single Sign-On Works (syfuhs.net)
Now the Kerberos stack has a request for
cifs/mystuff.file.core.windows.net, how does is it know what realm to use, what TGT to use? It's so confusing. It turns out to be quite simple: during that PRT thing at logon we return a mapping of name suffixes to realms.
What does that mean? An example:
*.windows.net => KERBEROS.MICROSOFTONLINE.COM
If the SPN ends with the value on the left, then that means you should use the TGT from the realm on the right.
Okay, so the Kerberos stack now knows cifs/myfiles.core.windows.net belongs to the cloud
KERBEROS.MICROSOFTONLINE.COM realm. How do you get a ticket from this nebulously named thing?
Back during logon we contacted AAD and did the PRT thing. In response we got
- The PRT
- Cloud TGT
- FIDO TGT (optionally)
- Realm mapping
- AAD tenant details
Aha, here we go. Tenant details.
The bit of Windows that did the AAD authentication knows enough about Kerberos to say "It's dangerous to go alone! Take this!", bundles up the TGT, mapping, and tenant info and hands it off to the Kerberos stack.
The Kerberos stack is enlightened enough to see that if it gets this bundle of stuff that it should do something special with it. What does it do?
- Stuff the TGT into the cache
- Cache the realm mapping
- Add a KDC Proxy map between (2) and the endpoint in the tenant details
Oh hey, that's how we magically transit Kerberos over the internet! KDC Proxy. Or more specifically the [MS-KKDCP] protocol. Neeeeeeat.
Okay, back to the ticket request. Kerberos sees
*.windows.net maps to
KERBEROS.MICROSOFTONLINE.COM realm and sees a KDC proxy mapping of that realm to
Incidentally you can find that endpoint in the .well-known endpoints list:
https://login.microsoftonline.com/common/.well-known/openid-configuration (replace common with your tenant ID).
So we fire off a TGS request over the KDC Proxy protocol to AAD and now it's real good old-fashioned Kerberos and we're cooking with fire.
It turns out this is the easy part. AAD receives the TGS-REQ, verifies the cloud TGT matches the tenant in the endpoint, looks up the user, looks up the requested SPN, generates a ticket, encrypts it to the service principal key, bundles it all up in a TGS-REP and returns it.
Oh right, the service principal. It's just an AD service principal. It has keys. You configured it when you set it up. Azure Files has its' storage keys, those keys are synced with AAD, and when you generate a ticket, it gets encrypted to those keys.
Anyway, the Kerberos stack receives the TGS-REP, strips out the ticket, generates an AP-REQ, hands it back to SMB, SMB stuffs it into a header, sends the SMB hello, Azure Files receives the hello, decrypts the ticket, and hey look at that, an identity.
SMB didn't know any better. Azure Files mostly didn't know any better. FSLogix definitely didn't know any better, and now here you are, logged into AVD with a centralized and roaming profile secured without Active Directory.