Killing NTLM is Hard
EDIT 2: Oh hey, we announced our strategy.
EDIT: Good news. Deprecating NTLM is Easy and Other Lies We Tell Ourselves (syfuhs.net)
So I joked earlier today that the reason we can't kill NTLM is because folks turn off telemetry. That's false.
Mostly.
Here's why.
— Steve Syfuhs (@SteveSyfuhs) May 5, 2021
Twitter warning: Like all good things this is mostly correct, with a few details fuzzier than others for reasons: a) details are hard on twitter; b) details are fudged for greater clarity; c) maybe I'm just dumb.
There's one sure-fire way to kill NTLM: switch to Kerberos or similar modern authentication protocol. Alright, done and done -- what more is there? Well, like all things its never that easy. If it were that easy we'd have solved this 15 years ago.
That's because NTLM is both a blessing and a burden. It has some properties that modern protocols don't. Namely that it doesn't require line of sight to a domain controller and that it doesn't enforce server authentication. Can you guess which bit is great and which is terrible?
Line of sight is maybe relatively straightforward to understand. In Kerberos world the client (you) need to speak to a domain controller to get a ticket before you're allowed to access a resource. This sucks majorly when you're outside the boundaries of your network.
Therefore that means you need to set up a VPN or KDC Proxy or some such thing. But that's a chicken and egg problem: to connect to VPN you need to authenticate, to authenticate you need to connect to the DC, to connect to the DC you need...line of sight. Doh!
NTLM doesn't have that problem. You authenticate to the service in question and that service fires the request to the DC. The service is usually in a fixed location so enforcing line of sight there is usually quite easy.
This is why RDP with NLA works from outside the network. You're not doing Kerberos anymore, instead you're doing NTLM, where the client is firing the NTLM request to the target directly, and the target is forwarding it off to the DC.
Fundamentally NTLM gets the same information as if you're doing Kerberos. Kerberos just bundles everything up ahead of time, where NTLM needs to go off and get it every time.
So what's the big deal?
Property 2: NTLM doesn't do server auth.
Server auth is fundamentally just the idea that you can [somehow] guarantee the server you're communicating with is in fact the server you want to communicate with.
As authentication protocols go, not supporting server auth sucks majorly.
The reason for that is because you just fired a credential-thingy to a server that says its the server you want to talk to, but you can't really verify that at all. What's to stop that server from forwarding it off to the real server and pretending to do evil things as you?
This has made the news recently as a new form of attack. It's not particularly new: NTLM has never supported server auth. The flaw here is really just that things are using NTLM when they shouldn't.
And why are things using NTLM when they shouldn't? Well, for the two properties I just described.
1. If a client doesn't have line of sight to a DC it can't do Kerberos, so it falls back to NTLM.
2. Server auth is forcing a downgrade to NTLM.
Wait, isn't that bad? Why would a server downgrade?
Well, it's not the server doing it. It's the client or DC doing it, because they can't guarantee server auth.
See, the guarantee of server auth is by virtue of a trusted third party saying 'yeeeeep, I vouch that server has the same name as the one you requested'. Kerberos does this with SPNs.
When you ask for a ticket to a service you use the SPN to identify it. The DC looks up the SPN and gives you a ticket to it. If that ticket can be decrypted by the server, then the DC has proven the server is who they say they are. Kerberos Explained in a Little Too Much Detail (syfuhs.net)
TLS does this through certificates. The server presents a certificate containing domain names. If the domain name you're contacting the server by is in the list, then you can reasonably say the CA that issued to certificate guarantees that server is who they say they are.
So what happens if the SPN isn't found, or the domain name isn't in the certificate? Well, server authentication CANNOT succeed.
In the TLS case this is catastrophic. It's like always clicking the "yes I know this is dangerous and stupid" button in your browser when you get a cert warning. It means you have no root of trust, so there are zero guarantees of privacy of information passed in the TLS channel.
NTLM can be similarly catastrophic, but in different ways. NTLM, much like Kerberos, can be used to authenticate clients, or to optionally provide a secure channel for communications similar to TLS.
The way this works is basically that either protocol will produce a session key that both the client and server know, that can be used for encrypting traffic between the two. This is how protocols like RPC or SMB work.
Other protocols like HTTP don't need NTLM or Kerberos to provide a secure channel because HTTP can rely on TLS to do that. So NTLM is only used for authenticating the user.
So therefore in the NTLM via HTTP over TLS case, you have some measure of server authentication through TLS. Not quite the end of the world.
So if Kerberos can't happen for whatever reason, then the client will fall back to NTLM.
This is the crux of the problem. If server auth fails then you must fall back to a protocol that doesn't do server auth. You can replace the protocol with something that does do server auth, but suddenly you're back to the original problem: server auth *already* failed.
This is why killing NTLM is so hard. We can replace the protocol with something better, but that something is starting it's life right out the gate fighting with basic fundamentals.
This is also why the approach to date has been more about remediating scenarios that are falling back to NTLM vs replacing NTLM outright: replacing it outright doesn't necessarily buy us anything.
Alternatively we can just turn it off. That actually just makes things worse.
Obviously security is important, but continuity of business is important-er. Turning off business critical services is a dangerous game.
So the path forward is through information gathering. You need to know what's doing NTLM, and you need to know why it's doing it. There's an audit policy for that: Network security Restrict NTLM Audit incoming NTLM traffic (Windows 10) - Windows security | Microsoft Docs
Which brings us to the original joke: we can't deprecate NTLM because folks don't turn on telemetry.
That's false. We have ways of doing it, but it doesn't necessarily buy us much.
What telemetry does tell us though is *when* we can turn it off.