blog
WHOIS: identification or correlation?
Recently, WHOIS data was used to uncover a large cluster of domains used for a fake URL-shortener scheme and a massive SMS phishing operation, known as Prolific Puma. Spamhaus Technology's Head of Data Carel Bitter explores why this case is particularly interesting and the role of WHOIS data in identification and correlation.
In this Blog
Jump to
Recently, an industry peer pointed out that WHOIS data made it possible to uncover a large cluster of domains. The domains were used for a fake URL-shortener scheme and a massive SMS phishing operation, known as Prolific Puma. Of course, this particular method of correlation is not new. Except since the arrival of GDPR, this technique has lost much of its power, due to redacting of ownership records by registries. And this is why she mentioned it: WHOIS correlation is becoming so rare that any successes deserve mention.
WHOIS correlation: A success story
Let’s take a deeper dive into the specifics of this case. The original research from Infoblox on Prolific Puma highlights a powerful case of correlating a large number of malicious domains via WHOIS domain owner records. Unfortunately, this is far less common these days.
In this particular case the choice of TLD by the Prolific Puma operator definitely helped. The domains were all registered under the .us TLD – in theory the official TLD for the United States. Compared to many other TLDs, .us has two things that set it apart. First, there is a policy that forbids WHOIS proxy services, meaning whatever registrant info is on file will appear in the public record. And second – often overlooked, but almost equally important in a case involving thousands of domain names – the data is reasonably accessible for research. Meaning, the WHOIS service has usable rate limits and responds quickly with the data you want.
Why mention this? Because this certainly isn’t the case for every TLD or registrar that maintains and provides thick WHOIS data.
Using WHOIS data for correlation
When talking about WHOIS, policy debate typically focuses on identification, considering things like GDPR, and the privacy implications of publishing ownership data. The fact that this same data allows for large scale correlation regrettably receives much less airtime.
When researching cybercrime, it is often the case that the ownership data of malicious domain names is fake (the ownership data is made up) or stolen (the owner may exist, but they have not purchased that specific domain). While there is attribution value in some of the data, the real value is in the correlation or clustering that WHOIS data can fuel. Once you can achieve this at scale, preventive left-of-bang action becomes a reality for most types of online crime that rely on multiple domain names.
Correlating new domains to ‘good’ clusters
Using WHOIS data for correlation rather than identification has another use case. While we care about finding malicious domain names, we are also interested in identifying benign ones. After all, domain reputation is a spectrum which has a good end, too.
Established businesses can register new domains for a variety of reasons. Over time, this may end up generating a portfolio of thousands, or even tens of thousands of domains. Being able to easily correlate a new domain name to a cluster of existing benign domains is incredibly valuable, allowing defenders to focus on finding potentially malicious domains at the middle of the spectrum.
Downfall of WHOIS data collection
In light of the above, ICANNs recently launched RDRS system is of questionable use. As it requires manual work per-domain, it is irrelevant in large scale processing workflows that are often used to identify security threats within the domain name space. That said, it is not unlike the current state of WHOIS data collection, where policy and technical implementation make it harder – not easier – to get to the valuable data in the registry.
In the absence of at-scale access to this data, those that need it have developed different ways to do correlation. While some of these methods can help identify relationships that can’t be found via WHOIS, they are often slower and much more computationally expensive. Unfortunately, these approaches are not true replacements, as there is simply no good alternative for a comprehensive domain ownership registry.
Towards a solution for correlation
As you might imagine, it is beyond frustrating for researchers that a treasure trove of useful data is still out there, but in practical sense inaccessible for use. Yes, RDRS is a positive step forward, however, it does not address the scale issue. Implementing a public identifier accessible at scale that uniquely correlates an owner across a registrar, while not perfect, would go a long way. It would enable correlation without revealing actual PII, helping prevent cybercrime damage instead of cleaning it up afterwards.
To make this happen the security, fraud prevention and IP fields need to work together to drive the necessary change in policies and practices. It will not be easy, but it can be done.