Inside Cato: How a Data Driven Approach Improves Client Identification in Enterprise Networks

Identification of OS-level client types over IP networks has become crucial for network security vendors. With this information, security administrators can gain greater visibility into their networks, differentiate between legitimate human activity and suspicious bot activity, and identify potentially unwanted software. The process of identifying clients by their network traces is, however, very challenging. Most...
Inside Cato: How a Data Driven Approach Improves Client Identification in Enterprise Networks Identification of OS-level client types over IP networks has become crucial for network security vendors. With this information, security administrators can gain greater visibility into their networks, differentiate between legitimate human activity and suspicious bot activity, and identify potentially unwanted software. The process of identifying clients by their network traces is, however, very challenging. Most of the common methods being applied today require a great deal of manual work, advanced domain expertise, and are prone to misclassification. Using a data-driven approach based on machine learning, Cato recently developed a new technology that addresses these problems, enabling accurate and scalable identification of network clients. Going “old school” with manual fingerprinting One of the most common methods to passively identify network clients, without requiring access to either endpoint, is fingerprinting. Imagine you are a police investigator arriving at a crime scene for forensics. Lucky for you, the suspect left an intact fingerprint. Since he is a well-known criminal with previous offenses, his fingerprints are already in the police database, and you can use the one you found to trace back to him. Like humans, network clients also leave unique traces that can be used to identify them. In your network, combinations of attributes such as HTTP headers order, user-agents, and TLS cipher suites, are unique to certain clients. [boxlink link="https://catonetworks.easywebinar.live/registration-ransomware-chokepoints"] Ransomware Chokepoints: Disrupt the Attack | Watch Webinar [/boxlink] In recent years, fingerprints relying solely on unsecured network traffic attributes (e.g., the HTTP user-agent value) have become obsolete, since they are easy to spoof, and are not applicable when using secured connections. TLS fingerprints, that rely on attributes from the Client Hello message of the TLS handshake, on the other hand, do not suffer from these drawbacks and are slowly gaining larger adaptation from security vendors. Below is an example of a TLS fingerprint that identifies a OneDrive client. Caption: TLS header fingerprint of a One Drive client (source) However, manually mapping network clients to unique identifiers is not a simple task; It requires in-depth domain knowledge and expertise. Without them, the method is prone to misclassifications. In a shared community effort to address this issue, some open-source projects (e.g. sslhaf and JA3) were created, but they provide low coverage and are not updated frequently. An even greater issue with manual fingerprinting is scalability. Accurately classifying client types requires manually analyzing traffic captures, a labor-intensive process that does not scale for enterprise networks. Taking such an approach at Cato wasn’t feasible. Each day the Cato SASE Cloud must inspect millions of unique TLS handshake values. The large number of values is due not only to the number of network clients connected to Cato SASE Cloud but also to the number of different versioning and updates that alter the TLS behavior of each client. Clearly, we needed a better solution. Clustering – An automated and robust approach With great amounts of data, comes great amounts of capabilities. With the use of machine learning clustering algorithms, we’ve managed to reduce millions of unique TLS handshake values to a subset of a few hundred clusters, representing unique network clients that share similar values. After creating the clusters, a single fingerprint can be generated for each one, using the longest common substring (LCS) from the Client Hello message of all the samples in the cluster. Finding the common denominator of several samples makes the approach more robust to small variations in the message values. Caption: Similar values from the TLS Client Hello message are clustered together and the LCS is used to generate a fingerprint. Each colored cluster represents a different client. The next step of the process is to identify and label the client of the fingerprint. To do so, we search for the fingerprint in our data lake, containing terabytes of daily network flows generated from different clients, and look for common attributes such as domains or applications contacted, and HTTP user-agent (visible from TLS traffic interception and inspection). For example, in the plot above, a group of TLS network flows, containing similar Client Hello messages, were clustered together by the algorithm (see the 3-point “Java/1.8.0_211” cluster colored in light blue). The resulting TLS fingerprint matched a group of inspected TLS flows in our data lake, with visible HTTP headers; all of which had a common user-agent that belongs to the Java Standard Library; a library used to perform HTTP requests. Wrapping up Accurately identifying client types has become crucial for network security vendors. With this new data-driven approach, we’ve managed to develop a fully automated and continuous pipeline that generates TLS fingerprints. The new method scales to large enterprise networks and is more robust to variation in the client's network activity.    

Cato Networks Adds Protection from the Perils of Cybersquatting

A technique long used for profiting from the brand strength of popular domain names is finding increased use in phishing attacks. Cybersquatting (also called domain squatting) is the use of a domain name with the intent to profit from the goodwill of a trademark belonging to someone else. Increasingly, attackers are tapping cybersquatting to harvest...
Cato Networks Adds Protection from the Perils of Cybersquatting A technique long used for profiting from the brand strength of popular domain names is finding increased use in phishing attacks. Cybersquatting (also called domain squatting) is the use of a domain name with the intent to profit from the goodwill of a trademark belonging to someone else. Increasingly, attackers are tapping cybersquatting to harvest user credentials. Last month, one such campaign targeted 1,000 users at a high-profile communications company with an email containing a supposed secure file link from an email security vendor. Once clicked, the link led to a spoofed Proofpoint page with login links for different email providers. So prevalent are these threats that Cato Networks has added cybersquatting protection to our service. Over the past month, we’ve detected 5,000 unique squatted domains for more than 50 well-known trademarks. These domains follow certain patterns. By understanding these patterns, you’ll be more likely to protect your organization from this new threat. [boxlink link="https://www.catonetworks.com/resources/ransomware-is-on-the-rise-catos-security-as-a-service-can-help?utm_source=blog&utm_medium=top_cta&utm_campaign=ransomware_ebook"] Ransomware is on the Rise | Download eBook [/boxlink] Types of Cybersquatting There are several techniques for creating domains that may trick unsuspecting users. Here are four of the most common: Typosquatting Typosquatting creates domain names that incorporate typical typos users input when attempting to access a legitimate site. A perfect example is catonetwrks.com, which leaves out the “o” in networks. The user mistypes Cato’s Web site and ends up interacting with another site used to spread misinformation, redirect the user, or download malware to the user’s system. Combosquatting Combosquatting creates a domain that combines the legitimate domain with additional words or letters. For example, cato-networks.com adds a hyphen to Cato’s URL catonetworks.com. Combosquatting is often used for links in phishing emails. Here are two examples of counterfeit websites that use combosquatting to prompt the user to submit sensitive information. The domain names, amazon-verifications[.]com and amazonverification[.]tk, make the user think they are interacting with a legitimate website owned by Amazon.   [caption id="attachment_20714" align="alignnone" width="860"] Figure 1- Examples of combosquatting[/caption] Levelsquatting Levelsquatting inserts the target domain into the subdomain of the cybersquatting URL. This attack usually targets mobile device users who may only see part of the URL displayed in the small mobile-device browser window. A perfect example of levelsquatting would be login.catonetworks.com.fake.com. The user may only see the prefix of login.catonetworks.com on his Apple or Android screen and thinks it’s a legitimate Cato Networks login site. Homographsquatting Homographsquatting uses various character combinations that resemble the target domain visually. One example is catonet0rks.com, which uses a zero digit that looks like the letter “o” or ccitonetworks.com, where the combination of “c” and “i” after the initial “c” looks to users like the letter “a.” Homographsquatting can also use Punycode to include non-ASCII characters in international domain names (IDN). An example would be cаtonetworks.com (xn--ctonetworks-yij.com in Punycode). In this case the “a” is a non-ASCII character from the Cyrillic alphabet. Here is a non-malicious example of a Facebook homograph domain (xn--facebok-y0a[.]com in Punycode) offered for sale. The squatted domain is used for the owner’s personal profit. [caption id="attachment_20716" align="alignnone" width="1600"] Figure 2- Example of homographsquatting targeting Facebook users[/caption] And here is another use of homographsquatting, this time going after Microsoft users. The domain name - nnicrosoft[.]online – uses double “n”s to look like the “m” in “microsoft.” [caption id="attachment_20718" align="alignnone" width="1600"] Figure 3- Example of homographsquatting targeting Microsoft users[/caption] How to Detect Cybersquatting To detect cybersquatted domains, Cato Networks uses a method called Damerau-Levenshtein distance. This approach counts the minimum number of operations (insertions, deletions, substitution, or transposition of two adjacent characters) needed to change one word into the other. For example, netflex.com has an edit distance of 1 from the legitimate site, netflix.com via the substitution the “i” character with “e”. [caption id="attachment_20722" align="alignnone" width="861"] Figure 4 – Substitution of the “i” character with an “e” in the netflix domain.[/caption] Cato Networks configures the edit distance used to classify squatted domains dynamically for each squatted trademark, taking into consideration the length and word similarity. Think of the words that can be generated with a 2 edit distance from the name Instagram or DHL, for example. We also look at who registered the domain. You might be surprised to learn that many domains of trademarks with common typos are registered by the trademark owner to redirect the user to the correct site. Detecting a domain registered by anyone other than the trademark owner arouses suspicion. Checking the domain age and registrar also turns up clues. Newly registered domains and domains from low-reputation registrars are more likely to be associated with unwanted and malicious activity than others. Separating Squatted from Non-squatted Domains In October 2021 alone, Cato Networks used these methods to detect more than 5,000 unique squatted domains for more than 50 well-known trademarks. The graphic below shows that fewer than 20% were owned by the legitimate trademark owner. [caption id="attachment_20724" align="alignnone" width="1200"] Figure 5 – Distribution of ownership in the detected domains.[/caption] Additionally, Cato’s data shows that legitimate companies tend to register domains that include their trademark with combinations of other characters and typical typos. Domains that are not registered by trademark owners tend to have a higher percentage of trademarks in the subdomain level, i.e. levelsquatting. [caption id="attachment_20726" align="alignnone" width="1200"] Figure 6 - Distribution of squatting techniques in domains not registered by trademark owners.[/caption] Finally, this graphic of Cato Networks data shows that many of the squatted domains target search engines, social media, Office suites and e-commerce websites. [caption id="attachment_20728" align="alignnone" width="1200"] Figure 7- Top targeted trademarks.[/caption] Don’t Wait to Identify Cybersquatting There is no doubt that cybersquatting can be used in a variety of ways to target unsuspecting users and companies for a data breach. Organizations need to educate themselves on the perils of cybersquatting and incorporate tools and techniques for detecting phishing and other attacks that use this method for nefarious purposes. The good news is that Cato customers can now take advantage of Cato’s cybersquatting detection to protect their users and precious data.