Cato Application Catalog – How we supercharged application categorization with AI/ML

New applications emerge at an almost impossible to keep-up-with pace, creating a constant challenge and blind spot for IT and security teams in the form...
Cato Application Catalog – How we supercharged application categorization with AI/ML New applications emerge at an almost impossible to keep-up-with pace, creating a constant challenge and blind spot for IT and security teams in the form of Shadow IT. Organizations must keep up by using tools that are automatically updated with latest developments and changes in the applications landscape to maintain proper security. An integral part of any SASE product is its ability to accurately categorize and map user traffic to the actual application being used. To manage sanctioned/unsanctioned applications, apply security policies across the network based on the application or category of applications, and especially for granular application controls using CASB, a comprehensive application catalog must be maintained. At Cato, keeping up required building a process that is both highly automated and just as importantly, data-driven, so that we focus on the applications most in-use by our customers and be able to separate the wheat from the chaff.In this post we’ll detail how we supercharged our application catalog updates from a labor-intensive manual process to an AI/ML based process that is fully automated in the form of a data-driven pipeline, growing our rate of adding new application by an order of magnitude, from tens of application to hundreds added every week. What IS an application in the catalog? Every application in our Application Catalog has several characteristics: General – what the company does, employees, where it’s headquartered, etc. Compliance – certifications the application holds and complies with. Security – features supported by the application such as if it supports TLS or Two-Factor authentication, SSO, etc. Risk score – a critical field calculated by our algorithms based on multiple heuristics (detailed here later) to allow IT managers and CISOs focus on actual possible threats to their network. Down to business, how it actually gets done We refer to the process of adding an application as “signing” it, that is, starting from the automated processes up to human analysts going over the list of apps to be released in the weekly release cycle and giving it a final human verification (side note: this is also presently a bottleneck in the process, as we want the highest control and quality when publishing new content to our production environment, though we are working on ways to improve this part of the process as well). As mentioned, first order of business is picking the applications that we want to add, and for that we use our massive data lake in which we collect all the metadata from all traffic that flows through our network.We identify these by looking at the most used domains (FQDNs) in our entire network, repeating across multiple customer accounts, which are yet to be signed and are not in our catalog. [boxlink link="https://catonetworks.easywebinar.live/registration-everything-you-wanted-to-know-about-ai-security"] Everything You Wanted To Know About AI Security But Were Afraid To Ask | Watch the Webinar [/boxlink] The automation is done end-to-end using “Shinnok”, our in-house tool developed and maintained by our Security Research team, taking the narrowed down list of unsigned apps Shinnok begins compiling the 4 fields (description, compliance, security & risk score) for every app. Description – This is the most straightforward part, and based on info taken via API from Crunchbase Compliance – Using a combination of online lookups and additional heuristics for every compliance certification we target; we compile the list of supported certifications by the app.For example by using Google’s query API for a given application + “SOC2”, and then further filtering the results for false positives from unreliable sources we can identify support for the SOC2 compliance. Security – Similar to compliance, with the addition of using our data lake to identify certain security features being used by the app that we observe over the network. Risk Score – Being the most important field, we take a combination of multiple data points to calculate the risk score: Popularity: This is based on multiple data points including real-time traffic data from our network to measure occurrences of the application across our own network and correlated with additional online sources. Typically, an app that is more popular and more well-known poses a lower risk than a new obscure application. CVE analysis: We collect and aggregate all known CVEs of the application, obviously the more high-severity CVEs an application has means it has more opening for attackers and increases the risk to the organization. Sentiment score: We collect news, mentions and any articles relating to the company/application, we then build a dataset with all mentions about the application.We then pass this dataset through our advanced AI deep learning model, for every mention outputting whether it is a positive or negative article/mentions, generating a final sentiment score and adding it as a data point for the overall algorithm. Distilling all the different data points using our algorithms we can calculate the final Risk Score of an app. WIIFM? The main advantage of this approach to application categorization is that it is PROACTIVE, meaning network administrators using Cato receive the latest updates for all the latest applications automatically. Based on the data we collect we evaluate that 80% - 90% of all HTTP traffic in our network is covered by a known application categorization.Admins can be much more effective with their time by looking at data that is already summarized giving them the top risks in their organization that require attention. Use case example #1 – Threads by Meta To demonstrate the proactive approach, we can take a look at a recent use case of the very public and explosive launch of the Threads platform by Meta, which anecdotally regardless of its present success was recorded as the largest product launch in history, overtaking ChatGPT with over 100M user registrations in 5 days.In the diagram below we can see this from the perspective of our own network, checking all the boxes for a new application that qualifies to be added to our app catalog. From the numbers of unique connections and users to the numbers of different customer accounts in total that were using Threads. Thanks to the automated process, Threads was automatically included in the upcoming batch of applications to sign. Two weeks after its release it was already part of the Cato App Catalog, without end users needing to perform any actions on their part. Use case example #2 – Coverage by geographical region As part of an analysis done by our Security Research team we identified a considerable gap in our coverage of application coverage for the Japanese market, and this coincided with feedback received from the Japan sales teams on lacking coverage.Using the same automated process, this time limiting the scope of the data from our data lake being inputted to Shinnok only from Japanese users we began a focused project of augmenting the application catalog with applications specific to the Japanese market, we were able to add more than 600 new applications over a period of 4 months. Following this we’ve measured a very substantial increase in the coverage of apps going from under 50% coverage to over 90% of all inspected HTTP traffic to Japanese destinations. To summarize We’ve reviewed how by leveraging our huge network and data lake, we were able to build a highly automated process, using real-time online data sources, coupled with AI/ML models to categorize applications with very little human work involved.The main benefits are of course that Cato customers do not need to worry about keeping up-to-date on the latest applications that their users are using, instead they know they will receive the updates automatically based on the top trends and usage on the internet.