Cato Application Catalog – How we supercharged application categorization with AI/ML

New applications emerge at an almost impossible to keep-up-with pace, creating a constant challenge and blind spot for IT and security teams in the form... Read ›
Cato Application Catalog – How we supercharged application categorization with AI/ML New applications emerge at an almost impossible to keep-up-with pace, creating a constant challenge and blind spot for IT and security teams in the form of Shadow IT. Organizations must keep up by using tools that are automatically updated with latest developments and changes in the applications landscape to maintain proper security. An integral part of any SASE product is its ability to accurately categorize and map user traffic to the actual application being used. To manage sanctioned/unsanctioned applications, apply security policies across the network based on the application or category of applications, and especially for granular application controls using CASB, a comprehensive application catalog must be maintained. At Cato, keeping up required building a process that is both highly automated and just as importantly, data-driven, so that we focus on the applications most in-use by our customers and be able to separate the wheat from the chaff.In this post we’ll detail how we supercharged our application catalog updates from a labor-intensive manual process to an AI/ML based process that is fully automated in the form of a data-driven pipeline, growing our rate of adding new application by an order of magnitude, from tens of application to hundreds added every week. What IS an application in the catalog? Every application in our Application Catalog has several characteristics: General – what the company does, employees, where it’s headquartered, etc. Compliance – certifications the application holds and complies with. Security – features supported by the application such as if it supports TLS or Two-Factor authentication, SSO, etc. Risk score – a critical field calculated by our algorithms based on multiple heuristics (detailed here later) to allow IT managers and CISOs focus on actual possible threats to their network. Down to business, how it actually gets done We refer to the process of adding an application as “signing” it, that is, starting from the automated processes up to human analysts going over the list of apps to be released in the weekly release cycle and giving it a final human verification (side note: this is also presently a bottleneck in the process, as we want the highest control and quality when publishing new content to our production environment, though we are working on ways to improve this part of the process as well). As mentioned, first order of business is picking the applications that we want to add, and for that we use our massive data lake in which we collect all the metadata from all traffic that flows through our network.We identify these by looking at the most used domains (FQDNs) in our entire network, repeating across multiple customer accounts, which are yet to be signed and are not in our catalog. [boxlink link=""] Everything You Wanted To Know About AI Security But Were Afraid To Ask | Watch the Webinar [/boxlink] The automation is done end-to-end using “Shinnok”, our in-house tool developed and maintained by our Security Research team, taking the narrowed down list of unsigned apps Shinnok begins compiling the 4 fields (description, compliance, security & risk score) for every app. Description – This is the most straightforward part, and based on info taken via API from Crunchbase Compliance – Using a combination of online lookups and additional heuristics for every compliance certification we target; we compile the list of supported certifications by the app.For example by using Google’s query API for a given application + “SOC2”, and then further filtering the results for false positives from unreliable sources we can identify support for the SOC2 compliance. Security – Similar to compliance, with the addition of using our data lake to identify certain security features being used by the app that we observe over the network. Risk Score – Being the most important field, we take a combination of multiple data points to calculate the risk score: Popularity: This is based on multiple data points including real-time traffic data from our network to measure occurrences of the application across our own network and correlated with additional online sources. Typically, an app that is more popular and more well-known poses a lower risk than a new obscure application. CVE analysis: We collect and aggregate all known CVEs of the application, obviously the more high-severity CVEs an application has means it has more opening for attackers and increases the risk to the organization. Sentiment score: We collect news, mentions and any articles relating to the company/application, we then build a dataset with all mentions about the application.We then pass this dataset through our advanced AI deep learning model, for every mention outputting whether it is a positive or negative article/mentions, generating a final sentiment score and adding it as a data point for the overall algorithm. Distilling all the different data points using our algorithms we can calculate the final Risk Score of an app. WIIFM? The main advantage of this approach to application categorization is that it is PROACTIVE, meaning network administrators using Cato receive the latest updates for all the latest applications automatically. Based on the data we collect we evaluate that 80% - 90% of all HTTP traffic in our network is covered by a known application categorization.Admins can be much more effective with their time by looking at data that is already summarized giving them the top risks in their organization that require attention. Use case example #1 – Threads by Meta To demonstrate the proactive approach, we can take a look at a recent use case of the very public and explosive launch of the Threads platform by Meta, which anecdotally regardless of its present success was recorded as the largest product launch in history, overtaking ChatGPT with over 100M user registrations in 5 days.In the diagram below we can see this from the perspective of our own network, checking all the boxes for a new application that qualifies to be added to our app catalog. From the numbers of unique connections and users to the numbers of different customer accounts in total that were using Threads. Thanks to the automated process, Threads was automatically included in the upcoming batch of applications to sign. Two weeks after its release it was already part of the Cato App Catalog, without end users needing to perform any actions on their part. Use case example #2 – Coverage by geographical region As part of an analysis done by our Security Research team we identified a considerable gap in our coverage of application coverage for the Japanese market, and this coincided with feedback received from the Japan sales teams on lacking coverage.Using the same automated process, this time limiting the scope of the data from our data lake being inputted to Shinnok only from Japanese users we began a focused project of augmenting the application catalog with applications specific to the Japanese market, we were able to add more than 600 new applications over a period of 4 months. Following this we’ve measured a very substantial increase in the coverage of apps going from under 50% coverage to over 90% of all inspected HTTP traffic to Japanese destinations. To summarize We’ve reviewed how by leveraging our huge network and data lake, we were able to build a highly automated process, using real-time online data sources, coupled with AI/ML models to categorize applications with very little human work involved.The main benefits are of course that Cato customers do not need to worry about keeping up-to-date on the latest applications that their users are using, instead they know they will receive the updates automatically based on the top trends and usage on the internet.

Shrinking a Machine Learning Pipeline for AWS Lambda

Using AWS Lambda for deploying machine learning algorithms is on the rise. You may ask yourself, “What is the benefit of using Lambda over deploying the... Read ›
Shrinking a Machine Learning Pipeline for AWS Lambda Using AWS Lambda for deploying machine learning algorithms is on the rise. You may ask yourself, “What is the benefit of using Lambda over deploying the model to an AWS EC2 server?” The answer: enabling higher throughput of queries. This scale-up may challenge an EC2 server, but not Lambda. It enables up to 2,000 parallel queries. However, troubles begin when it comes to the Lambda deployment package. The code, dependencies and artifacts needed for the application comprising the nearly 500MB deployment package must sum up to no more than 256MB (unzipped). [boxlink link=""] SPACE: The Key to Unlocking the True Value of SASE (eBook) [/boxlink] From 500MB to 256MB, that’s a significant difference. Let me show you how our team approached this challenge, which ended with deploying a complex machine learning (ML) model to AWS Lambda. The Problem: The Size of the Deployment Package We wanted to deploy a tree-based classification model for detecting malicious Domains and IPs. This detection is based on multiple internal and external threat intelligence sources that are transformed into a feature vector. This feature vector is fed into the classification model and the model in turn, outputs a risk score. The pipeline includes the data pre-processing and feature extraction phases. Our classification tree-based model required installing Python XGBoost 1.2.1 that requires 417MB. The data pre-processing phase requires installing Pandas 1.1.4 that requires 41.9MB and Numpy 1.19.4 that costs an additional 23.2MB. Moreover, the trained pickled XGBoost model weighs 16.3MB. All of which totals 498.4MB. So how do we shrink all of that with our code, to meet the 256MB deployment package limitation? ML Lambda The first step is to shrink the XGBoost package and its dependencies, do we really need all the 417MB of that package? Removing distribution Info directories (*.egg-info and *.dist_info) and testing directories together with stripping the .so files will reduce the XGBoost package space usage to 147MB. Summing up the packages with joblib, numpy and scipy results in 254.2MB, which is suitable for a Lambda layer. This shrunken Lambda layer is going to serve as our first Lambda function which mainly queries the ML model with a feature vector and returns a classification scoring. The feature vector is generated from multiple internal and external threat intelligence data sources. The data is pre-processed and transformed into decimal values vector, and then fed to our classification model, generating a risk score. But who is responsible for generating this feature vector? Feature Extraction Lambda To generate the feature vector, you need to build a feature extraction pipeline. Since we’re out of space in the first Lambda function, we’ll create another Lambda function for it. Feature Extraction (FE) Lambda gets an entity as an input, queries the relevant sources, then weighs and transforms this data to a feature vector. This feature vector is the input for the first Lambda function, which in turn – returns the classification result. The FE Lambda function imports some third-party packages for the data retrieval. Also, since we’ve gathered some information from various databases, we’ll need Pymysql and Pymongo. Finally, for the data cleaning and feature extraction we’ll need Pandas and Numpy. All this sums up to 111MB, which is clearly suitable for the 256MB per Lambda deployment package limitation. But you may ask yourself, where is the model? Don’t you add it to the deployment package? Actually, it’s not needed. Since the model is trained once and then executed on each function call, we can store the Pickled- trained model on S3 and download it using boto3 on each function call. That way we separate the model from the business logic, and we can swap between models easily without changing the deployment package. Lambda Functions Inter-communication Another concern is the two Lambda functions’ inter-communication process. We’ve used the REST API in the API Gateway service, using GET for the FE Lambda – since its input is just a single string entity. For the ML Lambda we’ve created a REST API using POST – since this Lambda input is a long feature vector. That way, the FE Lambda gets an entity as an input and it queries third-party data sources. After the data retrieval is finished, it cleans the data and extracts the feature vector, which in turn is sent to the ML Lambda for prediction. [caption id="attachment_17785" align="alignnone" width="300"] 1 Lambda function inter-communication using API Gateway[/caption] Modularity Another positive side effect of splitting the process into two Lambda functions, is modularity. This split enables you to integrate additional ML models to work in parallel to the original ML model. Let’s assume we decide to transform our single ML model pipeline into an ensemble of ML models, which output a result based on the aggregation of their stand-alone result. It becomes much easier when the FE pipeline is totally excluded from the ML pipeline, and that modularity can save much effort in the future. Wrapping up So, we have two main conclusions. The first step of moving a ML pipeline to a serverless application is understanding the hardware limitations of the platform. The second conclusion is that these limitations may require ML model adaptations and must be considered as early as possible. I hope you find the story of our struggles useful when moving your ML to a serverless application, the efforts needed for such a transition will pay off.  

Security Testing Shows How SASE Hones Threat Intelligence Feeds, Eliminates False Positives

Threat Intelligence (TI) feeds provide critical information about attacker behavior for adapting an enterprise’s defenses to the threat landscape. Without these feeds, your security tools,... Read ›
Security Testing Shows How SASE Hones Threat Intelligence Feeds, Eliminates False Positives Threat Intelligence (TI) feeds provide critical information about attacker behavior for adapting an enterprise's defenses to the threat landscape. Without these feeds, your security tools, and those used by your security provider would lack the raw intelligence needed to defend cyber operations and assets. But coming from open-source, shared communities, and commercial providers, TI feeds vary greatly in quality. They don't encompass every known threat and often contain false positives, leading to the blocking of legitimate network traffic, negatively impacting the business. Our security team found that even after applying industry best practice, 30 percent of TI feeds will contain false positives or miss malicious Indicators of Compromise (IoCs). To address this challenge, Cato developed a purpose-built reputation assessment system. Statistically, it eliminates all false positives by using machine learning models and AI to correlate readily available networking and security information. Here's what we did, and while you might not have the time and resources to build such a system yourself, here’s the process for how you can do something similar in your network. TI Feeds: Key to Accurate Detection The biggest challenge facing any security team is identifying and stopping threats with minimal disruption to the business process. The sheer scope and rate of innovation of attackers put the average enterprise on the defensive. IT teams often lack the necessary skills and tools to stop threats. Even when they do have those raw ingredients, enterprises only see a small part of the overall threat landscape. Threat intelligence services claim to fill this gap, providing the information needed to detect and stop threats. TI feeds consist of lists of IoCs, such as potentially malicious IP addresses, URLs, and domains. Many will also include the severity and frequency of threats. To date, the market has hundreds of paid and unpaid TI feeds. Determining feed quality is difficult without knowing the complete scope of the threat landscape. Accuracy is particularly important to ensure minimum false positives. Too many false positives result in unnecessary alerts that overwhelm security teams, preventing them from spotting legitimate threats. False positives also disrupt the business, preventing users from accessing legitimate resources. Security analysts have tried to prevent false positives by looking at the IoCs common among multiple feeds. Feeds with more shared IoCs have been thought to be more authoritative. However, using this approach with 30 TI feeds, Cato's security team still found that 78 percent of the feeds that would be considered accurate, continued to include many false positives. [caption id="attachment_11731" align="aligncenter" width="895"] Figure 1. In this matrix, we show the degree of IoC overlap between TI feeds. Lighter color indicates more overlaps and higher feed accuracy. Overall, 75% of the TI feeds showed a significant degree of overlaps.[/caption] Networking Data Helps Isolate False Positives To further refine security feeds, we found augmenting our security data with network flow data can dramatically improve feed accuracy. In the past, taking advantage of networking flow data would have been impractical for many organizations. Significant investment would have been required to extract event data from security and networking appliances, normalize the data, store the data, and then have the necessary query tools to interrogate that datastore. The shift to Secure Access Service Edge (SASE) solutions, however, converges networking and security together. Security analysts will now be able to leverage previously unavailable networking event data to enrich their security analysis. Particularly helpful in this area is the popularity of a given IoC among real users. In our experience, legitimate traffic overwhelmingly terminates at domains or IP addresses frequently visited by users. We intuitively understand this. The sites frequented by users have typically been operational for some time. (Unless you're dealing with research environments, which frequently instantiate new servers.) By contrast, attackers will often instantiate new servers and domains to avoid being categorized as malicious – and hence being blocked – by URL filters. As such, by determining the frequency of which real users visit IoC targets – what we call the popularity score – security analysts can identify IoC targets that are likely to be false positives. The less user traffic destined for an IoC target, the lower the popularity score, and the greater the probability that the target is likely to be malicious. At Cato, we derive popularity scoring by running machine learning algorithms against a data warehouse, which is built from the metadata of every flow from all our customers' users. You could do something similar by pulling in networking information from various logs and equipment on your network. Popularity and Overlap Scores to Improve Feed Effectiveness To isolate the false positives found in TI feeds, we scored the feeds in two ways: "Overlap Score" that indicates the number of overlapping IoCs between feeds, and "Popularity Score." Ideally, we'd like TI feeds to have a high Overlap Score and a low Popularity Score. Truly malicious IoCs tend to be identified by multiple threat intelligence services and, as noted, are infrequently accessed by actual users. However, what we found was just the opposite. Many TI feeds (30 percent) had IoCs with low Overlap Scores and high Popularity Scores. Blocking the IoCs in these TI feeds would lead to unnecessary security alerts and frustrating users. [caption id="attachment_11730" align="aligncenter" width="637"] Figure 2. By factoring in networking information, we could eliminate false positives typically found in threat intelligence feeds. In this example, we see the average score of 30 threat intelligence feeds (names removed). Those above the line are considered accurate. The score is a ratio of the feed's popularity to the number of overlaps with other TI feeds. Overall, 30% of feeds were found to contain false positives.[/caption] The TI Feeds Are Finely Tuned – Now What? Using networking information, we could eliminate most false positives, which alone is beneficial to the organization. Results are further improved, though, by feeding this insight back into the security process. Once an asset is known to be compromised, external threat intelligence can be enriched automatically with every communication the host creates, generating novel intelligence. The domains and IPs the infected host contacted, and files downloaded, can be automatically marked as malicious and added to the IoCs going into security devices for even greater protection.