Using AWS Lambda for deploying machine learning algorithms is on the rise. You may ask yourself, “What is the benefit of using Lambda over deploying the...
Shrinking a Machine Learning Pipeline for AWS Lambda Using AWS Lambda for deploying machine learning algorithms is on the rise. You may ask yourself, “What is the benefit of using Lambda over deploying the model to an AWS EC2 server?” The answer: enabling higher throughput of queries. This scale-up may challenge an EC2 server, but not Lambda. It enables up to 2,000 parallel queries.
However, troubles begin when it comes to the Lambda deployment package. The code, dependencies and artifacts needed for the application comprising the nearly 500MB deployment package must sum up to no more than 256MB (unzipped).
[boxlink link="https://www.catonetworks.com/resources/single-pass-cloud-engine-the-key-to-unlocking-the-true-value-of-sase?utm_source=blog&utm_medium=upper_cta&utm_campaign=space_wp?utm_source=blog&utm_medium=top_cta&utm_campaign=masterclass_4"] SPACE: The Key to Unlocking the True Value of SASE (eBook) [/boxlink]
From 500MB to 256MB, that’s a significant difference.
Let me show you how our team approached this challenge, which ended with deploying a complex machine learning (ML) model to AWS Lambda.
The Problem: The Size of the Deployment Package
We wanted to deploy a tree-based classification model for detecting malicious Domains and IPs. This detection is based on multiple internal and external threat intelligence sources that are transformed into a feature vector. This feature vector is fed into the classification model and the model in turn, outputs a risk score. The pipeline includes the data pre-processing and feature extraction phases.
Our classification tree-based model required installing Python XGBoost 1.2.1 that requires 417MB. The data pre-processing phase requires installing Pandas 1.1.4 that requires 41.9MB and Numpy 1.19.4 that costs an additional 23.2MB. Moreover, the trained pickled XGBoost model weighs 16.3MB. All of which totals 498.4MB.
So how do we shrink all of that with our code, to meet the 256MB deployment package limitation?
ML Lambda
The first step is to shrink the XGBoost package and its dependencies, do we really need all the 417MB of that package?
Removing distribution Info directories (*.egg-info and *.dist_info) and testing directories together with stripping the .so files will reduce the XGBoost package space usage to 147MB. Summing up the packages with joblib, numpy and scipy results in 254.2MB, which is suitable for a Lambda layer.
This shrunken Lambda layer is going to serve as our first Lambda function which mainly queries the ML model with a feature vector and returns a classification scoring. The feature vector is generated from multiple internal and external threat intelligence data sources. The data is pre-processed and transformed into decimal values vector, and then fed to our classification model, generating a risk score. But who is responsible for generating this feature vector?
Feature Extraction Lambda
To generate the feature vector, you need to build a feature extraction pipeline. Since we’re out of space in the first Lambda function, we’ll create another Lambda function for it. Feature Extraction (FE) Lambda gets an entity as an input, queries the relevant sources, then weighs and transforms this data to a feature vector. This feature vector is the input for the first Lambda function, which in turn – returns the classification result.
The FE Lambda function imports some third-party packages for the data retrieval. Also, since we’ve gathered some information from various databases, we’ll need Pymysql and Pymongo. Finally, for the data cleaning and feature extraction we’ll need Pandas and Numpy.
All this sums up to 111MB, which is clearly suitable for the 256MB per Lambda deployment package limitation.
But you may ask yourself, where is the model? Don’t you add it to the deployment package? Actually, it’s not needed. Since the model is trained once and then executed on each function call, we can store the Pickled- trained model on S3 and download it using boto3 on each function call. That way we separate the model from the business logic, and we can swap between models easily without changing the deployment package.
Lambda Functions Inter-communication
Another concern is the two Lambda functions’ inter-communication process. We’ve used the REST API in the API Gateway service, using GET for the FE Lambda – since its input is just a single string entity. For the ML Lambda we’ve created a REST API using POST – since this Lambda input is a long feature vector.
That way, the FE Lambda gets an entity as an input and it queries third-party data sources. After the data retrieval is finished, it cleans the data and extracts the feature vector, which in turn is sent to the ML Lambda for prediction.
[caption id="attachment_17785" align="alignnone" width="300"] 1 Lambda function inter-communication using API Gateway[/caption]
Modularity
Another positive side effect of splitting the process into two Lambda functions, is modularity. This split enables you to integrate additional ML models to work in parallel to the original ML model.
Let’s assume we decide to transform our single ML model pipeline into an ensemble of ML models, which output a result based on the aggregation of their stand-alone result. It becomes much easier when the FE pipeline is totally excluded from the ML pipeline, and that modularity can save much effort in the future.
Wrapping up
So, we have two main conclusions. The first step of moving a ML pipeline to a serverless application is understanding the hardware limitations of the platform.
The second conclusion is that these limitations may require ML model adaptations and must be considered as early as possible.
I hope you find the story of our struggles useful when moving your ML to a serverless application, the efforts needed for such a transition will pay off.
Threat Intelligence (TI) feeds provide critical information about attacker behavior for adapting an enterprise’s defenses to the threat landscape. Without these feeds, your security tools,...
Security Testing Shows How SASE Hones Threat Intelligence Feeds, Eliminates False Positives Threat Intelligence (TI) feeds provide critical information about attacker behavior for adapting an enterprise's defenses to the threat landscape. Without these feeds, your security tools, and those used by your security provider would lack the raw intelligence needed to defend cyber operations and assets.
But coming from open-source, shared communities, and commercial providers, TI feeds vary greatly in quality. They don't encompass every known threat and often contain false positives, leading to the blocking of legitimate network traffic, negatively impacting the business. Our security team found that even after applying industry best practice, 30 percent of TI feeds will contain false positives or miss malicious Indicators of Compromise (IoCs).
To address this challenge, Cato developed a purpose-built reputation assessment system. Statistically, it eliminates all false positives by using machine learning models and AI to correlate readily available networking and security information. Here's what we did, and while you might not have the time and resources to build such a system yourself, here’s the process for how you can do something similar in your network.
TI Feeds: Key to Accurate Detection
The biggest challenge facing any security team is identifying and stopping threats with minimal disruption to the business process. The sheer scope and rate of innovation of attackers put the average enterprise on the defensive. IT teams often lack the necessary skills and tools to stop threats. Even when they do have those raw ingredients, enterprises only see a small part of the overall threat landscape.
Threat intelligence services claim to fill this gap, providing the information needed to detect and stop threats. TI feeds consist of lists of IoCs, such as potentially malicious IP addresses, URLs, and domains. Many will also include the severity and frequency of threats.
To date, the market has hundreds of paid and unpaid TI feeds. Determining feed quality is difficult without knowing the complete scope of the threat landscape. Accuracy is particularly important to ensure minimum false positives. Too many false positives result in unnecessary alerts that overwhelm security teams, preventing them from spotting legitimate threats. False positives also disrupt the business, preventing users from accessing legitimate resources.
Security analysts have tried to prevent false positives by looking at the IoCs common among multiple feeds. Feeds with more shared IoCs have been thought to be more authoritative. However, using this approach with 30 TI feeds, Cato's security team still found that 78 percent of the feeds that would be considered accurate, continued to include many false positives.
[caption id="attachment_11731" align="aligncenter" width="895"] Figure 1. In this matrix, we show the degree of IoC overlap between TI feeds. Lighter color indicates more overlaps and higher feed accuracy. Overall, 75% of the TI feeds showed a significant degree of overlaps.[/caption]
Networking Data Helps Isolate False Positives
To further refine security feeds, we found augmenting our security data with network flow data can dramatically improve feed accuracy. In the past, taking advantage of networking flow data would have been impractical for many organizations. Significant investment would have been required to extract event data from security and networking appliances, normalize the data, store the data, and then have the necessary query tools to interrogate that datastore.
The shift to Secure Access Service Edge (SASE) solutions, however, converges networking and security together. Security analysts will now be able to leverage previously unavailable networking event data to enrich their security analysis. Particularly helpful in this area is the popularity of a given IoC among real users.
In our experience, legitimate traffic overwhelmingly terminates at domains or IP addresses frequently visited by users. We intuitively understand this. The sites frequented by users have typically been operational for some time. (Unless you're dealing with research environments, which frequently instantiate new servers.) By contrast, attackers will often instantiate new servers and domains to avoid being categorized as malicious – and hence being blocked – by URL filters.
As such, by determining the frequency of which real users visit IoC targets – what we call the popularity score – security analysts can identify IoC targets that are likely to be false positives. The less user traffic destined for an IoC target, the lower the popularity score, and the greater the probability that the target is likely to be malicious.
At Cato, we derive popularity scoring by running machine learning algorithms against a data warehouse, which is built from the metadata of every flow from all our customers' users. You could do something similar by pulling in networking information from various logs and equipment on your network.
Popularity and Overlap Scores to Improve Feed Effectiveness
To isolate the false positives found in TI feeds, we scored the feeds in two ways: "Overlap Score" that indicates the number of overlapping IoCs between feeds, and "Popularity Score." Ideally, we'd like TI feeds to have a high Overlap Score and a low Popularity Score. Truly malicious IoCs tend to be identified by multiple threat intelligence services and, as noted, are infrequently accessed by actual users.
However, what we found was just the opposite. Many TI feeds (30 percent) had IoCs with low Overlap Scores and high Popularity Scores. Blocking the IoCs in these TI feeds would lead to unnecessary security alerts and frustrating users.
[caption id="attachment_11730" align="aligncenter" width="637"] Figure 2. By factoring in networking information, we could eliminate false positives typically found in threat intelligence feeds. In this example, we see the average score of 30 threat intelligence feeds (names removed). Those above the line are considered accurate. The score is a ratio of the feed's popularity to the number of overlaps with other TI feeds. Overall, 30% of feeds were found to contain false positives.[/caption]
The TI Feeds Are Finely Tuned – Now What?
Using networking information, we could eliminate most false positives, which alone is beneficial to the organization. Results are further improved, though, by feeding this insight back into the security process. Once an asset is known to be compromised, external threat intelligence can be enriched automatically with every communication the host creates, generating novel intelligence. The domains and IPs the infected host contacted, and files downloaded, can be automatically marked as malicious and added to the IoCs going into security devices for even greater protection.