Two Remote Code Execution (RCE) vulnerabilities have been discovered in the Java Spring framework used in AWS serverless and many other Java applications. At least...
Cato Patches Vulnerabilities in Java Spring Framework in Record Time Two Remote Code Execution (RCE) vulnerabilities have been discovered in the Java Spring framework used in AWS serverless and many other Java applications. At least one of the vulnerabilities has been currently assigned a critical severity level and is already being targeted by threat actors. Within 20 hours of the disclosure, Cato Networks customers were already protected from attacks on both vulnerabilities as Cato Networks security researchers researched, signed and enforced virtual patching across Cato SASE Cloud. No Cato Networks systems are affected by this vulnerability.
The two vulnerabilities come following a recent release of a Spring Cloud Function. One vulnerability, Spring4Shell, is very severe and exploited in the wild. No patch has been issued. The second vulnerability, CVE-2022-22963, in the Spring Cloud Function has now been patched by the Spring team who issued Spring Cloud Function 3.1.7 and 3.2.3.
Within 20 hours of the discovery, Cato customers were already protected against attacks against the vulnerabilities through virtual patches deployed across Cato. Cato researchers had already identified multiple exploitation attacks by threat actors. While no further action is needed, Cato customers are advised to patch any affected systems.
Similar to the Log4j vulnerability, CVE-2022-22963 already has multiple POCs and explanations on GitHub – making it easy to utilize by attackers. As part of the mitigation efforts by Cato’s security and engineering team we verified that no Cato Networks system has been affected by this vulnerability.
As we have witnessed with Log4j, vulnerabilities such as these can take organizations a very long time to patch. Log4j exploitations are still observed to this day, four months after its initial disclosure. Subsequent vulnerabilities may also be discovered and Cato’s security researchers will continue to monitor and research for these and other CVEs – ensuring customers are protected.
The cat-and-mouse game between threat actors and security researchers is ever-evolving. With every new threat comes a security solution that in turn triggers the threat’s...
Attackers are Zeroing in On Trust with New Device ID Twist The cat-and-mouse game between threat actors and security researchers is ever-evolving. With every new threat comes a security solution that in turn triggers the threat’s evolution. It’s an ongoing process whose most recent twist in the Device ID game was documented in our just-released SASE report on enterprise security.
Device ID’s Led to Spoofing-as-a-Service
Device ID is an authentication measure that identifies attackers using compromised credentials by inspecting the device used and comparing it to previous devices used by this account. Most Device ID solutions use a weighted average formula to determine if the incoming device is risky. For example, hardware and screen resolution rarely change, so they get a significant “weight” in the overall calculation. At the same time, a browser version may receive lower priority as it gets patched and changed more often.
Threat actors have used a variety of tools and techniques over the years to fool Device ID solutions. One of the first tools was software that would allow the attacker to spoof several of the many hardware and software components installed on their system (for example – changing the hardware versions). Other early techniques included using an iPhone’s browser to perform a login. All iPhones have the same hardware, and the iOS also tends to be consistent thanks to Apple alerting users to an out-of-date operating system. In other words – the lack of entropy served the attackers well.
One of the latest iterations of Device ID evasion comes in the form of Spoofing-as-a-Service – online dark web services where threat actors send service providers details of the device they are trying to spoof. The service provider creates a VM (Virtual Machine) that would pass a device ID check and can be accessed over the Web. These services come at a significant price jump compared to “on-prem” solutions but provide an extra layer of security for the attacker.
The Latest Twist to Device ID
One question remains – how does the threat actor collect all the system data needed for spoofing? Be it spoofing as a service or a local tool, the attacker must know the victim’s device’s details to spoof them. One of the principles of zero trust architectures is continuous and contextual authentication in which Device ID plays a role. So how can the attacker collect all the necessary data for circumventing these systems?
Threat actors have been collecting Device ID data for at least 15 years now. Researchers have seen this type of data (such as operating system, screen resolution, and time zone) collected by some of the earliest phishing attacks. Later came malware that collected this data upon infection and immediately sent it to its C&C (Command and Control) server. But as we mentioned at the start – every “solution” leads to evolution, and security researchers have upped their game at identifying the exfiltration of Device ID parameters.
So what did the attackers do? They hid this data in areas they thought would be ignored or would generate false positives in security logs. Such is the case with the Houdini malware, which Cato Networks researchers have hunted down and analyzed in our 2021 Q2 report.
For more information, check out the complete report.
Amazon has recently enabled Sidewalk on its devices, raising security and privacy concerns for consumers. But those devices also lurk in many enterprise networks. How...
The New Shadow IT: How Will You Defend Against Threats from Amazon Sidewalk and Other “Unknown Unknowns” on Your Network? Amazon has recently enabled Sidewalk on its devices, raising security and privacy concerns for consumers. But those devices also lurk in many enterprise networks. How can your organization protect itself from these unknown threats?
Security discussions usually revolve around known or perceived threats – ransomware, phishing, social engineering — and which security practices can address them. Which is well and good, but, as the recently deceased Donald Rumsfield put it, what about the unknown unknowns of cyber security?
A recent example is the launch of Amazon Sidewalk – a new feature for Amazon devices that constructs a shared network between devices such as Amazon Echo devices, Ring Security Cams, outdoor lights and more. Sidewalk is meant to be a low-bandwidth wireless network that can connect devices in hard-to-reach places.
[boxlink link="https://catonetworks.easywebinar.live/registration-88?utm_source=blog&utm_medium=top_cta&utm_campaign=masterclass_4"] Cybersecurity Master Class: Episode 4: Supply chain attacks & Critical infrastructure [/boxlink]
But numerous security concerns have also been raised about Sidewalk. While some of the consumer security concerns that have been raised seem like hyped security FUD, others raise valid alarms for any security-minded individual:
Sidewalk’s automatic opt-in – Dot, Ring, Echo, Tile and other devices enable Sidewalk automatically unless the user explicitly opted out. This is not behavior enterprises would expect from their devices.
Sidewalk’s new encryption method – Doing encryption right is not an easy task, especially when introducing new methods. While Amazon promises its triple encryption will make data transfer safe, their new methodology has not been battle tested yet.
Updating and patching – Devices always need to be managed and kept patched. Even if you patch all your connected devices, Sidewalk is now connecting devices to your network that may not be up to date.
Third-party data access – Sidewalk devices will share data with third parties in the future as well as integrate with third party compatible devices.
How can any enterprise make statements its risk preparedness when risk-filled consumer devices might be inhabiting the organization’s network??
This is far from an academic exercise. The Cato Networks Research Team analyzed the network traffic of dozens of enterprises and found hundreds of thousands of flows generated by Amazon Alexa-enabled devices.
And Alexa isn’t the only consumer device or application lurking on your company’s networks. After analyzing 190B network flows as part of Cato Networks SASE Threat Research Report for Q1 of 2021, we discovered flows from many consumer applications and services, such as TikTok, TOR nodes, and Robinhood, on enterprise networks.
Shadow IT: A More Urgent Problem with Work from Anywhere Policies
Unidentified applications and devices and Shadow IT in general have always posed a security risk. What has changed now, however, is the shift to remote work. With the introduction of the home office into your corporate network, home-connected devices with all of their vulnerabilities suddenly become part of your network. And to make matters worse, visibility into those networks diminishes leaving you with a blind spot in your security defenses.
These risks can only be identified if the organization has complete visibility to ANYTHING connecting to its corporate network. This is where SASE comes into the picture. By operating at the network level for all devices and users — in the office and at home — you can see every network flow and every device on your network, giving you visibility to the unknown unknowns of your network.
The recent attack on the Colonial Pipeline. Russian and Chinese election meddling. The exotic and spectacular threats grab popular headlines, but it’s the everyday challenges that plague enterprise networks. Unpatched legacy systems, software long exploited by attackers, banned consumer applications, and more...
New Cato Networks SASE Report Identifies Age-Old Threats Lurking on Enterprise Networks The recent attack on the Colonial Pipeline. Russian and Chinese election meddling. The exotic and spectacular threats grab popular headlines, but it’s the everyday challenges that plague enterprise networks. Unpatched legacy systems, software long exploited by attackers, banned consumer applications, and more leave enterprises exposed to attack.
SASE Platform Gathers Networking and Security Information
Those were just some of the key findings emerging from our analysis of 850 enterprise networks in the Cato Networks SASE Threat Research Report. From January 1 till March 31, Cato used our global SASE platform to gather and analyze network flows from hundreds of enterprises worldwide. We captured the metadata of every traffic flow from every customer site, mobile user, and cloud resource connected to the Cato global private backbone was in a massive data warehouse. All totaled more than 200 billion network flows and 100 Terabytes of data per day were stored in the data warehouse.
With this massive repository, we gathered insights into which applications and security threats operate on enterprise networks. Network information was derived using Cato’s own internal tools. Data modeling was used to identify applications by unique traffic identifiers and then looked up in the Cato library of application signatures. Security information was derived by feeding network flow data into the Cato Threat Hunting System (CTHS), a proprietary machine learning platform that identifies threats through contextual network and security analysis. This highly efficient, automated platform eliminated more than 99% of the flows, leaving the Cato security team to analyze 181,000 high-risk flows. The result was 19,000 verified threats.
Key Findings Highlight Risks and Key Applications
Combining network flow data and accurate threat information provides a unique perspective into today’s enterprise networks. Read the report and learn:
The most popular corporate applications on enterprise networks. Some applications, like Microsoft Office and Google, you will already know, but other applications that have been a source of significant vulnerabilities also lurk on many networks.
Is video-sharing consuming your network bandwidth? A popular video-sharing platform was surprisingly common on enterprise networks, generating even more flows than Google Mail and LinkedIn.
The most common exploits. The report identifies the most common Common Vulnerability and Exposures (CVEs); many were still found in essential enterprise software packages.
The source of most threats. While the news focuses on Russia and China, most threats originate closer to home. “Blocking network traffic to and from ‘the usual suspects may not necessarily make your organization more secure,” says Etay Maor, our senior director of security strategy.
To learn more, check out the Cato Networks SASE Threat Research Report.
The ransomware group that targeted the Colonial Pipeline claims they are in it for the money, not for a political reason. An interesting predicament for...
Targeting critical infrastructure has critical implications The ransomware group that targeted the Colonial Pipeline claims they are in it for the money, not for a political reason. An interesting predicament for the attackers, defenders and the future of such attacks.
The recent ransomware attack on Colonial Pipeline is yet another brutal reminder of the implications of cyber security breaches. The attack’s effect was immediate – the pipeline was shutdown resulting in fear of a spike in gas prices. In addition, a state of emergency declaration for 17 states allows truck drivers to do more extra hours and get less sleep.
While the investigation of the attack is still ongoing, there has been an interesting development in the form of an announcement by the threat actor, DarkSide. In a statement by the group, they claim they are an apolitical group interested in money and not social disruption. In addition, they promise to start checking every company prior to targeting and ransoming it to avoid such cases in the future.
his announcement is interesting for several reasons. First off, it is almost a standard procedure for organizations that have been breached to claim it was a nation state actor that conducted the attack. This helps with mitigating calls for taking full responsibility over the attack because it widely agreed upon that a nation state actor is nearly impossible to stop and prevent their actions. With DarkSide announcing they are an “apolitical” group and specifically saying “do not need to tie us with a defined government” (sic) using this excuse goes out the window. This begs the question – why did DarkSide feel the need to announce this? This ransomware group has targeted dozens of companies in the past, what is different now?
The answer to that is – the target! DarkSide targeted a critical infrastructure and they might have just realized what that means. It means they are not up against an organization’s SOC, they are up against the federal government. Multiple three letter agencies will now be involved in the investigation, and as demonstrated in the past, this can have consequences to the ransomware operation and operators. It seems the group is genuinely concerned as the announcement also states they will make sure that in the future they don’t target systems that which have “social consequences.”
So where do things stand now? While DarkSide claims they will avoid similar targets in the future, it is clear that critical systems will be targeted not just by nation states, but by groups looking to make monetary gains. The “it’s a nation state actor” approach might be off the table for some types of attacks, on the other hand, this also looks like a golden chance for nation state actors to hide their operations under the mascaraed of a ransomware attack. This has already been done in the past…The Liar’s Dividend is something threat actors are happy to take advantage of.
Several new CVEs targeting MS Exchange servers have been discovered and shared by Microsoft. Attacks using these CVEs include manipulation of domain admin accounts, deployment...
New Microsoft Exchange Vulnerability Disclosed Several new CVEs targeting MS Exchange servers have been discovered and shared by Microsoft. Attacks using these CVEs include manipulation of domain admin accounts, deployment of a web shell and exfiltration of data.
Cato Networks security team has already developed and deployed the proper defenses for this new threat.
Earlier this week Microsoft disclosed a set of new 0-day CVEs (CVE-2021-26855, CVE-2021-26858, CVE-2021-26857, CVE-2021-27065) which were (and still are) used by the HALFIUM group to target Microsoft Exchange servers. These vulnerabilities were used by the attackers to create files and web shells using privileged users accounts which gave the attackers persistent access to the vulnerable servers as well as RCE (Remote Code Execution) capabilities and data exfiltration. According to available forensics and security investigations the affected entities were mainly US government entities and retailers.
As part of our research on new and emerging threats, Cato Networks researchers have found evidence of attackers (not necessarily the HALFIUM group) running a script to scan for vulnerable servers.
[caption id="attachment_13703" align="aligncenter" width="1548"] A code snippet from the scanning script[/caption]
Microsoft official release: https://www.microsoft.com/security/blog/2021/03/02/hafnium-targeting-exchange-servers/
While security researchers constantly try to identify and report zero-day vulnerabilities, if those vulnerabilities are not patched and if security controls are not updated, the...
The Biggest Misconception About Zero-Day Attacks While security researchers constantly try to identify and report zero-day vulnerabilities, if those vulnerabilities are not patched and if security controls are not updated, the threat remains real. Cato Networks MDR team investigated patching adoption rates and how to mitigate the risk of vulnerable systems in your network.
Not every software is vulnerable, not every vulnerability is exploitable, and not every exploit is usable – but when everything aligns and a patch has yet to be released, you have a zero-day attack in the making. A lot has been written about zero-day attacks and their potentially devastating outcome, from attacks targeting critical infrastructure to telcos to operating systems. However, a common misconception about zero-day attacks is that once they are disclosed and a patch is made available – the problem is solved, and we can move on. Unfortunately, that is not true.
The MDR team at Cato Networks recently investigated browser patching by pulling stats from our customers’ traffic flows. Based on data from more than 8,000 customers the team noticed that 38% of all Firefox users are not using the latest version. Even more interesting, 20% of Chrome users (which is hands down the most popular browser) have yet to patch the recent Chrome zero-day exploit. This exploit, which was likely used to target security researchers by the North Korean Lazarus group, has been identified and a patch was released almost two weeks ago. Yet, it remains an attack vector since some have not updated their browser to the latest version.
Frank Abagnale, who’s life story was portrayed in the movie ‘Catch Me If You Can’ and who is an FBI agent investigating cyber breaches, has said in one of his talks that “Every breach occurs because somebody in that company did something they weren’t supposed to do or somebody in that company failed to do something they were supposed to do.” Companies still fail at updating and patching systems.
To make things worse, attackers don’t need the resources of a nation state actor to identify vulnerable systems. In my last blog, I showed examples of vulnerable systems found through Shodan, a great search engine for connected devices. And Shodan isn’t the only tool attackers can use to easily identify vulnerable systems. With Censys, for example, even simple queries can show attackers systems vulnerable to attack exploiting POODLE, for example, a 6-year-old TLS vulnerability. With six-years under our collective belts, one would expect POODLE to be long gone, but alas below are just some of the POODLE vulnerable systems found on Censys:
[caption id="attachment_13679" align="aligncenter" width="1024"] A remote display system[/caption]
[caption id="attachment_13675" align="aligncenter" width="1920"] A system at a hospital[/caption]
[caption id="attachment_13675" align="aligncenter" width="1920"] A system at a hospital[/caption]
While some systems patching and updating processes are solely controlled by the IT team, in some cases operating systems and browsers rely on user’s actions. Patching can be a tough, lengthy process and so security teams do not rely on it alone.
This is where Cato’s Managed Detection and Response (MDR) service comes into play. Attackers have multiple advantages today, the most obvious one is they have the initiative; they make the first move. As complete prevention is impossible, security teams have come to rely on fast (as in minutes and hours, not days and months) detection and response once IOCs of a new attack become available.
Within hours, if not minutes, of learning about a threat Cato updates all customer defenses. Think about that for a moment. Even if the threat is discovered by assisting a different customer, your network would be updated and protected. One recent such example is the case of Sunburst, where customer defenses were updated within a couple of hours of the IOCs release. Contrast that with industry expectations that Sunburst vulnerabilities will linger for months and years in enterprise defenses. This means that even if an enterprise fails to patch a system, the system would still be protected against a Sunburst exploit by Cato defenses.
As we encounter more attacks, some in the form of supply chain attacks, others in the form of nation state zero-day vulnerabilities, we need to remember that knowing about the vulnerability isn’t enough. We must patch our systems — ASAP.
But given the impossibility of ever getting patching perfect, we need to ensure our security stacks can detect and prevent attacks on our connected systems. Failure to do so is, in Mr. Abagnale’s words, becoming the “somebody” who “failed to do something they were supposed to do” and we all know how that ends.
A bit more than a year ago, Cato introduced, Instant*Insight (also called “Event Discovery”) to the Cato Secure Access Service Edge (SASE) platform. Instant*Insight are...
February 14, 2021
How to Improve Elasticsearch Performance by 20x for Multitenant, Real-Time Architectures A bit more than a year ago, Cato introduced, Instant*Insight (also called “Event Discovery”) to the Cato Secure Access Service Edge (SASE) platform. Instant*Insight are SIEM-like capabilities that improve our customers’ visibility and investigation capabilities into their Cato account. They can now mine millions of events for insights, returning the results to their console in under a second.
In short, Instant*Insight provides developers with a textbook example for how to develop a high-performance real-time, multitenant Elasticsearch (ES) cluster architecture. More specifically, once Instant*Insight was finished, we had improved ES performance by 20x, efficiency by 72%, and created a platform that could scale horizontally and vertically.
Here’s our story.
To understand the challenge facing our development team, you first need to understand Instant*Insight and Cato. When we set out to develop Instant*Insight, we had this vision for a SIEM-like capability that would allow our customers to query and filter networking and security events instantly from across their global enterprise networks built on Cato Cloud, Cato’s SASE platform.
Customers rely on Cato Cloud to connect all enterprise network resources, including branch locations, the mobile workforce, and physical and cloud datacenters, into a global and secure, cloud-native network service that spans more than 60 PoPs worldwide. With all WAN and Internet traffic consolidated in the cloud, Cato applies a suite of security services to protect traffic at all times.
As part of this process, Cato logs all the connectivity, security, networking, and heath events across a customer’s network in a massive data warehouse. Instant*Insight was going to be the tool by which we allow our users to filter and query that data for improved problem diagnostics, planning, and more.
Instant*Insight needed to be a fully multitenant architecture that would meet several functional requirements:
Scale — We needed a future-proof architecture that would easily scale to support Cato’s growth. This meant horizontal growth in terms of customer base and PoPs, and vertical growth in the continuously increasing number of networks and security events. Already, Cato tracks events with more than 60 different variable types for which there are more than 250 fields.
Price — The design and build had to use a cost optimal architecture because this feature is being offered as part of Cato at no additional price to our customers. Storage had to be optimized as developing per customer storage was not an option. Machine power should also be utilized as much as possible.
UX — The front-end had to be easy to use. IT professionals would need to see and investigate millions of events.
Performance — Along with being usable, Instant*Insight would need to be responsive. It had to serve up raw and aggregated data in seconds. Such investigations are required to support queries for time periods such as a day, a month, or even a quarter.
To tackle these challenges, we sought to design a new, real-time index multitenant architecture from the bottom up. The architecture had to have an intuitive front-end user, storage that would be able to both index and serve tremendous amounts of data, and a back-end implementation that should be able to process real-time data while supporting high concurrency on sample and aggregation queries.
After some sketches and brainstorming, the team started working on the implementation. We decided to use Apache Storm to process the incoming metadata fed by the 60+ Cato PoPs across the globe. The data would be indexed by Elasticsearch (ES). ES queries would be generated at a Kibana-like interface and first validated by a lightweight backend. Let’s take a closer look.
[caption id="attachment_13022" align="aligncenter" width="610"] The Instant*Insight Architecture[/caption]
As noted, Cato has 60+ PoPs located around the globe providing us with real-time, multitenant metadata. To keep up with the increasing pace of incoming data, we needed a real time processing system that could scale horizontally to accommodate additional events from new PoPs. Apache Storm was chosen for the task.
Storm is a highly scalable, distributed stream processing computation framework. With Storm, we can increase the number of sources (called “spouts” in Apache Storm) of incoming metadata and operations (called “bolts” in Apache Storm) that are performed at each processing step all of which can be eventually scaled on top of a cluster.
In this architecture, PoP metadata is transferred to a Storm cluster (using Storm spouts) via enrichment and aggregation bolts that eventually output events to an ES cluster and queueing alerts (emails) to be sent.
Enrichment bolts are a key part in making the UX responsive. Instead of enriching data when reading events (an expensive operation) from the ES cluster, we enrich the data at write time. This both saves query time and is better for using ES abilities to aggregate data (instead of enriching when querying from ES). The enrichment adds metadata that is taken from other sources such as our internal configurations database, IP2Location database, and merges more than ~250 fields to ~60 common fields for all event types.
ES Indexing, Search and Cluster Architecture
As one might assume, the ES cluster should continuously index and serve the events to Instant*Insight users on demand. This allows for real-time visibility of the SASE platform. It was clear from the beginning that we needed ES time-based indices and we started with the widely recommended hot — warm architecture. Such an approach optimizes hardware and machine types for their tasks. An index per day approach is an intuitive structure to choose with this architecture.
The types of AWS machine configurations that we chose were:
Hot machines — These machines have optimal disks for fast paced indexing of large volumes of data. We chose AWS r5.2xlarge machines with io1 SSD EBS volumes.
Warm machines — These machines provide large amounts of data that need to be optimized for throughput. We chose AWS r5.2xlarge machines with st1 HDD EBS volumes.
To achieve fast-paced indexing, we gave up on replicas during the indexing process. To avoid potentiation data loss, we put some effort on a proprietary recovery path in case of a hot machine failure. We only created replicas when we moved the data to warm machines. We achieved resiliency in-house, keeping the incoming metadata as long as the data is in a hot machine (~1-day old metadata) and was not replicated on the warm machines. This means that in case of a hot machine failure, we would have to re-index the lost data (events of 1 day).
The Instant*Insight backend bridges between the ES cluster and the front-end. The back-end is a very lightweight stateless implementation that does a few simple tasks:
Authorization enforcements to prevent non-authorized users from accessing data.
Translation of front-end queries into ES’s Query DSL (Domain Specific Language).
Recording traffic levels and performance for further UX improvements.
For the front-end, we wanted to provide a Kibana-like discovery screen since it is intuitive and the unofficial UX standard for SIEMs. Initially, we checked if and how Kibana could be integrated into our management console. We soon realized that it would be easier to develop our own custom Kibana-like UX for several reasons. For one, we need our own security and permission protocols in place. Another reason is because fields on the side of our screen have different cardinality and aggregations from those in Kibana’s UX. Our aggregations show full aggregative data instead of counting the sample query results. Also, there are differences in requirements of a few minor features like the behavior that the histogram should be aligned across other places in our application.
[caption id="attachment_13020" align="aligncenter" width="1509"] Figure 2 Instant*Insight is based on the familiar Kibana interface. Both use (1) time-based histograms, (2) display the current query in a search bar, (3) show event fields that can be selected for analysis, and (4) display the top results.[/caption]
Development of the front end was not difficult, and within a few working days we had a functional SIEM screen that worked as expected on our staging environment. Ready to go to production, code was merged to master, and we started playing internally with our new amazing feature!
Not so surprisingly, the production environment tends to be a bit different from the staging one. Even though we performed some load and pressure tests on our staging during development, in production, a much greater data volume led the screen to behave worse and the user experience wasn’t as expected.
ES, UX and How They Interact
To understand what happened in production, one must better understand the UX requirements and behavior. When the screen loads, the Instant*Insight front-end should show the previous day’s events and aggregations for all fields for some tenant (Kibana Discover style). In this, the front-end performed well. But when we tried loading a few weeks' worth of data, the screen could take more than a minute to render.
The default approach, having an index in ES comprised of multiple shards (self-contained Lucene indices), with each capable of storing data on all tenants, was found to be far from optimal. This is because requests were likely to be spread across the entire ES cluster. A one-month request, for example, requires fetching data from 30 daily indices multiplied by the number of shards in each index. Assuming there are 3 shards per index, 90 shards will need to be fetched (If we have less than 90 machines — it is the entire cluster). Such an approach is not only inefficient but also fails to capitalize on ES cache optimizations, which provide a considerable performance boost. Having an entire cluster that is busy with one tenant’s request obviously makes no sense. We had to change the way we index.
Developing a More Responsive UX and a Better Cluster Architecture
After further research on ES optimizations and more board sketches, we made some minor UX changes allowing users to operate on an initial batch of data while delivering additional data in the background. We also developed a new ES cluster architecture.
While the front-end had to be responsive, we understood that some delay was inevitable. There’s no avoiding the fact that ES needs time to return query results. To minimize the impact, we divided data queries into smaller requests, each causing different work to be done on the ES cluster.
For example, suppose a user requests event data from a given timeframe, say the past month. Previously, the entire operation would be processed by the cluster. Now, however, that operation is divided into four requests fulfilled according to optimum efficiency. The top 100 samples, the simplest request to fulfil is fulfilled first. In less than a second, users see the top 100 samples on their screens from which they can start working. Meanwhile, the next two requests — 60 fields cardinality and time histogram with groups for five different types of events — are processed in the background. The last one, field aggregations, is only queried when the user expands the field to reduce load on the cluster — there is no point fetching top usages of all fields we have while user will be interested on only few of them and this is one of the more expensive operation we have.
ES Cluster Architecture Improvements
To improve cluster performance, we needed to do two things — reduce the number of shards per search for maximum performance and improve cache usage. While the best approach would probably be to create a separate shard for each tenant, there is a memory footprint cost (~20 shards in a node for a 1GB heap). To minimize that impact, we first turned to a routing feature that lets us divide data within index shards by tenant. As a result, the number of queried shards is reduced to the number of queried indices. This way instead of fetching 90 shards to serve one month’s query, we would be able to fetch 30 shards, a massive improvement but one that would be less than optimal for anything less than 30 machines.
The next thing we did was to extend the indices’ duration with more shards without increasing (actually decreasing) the number of shards in the cluster. For example, we had a total amount of 21 shards (7 indices * 3 shards each) in the cluster for one week, by changing to weekly indices with 7 shards each, we end up with 1 weekly index) * 7 shards = 7 shards only for a week. A query of one month will now end up fetching 5 shards (in 5 indices) only, which requires a much smaller amount of computational power. One thing to note: querying a day’s data we’ll scan one-week of shards but since our users tend to fetch extended periods of data (weeks), it is very suitable for our use case (see Figure 3).
[caption id="attachment_13018" align="aligncenter" width="457"] Figure 3 By refining how the ES cluster architecture, we improved retrieval efficiency as can be seen later in figure 4[/caption]
We need to find and measure the minimum number of shards that will provide the best performance given our requirements. Another powerful benefit of separating tenants into different shards is that it leads to better utilization of the ES cache. Instead of trying to fetch data from all shards, which constantly reloads the cache and degrades cluster performance, now we only need to fetch data from necessary shards, improving ES’s cache performance.
Addressing New Architecture Challenges
Optimizing the index structure for performance and improving the user experience, introduced new availability challenges. Our selected ES hot-warm architecture became less suitable for two reasons:
Optimization of hot and warm machines from a hardware perspective is not a clear cut as before. Previously, “hot” machines performed most of the indexing and, as such, were optimized for network performance. The “warm” machines performed most of querying and, as such, were optimized for throughput. With the new approach, all machines are doing both functions — indexing and querying.
Since we must index multiple days on the same index, our own recovery path is not a good option anymore because we’ll have to store data of days or weeks in-house. It will also be very hard to recover because of the amount of data we would need to re-index in the event of a failure.
As a result, the new architecture machines are now both indexing and handling queries while also writing replicas.
Results and details of the benchmarking we performed on real production data with respect to what is required from the UI behavior perspective can be found at the tables below. The benchmarking was always done when changing one parameter, what we called the “Measured Subject.” Each was evaluated by the time (in seconds) need to retrieve five types of measurement:
Sample — A query for 100 sample results, which may be sorted and filtered per a user’s requirements.
Histogram — A query for a histogram of events grouped by types.
Cardinality — An amount of unique values of a field.
High Cardinality (top 5) — The top 5 usages of a high cardinality field.
Low Cardinality (top 5) — The top 5 usages of a low cardinality field.
While all queried information is important from a UX perspective, our focus is on allowing the user to start interacting and investigating as fast as possible. As such, our approach prioritized fulfilling sampling and histogram queries.
Below in Figures 4-6 are the detailed results. We define a “Measured Section” to be a measured subject with one of two period of times, two or three weeks. The volume column represents event amounts in millions. The orange cells in this column mark cached requests. Each of the figures represent one measured Subject. Each Subject is evaluated on two axes, for example, in figure 4 we compare daily index and weekly index performance while still using remote throughput optimized disks. Results are color coded from low response times (green) to high response times (red). Combining this information with the prior (cache-colored rows shown in orange) we can see how ES cache performs and (usually) improves performance. The “Diff” section compares the Measured Subjects. Those scores highlighted in green indicate that the Weekly Index is better; those scores highlighted in red indicate that the Daily Index is better. The numbers represent the percentage of increase in performance when moved from daily to weekly index. percent of change.
The first thing to do was to check how ES index structure (and routing) can impact performance. As can be seen in figure 4, the ES index structure gave a tremendous boost. For example, a histogram query for 3 weeks was reduced from 46.55 seconds to only 4.06, and with the cache in play, from 32.57 to only 1.26 seconds as can be seen on lines 1 and 3.
Both daily & weekly indices were tested running on the Hot-Warm architecture while having two Hot (r5.2xlarge + io1 SSD) machines and three Warm (r5.2xlarge + st1 HDD) machines. Understanding the big effect of how we index the events, we moved to benchmark other things. However, we will have to return to see if these are the most suitable index time frames and number of shards per index.
Next step was to check how moving from Hot-Warm architecture can impact performance. Running benchmarks led us to understand that even with lower throughput SSD disks, performance is still better when disks are local rather than remote. We’ve compared the previous Hot-Warm architecture to 5 Hot (i3.2xlarge) machines each having a local 1.9TB SSD. While performance is better, this still leads to another limitation to be considered: We can’t dynamically increase storage on machines.
[caption id="attachment_13014" align="aligncenter" width="468"] Figure 5 comparing performance of remote throughput optimized disk vs local disks[/caption]
So now knowing that local disks perform better, we had to select what kind of machines are the most suitable. The 10 new generation (I3en.xlarge) small machines with a 2.5TB local SSD didn’t perform better than the previous 5 (i3.2xlarge) machines.
Five new generation machines (I3en.2xlarge) with a 2x2.5TB local SSD didn’t perform any better than the previous 5 for the most valuable query (top 100 events). Remember, this query response allows the end-user to start working.
[caption id="attachment_13012" align="aligncenter" width="468"] Figure 6 comparing performance of different machine types[/caption]
After defining our new architecture, we also benchmarked a different number of shards and time frames per index while making sure the number of total shards in the cluster will remain close between the benchmarks. We increased from one weekly index with seven shards to bi-weekly index with 14 shards, three weeks with 21 shards etc. (We added 7 shards for each week addition to an index.) Eventually, after checking a few different index time periods and shard amounts (up to a 1-month index) we found out that the (original) intuition of weekly index performed best for us and probably also the best with respect to our events retention policy (We need to be able to delete historical data with ease and weekly indices allows deletion of entire weeks.)
The measurements performed brought us to select a weekly index approach to be managed on top of i3.2xlarge machines with local 1.9TB NVMe SSDs. The table below shows how the changes impacted our ES cluster:
Metric / Index
Second (improved) architecture
Index time range
Shards in index
Total shards for year
1095 + replica = 2190
~365 + replica = ~730
5 + coordinator
10 + coordinator
Between 10GB and 20GB
Shards per node for year
We dramatically decreased the amount of shards (by 3 times) while also increasing one shard size. Decreasing the amount of shards with more data in each one allows us to increase stored data at one node by up to 200% . (There is a limit of ~20 shards per 1GB of heap memory and we now have less shards with more data.). Bigger shards also improved overall performance in our multitenant environment because querying smaller number of shards for the same data as time range as before will release resources for next queries faster, queuing fewer operations. . Shard sizes varies because it depends on the number of events of the accounts it contains and unfortunately our control on how to route is very limited with ES. (ES decides by itself how to route between shards in index with some hash function on configured field.)
Since building the Instant*Insight feature, we already had to increase to 20 shards and added few machines to the cluster to accommodate expansions in the Cato network and increased the number of events. Responsiveness continues to be unchanged.
One should note that what is described in this article is the most suitable design for our needs; it might perform differently for your requirements. Still, we believe that what we have learned from the process can give some points of how to approach a design of ES cluster and what things are to be considered when doing so. For further information about optimizing ES design, read the Designing the Perfect Elasticsearch Cluster: the (almost) Definitive Guide.
Earlier this week, the city of Oldsmar, FL reported a breach of their water supply system resulting in a water poisoning attempt that was luckily...
February 12, 2021
Threat actors are testing the waters with (not so) new attacks against ICS systems Earlier this week, the city of Oldsmar, FL reported a breach of their water supply system resulting in a water poisoning attempt that was luckily detected and mitigated. ICS (Industrial Control Systems) have been the target of threat actors for years now due to their remote connectivity needs combined with the lack of security monitoring, detection and management controls modern IT infrastructures utilize. Such “low risk, high reward” targets will continue to draw adversarial attention, and unfortunately, it is not hard to find these vulnerable critical infrastructure systems.
“I told you this would happen!” While some people take joy in pointing out they were right (siblings and parents come to mind first), this sentence is said in disappointment and sadness when it comes from most cybersecurity experts. It points to a failure, an educational, procedural, or technological one —it doesn’t matter. The press conference held by the city officials of Oldsmar confirmed an attack cyber security experts have been discussing and demoing for years now – a remote access to an ICS that could result in the loss of life. City officials assured reporters that there are redundancies in place that would have prevented this attack from materializing, but the threat is clear.
Let’s start with some cold, hard facts: Operating and managing an ICS (and SCADA systems even more so) is hard work. As with any business – there are financial constraints, talent shortage, regulations and more. Add to that mix remote locations of systems and software and hardware that may be older than some of its operators and you get a volatile mix. Some of the tasks enterprises perform each week are mind boggling for me when it comes to ICS. Just thinking of how one would patch operating systems at an active working oil refinery or deploy a new security policy on a remote gas line makes me sweat. Some of these systems were designed and deployed well before any Internet connectivity was an option. Since then, the need for easy management grew and remote administration tools, usually in the form of VNC (Virtual Network Connection) and RDP (Remote Desktop Protocol) (which is based on RFB – Remote Framebuffer), have been deployed. This was also the case at Oldsmar, where TeamViewer was used for remote control of the water supply systems.
Gaining unauthorized access to resources simply by finding them exposed online, was popularized in 2005 by Johnny Long at DEFCON’s “Google Hacking for Penetration Testers.”. Since then new tools and services have been made publicly accessible. One such tool is Shodan. Shodan allows users to search for specific devices and services using a GUI and an easy querying language. For example, if a certain facility wishes to use RFB (Remote Framebuffer) for remote administration of systems but has failed to secure that access, the facility can be easily found on Shodan and other similar services. Below are just three screenshots of such systems that can currently be found on Shodan. These are just screenshots of what is currently displayed on the system’s screen, but simply pointing software that supports RFB to their IP will give anyone control over it (Needless to say, this would be deemed an unauthorized access attempt and is a punishable felony. Simply put – don’t do it).
Industrial control systems for various facilities:
Unfortunately, many of these systems have minimal, if any, security monitoring, detection and response software and services. With a growing need for remote administration, be it due to the pandemic or due to the distance between the operator and the system, more and more ICS systems become available for anyone who knows how to search for them. Such facilities may be forced to rethink remote access, not just in how to deploy and use it but also in how to monitor and alert on suspicious activities as you would with any IT infrastructure. Lives may literally depend on it.
The Emotet botnet was taken down last week thanks to a coordinated international effort. Considered one of the most prolific malware botnets, Emotet evolved from...
Emotet Botnet: What It Means for You The Emotet botnet was taken down last week thanks to a coordinated international effort. Considered one of the most prolific malware botnets, Emotet evolved from a banking trojan to a pay-per-infection business, showcasing advanced spreading techniques. While we might see a dip in global malware infections in the short term due to the takedown of the backbone infrastructure, there is little to no doubt that the operators and masterminds behind Emotet will return in some form.
A coordinated multinational Europol effort has successfully taken down the Emotet infrastructure, the backbone of a botnet operation that infected millions of computers worldwide and has caused damages estimated from several hundreds of millions of dollars to $2.5B. In a video released from the raid of a Ukrainian Emotet operation center, viewers can see the hardware used by the criminals as well as cash from different countries, passports and bars of gold and silver.
Emotet has emerged as a banking trojan in 2014 but has since evolved, both on the technical as well as the business side. While maintaining its data stealing capabilities, Emotet’s business has expanded to a pay-per-infection service. Once Emotet infected a large enough number of computers it became a loader for other types of malware (think about it like a NASCAR racer who gets paid for putting stickers from different sponsors on their car – other cybercrime groups pay the Emotet team to spread their malware) and Emotet’s operations changed from pure banking malware to Malware as a Service (MaaS).
How Emotet Was Removed
Emotet’s takedown is not the first, nor the last, botnet takedown operation. It takes a significant amount of time and effort to takedown a botnet as well as a decision that taking it down will be more beneficial than monitoring and studying it. From past experience with botnets and cybercrime forums we know that a takedown ultimately results in an evolution in the operator’s capabilities, such was the case with multiple dark markets like Deep Dot Web as well as with malware like Trickbot.
When a law enforcement agency decides to take down such an operation it targets one (or more) of three components – People, Process and Technology – which are common to any business. In Emotet’s case the main target was the technology, the infrastructure used by the botnet for command and control. Other botnets were stopped by having their operators arrested, such was the case with Mirai, Satori and multiple DDoS botnets. It is worth noting that two Emotet operators have been arrested and are facing 12 years in prison.
The third component is the process, several botnets have been taken down by creating a kill switch for the malware that was distributed. These sinkhole tactics for botnet takedown are a good example of malware evolution as they were the trigger for the creation of P2P botnets (a decentralized method for botnet operation, GameOver Zeus being a prime example of one such botnet).
The Impact of Emotet Removal
For all the above reasons the biggest effect of the Emotet takedown will be in the short term. With the Emotet MaaS business gone, enterprises and individuals will suffer less malware infections. This means less info stealers, less Ransomware and spam bots. The flip side, in the short term, is that the operators of the malware that was spread by Emotet may fear their malware will be identified as well and may shift gears with their attacks and Ransomware demands. However, in the long term, it is highly unlikely that the masterminds behind this operation will decide to change their ways and become law abiding citizens. Chances are they will take their time to regroup and prepare their future criminal activities.
Emotet and other malware use various techniques to evade detection. These includes polymorphism (the process of having a different signature for every infected bot), WiFi infection vectors and malicious attachments amongst others. EDR alone is no match for advanced attacks and malware. Cato Network’s Shay Siksik has detailed the need for MDR in his blog about Sunburst. The future of fighting these types of threats starts with changing the current, siloed, point solution approach for cyber security to a converged, shared context solution architecture.