Lessons I’ve Learned While Scaling Up a Data Warehouse

Building and maintaining a data warehouse is not an easy task, many questions need to be answered to choose which technology you’re going to use.... Read ›
Lessons I’ve Learned While Scaling Up a Data Warehouse Building and maintaining a data warehouse is not an easy task, many questions need to be answered to choose which technology you’re going to use. For example: What are your use cases? These may change over time, for instance involving on-demand aggregations, ease of search, and data retention. What type of business-critical questions will you need to answer? How many users are going to use it? In this post, we will cover the main scale obstacles you might face when using a data warehouse. We’ll also cover what you can do to overcome these challenges in terms of technological tools and whether it pays to build these tools in-house or to use a managed service. Addressing these challenges could be very important for a young startup, whose data is just starting to pile up and questions from different stakeholders are popping up, or for an existing data warehouse that has reached its infrastructure limit. [boxlink link="https://www.catonetworks.com/resources/migrating-your-datacenter-firewall-to-the-cloud/?utm_source=blog&utm_medium=top_cta&utm_campaign=migrating_data_center?utm_source=blog&utm_medium=top_cta&utm_campaign=Cato_SASE_Cloud"] Migrating your Datacenter Firewall to the Cloud | Whitepaper [/boxlink] Comparing ELK vs Parquet, S3, Athena and EMR in AWS Just to set the scene, while using ELK we got to the point of having a 90TB cluster of multiple data nodes, master, and coordinator. These 90TB represented 21 days of data. Aggregations took a long time to run and most of the time failed completely. ELK’s disks were the best, yet most expensive AWS had to offer. Moving to Parquet, S3, Athena and EMR allowed us to save more than double, in terms of timeframe, for the same storage volume, while dramatically reducing costs and extending our abilities. I will explain more about the differences between these technologies and why you should consider choosing one and not another. [caption id="attachment_18272" align="alignnone" width="1228"] Figure 1: Benchmark comparing ELK vs Parquet-based data warehouse. Our conclusion: With Parquet, we could achieve more and pay less, while having more data when needed.[/caption] Self-Managed ELK – The Classic Choice Many will choose ELK as their data warehouse technology. The initial setup in this case is fairly easy, as well as data ingestion. Using Kibana can help you explore the data, its different data types and values, and create informative aggregate dashboards to present ideas and stats. But when it comes to scale, using this technology can become challenging and create a great deal of overhead and frustration. Scale Problems The problem with ELK starts with aggregations. As data volumes grow, aggregations can become heavy tasks. This is because Elasticsearch calculates aggregations on a single node, making it harder for ELK to deal with large amounts of data. This means that if you need aggregate tables over time, you must aggregate during processing. Overhead of managing a cluster on your own As data volumes grow, managing a cluster on your own can become a very big headache. It requires manual work from your SRE team and sometimes can lead to the worst - downtime. Managing a cluster on your own may include the following: Managing the disks and their volume types Adding capacity requires additional nodes be added to your ELK cluster In accordance with your original partitioning methods, data can become skewed. This means more data will reach a specific node as opposed to another and it means you will have to manually configure data balancing between different data shards Using a managed service, and not a self-managed one, can be considered expensive, but it can also save you these efforts and their price (financial or mental). The Alternative: Parquet and Why It’s So Important When it comes to scale, Parquet file format can save the day. It is a columnar file format, so you can read every column on its own instead of having to read the entire file. Reading just a column allows the search engine to invest fewer resources when scanning less data. Parquet is also compressed, meaning you can get to as low as 10% of a normal JSON file, which is very important when it comes to storage. Scanning a Parquet file by the query engine does not mean the query engine has to uncompress the entire file in advance - compression is done on every column on its own. Your storage can stay with compressed Parquet files, and the query engine will handle it in accordance with the selected columns. For us at Cato Networks, moving to Parquet meant that we could use the same storage volume and store up to three times more data in terms of timeframe than we could when using ELK, reducing our costs by 50%. Many distributed query engines now support Parquet. For instance, you can find Presto and Druid applicable for Parquet usage. [caption id="attachment_18279" align="alignnone" width="876"] Figure 2: Parquet file format structure is essential in gaining scaling efficiencies. Data is divided into rows, group, and columns with respective metadata parts used for efficient file scanning[/caption] Our Approach: S3, Athena, and EMR We gave up our self-managed ELK for a combination of S3 with Athena and EMR in AWS. We converted our data from JSON files that were headed towards ELK to Parquet and uploaded them to S3. AWS then offers a few methods on how to access the data. Athena Athena is a managed query service offered by AWS. It uses Presto as an underlying engine, and lets you query files you store on S3. Athena can also work with many file formats like CSV or JSON, but these can lead to a serialization overhead. Using Athena along with Parquet means you can expect optimal query results. Every query you execute will get the computing power resources it needs. The data will be automatically distributed among nodes behind the scenes, so you don't have to worry about configuring anything manually. EMR EMR is another managed service offered by AWS that lets you instantly create clusters to execute Spark applications, without any configuration overhead. Since your data is on S3, it even saves you the overhead of configuring and managing HDFS storage. Using EMR is a great method if you’ve ever considered Spark but couldn’t or wouldn’t invest the resources required to bring up such a heavy cluster. Being able to use Spark is a great addition to a data warehouse, however it is relatively hard or even impossible while having your data saved in Elasticsearch storage. Utilization Athena and EMR can sometimes be used for the same use cases, but they have many differences. When you are using EMR data persistence is available on disk or in memory, to save reading the data more than once. This is not an option in Athena, so multiple queries will result in recurring API calls for the same Parquet files. Another difference between the two, in terms of usability, is that while using Athena can be done using SQL syntax only, using Spark and EMR requires writing code. It can be either Python, JAVA, or Scala, but all those require a wider context than SQL. Additionally, Spark requires some configuration in terms of nodes, executors, memory, etc. These configurations, if not selected correctly, can lead to OOM. Athena queries can also end with an “exhaustion” message, but this only means you need to scan less data - there is nothing else you can do about it. Wrapping Up While using all of the above, we faced many scaling problems, extended our research and data mining capabilities, and saved many hours of manual work thanks to a fully working managed service. We can store our data for longer periods, use old data only when needed and do practically whatever we want with the newest data at all times. Moving to a new technology and neglecting old code and infrastructure can be considered challenging. It puts in question the actual reason manual work was done in the first place. Although challenging, it is an effort worth finding resources for. The technology you use should evolve together with the scale of your data.    

The Latest Cyber Attacks Demonstrate the Need to Rethink Cybersecurity

Cyberattacks are on the rise and more and more enterprises fall victim to attacks each and every day. Take for example the recent high profile... Read ›
The Latest Cyber Attacks Demonstrate the Need to Rethink Cybersecurity Cyberattacks are on the rise and more and more enterprises fall victim to attacks each and every day. Take for example the recent high profile attacks on Gedia, a German automotive parts manufacturer and Travelex, a foreign currency exchange enterprise. Both businesses experienced disruption and claimed the attacks came from a known criminal group. The same group that was behind a series of attacks on companies using sophisticated malware that encrypts files, known as Sodinokibi or REvil. The criminal group also threatened to publish sensitive data from the car parts supplier on the internet, unless a ransom was paid. Simply put, businesses are becoming attractive targets for digital extortion and are now on the radars of organized crime and criminal groups that are looking to make a quick buck off of the misery they can create. Both attacks demonstrate how vulnerable today’s businesses are when connected to the public internet and adequate protection is not deployed. It is speculated that the attack on Travelex became possible because the company had failed to patch vulnerable VPN servers. Which is important to note, especially since the NIST’s National Vulnerability Database has published over 100 new CVEs (Common Vulnerabilities and Exposures) for VPNs since January of 2019, indicating that there may be many unpatched VPN servers in use today. Even more troubling is the fact that the root cause of the Gedia attack is still yet to be discovered, which means the security flaw may still exist. Speculation aside, the root cause of most cyber-attacks can be traced back to unpatched systems, phishing, malicious code, or some other weak link in the security stack, such as compromised credentials. Knowing the root cause is only one part of the cybersecurity puzzle, The real question becomes “what can be done to prevent such attacks”? The Cato Approach Here at Cato Networks, we have developed a solution to the security problem of unpatched VPN servers. Remote users is just another “edge” along with branches, datacenters and cloud resources all connected and secured by the Cato global network. As the first implementation of the Gartner’s SASE (Secure Access Service Edge) architecture, Cato infrastructure is kept up-to-date and customers do not have to worry about applying patches or installing server-side security software, we take care of all of that. We also address the shortcomings of VPNs. Our client software replaces a traditional VPN and uses a Software Defined Perimeter (SDP) to allow only authorized users to access the private backbone. Once connected to Cato, users are protected from attack as Cato inspects all network traffic, attachments, and files. Since most ransomware enters a network after a phishing attack or by a user downloading software from an embedded link, our platform would detect the malicious code and the associated SMB traffic and prevent the lateral movement of the malicious code. We can also detect traffic flows that attempt to contact the external addresses used by malicious software and block that traffic as well. Ransomware and other malicious code can only impact an organization if it has a way to get on the network. Our platform eliminates the ability for malicious code to enter the network, thus defeating ransomware and other threats. Our SASE platform identifies known malicious code while it is in transit and blocks it and since our platform is a cloud service, it is constantly updated with the latest cyber security information, we take care of everything on the backend, including any patching or other updates. In addition, Cato protects against the spread of ransomware and other types of malware introduced into a host by means outside of the secured Cato network. For example, users may introduce malware into their systems by connecting across a public Wi-Fi network (without using Cato mobile access solution) or by inserting an infected thumb drive in their systems. Regardless of how hosts becomes infected, Cato detects and blocks the spread of malware by detecting anomalous traffic patterns. Cato monitors normal user behavior, such as the number of servers commonly accessed, typical traffic load, regular file extensions, and more. Once an anomaly occurs, Cato can either notify the customer or block various types of traffic from the abnormal host. Cato also offers an additional layer of protection, since our server-side software is not available to the public (unlike VPN server software), hackers do not have access to the code to create exploits to the system. Since we handle everything on the backend, administrators no longer have to worry about maintaining firewalls, setting up secure branch access, or deploying secure web gateways, all of those elements are part of Cato’s service offering, helping to further reduce administrative overhead. By moving to an SD-WAN built with SASE, enterprises can make most of their cybersecurity problems disappear. For more information, check out our advanced threat protection and get additional information on our next generation firewall.

Cato Develops Groundbreaking Method for Automatic Application Identification

New applications are identified faster, more efficiently by using data science and Cato’s data warehouse Identifying applications has become a crucial part of network operations.... Read ›
Cato Develops Groundbreaking Method for Automatic Application Identification New applications are identified faster, more efficiently by using data science and Cato’s data warehouse Identifying applications has become a crucial part of network operations. Quickly and reliably identifying unknown applications is essential to everything from enforcing QoS rules, setting application policies, and preventing malicious communications. However, legacy approaches to application classification have become too ineffective or too expensive. In the past, SD-WAN appliances and firewalls identified applications by largely relying on transport-layer information, such as the port number. This approach, though, is no longer sufficient as applications today employ multiple port numbers, run over their own protocols, or both. As a result, accurately classifying applications has required reconstruction of application flows. Indeed, next-generation firewalls have become application-aware, identifying applications by their protocol structure or other application-layer headers to permit or deny unwanted traffic. Reconstructing application flows, though, is a processor-intensive process that does not scale. Many vendors have resorted to manual classification, a labor-intensive process involving the hiring of many engineers. It’s costly, lengthy, and limited in accuracy. Ultimately that impacts product costs, the customer experience, or both. Cato Uses Data Science to Automatically Classify New Applications Cato has developed a new approach for automatically identifying the toughest types of applications to classify – new apps running over their own protocols. We do this by running machine learning algorithms against our data warehouse of flows. It’s a repository built from the billions of traffic flows crossing the Cato private backbone every day. We’re able to use that repository to classify and label applications based on thousands of datapoints derived from flow characteristics. Those application labels or AppIDs are fed into our management system. With them, customers can categorize what was once uncategorized traffic, giving them deeper visibility into their network usage, and, in the process, create more accurate network rules for managing traffic flows. To learn more about our approach and data science behind, click here to read the paper. For insight into our security services click here or here to learn about Cato Managed Threat Detection and Response service.  

How to Identify Malicious Bots on your Network in 5 Steps

It’s no secret that malicious bots play a crucial role in the security breaches of enterprise networks. Bots are often used by malware for propagation... Read ›
How to Identify Malicious Bots on your Network in 5 Steps It’s no secret that malicious bots play a crucial role in the security breaches of enterprise networks. Bots are often used by malware for propagation across the enterprise network. But identifying and removing malicious bots has been complicated by the fact that many routine processes in an operating environment, such as the software updaters, are also bots. Until recently there hasn’t been an effective way for security teams to distinguish between such “bad” bots and “good” bots. Open source feeds and community rules purporting to identify bots are of little help; they contain far too many false positives. In the end, security analysts wind up fighting alert fatigue from analyzing and chasing down all of the irrelevant security alerts triggered by good bots. At Cato, we faced a similar problem in protecting our customers’ networks. To solve the problem, we developed a new approach. It’s a multi-dimensional methodology implemented in our security as a service that identifies 72% more malicious incidents than would have been possible using open source feeds or community rules alone. Best of all, you can implement a similar strategy on your network. Your tools will be the stock-and-trade of any network engineer: access to your network, a way to capture traffic, like a tap sensor, and enough disk space to store a week’s worth of packets. Here’s how to analyze those packet captures to better protect your network. The Five Vectors for Identifying Malicious Bot Traffic As I said, we use a multi-dimensional approach. Although, no one variable can accurately identify malicious bots, the aggregate insight from evaluating multiple vectors will pinpoint them. The idea is to gradually narrow the field from sessions generated by people to those sessions likely to indicate a risk to your network. Our process was simple: Separate bots from people Distinguish between browsers and other clients Distinguish between bots within browsers Analyze the payload Determine a target’s risk Let’s dive into each of those steps. Separate Bots from People by Measuring Communication Frequency Bots of all types tend to communicate continuously with their targets. It happens since bots need to receive commands, send keep-alive signals, or exfiltrate data. A first step then to distinguishing between bots and humans is to identify those machines repeatedly communicating with a target. And that’s what you want to find out: the hosts that communicate with many targets periodically and continuously. In our experience, a week’s worth of traffic is sufficient to determine the nature of client-target communications. Statistically, the more uniform these communications, the greater the chance that they are generated by a bot (see Figure 1). Figure 1: This frequency graph shows bot communication in mid-May of this year, notice the complete uniform distribution of communications, a strong indicator of bot traffic. Join Our Cyber Security Masterclass –From Disinformation to Deepfake   Distinguish Between Browsers and Other Clients Simply knowing a bot exists on a machine alone won’t help very much: as we said most machines generate some bot traffic. You then need to look at the type of client communicating on the network. Typically, “good” bots exist within browsers while “bad” will operate outside of the browser. Operating systems have different types of clients and libraries generating traffic. For example, “Chrome,” ”WinInet,” and “Java Runtime Environment” are all different client types. At first, client traffic may look the same, but there are some ways to distinguish between clients and enrich our context. Start by looking at application-layer headers. Since most firewall configurations allow HTTP and TLS to any address, many bots use these protocols to communicate with their targets. You can identify bots operating outside of browsers by identifying groups of client-configured HTTP and TLS features. Every HTTP session has a set of request headers defining the request, and how the server should handle it. These headers, their order, and their values are set when composing the HTTP request (see Figure 2). Similarly, TLS session attributes, such as cipher suites, extensions list, ALPN (Application-Layer Protocol Negotiation), and elliptic curves, are established in the initial TLS packet, the “client hello” packet, which is unencrypted. clustering the different sequences of HTTP and TLS attributes will likely indicate different bots. Doing so, for example, will allow you to spot TLS traffic with different cipher suites, for example. It’s a good indicator that the traffic is being generated outside of the browser – a very non-human like approach and hence a good indicator of bot traffic. Figure 2: Here’s an example of sequence of packet headers (separated by commas) generated by a cryptographic library in Windows. Changes to the sequence, keys and value of the headers can help you classify bots Distinguish Between Bots within Browsers Another method for identifying malicious bots is to look at specific information contained in HTTP headers. Internet browsers usually have a clear and standard headers image. In a normal browsing session, clicking on a link within a browser will generate a “referrer” header that will be included in the next request for that URL. Bot traffic will usually not have a “referrer” header or worse, it will be forged. Identifying bots that look the same in every traffic flow likely indicates maliciousness. Figure 3: Here’s an example of a referrer header usage within the headers of a browsing session User-agent is the best-known string representing the program initiating a request. Various sources, such as fingerbank.org, match user-agent values with known program versions. Using this information can help identify abnormal bots. For example, most recent browsers use the “Mozilla 5.0” string in the user-agent field. Seeing a lower version of Mozilla or its complete absence indicates an abnormal bot user agent string. No trustworthy browser will create traffic without a user-agent value. Analyze the payload Having said that, we don’t want to limit our search for bots only to the HTTP and TLS protocols. This can be done by looking beyond those protocols. For example IRC protocol, where IRC bots have played a part in malicious botnets activity. We have also observed known malware samples using proprietary unknown protocols over known ports and such could be flagged using application identification. In addition, the traffic direction (inbound or outbound) has a significant value here.Devices which are connected directly to the Internet are constantly exposed to scanning operations and therefore these bots should be considered as inbound scanners. On the other hand, scanning activity going outbound indicates a device infected with a scanning bot. This could be harmful for the target being scanned and puts the organization IP address reputation at risk. The below graph demonstrates traffic flows spikes in a short timeframe. Such could indicate scanning bot activity. This can be analyzed using flows/second calculation. Figure 4: Here’s an example of a high-frequency outbound scanning operation Target Analysis: Know Your Destinations Until now we’ve looked for bot indicators in the frequency of client-server communications and in the type of clients. Now, let’s pull in another dimension — the destination or target. To determine malicious targets, consider two factors: Target Reputation and Target Popularity. Target Reputation calculates the likelihood of a domain being malicious based on the experience gathered from many flows. Reputation is determined either by third-party services or through self-calculation by noting whenever users report a target as malicious. All too often, though, simple sources for determining targets reputation, such as URL reputation feeds, alone are insufficient. Every month millions of new domains are registered. With so many new domains, domain reputation mechanisms lack sufficient context to categorize them properly, delivering a high rate of false positives. Putting It All Together Putting all of what we learned together, sessions identified as: Created by a machine rather than a human Generated outside of the browser or are browser traffic with anomalous metadata Communicating with low popularity targets, particularly those that are uncategorized or marked as malicious will likely be suspicious. Your legitimate and good bots should not be communicating with low-popularity targets. Practice: Under the network hood of Andromeda Malware You can use a combination of these methods to discover various types of threats over your network. Let’s look at one example: detecting the Andromeda bot. Andromeda is a very common downloader for other types of malware. We identify Andromeda analyzing data using four of the five approaches we’ve discussed. Target Reputation We noticed communication with “disorderstatus[.]ru” which is a domain identified as malicious by several reputation services. Categories of this site from various sources appear to be: “known infection source; bot networks.” However, as noted, alone that’s insufficient as it doesn’t indicate if the specific host is infected by Andromeda; a user could have browsed to that site. What’s more, as noted, your URL will be categorized as “unknown” or “not malicious.” Target Popularity Out of ten thousand users, only one user’s machine communicates with this target, very unusual. This gives the target a low popularity score. Communication Frequency Over one week we have seen continuous traffic for three days been the client and the target. The repetitive communication is again another indicator of a bot. Figure 5: Client-target communication between the user and disorderstatus[.]ru. Frequency is shown over three days in one-hour buckets Header Analysis The requesting user-agent is “Mozilla/4.0”, an invalid modern browser, indicating the user agent is probably a bot. Figure 6: Above is the HTTP header image from the traffic we captured with disorderstatus[.]ru. Notice, there is no ‘referrer’ header in any of these requests. The User-Agent value is also set to Mozilla/4.0. Both are indicators of an Andromeda session. Summary Bot detection over IP networks is not an easy task, but it’s becoming a fundamental part of network security practice and malware hunting specifically. By combining the five techniques we’ve presented here, you can detect malicious bots more efficiently. Follow the links to learn more about our security services, Cato Managed Threat Detection and Response service and SASE platform.