By Dr. Spark Tsao (Data Scientist)
Decades even before the buzz went off, machine learning has proven its ability to decipher information from vast datasets to see hard-to-spot patterns, classify and cluster data, as well as make predictions using algorithms. With its myriad of real-life applications, cybersecurity remains to be one of its top use areas: It gives traditional cybersecurity solutions the edge it needs to catch destructive threats such as ransomware before it gets deployed in a system, which saves organizations’ time, money, and reputations.
Traditional machine learning largely deals with historical knowledge. It allows computers to make inferences based on datasets that have been previously labeled by humans. In cybersecurity, training a machine learning model to learn what malicious files and programs look like can help in the discovery of new, emerging, or unclassified threats via correlation.
To further push the boundaries of how machine learning can be used for a more effective cybersecurity solution, we at Trend Micro have developed a machine learning model that uses two training phases — pre-training and training — to improve detection rates and reduce false positives. This machine learning model, called TrendX Hybrid Model, allows us to not just identify malware but also, and more importantly, predict its behaviors.
The pros and cons of detecting malware using either the static or the dynamic method
Typically, machine learning models in security solutions categorize unknown files as being either malicious or benign using two methods: static and dynamic or behavior-based malware analyses.
Static method
An email with a malicious executable attachment is received
The malicious executable’s static features are extracted
Features are run through the static model for correlation and prediction
The basic static approach allows for the quick analysis of a file without needing it to be executed within a system. Instead, a machine learning model can decipher whether a file is malicious or not based on its static information or attributes. A file’s technical indicators, such as its hashes, header information, printable strings, and its file type and size, serve as the bases of a file’s basic signature.
The static approach does not always work, especially for more sophisticated attacks. What it lacks in comprehensiveness, however, it makes up for in speed. Because a file doesn’t have to be executed, information can be extracted faster.
Dynamic or behavior-based method
An email with a malicious executable attachment is received
The malicious executable’s behavioral features are observed and analyzed inside a sandbox
The malicious behaviors of the executable are identified and extracted to improve the dynamic model and detection capabilities
Dynamic analysis involves executing files inside sandboxes to determine a file’s behavioral features, making for a more detailed malware analysis. Through dynamic analysis, technical indicators such as application program interfaces (APIs), registry keys, domain names, IP addresses, file path locations, and other files added to the system or network are identified. It can also detect connections to command-and-control (C&C) servers.
Because dynamic analysis needs to run a file inside a sandbox to determine its behaviors — which may or may not start its malicious activity immediately after it runs — it may take a longer period of time to get a result, which is not always ideal especially with the emergence of more complex threats. There are threats that evade sandboxes by pretending to be benign when they detect that they’re inside a virtualized environment. Examples of these complex threats are the Locky ransomware and fileless threats such as the Angler exploit kit.
Running a file inside a sandbox not only takes a lot of time but also accrues higher computing power and costs. However, the fact remains: Combining the speed of static analysis with the extensiveness of dynamic analysis is critical to keeping systems protected against sophisticated threats at time zero.
TrendX Hybrid Model: Pre-training, training, and predicting phases
The typical machine learning model has two phases — training and prediction. The TrendX Hybrid Model has two training phases, the pre-training and training phases, which aim to take advantage of both static and dynamic analyses in identifying malicious files swiftly, as well as a third phase, the predicting phase.
Pre-training
Samples are collected and analyzed to get both static and dynamic features
The static and dynamic features are mapped out. The static/dynamic features are paired and collected
In the pre-training phase, large volumes of known samples gathered from the Trend Micro™ Smart Protection Network™ infrastructure are used. Static features are extracted, and the samples are also run in a sandbox to extract their dynamic features.
The extracted features are then mapped out to determine which static features are related to the dynamic or behavioral features. The pairs of features collected are then used to train a machine learning model called Network 1.
Network 1 is the goal of the pre-training stage — to get a collection of static features, e.g., patterns and hashes, that correlate with dynamic or behavioral features, e.g., file encryption and file deletion.
During this phase, labeling the samples are not as important as identifying what behavior is associated with which static features.
Training
A sample's static features are extracted
The static features are compared to the static features of the files in the Network 1
The dynamic features associated with the static features of Network 1 are considered to be the pseudo-behaviors for the sample's static features
Based on the sample's pseudo-behaviors, it gets labeled either malicious or benign
The training phase is the machine learning stage of typical hybrid models. Network 1, which contains the correlations between static and dynamic features, is used to analyze a sample. Samples in training and pre-training don’t need to be the same files.
In this stage, the known sample for training no longer needs to be executed in the sandbox to identify its behavior. Static features and the corresponding predicted behaviors from Network 1 are used to predict if the file is malicious or otherwise and is, afterwards, labeled. The result of this stage is called Network 2.
Prediction
An unknown file is delivered via email
The unknown file is run inside Network 1 to extract its static and dynamic features
Based on correlation from Networks 1 and 2, the unknown file is predicted to be malicious in nature
In the prediction phase, Network 1 and Network 2 are used to predict if an unknown file is malicious or benign. This is especially helpful since the concept of what is malicious can be environment-dependent — adware can be considered malicious in certain environments, but not in some.
The TrendX Hybrid Model, a patent-pending innovation (U.S. Patent App. 15/659,403), aims to provide a faster and smarter solution to detecting malware using machine learning technologies. Because it uses both static and dynamic features to analyze unknown files, accuracy is not sacrificed for the sake of efficiency.
Both pre-training and training phases can be independently developed, maintained, and strengthened by different teams of experts who specialize in each phase. The phases are also stable — they need not be replaced each and every time to maintain the model’s efficacy.
Machine learning is a great tool in fighting against an ever evolving threat landscape. It is not a cybersecurity silver bullet, but a helpful layer in a comprehensive cross-generational threat defense strategy. Cybersecurity solutions that feature machine learning technologies can help detect malware faster and more accurately, fight ransomware at time zero, and help foil email threats that make use of social engineering and zero-day exploits.
Like it? Add this infographic to your site:
1. Click on the box below. 2. Press Ctrl+A to select all. 3. Press Ctrl+C to copy. 4. Paste the code into your page (Ctrl+V).
Image will appear the same size as you see above.