Clustering Malware based on Printable Strings and Dynamic API Calls

Posted by

In my experience of analyzing malware, the high-level process has always been straightforward. Examine static attributes, run it in a sandbox, code analysis and debugging. This is quite lengthy and depending on the sophistication of the sample and purpose of analysis, the procedure may take from a few hours to weeks. In general, without memory of historical samples, it is difficult to relate the sample in hand with other known samples. One method to answer this question is malware clustering.

Credits: My mentor, Scott Lambert suggested I look into TA505 and also the book, Malware Data Science. Another mentor, Libra provided all the malware samples’ hashes that I’ve used in this project. Much of the concepts and principles used in the code are derived heavily from the book.

Prerequisite: You need to have a Cuckoo sandbox setup. I have Cuckoo VM setup within a Xenial VM on VMware Workstation 15 Pro. The code that I use to dispatch jobs to Cuckoo and retrieve dynamic API call information assumes the mentioned setup. I used this article to setup Cuckoo.

Code: https://github.com/nikhilh-20/malwarecluster

Benefits of Clustering Malware

The following include some of the advantages of building relationships between malware:

  • Malware network for a threat actor gives an insight into the sophistication of deployed malware.
  • The malware sample can be compared with other malware in the network to determine any relationship with existing families.
  • If a newly acquired malware is strongly related to other malware of a particular threat actor’s malware network, it can be argued that it might be sourced from that threat actor.

Feature Selection

Printable Strings for Non-PE Samples

Linux’s strings command prints sequence of characters that are of length four characters or more. In this project, I’ve used printable strings as a feature of non-PE samples – macros, html, etc. If you have MS Suite available on your Cuckoo setup, you can test the efficiency of using dynamic features for macro samples.

Dynamic API Calls for PE samples.

PE samples’ printable strings vary based on compiler options and packers (if any). So, two malware samples may have the same functionality but their printable strings may differ because they were compiled using different options. Thus, I’ve used dynamic API calls as a feature for PE samples (EXE and DLLs). Irrespective of the packer and compiler options, if two malware have similar functionality the sequence of their API calls would likely be similar. In order to extract dynamic API calls information, I’ve setup a Cuckoo sandbox within a Xenial VM on VMware Workstation 15 Pro.

Similarity Measure

Jaccard Index

The Jaccard Index is a measure of similarity between two sets. It is used as a similarity measure for both non-PE and PE samples. Consider the Venn diagram below:

Venn Diagram for Two Sets of Strings

The left circle represents a malware sample containing seven strings. The right circle represents a malware sample containing six strings. They have four strings in common and the Jaccard Index is calculated as follows:

(number of common elements) / (number of total unique elements)

In the above case, there are four common strings and nine unique strings. Thus, their Jaccard Index is 4/9 which is 0.44.

Jaccard Index Threshold

File types like XLS have many strings which are common across files. This commonality is due to XLS having a common base. Other file types like HTML may have very little code in them and the Jaccard Index falls significantly with changes in few characters such as the domain name. These differences warranted the use of different Jaccard Index thresholds for different file types.

Subtleties in PE Similarity Measure

N-gram API Sequence

Each malware sample will result in thousands of Windows API calls. Determining common API calls just by their names is ineffective. It is not the API name itself that holds significance, rather it is the sequence of API calls. Thus, the basic formula of Jaccard Index is not effective.

Taking into account all API calls at a time would be ineffective. This is because of three reasons:

  • minor variants of a malware will likely result in a few different API calls,
  • different packers may result in different API calls at the beginning of execution.
  • malware authors might introduce a few random, useless API calls to evade AVs and make manual RE difficult.

Therefore, taking into account N-gram API calls is necessary. This would group N API calls of a sample into a bag and compare it with other malware. Consider the following two sets of API calls:

API Call Sets for Two Malware Samples

Taking into account all API calls at a time would definitely not show any similarity. The following shows how N-gram concept works:

For N = 4, the following would be the bags of API calls:

API1, API2, API3, API4API1, API2, API3, API5
API5, API6, API7, API8API6, API7, API4

We can see that there are no bags which have the same API calls. The Jaccard Index would be calculated as:

(number of common elements) / (number of total unique elements) = 0 / 4 = 0.

The total number of for loop iterations to complete comparisons is 2.

For N = 2:

API1, API2API1, API2
API3, API4API3, API5
API5, API6API6, API7
API7, API8API4

The first bag of API calls is the same while the rest are different. The Jaccard Index would be calculated as:

(number of common elements) / (number of total unique elements) = 1 / 7 = 0.14.

The total number of for loop iterations to complete comparisons is 4.

For N = 1:

API1API1
API2API2
API3API3
API4API5
API5API6
API6API7
API7API4
API8

There are three common bags while the rest are different. The Jaccard Index would be calculated as:

(number of common elements) / (number of total unique elements) = 3 / 12 = 0.25.

The total number of for loop iterations to complete comparisons is 8.

In the first column API4 appears at fourth position, while it appears at the seventh position in the second column. Even though they share same name, each instance is unique because of its position in the call sequence.

It is important to note the inverse relationship between the number of required iterations to complete comparisons and the value of N. Malware samples will call thousands of APIs and it is important to strike a balance between the number of required iterations and the value of N. In my project, I usually use a value of N = 3.

Malware Cluster Network for TA505

Libra provided malware hashes for TA505 and the code produced the following malware cluster network:

Malware Cluster Network for TA505

In the above network, we can see there are many related malware samples. The big cluster at the center of the network (containing hash 8cec63) related XLS samples. Another cluster (containing f51b30) to the central cluster contains DOC samples. Another cluster (containing e72a88) to the central cluster contains HTML samples. The following sub-cluster shows relationships between PE samples:

Thanks for reading!

In this article, I described a malware clustering principle / algorithm that I learned from the Malware Data Science book. Future work lies in improving clustering features for PE samples.

Thank you for reading! If you have any questions, leave them in the comments section below and I’ll get back to you as soon as I can!

Leave a Reply

Your email address will not be published.