Binary Similarity Measure using ssdeep

Posted by

We have all used cryptographic hashes to determine the integrity of files. You may have a preference for MD5, SHA256, SHA512, etc. but the primary principle behind using hashes to determine integrity is that a single bit change in the input will drastically change the hash output value. However, criminals are smart. It is highly possible to change a single bit in a piece of malware such that the functionality remains intact. In this case, a forensics professional would be unable to find this piece of malware with the cryptographic hash (MD5, SHA256, SHA512, etc.) of the original malware.

In this article, we will look at a concept called Context Triggered Piecewise Hashes (CTPH) and a CTPH-based application called ssdeep.

Basic Principle of CTPH

When using traditional cryptographic hashes, a single hash is created for the entire file. A single bit change has an avalanche effect on the output hash value. On the other hand, CTPH calculates multiple traditional cryptographic hashes for multiple fixed-size segments of the file. It uses a rolling hash.

A rolling hash algorithm produces a pseudo-random value based only on the current context of the input. The rolling hash works by maintaining a state based solely on the last few bytes from the input. Each byte is added to the state as it is processed and removed from the state after a set number of other bytes have been processed.

The current context can be imagined as a moving window across the input. The window length (number of bytes) depends on the implementation of CTPH.

Each recorded value in the CTPH signature depends only on part of the input, and changes to the input will result in only localized changes in the CTPH signature.

Two files similar to each other will have large sequences of identical bits in the same order. The main aim of CTPH is to find similarity between binaries.

If a byte of the input is changed, at most two, and in many cases, only one of the traditional hash values will be changed; the majority of the CTPH signature will remain the same. Because the majority of the signature remains the same, files with modifications can still be associated with the CTPH signatures of known files. 

Installing ssdeep

The ssdeep program was released along with the associated paper. Its GitHub repository is located at

Download the Latest Release

itachi@kali:~$ wget
— 2019–04–07 23:01:22 —
HTTP request sent, awaiting response… 200 OK
Length: 408831 (399K) [application/octet-stream]
Saving to: ‘ssdeep-2.14.1.tar.gz’
ssdeep-2.14.1.tar.gz 100%[========================>] 399.25K — .-KB/s in 0.06s
2019–04–07 23:01:23 (6.59 MB/s) — ‘ssdeep-2.14.1.tar.gz’ saved [408831/408831]

Install ssdeep

itachi@kali:~$ tar xf ssdeep-2.14.1.tar.gz
itachi@kali:~$ cd ssdeep-2.14.1/
itachi@kali:~/ssdeep-2.14.1$ sudo ./configure && sudo make && sudo make install
checking for a BSD-compatible install… /usr/bin/install -c
checking whether build environment is sane… yes
checking for a thread-safe mkdir -p… /usr/bin/mkdir -p
checking for gawk… gawk
checking whether make sets $(MAKE)… yes
checking whether make supports nested variables… yes
usr/bin/mkdir -p ‘/usr/local/include’
/usr/bin/install -c -m 644 fuzzy.h edit_dist.h ‘/usr/local/include’
/usr/bin/mkdir -p ‘/usr/local/share/man/man1’
/usr/bin/install -c -m 644 ssdeep.1 ‘/usr/local/share/man/man1’
make[1]: Leaving directory ‘/home/itachi/ssdeep-2.14.1’

ssdeep is now installed.

itachi@kali:~$ ssdeep -h
ssdeep version 2.14.1 by Jesse Kornblum and the ssdeep Project
For copyright information, see man page or README.TXT.
Usage: ssdeep [-m file] [-k file] [-dpgvrsblcxa] [-t val] [-h|-V] [FILES]
-m — Match FILES against known hashes in file
-k — Match signatures in FILES against signatures in file
-d — Directory mode, compare all files in a directory
-p — Pretty matching mode. Similar to -d but includes all matches
-g — Cluster matches together
-v — Verbose mode. Displays filename as its being processed
-r — Recursive mode
-s — Silent mode; all errors are suppressed
-b — Uses only the bare name of files; all path information omitted
-l — Uses relative paths for filenames
-c — Prints output in CSV format
-x — Compare FILES as signature files
-a — Display all matches, regardless of score
-t — Only displays matches above the given threshold
-h — Display this help message
-V — Display version number and exit


Note: At the time, I couldn’t acquire multiple malware samples of a specific APT. In the demonstration below, I’ve created three sample (non-malicious) binaries which serve as an alternative to real malware samples.

itachi@kali:~$ cat sample1.c
#include <stdio.h>
void main() {
    printf (“Hello World”);

itachi@kali:~$ cat sample2.c
#include <stdio.h>
int main(int argc, char *argv[]) {
    printf (“Number of arguments: %d\n”, argc);
    printf (“Program name: %s\n”, argv[0]);
    return 0;

itachi@kali:~$ cat sample3.c
#include <stdio.h>
void main() {
    int a = 5;
    printf (“Hello World: %d\n”, a);

itachi@kali:~$ gcc -o sample1 sample1.c
itachi@kali:~$ gcc -o sample2 sample2.c
itachi@kali:~$ gcc -o sample3 sample3.c

itachi@kali:~/sample$ ls
sample1 sample2 sample3 

Visually, it can be seen that sample1.c is more similar to sample3.c than sample2.c. We can expect a similar relationship between the associated binaries. Let’s see what ssdeep has got to say:

itachi@kali:~/sample$ ssdeep -s * > sample_ctph.ssd

itachi@kali:~/sample$ cat sample_ctph.ssd
ssdeep,1.1 — blocksize:hash:hash,filename

itachi@kali:~/sample$ ssdeep -m sample_ctph.ssd -s *
/home/itachi/sample/sample1 matches sample_ctph.ssd:/home/itachi/sample/sample1 (100)
/home/itachi/sample/sample1 matches sample_ctph.ssd:/home/itachi/sample/sample2 (65)
/home/itachi/sample/sample1 matches sample_ctph.ssd:/home/itachi/sample/sample3 (71)
/home/itachi/sample/sample2 matches sample_ctph.ssd:/home/itachi/sample/sample1 (65)
/home/itachi/sample/sample2 matches sample_ctph.ssd:/home/itachi/sample/sample2 (100)
/home/itachi/sample/sample2 matches sample_ctph.ssd:/home/itachi/sample/sample3 (63)
/home/itachi/sample/sample3 matches sample_ctph.ssd:/home/itachi/sample/sample1 (71)
/home/itachi/sample/sample3 matches sample_ctph.ssd:/home/itachi/sample/sample2 (63)
/home/itachi/sample/sample3 matches sample_ctph.ssd:/home/itachi/sample/sample3 (100) 

ssdeep considered sample1 to be a 65% match of sample2, whereas it matched 71% to sample3. This tells us that sample1 and sample3 binaries contained more identical sequences of bytes than sample1 and sample2. Consider this concept when applied to Threat Intelligence. If two (or more) malware samples are similar, we can theorize that the samples are:

  1. from the same threat group,
  2. the result of collaboration between threat groups,
  3. built from the same malware builder program

Drawbacks with Images

ssdeep is not effective when comparing two images. The reason is that two similar looking images may have vastly different binary data. Consider the two images below. The first image is the original image that I picked up from Google. I edited it with GIMP, added the text, ‘Hisoka’ and used default GIMP settings to export the image.

itachi@kali:~/Pictures$ ls
hisoka_1.jpg hisoka_2.jpg

itachi@kali:~/Pictures$ ssdeep -s hisoka_* > hisoka.ssd

itachi@kali:~/Pictures$ ssdeep -m hisoka.ssd -s -a * /home/itachi/Pictures/hisoka_1.jpg matches hisoka.ssd:/home/itachi/Pictures/hisoka_1.jpg (100)
/home/itachi/Pictures/hisoka_1.jpg matches hisoka.ssd:/home/itachi/Pictures/hisoka_2.jpg (0)
/home/itachi/Pictures/hisoka_2.jpg matches hisoka.ssd:/home/itachi/Pictures/hisoka_1.jpg (0)
/home/itachi/Pictures/hisoka_2.jpg matches hisoka.ssd:/home/itachi/Pictures/hisoka_2.jpg (100)

We can see from the results that ssdeep thinks that the two images are completely dissimilar! They have a 0% match with each other which is not true.

Thanks for reading!

In this article, we looked at a concept called Context Triggered Piecewise Hashes (CTPH) which can be used to determine the similarity between binaries. ssdeep is a program which utilizes it. We also learnt that while ssdeep is good at determining similarity between code binaries, it is not so good when comparing images.

Thank you for reading! If you have any questions, leave them in the comments section below and I’ll get back to you as soon as I can!

Leave a Reply

Your email address will not be published.