String Deobfuscation using SMT Solver

Posted by

Most malware authors employ string obfuscation techniques to hide important strings from malware analysts. Usually, my approach to deobfuscating these strings would be to either execute the malware sample under a debugger or codify the decoding scheme in a high-level language like Python.

These methods work well. Even if the malware employs anti-debugging techniques, they can be bypassed by carefully stepping through instructions in debuggers like OllyDbg, x64dbg, etc. However, it cannot be denied that it is hard work. Codifying the decoding scheme in a high-level language by studying the assembly code is quite prone to errors. If a mistake does occur, it is again hard work to find the faulty statement(s).

Enter, Satisfiability Modulo Theories (SMT). In simple words, it is a computer science concept which involves building a formula (or expression) and finding inputs that satisfy the said formula. With respect to string deobfuscation, it has the potential to find the input to an algebraic-like (or combinatoric) encoding scheme, provided the obfuscated string is available. Also, I don’t have to codify the decoding scheme, instead I codify the encoding scheme which is relatively easier because I just have to copy the assembly instruction’s functionality line by line.

Sample Encoding Scheme

Consider the simple encoding scheme shown in the snap below. It is not very sophisticated, but it serves the purpose for now.

0x0000000000000844 <+0>: push rbp
0x0000000000000845 <+1>: mov rbp,rsp
0x0000000000000848 <+4>: push rbx
0x0000000000000849 <+5>: sub rsp,0x28
0x000000000000084d <+9>: mov QWORD PTR [rbp-0x28],rdi
0x0000000000000851 <+13>: mov DWORD PTR [rbp-0x14],0x0
0x0000000000000858 <+20>: jmp 0x8d3 <encode+143>
0x000000000000085a <+22>: mov eax,DWORD PTR [rbp-0x14]
0x000000000000085d <+25>: movsxd rdx,eax
0x0000000000000860 <+28>: mov rax,QWORD PTR [rbp-0x28]
0x0000000000000864 <+32>: add rax,rdx
0x0000000000000867 <+35>: movzx eax,BYTE PTR [rax]
0x000000000000086a <+38>: add eax,0x4
0x000000000000086d <+41>: mov ecx,eax
0x000000000000086f <+43>: mov eax,DWORD PTR [rbp-0x14]
0x0000000000000872 <+46>: movsxd rdx,eax
0x0000000000000875 <+49>: mov rax,QWORD PTR [rbp-0x28]
0x0000000000000879 <+53>: add rax,rdx
0x000000000000087c <+56>: xor ecx,0xc
0x000000000000087f <+59>: mov edx,ecx
0x0000000000000881 <+61>: mov BYTE PTR [rax],dl
0x0000000000000883 <+63>: mov eax,DWORD PTR [rbp-0x14]
0x0000000000000886 <+66>: movsxd rdx,eax
0x0000000000000889 <+69>: mov rax,QWORD PTR [rbp-0x28]
0x000000000000088d <+73>: add rax,rdx
0x0000000000000890 <+76>: movzx eax,BYTE PTR [rax]
0x0000000000000893 <+79>: movsx eax,al
0x0000000000000896 <+82>: mov esi,0x3
0x000000000000089b <+87>: mov edi,eax
0x000000000000089d <+89>: call 0x73a <logical_shift_right>
0x00000000000008a2 <+94>: mov esi,eax
0x00000000000008a4 <+96>: mov eax,DWORD PTR [rbp-0x14]
0x00000000000008a7 <+99>: movsxd rdx,eax
0x00000000000008aa <+102>: mov rax,QWORD PTR [rbp-0x28]
0x00000000000008ae <+106>: add rax,rdx
0x00000000000008b1 <+109>: movzx eax,BYTE PTR [rax]
0x00000000000008b4 <+112>: movsx eax,al
0x00000000000008b7 <+115>: shl eax,0x5
0x00000000000008ba <+118>: mov ecx,eax
0x00000000000008bc <+120>: mov eax,DWORD PTR [rbp-0x14]
0x00000000000008bf <+123>: movsxd rdx,eax
0x00000000000008c2 <+126>: mov rax,QWORD PTR [rbp-0x28]
0x00000000000008c6 <+130>: add rax,rdx
0x00000000000008c9 <+133>: or esi,ecx
0x00000000000008cb <+135>: mov edx,esi
0x00000000000008cd <+137>: mov BYTE PTR [rax],dl
0x00000000000008cf <+139>: add DWORD PTR [rbp-0x14],0x1
0x00000000000008d3 <+143>: mov eax,DWORD PTR [rbp-0x14]
0x00000000000008d6 <+146>: movsxd rbx,eax
0x00000000000008d9 <+149>: mov rax,QWORD PTR [rbp-0x28]
0x00000000000008dd <+153>: mov rdi,rax
0x00000000000008e0 <+156>: call 0x5f0 <strlen@plt>
0x00000000000008e5 <+161>: cmp rbx,rax
0x00000000000008e8 <+164>: jb 0x85a <encode+22>
0x00000000000008ee <+170>: nop
0x00000000000008ef <+171>: add rsp,0x28
0x00000000000008f3 <+175>: pop rbx
0x00000000000008f4 <+176>: pop rbp
0x00000000000008f5 <+177>: ret

From the assembly, we can observe the following:

  1. IDA makes it simple to see that there exists a loop. The index loops until the end of string – function argument.
  2. Each character in the function argument is accessed through indexing.
  3. The encoding scheme consists of addition, circular shifts and XOR.

SMT solving using angr

“angr is a platform-agnostic binary analysis framework.”

I’d been studying angr docs over the past few days and I came across its solver engine. angr internally utilizes Z3 – a SMT solver by Microsoft.

Studying the assembly code, we can theorize that the core encoding is taken care by the following statements:

...
0x000000000000086a <+38>: add eax,0x4
...
0x000000000000087c <+56>: xor ecx,0xc
...
0x0000000000000896 <+82>: mov esi,0x3
...
0x000000000000089d <+89>: call 0x73a <logical_shift_right>
...
0x00000000000008b7 <+115>: shl eax,0x5
...
0x00000000000008c9 <+133>: or esi,ecx
...

The encoding scheme can be described as:

  1. 0x4 is added to a character.
  2. The result is xor’d with 0xc
  3. The result is shifted right circularly by 3 bits.

This encoding scheme can be codified in angr as follows:

#!/usr/bin/python

import angr

def rshift(val, n):
"""
Logical right shift
"""
    return (val % 0x100000000) >> n

# Hex values of the obfuscated string
encoded_string_hex = [0xffffff8a, 0xc, 0x25, 0x6f, 0x5, 0xffffffaf, \
                      0xffffffae, 0xffffff84, 0xffffff8e, 0x5, 0x4d, \
                      0x67, 0x5, 0xffffffac, 0xffffffcf, 0x6d, 0x7, \
                      0xffffff8c, 0x67, 0xffffff8c]

decoded_string = ""

proj = angr.Project("encode_decode")
state = proj.factory.entry_state()

for hex_code in encoded_string_hex:
    ch = state.solver.BVS("ch", 8)

    # Construct encoding expression
    base = (ch + 0x4) ^ 0xC
    expr = rshift(base, 3) | base << 5

    # Constrain the input to generate the required output
    state.solver.add(expr == hex_code)

    try:
        # Find inputs to satisfy the constraint.
        value_ch = state.solver.eval(ch)
        decoded_string += chr(value_ch)
    except angr.errors.SimUnsatError:
        print ("Unsat result for " + hex(hex_code))
        state.solver._stored_solver = None
        decoded_string += "."

print (decoded_string)

When the above code is executed, Z3 finds the input (deobfuscated characters) which satisfies the constraint (output = obfuscated characters).

(angr) angr@ubuntu:~/Downloads/angr/custom_encoding$ python solver.py 
WARNING | 2019-05-11 18:57:20,539 | cle.loader | The main binary is a position-independent executable. It is being loaded with a base address of 0x400000.
Th!s mu$t b3 enc0d3d

This was a very simple encoding scheme. If you’ve a real malware sample that uses a sophisticated algebraic-like (or combinatoric) encoding scheme, please let me know and I’ll definitely apply SMT solving concepts to it.

Thanks for reading!

In this article, I demonstrated ONE application of SMT solvers in reverse engineering. I’ve not yet explored other applications but according to Thaís Moreira Hamasaki, they include garbage code elimination, packing, crypto analysis etc. Here’s her talk on SMT solvers in security: https://www.youtube.com/watch?v=cyr_4_rN4pY

Thank you for reading! If you have any questions, leave them in the comments section below and I’ll get back to you as soon as I can!

Feature image credit: https://www.researchgate.net/figure/Schematic-view-of-an-SMT-solver_fig2_236147244

Leave a Reply

Your email address will not be published.