Most malware authors employ string obfuscation techniques to hide important strings from malware analysts. Usually, my approach to deobfuscating these strings would be to either execute the malware sample under a debugger or codify the decoding scheme in a high-level language like Python.
These methods work well. Even if the malware employs anti-debugging techniques, they can be bypassed by carefully stepping through instructions in debuggers like
x64dbg, etc. However, it cannot be denied that it is hard work. Codifying the decoding scheme in a high-level language by studying the assembly code is quite prone to errors. If a mistake does occur, it is again hard work to find the faulty statement(s).
Satisfiability Modulo Theories (SMT). In simple words, it is a computer science concept which involves building a formula (or expression) and finding inputs that satisfy the said formula. With respect to string deobfuscation, it has the potential to find the input to an algebraic-like (or combinatoric) encoding scheme, provided the obfuscated string is available. Also, I don’t have to codify the decoding scheme, instead I codify the encoding scheme which is relatively easier because I just have to copy the assembly instruction’s functionality line by line.
Sample Encoding Scheme
Consider the simple encoding scheme shown in the snap below. It is not very sophisticated, but it serves the purpose for now.
0x0000000000000844 <+0>: push rbp 0x0000000000000845 <+1>: mov rbp,rsp 0x0000000000000848 <+4>: push rbx 0x0000000000000849 <+5>: sub rsp,0x28 0x000000000000084d <+9>: mov QWORD PTR [rbp-0x28],rdi 0x0000000000000851 <+13>: mov DWORD PTR [rbp-0x14],0x0 0x0000000000000858 <+20>: jmp 0x8d3 <encode+143> 0x000000000000085a <+22>: mov eax,DWORD PTR [rbp-0x14] 0x000000000000085d <+25>: movsxd rdx,eax 0x0000000000000860 <+28>: mov rax,QWORD PTR [rbp-0x28] 0x0000000000000864 <+32>: add rax,rdx 0x0000000000000867 <+35>: movzx eax,BYTE PTR [rax] 0x000000000000086a <+38>: add eax,0x4 0x000000000000086d <+41>: mov ecx,eax 0x000000000000086f <+43>: mov eax,DWORD PTR [rbp-0x14] 0x0000000000000872 <+46>: movsxd rdx,eax 0x0000000000000875 <+49>: mov rax,QWORD PTR [rbp-0x28] 0x0000000000000879 <+53>: add rax,rdx 0x000000000000087c <+56>: xor ecx,0xc 0x000000000000087f <+59>: mov edx,ecx 0x0000000000000881 <+61>: mov BYTE PTR [rax],dl 0x0000000000000883 <+63>: mov eax,DWORD PTR [rbp-0x14] 0x0000000000000886 <+66>: movsxd rdx,eax 0x0000000000000889 <+69>: mov rax,QWORD PTR [rbp-0x28] 0x000000000000088d <+73>: add rax,rdx 0x0000000000000890 <+76>: movzx eax,BYTE PTR [rax] 0x0000000000000893 <+79>: movsx eax,al 0x0000000000000896 <+82>: mov esi,0x3 0x000000000000089b <+87>: mov edi,eax 0x000000000000089d <+89>: call 0x73a <logical_shift_right> 0x00000000000008a2 <+94>: mov esi,eax 0x00000000000008a4 <+96>: mov eax,DWORD PTR [rbp-0x14] 0x00000000000008a7 <+99>: movsxd rdx,eax 0x00000000000008aa <+102>: mov rax,QWORD PTR [rbp-0x28] 0x00000000000008ae <+106>: add rax,rdx 0x00000000000008b1 <+109>: movzx eax,BYTE PTR [rax] 0x00000000000008b4 <+112>: movsx eax,al 0x00000000000008b7 <+115>: shl eax,0x5 0x00000000000008ba <+118>: mov ecx,eax 0x00000000000008bc <+120>: mov eax,DWORD PTR [rbp-0x14] 0x00000000000008bf <+123>: movsxd rdx,eax 0x00000000000008c2 <+126>: mov rax,QWORD PTR [rbp-0x28] 0x00000000000008c6 <+130>: add rax,rdx 0x00000000000008c9 <+133>: or esi,ecx 0x00000000000008cb <+135>: mov edx,esi 0x00000000000008cd <+137>: mov BYTE PTR [rax],dl 0x00000000000008cf <+139>: add DWORD PTR [rbp-0x14],0x1 0x00000000000008d3 <+143>: mov eax,DWORD PTR [rbp-0x14] 0x00000000000008d6 <+146>: movsxd rbx,eax 0x00000000000008d9 <+149>: mov rax,QWORD PTR [rbp-0x28] 0x00000000000008dd <+153>: mov rdi,rax 0x00000000000008e0 <+156>: call 0x5f0 <strlen@plt> 0x00000000000008e5 <+161>: cmp rbx,rax 0x00000000000008e8 <+164>: jb 0x85a <encode+22> 0x00000000000008ee <+170>: nop 0x00000000000008ef <+171>: add rsp,0x28 0x00000000000008f3 <+175>: pop rbx 0x00000000000008f4 <+176>: pop rbp 0x00000000000008f5 <+177>: ret
From the assembly, we can observe the following:
IDAmakes it simple to see that there exists a loop. The index loops until the end of string – function argument.
- Each character in the function argument is accessed through indexing.
- The encoding scheme consists of addition, circular shifts and
SMT solving using
“angr is a platform-agnostic binary analysis framework.”
Studying the assembly code, we can theorize that the core encoding is taken care by the following statements:
... 0x000000000000086a <+38>: add eax,0x4 ... 0x000000000000087c <+56>: xor ecx,0xc ... 0x0000000000000896 <+82>: mov esi,0x3 ... 0x000000000000089d <+89>: call 0x73a <logical_shift_right> ... 0x00000000000008b7 <+115>: shl eax,0x5 ... 0x00000000000008c9 <+133>: or esi,ecx ...
The encoding scheme can be described as:
0x4is added to a character.
- The result is xor’d with
- The result is shifted right circularly by 3 bits.
This encoding scheme can be codified in
angr as follows:
#!/usr/bin/python import angr def rshift(val, n): """ Logical right shift """ return (val % 0x100000000) >> n # Hex values of the obfuscated string encoded_string_hex = [0xffffff8a, 0xc, 0x25, 0x6f, 0x5, 0xffffffaf, \ 0xffffffae, 0xffffff84, 0xffffff8e, 0x5, 0x4d, \ 0x67, 0x5, 0xffffffac, 0xffffffcf, 0x6d, 0x7, \ 0xffffff8c, 0x67, 0xffffff8c] decoded_string = "" proj = angr.Project("encode_decode") state = proj.factory.entry_state() for hex_code in encoded_string_hex: ch = state.solver.BVS("ch", 8) # Construct encoding expression base = (ch + 0x4) ^ 0xC expr = rshift(base, 3) | base << 5 # Constrain the input to generate the required output state.solver.add(expr == hex_code) try: # Find inputs to satisfy the constraint. value_ch = state.solver.eval(ch) decoded_string += chr(value_ch) except angr.errors.SimUnsatError: print ("Unsat result for " + hex(hex_code)) state.solver._stored_solver = None decoded_string += "." print (decoded_string)
When the above code is executed, Z3 finds the input (deobfuscated characters) which satisfies the constraint (output = obfuscated characters).
(angr) angr@ubuntu:~/Downloads/angr/custom_encoding$ python solver.py WARNING | 2019-05-11 18:57:20,539 | cle.loader | The main binary is a position-independent executable. It is being loaded with a base address of 0x400000. Th!s mu$t b3 enc0d3d
This was a very simple encoding scheme. If you’ve a real malware sample that uses a sophisticated algebraic-like (or combinatoric) encoding scheme, please let me know and I’ll definitely apply SMT solving concepts to it.
Thanks for reading!
In this article, I demonstrated ONE application of SMT solvers in reverse engineering. I’ve not yet explored other applications but according to Thaís Moreira Hamasaki, they include garbage code elimination, packing, crypto analysis etc. Here’s her talk on SMT solvers in security: https://www.youtube.com/watch?v=cyr_4_rN4pY
Thank you for reading! If you have any questions, leave them in the comments section below and I’ll get back to you as soon as I can!
Feature image credit: https://www.researchgate.net/figure/Schematic-view-of-an-SMT-solver_fig2_236147244