Introduction to Decompilation vs. Disassembly

A decompiler represents executable binary files in a readable form. More precisely, it transforms binary code into text that software developers can read and modify. The software security industry relies on this transformation to analyze and validate programs. The analysis is performed on the binary code because the source code (the text form of the software) traditionally is not available, because it is considered a commercial secret.

Programs to transform binary code into text form have always existed. Simple one-to-one mapping of processor instruction codes into instruction mnemonics is performed by disassemblers. Many disassemblers are available on the market, both free and commercial. The most powerful disassembler is our own IDA Pro. It can handle binary code for a huge number of processors and has open architecture that allows developers to write add-on analytic modules.

Decompilers are different from disassemblers in one very important aspect. While both generate human readable text, decompilers generate much higher level text which is more concise and much easier to read.

Compared to low level assembly language, high level language representation has several advantages:

  • It is consise.

  • It is structured.

  • It doesn't require developers to know the assembly language.

  • It recognizes and converts low level idioms into high level notions.

  • It is less confusing and therefore easier to understand.

  • It is less repetitive and less distracting.

  • It uses data flow analysis.

Let's consider these points in detail.

Usually the decompiler's output is five to ten times shorter than the disassembler's output. For example, a typical modern program contains from 400KB to 5MB of binary code. The disassembler's output for such a program will include around 5-100MB of text, which can take anything from several weeks to several months to analyze completely. Analysts cannot spend this much time on a single program for economic reasons.

The decompiler's output for a typical program will be from 400KB to 10MB. Although this is still a big volume to read and understand (about the size of a thick book), the time needed for analysis time is divided by 10 or more.

The second big difference is that the decompiler output is structured. Instead of a linear flow of instructions where each line is similar to all the others, the text is indented to make the program logic explicit. Control flow constructs such as conditional statements, loops, and switches are marked with the appropriate keywords.

The decompiler's output is easier to understand than the disassembler's output because it is high level. To be able to use a disassembler, an analyst must know the target processor's assembly language. Mainstream programmers do not use assembly languages for everyday tasks, but virtually everyone uses high level languages today. Decompilers remove the gap between the typical programming languages and the output language. More analysts can use a decompiler than a disassembler.

Decompilers convert assembly level idioms into high-level abstractions. Some idioms can be quite long and time consuming to analyze. The following one line code

x = y / 2;

can be transformed by the compiler into a series of 20-30 processor instructions. It takes at least 15- 30 seconds for an experienced analyst to recognize the pattern and mentally replace it with the original line. If the code includes many such idioms, an analyst is forced to take notes and mark each pattern with its short representation. All this slows down the analysis tremendously. Decompilers remove this burden from the analysts.

The amount of assembler instructions to analyze is huge. They look very similar to each other and their patterns are very repetitive. Reading disassembler output is nothing like reading a captivating story. In a compiler generated program 95% of the code will be really boring to read and analyze. It is extremely easy for an analyst to confuse two similar looking snippets of code, and simply lose his way in the output. These two factors (the size and the boring nature of the text) lead to the following phenomenon: binary programs are never fully analyzed. Analysts try to locate suspicious parts by using some heuristics and some automation tools. Exceptions happen when the program is extremely small or an analyst devotes a disproportionally huge amount of time to the analysis. Decompilers alleviate both problems: their output is shorter and less repetitive. The output still contains some repetition, but it is manageable by a human being. Besides, this repetition can be addressed by automating the analysis.

Repetitive patterns in the binary code call for a solution. One obvious solution is to employ the computer to find patterns and somehow reduce them into something shorter and easier for human analysts to grasp. Some disassemblers (including IDA Pro) provide a means to automate analysis. However, the number of available analytical modules stays low, so repetitive code continues to be a problem. The main reason is that recognizing binary patterns is a surprisingly difficult task. Any "simple" action, including basic arithmetic operations such as addition and subtraction, can be represented in an endless number of ways in binary form. The compiler might use the addition operator for subtraction and vice versa. It can store constant numbers somewhere in its memory and load them when needed. It can use the fact that, after some operations, the register value can be proven to be a known constant, and just use the register without reinitializing it. The diversity of methods used explains the small number of available analytical modules.

The situation is different with a decompiler. Automation becomes much easier because the decompiler provides the analyst with high level notions. Many patterns are automatically recognized and replaced with abstract notions. The remaining patterns can be detected easily because of the formalisms the decompiler introduces. For example, the notions of function parameters and calling conventions are strictly formalized. Decompilers make it extremely easy to find the parameters of any function call, even if those parameters are initialized far away from the call instruction. With a disassembler, this is a daunting task, which requires handling each case individually.

Decompilers, in contrast with disassemblers, perform extensive data flow analysis on the input. This means that questions such as, "Where is the variable initialized?"" and, "Is this variable used?" can be answered immediately, without doing any extensive search over the function. Analysts routinely pose and answer these questions, and having the answers immediately increases their productivity.

Side-by-side comparisons of disassembly and decompilation

Below you will find side-by-side comparisons of disassembly and decompilation outputs. The following examples are available:

The following examples are displayed on this page:

Division by two

Just note the difference in size! While the disassemble output requires you not only to know that the compilers generate such convoluted code for signed divisions and modulo operations, but you will also have to spend your time recognizing the patterns. Needless to say, the decompiler makes things really simple.

; =============== S U B R O U T I N E =======================================
; int __cdecl sub_4061C0(char *Str, char *Dest)
sub_4061C0      proc near               ; CODE XREF: sub_4062F0+15p
                                        ; sub_4063D4+21p ...
Str             = dword ptr  4
Dest            = dword ptr  8
                push    esi
                push    offset aSmtp_   ; "smtp."
                push    [esp+8+Dest]    ; Dest
                call    _strcpy
                mov     esi, [esp+0Ch+Str]
                push    esi             ; Str
                call    _strlen
                add     esp, 0Ch
                xor     ecx, ecx
                test    eax, eax
                jle     short loc_4061ED
loc_4061E2:                             ; CODE XREF: sub_4061C0+2Bj
                cmp     byte ptr [ecx+esi], 40h
                jz      short loc_4061ED
                inc     ecx
                cmp     ecx, eax
                jl      short loc_4061E2
loc_4061ED:                             ; CODE XREF: sub_4061C0+20j
                                        ; sub_4061C0+26j
                dec     eax
                cmp     ecx, eax
                jl      short loc_4061F6
                xor     eax, eax
                pop     esi
                retn
; ---------------------------------------------------------------------------
loc_4061F6:                             ; CODE XREF: sub_4061C0+30j
                lea     eax, [ecx+esi+1]
                push    eax             ; Source
                push    [esp+8+Dest]    ; Dest
                call    _strcat
                pop     ecx
                pop     ecx
                push    1
                pop     eax
                pop     esi
                retn
sub_4061C0      endp

Simple enough?

Questions like

  • What are the possible return values of the function?

  • Does the function use any strings?

  • What does the function do?

can be answered almost instantaneously looking at the decompiler output. Needless to say that it looks better because I renamed the local variables. In the disassembler, registers are renamed very rarely because it hides the register use and can lead to confusion.

; =============== S U B R O U T I N E =======================================
; int __cdecl sub_4061C0(char *Str, char *Dest)
sub_4061C0      proc near               ; CODE XREF: sub_4062F0+15p
                                        ; sub_4063D4+21p ...
Str             = dword ptr  4
Dest            = dword ptr  8
                push    esi
                push    offset aSmtp_   ; "smtp."
                push    [esp+8+Dest]    ; Dest
                call    _strcpy
                mov     esi, [esp+0Ch+Str]
                push    esi             ; Str
                call    _strlen
                add     esp, 0Ch
                xor     ecx, ecx
                test    eax, eax
                jle     short loc_4061ED
loc_4061E2:                             ; CODE XREF: sub_4061C0+2Bj
                cmp     byte ptr [ecx+esi], 40h
                jz      short loc_4061ED
                inc     ecx
                cmp     ecx, eax
                jl      short loc_4061E2
loc_4061ED:                             ; CODE XREF: sub_4061C0+20j
                                        ; sub_4061C0+26j
                dec     eax
                cmp     ecx, eax
                jl      short loc_4061F6
                xor     eax, eax
                pop     esi
                retn
; ---------------------------------------------------------------------------
loc_4061F6:                             ; CODE XREF: sub_4061C0+30j
                lea     eax, [ecx+esi+1]
                push    eax             ; Source
                push    [esp+8+Dest]    ; Dest
                call    _strcat
                pop     ecx
                pop     ecx
                push    1
                pop     eax
                pop     esi
                retn
sub_4061C0      endp

Where's my variable?

IDA highlights the current identifier. This feature turns out to be much more useful with high level output. In this sample, I tried to trace how the retrieved function pointer is used by the function. In the disassembly output, many wrong eax occurrences are highlighted while the decompiler did exactly what I wanted.

; =============== S U B R O U T I N E =======================================
; int __cdecl myfunc(wchar_t *Str, int)
myfunc          proc near               ; CODE XREF: sub_4060+76p
                                        ; .text:42E4p
Str             = dword ptr  4
arg_4           = dword ptr  8
                mov     eax, dword_1001F608
                cmp     eax, 0FFFFFFFFh
                jnz     short loc_10003AB6
                push    offset aGetsystemwindo ; "GetSystemWindowsDirectoryW"
                push    offset aKernel32_dll ; "KERNEL32.DLL"
                call    ds:GetModuleHandleW
                push    eax             ; hModule
                call    ds:GetProcAddress
                mov     dword_1001F608, eax
loc_10003AB6:                           ; CODE XREF: myfunc+8j
                test    eax, eax
                push    esi
                mov     esi, [esp+4+arg_4]
                push    edi
                mov     edi, [esp+8+Str]
                push    esi
                push    edi
                jz      short loc_10003ACA
                call    eax ; dword_1001F608
                jmp     short loc_10003AD0
; ---------------------------------------------------------------------------
loc_10003ACA:                           ; CODE XREF: myfunc+34j
                call    ds:GetWindowsDirectoryW
loc_10003AD0:                           ; CODE XREF: myfunc+38j
                sub     esi, eax
                cmp     esi, 5
                jnb     short loc_10003ADD
                pop     edi
                add     eax, 5
                pop     esi
                retn
; ---------------------------------------------------------------------------
loc_10003ADD:                           ; CODE XREF: myfunc+45j
                push    offset aInf_0   ; "\\inf"
                push    edi             ; Dest
                call    _wcscat
                push    edi             ; Str
                call    _wcslen
                add     esp, 0Ch
                pop     edi
                pop     esi
                retn
myfunc          endp

Arithmetics is not a rocket science

Arithmetics is not a rocket science but it is always better if someone handles it for you. You have more important things to focus on.

; =============== S U B R O U T I N E =======================================
; Attributes: bp-based frame
; sgell(__int64, __int64)
                public @sgell$qjj
@sgell$qjj      proc near
arg_0           = dword ptr  8
arg_4           = dword ptr  0Ch
arg_8           = dword ptr  10h
arg_C           = dword ptr  14h
                push    ebp
                mov     ebp, esp
                mov     eax, [ebp+arg_0]
                mov     edx, [ebp+arg_4]
                cmp     edx, [ebp+arg_C]
                jnz     short loc_10226
                cmp     eax, [ebp+arg_8]
                setnb   al
                jmp     short loc_10229
; ---------------------------------------------------------------------------
loc_10226:                          ; CODE XREF: sgell(__int64,__int64)+Cj
                setnl   al
loc_10229:                          ; CODE XREF: sgell(__int64,__int64)+14j
                and     eax, 1
                pop     ebp
                retn
@sgell$qjj      endp

Sample window procedure

The decompiler recognized a switch statement and nicely represented the window procedure. Without this little help the user would have to calculate the message numbers herself. Nothing particularly difficult, just time consuming and boring. What if she makes a mistake?...

; =============== S U B R O U T I N E =======================================
wndproc         proc near               ; DATA XREF: sub_4010E0+21o
Paint           = tagPAINTSTRUCT ptr -0A4h
Buffer          = byte ptr -64h
hWnd            = dword ptr  4
Msg             = dword ptr  8
wParam          = dword ptr  0Ch
lParam          = dword ptr  10h
                mov     ecx, hInstance
                sub     esp, 0A4h
                lea     eax, [esp+0A4h+Buffer]
                push    64h             ; nBufferMax
                push    eax             ; lpBuffer
                push    6Ah             ; uID
                push    ecx             ; hInstance
                call    ds:LoadStringA
                mov     ecx, [esp+0A4h+Msg]
                mov     eax, ecx
                sub     eax, 2
                jz      loc_4013E8
                sub     eax, 0Dh
                jz      loc_4013B2
                sub     eax, 102h
                jz      short loc_401336
                mov     edx, [esp+0A4h+lParam]
                mov     eax, [esp+0A4h+wParam]
                push    edx             ; lParam
                push    eax             ; wParam
                push    ecx             ; Msg
                mov     ecx, [esp+0B0h+hWnd]
                push    ecx             ; hWnd
                call    ds:DefWindowProcA
                add     esp, 0A4h
                retn    10h
; ---------------------------------------------------------------------------
loc_401336:                             ; CODE XREF: wndproc+3Cj
                mov     ecx, [esp+0A4h+wParam]
                mov     eax, ecx
                and     eax, 0FFFFh
                sub     eax, 68h
                jz      short loc_40138A
                dec     eax
                jz      short loc_401371
                mov     edx, [esp+0A4h+lParam]
                mov     eax, [esp+0A4h+hWnd]
                push    edx             ; lParam
                push    ecx             ; wParam
                push    111h            ; Msg
                push    eax             ; hWnd
                call    ds:DefWindowProcA
                add     esp, 0A4h
                retn    10h
; ---------------------------------------------------------------------------
loc_401371:                             ; CODE XREF: wndproc+7Aj
                mov     ecx, [esp+0A4h+hWnd]
                push    ecx             ; hWnd
                call    ds:DestroyWindow
                xor     eax, eax
                add     esp, 0A4h
                retn    10h
; ---------------------------------------------------------------------------
loc_40138A:                             ; CODE XREF: wndproc+77j
                mov     edx, [esp+0A4h+hWnd]
                mov     eax, hInstance
                push    0               ; dwInitParam
                push    offset DialogFunc ; lpDialogFunc
                push    edx             ; hWndParent
                push    67h             ; lpTemplateName
                push    eax             ; hInstance
                call    ds:DialogBoxParamA
                xor     eax, eax
                add     esp, 0A4h
                retn    10h
; ---------------------------------------------------------------------------
loc_4013B2:                             ; CODE XREF: wndproc+31j
                push    esi
                mov     esi, [esp+0A8h+hWnd]
                lea     ecx, [esp+0A8h+Paint]
                push    ecx             ; lpPaint
                push    esi             ; hWnd
                call    ds:BeginPaint
                push    eax             ; HDC
                push    esi             ; hWnd
                call    my_paint
                add     esp, 8
                lea     edx, [esp+0A8h+Paint]
                push    edx             ; lpPaint
                push    esi             ; hWnd
                call    ds:EndPaint
                pop     esi
                xor     eax, eax
                add     esp, 0A4h
                retn    10h
; ---------------------------------------------------------------------------
loc_4013E8:                             ; CODE XREF: wndproc+28j
                push    0               ; nExitCode
                call    ds:PostQuitMessage
                xor     eax, eax
                add     esp, 0A4h
                retn    10h