Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Here are some side-by-side comparisons of disassembly and decompiler for PowerPC. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
This simple function calculates the sum of the squares of the first N natural numbers. While the function logic is obvious by just looking at the decompiler output, the assembly listing has too much noise and requires studying it. The decompiler saves your time and allows you to concentrate on more exciting aspects of reverse engineering.
The PowerPC processor has a number of instructions which can be used to avoid branches (for example cntlzw). The decompiler restores the conditional logic and makes code easier to understand.
64-bit comparison usually involves several compare and branch instructions which do not improve the code readability.
System call is always mysterious, but decompiler helps you with its name and arguments.
Compiler sometime uses helpers and decompiler knows the meaning of the many helpers and uses it to simplify code.
The PowerPC processor contains a number of complex floating point instructions which perform several operations at once. It is not easy to recover an expression from the assembler code but not for the decompiler.
Compilers can decompose a multiplication/division instruction into a sequence of cheaper instructions (additions, shifts, etc). This example demonstrates how the decompiler recognizes them and coagulates back to the original operation.
This example demonstrates that the decompiler can handle VLE code without problems.
The pseudocode is not something static because the decompiler is interactive the same way as IDA. You can change variable types and names, change function prototypes, add comments and more. The example above presents the result after these modifications.
Surely the result is not ideal, and there is a lot of room for improvement, but we hope that you got the idea.
And you can compare the result with the original: http://lxr.free-electrons.com/source/fs/fat/namei_msdos.c#L224
Here are some side-by-side comparisons of disassembly and decompiler for ARM. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
Let's start with a very simple function. It accepts a pointer to a structure and zeroes out its first three fields. While the function logic is obvious by just looking at the decompiler output, the assembly listing has too much noise and requires studying it.
The decompiler saves your time and allows you to concentrate on more exciting aspects of reverse engineering.
Sorry for a long code snippet, ARM code tends to be longer compared to x86 code. This makes our comparison even more impressive: look at how concise is the decompiler output!
The ARM processor has conditional instructions that can shorten the code but require high attention from the reader. The case above is very simple, just note that there is a pair of instructions: MOVNE
and LDREQSH
. Only one of them will be executed at once. This is how simple if-then-else
looks in ARM.
The pseudocode shows it much better and does not require any explanations.
A quiz question: did you notice that MOVNE
loads zero to R0
? (because I didn't:)
Also note that in the disassembly listing we see var_8
but the location really used is var_A
, which corresponds to v4
.
Look, the decompiler output is longer! This is a rare case when the pseudocode is longer than the disassembly listing, but it is a for a good cause: to keep it readable. There are so many conditional instructions here, it is very easy to misunderstand the dependencies. For example, did you notice that the first MOVEQ
may use the condition codes set by CMP
? The subtle detail is that CMPNE may be skipped and the condition codes set by CMP
may reach MOVEQ
s.
The decompiler represented it perfectly well. I renamed some variables and set their types, but this was an easy task.
Conditional instructions are just part of the story. ARM is also famous for having a plethora of data movement instructions. They come with a set of possible suffixes that subtly change the meaning of the instruction. Take STMCSIA
, for example. It is a STM
instruction, but then you have to remember that CS
means "carry set" and IA
means "increment after".
In short, the disassembly listing is like Chinese. The pseudocode is longer but requires much less time to understand.
Sorry for another long code snippet. Just wanted to show you that the decompiler can handle compiler helper functions (like __divdi3
) and handles 64-bit arithmetic quite well.
Since ARM instructions cannot have big immediate constants, sometimes they are loaded with two instructions. There are many 0xFA
(250 decimal) constants in the disassembly listing, but all of them are shifted to the left by 2 before use. The decompiler saves you from these petty details.
Also a side: the decompiler can handle ARM mode as well as Thumb mode instructions. It just does not care about the instruction encoding because it is already handled by IDA.
In some case the disassembly listing can be misleading, especially with PIC (position independent code). While the address of a constant string is loaded into R12
, the code does not care about it. It is just how variable addresses are calculated in PIC-code (it is .got-someoffset). Such calculations are very frequent in shared objects and unfortunately IDA cannot handle all of them. But the decompiler did a great job of tracing R12
.
Hex-Rays' support for exceptions in Microsoft Visual C++/x64 incorporates the C++ exception metadata for functions into their decompilation, and presents the results to the user via built-in constructs in the decompilation (`try`, `catch`, `__wind`, `__unwind`). When the results cannot be presented entirely with these constructs, they will be presented via helper calls in the decompilation.
The documentation describes:
# TRY, CATCH, AND THROW
The C++ language provides the `try` scoped construct in which the developer expects that an exception might occur. `try` blocks must be followed by one or more scoped `catch` constructs for catching exceptions that may occur within. `catch` blocks may use `...` to catch any exception. Alternatively, `catch` blocks may name the type of an exception, such as `std::bad_alloc`. `catch` blocks with named types may or may not also catch the exception object itself. For example, `catch(std::bad_alloc *v10)` and `catch(std::bad_alloc *)` are both valid. The former can access the exception object through variable `v10`, whereas the latter cannot access the exception object.
C++ provides the `throw` keyword for throwing an exception, as in `std::bad_alloc ba; throw ba;`. This is represented in the output as (for example) `throw v10;`. C++ also allows code to rethrow the current exception via `throw;`. This is represented in the output as `throw;`.
# WIND AND UNWIND
Exception metadata in C++ binaries is split into two categories: `try` and `catch` blocks, as discussed above, and so-called `wind` and `unwind` blocks. C++ does not have `wind` and `unwind` keywords, but the compiler creates these blocks implicitly. In most binaries, they outnumber `try` and `catch` blocks by about 20 to 1.
Consider the following code, which may or may not throw an `int` as an exception at three places:
If an exception is thrown at point -1, the function exits early without executing any of its remaining code. As no objects have been created on the stack, nothing needs to be cleaned up before the function returns.
If an exception is thrown at point 0, the function exits early as before. However, since `string s0` has been created on the stack, it needs to be destroyed before exiting the function. Similarly, if an exception is thrown at point 1, both `string s1` and `string s0` must be destroyed.
These destructor calls would normally happen at the end of their enclosing scope, i.e. the bottom of the function, where the compiler inserts implicitly-generated destructor calls. However, since the function does not have any `try` blocks, none of the function's remaining code will execute after the exception is thrown. Therefore, the destructor calls at the bottom will not execute. If there were no other mechanism for destructing `s0` and/or `s1`, the result would be memory leaks or other state management issues involving those objects. Therefore, the C++ exception management runtime provides another mechanism to invoke their destructors: `wind` blocks and their corresponding `unwind` handlers.
`wind` blocks are effectively `try` blocks that are inserted invisibly by the compiler. They begin immediately after constructing some object, and end immediately before destructing that object. Their `unwind` blocks play the role of `catch` handlers, calling the destructor upon the object when exceptional control flow would otherwise cause the destructor call to be skipped.
Microsoft Visual C++ effectively transforms the previous example as follows:
`unwind` blocks always re-throw the current exception, unlike `catch` handlers, which may or may not re-throw it. Re-throwing the exception ensures that prior `wind` blocks will have a chance to execute. So, for example, if an exception is thrown at point 1, after the `unwind` handler destroys `string s1`, re-throwing the exception causes the unwind handler for point 0 to execute, thereby allowing it to destroy `string s0` before re-throwing the exception out of the function.
# STATE NUMBERS AND INSTRUCTION STATES
As we have discussed, the primary components of Microsoft Visual C++ x64 exception metadata are `try` blocks, `catch` handlers, `wind` blocks, and `unwind` handlers. Generally speaking, these elements can be nested within one another. For example, in C++ code, it is legal for one `try` block to contain another, and a `catch` handler may contain `try` blocks of its own. The same is true for `wind` and `unwind` constructs: `wind` blocks may contain other `wind` blocks (as in the previous example) or `try` blocks, and `try` and `catch` blocks may contain `wind` blocks.
Exceptions must be processed in a particular sequence: namely, the most nested handlers must be consulted first. For example, if a `try` block contains another `try` block, any exceptions occurring within the latter region must be processed by the innermost `catch` handlers first. Only if none of the inner `catch` handlers can handle the exception should the outer `try` block's catch handlers be consulted. Similarly, as in the previous example, `unwind` handlers must destruct their corresponding objects before passing control to any previous exception handlers (such as `string s1`'s `unwind` handler passing control to `string s0`'s `unwind` handler).
Microsoft's solution to ensure that exceptions are processed in the proper sequence is simple. It assigns a "state number" to each exception-handling construct. Each exception state has a "parent" state number whose handler will be consulted if the current state's handler is unable to handle the exception. In the previous example, what we called "point 0" is assigned the state number 0, while "point 1" is assigned the state number 1. State 1 has a parent of 0. (State 0's parent is a dummy value, -1, that signifies that it has no parent.) Since `unwind` handlers always re-throw exceptions, if state 1's `unwind` handler is ever invoked, the exception handling machinery will always invoke state 0's `unwind` handler afterwards. Because state 0 has no parent, the exception machinery will re-throw the exception out of the current function. This same machinery ensures that the catch handlers for inner `try` blocks are consulted before outer `try` blocks.
There is only one more piece to the puzzle: given that an exception could occur anywhere, how does the exception machinery know which exception handler should be consulted first? I.e., for every address within a function with C++ exception metadata, what is the current exception state? Microsoft C++/x64 binaries provide this information in the `IPtoStateMap` metadata tables, which is an array of address ranges and their corresponding state numbers.
# GUI OPERATION
This support is fully automated and requires no user interaction. However, the user can customize the display of C++ exception metadata elements for the global database, as well as for individual functions.
# GLOBAL SETTINGS
Under the `Edit->Other->C++ exception display settings` menu item, the user can edit the default settings to control which exception constructs are shown in the listing. These are saved persistently in the database (i.e., the user's choices are remembered after saving, closing, and re-opening), and can also be adjusted on a per-function basis (described later).
The settings on the dialog are as follows:
* Default output mode. When the plugin is able to represent C++ exception constructs via nice constructs like `try`, `catch`, `__wind`, and `__unwind` in the listings, these are called "structured" exception states. The plugin is not always able to represent exception metadata nicely, and may instead be forced to represent the metadata via helper calls in the listing (which are called "unstructured" states). As these can be messy and distracting, users may prefer not to see them by default. Alternatively, the user may prefer to see no exception metadata whatsoever, not even the structured ones. This setting allows the user to specify which types of metadata will be shown in the listing. * Show wind states. We discussed wind states and unwind handlers in the background material. Although these states can be very useful when reverse engineering C++ binaries (particularly when analyzing constructors), displaying them increases the amount of code in the listing, and sometimes the information they provide is more redundant than useful. Therefore, this checkbox allows the user to control whether they are shown by default. * Inform user of hidden states. The two settings just discussed can cause unstructured and/or wind states to be omitted from the default output. If this checkbox is enabled, then the plugin will inform the user of these omissions via messages at the top of the listing, such as this message indicating that one unstructured wind state was omitted: ``` // Hidden C++ exception states: #wind_helpers=1 ```
There are three more elements on the settings dialog; most users should never have to use them. However, for completeness, we will describe them now.
* Warning behavior. When internal warnings occur, they will either be printed to the output window at the bottom, or shown as a pop-up warning message box depending on this setting. * Reset per-function settings. The next section will discuss how the display settings described above can be customized on a per-function basis. This button allows the user to erase all such saved settings, such that all functions will use the global display settings the next time they are decompiled. * Rebuild C++ metadata caches. Before the plugin can show C++ exception metadata in the output, it must pre-process the metadata across the whole binary. Doing so crucially relies upon the ability to recognize the `__CxxFrameHandler3` and `__CxxFrameHandler4` unwind handler functions when they are referenced by the binary's unwind metadata. If the plugin fails to recognize one of these functions, then it will be unable to display C++ exception metadata for any function that uses the unrecognized unwind handler(s).
If the user suspects that a failure like this has taken place -- say, because they expect to see a `try`/`catch` in the output and it is missing, and they have confirmed that the output was not simply hidden due to the display settings above -- then this button may help them to diagnose and repair the issue. Pressing this button flushes the existing caches from the database and rebuilds them. It also prints output to tell the user which unwind handlers were recognized and which ones were not. The user can use these messages to confirm whether the function's corresponding unwind handler was unrecognized. If it was not, the user can rename the unwind handler function to something that contains one of the two aforementioned names, and then rebuild the caches again.
Note that users should generally not need to use this button, as the plugin tries several methods to recognize the unwind handlers (such as FLIRT signatures, recognizing import names, and looking at the destination of "thunk" functions with a single `jmp` to a destination function). If the user sees any C++ exception metadata in the output, this almost always means that the recognition worked correctly. This button should only be used by experienced users as a last resort. Users are advised to save their database before pressing this button, and only proceed with the changes if renaming unwind handlers and rebuilding the cache addresses missing metadata in the output.
# CONFIGURATION
The default options for the settings just described are controlled via the `%IDADIR%/cfg/eh34.cfg` configuration file. Editing this file will change the defaults for newly-created databases (but not affect existing databases).
# PER-FUNCTION SETTINGS
As just discussed, the user can control which C++ exception metadata is displayed in the output via the global menu item. Users can also customize these settings on a per-function basis (say, by enabling display of wind states for selected functions only), and they will be saved persistently in the database.
When a function has C++ exception metadata, one or more items will appear on Hex-Rays' right click menu. The most general one is "C++ exception settings...". Selecting this menu item will bring up a dialog that is similar to the global settings menu item with the following settings:
* Use global settings. If the user previously changed the settings for the function, but wishes that the function be shown via the global settings in the future, they can select this item and press "OK". This will delete the saved settings for the function, causing future decompilations to use the global settings. * This function's output mode. This functions identically to "Default output mode" from the global settings dialog, but only affects the current function. * Show wind states. Again, identical to the global settings dialog item.
There is a button at the bottom, "Edit global settings", which is simply a shortcut to the same global settings dialog from the `Edit->Other->C++ exception display settings` menu item.
The listing will automatically refresh if the user changes any settings.
Additionally, there are four other menu items that may or may not appear, depending upon the metadata present and whether the settings caused any metadata to be hidden. These menu items are shortcuts to editing the corresponding fields in the per-function settings dialog just discussed. They are:
* Show unstructured C++ states. If the global or per-function default output setting was set to "Structured only", and the function had unstructured states, this menu item will appear. Clicking it will enable display of unstructured states for the function and refresh the decompilation. * Hide unstructured C++ states. Similar to the above. * Show wind states. If the global or per-function "Show wind states" setting was disabled, and the function had wind states, this menu item will appear. Clicking it will enable display of wind states for the function and refresh the decompilation. * Hide wind states. Similar to the above.
# KEYBOARD SHORTCUTS
The user can change (add, remove, or edit) the keyboard shortcuts for the per-function settings right-click menu items from the `Edit->Options->Shortcuts` dialog. The names of the corresponding actions are:
* "C++ exception settings": `eh34:func_settings` * "Show unstructured C++ states": `eh34:enable_unstructured` * "Hide unstructured C++ states": `eh34:disable_unstructured` * "Show wind states": `eh34:enable_wind` * "Hide wind states": `eh34:disable_wind` * The global settings dialog: `eh34:config_menu`
# HELPER CALLS
Hex-Rays' Microsoft C++ x64 exception support tries to details about exception state numbers as much as possible. However, compiler optimizations can cause binaries to diverge from the original source code. For example, inlined functions can produce `goto` statements in the decompilation despite there being none in the source. Optimizations can also cause C++ exception metadata to differ from the original code. As a result, it is not always possible to represent `try`, `catch`, `wind`, and `unwind` constructs as scoped regions that hide the low-level details.
In these cases, Hex-Rays' Microsoft C++ x64 exception support will produce helper calls with informative names to indicate when exception states are entered and exited, and to ensure that the user can see the bodies of `catch` and `unwind` handlers in the output. The user can hover their mouse over those calls to see their descriptions. They are also catalogued below.
The following helper calls are used when exception states have multiple entrypoints, or multiple exits:
The following helper calls are used when exception states had single entry and exit points, but could not be represented via `try` or `__wind` keywords:
The following helper calls are used to display `catch` handlers for exception states that could not be represented via the `catch` keyword:
The following helper calls should be removed, but if you see them, they signify the boundary of a `catch` handler:
The following helper calls are used to display `unwind` handlers for exception states that could not be represented via the `__unwind` keyword:
The following helper calls are used to signify that an `unwind` handler has finished executing, and will transfer control to a parent exception state (or outside of the function):
The following helper call is used when the exception metadata did not specify a function pointer for an `unwind` handler, which causes program termination:
The following helper calls are used to signify that Hex-Rays was unable to display an exception handler in the decompilation:
Starting from MSVC 2017 Service Pack 3 (version 14.13), the compiler began applying optimizations to reduce the size of the C++ exception metadata. An official Microsoft blog entry entitled ["Making C++ Exception Handling Smaller on x64"](https://devblogs.microsoft.com/cppblog/making-cpp-exception-handling-smaller-x64/)
As a result of these changes, the C++ exception metadata in MSVC 14.13+ binaries is no longer fully precise. Exception states are frequently reported as beginning physically after where the source code would indicate. In order to produce usable output, Hex-Rays employs mathematical optimization algorithms to reconstruct more detailed C++ exception metadata configurations that can be displayed in a nicer format in the decompilation. These algorithms improve the listings by producing more structured regions and fewer helper calls in the output, but they introduce further imprecision as to the true starting and ending locations of exception regions when compared to the source code. They are an integral part of Hex-Rays C++/x64 Windows exception metadata support and cannot be disabled.
The takeaway is that, when processing MSVC 14.13+ binaries, Hex-Rays C++/x64 Windows exception support frequently produces `try` and `__unwind` blocks that begin and/or end earlier and/or later than what the source code would indicate, were it available. This has important consequences for vulnerability analysis.
For example, given accurate exception boundary information, the destructor for a local object would ordinarily be situated after the end of that object's `__wind` and `__unwind` blocks, as in:
Yet, due to the imprecise boundary information, Hex-Rays might display the destructor as being inside of the `__wind` block:
The latter output might indicate that `v14`'s destructor would be called twice if its destructor were to throw an exception. However, this indication is simply the result of imprecise exception region boundary information. In short, users should be wary of diagnosing software bugs or security issues based upon the positioning of statements nearby the boundaries of `try` and `__wind` blocks. The example above indicates something that might appear to be a bug in the code -- a destructor being called twice -- but is in fact not one.
These considerations primarily apply when analyzing C++ binaries compiled with MSVC 14.13 or greater. They do not apply as much to binaries produced by MSVC 14.12 or earlier, when the compiler emitted fully precise information about exception regions.
Although Hex-Rays may improve its detection of exception region boundaries in the future, because modern binaries lack the ground truth of older binaries, the results will never be fully accurate. If the imprecision is unacceptable to you, we recommend permanently disabling C++ metadata display via the `eh34.cfg` file discussed previously.
# MISCELLANEOUS
Hex-Rays' support for exceptions in Microsoft Visual C++/x64 only works after auto-analysis has been completed. Users can explore the database and decompile functions as usual, but no C++ exception metadata will be shown. Users are advised to refresh any decompilation windows after auto-analysis has completed.
If users have enabled display of wind states, they may see empty `__wind` or `__unwind` constructs in the output. Usually, this does not indicate an error occurred; this usually means that the region of the code corresponding the `wind` state was very small or contained dead code, and Hex-Rays normal analysis and transformation made it empty.
Starting in IDA 9.0, IDA's auto-analysis preprocesses C++ exception metadata differently than in previous versions. In particular, on MSVC/x64 binaries, `__unwind` and `catch` handlers are created as standalone functions, not as chunks of their parent function as in earlier versions. This is required to display the exception metadata correctly in the decompilation. For databases created with older versions, the plugin will still show the outline of the exception metadata, but the bodies of the `__unwind` and `catch` handlers will be displayed via the helper calls `__eh34_unwind_handler_absent` and `__eh34_catch_handler_absent`, respectively. The plugin will also print a warning at the top of the decompilation such as `Absent C++ exception handlers: #catch=1 (pre-9.0 IDB)` in these situations. Re-creating the IDB with a newer version will solve those issues, although users might still encounter absent handlers in new databases (rarely, and under different circumstances).
Here are some side-by-side comparisons of disassembly and decompiler for MIPS. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
This is a very simple code to decompile and the output is perfect. The only minor obstacle are references through the global offset table but both IDA and the Decompiler handle them well. Please note the difference in the number of lines to read on the left and on the right.
Sorry for another long assembler listing. It shows that for MIPS, as for other platforms, the decompiler can recognize 64-bit operations and collapse them into very readable constructs.
We recognize magic divisions for MIPS the same way as for other processors. Note that this listing has a non-trivial delay slot.
The previous example was a piece of cake. This one shows a tougher nut to crack: there is a jump to a delay slot. A decent decompiler must handle these cases too and produce a correct output without misleading the user. This is what we do. (We spent quite long time inventing and testing various scenarios with delay slots).
We support both big-endian and little-endian code. Usually they look the same but there may be subtle differences in the assembler. The decompiler keeps track of the bits involved and produces human-readable code.
MicroMIPS, as you have probably guessed, is supported too, with its special instructions and quirks.
The MIPS processor contains a number of complex floating point instructions, which perform several operations at once. It is not easy to decipher the meaning of the assembler code but the pseudocode is the simplest possible.
A compiler sometime uses helpers; our decompiler knows the meaning of the many helpers and uses it to simplify code.
A decompiler represents executable binary files in a readable form. More precisely, it transforms binary code into text that software developers can read and modify. The software security industry relies on this transformation to analyze and validate programs. The analysis is performed on the binary code because the source code (the text form of the software) traditionally is not available, because it is considered a commercial secret.
Programs to transform binary code into text form have always existed. Simple one-to-one mapping of processor instruction codes into instruction mnemonics is performed by disassemblers. Many disassemblers are available on the market, both free and commercial. The most powerful disassembler is our own IDA Pro. It can handle binary code for a huge number of processors and has open architecture that allows developers to write add-on analytic modules.
Decompilers are different from disassemblers in one very important aspect. While both generate human readable text, decompilers generate much higher level text which is more concise and much easier to read.
Compared to low level assembly language, high level language representation has several advantages:
It is consise.
It is structured.
It doesn't require developers to know the assembly language.
It recognizes and converts low level idioms into high level notions.
It is less confusing and therefore easier to understand.
It is less repetitive and less distracting.
It uses data flow analysis.
Let's consider these points in detail.
Usually the decompiler's output is five to ten times shorter than the disassembler's output. For example, a typical modern program contains from 400KB to 5MB of binary code. The disassembler's output for such a program will include around 5-100MB of text, which can take anything from several weeks to several months to analyze completely. Analysts cannot spend this much time on a single program for economic reasons.
The decompiler's output for a typical program will be from 400KB to 10MB. Although this is still a big volume to read and understand (about the size of a thick book), the time needed for analysis time is divided by 10 or more.
The second big difference is that the decompiler output is structured. Instead of a linear flow of instructions where each line is similar to all the others, the text is indented to make the program logic explicit. Control flow constructs such as conditional statements, loops, and switches are marked with the appropriate keywords.
The decompiler's output is easier to understand than the disassembler's output because it is high level. To be able to use a disassembler, an analyst must know the target processor's assembly language. Mainstream programmers do not use assembly languages for everyday tasks, but virtually everyone uses high level languages today. Decompilers remove the gap between the typical programming languages and the output language. More analysts can use a decompiler than a disassembler.
Decompilers convert assembly level idioms into high-level abstractions. Some idioms can be quite long and time consuming to analyze. The following one line code
x = y / 2;
can be transformed by the compiler into a series of 20-30 processor instructions. It takes at least 15- 30 seconds for an experienced analyst to recognize the pattern and mentally replace it with the original line. If the code includes many such idioms, an analyst is forced to take notes and mark each pattern with its short representation. All this slows down the analysis tremendously. Decompilers remove this burden from the analysts.
The amount of assembler instructions to analyze is huge. They look very similar to each other and their patterns are very repetitive. Reading disassembler output is nothing like reading a captivating story. In a compiler generated program 95% of the code will be really boring to read and analyze. It is extremely easy for an analyst to confuse two similar looking snippets of code, and simply lose his way in the output. These two factors (the size and the boring nature of the text) lead to the following phenomenon: binary programs are never fully analyzed. Analysts try to locate suspicious parts by using some heuristics and some automation tools. Exceptions happen when the program is extremely small or an analyst devotes a disproportionally huge amount of time to the analysis. Decompilers alleviate both problems: their output is shorter and less repetitive. The output still contains some repetition, but it is manageable by a human being. Besides, this repetition can be addressed by automating the analysis.
Repetitive patterns in the binary code call for a solution. One obvious solution is to employ the computer to find patterns and somehow reduce them into something shorter and easier for human analysts to grasp. Some disassemblers (including IDA Pro) provide a means to automate analysis. However, the number of available analytical modules stays low, so repetitive code continues to be a problem. The main reason is that recognizing binary patterns is a surprisingly difficult task. Any "simple" action, including basic arithmetic operations such as addition and subtraction, can be represented in an endless number of ways in binary form. The compiler might use the addition operator for subtraction and vice versa. It can store constant numbers somewhere in its memory and load them when needed. It can use the fact that, after some operations, the register value can be proven to be a known constant, and just use the register without reinitializing it. The diversity of methods used explains the small number of available analytical modules.
The situation is different with a decompiler. Automation becomes much easier because the decompiler provides the analyst with high level notions. Many patterns are automatically recognized and replaced with abstract notions. The remaining patterns can be detected easily because of the formalisms the decompiler introduces. For example, the notions of function parameters and calling conventions are strictly formalized. Decompilers make it extremely easy to find the parameters of any function call, even if those parameters are initialized far away from the call instruction. With a disassembler, this is a daunting task, which requires handling each case individually.
Decompilers, in contrast with disassemblers, perform extensive data flow analysis on the input. This means that questions such as, "Where is the variable initialized?"" and, "Is this variable used?" can be answered immediately, without doing any extensive search over the function. Analysts routinely pose and answer these questions, and having the answers immediately increases their productivity.
Below you will find side-by-side comparisons of disassembly and decompilation outputs. The following examples are available:
The following examples are displayed on this page:
Just note the difference in size! While the disassemble output requires you not only to know that the compilers generate such convoluted code for signed divisions and modulo operations, but you will also have to spend your time recognizing the patterns. Needless to say, the decompiler makes things really simple.
Questions like
What are the possible return values of the function?
Does the function use any strings?
What does the function do?
can be answered almost instantaneously looking at the decompiler output. Needless to say that it looks better because I renamed the local variables. In the disassembler, registers are renamed very rarely because it hides the register use and can lead to confusion.
IDA highlights the current identifier. This feature turns out to be much more useful with high level output. In this sample, I tried to trace how the retrieved function pointer is used by the function. In the disassembly output, many wrong eax occurrences are highlighted while the decompiler did exactly what I wanted.
Arithmetics is not a rocket science but it is always better if someone handles it for you. You have more important things to focus on.
The decompiler recognized a switch statement and nicely represented the window procedure. Without this little help the user would have to calculate the message numbers herself. Nothing particularly difficult, just time consuming and boring. What if she makes a mistake?...
The decompiler tries to recognize frequently inlined string functions such as strcmp, strchr, strlen, etc. In this code snippet, calls to the strlen
function has been recognized.
Here are some side-by-side comparisons of decompilations for v7.3 and v7.4. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
The text produced by v7.3 is not quite correct because the array at [ebp-128] was not recognized. Overall determining the array is a tough task but we can handle simple cases automatically now.
On the left there is a mysterious call to _extendsfdf2. In fact this is a compiler helper function that just converts a single precision floating point value into a double precision value. However, we do not want to see this call as is. It is much better to translate it into the code that looks more like C. Besides, there is a special treatment for printf-like functions.
In some cases we can easily prove that one variable can be mapped into another. The new version automatically creates a variable mapping in such cases. This makes the output shorter and easier to read. Needless to say that the user can revert the mapping if necessary.
The new version automatically applies symbolic constants when necessary. Less manual work.
This is not the longest C++ function name one may encounter but just compare the left and right sides. In fact the right side could even fit into one line easily, we just kept it multiline to be consistent. By the way, all names in IDA benefit from this simplification, not only the ones displayed by the decompiler. And it is configurable!
The battle is long but we do not give up. More 64-bit patterns are recognized now.
Yet another example of 64-bit arithmetics. The code on the left is correct but not useful at all. It can and should be converted into the simple equivalent text on the right.
Currently we support only GetProcAddress but we are sure that we will expand this feature in the future.\
Below you will find side-by-side comparisons of v7.2 and v7.3 decompilations. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
NOTE: these are just some selected examples that can be illustrated as side-by-side differences. There are many other improvements and new features that are not mentioned on this page. We just got tired selecting them. Some of the improvements that did not do to this page:
objc-related improvements
value range analysis can eliminate more useless code
better resolving of got-relative memory references
too big shift amounts are converted to lower values (e.g. 33->1)
more for-loops
better handling of fragemented variables
many other things...
When a constant looks nicer as a hexadecimal number, we print it as a hexadecimal number by default. Naturally, beauty is in the eye of the beholder, but the new beahavior will produce more readable code, and less frequently you will fell compelled to change the number representation. By the way, this tiny change is just one of numerious improvements that we keep adding in each release. Most of them go literally unnoticed. It is just this time we decided to talk about them
EfiBootRecord points to a structure that has RecordExtents[0] as the last member. Such structures are considered as variable size structures in C/C++. Now we handle them nicely.
We were printing UTF-8 and other string types, UTF-32 was not supported yet. Now we print it with the 'U' prefix.
The difference between these outputs is subtle but pleasant. The new version managed to determine the variable types based on the printf format string. While the old version ended up with int a2, int a3
, the new version correctly determined them as one __int64 a2
.
A similar logic works for scanf-like functions. Please note that the old version was misdetecting the number of arguments. It was possible to correct the misdetected arguments using the Numpad-Minus hotkey but it is always better when there is less routine work on your shoulders, right?
While seasoned reversers know what is located at fs:0
, it is still better to have it spelled out. Besides, the type of v15
is automatically detected as struct _EXCEPTION_REGISTRATION_RECORD *
.
Again, the user can specify the union field that should be used in the output (the hotkey is Alt-Y
) but there are situations when it can be automatically determined based on the access type and size. The above example illustrates this point. JFYI, the type of entry
is:
While we can not handle bitfields yet, their presence does not prevent using other, regular fields, of the structure.
I could not resist the temptation to include one more example of automatic union selection. How beautiful the code on the right is!
No comments needed, we hope. The new decompiler managed to fold constant expressions after replacing EABI helpers with corresponding operators.
Now it works better especially in complex cases.
In this case too, the user could set the prototype of sub_1135FC
as accepting a char *
and this would be enough to reveal string references in the output, but the new decompiler can do it automatically.
The code on the left had a very awkward sequence to copy a structure. The code on the right eliminates it as unnecessary and useless.
Do you care about this improvement? Probably you do not care because the difference is tiny. However, in additon to be simpler, the code on the right eliminated a temporary variable, v5
. A tiny improvement, but an improvement it is.
Another tiny improvement made the output considerably shorter. We like it!
This is a very special case: a division that uses the rcr
instruction. Our microcode does not have the opcode for it but we implemented the logic to handle some special cases, just so you do not waste your time trying to decipher the meaning of convoluted code (yes, rcr
means code that is difficult to understand).
Well, we can not say that we produce less gotos in all cases, but there is some improvement for sure. Second, note that the return type got improved too: now it is immediately visible that the function returns a boolean (0/1) value.
What a surprise, the code on the right is longer and more complex! Indeed, it is so, and it is because now the decompiler is more careful with the division instructions. They potentially may generate the zero division exception and completely hiding them from the output may be misleading. If you prefer the old behaviour, turn off the division preserving in the configuration file.
Do you notice the difference? If not, here is a hint: the order of arguments of sub_88
is different. The code on the right is more correct because the the format specifiers match the variable types. For example, %f
matches float a
. At the first sight the code on the left looks completely wrong but (surprise!) it works correctly on x64 machines. It is so because floating point and integer arguments are passed at different locations, so the relative order of floating/integer arguments in the call does not matter much. Nevertheless, the code on the right causes less confusion.
This is a never ending battle, but we advance!
Below you will find side-by-side comparisons of v7.1 and v7.2 decompilations. Please maximize the window too see both columns simultaneously.
The following examples are displayed on this page:
NOTE: these are just some selected examples that can be illustrated as side-by-side differences. There are many other improvements and new features that are not mentioned on this page.
In the past the Decompiler was able to recognize magic divisions in 32-bit code. We now support magic divisions in 64-bit code too.
More aggressive folding of if_one_else_zero constructs; the output is much shorter and easier to grasp.
The decompiler tries to guess the type of the first argument of a constructor. This leads to improved listing.
The decompiler has a better algorithm to find the correct union field. This reduces the number of casts in the output.
We improved recognition of 'for' loops, they are shorter and much easier to understand.
Please note that the code on the left is completely illegible; the assembler code is probably easier to work with in this case. However, the code on the right is very neat. JFYI, below is the class hierarchy for this example:
Also please note that the source code had
but at the assembler level we have
Visual Studio plays such tricks.
Yes, the code on the left and on the right do the same. We prefer the right side, very much.
Minor stuff, one would say, and we'd completely agree. However, these minor details make reading the output a pleasure.
This is a rare addressing mode that is nevertheless used by compilers. Now we support it nicely.
The new decompiler managed to disentangle the obfuscation code and convert it into a nice strcpy()
The new version knows about ObjC blocks and can represent them correctly in the output. See Edit, Other, Objective-C
submenu in IDA, it contains the necessary actions to analyze the blocks.
We continue to improve recognition of 64-bit arithmetics. While it is impossible to handle all cases, we do not give up.
Yet another optimization rule that lifts common code from 'if' branches. We made it even more aggressive.
In the sample above the slot of the p_data_format
variable is reused. Initially it holds a pointer to an integer (data_format) and then it holds a simple integer (errcode). Previous versions of the decompiler could not handle this situation nicely and the output would necessarily have casts and quite difficult to read. The two different uses of the slot would be represented just by one variable. You can see it in the left listing.
The new version produces clean code and displays two variables. Naturally it happens after applying the force new variable
command.
Well, these listings require no comments, the new version apparently wins!
The decompiler requires the latest version of IDA. While it may work with older versions (we try to ensure compatibility with a couple of previous versions), the best results are obtained with the latest version: first, IDA analyses files better; second, the decompiler can use additional available functionality.
The decompiler runs on MS Windows, Linux, and Mac OS X. It can decompile programs for other operating systems, provided they have been built using GCC/Clang/Visual Studio/Borland compilers.
IDA loads appropriate decompilers depending on the input file. If it cannot find any decompiler for the current input file, no decompilers will be loaded at all.
Let's start with a very short and simple function:
We decompile it with View, Open subviews, Pseudocode (hotkey F5):
While the generated C code makes sense, it is not pretty. There are many cast operations cluttering the text. The reason is that the decompiler does not perform the type recovery yet. Apparently, the a1 argument points to a structure but the decompiler missed it. Let us add some type information to the database and see what happens. For that we will open the Local Types window (Shift-F1) and add a new structure type:
After that, we switch back to the pseudocode window and specify the type of a1. We can do it by positioning the cursor on any occurrence of a1 and pressing Y:
When we press Enter, the decompilation output becomes much better:
But there is some room for improvement. We could rename the structure fields and specify their types. For example, field_6B1 seems to be used as a counter and field_6B5 is obviously a function pointer. We can do all this without switching windows now. . Here is how we specify the type of the function pointer field:
The final result looks like this:
Please note that there are no cast operations in the text and overall it looks much better than the initial version.
This is an excerpt from a big function to illustrate . Complex things happen in long functions and it is very handy to have the decompiler to represent things in a human way. Please note how the code that was scattered over the address space is concisely displayed in two if
statements.
Sometimes compilers reuse the same stack slot for different purposes. Many our users asked us to add a feature to handle this situation. The new decompiler addresses this issue by adding a command to at the specified point. Currently we support only aliasable stack variables because this is the most common case.
The GUI version of IDA is required for the operation. For the text mode version, only the operation is supported.
Hotkeys:
H - toggle between hexadecimal and decimal representations
R - switch to character constant representation
M - switch to enumeration (symbolic constant) representation
_ - invert sign
T - apply struct offset
This command allows the user to specify the desired form of a numeric constant. Please note that some constants have a fixed form and cannot be modified. This mainly includes constants generated by the decompiler on the fly.
The decompiler ties the number format information to the instruction that generated the constant. The instruction address and the operand number are used for that. If a constant, which was generated by a single instruction, is used in many different locations in the pseudocode, all these locations will be modified at once.
Using the 'invert sign' negates the constant and resets the enum/char flag if it was set.
When this command is applied the first time to a negative constant, the output will seemingly stay the same. However, the list of symbolic constants available to the M hotkey changes. For example, if the constant is '-2', then before inverting the sign the symbolic constants corresponding to '-2' are available. After inverting the sign the symbolic constants corresponding to '2' are available.
The T hotkey applies the structure offset to the number. For positive numbers, it usually converts the number into offsetof() macro. For negative numbers, it usually converts the whole (var-num) expression into the CONTAINING_RECORD macro. By the way, the decompiler tries to use other hints to detect this macro. It checks if the number corresponds to a structure offset in the disassembly listing. For example, an expression like
can be converted into
where structype * is the type of v1 and offsetof(structype, fieldname) == num. Please note that v2 must be declared as a pointer to the corresponding structure field, otherwise the conversion may fail.
See also:
Hotkey: Y
The SetType command sets the type of the current item. It can be applied to the following things:
Function
Local variable
Global item (function or data)
If the command is applied to the very first line of the output text, the decompiler will try to detect the current function argument. If the cursor is on an argument declaration, then the argument type will be modified. Otherwise, the current function type will be modified.
In all other cases the item under the cursor will be modified.
When modifying the prototype of the current function you may add or remove function arguments, change the return type, and change the calling convention. If you see that the decompiler wrongly created too many function arguments, you can remove them.
The item type must be specified as a C type declaration. All types defined in the loaded type libraries, all structures in the local types window, all enum definitions in the local types window can be used.
This is a very powerful command. It can change the output dramatically. Use it to remove cast operations from the output and to make it more readable. In some cases, you will need to define structure types in the local types window and only after that use them in the pseudocode window.
NOTE: since the arguments of indirect calls are collected before defining variables, specifying the type of the function pointer may not be enough. Please read this for more info.
Since variables and function types are essential, the decompiler uses colors to display them. By default, definite types (set by the user, for example) are displayed in blue while guessed types are displayed in gray. Please note that the guessed types may change if the circumstances change. For example, if the prototype of a called function is changed, the variable that holds its return value may change automatically, unless its type was set by the user.
This command does not rename the operated item, even if you specify the name in the declaration. Please use the rename command for that.
See also: interactive operation
Hotkey: N
The rename command renames the current item. It can be applied to the following things:
Function
Local variable
Global item (function or data)
Structure field
Statement label
Normally the item under the cursor will be renamed. If the command is applied to the very first line of the output text and the decompiler cannot determine the item under the cursor, the current function will be renamed.
See also: interactive operation
Hotkey: /
This command edits the indented comment for the current line or the current variable. It can be applied to the local variable definition area (at the top of the output) and to the function statement area (at the bottom of the output).
If applied to the local variable definition area, this command edits the comment for the current local variable. Otherwise the comment for the current line will be edited.
Please note that due to the highly dynamic nature of the output, the decompiler uses a rather complex coordinate system to attach comments. Some output lines will not have a coordinate in this system. You cannot edit comments for these lines. We will try to overcome this limitation in the future but it might take some time and currently we do not have a clear idea how to improve the existing coordinate system.
Each time the output text changes the decompiler will rearrange the entered comments so they are displayed close to their original locations. However, if the output changes too much, the decompiler could fail to display some comments. Such comments are called "orphan comments". All orphan comments are printed at the very end of the output text.
You can cut and paste them to the correct locations or you can delete them with the "Delete orphan comments" command using the right-click menu.
The starting line position for indented comments can be configured by the user. Please check the COMMENT_INDENT parameter in the configuration file.
See also: Edit block comment | Interactive operation
Hotkey: Ins
This command edits the block comment for the current line. The entered comment will be displayed before the current line.
Please note that due to the highly dynamic nature of the output, the decompiler uses a rather complex coordinate system to attach comments. Some output lines will not have a coordinate in this system. You cannot edit comments for these lines. Also, some lines have the same coordinate. In this case, the comment will be attached to the first line with the internal coordinate. We will try to overcome this limitation in the future but it might take some time and currently we do not have a clear idea how to improve the existing coordinate system.
Each time the output text changes the decompiler will rearrange the entered comments so they are displayed close to their original locations. However, if the output changes too much, the decompiler could fail to display some comments. Such comments are called "orphan comments". All orphan comments are printed at the very end of the output text.
If applied to the function declaration line, this command edits the function comment. This comment is shared with IDA: it is the same as the function comment in IDA.
You can cut and paste them to the correct locations or you can delete them with the "Delete orphan comments" command using the right-click menu.
See also: Edit indented comment | Interactive operation
The decompiler adds the following commands to the menus:
This command decompiles the current function. If the decompilation is successful, it opens a new window titled "Pseudocode" and places the generated C text in this window.
The following commands can be used in the pseudocode window:
If the current item is a local variable, additional items may appear in the context menu:
If the current item is a union field, an additional item may appear in the context menu:
If the current item is a parenthesis, bracket, or a curly brace, the following hotkey is available:
The user can also select text and copy it to the clipboard with the Ctrl-C combination.
If the current item is C statement keyword, an additional item may appear in the context menu:
The user can also select text and copy it to the clipboard with the Ctrl-C combination.
Pressing Enter on a function name will decompile it. Pressing Esc will return to the previously decompiled function. If there is no previously decompiled function, the pseudocode window will be closed.
Ctrl-Enter or Ctrl-double click on a function name will open a new pseudocode window for it.
Pressing F5 while staying in a pseudocode window will refresh its contents. Please note that the decompiler never refreshes pseudocode by itself because it can take really long.
The user can use the mouse right click or keyboard hotkeys to access the commands. Please check the command descriptions for the details.
This command toggles between the disassembly view and pseudocode view. If there is no pseudocode window, a new window will be created.
Pressing Tab while staying in the pseudocode window will switch to the disassembly window. The Tab key can be used to toggle pseudocode and disassembly views.
See above the Open pseudocode command for more details.
This command decompiles the selected functions or the whole application. It will ask for the name of the output .c file.
If there is a selected area in the disassembly view, only the selected functions will be decompiled. Otherwise, the whole application will be decompiled.
When the whole application is decompiled, the following rules apply:
the order of decompilation is determined by the decompiler. It will start with the leaf functions and will proceed in the postnumbering order in the call graph. This order makes sure that when we decompile a function, we will have all information about the called functions. Obviously, for recursive functions some information will be still missing.
the library (light blue) functions will not be decompiled. By the way, this is a handy feature to exclude functions from the output.
A decompilation failure will not stop the analysis but the internal errors will. The decompiler generates #error directives for failed functions.
This command decompiles the current function and copies the pseudocode to the disassembly listing in the form of anterior comments. If the current function already has a pseudocode window, its contents are used instead of decompiling the function anew.
This menu item performs exactly the same actions as the Copy to assembly command.
This command deletes all anterior comments created by the previous command. Its name is a slight misnomer because it does not verify the comment origin. In fact, all anterior comments within the current function are deleted.
This command marks/unmarks instructions to be skipped by the decompiler. It is useful if some prolog/epilog instructions were missed by IDA. If such instructions were not detected and marked, the decompilation may fail (most frequently the call analysis will fail).
The decompiler skips the prolog, epilog, and switch instructions. It relies on IDA to mark these instructions. Sometimes IDA fails to mark them, and this command can be used to correct the situation.
If the command is applied to marked instructions, it will unmark them.
By default, the skipped instructions are not visualized. To make them visible, edit the IDA.CFG file and uncomment the following lines:
This command deletes decompiler information.
It can delete information about global objects (functions, static data, structure/enum types) and/or information local to the current function.
Use this command if you inadvertently made some change that made decompilation impossible.
It can also be used to reset other information types used by the decompiler. For example, the forced variadic arguments or split expression can be reset.
This commands configures a function call the current instruction should be replaced by in the pseudocode output.
Special names can be used to access operands of the current instructions: __OP1, __OP2, ... for first, second, etc. operands. Each function argument having a name like that will be replaced in the call by the value of the corresponding operand of the instruction. Also if the function name has this format, a call to the location pointed by the corresponding operand will be generated. Other arguments and the return value will be placed into locations derived from the function prototype according to the current compiler, calling convention, argument and return types. You can use IDA-specific __usercall calling convention to specify arbitrary locations independently of platform and argument/return types (read IDA help pages about the user defined calling conventions for more info).
Examples
We could ask to replace the following instruction:
out 2b, ax
by specifying the following prototype:
void OUT(unsigned int8 __OP1, int16 __OP2)
which would lead to the following decompiler output:
OUT(0x2b, v1);
where v1 is mapped to ax.
The following prototype: int __usercall syscall@<R0;>(int code@<R12;>, void *a1@<R0;>, void *a2@<R1;>)
applied to the second instruction in the following piece of code:
mov r12, #0x148
svc 0x801
will generate the following pseudocode:
v3 = syscall(328, v1, v2);
where v1, v2, v3 are mapped to R0, R1, R2 respectively.
This command packs and sends the current database to our server. The user can specify his/her email and add notes about the error. This is the preferred way of filing bugreports because it is virtually impossible to do anything without a database. The database will also contain the internal state of the decompiler, which is necessary to reproduce the bug.
The database is sent in the compressed form to save the bandwidth. An encrypted connection (SSL) is used for the transfer.
This command deletes all code and data from the current idb extract the current function. It can be used to reduce the database size before sending a bug report. Please note that deleting information from the database may make the bug irreproducible, so please verify it after applying this command.
Hotkeys
Keypad -
Hide current statement
Keypad +
Unhide current statement
This command collapses the current statement into one line. It can be applied to multiline statements (if, while, for, do, switch, blocks).
The hidden item can be uncollapsed using the unhide command.
See also: interactive operation
Hotkey: Ctrl-Shift-R
This command removes the return type from the function prototype. It is applied to the prototype of the current function.
It is available anywhere in the pseudocode window, regardless where exactly the cursor is positioned. This command is not visible in the context sensitive popup menu.
If applied to a function without the return type, it will add the previously removed return type to the function prototype.
This command is available starting from v7.5.
Hotkey: Shift-Del
This command removes an argument or the return type from a function prototype. It can be applied to the prototype of the current function as well as to any called function.
It is available only when the cursor is on a function argument or on the return type. As a result of this command, the function prototype is modified: the selected argument is removed from the argument list. If necessary, the calling convention is replaced by a new one.
Please note that other register arguments do not change their locations. This logic ensures that a stray argument in the argument list can be deleted with a keypress.
When applied to the function return type it will convert it to "void".
This command is available starting from v7.5.
Hotkeys
None
Split current expression
None
Unsplit current expression
This command splits the current expression into multiple expressions. It is available only for int16, int32, or int64 assignments or expressions which were combiled by the decompiler (e.g. 64bit comparison on 32bit platform). Splitting an assignment breaks it into two assignments: one for the low part and one for the high part. Other expressions can be splitted into more than two expressions.
This command is useful if the decompiler erroneously combines multiple unrelated expressions into one. In some cases the types of the new variables should be explicitly specified to get a nice listing. For example:
can be split into two assignments:
by right clicking on the 64-bit assignment operation (the '=' sign) and selecting the 'Split' command.
The split expression can be unsplit using the unsplit command. Unsplitting removes all effects of the previous Split commands.
In some cases, especially for indirect calls, the decompiler cannot correctly detect call arguments. For a call like
it is very difficult to determine where are the input arguments. For example, it is unclear if ECX is used by the call or not.
However, the number of arguments and their types can become available at later stages of decompilation. For example, the decompiler may determine that ECX points to a class with a table of virtual functions. If the user specifies the vtable layout, the output may become similar to
If the user declares somefunc as a pointer to a function like this:
then the code is incorrect. The decompiler detected only one argument and missed the one in ECX.
The 'force call type' command instructs the decompiler not to perform the call argument analysis but just use the type of the call object. For the above example, the call will be transformed into something like
In other words, this command copies the call type from the call object to the call instruction. The call object may be any expression, the only requirement is that it must be a pointer to a function.
NOTE: Behind the scenes the 'force call' command copies the desired type to the operand of the call instruction. To revert the effects of 'force call' or to fine tune the forced type please use the Edit, Operand type, Set operand type in the disassembly view while staying on the call instruction.
Hotkeys
Numpad+
Add variadic argument
Numpad-
Delete variadic argument
This command adds or removes an argument of a variadic call. It is impossible to detect the correct number of variadic arguments in all cases, and this command can be used to fix wrongly detected arguments. It is available only when the cursor is located on a call to a variadic function (like printf). The decompiler automatically detects the argument locations, the user can only increase or decrease their number.
This command is useful if the decompiler determines the number of arguments incorrectly. For example:
apparently lacks an argument. Pressing Numpad+ modifies it:
undo the last action (hotkey Ctrl-Z)
position the cursor on the wrongly modified call and press Numpad-
and that the decompiler erroneously detected one argument whereas four arguments actually are present. If the user sets the new call type as
then the call will be transformed into
It sets the new type for off_5C6E4 that will cause changes to all places where off_5C6E4 is called, including the current call.
This command also can be used to specify the __noreturn attribute of a call.
See also: , .
See also: , .
See also:
There is a more general command that allows the user to set any type for a call instruction.
See also:
If too many arguments are added to a variadic call, decompilation may . Three methods to correct this situation exist:
or use to reset the forced varidic argument counts.
See also:
In some cases, especially for indirect calls, the decompiler cannot correctly detect call arguments. The 'Set call type' command sets the type of the function call at the current item without changing the prototype of the called function itself. So there is a difference between 'Set call type' and commands. Let us assume that there is a call
and the type of off_5C6E4 will remain unchanged. Note that in this case the user can revert the call to the previous state using the command.
The command will have a different effect:
NOTE: Behind the scenes the 'Set call type' command, like , copies the entered type to the operand of the call instruction. Actually it is a shortcut to Edit, Operand type, Set operand type in the disassembly view while staying on the call instruction.
See also:
This command decompiles all non-trivial functions in the database and looks for xrefs in them. Library and thunk functions are skipped. The decompilation results are cached in memory, so only the first invocation of this command is slow.
Cross references to the current item are looked up in the decompilation results. A list of such xrefs is formed and displayed on the screen. Currently the following item types are supported:
a structure field
a enumeration member (symbolic constant)
This action is also available (only by hotkey) in the local types view.
See also: interactive operation
This command marks the current function as decompiled. It is a convenient way to track decompiled functions. Feel free to use it any way you want.
This command copies the pseudocode text to the disassembly window. It is available from the popup right-click menu.
Please note that only "meaningful" lines are copied. Lines containing curly braces, else/do keywords will be omitted.
The copied text is represented as anterior comments in the disassembly. Feel free edit them the way you want. The copied text is static and will not change if the pseudocode text changes.
Marking a function as decompiled will change its background color to the value specified by the MARK_BGCOLOR parameter in the . The background color will be used in the pseudocode window, in the disassembly listing, and in the function list.
See also:
See also:
This command generates an HTML file with the pseudocode of the current function. It is available from the popup menu if the mouse is clicked on the very first line of the pseudocode text.
This command also works on the selected area. The user can select the area that will be saved into HTML file. This is useful if only a small code snippet is needed to be saved instead of the entire function body.
See also: interactive operation
Hotkey: \
This command hides all cast operators from the output listing. Please note that the output may become more difficult to understand or even lose its meaning without cast operators. However, since in some cases it is desirable to temporarily hide them, we provide the end user with this command.
The initial display of cast operators can be configured by the user. Please check the HO_DISPLAY_CASTS bit in the HEXOPTIONS parameter in the configuration file.
See also: interactive operation
This command opens the standard dialog box with the cross references to the current item. The user may select a cross reference and jump to it. If the cross-reference address belongs to a function, it will be decompiled. Otherwise, IDA will switch to the disassembly view.
For local variables, the following cross reference types are defined:
It is also possible to jump to structure fields. All local references to a field of a structure type will be displayed.
If the item under the cursor is a label, a list of all references to the label will be displayed.
Finally, xrefs to statment types are possible too. For example, a list of all return statements of the current function can be obtained by pressing X on a return statment. All statements with keywords are supported.
See also: interactive operation
Hotkey: Alt-Y
This command allows the user to select the desired union field. In the presence of unions, the decompiler cannot always detect the correct union field.
The decompiler tries to reuse the union selection information from the disassembly listing. If there is no information in the disassembly listing, the decompiler uses an heuristic rule to choose the most probable union field based on the field types. However, it may easily fail in the presence of multiple union fields with the same type or when there is no information how the union field is used.
If both the above methods of selecting the union field fail, then this command can be used to specify the desired field. It is especially useful for analyzing device drivers (I/O request packets are represented with a long union), or COM+ code that uses VARIANT data types.
See also: interactive operation
Hotkey: none
This convenience command allows the user to specify a pointer to structure type in a quick and efficient manner. The list of the local structure types will be displayed. The type of the current variable will be set as a pointer to the selected structure type.
This is just a convenience command. Please use the set type command in order to specify arbitrary variable types.
This command is available only when the decompiler is used with recent IDA versions.
See also: interactive operation
Hotkey: none
This convenience command allows the user to convert the current local variable from a non-pointer type to a pointer to a newly created structure type. It is available from the context menu if the current variable is used a pointer in the pseudocode.
The decompiler scans the pseudocode for all references to the variable and tries to deduce the type of the pointed object. Then the deduced type is displayed on the screen and the user may modify it to his taste before accepting it. When the user clicks OK, the new type is created and the type of the variable is set as a pointer to the newly created type.
In simple cases (for example, when the variable is used as a simple character pointer), the decompiler does not display any dialog box but directly changes the variable type. In such cases, no new type will be created.
This command is available only when the decompiler is used with recent IDA versions.
This is just a convenience command. Please use the command in order to specify arbitrary variable types.
See also:
Hotkey: none
This command resets the type of the current local variable from a pointer type to an integer type. This is just a convenience command. Please use the set type command in order to specify arbitrary variable types.
See also: interactive operation
Hotkey: Shift-S
Sometimes a stack slot is used for two completely different purposes during the lifetime of a function. While for the unaliased part of the stack frame the decompiler can usually sort things out, it cannot do much for the aliased part of the stack frame. For the aliased part, it will create just one variable even if the corresponding stack slot is used for multiple different purposes. It happens so because the decompiler cannot prove that the variable is used for a different purpose, starting from a certain point.
The split variable command is designed to solve exactly this problem.
This command allows the user to force the decompiler to allocate a new variable starting from the current point. If the current expression is a local variable, all its subsequent occurrences will be replaced by a new variable up to the end of the function or the next split variable at the same stack slot. If the cursor does not point to a local variable, the decompiler will ask the user about the variable to replace.
In the current statement, only the write accesses to the variable will be replaced. In the subsequent statements, all occurrences of the variable will be replaced. We need this logic to handle the following situation:
where only the second occurrence of the variable should be replaced. Please note that in some cases it makes sense to click on the beginning of the line with the function call, rather than on the variable itself.
Please note that in the presence of loops in the control flow graph it is possible that even the occurrences before the current expression will be replaced by the new variable. If this is not desired, the user should split the variable somewhere else.
The very first and the very last occurrences of a variable cannot be used to split the variable because it is not useful.
The decompiler does not verify the validity of the new variable. A wrong variable allocation point may render the decompiler output incorrect.
Currently, only aliasable stack variables can be split.
A split variable can be deleted by right clicking on it and selecting 'Unsplit variable'.
See also: interactive operation
This command jumps to the matching parenthesis. It is available only when the cursor is positioned on a parenthesis, bracket, or curly brace.
The default hotkey is '%'.
See also: interactive operation
This command collapses the selected multiline C statement into one line. It can be applied to if, while, for, switch, do keywords. The collapsed item will be replaced by its keyword and "..."
It can also be applied to the local variable declarations. This can be useful if there are too many variables and they make the output too long. All variable declarations will be replaced by just one line:
See also: interactive operation
Hotkey: =
This command allows the user to replace all occurrences of a variable by another variable. The decompiler will propose a list of variables that may replace the current variable. The list will include all variables that have exactly the same type as the current variable. Variables that are assigned to/from the current variable will be included too.
Please note that the decompiler does not verify the mapping. A wrong mapping may render the decompiler output incorrect.
The function arguments and the return value cannot be mapped to other variables. However, other variable can be mapped to them.
A mapping can be undone by right clicking on the target variable and using the 'unmap variable' command.
See also: interactive operation
The decompiler supports the batch mode operation with the text and GUI versions of IDA. All you need is to specify the -Ohexrays switch in the command line. The format of this switch is:
The valid options are:
-new decompile only if output file does not exist
-nosave do not save the database (idb) file after decompilation
-errs send problematic databases to hex-rays.com
-lumina use Lumina server
-mail=my@mail.com your email (meaningful if -errs option is used)
The output file name can be prepended with + to append to it. If the specified file extension is invalid, .c will be used.
The functions to decompile can be specified by their addresses or names. The ALL keyword means all non-library functions. For example:
will decompile all nonlibrary functions to outfile.c. In the case of an error, the .idb file will be sent to hex-rays.com. The -A switch is necessary to avoid the initial dialog boxes.
Below is the list of noteworthy public third-party plugins for the decompiler.
HexRaysCodeXplorer by Aleksandr Matrosov and Eugene Rodionov
Hex-Rays Decompiler plugin for better code navigation Here is the features list for first release:
navigation through virtual function calls in Hex-Rays Decompiler window;
automatic type reconstruction for C++ constructor object;
useful interface for working with objects & classes;
A simple list of various IDA and Decompiler plugins
More to come...
Happy analysis!
The current release of the x86 decompiler supports floating point instructions. While everything works automatically, the following points are worth noting:
IDA v5.5 or higher is required for floating point support. Earlier versions do not have the required functionality and the decompiler represents fpu instructions using inline assembler statements.
The decompiler knows about all floating point types, including: float, double, long double, and _TBYTE. We introduced _TBYTE because sizeof(long double) is often different from sizeof(tbyte). While the size of long double can be configured (it is implicitly set to a reasonable value when the compiler is set), the size of tbyte is always equal to 10 bytes.
Casts from integers types to floating point types and vice versa are always displayed in the listing, even if the output has the same meaning without them.
The decompiler performs fpu stack analysis, which is similar to the simplex method performed by IDA. If it fails, the decompiler represents fpu instructions using inline assembler statements. In this case the decompiler adds one more prefix column to the disassembly listing, next to the stack pointer values. This column shows the calculated state of the fpu stack and may help to determine where exactly the fpu stack tracing went wrong.
The decompiler ignores all manipulations with the floating point control word. In practice this means that it may miss an unusual rounding mode. We will address this issue in the future, as soon as we find a robust method to handle it.
SSE floating point instructions are represented by intrinsic functions. Scalar SSE instructions are however directly mapped to floating point operations in pseudocode.
Feel free to report all anomalies and problems with floating point support using the Send database command. This will help us to improve the decompiler and make it more robust. Thank you!
See also: Failures and troubleshooting
The decompiler has a configuration file. It is installed into the 'cfg' subdirectory of the IDA installation. The configuration file is named 'hexrays.cfg'. It is a simple text file, which can be edited to your taste. Currently the following keywords are defined:
Background color of local type declarations. Currently this color is not used. Default: default background of the disassembly view
Background color of local variable declarations. It is specified as a hexadecimal number 0xBBGGRR where BB is the blue component, GG is the green component, and RR is the red component. Color -1 means the default background color (usually white). Default: default background of the disassembly view
Background color of the function body. It is specified the same way as VARDECL_BGCOLOR. Default: default background of the disassembly view
Number of spaces to use for block indentations. Default: 2
The position to start indented comments. Default: 48
As soon as the line length approaches this value, the decompiler will try to split it. However, it some cases the line may be longer. Default: 120
In order to keep the expressions relatively simple, the decompiler limits the number of comma operators in an expression. If there are too many of them, the decompiler will add a goto statement and replace the expression with a block statement. For example, instead of
we may end up with:
Default: 8
Specifies the default radix for numeric constants. Possible values: 0, 10, 16. Zero means "decimal for signed, hex for unsigned". Default: 0
Specifies the maximal decompilable function size, in KBs. Only reachable basic blocks are taken into consideration. Default: 64
Combination of various analysis and display options:
If enabled, the decompiler will handle out-of-function jumps by generating a call to the JUMPOUT() function. If disables, such functions will not be decompiled. Default: enabled
If enabled, the decompiler will display cast operators in the output listing. Default: enabled
If enabled, the decompiler will hide unordered floating point comparisons. If this option is turned off, unordered comparisons will be displayed as calls to a helper function: __UNORDERED__(a, b) Default: enabled
If enabled, fast structural analysis will be used. It generates less number of nested if-statements but may occasionally produce some unnecessary gotos. It is much faster on huge functions.
Only print string literals if they reside in read-only memory (e.g. .rodata segment). When off, all strings are printed as literals. You can override decompiler's decision by adding 'const' or 'volatile' to the string variable's type declaration.
Convert signed comparisons of unsigned variables with zero into bit checks. Before:
After:
For signed variables, perform the opposite conversion.
Reverse effects of branch tail optimizations: reduce number of gotos by duplicating code
Keep curly braces for single-statement blocks
Optimize away address comparisons. Example:
will be replaced by 0 or 1. This optimization works only for non-relocatable files.
Print casts from string literals to pointers to char/uchar. For example:
Pressing Esc closes the pseudocode view
Assume all functions spoil flag registers ZF,CF,SF,OF,PF (including functions with explicitly specified spoiled lists)
Keep all indirect memory reads (even with unused results) so as not to lose possible invalid address access
Keep exception related code (e.g. calls to _Unwind_SjLj_Register)
Translate ARMv8.3 Pointer Authentication instructions into intrinsic function calls (otherwise ignore all PAC instructions)
Preserve potential divisions by zero (if not set, all unused divisions will be deleted)
Generate the integer overflow trap call for 'add', 'sub', 'neg' insns
Ignore the division by zero trap generated by the compiler (only for MIPS)
Consider __readflags as depending on cpu flags default: off, because the result is correct but awfully unreadable
Permit decompilation after an internal error (normally the decompiler does not permit new decompilations after an internal error in the current session)
Never use multiline function declarations, even for functions with a long argument list
Decompile library functions too (in batch mode)
Propagate ldx instructions without checking for volatile memory access
Specifies the warning messages that should be displayed after decompilation. Please refer to hexrays.cfg file for the details. Default: all warnings are on
Specified list of function names that are considered "strcmp-like". For them the decompiler will prefer to use comparison against zero like
as a condition. Underscores, j_ prefixes and _NN suffixes will be ignored when comparing function names
Name of Control Flow Guard check function. Calls of this function will not be included into the pseudocode. Default: "guard_check_icall_fptr"
Name of Control Flow Guard dispatch function. Each call of this function will be replaced by 'call rax' instruction when generating pseudocode. Default: "guard_dispatch_icall_fptr"
Background color of the function if it is . It is specified the same way as VARDECL_BGCOLOR. Default: very light green
If enabled, the decompiler will generate for SSE instructions that use XMM/MMX registers. If this option is turned off, these instructions will be displayed using inline assembly. Default: enabled
If enabled, the decompiler will produce output even if the local variable allocation has failed. In this case the output may be wrong and will contain some . Default: enabled
The current release of the decompiler supports instrinsic functions. The instructions that cannot be directly mapped to high level languages very often can be represented by special intrinsic functions. All Microsoft and Intel simple instrinsic functions up to SSE4a are supported, with some exceptions. While everything works automatically, the following points are worth noting:
SSE intrinsic functions require IDA v5.6 or higher. Older versions of IDA do not have the necessary functionality and register definitions.
Some intrinsic functions work with XMM constant values (16 bytes long). Modern compiler do not accept 16-byte constants yet but the decompiler may generate them when needed.
Sometimes it is better to represent SSE code using inline assembly rather than with intrinsic functions. If the decompiler detects SSE instructions in the current function, it adds a one more item to the popup menu. This item allows the user to enable or disable SSE intrinsic functions for the whole database. This setting is remembered in the database. It can also be modified in the configuration file for new databases.
The decompiler knows about all MMX/XMM built-in types. If the current database does not define these types, they are automatically added to the local types as soon as a SSE instruction is decompiled.
Scalar SSE instructions are never converted to intrinsic functions. Instead, they are directly mapped to floating point operations. This usually produces much better output, especially for Mac OS X binaries.
The scalar SSE instructions that cannot be mapped into simple floating point operations (like sqrtss) are mapped into simple functions from math.h.
The decompiler uses intrinsic function names as defined by Microsoft and Intel.
The decompiler does not track the state of the x87 and mmx registers. It is assumed that the compiler generated code correctly handles transitions between x87 and mmx registers.
Some intrinsic functions are not supported because of their prototype. For example, the __cpuid(int a[4], int b) function is not handled because it requires an array of 4 integers. We assume that most cpuid instructions will be used without any arrays, so adding such an intrinsic function will obscure things rather than to make the code more readable.
Feel free to report all anomalies and problems with intrinsic functions using the Send database command. This will help us to improve the decompiler and make it more robust. Thank you!
See also: Failures and troubleshooting
In some cases the decompiler cannot produce nice output because the variable allocation fails. It happens because the input contains overlapped variables (or the decompiler mistakenly lumps together memory reads and writes). Overlapped variables are displayed in red so they conspicuously visible. Let us consider some typical situations.
For example, consider the following output:
The last assignment to v1 reads beyond v1 boundaries. In fact, it also reads v2. See the assembly code:
Unfortunately, the decompiler cannot handle this case and reports overlapped variables.
Arrays cannot be passed to functions by value, so this will lead to a warning. Just get rid of such an array (embed it into a structure type, for example)
The decompiler can handle up to 64 function arguments. It is very unlikely to encounter a function with a bigger number of arguments. If so, just embed some of them into a structure passed by value.
The corrective actions include:
Check the stack variables and fix them if necessary. A wrongly variable can easily lead to a lvar allocation failure.
Define a big structure that covers the entire stack frame or part of it. Such a big variable will essentially turn off variables lumping (if you are familiar with compiler jargon, the decompiler builds a web of lvars during lvar allocation and some web elements become too big, this is why variable allocation fails). Instead, all references will be done using the structure fields.
Check the function argument area of the stack frame and fix any wrong variables. For example, this area should not containt any arrays (arrays cannot be passed by value in C). It is ok to pass structures by value, the decompiler accepts it.
Currently the list is very short but it will grow with time.
The output is excessively short for the input function. Some code which was present in the assembly form is not visible in the output.
This can happen if the decompiler decided that the result of these computations is not used (so-called dead code). The dead code is not included in the output.
One very common case of this is a function that returns the result in an unusual register, e.g. ECX. Please explicitly specify the function type and tell IDA the exact location of the return value. For example:
Read IDA help about the user defined calling conventions for more info.
Another quite common case is a function whose type has been guessed incorrectly by IDA or the decompiler. For example, if the guessed type is
but the correct function type is
then all computations of the function arguments will be removed from the output. The remedy is very simple: tell IDA the correct function type and the argument computations will appear in the output.
In general, if the input information (function types) is incorrect, the output will be incorrect too. So please check them!
The following code
is being translated into:
This does not look correct. Can this be fixed?
This happens because the decompiler does not perform the type recovery. To correct the output, modify the definition of CommandLine in IDA. For that, open the stack frame (Edit, Functions, Open stack frame), locate CommandLine and set its type to be an array (Edit, Functions, Set function type). The end result will be:
Old databases do not contain some essential information. If you want to decompile them, first let IDA reanalyze the database (right click on the lower left corner of the main window and select Reanalyze). You will also need to recreate indirect (table) jump instructions, otherwise the switch idioms will not be recognized and decompilation of the functions containing them will fail.
In general, there is no need to file a bugreport if the decompiler gracefully fails. A failure is not necessarily a bug. Please read the graceful failures section to learn how to proceed.
Sure, it can be improved. However, given that many decompilation subproblems are still open, even simple things can take enormous time. Meanwhile we recommend you to use a text editor to modify the pseudocode.
Please read this page.
Please read this page.