gooMBA

Hands-Free Binary Deobfuscation with gooMBA

At Hex-Rays SA, we are constantly looking for ways to improve the usefulness of our state-of-the-art decompiler solution. We achieve this by monitoring for new trends in anti-reversing technology, keeping up with cutting-edge research, and brainstorming ways to innovate on existing solutions.

Today we are excited to introduce a new Hex-Rays decompiler feature, gooMBA, which should greatly simplify the workflow of reverse-engineers working with obfuscated binaries, especially those using Mixed Boolean-Arithmetic (MBA) expressions. Our solution combines algebraic and program synthesis techniques with heuristics for best-in-class performance, integrates directly into the Hex-Rays decompiler, and provides a bridge to an SMT-solver to prove the correctness of simplifications.

MBA Obfuscation Overview

What Is MBA?

A Mixed Boolean-Arithmetic (MBA) expression combines arithmetic (e.g. addition and multiplication) and boolean operations (e.g. bitwise OR, AND, XOR) into a single expression. These expressions are often made extremely complex in order to make it difficult for reverse-engineers to determine their true meaning.

For instance, here is an example of an MBA obfuscation found in a decompilation listing. Note the combination of bitshift, addition, subtraction, multiplication, XOR, OR, and comparison operators within one expression.

v1 = 715827883LL * (char)((((unsigned __int64)(-424194301LL * (a1 >> 4)) >> 35)+(-424194301LL * (a1 >> 4) < 0)) *  a1);
v2 = (char)(((((((unsigned  __int64)(-424194301LL * (a1 >> 4)) >> 35) + (-424194301LL * (a1  >> 4) < 0)) * a1 - 48 * ((v1 >> 35) + (v1  < 0))) ^ 0x28) + 111) | 0x33);
v3 = 818089009LL * (char)(((((((unsigned  __int64)(-424194301LL * (a1 >> 4)) >> 35) + (-424194301LL * (a1  >> 4) < 0)) * a1 - 48 * ((v1 >> 35) + (v1  < 0))) ^ 0x28) + 111) | 0x33);
v4 = (4 * (v2 - 21 * ((v3 >> 34) + (v3  >> 63)))) & 0xF4 | 8;
return (v4 - ((v4 / 0x81) & 0x7F | ((v4 /  0x81) << 7))) ^ 0xE;

For reference, the above code always returns 0x89.

MBA is also used as a name for a semantics-preserving obfuscation technique, which replaces simple expressions found in the source program with much more complicated MBA expressions. MBA obfuscation is called semantics-preserving since it only changes the syntax of the expression, not the underlying semantics — the input/output behavior of the expression should remain the same before and after.

Why is MBA Reversing Difficult?

A decompiler can be thought of as a massive simplification engine — it reduces the mental load of the reverse engineer by transforming a complex binary program into a vastly simplified higher-level readable format. It partially achieves this through equivalences, special pattern-matching rules derived from mathematical properties such as the commutativity, distributivity, and identity. For instance, the following simplification can be performed by applying the distributive property and identity property.

 2a + 3(a+0) = 5a

Both boolean functions and arithmetic functions on integers are very well studied, and there is an abundance of simplification techniques and algorithms developed for each. MBA obfuscators exploit the fact that many of these equivalences and techniques break down when the two function types are combined. For instance, we all know that integer multiplication distributes over addition, but note that the same does not hold over the bitwise XOR:

 3·(2 ⊕ 1) = 3·3 = 9
 (3 ⊕ 2)·(3⊕1) = 1·2 = 2

Advanced Computer Algebra Systems (CAS) such as Sage and Mathematica allow users to simplify arithmetic expressions, but their algorithms break down when we start introducing bitwise operations into our inputs.

Furthermore, although Satisfiability Modulo Theories (SMT) solvers such as z3 do often support both arithmetic and boolean operations on computer integers, they do not perform simplification — at least not for any human definition of "simplicity." Rather, their only goal is to prove or disprove the input formula; as a result, they are useful in proving a simplification correct, but not in deriving the simplification to begin with.

MBA Obfuscation Techniques

The core idea behind MBA obfuscation is that a complex, but semantically equivalent, MBA expression can be substituted for simpler expressions in the source program. For instance, one technique that can be used for MBA generation is the repeated application of simple MBA identities, such as:


x+y=(x|y)+(x&y)

x+y=2(x|y)-(x⊕y)

x|y=(¬x|y)-¬x

x-y=x+¬y+1

Many of these identities are available in the classic book Hacker’s Delight, but there are an effectively unbounded number of them. For instance, Reichenwallner et al. easily generated 1,000 distinct MBA substitutions for x+y alone.

There are also many more sophisticated techniques that can be used for MBA generation, such as applying invertible functions and point functions. The number of invertible functions in computer integers is similarly unbounded. By simply choosing and applying any invertible function followed by its inverse, then applying rewriting rules to mix up the order of operations, an MBA generator can create extremely complex expressions effortlessly.

Effects of MBA Obfuscation

Besides the obvious effect of making decompilation listings longer and more complex for humans to understand, there are a few other effects which this form of obfuscation can have on the binary analysis process.

For instance, dataflow/taint analysis is a static analysis technique that can be used to automatically search for potentially exploitable parts of a program (such as an unsanitized dataflow from untrusted user input into a SQL query). MBA obfuscation can be used to complicate dataflow analysis, by introducing arbitrary unrelated variables into the MBA expression without modifying its semantics. It then becomes extremely difficult to deduce whether or not the newly introduced variable has an effect on the expression’s final value.

An extreme example of this false dataflow technique is known as opaque predicates, whose original expressions have no semantic data inflows (i.e. they are constant). In other words, they always evaluate to a constant, regardless of their (potentially many) inputs. These opaque predicates can then be used for branching, creating false connections in the control-flow graph in addition to the dataflow graph.

Prior Work

Over the years, many algorithms have been developed to simplify MBA expressions. These include pattern matching, algebraic methods, program synthesis, and machine learning methods.

Pattern Matching

Since one of the core techniques involved in MBA generation is the application of rewrite rules, it seems natural to simply match and apply the same rewrite rules in the reverse direction. Indeed, this is precisely what earlier tools such as SSPAM did.

There are several issues with pattern matching methods. Firstly, there are a massive number of possible rewrite rules, and proprietary binary obfuscators are unlikely to reveal what rules they use. In addition, at any given moment an expression might contain multiple subexpressions that each match a pattern, and the order in which we perform these simplifications matters! Performing one simplification might prevent a more optimal simplification from appearing down the line. If we were to attempt every possible ordering of optimizations, our search space quickly becomes exponential. As a result, we considered pure pattern-matching methods to be infeasible for our purposes of simplifying complex MBA expressions.

Algebraic Methods

Arybo is an example of an MBA simplifier that relies entirely on algebraic methods. It splits both inputs and outputs into their individual bits and simplifies each bit of the output individually. It’s clear that this method comes with some limitations. For a 64-bit expression, the program outputs 64 individual boolean functions, and it then becomes quite difficult for a human to combine these functions back into a single simplified expression. Notably, the built-in z3 bitvector simplifier also outputs a vector of boolean functions, since this representation is more useful for its main goal of proving whether or not a statement holds.

Other algebraic algorithms for solving MBA expressions which do not split the expression into individual bits also exist. For instance, MBA-Blast and MBA-Solver use a transformation between n-bit MBA expressions and 1-bit boolean expressions. For linear MBAs (which we will describe in more detail later), this transformation is well-behaved, and a lookup table can trivially be used to simplify the corresponding boolean expression.

SiMBA, another algorithm published by Denuvo researchers in 2022, uses a similar approach to MBA-Blast and MBA-Solver, but additionally makes the observation that the transformation to 1-bit boolean expressions is not necessary for correctness; rather, the authors prove that it is sufficient to simply limit the domains of all input variables to 0/1. As a result, their algorithm yields much better performance; however, it’s important to note that the algorithm still relies on the algebraic structure of linear MBA expressions, and as a result will not work on all MBA expressions found in the wild.

Program Synthesis

Program synthesis is the act of generating programs that provably fulfill some useful criteria. In the case of MBA-deobfuscation, our task is to generate simpler programs that are provably semantically equivalent to the provided obfuscated program. In short, two programs are considered semantically equivalent if they yield identical side effects and identical outputs on every possible set of inputs. For the MBA expressions we consider, the expressions have no side effects or branching, so we are just left with the requirement that the simplified expression must yield the same output for every possible set of inputs.

One core observation made by synthesis-based tools such as Syntia, QSynthesis, and msynth is that for many real-world programs, the underlying semantics are relatively simple. After all, it is much more common to calculate the sum of two numbers x+y, than the result of say, 4529*(x>>(y^(11-~x))). Thus, for the most part, we only need to consider synthesizing relatively simple programs. To be clear, this is still a massive number of programs, but it at least makes the problem tractable.

The main technique used by QSynth and msynth is an offline enumerate synthesis primitive guided by top-down breadth-first search. In simpler terms, these tools take advantage of precomputation, generating and storing a massive database of candidate expressions known as an oracle, searchable by their input/output behavior. Then, when asked to simplify a new expression, they analyze its input/output behavior and use it to perform a lookup in the oracle.

Essentially, the input/output behavior of any expression is summarized by running the candidate expression with various inputs (some random, some specially chosen like 0 or 0xffffffff), collecting the resulting outputs, and hashing them into a single number. We refer to this number as a fingerprint, and the oracle can be thought of as a multimap from fingerprints to expressions. The simplification is then performed by calculating the fingerprint of the expression to be simplified, then looking up the fingerprint in the oracle for simpler equivalent expressions.

Machine Learning

Tools such as Syntia and NeuReduce use machine learning and reinforcement learning techniques to search for semantically equivalent expressions on the spot. However, we found that Syntia’s success rate was quite low — only around 15% on linear MBA expressions, and NeuReduce appeared to only have been evaluated on linear MBA expressions (on which it reported a 75% success rate), which are already solvable 100% of the time through algebraic approaches such as MBA-Blast and SiMBA.

Goals for gooMBA

When designing gooMBA, we had the following goals in mind:

  • Correctness — Obviously, a tool that outputs nonsense is useless, so we should strive to generate correct simplifications whenever feasible. When a true proof of correctness is infeasible, the tool should try to verify the results to a reasonable degree of certainty.

  • SpeedThe Hex-Rays decompiler is well-known in the industry for its speed. Likewise, the tool should strive for the highest performance possible. However, we are obviously willing to sacrifice a couple of seconds in machine-computation time if it means saving a human analyst hours of work.

  • Integration — The decompiler plugin should be able to optionally disappear into the background. Ideally, the user should be able to forget that they are even analyzing an obfuscated program and focus only on the work at hand.

Our Approach

Since there is no single way to generate MBA expressions, we decided to incorporate multiple deobfuscation algorithms into our final design and leave room for more in the future. Our tool, gooMBA, can be split into the following parts: microcode tree walking, simplification engine, SMT proofs of correctness, and heuristics.

Below is a drawing of our overall approach:

Since we found the SMT stage to be the most time-consuming, we run several hundred random test cases on candidate simplifications before attempting a proof.

Microcode Tree Walking

Before we can attempt simplification, we must first find potential MBA-obfuscated expressions in the binary. The Hex-Rays decompiler converts binaries into an intermediate form known as microcode, and continuously propagates variable values downward until a certain complexity limit is reached. Since MBA-expressions can be extremely complex (but notably, not so complex that they hinder performance), we increase the complexity limit when the MBA deobfuscator is invoked in order to maximize the complexity of expressions we can analyze. We then perform a simple tree-search through all expressions found in the program, starting with the most complex top-level expressions, and falling through to simpler subexpressions if they fail to simplify.

Simplification Engine

Our MBA simplification engine is split into three parts, each handling a subset of MBA expressions. We refer to these three parts as the Simple Linear Algorithm, Advanced Linear Algorithm, and the Synthesis Oracle Lookup.

We can think of each one of these three parts as a self-contained module: the obfuscated expression goes in one end, and a set of candidate expressions (each simpler than the obfuscated expression) comes out of the other end. At this stage, these expressions are simply guesses, and may or may not be correct.

One important thing to note is that all three of our subengines are considered black-box, i.e. they do not care about the syntactic characteristics of the expression being simplified, only its semantic properties — i.e. how the outputs change depending on the input values.

Simple Linear Algorithm

One of the fastest and easiest types of expressions we can simplify are those that reduce to a linear equation, i.e.

Note that constants fall under this category as well. We can simplify these easily by emulating the expression we are trying to simplify, first using zeroes for every input variable. This would tell us the value of a0. We can then emulate the instruction once again, this time using zeroes for every input variable except x1. Combined with the previously found value, this tells us the value of a1. We can repeat the process until we’ve obtained all the necessary coefficients. Note that the algorithm can also efficiently detect when a variable needs to be zero- or sign- extended; we can simply try the value -1 for each variable and see which of the zero- or sign-extended versions of the linear equation matches the output value. It can be shown in this case that both checks will succeed if and only if both sign- and zero-extension are semantically acceptable.

Advanced Linear Algorithm

Reichenwallner et al. showed that there is also a fast algorithm, namely SiMBA, to simplify linear MBA expressions, defined as those which can be written as a linear combination of bitwise expressions, i.e.

Where each ei(x1,...,xn)is a bitwise expression. For instance, 2*(x&y) is a linear MBA expression, but neither (x & 0x7) nor (x>>3) are linear MBA expressions, since neither (x & 0x7) nor (x >> 3) are bitwise or can be written as the linear combination of bitwise expressions.

Essentially, the algorithm works by deriving an equivalent representation consisting of linear combinations of only bitwise conjunctions, e.g. 4 + 2*x + 3*x + 5*(x&y). Without going into too much detail, we can recall that every boolean function has a single canonical full DNFform (i.e. it can be written as an OR of ANDs formula), which can then be easily translated into a linear combination of conjunctions. Therefore, every linear MBA expression can be written as a linear combination of conjunctions by simply applying the aforementioned transformation to each individual bitwise function, then combining terms.

Now, this linear combination of ANDs can be easily solved using a similar technique we described in the previous section, with the difference being that we must evaluate every possible combination of 0/1 value inputs, not just the inputs containing zero or one 1-values. Without going into too much detail, the coefficients can then be solved through a system of 2n linear equations of 2n variables, where each variable in the linear system represents one of the conjunctions of original variables, and each equation represents a possible 0/1 assignment to the original variables. We improve upon the algorithm proposed by Reichenwallner et al. by making further observations on the structure of the coefficients in the system and applying the forward substitution technique, yielding a simpler and faster solver.

Finally, Reichenwallner et al. apply an 8-step refinement procedure to find simpler representations, involving more bitwise operations than just conjunction. We found this refinement procedure reasonable and only applied a few tweaks in our implementation.

Synthesis Oracle Lookup

The algebraic engines are great for deriving constants when the expression’s semantics fulfill a certain structural quality, namely that they are equivalent to a linear combination of bitwise functions. However, we found that non-linear MBAs are also common in real-world binaries. In order to handle these cases, it is necessary to implement a more general algorithm that does not rely on algebraic properties of the input expression.

QSynth (2020, David, et al.) and later msynth (2021, Blazytko, et al.) both rely on a precomputed oracle which contains an indexed list of expressions generated through an enumerative search procedure. These expressions are searchable by what we refer to as fingerprints, which can intuitively be understood as a numeric representation of a function’s I/O behavior.

In order to generate a function fingerprint, we begin by generating test cases, which are assignments of possible inputs to the function. For instance, if we had three variables, a possible test case would be (x=100, y=0, z=-1). Then, we feed each one of these test cases into the function being analyzed; for instance, the expression "x - y + z" would yield the output value 99 for the previous test case. Finally, we collect all the outputs and hash them into a single number to get the fingerprint. Now we can look up the fingerprint in the oracle and find a function that is possibly semantically equivalent to the analyzed function.

Note that two functions that are indeed semantically equivalent will always yield the same fingerprints (since they will give the same outputs on the test cases). Therefore, if our oracle is exhaustive enough, it should be possible to find equivalences for many MBA-obfuscated expressions. A large precomputed oracle which can be used with goomba is available here: https://hex-rays.com/products/ida/support/freefiles/goomba-oracle.7z

SMT Proofs of Correctness

In order to have full confidence in the correctness of our simplifications, we feed both the simplified and original expressions into a satisfiability modulo theories (SMT) solver. Without going into too much detail, we translate IDA’s internal intermediate representation into the SMT language, then confirm that there is no value assignment that causes the two expressions to differ. (In other words, a != b is UNSAT.) If the proof succeeds, then we have full faith that the substitution can be performed without changing the semantics of the decompilation. We use the z3 theorem prover provided by Microsoft Research for this purpose.

Heuristics

We found that invoking the SMT solver leads to unreliable performance, since the solver often times out or takes an unreasonable amount of time to prove equivalences. In order to avoid invoking the solver too often, we use heuristics at various points in our analysis. For instance, we detect whether an expression appears to be an MBA expression before trying to simplify it. In addition, every time before we invoke the SMT solver, we generate random test cases and emulate both the input and simplified expressions to ensure they return the same values. We found the latter heuristic to improve performance up to 1,000x in many cases.

Evaluation

We evaluated gooMBA on the dataset of linear MBA-obfuscated expressions on MBA-Solver’s GitHub repository, an assortment of real-world examples from VirusTotal that appeared to be MBA-obfuscated, and an MBA-obfuscated sample object file from Apple’s FairPlay DRM solution. In terms of correctness, we find what we expect — gooMBA, being a combination of multiple algorithms, is able to cover more cases than each algorithm individually.

In terms of performance, we find that gooMBA competes very favorably against state-of-the-art linear MBA solvers, and is able to simplify all of the 1,000 or so examples from MBA-Solver much faster than SiMBA. Note that the comparison is not strictly fair, since SiMBA accepts input expressions as a string, and gooMBA accepts them as decompilation IR; regardless, we claim that accepting decompilation IR leads to a superior user experience with less possibility for human error.

Compared to msynth, the difference is even more dramatic. On the mba_challenge file provided on msynth’s GitHub repo, we measured the runtime to take around 1.87s per expression. In contrast, our equivalent algorithm took just 0.0047s to run, with the z3 proof taking 0.1s.

Future Work

We have presented gooMBA, a deobfuscator that integrates directly into the Hex-Rays decompiler in IDA Pro. This is a meaningful usability trait, since competing tools are typically standalone and require inputting the expression manually or interpreting obtuse outputs. However, this feature also presents some difficulties. For instance, we do not yet perform any use-def analysis or variable propagation beyond what’s already performed by the decompiler. The plugin also currently operates in a purely non-interactive manner, and we believe that adding some interactivity (e.g. allowing the user to choose from a list of simplifications, runs proofs in the background, etc.) would greatly benefit usability.

Some potential areas of improvement for gooMBA are: sign extensions are not handled uniformly across all simplification strategies, point function analysis is limited, the simplification oracle is limited by necessity, and use-def analysis can be strengthened to extract expressions spread across basic blocks.

Finally, it’s important to note that MBA obfuscation and deobfuscation are constantly evolving. We based our algorithm choices and implementations on the most promising research on the cutting-edge, but acknowledge that more effective solutions may appear in the future. For instance, though we found that machine learning techniques for MBA-solving have historically underperformed competing methods, machine learning seems like a good candidate for NP-hard problems such as MBA simplification, and we are watching this space for new solutions.

References

  1. Blazytko, Tim, et al. "Syntia: Synthesizing the semantics of obfuscated code." 26th USENIX Security Symposium (USENIX Security 17). 2017.

  2. Blazytko, Tim, et al. "msynth." https://github.com/mrphrazer/msynth. 2021.

  3. David, Robin, Luigi Coniglio, and Mariano Ceccato. "Qsynth-a program synthesis based approach for binary code deobfuscation." BAR 2020 Workshop. 2020.

  4. Feng, Weijie, et al. "Neureduce: Reducing mixed boolean-arithmetic expressions by recurrent neural network." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.

  5. Liu, Binbin, et al. "MBA-Blast: Unveiling and Simplifying Mixed Boolean-Arithmetic Obfuscation." 30th USENIX Security Symposium (USENIX Security 21). 2021.

  6. Quarkslab. "SSPAM: Symbolic Simplification with PAttern Matching." https://github.com/quarkslab/sspam. 2016.

  7. Quarkslab. "Arybo." https://github.com/quarkslab/arybo. 2016.

  8. Reichenwallner, Benjamin, and Peter Meerwald-Stadler. "Efficient Deobfuscation of Linear Mixed Boolean-Arithmetic Expressions." Proceedings of the 2022 ACM Workshop on Research on offensive and defensive techniques in the context of Man At The End (MATE) attacks. 2022.

  9. Xu, Dongpeng, et al. "Boosting SMT solver performance on mixed-bitwise-arithmetic expressions." Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021.

Last updated