30 Years of Decompilation and the Unsolved Structuring Problem: Part 2

TL;DR

A two-part series on the history of decompiler research and the fight against the unsolved control flow structuring problem. In part 1, we revisit the history of foundational decompilers and techniques, concluding on a look at modern works. In part 2, we deep-dive into the fundamentals of modern control flow structuring techniques, and their limitations, and look to the future.

A Recap

In Part 1 of this series, we looked at the past 30 years of decompilation, providing a broad overview of how the field has been shaped. We also briefly covered how Cifuentes-based structuring worked: graph-schema matching, where an algorithm identifies patterns in CFGs to make code. Part 1 also garnered some interesting discussion across the web on Hacker News, Reddit, and Twitter among others. Generally, I think the community is very alive and excited about research and posts in this area, which is encouraging :).

In this post, we focus on only four papers from the last 11 years:

USENIX   - Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation
Asia CCS - A Comb for Decompiled C Code
NDSS     - No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations. 
USENIX   - Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring.

These papers, together, make up the entire history of decompilation structuring research in the security community. We will briefly overview the methods proposed in each of these works and how they influenced the next paper. Admittedly, the final paper in this series is SAILR, my paper. As such, I’ll be using many of the same arguments and questions I used in my paper to understand previous works mentioned here. As said in the summary of this post, Part 2 will be more technical than Part 1, but it should be accessible to newcomers.

As a reminder, though these are all of the decompilation structuring papers in security, many structuring techniques were informed from other fields and conferences. These include the research communities for compilers, static analysis, and programming languages. Here are some notable ones listed for completeness which will not be covered in this post:

Beyond Relooper: recursive translation of unstructured control flow to structured control flow (functional pearl) 
Emscripten: an LLVM-to-JavaScript compiler
Making Graphs Reducible with Controlled Node Splitting
On the capabilities of while, repeat, and exit statements
Control flow analysis

Phoenix: Condition-Aware Schema Matching

The 2013 work Phoenix came at an interesting time since when it was published IDA Pro had an openly working decompiler for over 6 years. IDA Pro had been generating code that was better than other decompilers, but no one in the academic community knew why for certain (because it’s closed-source). In many ways, Phoenix was an attempt to make public the techniques that IDA Pro had been using to beat others, though Phoenix itself was never open-sourced. As such, the Phoenix decompiler core is much like Cifuentes’ work on schema-matching, but with improved condition awareness and more schema.

Phoenix elaborated on many of the modern mechanics of structuring, such as loop refinement, switch refinement, condition handling, and even some special if-else case structuring. This was important enough that some hacker decompilers, like Reko, implemented much of the Phoenix paper. However, some of the largest impacts Phoenix had on the future of decompiler research can be surmised in three proposed points:

When you run out of good schema to match, you must use a goto. The process of turning an edge into a goto is called virtualization.
Gotos in decompilation are bad, reduce them.
Coreutils is a good dataset for the evaluation of decompilers.

To understand some of those claims, let’s look back at our example graph from Part 1:

Example Graph

Recall, that B-E makes a good if-else, but A->C violates that structure. In Cifuentes’ work, you’re allowed to also remove A->B, which will result in something structurable. Phoenix introduces more rules to choosing which edge to virtualize (make into a goto):

“… remove edges whose source does not dominate its target, nor whose target dominates its source.”

Under these rules, you must choose to virtualize A->C. After virtualizing, your graph loses an edge and it’s recorded as a goto. The resulting graph looks like this:

Example Virtualized Graph

Now the graph is only full of schema we can match, resulting in structured code with only 1 goto. Had you chosen to virtualize A->B instead, you could end with 2 gotos. Indeed, in the case of this CFG, 1 goto was better than 2 gotos. According to Phoenix, this phenomenon remains consistent as you reduce gotos: gotos are considered harmful.

The most popular decompilers today follow the same principles outlined in Phoenix. That list includes Ghidra, IDA Pro, and Binary Ninja. I largely attribute their differences in output to be either based on better schema (IDA case) or better lifting (Ghidra case).

The idea that you should reduce the number of gotos emitted in decompilation had a profound effect on the research direction future academics took. Gotos became the prime metric for making your decompiler better. With a metric to min-max, the next two papers focused on doing just that: getting gotos emitted to 0.

DREAM: Schemaless Condition-Based Structuring

2015 was an active year for decompilation, amounting to many hacker decompilers and a new structuring algorithm. The paper “No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations”, colloquially referred to as DREAM, boasted a novel algorithm that could output decompilation with 0 gotos. This is a significant feat, considering, according to the DREAM paper, Phoenix had over 4,231 gotos in their output of Coreutils.

To make such a quantitatively large jump required a drastically different structuring algorithm. DREAM proposed the idea that you don’t need schema when making decompilation. Instead, you can use the conditions on statements to generate semantically equivalent code. Taking a look at our example CFG, DREAM would initially generate the following code based on its conditions:

A();
if (~x)
  B();
if ((~x && y) || x)
  C();
if (~x && ~y)
  D();
E();

Note, DREAM would avoid making conditions for E, since it dominates everyone. After that, DREAM simplifies the boolean expressions to get the following:

A();
if (~x)
  B();
if (x || y) {
  C();
}
else {
  D();
}
E();

DREAM accomplished a very interesting and effective way to make code that is always goto-less. However, DREAM also came with two fundamental drawbacks to its approach:

Simplifying arbitrary boolean expressions is NP-Hard
Destroying every goto means gotos that existed in the source get eliminated

First, in practice, it’s easy to find code with many conditions in it. For instance, here is some DREAM (implemented in angr) decompiled code from tail in Coreutils:

tail: ignore_fifo_and_pipe

        if (!v2 && !a0->field_34 && a0->field_38 >= 0 && (a0->field_30 & 0xf000) == 0x1000)
        {
            a0->field_38 = -1;
            a0->field_34 = 1;
        }
        if (a0->field_38 < 0 || v2 || a0->field_34 || (a0->field_30 & 0xf000) != 0x1000)
            v1 += 1;

If the SAT problem was solved, you could have reduced those two expressions. Because you can’t guarantee you can simplify an expression, you are left with many overlapping booleans. In a slide from my Ohio State Talk, I counted the number of Boolean operators in DREAM and other algorithms output. DREAM had around 9,600 Booleans. Phoenix had 342. The source code (Coreutils) had 1,256.

Though with its flaws, DREAM still had an important effect on decompilation. Namely, it introduced the importance of Single-Entry Single-Exit (SESE) Regions, which played a fundamental role in future papers. It also had a follow-up paper that explored how this algorithm might work on C++ with a human study.

So, what could a goto-less output be like if we got around the boolean expression simplification problem?

rev.ng: Code Cloning Schema-Matching

About five years after DREAM, in 2020, the paper “A Comb for Decompiled C Code” was published about its decompiler rev.ng, a closed-source commercial decompiler. The rev.ng decompiler targeted the same end goal as DREAM: 0 gotos in decompilation. However, it achieved this through a different means, sticking with schema-based algorithms as its base. Learning from DREAM, the rev.ng approach avoids simplifying booleans and instead uses a new, but intuitive, algorithm called Combing.

Where DREAM duplicated conditions (as a result) to eliminate gotos, Combing duplicates actual code to fix CFGs before they are structured. In this way, rev.ng uses the same structuring algorithm base as Phoenix, but fixes the CFG before structuring. Combing attempts to turn every graph into a series of layered diamonds, which results in nicely structured if-else trees. Using our example CFG once again, we can see the result of “Combing” the CFG:

Combing Example Graph

Take note, that the original graph only had one C node, but now there are two. Previously, an edge from A->C caused a goto, so Combing copied the code at C and made it a new node. This new structure results in a perfect layering of diamonds, resulting in the following code through Phoenix now:

A()
if (x) {
  C();
}
else {
  B();
  if (y)
    C();
  else
    D();
}
E();

As some Redditors on r/Compilers discussed, this concept can lead to an exponential increase in code. This also so happens to duplicate code when you may have had the opportunity to solve it with more schema. Similar to DREAM, the largest issue with this approach is its divergence from the source that contains gotos.

At this point in the blog post, we approach the final paper in this series of structuring, SAILR. As the first author of this paper, I’ll be talking about this section a little differently from the others. Much of the progression of this story, as well as its fueling questions, was produced by my work on SAILR.

SAILR: Compiler-Aware Structuring

After understanding the fundamental gains each of these papers had, you may still have some lingering questions. In Phoenix, we reduced gotos with more schema. In DREAM, we reduced gotos at the expense of duplicating conditions. In rev.ng, we reduced gotos at the expense of duplicating code. But one thing never answered was, why do gotos even exist in decompilation? Also, is goto reduction the best way to measure when decompilers are improving?

In our 2024 work “Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation”, referred to as SAILR, we explored these questions. We also open-sourced the entire decompiler, evaluation pipeline, and reimplementations of all previous work mentioned here.

Back to our first question, why do gotos exist in decompilation? In schema-based structuring algorithms, we know that gotos appear because there are no more schemas to match a section of the CFG. But why do the CFGs generate unstructurable patterns in the first place? Surely, if you compiled code with no gotos you should always be able to structure the graph without gotos?

As you likely guessed, when you compile goto-free code you can get decompilation with gotos. Take this simplified example based on the SAILR paper:

Comparing Code and Decompilers

Source Code IDA Pro (and others)* DREAM

// ====================
if (a && b)
  puts("first");

puts("second");
if (b)
  puts("third");

sleep(1);
puts("fourth");

// ====================
if ( a && b )
{
  puts("first");
  puts("second");
  goto LABEL_6;
}
puts("second");
if ( b ) {
LABEL_6:
  puts("third");
}
sleep(1);
puts("fourth");

// ====================
if (a && b) {
    puts("first");
    puts("second");
}
if (!a || !b)
    puts("second");
if (b)
    puts("third");
sleep(1);
puts("fourth");

*Others include any Phoenix-based decompiler like Ghidra and Binja. SAILR is the exception, which produces the source code exactly in this example.

The code was compiled with gcc -O2, the default compiler optimization level for packages in Linux. Indeed, IDA Pro has both a goto and duplicated code (puts("second")). In SAILR, we found that the cause of most gotos is just 9 compiler optimizations, most found in O2. For any savvy decompiler people, these findings suggest that reducible graphs are unimportant to decompilation structuring. The only way to ever get rid of these gotos, while maintaining structure like the source, is to revert them precisely.

This realization was important because up until this point, we had been removing gotos from decompilation without knowing why. Doing something without understanding the root cause always results in side effects, as seen in DREAM and rev.ng. However, doing nothing results in decompilation with gotos that did not exist in the source, as seen in Phoenix.

With this in mind, we fought on two fronts. First, we made our decompiler revert all 9 compiler opts precisely to undo their transformations that made gotos. Second, we measured which decompiler was doing the best by comparing their decompilation to the source using a sped-up version of Graph Edit Distance (called CFGED). With Control-Flow Graph Edit Distance (CFGED), a decompiler can still have gotos in its output and get a good score if it matches the source.

A perfect decompiler should produce a 0 CFGED, meaning 0 graph edit distance, and the same gotos as the source. Here is how everyone did in a condensed graph derived from the paper on 26 C packages:

As you might’ve guessed, goto-less algorithms do the worst when you compare them to their original source. IDA Pro has the best CFGED score, but still produces a huge amount of fake gotos. Our algorithm to produce a CFGED score on source for your own decompiler is open-sourced, so get yours evaluated today!

In conclusion, SAILR can be summarized into three points:

Gotos are caused by compiler optimizations, many of which you can revert.
CFGED is a new algorithm to fairly measure your distance from source.
Not all gotos are evil.

For more information on how CFGED works, how you might revert optimizations, or how we discovered which optimizations cause gotos, read our paper. The appendix also has a nifty step-by-step of the CFGED algorithm in practice. Additionally, if you want to try out any of these previous works in the real world, check out our sailr-eval repo for how to use angr with previous structuring algorithms.

Wrap Up

And here we are, at the end of this whirlwind of a post series. Many issues still exist in decompilation aside from structuring, such as low fidelity to source, weak editability, no recompilability, and other sub-topics such as poor symbol recovery. Of these issues, I find, possibly biasedly, that control flow structuring is the most interesting. I hope after reading this, you find it interesting too, since it’s still unsolved!

Even with our work on SAILR, which got us much closer to source, there is still so much more the community needs to discover. Notice that in the graph on CFGED v Gotos, SAILR is still nearly double the gotos of the source and nowhere near 0 CFGED. The future of perfect decompilation is tied to improving these two scores. I believe with open and community-driven work, we will achieve near-perfect decompilation.

I’d like to acknowledge the DARPA AMP program for funding this research. Lastly, I’d like to thank you, the reader. Your support motivates us to keep pushing the boundaries of what is possible.