undefined reference
undefined reference

Finding Input for Unknown Binaries

Written by bcopos on May 26, 2015.

Software testing is a very tedious and expensive process. Academia and security researchers have explored methods for program testing for decades. To date, there are two main options for (mostly) automatic testing programs and finding bugs: symbolic execution and fuzzing.

Symbolic execution is a program analysis technique that was introduced in the 1970s. However, in the last decade, due to advancements in technology with respect to computational power and efficiency, the topic has once again gained popularity. The following steps describe how symbolic execution usually works. First, the memory buffer associated with the user input is marked as symbolic. This step is often done manually by the program tester and usually involves the editing of source code (as in the case with KLEE and fuzzball ). Once the instrumentation is done, the program is executed. During the program execution, the constraints on the input on the execution path are gathered. These constraints are then solved to generate input which is used to further explore the program and other execution paths. The goal is to find input which will crash the program or cause unwanted behavior.

Fuzzing is the other alternative to automatic program analysis. Fuzzing comes into three flavors: white, black, grey. Whitebox fuzzing is very similar to symbolic execution and actually uses symbolic execution as part of its process. Blackbox fuzzing is when the input is randomly generated and fed to the program for testing. Greybox fuzzing is when some information about the structure of the input is either known or gathered during testing and used when randomly generating new input.

Both approaches have had a lot of success and have found numerous bugs in various software. However, they also have disadvantages and limitations. Constraint solving, an essential component of symbolic execution, is efficient only with linear constraints. Actually, the constraint satisfaction problem is NP-complete on a finite domain. Restricting the domain or (type of) constraints will result in polynomial time execution. Also, there are some complications when it comes to symbolic execution of floating point computations. To overcome some of these limitations, concolic execution was invented (in the CUTE: A concolic unit testing engine for C paper by Sen et. al.). Concolic execution uses a combination of symbolic and concrete execution. Similarly fuzzing also has limitations. It is often slow, results in low code coverage (especially with blackbox fuzzing), and can take a long time to produce results.

While working on a project which involved program analysis, I came up with another method. The method I am about to describe DOES also has limitations and it is not the silver bullet security experts are looking for. However, it does have some advantages and it is rather interesting (at least IMO). The method of finding input (both benign and malicious) works by exploiting information from hardware performance counters during the execution of a program (see previous post about hardware performance counters).

DISCLAIMER In this post, I use instructions executed and instructions retired interchangeably. This is not always correct, the two values are not always equal.

The main idea is this: usually programs will contain some input validation logic. When a program is executed with valid input (vs. invalid/malformed/bad input) the program will execute a different (can be more or less, depending on the implementation) number of instructions. NOTE: This assumes that for a given index of the input string, most printable characters are not valid at that index. Hence the information about the number of instructions executed can be measured using hardware performance counters and exploited to incrementally (one character at a time) generate input strings for the program. The idea essentially exploits this side channel (i.e. number of instructions executed) to determine if the input (string or character) is valid.

How does it actually work?

First, start by executing the program with one character from the set of all printable characters. Using the perf utility, record the number of user-space instructions retired for each of those characters executed. After trying all printable characters, find the mode of the number of instructions executed. The characters for which number of instructions executed is equal to the mode are likely to be valid characters in the first index of the input string. Now test the program in the same manner expect with input string of length 2, where the first character is a character previously discovered. Repeating this process allows one to incrementally build valid input strings.

Does it always work?

No. This does not always work. The main assumption here is that most characters do not belong at a given index of the input string. If that assumption does not hold (e.g. more than half of the printable characters are valid at index 0 of the input string), this method will not work. Furthermore, if the input first is hashed and then compared, this will not work. I also have not tested it with non-printable characters. Additionally, this works if the input is compared byte by byte using normal string or character comparison functions (e.g. strcmp()). But it does not work if the input is interpreted as hex values (unless you try 2 characters at once).

So then what's the point?

All that glitters is not gold Shakespeare

This is definitely not gold and one could argue about the method's performance in large complex software systems... but there is still some value here.

This method was tested on several binaries including a binary from a cyber-security job interview (where the interviewee was asked to crack a binary), and it actually DOES work. Usually reverse engineering a program is complicated and requires quite a bit of knowledge about the binary format, debuggers, assembly, and many other things. On the other hand, using this method is rather quick and can lead to similar results (in the case of the interview challenge binary, the method presented here cracked the password in less than 5 minutes).

Perhaps the main takeaway from this experiment is to not ignore side effects and side channels when testing programs. After realizing the limitations of this technique, I looked further at the cmp instruction. If you look at its documentation, the value of some of the flags can give you information about the value expected. I think there is a lot of information from the hardware that can be used to help with program analysis. At Black Hat 2015, Herath and Fogh presented a work which also explores the use of hardware performance counters for security purposes. Their work is very interesting and the idea is a lot more mature than what I described in this blog post. I suggest reading it if you have the time!