Reverse-engineering part 1: the toolchain workflow

Before diving into the topic of reverse-engineering programs, an explanation of how programs are engineered in general is needed in order to understand what reversing this process actually means.

This is a simplified explanation intended to give a basic, high-level overview of the traditional compilation and linkage toolchain workflow. In particular, we will use in this part an imaginary implementation of the C programming language as an example, in order to not distract ourselves with unnecessary details.

Obligatory car analogy

Let’s imagine a car company. The company produces cars, which are assembled from parts, which are themselves manufactured from blueprints. To produce a car, the company uses the following process:

First, the automotive engineer creates a set of blueprints for parts ;
Then, the manufacturing engineer takes each blueprint and manufacture the corresponding part ;
Finally, the assembly line worker takes all the parts and assembles the car.

Computer programs are similarly produced, by designing components that are then built and linked together to form an executable.

Toolchain: a theoretical model

The toolchain is the toolkit used by a programmer to transform a set of source code files written by a human into an executable that can be run by a computer. It includes many tools, in particular:

The compiler: a program that turns a source code file into an object file.
The linker: a program that turns a set of object files into an executable.

Figure 1: Diagram of the toolchain workflow.

Comparing this toolchain model with the car analogy, we can draw the following equivalencies:

Car company employee	Produces	Toolchain model	Creates
Automotive engineer	Blueprint for part	Programmer	Source code file
Manufacturing technician	Part	Compiler	Object file
Assembly line worker	Car	Linker	Executable file

Source code files

A source code file is the output of the programmer. It contains definitions of variables and functions written in plain text. The computer can’t execute a program written in this format, it is necessary to compile and link it to generate a file that is suitable for execution on a computer.

Object files

An object file is the output of the compiler, built from a source code file. It is said to be relocatable because its contents can be relocated anywhere in memory. It contains sections, symbols and relocations:

A section is an array of bytes with a name. These bytes can represent machine code or data. Sections in relocatable object files do not have fixed addresses because the linker will decide later where they will be located in memory.
A symbol is a symbolic name for an address. It can be defined, meaning the symbol refers to a particular location of a section of this object file, or undefined, meaning the symbol is used by this object file but not provided by it.
A relocation is a request to fix up a spot inside a section with the final address of a symbol. Relocations can be applied to machine code or to data ; they must be resolved before the executable can be executed.

Executable files

An executable is the output of the linker, linked from a set of relocatable object files. Unlike object files, it is not relocatable because it has a fixed memory layout, created as part of the linking process.

In order to produce a working executable from object files, the linker lays out all the sections of the input object files in memory (assigning addresses to sections in the process) and proceeds to fix up all of the relocations. Once an executable is linked, its symbols and relocations are no longer needed and can be omitted.

Toolchain: an imaginary example

We will illustrate this model by commenting a “Hello, world!” executable written in the C language, with an imaginary implementation of the C programming language suitable for illustration purposes.

Source code file

This is the source code of the program we’ll use as an example:

#include <unistd.h>

const char* MESSAGE = "Hello, world!\n";

int main() {
    write(1, MESSAGE, 14);
    return 0;
}

This program writes the message Hello, world! to the standard output stream using the standard POSIX function write(). It defines the variable MESSAGE as a pointer to a constant character array, initialized to an address representing the greeting message. Inside the main() function, the write() function is called with the value of MESSAGE given as a parameter ; main() then returns the integer 0, indicating a successful execution of the program.

Note that MESSAGE is a pointer to a constant character array. It is not a constant pointer, its value can be modified to point somewhere else ; the MESSAGE variable itself is therefore writable.

Object file

After compiling the source code file, the compiler will emit a relocatable object file:

This object file contains the following data:

Three sections:
- .text contains the machine instructions of the main() function in an executable section ;
- .rodata contains the constant string Hello, world\n in a read-only section ;
- .data contains the variable MESSAGE in a writable section.
Three defined symbols:
- main is located inside the .text section ;
- $LC0 for the string Hello, world!\n is located inside the .rodata section ;
- MESSAGE is located inside the .data section.
One undefined symbol:
- write is used by the main function.
Three relocations:
- A relocation within the .data section to fix up MESSAGE with the address of $LC0 ;
- A relocation within the .text section to fix up the address of MESSAGE ;
- A relocation within the .text section to fix up the address of write.

The symbols (highlighted in blue) do not have addresses at this stage because the sections that contain them haven’t been laid out in memory yet. Therefore, any references to symbols in object files are annotated with a relocation (highlighted in red), a request to patch in the final address of a given symbol at a later time. The linker will fill in these relocations after laying out the sections in memory, which gives out an address to every symbol in the object file.

Executable file

After linking the object file, the linker will emit an executable file:

Once linked, the executable still contains the three sections .text, .rodata and .data from the relocatable object file, but laid out in memory: every byte in a section now has an address. Since the relocations have been applied (highlighted in green) and are no longer useful, they are not included in the executable. Similarly, the symbols are no longer needed and can be omitted too, but were kept in this example for clarity.

Given its fixed memory layout and lack of relocations, the executable file is not relocatable. Moving sections around would break the addresses embedded within the executable code and data.

Conclusion

We have learned how a toolchain works and how a programmer can use one to create an executable file from a set of source code files. Next time, we will start our case study by building a program that we will then reverse-engineer.

« Reverse-engineering: introduction

Reverse-engineering part 2: building our case study »