Reverse-engineering part 1: the toolchain workflow
Before diving into the topic of reverse-engineering programs, an explanation of how programs are engineered in general is needed in order to understand what reversing this process actually means.
This is a simplified explanation intended to give a basic, high-level overview of the traditional compilation and linkage toolchain workflow. In particular, we will use in this part an imaginary implementation of the C programming language as an example, in order to not distract ourselves with unnecessary details.
Obligatory car analogy
Let’s imagine a car company. The company produces cars, which are assembled from parts, which are themselves manufactured from blueprints. To produce a car, the company uses the following process:
- First, the automotive engineer creates a set of blueprints for parts ;
- Then, the manufacturing engineer takes each blueprint and manufacture the corresponding part ;
- Finally, the assembly line worker takes all the parts and assembles the car.
Computer programs are similarly produced, by designing components that are then built and linked together to form an executable.
Toolchain: a theoretical model
The toolchain is the toolkit used by a programmer to transform a set of source code files written by a human into an executable that can be run by a computer. It includes many tools, in particular:
- The compiler: a program that turns a source code file into an object file.
- The linker: a program that turns a set of object files into an executable.
Comparing this toolchain model with the car analogy, we can draw the following equivalencies:
Car company employee | Produces | Toolchain model | Creates |
---|---|---|---|
Automotive engineer | Blueprint for part | Programmer | Source code file |
Manufacturing technician | Part | Compiler | Object file |
Assembly line worker | Car | Linker | Executable file |
Source code files
A source code file is the output of the programmer. It contains definitions of variables and functions written in plain text. The computer can’t execute a program written in this format, it is necessary to compile and link it to generate a file that is suitable for execution on a computer.
Object files
An object file is the output of the compiler, built from a source code file. It is said to be relocatable because its contents can be relocated anywhere in memory. It contains sections, symbols and relocations:
- A section is an array of bytes with a name. These bytes can represent machine code or data. Sections in relocatable object files do not have fixed addresses because the linker will decide later where they will be located in memory.
- A symbol is a symbolic name for an address. It can be defined, meaning the symbol refers to a particular location of a section of this object file, or undefined, meaning the symbol is used by this object file but not provided by it.
- A relocation is a request to fix up a spot inside a section with the final address of a symbol. Relocations can be applied to machine code or to data ; they must be resolved before the executable can be executed.
Executable files
An executable is the output of the linker, linked from a set of relocatable object files. Unlike object files, it is not relocatable because it has a fixed memory layout, created as part of the linking process.
In order to produce a working executable from object files, the linker lays out all the sections of the input object files in memory (assigning addresses to sections in the process) and proceeds to fix up all of the relocations. Once an executable is linked, its symbols and relocations are no longer needed and can be omitted.
Toolchain: an imaginary example
We will illustrate this model by commenting a “Hello, world!” executable written in the C language, with an imaginary implementation of the C programming language suitable for illustration purposes.
Source code file
This is the source code of the program we’ll use as an example:
#include <unistd.h>
const char* MESSAGE = "Hello, world!\n";
int main() {
write(1, MESSAGE, 14);
return 0;
}
This program writes the message Hello, world!
to the standard output stream using the standard POSIX function write()
.
It defines the variable MESSAGE
as a pointer to a constant character array, initialized to an address representing the greeting message.
Inside the main()
function, the write()
function is called with the value of MESSAGE
given as a parameter ; main()
then returns the integer 0
, indicating a successful execution of the program.
Note that MESSAGE
is a pointer to a constant character array.
It is not a constant pointer, its value can be modified to point somewhere else ; the MESSAGE
variable itself is therefore writable.
Object file
After compiling the source code file, the compiler will emit a relocatable object file:
This object file contains the following data:
- Three sections:
.text
contains the machine instructions of themain()
function in an executable section ;.rodata
contains the constant stringHello, world\n
in a read-only section ;.data
contains the variableMESSAGE
in a writable section.
- Three defined symbols:
main
is located inside the.text
section ;$LC0
for the stringHello, world!\n
is located inside the.rodata
section ;MESSAGE
is located inside the.data
section.
- One undefined symbol:
write
is used by themain
function.
- Three relocations:
- A relocation within the
.data
section to fix upMESSAGE
with the address of$LC0
; - A relocation within the
.text
section to fix up the address ofMESSAGE
; - A relocation within the
.text
section to fix up the address ofwrite
.
- A relocation within the
The symbols (highlighted in blue) do not have addresses at this stage because the sections that contain them haven’t been laid out in memory yet. Therefore, any references to symbols in object files are annotated with a relocation (highlighted in red), a request to patch in the final address of a given symbol at a later time. The linker will fill in these relocations after laying out the sections in memory, which gives out an address to every symbol in the object file.
Executable file
After linking the object file, the linker will emit an executable file:
Once linked, the executable still contains the three sections .text
, .rodata
and .data
from the relocatable object file, but laid out in memory: every byte in a section now has an address.
Since the relocations have been applied (highlighted in green) and are no longer useful, they are not included in the executable.
Similarly, the symbols are no longer needed and can be omitted too, but were kept in this example for clarity.
Given its fixed memory layout and lack of relocations, the executable file is not relocatable. Moving sections around would break the addresses embedded within the executable code and data.
Conclusion
We have learned how a toolchain works and how a programmer can use one to create an executable file from a set of source code files. Next time, we will start our case study by building a program that we will then reverse-engineer.