Exploring ELF files interactively
Introduction
ELF stands for Executable and Linkable Format. That is, it is used to define the structure and shape of two types of files:
- Executable files
- Linkable files
Executables
Putting it simple, executables are files. They consist of code and data. They are intended to be loaded by an operating system so their code could be executed by a processing unit. For instance, a desktop application is a collection of one or more executable files. Those executables contain the application's code and data. Often, the execution process of an executable is a two-phase process.
- The first phase is called loading which is a multi-staged action by itself. Here are the stages:
- Creation of a new process.
- Allocation of a sufficient amount of memory.
- Copying of the executable's code and data from the file to the memory space of the new process.
- Resolving the dependencies of the executable.
- Initialization of the executable.
- The second phase is called execution. A new execution unit is created by the operating system (Often called Thread). The execution unit starts executing the executable's code from a specific point, called the entry point.
Now, in order for the loading process to occur, the loader should be able to find every piece of information needed for the loading sequence described above. To achieve that, there must be an agreement between a wished-to-be-executed file and the OS loader. That is, the information (code, data, dependencies, etc...) has to be organized in a certain way known to the loader. A method to organize information in a file is called a format. ELF is such a format.
Linkables
Programming languages such as C and C++ were designed to support modularity. A C/C++ program could be divided to separate distinct units called objects, represented as (*.o) files. The separation to objects is accomplished through a division of the source code to separate translation units represented as (*.c/*.cpp) files. These objects are later linked together into a single functioning executable in a process called Linkage performed by an entity called Linker. No different from executables, object files' pieces of data are also organized in a certain way, known to the linker, in order to let it extract the pieces of information needed for the linkage. ELF is used as a format of object files. Executables and Linkables are the two faces of the ELF format. ELF is used mostly by Unix and Linux based operating systems over various architectures. In this article, x86_64 ELF executables are explored interactively using elfy, a Web Based ELF editor. The syntax and semantics of the format are demonstrated by patching a sample ELF file.
Preparations
For the upcoming demonstration of the loading process and of the format itself, the following executable will be used. elf.c
#include <stdio.h>
#include <stdlib.h>
int stage = 0;
__attribute__((constructor))
void initializer() {
stage += 1;
printf("initializer function (stage: %d) \n", stage);
}
__attribute__((destructor))
void finalizer() {
stage += 1;
printf("finalizer function (stage: %d) \n", stage);
}
int main(void) {
stage += 1;
printf("main function (stage: %d) \n", stage);
return 0;
}
void alternativeEntryPoint() {
stage += 1;
printf("This is an alternative entry point to the program (stage: %d) \n", stage);
exit(0);
}
The above is compiled by GCC using the following command:
$ gcc -lc -O0 elf.c -o elf.out
The output executable is elf.out. Execution of elf.out gives an output of the following form (Parts of it might vary):
$ ./elf.out
initializer function (stage: 1)
main function (stage: 2)
finalizer function (stage: 3)
In order to study elf.out with elfy, it has to be loaded first by clicking the Open button.
Fundamental types
ELF as a format, is basically a collection of smaller data structures called Headers. A header is a contiguous sequence of simpler (or primitive) data types - Integers. There are different kinds of headers, each kind describes a different type of information. Together, they expose the shape and structure of the file to execute.
Primitive data types
ELF defines several primitive types:
typedef uint16_t Elf64_Half; General purpose 16 bit unsigned integer
typedef uint32_t Elf64_Word; General purpose 32 bit unsigned integer
typedef int32_t Elf64_Sword; General purpose 32 bit signed integer
typedef uint64_t Elf64_Xword; General purpose 64 bit unsigned integer
typedef int64_t Elf64_Sxword; General purpose 64 bit signed integer
typedef uint64_t Elf64_Addr; Memory address: 64 bit unsigned integer
typedef uint64_t Elf64_Off; Offset: 64 bit unsigned integer
The primitives are simply typedefs of stdint.h's type definitions. Some of them are general purpose types while others are dedicated for specific types of data. Any other structure defined by ELF (Header) is a composition of the primitives described above.
ELF Header
One of the headers defined by ELF is the primary one. The primary header serves as a main map to every other part of the format. Sure enough, it's name is Elf Header. The header and its fields can be shown by clicking on the "Elf Header" button in the main menu. A hexadecimal representation of the binary data can be shown by clicking on the Elf_Ehdr structure itself.
The structure type used to describe a 64 bit ELF header is the following:
typedef struct
{
unsigned char e_ident[EI_NIDENT];
Elf64_Half e_type; /* Object file type */
Elf64_Half e_machine; /* Architecture */
Elf64_Word e_version; /* Object file version */
Elf64_Addr e_entry; /* Entry point virtual address */
Elf64_Off e_phoff; /* Program header table file offset */
Elf64_Off e_shoff; /* Section header table file offset */
Elf64_Word e_flags; /* Processor-specific flags */
Elf64_Half e_ehsize; /* ELF header size in bytes */
Elf64_Half e_phentsize; /* Program header table entry size */
Elf64_Half e_phnum; /* Program header table entry count */
Elf64_Half e_shentsize; /* Section header table entry size */
Elf64_Half e_shnum; /* Section header table entry count */
Elf64_Half e_shstrndx; /* Section header string table index */
} Elf64_Ehdr;
ELF Identity
The first field, e_ident, is an array of 16 bytes used to identify the file. The first 4 bytes of e_ident which are the very first 4 bytes of the file itself are the famous sequence (0x7f, 'E', 'L', 'F') called the ELF magic bytes. The ELF magic bytes mark the file as an ELF file and let the OS identify the file as one. The 5 subsequent bytes encode general information about the file.
e_ident[EI_CLASS]
As mentioned earlier, ELF supports multiple architectures. In particular, ELF supports 32 and 64 bit architectures. In order to support both of them, every structure defined by ELF is actually duplicated. There are 32 version and 64 bit version of every structure. The 4th byte ( e_ident[EI_CLASS] ) determines whether the file is formatted using the 32 bit versions of structures or the 64 bit ones. It does so by attributing the file to one of two classes: {ELFCLASS32, ELFCLASS64} for 32 bit and 64 bit respectively. elf.out belongs to the 64 bit class.
e_ident[EI_DATA]
This one indicates the Endianness of processor specific data found in the file. That is, the way numeric values are encoded by sequential bytes in memory. Elf.out which is built for x86_64 architecture, uses the Little-Endian method. By the Little-Endian method, less significant bytes occupy lower addresses. The significance of bytes is determined by their contribution to the overall numeric value. Therefore, ELFDATA2LSB (LSB for Least Significant Byte) is used rather than ELFDATA2MSB (MSB for Most Significant Byte).
e_ident[EI_VERSION]
ELF defines a single version. Version 1. Extensions to the format may use version numbers higher than 1.
e_ident[EI_OSABI]
Once again, due to multi-platform support of the ELF format, specifications are part of the format. Another specification is the file's ABI. ABI stands for Application Binary Interface. ABI includes functions calling conventions, OS system calls interface, and more. Basically, ABI specifies a way in which two pieces of program should communicate with each other.
e_ident[EI_OSABIVERSION]
A complementary field to the previous one. It specifies an ABI version.
The remaining 7 bytes are not used at all. They are all reserved for future or specific use.
Target machine
e_machine describes the platform the executable was compiled for. elf.out is compiled for X86_64, hence, the value of e_machine is the constant EM_X86_64 which is 0x3E. There is a constant dedicated to each supported platform.
Version
e_version is a duplication of e_ident[EI_VERSION]. Their values are the same and so is their meaning.
Entry point
Executables contain code. The contained code is an imperative machine code. That is, it's a sequence of instructions intended to be executed by a processing unit, one by one. The CPU is capable of executing code from a dedicated memory (RAM). It does so by repeating a routine of the following form:
- Fetch - Read the next instruction from memory
- Decode - Parse the instruction
- Execute - Execute the instruction
- Repeat
This is the well known Fetch-Decode-Execute routine. A processing unit that uses such a routine has to constantly track the memory address of the next instruction to fetch. Generally speaking, modern processing units use a dedicated register named Program Counter to achieve that. The Program Counter stores a memory address of the next instruction to be fetched. After the loading process of an executable, the executable's code occupies a region in the execution dedicated memory. (Referred as CPU's memory in this article) After that, every instruction has a unique memory address. Now, every executable has an instruction that is used as a starting point for the Fetch-Decode-Execute routine. e_entry is the address of the instruction used as a starting point of the program.
The rest of the (significant) fields in the Elf Header describe other headers found in the file. As mentioned earlier, the Elf Header is a main map which instructs a parser how to find needed information scattered over the file. The Elf header describes two types of headers. Section headers and Program headers.
Section headers
e_shoff is an offset to an array of headers called Section Headers. There are e_shnum headers in that array. Each header is e_shentsize bytes long. As one would expect, section headers describe sections. Sections are a logic partitioning of an ELF file's data. Sections are used mostly to partition object files' (linkables) data. They have no purpose in executables and even though they exist in executables, a loader may ignore them. Executables are partitioned differently, by using Segments. Segments are described by Program headers, which will be explored later. By clicking on the "Section Headers" in the left main menu, one could see the sections found in elf.out.
Program headers
Program headers e_phoff is an offset to an array of headers called Program headers. There are e_phnum headers in that array. Each header is e_phentsize bytes long. Program headers describe segments. Segments form a logical partitioning of an executable's data. As opposed to Section headers, Program headers do have a purpose in the context of executables and therefore, are discussed in detail. Elf.out's Program headers could be seen by clicking on the "Program Headers" button in the left main menu.
Patching the ELF Header
In this section, the semantics of the ELF header are demonstrated by patching its fields and observing the outcomes. * The constants used in this section might be different for you. In such a case, use elfy to figure them on your own.
Patching the entry point
As explained earlier, the entry point is the address of memory in which the first instruction to execute is found. The entry point of elf.out is 0x5b0 (*). Note that 0x5b0 (*) is the memory address of a function called "_start". The function _start could be observed as a symbol in the Symbol Table. The Symbol Table is a data structure that its syntax and semantics are covered later in this article (Dynamic link). For now we'll just say that a symbol is a named chunk of bytes that represent code or data. (Go to "Symbols", choose .symtab and then filter by 5b0 (*))
Using elfy, the value of e_entry could be patched to point to a different location. For this experiment, we choose the function alternativeEntryPoint as an alternative (See elf.out's source code under Preparations). In order to determine the memory address of the function alternativeEntryPoint, the Symbol Table is used again. (Go to "Symbols", choose .symtab and then filter by alternativeEntryPoint) Now it's easy to see the memory address of alternativeEntryPoint. It's 0x74b (*). The next step is changing the value of e_entry to the value 74b (*). In order to do so, go to "ELF Header", click on Elf_Ehdr and then on the "edit" button on the right. Note that the bytes are ordered according to Little Endian scheme. The initial state should be like the following: The final state should be like the following:
Finally, the outcomes can be observed by downloading and executing the new patched elf.out. Click on the Save button on the left menu, then execute the file. The results should be the following:
$ ./elf.out
entry point (stage: 1)
main function (stage: 2)
We have changed elf.out's entry point successfully.