Angel Bandres

Posted on Apr 2

How I built a binary ELF/PE analyzer (Phase 2)

#lowlevel #reversing #cybersecurity #python

Introduction and Motivation

If you've read my previous post on Phase 1, you probably have already an idea of the evolution of this project. If you haven't let me explain briefly.

I've been wanting to learn about Cybersecurity, Reversing and Low-level development, and this is the very first step, the very tip of the iceberg if you will; to have some fun and hopefully land a job in this vast and fascinating area. I've named this project "Binalyzer" as a spiritual succesor to my first junior Full Stack project "Textalyzer" (completely vibecoded and barely knowing a thing or two about HTML).

This time around, I wanted to do things on my own, with little to no use of AI for the development of this project, so I can properly learn everything relevant to this project (namely, binary file structure and Python). I've decided to do this in Python since I didn't know much about the language and development speed is overall much faster in comparison with C/C++. Also, it's a surprisingly powerful and expressive language and there's always more to learn about it. Documentation is also interesting and relatively easy to digest too.

Now, what is Binalyzer?

As the name might imply, it's simply a CLI binary file analyzer. As of right now, you need Python to run it, and the command you need to use to run is

python main.py filepath_to_bin -a

What's with all these phases?

Each phase represents a new functionality of the script. Phase 1 involved basic detection of the file type (ELF or PE, for the time being). Now, phase 2 involves header parsing of these files, which I'm very proud about.

Now now, what are PE and ELF? Do ELF files have ears? Is PE the Pocket Edition of Minecraft?

Answering the last two questions, no, unfortunately. But, Binalyzer kinda "hears" the inner structure and, since you can run Python in your phone, it's technically "pocketable" ig.

Damn... what is ELF then?

According to the Linux manpages (which I read recommend to read for further information, but not exclusively), it is a:

format of Executable and Linking Format files

This structure is fairly straightfoward, defined almost entirely by unsigned integers (size depending on the architecture), and maybe a few arrays here and there.

Alright then, what about PE?

PE stands for Portable Executable. And that's all you need to know

Both files are dynamic in size (shocking!) and are structured in very different ways (what?!), but PE is more complicated (even more than I can think of).

How can you tell which one is which?

ELF

Check the first 4 bytes of the binary. It's a magic number. If it's "\0x7F E L F", then it's an ELF file.

PE

Check the first 2 bytes. If it's "MZ" then it's a DOS file. This is done to preserve compatibility with older DOS programs. Then, check bytes 61 to 64 (indices 60 to 63), that is the signature. If it's "P E \0 \0", then it's a valid PE file.

Thought: there are many other types of binaries and I assume that each and every one of them has a different way of being read. This may be important if I want to scale my detection capabilites for other types of binaries.

Alright. Now, tell me more about the structure of these "headers".

ELF Header

Made of

e_ident: a 16 byte (unsigned char) array that identifies how the file must be interpreted as.
e_type: Object file type
e_machine: Required architecture of an individual file
e_version: File version
e_entry: Point of entry. If you know a thing or two about operating systems and files, it's the address of the file where the system starts the process related to it.
e_phoff: Program Header table OFFset (address)
e_shoff: Section Header table OFFset
e_flags: processor-specific flags associated with the file
e_ehsize: ELF Header SIZE (in bytes)
e_phentsize: Program Header table ENTry SIZE
e_phnum: Program Header table's NUMber of entries
e_shentsize: Section Header table ENTry size
e_shnum: Section Header table's NUMber of entries
e_shstrndx: Section Header table's section name STRing table iNDeX. Basically, there's a table of strings in the section name, in the section header table, and that table contains an index.

PE header

This one is more complicated, as it contains multiple "sub-headers". Fortunately, the names of these fields within these sub-headers are clearer. Here are some of the sub-headers that I've managed to parse as of right now in phase 2:

File Header

Contains static information (metadata) on the moment the binary was compiled. These fields are:

Machine: The type of CPU the binary was made for.
NumberOfSections
TimeDateStamp
PointerToSymbolTable
NumberOfSymbols
SizeOfOptionalHeader
Characteristics: flags in hex that determine additional aspects of the binary

Optional Header

This header is not fixed in size per se, since it has the instructions on how the system should expand and run the binary in RAM. But, contains some standard fields which are the same in every PE file:

Standard Fields

Magic: The architecture of the binary, might be 32-bit (0x10b) or 64-bit (0x20b).
MajorLinkerVersion and MinorLinkerVersion
SizeOfCode (bytes)
SizeOfInitializedData
SizeOfUnitializedData
AddressOfEntryPoint
BaseOfCode (address): Start of the code section when the file is loaded into memory.
BaseOfData (address, 32-bit Magic only): Same for BaseOfCode, but for the data section.

These fields are based on custom integer types called WORDs (2 bytes) and DWORDs (4 bytes).

How did you do implement this in Python?

Now we're talking. I first took the file and read the header as a string of bytes. For both files, I used the struct module and its method struct.unpack(), using specific string formats depending on the architecture and endianness:

ELF Header Unpacking String Formats

32-bit, little endian: "<16BHHIIIIIHHHHHH"
32-bit, big endian: ">16BHHIIIIIHHHHHH"
64-bit, little endian: "<16BHHIQQQIHHHHHH"
64-bit, big endian: ">16BHHIQQQIHHHHHH"

Note: Since I treated the header as a string, I had some troubles with the indices when determining the architecture and endianness, so make sure you're using the right ones! Here are the ones

architecture = header[4] # fifth byte
endianness = header[5] # sixth byte

PE Header Unpacking String Formats

They are all little endian.

File Header: architecture-independent

"<HHIIIHH"

Optional Header: Magic-dependent

32-bit: "<HBBIIIIII"
64-bit: "<HBBIIIII"

32-bit uses an additional field for BaseOfData.

To whom may concern, this is the structure of the project, pretty self-explanatory tbh.

No pics no clicks! Where are the examples?

Here they are!

PE: Windows Notepad

ELF: ls (Linux)

Comparing to a standardized program: readelf -h, the resulting values are the same:

And now what?

I will be very soon starting development on phase 3, which will include a listing of the sections of the binary, so stay tuned for more!

If you want to check out the project, it's on GitHub

I am open to any questions, comments, suggestions and constructive criticism.

DEV Community