DEV Community

Travis Matthews
Travis Matthews

Posted on

Understanding the Mach-O File Format

Understanding the Mach-O File Format

When code is compiled for use on systems running the Mach kernel (macOS, iOS, etc.), this code is organized using the Mach object (or Mach-O) file format. An executable format determines the order in which code and data in a binary file are read into memory. Code organized under this format includes compiled programs along with files with the .o, .dylib and .bundle extensions.
A Mach-O file consists of three major regions - a header, load commands, and segments . Segments contain one or more sections where each section contains code or data of different types. Segments start on page boundaries, sections not necessarily aligned. Convention is to name segments in uppercase prefixed by two underscores (e.g., TEXT), sections in lowercase prefixed by two underscores (e.g., __text). For paging purposes, the header and load commands are considered part of the first segment - in executable, means they live at the start of the __TEXT segment as that is the first segment containing data (PAGEZERO contains no data and not readable/writeable).

Header

Structure identifying file as a Mach-O executable. Contains general information about file.

struct mach_header {
    unsigned long magic; /* Mach magic number identifier */
    cpu_type_t cputype; /* cpu specifier */
    cpu_subtype_t cpusubtype; /* machine specifier */
    unsigned long filetype; /* type of file */
    unsigned long ncmds; /* number of load commands */
    unsigned long sizeofcmds; /* size of all load commands */
    unsigned long flags; /* flags */
};

Load Commands:

Variable size commands that specify the layout and linkage characteristics of of the file. Can specify initial layout of the file in virtual memory, location of symbol table, initial exec state of main thread, names of shared libraries for imported symbols. Load Commands:
__PAGEZERO segment load command
__TEXT segment load command
__DATA segment load command
__LINKEDIT segment load command
DYLD_INFO_ONLY segment load command: specify internal structure of __LINKEDIT segment, give size and offset of symbol export trie and some bytecode interpreted by OSX dynamic linker
SYMTAB segment load command: symbol table (list of nlist_64 structs). Largely vestigial, but string table still used.
DYSMTAB load command: specifies offset of the indirect symbol table.
LOAD_DYLINKER: specifies location of /usr/lib/dylib
UUID Load Command: unique identifier for the executable.
VERSION_MIN_MACOSX load command: minimum version of OS X compatible with the executable (10.13.0).
SOURCE_VERSION load command: version of the source code used to generate the executable.
MAIN load command: offset of the __main function in the file
LOAD_DYLIB load command: one LOAD_DYLIB load command for every library to which the executable is dynamically linked.
FUNCTION_STARTS load command: offset and size of the function starts segment. Used by tools to determine if a given address falls inside a function. Formatted as a zero-terminated sequence of DWARF-style ULEB128 values. The first value is the offset from the start of the __TEXT segment to the start of the first function. The remaining values are offsets to the start of the next function from the previous function.
DATA_IN_CODE load command: offset and size of a segment which records the locations of certain pieces of data that are inlined in the __TEXT segment.

Segments

__PAGEZERO: One full VM page (4096 bytes or 0–0x1000) located at 0 with no protection rights assigned, which causes any accesses to c NULL to crash. With no data contained, it occupies no space in the file - file size is 0.
__TEXT Segment: Read-only area containing executable code and constant data. Compiler tools create every executable with at least one read-only TEXT segment. Since read-only, can map directly into memory just once - all processes can share safely (mostly useful in frameworks and shared libraries, but also running same executable multiple times simultaneously). Major sections:
**
TEXT,text*: executable machine code
*
TEXT,stubs/stubs/helper*: helpers involved in call to dynamically linked functions
*
TEXT,cstring*: constant c style (null terminated) strings. Duplicate strings removed by static linker when building final file.
*
TEXT,picsymbol_stub*: Position-independent symbol stubs, allow dynamic linker to load region of code at non-fixed virtual memory addresses.
*
TEST,symbol_stub*: Indirect symbol stubs.
*
TEXT,const*: initialized constant variables. All nonrelocatable const variables placed here. Uninitialized constant variables placed in a zero filled section.
*
TEXT,literal4*: 4-byte literal values, single precision floating point constants.
*
TEXT,literal8*: 8-byte literal values, double precision floating point constants. Sometimes more efficient to use immediate load instructions.
*
DATA Segment*: Contains writable data, static linker sets the virtual memory permissions to allow both reading and writing. Because writable, segment is logically copied for each process linking with the library and marked as copy-on-write - when process writes to one of these pages, it receives its own private copy of the page.
*
DATA,data*: Initialized mutable varaibles
*
DATA,la_symbol_ptr*: Lazy symbol pointers - indirect references to data items imported from a different file.
*
DATA,dyld*: Placeholder section used by the dynamic linker
*
DATA,const*: Initialized relocatable constant variables.
*
DATA,mod_init_func*: Module initialization functions (e.g., C++ static constructors)
*
DATA,mod_term_func*: Module termination functions
*
DATA,bss*: uninitialized static variables (e.g., static int i;)
*
DATA,common*: Uninitialized imported symbol definitions (e.g., int i;, located in the global scope
*
OBJC** Segment: Contains data used by the objective-c language runtime support library.
__IMPORT Segment: contains symbol stubs and non-lazy pointers to symbols not defined in the executable. Generated only for executable targeted for the IA-32 architecture.
IMPORT,jump_table: Stubs for calls to function in dynamic library
IMPORT,pointers: Non-lazy symbol pointers - direct references to function imported from a different file.
__LINKEDIT Segment: contains raw data used by the dynamic linker: symbol/string/relocation table entries.

Concepts:

Position Indepenent Code (PIC) same library code can be loaded in location in each program address space where it will not overlap any other uses of memory, can be executed at any memory address without modification - as opposed to absolute code (loaded at specific locations) or load-time locatable code (ltl, where linker or loader modifies program so it can only be run from particular memory location).
Indirect addressing is code generation technique allowing symbols to be defined in files separate from referencing files, allowing independent modification. Symbol references can be of two types: non-lazy (a.k.a. symbol pointer, resolved by dynamic linker when module loaded), or lazy symbol (dynamic linker overwrites lazy symbol pointer with the address of the function, subsequent calls jump directly to definition). Lazy symbols are composed of a symbol pointer and symbol stub (small amount of code that directly dereferences and jumps through the symbol pointer). Compilers generate lazy symbol references when it encounters calls to functions defined in other files.

TODO: finish adrummond notes on (6: Lazy symbol pointer section) to finish

References:

Top comments (0)