Introduction to Mach-O Format

Mach-O (Mach Object) file format is the binary file format used by Apple for various compilation intermediate products (such as .o object files) and final compilation products (such as executable binaries) across all their platforms. Understanding the Mach-O format proves invaluable for low-level debugging, reverse engineering, and security analysis tasks.

This comprehensive guide explores the complete structure of Mach-O files, providing insights that will enhance your ability to analyze and troubleshoot binary-related issues on Apple platforms.

Overall Architecture

The Mach-O file structure follows a well-defined hierarchical organization. At the very top of every Mach-O file resides the header information.

The header defines fundamental information about the Mach-O file, including the CPU architecture it targets, file type, and various flags controlling its behavior.

The header structure is defined in the XNU source code at macho/loader.h:

struct mach_header_64 {
    uint32_t      magic;        /* mach magic number identifier */
    cpu_type_t    cputype;      /* cpu specifier */
    cpu_subtype_t cpusubtype;   /* machine specifier */
    uint32_t      filetype;     /* type of file */
    uint32_t      ncmds;        /* number of load commands */
    uint32_t      sizeofcmds;   /* the size of all load commands */
    uint32_t      flags;        /* flags */
    uint32_t      reserved;     /* reserved */
};

Header Field Analysis

Magic Number:

The magic field uses constants MH_MAGIC_64 and MH_CIGAM_64. A common misconception is that MH_MAGIC_64 represents big-endian byte order while MH_CIGAM_64 represents little-endian byte order.

In reality, if the Mach-O file's byte order matches the byte order of the CPU specified by the subsequent cputype field, then MH_MAGIC_64 is used. Conversely, if the Mach-O file's byte order is opposite to the CPU's byte order, MH_CIGAM_64 is employed.

CPU Type:

The cputype field indicates the CPU type on which this Mach-O file is intended to run, such as ARM64 or X86_64.

Note that this field's type is cpu_type_t, defined as follows:

// mach/machine.h
typedef integer_t cpu_type_t;
typedef integer_t cpu_subtype_t;
typedef integer_t cpu_threadtype_t;

// mach/arm/vm_types.h
typedef int integer_t;

From the code above, we can see that cpu_type_t is essentially an integer type.

CPU Subtype:

Defines the CPU subtype, such as CPU_SUBTYPE_ARM64_ALL, providing more granular specification of the target processor.

File Type:

Common values include:

  • MH_OBJECT: Indicates the current file is an .o object file.
  • MH_EXECUTE: Indicates the current file is an executable file.
  • MH_DYLIB: Indicates the current file is a dynamic library.
  • MH_PRELOAD: This type has been deprecated.
  • MH_CORE: Indicates the current file is a core file. Core files are generated after program crashes and can be directly debugged to locate issues, but iOS applications do not produce this file type.
  • MH_DYLINKER: Indicates the current file is a dynamic linker (dyld uses this type).
  • MH_DSYM: Indicates the current file is a .dsym symbol file.

Number of Commands:

The ncmds field specifies the count of Load Commands following the header.

Size of Commands:

The sizeofcmds field indicates the total byte size occupied by all Load Commands.

Flags:

Common flag values include:

  • MH_NOUNDEFS: Indicates the current Mach-O file has no undefined references.
  • MH_DYLDLINK: Indicates the current Mach-O file can only serve as input to the dynamic linker and cannot undergo further static linking. Executable files carry this flag.
  • MH_TWOLEVEL: Indicates the current Mach-O file uses two-level namespace, meaning each external symbol records which library it originates from, avoiding name conflicts.
  • MH_PIE: Indicates the current Mach-O file will use Address Space Layout Randomization (ASLR).

Load Commands

Immediately following the Mach-O header comes a series of Load Commands. There are many types of Load Commands, and all Load Commands share the same structure at their beginning:

// mach-o/loader.h
struct load_command {
    uint32_t cmd;      /* type of load command */
    uint32_t cmdsize;  /* total size of command in bytes */
};

Command Type:

The cmd field indicates the type of the current Load Command.

Command Size:

Different Load Commands have varying calculation methods for their size. However, regardless of the specific command, Load Commands on 64-bit machines must be aligned to 8 bytes.

Let us now examine common Load Commands in detail.

Segment Command (segment_command_64)

The segment_command_64 defines a segment's offset within the Mach-O file and its address after loading into virtual memory:

// mach-o/loader.h
struct segment_command_64 {
    uint32_t  cmd;        /* LC_SEGMENT_64 */
    uint32_t  cmdsize;    /* includes sizeof section_64 structs */
    char      segname[16];/* segment name */
    uint64_t  vmaddr;     /* memory address of this segment */
    uint64_t  vmsize;     /* memory size of this segment */
    uint64_t  fileoff;    /* file offset of this segment */
    uint64_t  filesize;   /* amount to map from the file */
    vm_prot_t maxprot;    /* maximum VM protection */
    vm_prot_t initprot;   /* initial VM protection */
    uint32_t  nsects;     /* number of sections in segment */
    uint32_t  flags;      /* flags */
};

Command: Set to LC_SEGMENT_64.

Command Size: The size of the current Load Command is calculated as:

cmdsize = sizeof(segment_command_64) + sizeof(section_64) × nsects

From this formula, we can see that LC_SEGMENT_64's size includes not only its own structure but also the size of the section_64 structures beneath it.

Segment Name:

The segname field contains the segment name. By convention, segment names must be uppercase, such as __TEXT.

Virtual Memory Address:

The vmaddr field specifies the address where this segment loads into virtual memory. Due to ASLR, the actual address becomes vmaddr + ASLR offset.

Virtual Memory Size:

The vmsize field indicates the virtual memory size occupied by the segment and must be aligned to memory pages.

For iOS, the memory page size is 16KB. This means the vmaddr's lowest 4 bits of a segment must be 0.

File Offset:

The fileoff field specifies the offset of the segment corresponding to LC_SEGMENT_64 within the Mach-O file.

File Size:

The filesize field indicates the disk size occupied by the segment corresponding to LC_SEGMENT_64 in the Mach-O file.

Maximum Protection:

The maxprot field defines the maximum protection settings for the segment in memory, such as read-only VM_PROT_READ.

Initial Protection:

The initprot field specifies the initial protection settings for the segment in memory.

Number of Sections:

The nsects field indicates the quantity of sections contained within the segment corresponding to LC_SEGMENT_64.

Flags:

Typically set to 0.

Common Segments in Mach-O Files

A Mach-O file commonly contains the following segments:

__TEXT Segment: Contains program code or read-only data, such as C strings.

__DATA Segment: Contains program data.

__LINKEDIT Segment: Contains symbol tables, string tables, indirect tables, and other linking-related information.

In an .o object file, all section_64 structures reside under an anonymous LC_SEGMENT_64. The static linker ultimately places different sections into their corresponding segments.

Section Structure (section_64)

Following a LC_SEGMENT_64 structure, there may be zero or more section_64 structures:

struct section_64 {
    char     sectname[16];  /* name of this section */
    char     segname[16];   /* segment this section goes in */
    uint64_t addr;          /* memory address of this section */
    uint64_t size;          /* size in bytes of this section */
    uint32_t offset;        /* file offset of this section */
    uint32_t align;         /* section alignment (power of 2) */
    uint32_t reloff;        /* file offset of relocation entries */
    uint32_t nreloc;        /* number of relocation entries */
    uint32_t flags;         /* flags (section type and attributes) */
    uint32_t reserved1;     /* reserved (for offset or index) */
    uint32_t reserved2;     /* reserved (for count or sizeof) */
    uint32_t reserved3;     /* reserved */
};

Section Name:

The sectname field contains the section name. By convention, section names are lowercase, such as __text.

Segment Name:

The segname field indicates the segment to which this section belongs.

Address:

The addr field specifies the address where the section loads into virtual memory. The actual address becomes addr + ASLR offset.

Size:

The size field indicates the virtual memory size occupied by the section. For special sections like __bss, the disk occupancy size may be 0 while the memory size remains non-zero.

Offset:

The offset field specifies the current section's offset within the Mach-O file.

Alignment:

The align field indicates the section's memory alignment requirement. If the value is 3, it represents 2^3, meaning 8-byte alignment.

Relocation Offset:

The reloff field relates to Relocation. In .o files, there exists a relocation table. The reloff indicates the offset within the Mach-O file of the first item in the relocation table that requires relocation for this section.

In other words, through this field, one can locate all items requiring relocation for this section within the relocation table.

We frequently encounter three related concepts: Relocation, Rebase, and Bind. The distinctions among these three are:

Relocation: Occurs during static linking, completed by the static linker ld. When one .o file calls a function in another .o file, the compiler does not know the correct address of the latter during compilation and uses a placeholder address instead. The static linker ld, when merging multiple .o files into an executable file, replaces these placeholder addresses with correct addresses.

Rebase: Occurs when the executable file loads, completed by the dynamic linker dyld. This exists because of ASLR—dyld must repair addresses within the executable file.

Bind: Completed by the dynamic linker dyld, it replaces placeholder addresses in the executable file pointing to functions in external dynamic libraries with actual addresses.

Number of Relocations:

The nreloc field indicates the count of items requiring relocation for this section.

Flags:

This field is divided into two parts. The low 8 bits define the section type, while the high 24 bits define section attributes.

Common Section Types

S_REGULAR: Indicates this section is an ordinary section, such as __TEXT,__text.

S_ZEROFILL: Indicates this section will be filled with zeros initially, such as __DATA,__bss.

S_CSTRING_LITERALS: Indicates this section contains only C strings.

S_LITERAL_POINTERS: Indicates this section contains only constant pointers, such as __DATA,__objc_selrefs.

S_LAZY_SYMBOL_POINTERS: Indicates this section contains lazily bound pointers, meaning the dynamic linker dyld binds the actual address only upon first access.

The need for dynamic linker dyld binding arises from the existence of dynamic libraries. If an executable file calls an external function in a certain dynamic library, such as the print function in the system C library, the address of this external function print cannot be known during executable file compilation and linking.

This is because the dynamic library's address in virtual memory is not fixed each time the operating system loads it. Therefore, the static linker ld can only assign a placeholder address to the print function during the linking period and mark it for runtime binding by the dynamic linker dyld.

Since the dynamic linker dyld binding process requires symbol lookup, to accelerate App startup speed, lazy binding technology was developed.

However, on iOS >= 15, due to dyld's adoption of Chained Fixup technology, lazy binding has been eliminated. All bindings are now non-lazy.

Non-lazy binding means that at App startup, all external addresses have already been bound by dyld, without waiting until the first access to that external address.

S_NON_LAZY_SYMBOL_POINTERS: Indicates this section contains non-lazy bound pointers. These pointers' addresses have already been bound by the dynamic linker dyld at startup.

S_SYMBOL_STUBS: Indicates this section contains only Stubs functions, such as __TEXT,__stubs.

Each Stubs function consists of a simple assembly code snippet related to lazy binding. Each Stubs function retrieves a pointer requiring lazy binding from a S_LAZY_SYMBOL_POINTERS section, then jumps to the assembly function in the __TEXT,__stub_helper section, calling dyld for symbol binding.

However, on iOS >= 15, since lazy binding has been eliminated, the __TEXT,__stub_helper section no longer exists. Therefore, on iOS >= 15, Stubs functions are all non-lazy bound, directly jumping to the corresponding external function.

S_COALESCED: Used for handling duplicate symbol definitions, ensuring that after static linking and merging of multiple .o files, only one copy of the code remains.

For example, in C++, different source files may perform identical instantiation of the same template, resulting in different .o files containing duplicate code after compilation. After marking them as S_COALESCED, the static linker performs coalescing, retaining only one copy of the code.

Common Section Attributes

S_ATTR_PURE_INSTRUCTIONS: Indicates this section contains only executable code, such as __TEXT,__text.

S_ATTR_SOME_INSTRUCTIONS: Indicates this section contains some executable code.

S_ATTR_NO_DEAD_STRIP: Informs the static linker ld that regardless of whether this section's content is referenced, it must not be deleted.

S_ATTR_LIVE_SUPPORT: Informs the static linker ld that only when certain code referenced by this section is "live," it itself survives; otherwise, it will be deleted.

S_ATTR_STRIP_STATIC_SYMS: Instructs the static linker ld to remove static symbols defined by static. Since they are only visible within the current file, removing them can reduce symbol table volume.

Reserved Fields

reserved1: This field is typically 0 and only useful for certain special sections:

  • S_SYMBOL_STUBS sections, such as __TEXT,__stubs
  • S_LAZY_SYMBOL_POINTERS sections, such as __DATA,__la_symbol_ptr
  • S_NON_LAZY_SYMBOL_POINTERS sections, such as __DATA_CONST,__got

In these sections, reserved1 indicates the index of the first stub or pointer in the current section within the indirect symbol table.

The indirect symbol table stores indices into the symbol table. Through the symbol table, one can find relevant symbol information.

The stubs and pointers of the above three types exist consecutively in the indirect symbol table. For example, if __TEXT,__stubs contains 3 stubs and the first stub's index in the indirect symbol table is 25, then the next 2 stubs' indices in the indirect symbol table are 26 and 27.

Since the global symbol table contains all symbol information, the indirect symbol table conveniently allows the dynamic linker dyld to know which symbols require binding.

reserved2: This field is typically 0 and only useful for certain special sections:

For S_SYMBOL_STUBS sections, such as __TEXT,__stubs, reserved2 indicates the byte size occupied by one stub. Therefore, the number of stubs in this section equals the current section's size divided by reserved2.

For S_LAZY_SYMBOL_POINTERS and S_NON_LAZY_SYMBOL_POINTERS sections, which store pointers, the pointer size is always 8 bytes. Therefore, the number of pointers in S_LAZY_SYMBOL_POINTERS and S_NON_LAZY_SYMBOL_POINTERS sections equals the current section's size divided by 8.

reserved3: Reserved and unused, set to 0.

Section Numbering

Sections in Mach-O files all have sequence numbers. Section numbering starts from 1 and spans across segments. In other words, if the first segment contains 10 sections, then sections in the second segment start from 11.

Special Segments

__PAGEZERO Segment

__PAGEZERO is a special segment. The description of the __PAGEZERO segment in the Mach-O Load Command appears as follows:

When loaded into virtual memory, its starting address is 0x0, and it occupies a virtual memory size of 0x100000000, exactly 4GB.

This means __PAGEZERO's address space spans 0x0 to 0xffffffff. If a program accesses an address within this address space, a null pointer crash will occur.

The reason for the 4GB size is that 4GB恰好 corresponds to 32-bit, preventing 64-bit programs from using 32-bit pointers.

Simultaneously, one can discover that this segment occupies no disk space in the Mach-O file—its fileSize is 0. This is because this segment contains no data, thereby saving disk space.

For iOS executable files, due to the ASLR mechanism, one might ask: is __PAGEZERO segment's space still 0x0 to 0xffffffff?

The answer is: under the ASLR mechanism, __PAGEZERO's starting virtual address remains 0x0, but the ending address will have a corresponding offset added to 0xffffffff.

__TEXT Segment Details

A question arises: will the Header and Load Commands in Mach-O be loaded into virtual memory?

The answer is yes.

Then, where in virtual memory will the Header and Load Commands be loaded?

The answer is: Header and Load Commands will be placed into the __TEXT segment.

As the first segment containing data, the __TEXT segment sits adjacent to the __PAGEZERO segment.

However, the __TEXT segment's starting position in virtual memory is not code but Header and Load Command data.

After the Header and Load Commands comes the真正 executable code.

From the diagram, we can see that the __TEXT segment's virtual starting address is 0x100000000. However, the actual code's address in virtual memory is 0x1000011B8.

The difference between these two values, 0x11B8,恰好 equals the size occupied by Header and Load Commands.

__LINKEDIT Segment

The static linker sorts various segments. The __TEXT segment is always placed first. The __LINKEDIT segment is always the last segment in Mach-O.

The __LINKEDIT segment contains considerable content, typically including:

  • Chained Fixups
  • Exports Trie
  • Function Starts
  • Symbol Table
  • Data In Code Entries
  • Indirect Symbol Table
  • String Table
  • Code Signature

Except for the string table, all other parts have corresponding Load Commands describing their attributes.

The string table's position is described by the symbol table's Load Command LC_SYMTAB.

UUID Command

The uuid_command stores the 128-bit UUID corresponding to the Mach-O file:

// mach-o/loader.h
struct uuid_command {
    uint32_t cmd;      /* LC_UUID */
    uint32_t cmdsize;  /* sizeof(struct uuid_command) */
    uint8_t  uuid[16]; /* the 128-bit uuid */
};

Command: Set to LC_UUID.

Command Size: sizeof(uuid_command).

UUID: 128-bit universally unique identifier.

Dynamic Library Command

The dylib_command stores dynamic library information used by the Mach-O file:

// mach-o/loader.h
struct dylib_command {
    uint32_t       cmd;
    uint32_t       cmdsize;
    struct dylib   dylib;
};

Command: The cmd can be set to:

  1. LC_ID_DYLIB: If the current Mach-O is a dynamic library, cmd is set to LC_ID_DYLIB. At this point, LC_ID_DYLIB identifies this dynamic library's install name, telling others where to find it.
  2. LC_LOAD_DYLIB: When an App links to this dynamic library, the dynamic library's LC_ID_DYLIB is copied into the App's LC_LOAD_DYLIB, enabling the App to locate this dynamic library at runtime.
  3. LC_LOAD_WEAK_DYLIB: Indicates weak linking. Weak linking means that even if the App cannot find the corresponding dynamic library at runtime, or the library exists but a certain interface in the library has been deleted, no crash will occur.
  4. LC_REEXPORT_DYLIB: Functions to disguise one dynamic library's interfaces as another dynamic library's interfaces.

For example, suppose there exists a huge dynamic library A that we want to split into two smaller dynamic libraries B and C. Since other Apps depend on dynamic library A, to avoid affecting App operation, dynamic library A can use LC_REEXPORT_DYLIB to export dynamic libraries B and C's interfaces.

In other words, dynamic library A becomes an empty shell. iOS's Umbrella Framework employs LC_REEXPORT_DYLIB.

Command Size: The size of dylib_command, besides sizeof(dylib_command), must also include the string length contained in struct dylib. Simultaneously, the overall size must be an integer multiple of 8 bytes.

Dylib Structure:

// mach-o/loader.h
struct dylib {
    union lc_str name;
    uint32_t     timestamp;
    uint32_t     current_version;
    uint32_t     compatibility_version;
};

Name: The name is a union lc_str, defined as:

// mach-o/loader.h
union lc_str {
    uint32_t offset;
#ifndef __LP64__
    char *ptr;
#endif
};

In 64-bit environments, this union is actually:

union lc_str {
    uint32_t offset;
};

The lc_str's offset stores the offset of the string from the top of the current Load Command. The actual string is stored directly after the current Load Command.

Timestamp: The dynamic library's build time.

Current Version: The current dynamic library's version number.

Compatibility Version: The minimum version with which the dynamic library maintains backward compatibility.

When the App runs, the dynamic linker uses the path pointed to by the name field in struct dylib to load the dynamic library.

To successfully start the App, the dynamic library's compatibility_version must be greater than or equal to the compatibility_version recorded by the App.

If not, the dynamic linker will directly crash the program at startup. This is because, for Apple, if the App's compatibility_version is larger, Apple assumes the App will definitely use the newer dynamic library interface; otherwise, the App would have no reason to link to the new version dynamic library.

Therefore, if a lower version dynamic library is provided, for safety, Apple directly prevents the App from starting.

However, even if the dynamic library's compatibility_version is updated, it only represents that the App can start normally—it does not guarantee no crashes during runtime.

The App might use an interface that has been deleted in the new version dynamic library. At this point, when the App runs to this interface, it will crash.

However, having the dynamic linker confirm whether the App uses interfaces already deleted in the dynamic library incurs too much overhead, so Apple does not perform related checks that would cause the App to crash at startup.

Although there is a risk of runtime crashes, in most cases, dynamic libraries can still upgrade smoothly without affecting existing Apps' operation.

Dynamic Linker Command

The dylinker_command stores information related to the dynamic linker:

struct dylinker_command {
    uint32_t     cmd;
    uint32_t     cmdsize;
    union lc_str name;
};

Command: The cmd value can be set to:

  1. LC_LOAD_DYLINKER: If the Mach-O file is an App, cmd is set to LC_LOAD_DYLINKER. At this point, this command stores the dynamic linker's disk path. When the kernel loads this App, it tells the kernel where to find the dynamic linker.
  2. LC_ID_DYLINKER: If the Mach-O file is the dynamic linker itself, cmd is set to LC_ID_DYLINKER. At this point, this command also stores the dynamic linker's own path on disk.

Dynamic libraries have neither LC_LOAD_DYLINKER commands nor LC_ID_DYLINKER commands. This is because dynamic libraries are loaded by the dynamic linker when executable files start. At this point, the kernel has already read the dynamic linker's position from the executable file.

  1. LC_DYLD_ENVIRONMENT: Sometimes we want to pass some environment variables to the dynamic linker when running an executable file. For example, we can pass DYLD_PRINT_LIBRARIES=1 in Xcode → Scheme → Run → Arguments → Environment Variables, making the dynamic linker print the dynamic libraries currently loaded by the executable program.

Or use the command line:

DYLD_PRINT_LIBRARIES=1 ./executable

However, sometimes we hope the executable file can automatically pass environment variables to the dynamic linker when loading. This is where LC_DYLD_ENVIRONMENT becomes useful.

When the dynamic linker loads an executable file, it searches for LC_DYLD_ENVIRONMENT and reads the environment variables inside.

There are two ways to embed LC_DYLD_ENVIRONMENT in Mach-O:

Method 1: In Xcode → Build Settings → Other Link Flags, add:

-Wl,-dyld_env,DYLD_PRINT_LIBRARIES=1

-Wl is a clang parameter that passes the comma-separated parameters following it to the static linker. -dyld_env is the static linker's parameter, instructing the static linker to write the corresponding environment variable into Mach-O.

Method 2: Directly add in the console:

clang main.m -o MyApp -Wl,-dyld_env,DYLD_PRINT_STATISTICS=1

Command Size: sizeof(dylinker_command) plus the name string length.

Name: Either the dynamic linker's path or the environment variable's value.

Framework-Related Commands

Sub Framework Command

A framework may be a member of an umbrella framework. For example, if A.framework is a member of an umbrella framework, then the sub_framework_command in A.framework records the umbrella framework's name:

struct sub_framework_command {
    uint32_t     cmd;
    uint32_t     cmdsize;
    union lc_str umbrella;
};

Command: Set to LC_SUB_FRAMEWORK.

Command Size: sizeof(sub_framework_command) plus the umbrella framework name length.

Umbrella: Records the umbrella framework's name.

If a framework contains sub_framework_command, it can only be linked by the umbrella framework or other umbrella framework members.

Sub Umbrella Command

An umbrella framework may contain other sub-umbrella frameworks. At this point, the umbrella framework uses sub_umbrella_command to record the sub-umbrella framework's name:

struct sub_umbrella_command {
    uint32_t         cmd;
    uint32_t         cmdsize;
    union lc_str     sub_umbrella;
};

Command: Set to LC_SUB_UMBRELLA.

Command Size: sizeof(sub_umbrella_command) plus the sub-umbrella framework name length.

Sub Umbrella: The sub-umbrella framework's name.

If sub_umbrella_command is used, then symbols exported from the sub-umbrella framework are treated as if exported from the main umbrella framework. This is very similar to LC_REEXPORT_DYLIB.

Perhaps precisely because of this, sub_umbrella_command is rarely used. At least, searching several iOS system libraries did not find any using this command.

Sub Library Command

The sub_library_command corresponds to sub_umbrella_command. One applies to frameworks, the other directly to dynamic libraries .dylib:

struct sub_library_command {
    uint32_t     cmd;
    uint32_t     cmdsize;
    union lc_str sub_library;
};

Command: Set to LC_SUB_LIBRARY.

Command Size: sizeof(sub_library_command) plus the sub-library name length.

Sub Library: The sub-library's name.

Sub Client Command

Sometimes a framework only wants to link to specific programs, requiring sub_client_command. The sub_client_command functions as a whitelist:

struct sub_client_command {
    uint32_t     cmd;
    uint32_t     cmdsize;
    union lc_str client;
};

Command: Set to LC_SUB_CLIENT.

Command Size: sizeof(sub_client_command) plus the name of the program allowed to link.

Client: The name of the program allowed to link to this framework.

Build Version Command

The build_version_command records the current program's build information:

struct build_version_command {
    uint32_t cmd;
    uint32_t cmdsize;
    uint32_t platform;
    uint32_t minos;
    uint32_t sdk;
    uint32_t ntools;
};

Command: Set to LC_BUILD_VERSION.

Command Size: Following build_version_command come one or more structures, each corresponding to a tool used. Therefore, cmdsize equals sizeof(build_version_command) plus n × sizeof(build_tool_version).

The build_tool_version is the structure defining tools.

Platform: Defines the platform on which this program runs:

#define PLATFORM_MACOS         1
#define PLATFORM_IOS           2
#define PLATFORM_TVOS          3
#define PLATFORM_WATCHOS       4
#define PLATFORM_BRIDGEOS      5
#define PLATFORM_MACCATALYST   6
#define PLATFORM_IOSSIMULATOR  7
#define PLATFORM_TVOSSIMULATOR 8
#define PLATFORM_WATCHOSSIMULATOR 9
#define PLATFORM_DRIVERKIT     10
#define PLATFORM_MAX           PLATFORM_DRIVERKIT

Minimum OS: Defines the minimum system version on which this program can run.

SDK: Defines the SDK version used to build this program.

Number of Tools: Defines the quantity of tools used to build this program.

Tool information is defined by the build_tool_version structure, following build_version_command. If multiple tools are used, there will be multiple build_tool_version structures:

struct build_tool_version {
    uint32_t tool;
    uint32_t version;
};

Tool: Defines tool type:

#define TOOL_CLANG    1
#define TOOL_SWIFT    2
#define TOOL_LD       3

Version: Defines tool version information. The version format is XXXX.YY.ZZ. The high 16 bits define XXXX, the middle 8 bits define YY, and the final 8 bits define ZZ.

Source Version Command

The source_version_command contains source code version information used to build the current binary program:

struct source_version_command {
    uint32_t cmd;
    uint32_t cmdsize;
    uint64_t version;
};

Command: Set to LC_SOURCE_VERSION.

Command Size: sizeof(source_version_command).

Version: Source code version information. The version format is a.b.c.d.e, where a occupies 24 bits, b occupies 10 bits, c occupies 10 bits, d occupies 10 bits, and e occupies 10 bits.

Entry Point Command

The entry_point_command defines the executable program's main function entry position:

struct entry_point_command {
    uint32_t cmd;
    uint32_t cmdsize;
    uint64_t entryoff;
    uint64_t stacksize;
};

Command: Set to LC_MAIN.

Command Size: sizeof(entry_point_command).

Entry Offset: The main function entry's offset. Note that this offset is relative to the __TEXT segment's starting address.

Stack Size: The main thread's initial stack size.

Encryption Information Command

The encryption_info_command_64 defines the encryption scope within the program. The encryption_info_command_64 exists only on real device programs, not on simulator programs. Typically, the __TEXT segment is encrypted:

struct encryption_info_command_64 {
    uint32_t cmd;
    uint32_t cmdsize;
    uint32_t cryptoff;
    uint32_t cryptsize;
    uint32_t cryptid;
    uint32_t pad;
};

Command: Set to LC_ENCRYPTION_INFO_64.

Command Size: sizeof(encryption_info_command_64).

Encryption Offset: The starting offset of the encryption scope.

Encryption Size: The size of the encryption scope.

Encryption ID: Whether encrypted. If 0, indicates unencrypted.

Pad: Padding field, because command size must be a multiple of 8 bytes.

RPath Command

Sometimes we do not want the linked dynamic library address recorded in the executable program to be an absolute address. This is where rpath_command becomes useful:

For example, suppose we wrote a dynamic library A.dylib and want to package and distribute it together with the executable program. If we use A.dylib's absolute address when linking the executable program, it will cause program runtime crashes because users install the executable program at different addresses.

At this point, we can use the install_name_tool tool to first change A.dylib's install name to @rpath:

install_name_tool -id @rpath/A.dylib A.dylib

Then re-link A.dylib to the executable program. Finally, use the install_name_tool tool to add rpath_command to the executable program:

install_name_tool -add_rpath @executable_path/Frameworks MyApp

When the dynamic linker dyld loads a dynamic library according to LC_LOAD_DYLIB in the executable program and discovers that the address recorded in LC_LOAD_DYLIB is @rpath/aaa/bbb, dyld will check rpath_command to parse @rpath.

struct rpath_command {
    uint32_t     cmd;
    uint32_t     cmdsize;
    union lc_str path;
};

Command: Set to LC_RPATH.

Command Size: sizeof(rpath_command) plus the path string length.

Path: The @rpath value, containing two scenarios:

  1. @executable_path: Represents the current executable program's directory.
  2. @loader_path: Represents the directory of the currently loaded executable program or library.

If the currently loaded item is an executable program, then @loader_path is the executable program's directory. If the currently loaded item is another dynamic library B.dylib, then @loader_path is B.dylib's directory.

Linkedit Data Command

Earlier, when discussing the __LINKEDIT segment, we mentioned that except for the string table, all content in the __LINKEDIT segment has its own corresponding Load Command. The Load Commands for the indirect symbol table and symbol table will be written separately. The remaining parts' Load Commands are all defined by linkedit_data_command:

struct linkedit_data_command {
    uint32_t cmd;
    uint32_t cmdsize;
    uint32_t dataoff;
    uint32_t datasize;
};

Command: Commands include:

  1. LC_DYLD_CHAINED_FIXUPS: Related to dynamic linker symbol address binding, will be written separately.
  2. LC_DYLD_EXPORTS_TRIE: Related to dynamic linker exported symbol lookup, will be written separately.
  3. LC_FUNCTION_STARTS: Records where all functions in the binary Mach-O begin, will be written separately.
  4. LC_LINKER_OPTIMIZATION_HINT: Exists in target .o files, telling the static linker ld which assembly instructions can be replaced with more efficient instructions.
  5. LC_DYLIB_CODE_SIGN_DRS: Records the signature rules of all dynamic libraries a binary Mach-O depends on, ensuring these dynamic libraries are genuine and not maliciously replaced when loaded.
  6. LC_SEGMENT_SPLIT_INFO: Gradually being replaced by the emergence of LC_DYLD_CHAINED_FIXUPS.
  7. LC_DATA_IN_CODE: Sometimes during compilation, to achieve lookup efficiency or memory alignment requirements, some data is inserted into the __TEXT segment. Without this command, disassembly would incorrectly interpret this data as code instructions.
  8. LC_CODE_SIGNATURE: Stores the binary Mach-O's digital signature.

Command Size: sizeof(linkedit_data_command).

Data Offset: The offset of related data relative to the __LINKEDIT segment.

Data Size: The size of related data.

Fat Mach-O

Apple supports placing Mach-O binaries compiled from the same source code on different CPU architectures into the same Mach-O file. Such Mach-O files are called Fat Mach-O.

The Fat Mach-O structure appears as follows:

The Fat Header indicates how many CPU architecture Mach-O files are contained. The Fat Arch specifies the current CPU architecture and the offset of this architecture's Mach-O relative to the Fat Mach-O header.

The Mach-O for each CPU architecture is a complete Mach-O file, containing Header, Load Commands, and Segments.

When the system loads a Fat Mach-O, it only selects the Mach-O matching the current CPU for loading, not loading the entire Fat Mach-O.

Fat Header Definition

// mach-o/fat.h
struct fat_header {
    uint32_t magic;
    uint32_t nfat_arch;
};

Magic: Always FAT_MAGIC.

Number of Fat Architectures: Indicates how many CPU architecture Mach-O files are contained.

For Fat Header, regardless of whether the machine compiling this Fat Mach-O uses big-endian or little-endian, all its fields are stored in big-endian format.

Fat Arch Definition

struct fat_arch {
    cpu_type_t    cputype;
    cpu_subtype_t cpusubtype;
    uint32_t      offset;
    uint32_t      size;
    uint32_t      align;
};

CPU Type: Corresponds to CPU type.

CPU Subtype: Corresponds to CPU subtype.

Offset: The offset of this CPU type's Mach-O relative to the current Fat Mach-O header.

Size: The size of this CPU type's Mach-O.

Alignment: Specifies offset alignment, which should align to 2^align.

Since the system loads Fat Mach-O by mapping disk pages one-to-one with memory pages, there is an alignment requirement for offset.

Conclusion

Understanding the Mach-O file format provides crucial insights for anyone working with Apple platforms at a low level. From debugging crashes to analyzing binary security, from optimizing load times to understanding dynamic linking behavior, the knowledge of Mach-O structure proves invaluable.

This comprehensive exploration has covered the header structure, all major Load Commands, special segments, and the Fat Mach-O format. With this foundation, you are now equipped to delve deeper into binary analysis, reverse engineering, or low-level system debugging on Apple platforms.