Bartosz Wójcik

Posted on Jul 5, 2020 • Edited on Jan 2, 2021

Assembly code size optimization tricks

#assembler #assembly #shellcode #reversing

There are many different types of code optimization when it comes to assembly or assembler code.

There is of course most popular speed optimization that focuses on the fastest possible code, often with the use of MMX, SSE, AVX instructions to process as much data as possible.

But there is one particular area of assembly programming that focuses on size optimization. I have used this knowledge many times in many of my software reverse engineering projects to modify compiled binaries with a limited amount of space available to include the modified code or to develop shellcodes for 0-day exploits, where again the size of the shellcode is limited.

Programmers who write in an assembler tend to think that if you write in an assembler their code is already optimized to the maximum (finally, it's an assembler!), but as I've found out there any many tricks that can be used to achieve even better results in terms of minimizing the code size.

Zeroing of CPU registers

mov eax,0 ; 5 bytes -> B0 00 00 00 00
xor eax,eax ; 2 bytes -> 33 C0
sub eax,eax ; 2 bytes -> 2B C0
and eax,0 ; 3 bytes -> 83 E0 00

As it turns out, even the simplest operation can take up to 5 bytes, but if we use xor instruction instead, the same operation will take 2 bytes in the resulting program code. The value 0 is often used as a base parameter for WinAPI functions.

Example with the standard version of code

    push    offset szSansSerif      ; lpFace                        ; 5 bytes
    push    0                       ; pitch and family              ; 2 bytes
    push    0                       ; output quality                ; 2 bytes
    push    0                       ; clipping precision            ; 2 bytes
    push    0                       ; output precision              ; 2 bytes
    push    1                       ; char set identifier           ; 2 bytes
    push    0                       ; strikeout attribute flag      ; 2 bytes
    push    1                       ; underline attribute flag      ; 2 bytes
    push    0                       ; italic attribute flag         ; 2 bytes
    push    400                     ; font weight(normal)           ; 5 bytes
    push    0                       ; base-line orientation angle   ; 2 bytes
    push    0                       ; angle of escapement           ; 2 bytes
    push    0                       ; logical average character     ; 2 bytes
    push    0Dh                     ; logical height of font        ; 2 bytes
    call    CreateFontA

The total number of bytes of instructions needed to remember the parameters of the CreateFontA percentage call will take 34 bytes in this case.

Size optimized version

    sub     eax,eax                                                 ; 2 bytes
    push    offset szSansSerif      ; lpFace                        ; 5 bytes
    push    eax                     ; pitch and family              ; 1 byte
    push    eax                     ; output quality                ; 1 byte
    push    eax                     ; clipping precision            ; 1 byte
    push    eax                     ; output precision              ; 1 byte
    push    1                       ; char set identifier           ; 2 bytes
    push    eax                     ; strikeout attribute flag      ; 1 byte
    push    1                       ; underline attribute flag      ; 2 bytes
    push    eax                     ; italic attribute flag         ; 1 byte
    push    400                     ; font weight(normal)           ; 5 bytes
    push    eax                     ; base-line orientation angle   ; 1 byte
    push    eax                     ; angle of escapement           ; 1 byte
    push    eax                     ; logical average character     ; 1 byte
    push    0Dh                     ; logical height of font        ; 2 bytes
    call    CreateFontA

This time 27 bytes, a small profit compared to the previous function, but sometimes these few bytes can be useful for something else.

Passing series of the same values

If we need to pass the same parameters to the function, it's usually done like this:

    push    0               ; 2 bytes
    push    0               ; 2 bytes
    push    0               ; 2 bytes
    push    0               ; 2 bytes
    push    0               ; 2 bytes
    push    0               ; 2 bytes
    push    0               ; 2 bytes
    ================================
                        = 14 bytes

Or more size optimized, like this:

    sub     eax,eax         ; 2 bytes
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    push    eax             ; 1 byte
    ===============================
                        = 9 bytes

But it can be further size optimized using a simple loop:

    sub     eax,eax         ; 2 bytes

    push    7               ; 2 bytes
    pop     ecx             ; 1 byte

@save_args:
    push    eax             ; 1 byte
    loop    @save_args      ; 2 bytes
    ================================
                        = 8 bytes

I haven't seen this type of size optimization nor in GCC or even in LLVM generated code (with size optimizations enabled), so it's a trick strictly reserved for hand-optimized assembly code.

Zeroeing EDX register

If we intend to zero the edx register, we normally do so by e.g. xor edx,edx but you can do it even more easily by using the cdq instruction (it stands for Convert Double to Quad).

The cdq instruction causes the edx register to be filled with a sign bit from eax register (sign bit is the most significant bit of the register value, so in this case it's the 31st bit).

So if we know that in eax we have e.g. 1, then execution of the cdq instruction will cause edx to be reset to zero.

If you are not sure about the content of the eax register (for example, after the function calls) you shouldn't use, because it can lead to errors:

eax=80000001h = 1000000000000000000000000000000000000001b
                ^ most significant bit of the EAX register is set to 1

This execution of cdq will cause edx to be filled with a bit of eax, which is 1, so in edx there will be 0FFFFFFh.

cdq instruction takes only one byte.

Transferring 32-bit values from 0-255 range to the CPU registers

mov  eax,7Fh 5 bytes        ; B0 FF 00 00 00

   sub  eax,eax 4 bytes         ; 2 bytes C0
   mov  al,7Fh                  ; B0 FF

   push 7Fh     3 bytes         ; 6A FF
   pop  eax                     ; 58

It is often necessary to transfer values from 0-255 range into 32-bit register. We can do it like this:

        mov     eax,4           ; B0 04 00 00 00

This instruction takes 5 bytes. A value of 4 is treated as a full 32-bit value that needs 4 bytes to encode. The most optimized solution is to store aka push this value on the stack and pop it back to the CPU register:

    push    4                       ; 6A 04
    pop     eax                     ; 58

This time it takes only 3 bytes, even though it takes up more space in the source code, it takes up fewer bytes on the disk!

It should be mentioned, that the compiler will write the shortened form of push instruction if the value is between 0-127 (signed integer value).

If you want to use the shortened version of push instruction even for signed integer values, you need to do it either by using:

    push -127

or by using helper macro

    pushb   macro   byteval
    db      06Ah,byteval
    endm

    pushb   080h    ; store 128 value (
    pop     eax

After these instructions are completed, the eax will hold a value of 0FFFFFF80h (-80h) but why not 00000080h?

The numbers in the range 128-255 in the short version of push instruction are treated as negative numbers (aka sign-extended).

The sign bit from the short encoded integer value is then copied to the upper bits of the CPU register:

    00000000 00000000 00000000 10000000 = 00000080h
                               ^integer sign bit

    11111111 11111111 11111111 10000000 = FFFFFF80h
                               ^signed integer

There is another trick to make the code a little short in case you want to encode values in the range from 128-255 to a full 32-bit value:

Standard way:

    mov     eax,255 ; bytes

Size optimized way:

    xor     eax,eax ; bytes
    mov     al,255  ; bytes

The use of error codes returned by functions

This is another of the tricks often overlooked by HLL compilers.

Functions by definition return some values. In the case of WinAPI functions, the returned value is always stored in the eax register.

Depending on the function, returned values can differ and it could be 0, -1, file handle, etc.

For example CreateFileA function returns -1 in eax register when we don't have access to the file we just wanted to open.

But another WinAPI function like CreateIcon returns in eax 0 if there is an error.

We can use those values, before checking the MSDN documentation to our advantage:

    push    ...
    call    LoadBitmapA

Documentation about LoadBitmapA function says the function returns the handle to the bitmap on success and 0 on error.

    push    ..
    call    LoadBitmapA
    cmp     eax,0           ; 83 F0 00
    jz      @error

cmp eax,0 instruction takes 3 bytes. Can't we do it better? Of course, we can by using logical operations like or or test:

    call    LoadBitmapA
    or      eax,eax         ; 0B C0
    jz      @error

or:

    call    LoadBitmapA
    test    eax,eax         ; 85 C0
    jz      @error

Both of the or and test instructions sets the CPU zero flag if the eax register value is set to 0, it gives us the same result as the cmp eax,0 instruction but with 1 byte less size in output code.

We can optimize it even further by using xchg instruction:

    call    LoadBitmapA
    xchg    eax,ecx         ; 1 byte
    jecxz   @error          ; jecxz instruction takes 2 bytes (the same as jxx short range branches)

The jecxz instruction jumps to the provided label if the ecx register is set to 0.

But there is a catch! The instruction itself is a conditional branch instruction to the nearest label in range of -127 to 128 bytes from the instruction itself in compiled code (it's a short jump type instruction only).

So if your destination, in our case @error label is further away in compiled code than that you will get an error message from the compiler.

Some assembly compilers like an old school TASM compiler will automatically translate jecxz with destinations further than 128 bytes to:

    call    LoadBitmapA
    xchg    eax,ecx
    jecxz   @dummy

    jmp     @next

@dummy:
    jmp     @error

@next:

Many WinAPI functions returns -1 (0FFFFFFFh) value on error. How can we check it? The simplest way is of course:

    call    CreateFileA
    cmp     eax,-1          ; 83 F0 00
    je      @error

We can get the same result using much more size optimized code:

    call    CreateFileA
    inc     eax             ; if there was -1 value returned, the inc instruction will set the EAX register to 0
    je      @error          ; and we can detect it with a conditional JE/JZ instruction
    dec                     ; if there wasn't an error, restore the originally returned value

In this case, the resulting code will be 1 byte smaller than the one using cmp eax,-1.

Exchanging CPU registers values

Say you have a value of 4 stored in the eax register and a value of 98 stored in edx register. How to exchange those two registers?

We can do it like this:

    push    eax
    push    edx
    pop     eax
    pop     edx

This takes 4 bytes. We can use a temporary register like this:

    mov     ebx,eax
    mov     eax,edx
    mov     edx,ebx

But this one is even bigger with 6 bytes.

Or we can use this one clever trick using the logical xor instruction:

    xor     edx,eax
    xor     eax,edx
    xor     edx,eax

Still 6 bytes in output code. But there is one overlooked instruction, not used by HLL compilers anymore.

It's called xchg (from eXCHange), it's size is just 1 byte in output code and it does just what we need:

    xchg    eax,edx         ; 92h

Is 1 byte in size, but:

    xchg    edx,esi         ; 87h 0D6h

The xchg instruction takes only 1 byte in output code, but only if one of the exchanged registers is eax. Otherwise it's encoded as 2 bytes.

You will learn that many other instructions are smaller if you use the eax register e.g.:

    add     edi,400000h     ; 6 bytes -> 81 C7 00 00 40 00
    add     eax,400000h     ; 5 bytes -> 05 00 00 40 00

So it's the same instruction add, but if the eax is used - the output code is 1 byte smaller. Keep that in mind.

CPU string instructions

There is a separate set of string instructions in CPUs. They operate on esi and edi registers only.

Some of those instructions are rarely used by modern compilers, but they have one advantage to us - the size of the output code.

Let's look at this example. We have a simple loop and after each iteration, we increase the value of the esi pointer by 4.

_loop_label:
    ...
    ...
    ...
    add     esi,4
    loop    _loop_label

Easy & simple. But the:

    add esi,4               ; 83 C6 04

instruction takes 3 bytes. But we can use the string instruction lodsd to make our code shorter and it does exactly the same:

    lodsd                   ; AD     = add esi,4
    lodsw                   ; 66 0A  = add esi,2
    lodsb                   ; 0A     = add esi,1

There are 3 variants of this instruction, operating on 32 bit, 16 bit and 8 bit values:

    lodsd                   ; mov eax,dword ptr[esi]
                            ; add esi,4

    lodsw                   ; mov ax,word ptr[esi]
                            ; add esi,2

    lodsb                   ; mov al,byte ptr[esi]
                            ; inc esi

So the optimized loop could look like this:

_loop_label:
    ...
    ...
    ...
    lodsd                   ; mov eax,dword ptr[esi]
                            ; add esi,4
    loop    _loop_label

So we can use it a short version of add esi,4 instruction, just keep in mind it access the memory pointer in esi register (so it cannot be any value, it must be a pointer to some data) and it writes to eax register.

If you need to preserve the value of the eax register you can do it like this:

_loop_label:
    ...
    push    eax

    lodsd

    pop     eax

    loop    _loop_label

There is also a scasX instruction. It compares the value pointed by the edi register to the value from eax register and increases (if the direction flag DF is set to 0, use the cld instruction) or decreases (if the direction flag DF is set to 1, use the std instruction) the value of the edi registers. It also comes in 3 variants for 32 bit, 16 bit and 8-bit comparisons. In order to use it, you need to make sure the edi register points to a valid data buffer, so again it cannot be any number or value you want because it will end with an exception if you try that (access violation).

So if one of registers you want to increase is edi, instead of this:

    add     edi,4           ; 83 C7 04

it's better to use:

    scasd                   ; AF
    scasw                   ; 66 AF
    scasb                   ; AE

and it works like this:

    scasd                   ; cmp dword ptr[edi],eax
                            ; add edi,4

    scasw                   ; cmp word ptr[edi],ax
                            ; add edi,2

    scasb                   ; cmp byte ptr[edi],al
                            ; inc edi

The CPU direction flag decides if the value of the edi register is increased or decreased:

    std                     ; set DF (Direction Flag), 1 byte
    scasd                   ; cmp dword ptr[edi],eax
                            ; sub edi,4

Keep in mind the direction flag (DF) is always cleared after the application starts, at least for the Windows PE executables and it's also expected to be clear between any WinAPI functions.

So if you ever set it with std instruction, make sure to reset it back afterward with cld otherwise you might end up with hard to find bugs related to this issue in other applications or OS components.

    std                     ; set DF (Direction Flag), 1 byte
    lodsd                   ; mov eax,dword ptr[esi]
                            ; sub esi,4
    ...
    ...
    cld                     ; restore DF to its expected default state

The final word

It may seem that all this size optimization doesn't make sense nowadays, but it may come in handy if, for example, you write some shellcode or you need to modify the compiled code using as few instructions as possible, and the space to use will be very modest, the knowledge about optimization may be very useful.

If you want to learn more, you can read my free articles about programming (assembler, C/C++), malware analysis, and reverse engineering.