About compiler optimizations

I came into some interesting stuff some time ago, while trying to detect performance issues in some ANSI-C software.

Sometimes, while coding, it may be difficult to decide whether to use a recursive function or a loop, and if so, what kind of loop.
The best example, is the factorial computation. That example is typically used to teach recursive function.

I always felt something strange with those examples. Why the hell using a recursive function for a factorial computation, since we can use a while loop, for instance?

If you think about machine code, or assembly, calling a procedure implies to copy the values of some registers (instruction pointer, etc), to execute a jump into another portion of memory (maybe in another segment), to execute the code, to restore the registers, and finally to return to the previous memory location (RETF).
In pure assembly, a simple loop, using the CX register, on Intel, is far more efficient, as there's no jump, nor stack operation to copy the register values.

So what about C? Take a look at this typical example:

unsigned long factorial( unsigned long n )
{
    if( n == 0 )
    {
        return 1;
    }

    return n * factorial( n - 1 );
}

The same can be achieved with a while loop:

unsigned long x = n;

if( n == 0 )
{
    x = 1;
}
else
{
    while( n > 1 )
    {
        x *= --n;
    }
}

We can also choose to use a for loop:

unsigned long x = n;

if( n == 0 )
{
    x = 1;
}

for( ; n > 1 ; )
{
    x = --n * x;
}

In pure assembly, we would also write a loop. So with GCC inline assembly:

unsigned long x;

if( n == 0 )
{
    x = 1;
}

__asm__
(
    "    movq %[n], %%rax \n\t"
    "    movq %[n], %%rcx \n\t"
    "    decq       %%rcx \n\t"
    "FACTORIAL:"
    "    mulq %%rcx \n\t"
    "    decq %%rcx \n\t"
    "    jnz FACTORIAL \n\t"
    "    movq %%rax, %[x]"

    : [ x ] "=m" ( x )
    : [ n ] "m"  ( n )
);

Which one will be the fastest? At first sight, I would have said inline assembly, while, for, and finally the recursive function.

But that's not the case.
Running each case 10000000 time gives the following results:

Recursive:  0.64492511749267578125
While:      1.47788429260253906250
Assembler:  0.62227106094360351562
For:        1.47960233688354492188

Inline assembly is the fastest, but the recursive function is very very close.
The while and for loops takes much longer.

I was very surprised. Why such results?
The answer is in the way GCC optimizes to source code to generate intermediate object code.

If we take a look closer at what GCC does, we can see it optimizes for us a lot of things.

For instance, our recusrisve function, in pure assembly, generated by GCC:

LFE7:
.globl _factorial_recursive
_factorial_recursive:
LFB8:
    pushq	%rbp
LCFI3:
    movq	%rsp, %rbp
LCFI4:
    subq	$16, %rsp
LCFI5:
    movq	%rdi, -8(%rbp)
    cmpq	$0, -8(%rbp)
    jne	L4
    movq	$1, -16(%rbp)
    jmp	L6
L4:
    movq	-8(%rbp), %rdi
    decq	%rdi
    call	_factorial_recursive
    movq	%rax, %rdx
    imulq	-8(%rbp), %rdx
    movq	%rdx, -16(%rbp)
L6:
    movq	-16(%rbp), %rax
    leave
    ret

And now the while loop:

LFB9:
    pushq	%rbp
LCFI6:
    movq	%rsp, %rbp
LCFI7:
    movq	%rdi, -24(%rbp)
    movq	-24(%rbp), %rax
    movq	%rax, -8(%rbp)
    cmpq	$0, -24(%rbp)
    jne	L12
    movq	$1, -32(%rbp)
    jmp	L11
L13:
    decq	-24(%rbp)
    movq	-8(%rbp), %rax
    imulq	-24(%rbp), %rax
    movq	%rax, -8(%rbp)
L12:
    cmpq	$1, -24(%rbp)
    ja	L13
    movq	-8(%rbp), %rax
    movq	%rax, -32(%rbp)
L11:
    movq	-32(%rbp), %rax

The same applies for the for loop.

We can see here the compiler (GCC for instance) did a HUGE optimization of our source code.
It means actually that, except if you're a assembly genius, you probably won't code that will run faster, because you did it with inline assembly.
Nowadays, compilers are so efficient that some optimizations just won't work.

In doubt, always remember to benchmark your code, and always remember that compiler can usually generate intermediate assembly code, so you can check what your code will really look like, when executed by the CPU.

Comments

Author
Ossama
Date
02/04/2011 16:41
Great article, thanks !
Author
Jean-David Gadina
Date
02/04/2011 20:08
You're welcome ; )