This week-end I spent a lot of time trying to optimize the code of my incoming video game. The game engine features a fluid simulation algorithm based on this excellent paper by Jos Stam.

The algorithm is CPU intensive, and so I tried my best to make the code as fast as possible, which eventually means having to write the code in assembly.

Since I couldn’t find much information about how to write assembly code for iOS on XCode, here is a small tutorial I wrote about it.

GCC inline assembly

XCode recognises any file ending with “.s” in a project as assembly. Any function declared in the file can then be called from the C code. For small chunk of assembly, though, there is a better way: using the C asm statement, that allows to embed and call assembly code directly from C. For an overview of the syntax you can refer to this documentation.

Let’s write a basic example: a function that adds two integers together using the ARM add instruction. Needless to say, for such a simple example there is no advantage to writing assembly, the C version would result in exactly the same binary code.

static int add_two_int(int x, int y) __attribute__((noinline));

#ifdef __arm__

static int add_two_int(int x, int y) {
    int ret;
    asm volatile (
        "add %[ret], %[x], %[y]"
        // outputs
        : [ret]"=r"(ret)
        // inputs
        : [x]"r"(x), [y]"r"(y)
    return ret;


static int add_two_int(int x, int y) {
    return x + y;


A few notes about this code:

  • Why the __attribute__((noinline))? I marked the function as static - as you always should for function local to a file. But since I want to check the produced asm code in XCode, it is nice to force the compiler not to inline the function. Clang recognises most of GNU gcc attributes, so we can use it the noinline attribute to mark the function as not to be inlined.

  • I wrote a fall-back version of the code in plain C so that I support not only ARM but also other architectures. For example the new iPhone 5 are now using arm64, so in that example XCode would compile the arm64 version using the fallback C code (if we wanted to also provide an arm64 version of the asm function we could have tested the __arm64__ macro.

  • In the assembly code, I never actually use a named register. Instead I aliased ret, x and y in the output and input lists, then I can refer to them in the asm as %[ret], % , %[y]. The compiler will automatically assign them to one of the ‘r’ registers. This is the nice thing about inline assembly.

  • The volatile keyword is used to let clang knows that it should not try to optimize the assembly code (yes it would do that!) Of course in that case I don’t see how we could optimize such a simple function.

  • Some people would use __asm__ instead of asm, and __volatile__ instead of volatile. Those are exactly the same. the versions with underscores may be used to prevent clash with already defined functions or variables (and are mandatory if you want to code to compile in strict ansi mode.) I prefer to use the short versions here for simplicity.

OK, so we can put this in our code and then use the function add_two_int as if it was a normal static C function (and in fact it is!)

XCode is nice enough to let us see what the compiled version looks like. For that we have to click on the tuxedo looking like button in the top right of the IDE, then in the drop down menu select “assembly”. A quick search for add_two_int reveals the actual assembly code produced by the compiler:

    add r0, r0, r1
    bx lr

As we see, the compiler decided to use the r0 register for both ret and x, and the r1 register for y. This is a clever choice, since the C function call convention is using r0 and r1 to pass the two first parameters, and r0 as the output. Also note that the actual function name starts with an underscore, as specified by the C ABI.

A bigger example, with NEON instructions

Now let’s try with a slightly more interesting example. We are going to use the vector instructions of the NEON arm extension to quickly divide all the values of a 16 bit integer array by 2. To make things really simple, we are going to assume that the size of the array is a multiple of 8 and that it is memory aligned to 128 bit.

In this new version of the code, we are using the NEON instruction vst1.16 that can right shift eight 16bit integers in parallel.

#if defined __arm__ && defined __ARM_NEON__

static void div_by_2(int16_t *x, int n)
    assert(n % 8 == 0);
    assert(x % 16 == 0);
    asm volatile (
        "Lloop:                         \n\t"
        "vld1.16    {q0}, [%[x]:128]    \n\t"
        "vshr.s16   q0, q0, #1          \n\t"
        "vst1.16    {q0}, [%[x]:128]!   \n\t"
        "sub        %[n], #8            \n\t"
        "cmp        %[n], #0            \n\t"
        "bne Lloop                      \n\t"
        // output
        : [x]"+r"(x), [n]"+r"(n)
        // clobbered registers
        : "q0", "memory"


static void div_by_2(int16_t *x, int n)
    int i;
    for (i = 0; i < n; i++)
        x[i] /= 2;


A few notes:

  • Since we are using the ARM NEON extension, we have to check for __ARM_NEON__ as well as __arm__. Just testing for __ARM_NEON__ would generate an error when compiling for arm64.

  • The loop label has to start with an upper case ‘L’ to tell the compiler that it is a local label. Apparently, if you don’t do it and the compiler decides to duplicate the assembly code, you might get a compilation error.

  • Don’t forget to specify the output as read/write (with "+r"). If you do not, you might run into strange run-time bugs because the compiler will assume that x and n have not changed after your function call and might optimize the calling function accordingly.

  • All the lines end with “\n” (end of line) followed by a “\t” (tab). This make sure that your assembly code will be correctly indented.


Being able to write assembly code is actually the easy part. What is difficult it to write efficient assembly code. By efficient I mean that the code should be faster than what the compiler would have produced from the C code. I am not that good at assembly, and that’s why I am not going to claim that the assembly code I put in this example will actually be faster than the C version. In fact, with the memory access time being so slow on iPhone, I wouldn’t be surprised if both versions have similar performances (which is why measuring the performance is so important).

In a follow up post I will try to compare the speed we can achieve with NEON assembly compared to plain C version on an iPhone 4.

And if you want more information about ARM and NEON on iOS, here is a detailed article I found online that goes much deeper into the topic: