This week-end I spent a lot of time trying to optimize the code of my incoming video game. The game engine features a fluid simulation algorithm based on this excellent paper by Jos Stam.
The algorithm is CPU intensive, and so I tried my best to make the code as fast as possible, which eventually means having to write the code in assembly.
Since I couldn’t find much information about how to write assembly code for iOS on XCode, here is a small tutorial I wrote about it.
GCC inline assembly
XCode recognises any file ending with “.s” in a project as assembly. Any
function declared in the file can then be called from the C code. For small
chunk of assembly, though, there is a better way: using the C asm
statement,
that allows to embed and call assembly code directly from C. For an overview
of the syntax you can refer to this documentation.
Let’s write a basic example: a function that adds two integers together using
the ARM add
instruction. Needless to say, for such a simple example there is
no advantage to writing assembly, the C version would result in exactly the
same binary code.
static int add_two_int(int x, int y) __attribute__((noinline));
#ifdef __arm__
static int add_two_int(int x, int y) {
int ret;
asm volatile (
"add %[ret], %[x], %[y]"
// outputs
: [ret]"=r"(ret)
// inputs
: [x]"r"(x), [y]"r"(y)
);
return ret;
}
#else
static int add_two_int(int x, int y) {
return x + y;
}
#endif
A few notes about this code:
Why the
__attribute__((noinline))
? I marked the function as static - as you always should for function local to a file. But since I want to check the produced asm code in XCode, it is nice to force the compiler not to inline the function. Clang recognises most of GNU gcc attributes, so we can use it thenoinline
attribute to mark the function as not to be inlined.I wrote a fall-back version of the code in plain C so that I support not only ARM but also other architectures. For example the new iPhone 5 are now using arm64, so in that example XCode would compile the arm64 version using the fallback C code (if we wanted to also provide an arm64 version of the asm function we could have tested the
__arm64__
macro.In the assembly code, I never actually use a named register. Instead I aliased
ret
,x
andy
in the output and input lists, then I can refer to them in the asm as %[ret], % , %[y]. The compiler will automatically assign them to one of the ‘r’ registers. This is the nice thing about inline assembly.The volatile keyword is used to let clang knows that it should not try to optimize the assembly code (yes it would do that!) Of course in that case I don’t see how we could optimize such a simple function.
Some people would use
__asm__
instead ofasm
, and__volatile__
instead ofvolatile
. Those are exactly the same. the versions with underscores may be used to prevent clash with already defined functions or variables (and are mandatory if you want to code to compile in strict ansi mode.) I prefer to use the short versions here for simplicity.
OK, so we can put this in our code and then use the function add_two_int
as
if it was a normal static C function (and in fact it is!)
XCode is nice enough to let us see what the compiled version looks like. For
that we have to click on the tuxedo looking like button in the top right of the
IDE, then in the drop down menu select “assembly”. A quick search for
add_two_int
reveals the actual assembly code produced by the compiler:
_add_two_int:
add r0, r0, r1
bx lr
As we see, the compiler decided to use the r0 register for both ret
and x
,
and the r1 register for y
. This is a clever choice, since the C function
call convention is using r0 and r1 to pass the two first parameters, and r0 as
the output. Also note that the actual function name starts with an underscore,
as specified by the C ABI.
A bigger example, with NEON instructions
Now let’s try with a slightly more interesting example. We are going to use the vector instructions of the NEON arm extension to quickly divide all the values of a 16 bit integer array by 2. To make things really simple, we are going to assume that the size of the array is a multiple of 8 and that it is memory aligned to 128 bit.
In this new version of the code, we are using the NEON instruction vst1.16 that can right shift eight 16bit integers in parallel.
#if defined __arm__ && defined __ARM_NEON__
static void div_by_2(int16_t *x, int n)
{
assert(n % 8 == 0);
assert(x % 16 == 0);
asm volatile (
"Lloop: \n\t"
"vld1.16 {q0}, [%[x]:128] \n\t"
"vshr.s16 q0, q0, #1 \n\t"
"vst1.16 {q0}, [%[x]:128]! \n\t"
"sub %[n], #8 \n\t"
"cmp %[n], #0 \n\t"
"bne Lloop \n\t"
// output
: [x]"+r"(x), [n]"+r"(n)
:
// clobbered registers
: "q0", "memory"
);
}
#else
static void div_by_2(int16_t *x, int n)
{
int i;
for (i = 0; i < n; i++)
x[i] /= 2;
}
#endif
A few notes:
Since we are using the ARM NEON extension, we have to check for
__ARM_NEON__
as well as__arm__
. Just testing for__ARM_NEON__
would generate an error when compiling for arm64.The loop label has to start with an upper case ‘L’ to tell the compiler that it is a local label. Apparently, if you don’t do it and the compiler decides to duplicate the assembly code, you might get a compilation error.
Don’t forget to specify the output as read/write (with
"+r"
). If you do not, you might run into strange run-time bugs because the compiler will assume thatx
andn
have not changed after your function call and might optimize the calling function accordingly.All the lines end with “\n” (end of line) followed by a “\t” (tab). This make sure that your assembly code will be correctly indented.
Conclusion
Being able to write assembly code is actually the easy part. What is difficult it to write efficient assembly code. By efficient I mean that the code should be faster than what the compiler would have produced from the C code. I am not that good at assembly, and that’s why I am not going to claim that the assembly code I put in this example will actually be faster than the C version. In fact, with the memory access time being so slow on iPhone, I wouldn’t be surprised if both versions have similar performances (which is why measuring the performance is so important).
In a follow up post I will try to compare the speed we can achieve with NEON assembly compared to plain C version on an iPhone 4.
And if you want more information about ARM and NEON on iOS, here is a detailed article I found online that goes much deeper into the topic: