This document is Copyright 1994 ARM Ltd, and has been included on this disc with their kind permission. This manual is supplied "as is"; ARM Limited ("ARM") makes no warranty, express or implied, of the merchantability of this document or its fitness for any particular purpose. In no circumstances shall ARM be liable for any damage, loss of profits, or any indirect or consequential loss arising out of the use of these recipes or inability to use these recipes, even if ARM has been advised of the possibility of such loss. --------------------------------------------------------------------------- 3. Exploring ARM Assembly Language ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3.1 Integer to String conversion -------------------------------- 3.1.1 About this Recipe ----------------------- This recipe shows you: how to convert an integer to a string in ARM assembly language; how to use a stack in an ARM assembly language program; how to write a recursive function in ARM assembly language. This recipe refers to the program utoa1.s in the examples directory. Its dtoa entry point converts a signed integer to a string of decimal digits (possibly with a leading '-''); its utoa entry point converts an unsigned integer to a string of decimal digits. 3.1.2 The Algorithm ------------------- To convert a signed integer to a decimal string: generate a '-' and negate the number if it is negative; then convert the remaining unsigned value. To convert a given unsigned integer to a decimal string, divide it by 10, yielding a quotient and a remainder. The remainder is in the range 0-9 and is used to create the last digit of the decimal representation. If the quotient is non-zero it is dealt with in the same way as the original number, creating the leading digits of the decimal representation; otherwise the process has finished. 3.1.3 The Implementation ------------------------ The implementation of utoa sketched below employs the register naming and usage conventions of the ARM Procedure Call Standard: a1-a4 are argument or scratch registers and a1 is the function result register; v1-v5 are 'variable' registers (preserved across function calls); sp is the stack pointer; at routine entry, lr holds the subroutine call return address; and pc is the program counter. utoa STMFD sp!, {v1, v2, lr} ; function entry - save some v-registers ; and the return address. MOV v1, a1 ; preserve arguments over following MOV v2, a2 ; function calls MOV a1, a2 BL udiv10 ; a1 = a1 / 10 SUB v2, v2, a1, LSL #3 ; number - 8*quotient SUB v2, v2, a1, LSL #1 ; - 2*quotient = remainder CMP a1, #0 ; quotient non-zero? MOVNE a2, a1 ; quotient to a2... MOV a1, v1 ; buffer pointer unconditionally to a1 BLNE utoa ; conditional recursive call to utoa ADD v2, v2, #'0' ; final digit STRB v2, [a1], #1 ; store digit at end of buffer LDMFD sp!, {v1, v2, pc} ; function exit - restore and return Explanation ----------- On entry, a2 contains the unsigned integer to be converted and a1 addresses a buffer to hold the character representation of it. On exit, a1 points immediately after the last digit written. Both the buffer pointer and the original number have to be saved across the call to udiv10. This could be done by saving the values to memory. However, it turns out to be more efficient to use two 'variable' registers, v1 and v2 (which, in turn, have to be saved to memory). (An instructive exercise for the reader is to rework this example with a1 and a2 saved to memory in the initial STMFD, rather than v1 and v2). Because utoa calls other functions (udiv10 and utoa) it must save its return link address passed in lr. The function therefore begins by stacking v1, v2 and lr using STMFD sp!, {v1,v2,lr}. In the next block of code, a1 and a2 are saved (across the call to udiv10) in v1 and v2 respectively and the given number (a2) is moved to the first argument register (a1) before calling udiv10 with a BL (Branch with Link) instruction. On return from udiv10, 10 times the quotient is subtracted from the original number (preserved in v2) by two SUB instructions. The remainder (in v2) is ready to be converted to character form (by adding ASCII '0') and to be stored into the output buffer. But first, utoa has to be called to convert the quotient, unless that is zero. The next four instructions do this, comparing the quotient (in a1) with 0, moving the quotient to the second argument register (a2) if not zero, moving the buffer pointer to the first argument/result register (a1), and calling utoa if the quotient is not zero. Note that the buffer pointer is moved to a1 unconditionally: if utoa is called recursively then a1 will be updated, but it will still identify the next free buffer location; if utoa is not called recursively, the next free buffer location is still needed in a1 by the following code which plants the remainder digit and returns the updated buffer location (via a1). The remainder (in a2) is converted to character form by adding '0' and is then stored in the location addressed by a1. A post-incrementing STRB is used which stores the character and increments the buffer pointer in a single instruction, leaving the result value in the result register a1. Finally, the function is exited by restoring the saved values of v1 and v2 from the stack, loading the stacked link address into pc and popping the stack using a single multiple load instruction LDMFD sp!, {v1,v2,pc}. 3.1.4 Creating a Runnable Example --------------------------------- You can run the utoa routine described here under armsd. To do this, you must assemble the example and the udiv10 function, compile a simple test harness written in C, and link the resulting objects together to create a runnable program. Begin by setting your current directory to the examples directory then use the following commands: armasm utoa1.s -o utoa1.o -li armasm udiv10.s -o udiv10.o -li armcc -c utoatest.c -apcs 3/32bit armlink -o utoatest utoa1.o udiv10.o utoatest.o somewhere/armlib.32l where somewhere is the directory in which armlib.32l can be found. Explanation ----------- The first two armasm commands assemble the utoa function and the udiv10 function, creating relocatable object files utoa1.o and udiv10.o. The -li flag tells armasm to assemble for a little-endian memory. You can omit this flag if your armasm has been configured for this default (see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for how to configure the ARM development tools). The armcc command compiles the test harness. The -c flag tells armcc not to link its output with the C library; the -li flag tells armcc to compile for a little-endian memory (as with armasm). The armlink command links your three relocatable objects with the ARM C library to create a runnable program (here called utoatest). If you have installed your ARM development tools in a standard way then you could use the following shorter command to do the compilation and linking: armcc utoatest.c utoa1.o udiv10.o -apcs 3/32bit -li 3.1.5 Running the Example ------------------------- You can run your example program under armsd using: armsd -li utoatest Note that the -li and -apcs 3/32bit options can be omitted if the tools have been configured appropriately. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details. 3.1.6 Stacks in Assembly Language --------------------------------- In this example, three words are pushed on to the stack on entry to utoa and popped off again on exit. By convention, ARM software uses r13, usually called sp, as a stack pointer pointing to the last-used word of a downward growing stack (a so-called 'full, descending' stack). However, this is only a convention and the ARM instruction set supports equally all four stacking possibilities: {full or empty} x {ascending or descending}. The instruction used to push values on the stack was: STMFD sp!, {v1, v2, lr} The action of this instruction is as follows: subtract 4 * number-of-registers from sp; store the registers named in {...} in ascending register number order to memory at [sp], [sp,4], [sp,8] ... The matching pop instruction was: LDMFD sp!, {v1, v2, pc} Its action is: load the registers named in {...} in ascending register number order from memory at [sp], [sp,4], [sp,8] ... add 4 * number-of-registers to sp. Discussion ---------- Many, if not most, register-save requirements in simple assembly language programs can be met using this approach to stacks. A more complete treatment of run-time stacks requires a discussion of: stack-limit checking (and extension); local variables and stack frames. In the utoa program, you must assume the stack is big enough to deal with the maximum depth of recursion, as no one bothers to check this. In practice, this assumption is OK. The biggest 32-bit unsigned integer is about four billion, or ten decimal digits. This means that at most 10 x 3 registers = 120 bytes have to be stacked. Because the ARM Procedure Call Standard (APCS) guarantees that there are at least 256 bytes of stack available when a function is called and because we can guess (or know) that udiv10 uses no stack space, we can be confident that utoa is quite safe if called by an APCS-conforming caller such as a compiled-C test harness. This discussion raises another delicacy. The stacking technique illustrated here conforms to the ARM Procedure Call Standard only if the function using it makes no function calls. utoa calls both udiv10 and itself; it really ought to establish a proper stack frame (see ARM Procedure Call Standard starting on page38 of the Technical Specifications). If you really want to be safe and write functions that can 'plug and play together' you have to follow the APCS exactly. However, when writing a whole program in assembly language you often know much more than when writing a program fragment for general, robust service. This allows you to gently break the APCS in the following way: Any chain of function/subroutine calls can be considered compatible with the APCS provided it uses less than 256 bytes of stack space. So the utoa example is compatible with the APCS even though it doesn't conform to the APCS. Note however: if you call any function whose stack use is unknown (but which is believed to be APCS conforming), you court disaster unless you establish a proper APCS call frame and perform APCS stack limit checking on function entry. Please refer to ARM Procedure Call Standard starting on page38 of the Technical Specifications for further details. 3.1.7 Related Topics -------------------- For more information about stacks, and conforming to the ARM Procedure Call Standard see: Flexibility of Load and Store Multiple starting on page13; Register Usage under the ARM Procedure Call Standard starting on page62; Stack Overflow Checking starting on page93. 3.2 Multiply by a Constant -------------------------- 3.2.1 About this Recipe ----------------------- This recipe shows you how to construct a sequence of ARM instructions to multiply by a constant. For some applications multiply is used extensively, so it is important to make the application run as fast as possible. For instance, most DSP (Digital Signal Processing) applications perform a lot of multiplication. In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). A naive programmer would assume that the only way to calculate this would be to use the MUL instruction - but there is an alternative... This recipe demonstrates how to improve the speed of multiply-by-constant by using a sequence of arithmetic instructions instead of the general-purpose multiplier. 3.2.2 Introduction ------------------ Throughout this recipe, registers are referred to using register names (eg. Rd, Rm, Rs), but you should use only register names which have been declared using the RN directive (eg. a1, r4 etc.) in assembler source code. This recipe does not refer to any example programs; it should be viewed as an explanation of the multiply-by-constant technique. MUL has the following syntax: MUL Rd, Rm, Rs The timing of this instruction depends on the value in Rs. The ARM6 datasheet specifies that for Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive takes 1S + mI cycles. For more details on the multiplier, see ARM6 Multiplier Performance Issues starting on page38. There is, of course, no guarantee that MUL will not be implemented differently (possibly faster) in the future... When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of adds and subtracts which have the same effect. For instance, multiply by 5 could be achieved using a single instruction: ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5 This ADD version is obviously better than the MUL version below: MOV Rs, #5 MUL Rd, Rm, Rs The 'cost' of the general multiply includes the instructions needed to load the constant into a register (up to 4 may be needed, or an LDR from a literal pool) as well as the multiply itself. 3.2.3 The problem of finding the optimum sequence ------------------------------------------------- The difficulty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction each. Consider multiply by 105: This could be achieved by decomposing thus: 105 == 128 - 23 == 128 - (16 + 7) == 128 - (16 + (8 - 1)) RSB Rd, Rm, Rm, LSL #3; Rd = Rm*7 ADD Rd, Rd, Rm, LSL #4; Rd = Rm*7 + Rm*16 = Rm*23 RSB Rd, Rd, Rm, LSL #7; Rd = -Rm*23 + Rm*128 = Rm*105 Or, decomposing differently: 105 == 15 * 7 == (16 - 1) * (8 - 1) RSB Rt, Rm, Rm, LSL #4; Rt = Rm*15 (tmp reg) RSB Rd, Rt, Rt, LSL #3; Rd = Rt*7 = Rm*105 The second method is the optimal solution (fairly easy to find for small values such as 105). However, the problem of finding the optimum becomes much more difficult for larger constant values. A program can be written to search exhaustively for the optimum, but it may take a long time to execute. There are no known algorithms which solve this problem quickly. Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a large constant, more than one temporary may be needed, otherwise the sequence will be longer. The C-compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long. 3.2.4 Experimenting with armcc assembly output ---------------------------------------------- When writing a speed-critical ARM assembler program, it is a good idea to code it in C first (to check the algorithm) before converting it to hand tuned assembler. It is helpful to see the ARM code which the compiler generates as a starting point for your work. Invoking armcc with the -S flag will generate an assembly file instead of an object file. For example, consider the following simple C code: int mulby105( int num ) { return num * 105; } Compile this using: armcc -li -S mulby105.c Now, examine the file mulby105.s which has been created. The version of armcc in this release may not produce precisely this output, although it should be very similar: ; generated by Norcroft ARM C vsn 4.41 (Advanced RISC Machines) AREA |C$$code|, CODE, READONLY |x$codeseg| EXPORT mulby105 mulby105 RSB a1,a1,a1,LSL #4 RSB a1,a1,a1,LSL #3 MOV pc,lr AREA |C$$data|,DATA |x$dataseg| END Notice that the compiler has found the short multiply-by-constant sequence. 3.2.5 Discussion of speed improvement ------------------------------------- To evaluate the speed gains from using multiply-by-constant, consider multiplying by 11585 (which is 8192*sqr2) as an example: A normal multiply consists of: MOV Rs, #0x2D << 8; load constant ORR Rs, Rs, #0x41; load constant, now Rs = 11585 MUL Rd, Rm, Rs; do the multiply The load-a-constant stage may take up to four 4 instructions (in this case 2) or an LDR,= and the multiply takes 1 instruction fetch plus a number of internal cycles to calculate the multiply (on ARM6 based processors) . The optimal multiply-by-constant sequence consists of: ADD Rd, Rm, Rm, LSL #1; Rd = Rm*3 RSB Rd, Rd, Rd, LSL #4; Rd = Rd*15 = Rm*45 ADD Rd, Rm, Rd, LSL #8; Rd = Rm + Rd*256 = Rm*11521 ADD Rd, Rd, Rm, LSL #6; Rd = Rd + Rm*64 = Rm*11585 This is just 4 data processing instructions. Method Cycles ------ ------ Normal multiply 3 instructions + MUL internal cycles Multiply-by-constant 4 instructions Considering the ARM6 family, the 2-bit Booth's Multiplier used by MUL takes a number of I-cycles depending on the value in Rs (in this case m=8, as Rs lies between 8192 and 32767). Hence multiply-by-constant looks to be the winner in this case. An instruction fetch is an external memory S-cycle on the ARM60, or a cache F-cycle (if there is a cache hit) on cached processors like the ARM610. With slow memory systems and non-cached processors, I-cycles can be much faster than other cycles because they are internal to the ARM core. This means that the general multiply can sometimes be the fastest option (for large constants where an efficient solution cannot be found) - it should also use less memory. If the load-a-constant stage could be moved outside a loop, the balance would tip further in favour of the general multiply as there is only the MUL to execute. Method Cycles on ARM60 Cycles on ARM610 ------ --------------- ---------------- Normal multiply 3S + 8I 11F Multiply-by-constant 4S 4F 3.2.6 Related Topics -------------------- Digital Signal Processing on the ARM starting on page35; ARM6 Multiplier Performance Issues starting on page38; Multiplication - Returning a 64-bit Result starting on page69; ARM Assembly Programming Performance Issues starting on page55. 3.3 Division by a Constant -------------------------- 3.3.1 About this Recipe ----------------------- This recipe shows you: how to improve on the general divide code for the case when the divisor is a constant. the simple case for divide-by-2^n using the barrel shifter. how to use divc.c to generate ARM code for divide-by-constant. 3.3.2 Introduction ------------------ The ARM instruction set was designed following a RISC philosophy. One of the consequences of this is that the ARM core has no divide instruction, so divides must be performed using a subroutine. This means that divide can be quite slow, but this is not a major issue as divide performance is rarely critical for applications. It is possible to do better than the general divide in the special case when the divisor is a constant. This recipe explains how the divide-by-constant technique works, and how to generate ARM assembler code for divide-by-constant. 3.3.3 Special case for divide-by-2^n ------------------------------------ In the special case when dividing by 2^n, a simple right shift is all that is required (instead of a left shift which multiplies by a power of 2). There is a small caveat which concerns the handling of signed and unsigned numbers. For signed numbers, an arithmetic right shift is required as this performs sign extension (to handle negative numbers correctly). In contrast, unsigned numbers require a 0-filled logical shift right. Please refer to an ARM datasheet for more details of the difference between arithmetic and logical shifts. MOV a2, a1, lsr #5; unsigned division by 32 MOV a2, a1, asr #10; signed division by 1024 3.3.4 Explanation of divide-by-constant ARM code ------------------------------------------------ The divide-by-constant technique basically does a multiply in place of the divide, but it is somewhat more complicated than multiply-by-constant (see Multiply by a Constant starting on page24): x/y == x * (1/y) ^^^^^ consider the underlined portion as a 0.32 fixed-point number (truncating any bits past the most significant 32). 0.32 means 0 bits before the decimal point and 32 after it. == (x * (2^32/y)) / 2^32 ^^^^^^^^ the underlined portion here is a 32.0 bit fixed-point number == (x * (2^32/y)) >> 32 This is effectively returning the top 32-bits of the 64-bit product of x and (2^32/y). If y is a constant, then (2^32/y) is obviously also a constant. For certain y, the reciprocal (2^32/y) is a repeating pattern in binary: y (2^32/y) 2 10000000000000000000000000000000 # 3 01010101010101010101010101010101 * 4 01000000000000000000000000000000 # 5 00110011001100110011001100110011 * 6 00101010101010101010101010101010 * 7 00100100100100100100100100100100 * 8 00100000000000000000000000000000 # 9 00011100011100011100011100011100 * 10 00011001100110011001100110011001 * 11 00010111010001011101000101110100 12 00010101010101010101010101010101 * 13 00010011101100010011101100010011 14 00010010010010010010010010010010 * 15 00010001000100010001000100010001 * 16 00010000000000000000000000000000 # 17 00001111000011110000111100001111 * 18 00001110001110001110001110001110 * 19 00001101011110010100001101011110 20 00001100110011001100110011001100 * 21 00001100001100001100001100001100 * 22 00001011101000101110100010111010 23 00001011001000010110010000101100 24 00001010101010101010101010101010 * 25 00001010001111010111000010100011 The lines marked with a # are the special cases 2^n, which have already been dealt with. The lines marked with a * have a simple repeating pattern. Note how regular the patterns are for y=2^n+2^m or y=2^n-2^m (for n>m)... n m (2^n+2^m) n m (2^n-2^m) 1 0 3 1 0 1 2 0 5 2 1 2 2 1 6 2 0 3 3 0 9 3 2 4 3 1 10 3 1 6 3 2 12 3 0 7 4 0 17 4 3 8 4 1 18 4 2 12 4 2 20 4 1 14 4 3 24 4 0 15 5 0 33 5 4 16 5 1 34 5 3 24 5 2 36 5 2 28 5 3 40 5 1 30 5 4 48 5 0 31 For the repeating patterns, it is a relatively easy matter to calculate the product by using a multiply-by-constant method. The result can be calculated in a small number of instructions by taking advantage of the repetition in the pattern; this corresponds to the optimal solution in the multiply-by-constant problem (see Multiply by a Constant starting on page24). The actual multiply is slightly unusual due to the need to return the top 32-bits of the 64-bit result. It efficient to calculate just the top 32-bits; this can be achieved by modifying the multiply-by-constant sequence so that the input value is shifted right rather than left. Consider this fragment of the divide-by-ten code (x is the input dividend as used in the above equations): SUB a1, x, x, lsr #2 ; a1 = x*%0.11000000000000000000000000000000 ADD a1, a1, a1, lsr #4 ; a1 = x*%0.11001100000000000000000000000000 ADD a1, a1, a1, lsr #8 ; a1 = x*%0.11001100110011000000000000000000 ADD a1, a1, a1, lsr #16 ; a1 = x*%0.11001100110011001100110011001100 MOV a1, a1, lsr #3 ; a1 = x*%0.00011001100110011001100110011001 The SUB calculates (for example) a1 = x - x/4 = x - x*%0.01 = x*%0.11 Hence, just 5 instructions are needed to perform the multiply. A small problem is caused by calculating just the top 32-bits, as this ignores any carry from the low 32-bits of the 64-bit product. Fortunately, this can be corrected. A correct divide would round down, so the remainder can be calculated by: x - (x/10)*10 = 0..9 It takes just 2 ARM instructions to perform this multiply-by-10 and subtract, by making good use of the ARMÕs barrel shifter. In the case when (x/10) is too small by 1 (if carry has been lost), the remainder will be in the range 10..19 in which case corrections must be applied. This test would require a compare-with-10 instruction, but this can be combined with other operations to save an instruction (see below). When a lost carry is detected, both the quotient and remainder must be fixed up (1 instruction each). This should explain the full divide-by-10 code: div10 ; takes argument in a1 ; returns quotient in a1, remainder in a2 ; cycles could be saved if only divide or remainder is required SUB a2, a1, #10 ; keep (x-10) for later SUB a1, a1, a1, lsr #2 ADD a1, a1, a1, lsr #4 ADD a1, a1, a1, lsr #8 ADD a1, a1, a1, lsr #16 MOV a1, a1, lsr #3 ADD a3, a1, a1, asl #2 SUBS a2, a2, a3, asl #1 ; calc (x-10) - (x/10)*10 ADDPL a1, a1, #1 ; fix-up quotient ADDMI a2, a2, #10 ; fix-up remainder MOV pc, lr The optimisation which eliminates the compare-with-10 instruction is to keep (x-10) for use in the subtraction to calculate the remainder. This means that compare-with-0 is required instead, which is easily achieved by adding an S (to set the flags) to the SUB opcode. This also means that the subtraction has to be undone if no rounding error occured (which is why the ADDMI instruction is used). 3.3.5 How to generate divide-by-constant sequences -------------------------------------------------- For suitable numbers , the details of the divide-by-constant technique can be avoided completely by using the divc program. This is supplied in ANSI C source form which is in the examples directory. Naturally, it must be compiled it in order to use it; use your host system's C compiler, or armcc in which case the executable must be run using armsd. The divc command-line help can be obtained by running divc with no arguments: Usage: divc Generates optimal ARM code for divide-by-constant where is one of (2^n-2^m) or (2^n+2^m) eg. 10 Advanced RISC Machines [01 Jul 92] Type "divc 10" to generate the ARM assembler code for divide-by-10. The output is suitable for immediate use as an armasm source file. The routine is called 'udiv10' for unsigned divide-by-10 (for example). It takes the unsigned argument in a1, and returns the quotient in a1 and the remainder in a2. It conforms fully to the APCS, but the remainder may not be available when called from C. The range of values covered by (2^n-2^m) and (2^n+2^m) contains some useful numbers such as 7, 10, 24, 60. 3.3.6 Related Topics -------------------- Multiply by a Constant starting on page24; C Programming for Deeply Embedded Applications starting on page87 for information about the division routines to which armcc generates references; 3.4 Choosing a division implementation -------------------------------------- This recipe shows you: how to select a divide implementation for the C-library how to use the fast divide routines from the examples directory a comparison of the speeds of the divide routines. The ARM instruction set does not have a divide instruction. In some applications it is important that a general purpose divide executes as quickly as possible. This recipe shows how to choose between different divide implementations for the ARM. 3.4.1 Divide implementations in the C-library --------------------------------------------- The C-library offers a choice between two divide variants. This choice is basically a speed vs. space tradeoff; the options are: 'unrolled' and 'small'. In the C-library build directory (eg. directory semi for the semi-hosted library), the file options is used to select variants of the C-library. The supplied file contains the following: memcpy = fast divide = unrolled stdfile_redirection = off fp_type = module stack = contiguous backtrace = off The default divide implementation 'unrolled' is fast, but occupies a total of 416 bytes (55 instructions for the signed version plus 49 instructions for the unsigned version). This is an appropriate default for most toolkit users who are interested in obtaining maximum performance. Alternatively you can change this file to select 'small' divide which is more compact at 136 bytes(20 instructions for signed plus 14 instructions for unsigned) but somewhat slower as there is considerable looping overhead. For a comparison of the speed difference between these two routines, consult the following table (the speed of divide is data-dependent): Signed division example timings ------------------------------- Cycle times are F-cycles on a cached ARM6 series processor excluding call & return Calc unrolled cycles rolled cycles ---- --------------- ------------- 0/100 22 19 9/7 22 19 4096/2 70 136 1000000/3 99 240 1000000000/1 148 370 If you have a specific requirement you can modify the supplied routines to suit your application better. For instance, you could write an unrolled-2-times version or you could combine the signed and unsigned versions to save more space. 3.4.2 Guaranteed-performance divide routines for real-time applications ----------------------------------------------------------------------- The C-library also contains two fully unwound divide routines. These have been carefully implemented for maximum speed. They are useful when a guaranteed performance is required, eg. for real-time feedback control routines or DSP. The long maximum division time of the standard C-library functions may make them unsuitable for some real-time applications. The supplied routines implement signed division only; it would be possible to modify them for unsigned division if required. The routines are described by the standard header file "stdlib.h" which contains the following declarations: ARM real-time divide functions for guaranteed performance typedef struct __sdiv32by16 { int quot, rem; } __sdiv32by16; /* used int so that values return in regs, although 16-bit */ typedef struct __sdiv64by32 { int rem, quot; } __sdiv64by32; __value_in_regs extern __sdiv32by16 __rt_sdiv32by16( int /*numer*/, short int /*denom*/); /* * Signed div: (16-bit quot), (16-bit rem) = (32-bit) / (16-bit) */ __value_in_regs extern __sdiv64by32 __rt_sdiv64by32( int /*numer_h*/, unsigned int /*numer_l*/, int /*denom*/); /* * Signed div: (32-bit quot), (32-bit rem) = (64-bit) / (32-bit) */ These routines both have guaranteed performance: 108 cycles for __rt_sdiv64by32 (excluding call & return) 44 cycles for __rt_sdiv32by16 Currently the C-compiler does not automatically use these routines, as the default routines have early-termination which yields good performance for small values. In order to use these new divide routines, you should explicitly call the relevant function. The __rt_div64by32 function is complicated by the fact that our C-compiler does not currently support 64-bit integers, as you have to split a 64-bit value between two 32-bit variables. The following example program shows how to use these routines. This program is available as dspdiv.c in the examples directory. Once the program has been compiled and linked, type armsd dspdiv 1000 3 to calculate 1000/3 divdsp.c source code #include #include int main(int argc, char *argv[]) { if (argc != 3) puts("needs 2 numeric args"); else { __sdiv32by16 result; result = __rt_sdiv32by16(atoi(argv[1]), atoi(argv[2])); printf("quotient %d\n", result.quot); printf("remainder %d\n", result.rem); } return 0; } 3.4.3 Summary ------------- The standard division routine used by the C-library can be selected by using the options file in the C-library build area. If the supplied variants are not suitable, you can write your own. For real-time applications, the maximum division time must be as short as possible to ensure that the calculation can complete in time. In this case, the functions __rt_sdiv64by32 and __rt_sdiv32by16 are useful. 3.4.4 Related Topics -------------------- Division by a Constant starting on page28; C Programming for Deeply Embedded Applications starting on page87 for information about the division routines to which armcc generates references; 3.5 Digital Signal Processing on the ARM ---------------------------------------- 3.5.1 About this Recipe ----------------------- This recipe is designed to explain the issues when performing digital signal processing (DSP) on the ARM. DSP often needs to be performed in real-time, so it is important to achieve the highest possible throughput. In such instances, careful speed optimisation of ARM assembly code is often necessary to achieve the performance required. 3.5.2 Introduction ------------------ DSP is finding a growing number of applications as all kinds of signals are now processed digitally eg. Compact Disc, telephone speech compression (GSM, G.721 etc.), PhotoCD, JPEG, MPEG... The ARM cannot match a dedicated DSP chip for raw performance, but not all applications require ultra-high performance. The ARM processor can also perform other tasks; DSP chips are severely limited in their range of functions due to their specialised architecture. Features of most DSP chips single cycle multiply-accumulate no-overhead loops (dedicated loop counter register is decremented in parallel with executing body of loop) address generators (with circular buffer support, and bit-reversed addressing) Harvard architecture to funnel data into multiply-accumulate There is just one main operation which DSP chips perform very fast: the weighted sum. This is a scalar product where each element of one array of data is multiplied by a corresponding element in another array, and the total is accumulated. In any book on digital signal processing, there are hundreds of equations using sigma notation which denotes a weighted sum. This single operation is required for the all the major DSP functions: correlation, autocorrelation, FIR filters, IIR filters, convolution, DCT etc. Features of the ARM which are advantageous for DSP -------------------------------------------------- barrel shifter in parallel with data processing instructions; MUL instruction; auto-update load/store instructions; auto-update load/store multiple (quick sequential addresses). 3.5.3 Examples of some DSP code on the ARM ------------------------------------------ These examples demonstrate typical DSP code. The MLA accumulates the products in a single 32-bit register, so care must be taken to ensure that the value will not overflow. If 1024 products are to be accumulated, the number of bits in the result should not exceed 22 otherwise the total may overflow. Naive version of weighted-sum ARM code: This is the obvious version of the weighted-sum code which uses 2 load instructions and a MLA. MOV r10, #0; initialise total naive_sigma_loop LDR r0, [r8], #4; load data A & update pointer LDR r1, [r9], #4; load data B & update pointer MLA r10, r0, r1, r10 SUBS r11, r11, #1; decrement counter BNE naive_sigma_loop Faster version of weighted-sum ARM code: This shows how to unwind the loop 4 times (to lower the branch overhead). Load-multiple (LDM) is used instead of a single register load; this improves performance significantly. It is possible to use more registers and unwind the loop more. MOV r10, #0; initialise total faster_sigma_loop LDMIA r8!, {r0-r3}; load 4 data A values & update pointer LDMIA r9!, {r4-r7}; load 4 data B values & update pointer MLA r10, r0, r4, r10 MLA r10, r1, r5, r10 MLA r10, r2, r6, r10 MLA r10, r3, r7, r10 SUBS r11, r11, #1; decrement counter BNE faster_sigma_loop Cross-correlation: This example performs cross-correlation; this particular code was written for a telephone-quality speech compressor. It demonstrates careful optimisation for this specific function; there are 10 MLA instructions to every 1 LDM instruction. All the registers (apart from r15) are used in order to reduce load operations. Cross-correlation involves multiplying every element of one list with the corresponding element in another list and accumulating the total (a weighted sum). To calculate the cross-correlation function, the offset is varied as in this example: i1 i2 i3 i4 * * * * = cross_corr_0 j1 j2 j3 j4 j5 j6 j7 j8 j9 i1 i2 i3 i4 * * * * = cross_corr_1 j1 j2 j3 j4 j5 j6 j7 j8 j9 and so on until: i1 i2 i3 i4 * * * * = cross_corr_5 j1 j2 j3 j4 j5 j6 j7 j8 j9 Notice that 'i' has 4 elements and 'j' has 9 elements, so the cross_corr list has (9-4+1)=6 elements. The routine given here is designed to process 4 elements from 'i' and 4 elements from 'j' per block. The block can be repeated to process the entire 'i' list (which should be a multiple of 4). The 'i'-list should be the smaller of the two, so that it is traversed completely. This yields 5 totals which are cross-correlation results. An outer loop can be used to update the start position in 'j' in order to calculate the full cross-correlation function. LDMIA r8!, {r0-r3} ; initialise: load j1-j4 (1) LDMIA r9!, {r4-r7} ; repeating block start, load i1-i4 (2) MLA r10, r0, r4, r10 MLA r10, r1, r5, r10 MLA r10, r2, r6, r10 MLA r10, r3, r7, r10 MLA r11, r1, r4, r11 MLA r11, r2, r5, r11 MLA r11, r3, r6, r11 MLA r12, r2, r4, r12 MLA r12, r3, r5, r12 MLA r13, r3, r4, r13 LDMIA r8!, {r0-r3} ; load j5-j8 (3) MLA r14, r0, r4, r14 MLA r14, r1, r5, r14 MLA r14, r2, r6, r14 MLA r14, r3, r7, r14 MLA r13, r0, r5, r13 MLA r13, r1, r6, r13 MLA r13, r2, r7, r13 MLA r12, r0, r6, r12 MLA r12, r1, r7, r12 MLA r11, r0, r7, r11 ; repeating block end ; use loop (or repeat block) in order to traverse 'i'-list The 22-instruction block calculates 5 cross-correlation sums (in r10-r14), according to the following diagram: i1 i2 i3 i4 | * * * * | Total in r10 j1 j2 j3 j4 | j5 j6 j7 j8 | | i1 i2 i3 | i4 * * * | * Total in r11 j1 j2 j3 j4 | j5 j6 j7 j8 | | i1 i2 | i3 i4 * * | * * Total in r12 j1 j2 j3 j4 | j5 j6 j7 j8 | | i1 | i2 i3 i4 * | * * * Total in r13 j1 j2 j3 j4 | j5 j6 j7 j8 | | | i1 i2 i3 i4 | * * * * Total in r14 j1 j2 j3 j4 | j5 j6 j7 j8| | First half of | Second half of repeating block | repeating block The j1-j4 values are loaded into r0-r3 by (1). In the repeating block, the next i1-i4 values are loaded by (2). This data is sufficient to calculate the products on the left side of the dividing line; this is done by the first 10 MLA instructions. The number of registers required can be minimised by reusing r0-r3 to hold the j5-j8 values which are loaded in (3). By arranging the MLA instructions into 2 groups (left and right of the dividing line), it is possible to use the j1-j4 values first, and then use j5-j8 second. At the end of the block, r0-r3 (j5-j8) are used as j1-j4 for the next block (because the pointers have moved on by 4). This technique could also be applied to reduce the memory load traffic of other DSP functions. Fixed Point Arithmetic ---------------------- Fixed-point arithmetic is an important part of DSP (for example, weighting coefficients are often in the range -1..1). The MUL instruction is an integer multiplier, so shifting will be necessary to justify the result correctly. Fortunately, the ARM barrel shifter makes this very easy. When a single MUL is being used to multiply fixed-point numbers, it may be necessary to right-shift the multiply operands so that the result fits in 32-bits to avoid overflow. As you would expect, add and subtract are unaffected if the operands are fixed-point numbers, provided that both operands are the same fixed-point format. Naturally, the barrel shifter can be applied to the second operand (with no overhead) if it is necessary to align the formats. 3.5.4 ARM6 Multiplier Performance Issues ---------------------------------------- The performance of the MUL instruction is important for many DSP applications where multiply is used extensively (eg. digital filters, correlation etc.) This section explains how to predict the timing of a MUL instruction, and suggests some ideas to improve the performance of speed-critical programs. This section is specific to the 2-bit Booth's Multiplier in the ARM6, ARM2 and ARM3. Explanation of Booth's multiplication Consider the instruction: MUL Rd, Rm, Rs ^^ The speed of the multiply depends on the value in Rs. It is important to place the smallest number in Rs so that the multiply takes the least number of cycles. The rest of this section explains how the value in Rs affects the time taken to perform the multiplication. In the ARM6 core, the value in Rs is transferred to the Booth's multiplier register during the first cycle of the instruction. Thereafter, a number of internal I-cycles are required to perform the multiplication. The Booth's multiplier operates in the following way: a 32-bit multiplier register is initialised with the second operand of the multiplication. This register is extended at the low end with an extra bit, which is initialised to zero. So, the register's contents after initialisation are: M31 M30 M29 M28 ... M5 M4 M3 M2 M1 M0 0 On each iteration, the bottom 3 bits of this register are used to generate a Booth digit, which controls what is done on the datapath with the destination register and the first operand register. Then this register is shifted right by 2 bits, losing the two bits at the right hand end. The 2 leftmost bits are filled with zeros. Early termination occurs if and when the entire multiplier register is all zeros, with the process terminating after 16 iterations in any case. So, after the first iteration the multiplier register's contents are: 0 0 M31 M30 ... M7 M6 M5 M4 M3 M2 M1 and the Booth digit which was used on the first iteration was based on the three bits "M1 M0 0". The second Booth digit will hence be "M3 M2 M1". Each Booth digit takes an I-cycle to process, as the ARM datapath is involved in accumulating the partial product. The total time for a MUL is thus 1S + nI cycles where n depends on the value in Rs. From the above explanation it can be shown that n has the following relationship to the value in Rs: Multiplication by 2^(2n-3) and 2^(2n-1)-1 inclusive takes 1S + nI-cycles (n>1). (Multiplication by 0 or 1 takes 1S + 1I-cycle). (These speeds are taken straight from the ARM6 datasheet) Overflow issues --------------- It is common to use the current multiply instruction in (for instance) a 16bit x 16bit -> 32bit mode to avoid overflow. This situation can be generalised, as the number of result bits is just the sum of the operand bits. Thus, the MUL can perform 16bit x 16bit -> 32bit, 8bit x 24bit -> 32bit etc. all without overflow. For MLA, the total is accumulated (overflow of the total must be avoided). Hence, MLA would be used as (for example) 12x12->24 leaving 8 bits to accumulate up to 256 values without the possibility of overflow. So, although the worst-case multiply is 1S + 16I-cycles, in practice it is possible to arrange for a worst case which is at most 1S + 9I-cycles (by putting the operand with the least number of bits into Rs, so that Rs <= 16bits), but often considerably better. Negative operands ----------------- The multiplier yields correct results for negative operand values, so the sign of the operands can be ignored. For positive values of Rs, a 16x16->32 MUL takes at most 1S + 9I-cycles (the average should be better than this). But, the MUL *always* takes 1S + 16I-cycles if Rs is negative. Early termination cannot take place because the top bit of Rs is a 1, so the Booth's multiplier register never contains all 0s (the maximum-of-16 limit is reached instead). To reduce the number of MUL cycles required, ensure that the Rs value is positive: CMP Rs, #0 RSBMI Rs, Rs, #0 ; make Rs positive MUL Rd, Rm, Rs RSBMI Rd, Rd, #0 ; correct the sign of the result but this does not really improve things unless Rs is very small so that the gain exceeds the (3-instruction) overhead. It is sometimes possible to do away with the CMP by incorporating this into another instruction (see below). In the special case when squaring, the result does not need to be negated after the multiplication as it will always be positive (thus the second RSB instruction can be eliminated). For example, consider this critical bit of code which uses MUL to square signed 5-bit input values. This demonstrates the importance of ensuring that Rs is positive, as the worst-case performance is improved to just 1S + 3I-cycles for the MUL or MLA. AND r1, r8, #31; extract 5-bit field of interest AND r2, r9, #31; extract 5-bit field of interest SUBS r1, r1, r2 RSBMI r1, r1, #0 MUL r0, r1, r1 As you can see, it has been arranged to only cost 1 S-cycle (for the RSBMI instruction) to ensure the multiply is fast. A similar method can be used to ensure MLA is always fast, without affecting the total: CMP Rs, #0 RSBMI Rs, Rs, #0 RSBMI Rm, Rm, #0 MLA Rd, Rm, Rs, Rn Multiplication by constant -------------------------- This technique replaces a MUL by a sequence of arithmetic instructions which are equivalent to multiplying by a constant. The gains depend on the value of the constant (smaller constants are generally faster). For more details, please see Multiply by a Constant starting on page24. 3.5.5 Related Topics -------------------- Multiply by a Constant starting on page24; Division by a Constant starting on page28; Using 16-bit Data on the ARM starting on page41; ARM Assembly Programming Performance Issues starting on page55. 3.6 Using 16-bit Data on the ARM -------------------------------- 3.6.1 About this Recipe ----------------------- This recipe covers several different approaches to 16-bit data manipulation: Converting the 16-bit data to 32-bit data, and from then on treating it as 32-bit data. Converting 16-bit data into 32-bit data when loading and storing, but using 32-bit data within ARM's registers. Load 16-bit data into the top 16-bits of ARM registers, and processing it as 16-bit data (ie. keeping the bottom 16-bits clear at all times). Useful code fragments are given which can be used to help implement these different approaches efficiently. 3.6.2 Introduction ------------------ Since the ARM is a 32-bit processor, and does not have half-word load and store instructions in its instruction set, at first glance the ARM may look unsuitable for processing 16-bit data. This recipe is intended to show that the ARM is quite capable of handling 16-bit data efficiently, and in several different ways, depending on the what is needed for a particular application. 3.6.3 How "16-bit" is my data ? ------------------------------- Just because data is 16-bit in size does not mean that it cannot be considered as 32-bit data by the ARM, and thus be manipulated using the ARM instruction set in the normal way. Clearly any unsigned 16-bit value can be held as a 32-bit value in which the top 16 bits are all zero. Similarly any signed 16-bit value can be held as a 32-bit value with the top 16 bits sign extended (ie. copied from the top bit of the 16-bit value). The main disadvantage of storing 16-bit data as 32-bit data in this way for ARM based systems is that it takes up twice as much space in memory or on disk. If the amount of memory taken up by the 16-bit data is small, then simply treating it as 32-bit data is likely to be the easiest and most efficient technique. ie. converting the data to 32-bit format and from then on treating it as 32-bit data. When the space taken by 16-bit data in memory or on disk is not small, an alternative method can be used: The 16-bit data is loaded and converted to be 32-bit data for use within the ARM, and then when processed, can either be output as 32-bit or 16-bit data. Useful code fragments are given to perform the necessary conversions for this approach in section Little Endian Loading Recipes starting on page42 to section Big Endian Storing Recipes starting on page45. An issue which may arise when 16-bit data is converted to 32-bit data for use in the ARM and then stored back out as 16-bit data is detecting whether the data is still 16-bit data, ie. whether it has 'overflowed' into the top 16 bits of the ARM register. Code fragments which detect this are given in section Detecting Overflow into the Top 16 Bits starting on page46. Another approach which avoids having to use explicit code to check whether results have overflowed into the top 16-bits is to keep 16-bit data as 16-bit data all the time, by loading it into the top half of ARM registers, and ensuring that the bottom 16 bits are always 0. Useful code sequences, and the issues involved when taking this approach are described in Using ARM Registers as 16-bit Registers starting on page46. 3.6.4 Little Endian Loading Recipes ----------------------------------- Code fragments in this section which transfer a single 16-bit data item transfer it to the least significant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word at the load address. eg. the address 0x4321 has a byte offset of 1. One Data Item - Any Alignment (Byte offsets 0,1,2 or 3) The following code fragment loads a 16-bit value into a register, whether the data is byte, half-word or word aligned in memory, by using the ARM's load byte instruction. This code is also optimal for the common case where the 16-bit data is half word aligned, ie. at either byte offset 0 or 2 (but the same code is required to deal with both cases). Optimisations can be made when it is known that the data is at at byte offset 0, and also when it is known to be at byte offset 2 (but not when it could be at either offset). LDRB R0, [R2, #0]; 16-bit value is loaded from the LDRB R1, [R2, #1]; address in R2, and put in R0 ORR R0, R0, R1, LSL #8; R1 is required as a ; MOV R0, R0, LSL #16; temporary register ; MOV R0, R0, ASR #16 The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data processing operation by specifying the second argument as "R0, ASR, #16" rather than just R0. One Data Item - Byte Offset 2 ----------------------------- If the data is aligned on a half word boundary, but not a word boundary, ie. the byte offset is 2, then the following code fragment can be used (which is clearly much more efficient than the general case given above): LDR R0, [R2, #-2]; 16-bit data is loaded from address in MOV R0, R0, LSR #16; R2 into R0 (R2 has byte offset 2) The "LSR" should be replaced with "ASR" if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data processing operation. One Data Item - Byte Offset 0 ----------------------------- If the data is on a word boundary, then the following code fragment will load a 16-bit value (again a significant improvement over the general case): LDR R0, [R2, #0]; 16-bit value is loaded from the word MOV R0, R0, LSL #16; aligned address in R2 into R0. MOV R0, R0, LSR #16 As before, "LSR" should be replaced with "ASR" if the data is signed. Also, it may be possible to combine the second MOV with another data processing operation. This code can be further optimised if non-word-aligned word-loads are permitted (ie. Alignment faults are not enabled). This makes use of the way ARM rotates data into a register for non-word-aligned word-loads, see the appropriate ARM Datasheet for more information: LDR R0, [R2, #2]; 16-bit value is loaded from the word MOV R0, R0, LSR #16; aligned address in R2 into R0. Two Data Items - Byte Offset 0 ------------------------------ Two 16-bit values stored in one word can be loaded more efficient;y than two separate values. The following code loads two unsigned 16-bit data items into two registers from a word aligned address: LDR R0, [R2, #0]; 2 unsigned 16-bit values are loaded MOV R1, R0, LSR #16; from one word of memory [R2}. The BIC R0, R0, R1, LSL #16; 1st is put in R0, and the 2nd in R1. The version of this for signed data is: LDR R0, [R2, #0]; 2 signed 16-bit values are loaded MOV R1, R0, ASR #16; from one word of memory [R2]. The MOV R0, R0, LSL #16; 1st is put in R0, and the 2nd in R1. MOV R0, R0, ASR #16 The address in R2 should be word aligned (byte offset 0), in which case these code fragments load the data item in bytes 0-1 into R0, and the data item in bytes 2-3 into R1. 3.6.5 Little Endian Storing Recipes ----------------------------------- The code fragment in this section which transfers a single 16-bit data item transfers it from the least significant 16 bits of an ARM register. The byte offset referred to is the byte offset from a word address of the store address. eg. the address 0x4321 has a byte offset of 1. One Data Item - Any Alignment (Byte offsets 0,1,2 or 3) ------------------------------------------------------- The following code fragment saves a 16-bit value to memory, whatever the alignment of the data address, by using the ARM's byte saving instructions: STRB R0, [R2, #0]; 16-bit value is stored to the address MOV R0, R0, ROR #8; in R2. STRB R0, [R2, #1] ; MOV R0, R0, ROR #24 The second MOV instruction is not needed if the data is no longer needed after the data is stored. Unlike load operations, knowing the alignment of the destination address does not make optimisations possible. Two Data Items - Byte Offset 0 ------------------------------ Two unsigned 16-bit values in two registers can be packed into a single word of memory very efficiently, as the following code fragment demonstrates: ORR R3, R0, R1, LSL #16; Two unsigned 16-bit values STR R3, [R2, #0]; in R0 and R1 are packed into ; the word addressed by R2 ; R3 is used as a temporary register If the values in R0 and R1 are not needed after they are saved, then R3 need not be used as a temporary register (one of R0 or R1 can be used instead). The version for signed data is: MOV R3, R0, LSL #16; Two signed 16-bit values MOV R3, R3, LSR #16; in R0 and R1 are packed into ORR R3, R3, R1, LSL #16; the word addressed by R2 STR R3, [R2, #0]; R3 is a temporary register Again, if the values in R0 and R1 are not needed after they are saved, then R3 need not be used as a temporary register (R0 can be used instead). 3.6.6 Big Endian Loading Recipes -------------------------------- Code fragments in this section which transfer a single 16-bit data item transfer it to the least significant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word at the load address. eg. the address 0x4321 has a byte offset of 1. One Data Item - Any Alignment (Byte offsets 0,1,2 or 3) ------------------------------------------------------- The following code fragment loads a 16-bit value into a register, whether the data is byte, half-word or word aligned in memory, by using the ARM's load byte instruction. This code is also optimal for the common case where the 16-bit data is half word aligned, ie. at either byte offset 0 or 2 (but the same code is required to deal with both cases). Optimisations can be made when it is known that the data is at at byte offset 0, and also when it is known to be at byte offset 2 (but not when it could be at either offset). LDRB R0, [R2, #0]; 16-bit value is loaded from the LDRB R1, [R2, #1]; address in R2, and put in R0 ORR R0, R1, R0, LSL #8; R1 is a temporary register ; MOV R0, R0, LSL #16 ; MOV R0, R0, ASR #16 The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data processing operation by specifying the second argument as "R0, ASR, #16" rather than just R0. One Data Item - Byte Offset 0 ----------------------------- If the data is aligned on a word boundary, then the following code fragment can be used (which is clearly much more efficient than the general case given above): LDR R0, [R2, #0]; 16-bit value is loaded from the word MOV R0, R0, LSR #16; aligned address in R2 into R0. The "LSR" should be replaced with "ASR" if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data processing operation. One Data Item - Byte Offset 2 ----------------------------- If the data is aligned on a half word boundary, but not a word boundary, ie. the byte offset is 2, then the following code fragment can be used (again a significant improvement over the general case): LDR R0, [R2, #-2]; 16-bit value is loaded from the MOV R0, R0, LSL #16; address in R2 into R0. R2 is MOV R0, R0, LSR #16; aligned to byte offset 2 As before, "LSR" should be replaced with "ASR" if the data is signed. Also, it may be possible to combine the second MOV with another data processing operation. This code can be further optimised if non-word-aligned word-loads are permitted (ie. Alignment faults are not enabled). This makes use of the way ARM rotates data into a register for non-word-aligned word-loads, see the appropriate ARM Datasheet for more information: LDR R0, [R2, #0]; 16-bit value is loaded from the MOV R0, R0, LSR #16; address in R2 into R0. R2 is ; aligned to byte offset 2 Two Data Items - Byte Offset 0 ------------------------------ Two 16-bit values stored in one word can be loaded more efficiently than two separate values. The following code loads two unsigned 16-bit data items into two registers from a word aligned address: LDR R0, [R2, #0]; 2 unsigned 16-bit values are MOV R1, R0, LSR #16; loaded from one word of memory BIC R0, R0, R1, LSL #16; The 1st goes in R0, the 2nd in R1. The version of this for signed data is: LDR R0, [R2, #0]; 2 signed 16-bit values are loaded from MOV R1, R0, ASR #16; one word of memory (address in R2). MOV R0, R0, LSL #16; The first is put in R0, and the second MOV R0, R0, ASR #16; into R1. 3.6.7 Big Endian Storing Recipes -------------------------------- The code fragment in this section which transfers a single 16-bit data item transfers it from the least significant 16 bits of an ARM register. The byte offset referred to is the byte offset from a word address of the store address. eg. the address 0x4321 has a byte offset of 1. One Data Item - Any Alignment (Byte offsets 0,1,2 or 3) ------------------------------------------------------- The following code fragment saves a 16-bit value to memory, whatever the alignment of the data address: STRB R0, [R2, #1]; 16-bit value is stored to the MOV R0, R0, ROR #8; address in R2. STRB R0, [R2, #0] ; MOV R0, R0, ROR #24 The second MOV instruction is not needed if the data is no longer needed after the data is stored. Unlike load operations, knowing the alignment of the destination address does not make optimisations possible. Two Data Items - Byte Offset 0 ------------------------------ Two unsigned 16-bit values in two registers can be packed into a single word of memory very efficiently, as the following code fragment demonstrates: ORR R3, R0, R1, LSL #16; Two unsigned 16-bit values in STR R3, [R2, #0]; R0 and R1 are packed into the ; word addressed by R2 ; R3 is used as a temporary register If the values in R0 and R1 are not needed after they are saved, then R3 need not be used as a temporary register (one of R0 or R1 can be used instead). The version for signed data is: MOV R3, R0, LSL #16; Two signed 16-bit values in MOV R3, R3, LSR #16; R0 and R1 are packed into the ORR R3, R3, R1, LSL #16; word addressed by R2. STR R3, [R2, #0]; R3 is a temporary register Again, if the values in R0 and R1 are not needed after they are saved, then R3 need not be used as a temporary register (R0 can be used instead). 3.6.8 Detecting Overflow into the Top 16 Bits --------------------------------------------- If 16-bit data is converted to 32-bit data for use in the ARM, it may sometimes be necessary to check explicitly whether the result of a calculation has 'overflowed' into the top 16 bits of an ARM register. This is likely to be necessary because the ARM does not set its processor status flags when this happens. The following instruction sets the Z flag if the value in R0 is a 16-bit unsigned value. R1 is used as a temporary register. MOVS R1, R0, LSR #16 The following instructions set the Z flag if the value in R0 is a valid 16-bit signed value (ie. bit 15 is the same as the sign extended bits). R1 is used as a temporary register. MOVS R1, R0, ASR #15 CMNNE R1, R1, #1 3.6.9 Using ARM Registers as 16-bit Registers --------------------------------------------- Instead of holding 16-bit data as 32-bit data within the ARM it can be held as 16-bit data. This is done by positioning it in the top 16-bits of the ARM registers as opposed to the bottom 16 bits as has been described so far. The advantages of this approach are: Some 16-bit data load instruction sequences are shorter. The loading and storing sequences shown above will have to be modified, and in some cases shorter instruction sequences will be possible. In particular, handling signed data will often be more efficient, as the top bit does not have to be copied into the top 16 bits of the register. Note that it is, however, necessary to ensure that the bottom 16 bits are all clear. The ARM processor status flags will be set if the 'S' bit of a data processing instruction is set and overflow or carry occurs out of the 16-bit value. Thus explicit 'overflow' checking instructions are not needed. Pairs of signed 16-bit integers can be saved more efficiently than in the previous approach, since the sign extended bits do not have to be cleared out before the two values are combined. The disadvantages of this approach are: Instructions such as add with carry cannot be used. eg. the instruction ADC R0, R0, #0 to increment R0 if Carry is set should be replaced by ADDCS R0, R0, #&10000 Having to use this form of instruction reduces the chance of being able to combine several data processing operations into one by making use of the barrel shifter. Before multiplying two 16-bit values in the top half of the registers, the values must be shifted into the bottom half of the register. Before combining a 16-bit value with a 32-bit value, the 16-bit value must be shifted into the bottom half of the register. Note, however, that this may cost nothing if the barrel shifter can be used in parallel. The recipes given above for loading and storing 16-bit data into the bottom half of ARM registers can be easily adapted to load the data into the top half of the registers (and ensure the bottom half is all zero), or save the data from the top half of the registers. 3.6.10 Related Topics --------------------- Using the Barrel Shifter starting on page10; Fixed Point Arithmetic starting on page38. 3.7 ARM600 Page Table Generation -------------------------------- 3.7.1 About This Recipe ----------------------- This recipe tells you how to use: armasm repetitive assembly; armasm conditional assembly; armasm macros; armasm and armlink to produce a plain binary output file (containing only the bytes you describe in your source program). This recipe refers to the program pagetab.s in the examples directory. Although pagetab.s generates ARM600 page tables, it is not the format of the page tables which is being discussed in this recipe, but how they are generated using armasm. To find out more about ARM600 page tables please refer to the ARM600 datasheet. 3.7.2 Repetitive Assembly ------------------------- The following code fragment is taken from pagetab.s: GBLA counter counter SETA 0 WHILE counter <= 3 L1Entry SECTION, (counter:SHL:20), 0, U_BIT+C_BIT+B_BIT, SVC_RW counter SETA counter+1 WEND Explanation ----------- GBLA counter declares a global numeric variable called counter which is initialized to zero using the SETA directive. The WHILE ... WEND construct is then used to repeatedly assemble the lines between WHILE and WEND. In this example, the loop body is assembled for counter = 0, 1, 2 and 3, but because the looping condition is checked at the top of the loop, it is possible for code between a WHILE and a WEND never to be assembled. For example, if counter were initialized to 4, the body of the WHILE ... WEND loop would not be assembled at all. Each time around the loop the macro L1Entry is called (with 5 arguments), and then counter is incremented. 3.7.3 Macro Usage and Conditional Assembly ------------------------------------------ The following code fragment for L1Entry is also taken from pagetab.s: MACRO $L L1Entry $type, $addr, $dom, $ucb, $acc $L IF ($type=SECTION) DCD ((($addr):AND:&FFF00000):OR:(($acc):SHL:10) \ :OR:(($dom):SHL:5):OR:($ucb):OR:($type)) MEXIT ENDIF IF ($type=PAGE) DCD ((($addr):AND:&FFFFFC00):OR:(($dom):SHL:5) \ :OR:(($ucb):AND:U_BIT):OR:$type) ELSE DCD 0 ; Invalid Level 1 Page Table Entry ENDIF MEND Note that a backslash breaks a logical line of assembly language across two physical lines. However, there must be no character after the backslash on the line. Explanation ----------- The macro definition is enclosed between MACRO and MEND. The first line of the definition gives the macro's name and lists its parameters, $L is a special parameter, called the label parameter. The body of the macro illustrates the use of IF ... ENDIF and IF ... ELSE ... ENDIF to assemble different code conditional on a value known at assembly-time. In this example, the controlling expressions of the IFs involve a macro parameter ($type) which gets its value when the macro is called. This macro definition also shows how the MEXIT directive can be used to exit from processing a macro before the MEND directive is reached. You can think of MEXIT as being like a return statement in a C function. The label parameter is used to allow labels to be setup within a macro and then be accessed from outside that macro. In this macro the label parameter is set up to be the address of this page table entry. 3.7.4 Assembling the Page Tables in Plain Binary Format ------------------------------------------------------- This section tells you how to create a file containing a plain binary image of the page tables. In other words, a file containing just the bytes you would need to load into memory and nothing else by way of symbolic information, file content descriptors, load addresses, etc. You create a plain binary image in two steps: first you create a relocatable object file from your source file; then you use armlink to make a plain binary version of the relocatable object. Method ------ Set your current directory to that containing the pagetab.s program then assemble pagetab.s and link pagetab.o as follows: armasm pagetab.s -o pagetab.o -li armlink -bin -o pagetab pagetab.o Explanation ----------- As in earlier examples, the -li option tells armasm to assemble code for a little-endian memory. This need not be specified if the tools have been configured for little-endian operation (see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details). The -bin option tells armlink to make a plain binary output file containing nothing but the bytes you described in your source program. Because pagetab contains no position-dependent data, you do not need to tell armlink where to base its output. If there had been position-dependent data or code, you would have had to use the -base address option to tell armlink where to base its output and, of course, you would only be able to use the output at that memory address. 3.7.5 Related Topics -------------------- Loading Constants into Registers starting on page15. 3.8 Pseudo Random Number Generation ----------------------------------- 3.8.1 About This Recipe ----------------------- This recipe describes a 32-bit pseudo random number generator implemented efficiently in ARM Assembly Language. 3.8.2 The Details ----------------- It is often necessary to generate pseudo random numbers, and the most efficient algorithms are based on shift generators with exclusive-or feedback (rather like a cyclic redundancy check generator). Unfortunately t he sequence of a 32-bit generator needs more than one feedback tap to be maximal length (ie. 2^32-1 cycles before repetition), so this example uses a 33 bit register with taps at bits 33 and 20. The basic algorithm is newbit:=bit33 EOR bit20, shift left the 33 bit number and put in newbit at the bottom; this operation is performed for all the newbits needed (ie. 32 bits). The entire operation can be coded compactly by making maximal use of the ARM's barrel shifter: ; enter with seed in R0 (32 bits), R1 (1 bit in least significant bit) ; R2 is used as a temporary register. ; on exit the new seed is in R0 and R1 as before ; Note that a seed of 0 will always produce a new seed of 0. ; All other values produce a maximal length sequence. ; TST R1, R1, LSR #1 ; top bit into Carry MOVS R2, R0, RRX ; 33 bit rotate right ADC R1, R1, R1 ; carry into lsb of R1 EOR R2, R2, R0, LSL #12 ; (involved!) EOR R0, R2, R2, LSR #20 ; (similarly involved!) 3.8.3 Using This Example ------------------------ This random number generation code is provided as random.s in the examples directory. It is provided as ARM Assembly Language source which can be assembled and then linked with C modules (see recipes in chapter Interfacing Assembly Language and C starting on page62 for more information). The C test program randtest.c (also in the examples directory) can be used to demonstrate this. First set the examples directory to be your current directory, and execute the following commands to build an executable suitable for armsd: armasm random.s -o random.o -li armcc -c randtest.c -li -apcs 3/32bit armlink randtest.o random.o -o randtest somewhere/armlib.32l Where somewhere is the path to the lib directory of the ARM Software Development Toolkit release. Note that in both the above commands, and the armsd command below, -li indicates that the target ARM is little endian, and -apcs 3/32bit specifies that the 32 bit variant of the ARM Procedure Call Standard should be used. These options can be omitted if the tools have already been configured a ppropriately - see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details. armsd can be used to run this program as follows: > armsd -li randtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file randtest armsd: go randomnumber() returned 0b3a9965 randomnumber() returned ac0b1672 randomnumber() returned 6762ad4f randomnumber() returned 1965a731 randomnumber() returned d6c1cef4 randomnumber() returned f78fa802 randomnumber() returned 8147fc15 randomnumber() returned 3f62adfc randomnumber() returned b56e9da8 randomnumber() returned b36dc5e2 Program terminated normally at PC = 0x000082c8 0x000082c8: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > 3.8.4 Related Topics -------------------- Please refer to the index for other topics of interest. 3.9 Loading a Word from an Unknown Alignment -------------------------------------------- 3.9.1 About This Recipe ----------------------- In this recipe a code sequence which loads a word from memory at any byte alignment is described. Although loading 32-bit data from non word aligned addresses should be avoided whenever possible, it may sometimes be necessary. 3.9.2 The Details ----------------- The ARM Load and Store (single and multiple) instructions are designed to load word aligned data. Unless there is a very good reason it is best to avoid having to load or store word-sized data to or from non word aligned addresses, as neither the Load or Store instruction can do this unaided. To deal with misaligned word fetches two words must be read and the required data extracted from these two words. The code below performs this operation for a little endian ARM, making good use of the barrel shifter in the process. ; enter with address in R0 ; R2 and R3 are used as temporary registers ; the word is loaded into R1 ; BIC R2, R0, #3 ; Get word aligned address LDMIA R2, {R1, R3} ; Get 64 bits containing data AND R2, R0, #3 ; Get offset in bytes MOVS R2, R2, LSL #3 ; Get offset in bits MOVNE R1, R1, LSR R2 ; Extract data from bottom 32 bits RSBNE R2, R2, #32 ; Get 32 - offset in bits ORRNE R1, R1, R3, LSL R2 ; Extract data from top 32 bits ; and combine with the other data This code can easily be modified for use on a big endian ARM - the LSR R2 and LSL R2 just have to be swapped over. For details of what the Load and Store instructions do if used with non word aligned addresses refer to the appropriate datasheet. Note that non word aligned word loads are also used in Using 16-bit Data on the ARM starting on page41. 3.9.3 Related Topics -------------------- Using 16-bit Data on the ARM starting on page41 3.10 Byte Order Reversal ------------------------ 3.10.1 About This Recipe ------------------------ This recipe gives a compact ARM Instruction Sequence to perform byte order reversal ie. reversing the endianess of a word. 3.10.2 The Details ------------------ Changing the endianess of a word can be a common operation in certain applications. For example when communicating word sized data as a stream of bytes to a receiver of the opposite endianess. This operation can be performed efficiently on the ARM, using just four instructions. The word to be reversed is held in a1 both on entry and exit of this instruction sequence. ip is used as a temporary register (For more information about these register names see Register Usage under the ARM Procedure Call Standard starting on page62): EOR ip, a1, a1, ror #16 BIC ip, ip, #&ff0000 MOV a1, a1, ror #8 EOR a1, a1, ip, lsr #8 A demonstration program which should help explain how this works has been provided in source form in the examples directory. To compile this program and run it under armsd first set your current directory to be examples and then use the following commands: >armcc bytedemo.c -o bytedemo -li -apcs 3/32bit >armsd -li bytedemo A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file bytedemo armsd: go ... demonstration program executes ... Note that this program uses ANSI control codes, so should work on most terminal types under Unix and also on the PC. 3.10.3 Related Topics --------------------- Please refer to the index for further topics of interest. 3.11 ARM Assembly Programming Performance Issues ------------------------------------------------ 3.11.1 About This Recipe ------------------------ This recipe outlines many performance related issues which the ARM Assembly Language Programmer should be aware of. Many of these issues are dealt with more fully elsewhere in the Cookbook, but are mentioned here for completeness. This recipe is also useful background for C programmers using armcc, as some of these issues can also apply to programming in C. 3.11.2 Performance Issues ------------------------- Not all of the issues in this recipe apply to every ARM processor based system. However, unless otherwise stated, all issues relate to processors based on ARM6 and ARM7, with or without cache and/or write buffer. LDM / STM Use LDM and STM instead of a sequence of LDR or STR instructions wherever possible. This provides several benefits: The code is smaller (and thus will cache better on an ARM processor with a cache); An instruction fetch cycle and a register copy back cycle is saved for each LDR or STR eliminated; On an uncached ARM processor (for LDM) or an unbuffered ARM processor (for STM), non-sequential memory cycles can be turned into faster memory sequential cycles. For a more detailed discussion of LDM and STM see Flexibility of Load and Store Multiple starting on page13. Conditional Execution --------------------- In many situations, branches around short pieces of code can be avoided by using conditionally executed instructions. This reduces the size of code and may avoid a pipeline break. For a more detailed explanation of the Conditional Execution of ARM Instructions see Making the Most of Conditional Execution starting on page7. Using the Barrel Shifter ------------------------ Combining shift operations with other operations can significantly increase the code density (and thus performance) of much ARM code. An explanation of how to use the ARM's barrel shifter is given in Using the Barrel Shifter starting on page10. Many other recipes also demonstrate good use of the barrel shifter. Addressing Modes ---------------- The ARM instruction set provides a useful selection of addressing modes, which can often be used to improve the performance of code. eg. using LDR or STR pre- or post-indexed with a non zero offset increments the base register and performs the data transfer. For full details of the addressing modes available refer to the appropriate ARM datasheet. Multiplication -------------- Be aware of the time taken by the ARM multiply and multiply accumulate instructions. Making the best use of the multiplier is discussed in ARM6 Multiplier Performance Issues starting on page38. When multiplying by a constant value note that using the multiply instruction is often not the optimal solution. The issues involved are discussed in the Multiply by a Constant starting on page24. Optimise Register Usage ----------------------- Examine your code and see if registers can be reused for another value during parts of a long calculation which uses many registers. Doing this will reduce the amount of 'register spillage', ie. the number of times a value has to be reloaded or an intermediate value saved and then reloaded. Notice that because data processing is cheap (much can be achieved in a single instruction), keeping a calculated result in a register for use some considerable time later may be less efficient than recalculating it when it is next needed. This is because it may allow the register so freed to be used for another purpose in the meantime, thus reducing the amount of register spillage. For example code which takes great care to optimise register usage see some of the examples in Digital Signal Processing on the ARM starting on page35. Loop Unrolling -------------- Loop unrolling involves using more than one copy of the inner loop of an algorithm. The following benefits may be gained by loop unrolling: the branch back to the beginning of the loop is executed less frequently; it may be possible to combine some of one iteration with some of the next iteration, and thereby significantly reduce the cost of each iteration. A common case of this is combining LDR or STR instructions from two or more iterations into single LDM or STM instructions. This reduces code size, the number of instruction fetches, and in the case of LDM, the number of register writeback cycles. As an example to illustrate the issues involved in loop unrolling, let us consider calculating the following over an array: x[i] = y[i] - y[i+1]. Below is a code fragment which performs this: LDR R2, [R0]; Preload y[i] Loop LDR R3, [R0, #4]!; Load y[i+1] SUB R2, R2, R3; x[i] = y[i] - y[i+1] STR R2, [R1], #4; Store x[i] MOV R2, R3; y[i+1] is the next y[i] CMP R0, R4; Finished ? BLT Loop Let us consider the number of execution cycles this will take on an ARM6 based processor. IF stands for Instruction Fetch, WB stands for Register Write Back, R stands for Read, and W stands for Write. The loop will execute in the following cycles: 6 IF, 1 R, 1 WB, 1 W, and the branch costs an additional 2 IF cycles. Therefore the total cycle count for processing a 100 element x[] array is: 799 IF (198 caused by branching), 101 R, 101 WB, 100 W (1198 cycles) Code size: 7 instructions Now consider the effects of unrolling this loop: Branch overhead cycles ---------------------- In the above example there are 198 IFs caused by branching. Unrolling the loop can clearly reduce this, and the table below shows how progressively unrolling the loop gives reducing returns for the increase in code size: Times Unrolled IFs caused by Branching IF saving -------------- ----------------------- --------- 2 98 100 3 66 134 4 48 150 5 38 160 10 18 180 100 0 198 Therefore if code size is an issue of any importance, unrolling any more than around 3 times is unlikely to pay off with regard to branch overhead elimination. Combining LDRs and STRs into LDM and STM ------------------------------------------ The number of LDRs or STRs which can be combined into a single LDM or STM is restricted by the number of available registers. In this instance 10 registers is the most which are likely to be usable. This would result in unrolling the loop 10 times for the above example. Another case to consider is unrolling 3 times, as this seems to be a good compromise for branch overhead reduction. Other optimisations ------------------- Upon examining the unrolled code below, it can be seen that it is only necessary to execute the MOV once per loop, thus saving another 2 IF cycles per loop for the 3 times unrolled code, and another 9 IF cycles per loop for the 10 times unrolled code. Here is the code unrolled three times and then optimised as described above: LDR R2, [R0], #4; Preload y[i] Loop LDMIA R0!, {R3-R5}; Load y[i+1] to y[i+3] SUB R2, R2, R3; x[i] = y[i] - y[i+1] SUB R3, R3, R4; x[i+1] = y[i+1] - y[i+2] SUB R4, R4, R5; x[i+2] = y[i+2] - y[i+3] STMIA R1!, {R2-R4}; Store x[i] to x[i+2] MOV R2, R5; y[i+3] is the next y[i] CMP R0, R6; Finished ? BLT Loop Analysing how this code executes for a y[] array of size 100, as described above for the unrolled code produces the following results: 339 IF (66 caused by branching), 101 R, 34 WB, 100 W (574 cycles) Code size: 9 instructions Saving over unrolled code: 460 IF, 67 WB Here is the code unrolled ten times and then optimised in the same way: LDR R2, [R0], #4; Preload y[i] Loop LDMIA R0!, {R3-R12}; Load y[i+1] to y[i+10] SUB R2, R2, R3; x[i] = y[i] - y[i+1] SUB R3, R3, R4; x[i+1] = y[i+1] - y[i+2] SUB R4, R4, R5; x[i+2] = y[i+2] - y[i+3] SUB R5, R5, R6; x[i+3] = y[i+3] - y[i+4] SUB R6, R6, R7; x[i+4] = y[i+4] - y[i+5] SUB R7, R7, R8; x[i+5] = y[i+5] - y[i+6] SUB R8, R8, R9; x[i+6] = y[i+6] - y[i+7] SUB R9, R9, R10; x[i+7] = y[i+7] - y[i+8] SUB R10, R10, R11; x[i+8] = y[i+8] - y[i+9] SUB R11, R11, R12; x[i+9] = y[i+9] - y[i+10] STMIA R1!, {R2-R11}; Store x[i] to x[i+9] MOV R2, R12; y[i+10] is the next y[i] CMP R0, R13; Finished ? BLT Loop Analysing how this code executes for a y[] array of size 100, produces the following results: 169 IF (18 caused by branching), 101 R, 10 WB, 100 W (380 cycles) Code size: 16 instructions Saving over unrolled code: 630 IF, 91 WB Thus for this problem, unless the extra seven instructions make the code too large unrolling ten times is likely to be the optimum solution. However, loop unrolling is not always a good idea, especially when the optimisation between one iteration and the next is small. Consider the following loop which copies an area of memory: Loop LDMIA R0!,{R3-R14} STMIA R1!,{R3-R14} CMP R0, #LimitAddress BNE Loop This could be unrolled as follows: Loop LDMIA R0!,{R3-R14} STMIA R1!,{R3-R14} LDMIA R0!,{R3-R14} STMIA R1!,{R3-R14} LDMIA R0!,{R3-R14} STMIA R1!,{R3-R14} LDMIA R0!,{R3-R14} STMIA R1!,{R3-R14} CMP R0, #LimitAddress BLT Loop In this code the CMP and BNE will be executed only a quarter as often, but this will give only a small saving. However, other issues should be taken into account: If in the above case the amount of data to be transferred was not a multiple of 48, then this amount of loop unrolling will copy too much data. This may be catastrophic, or may merely be inefficient. On a cached ARM processor, the larger the inner loop, the more likely it is that the loop will not stay entirely in the cache. In this case, it is not obvious at what point the performance gain due to unrolling is offset by the performance loss due to cache misses, or the disadvantage of larger code. On an ARM processor with a write buffer, the loop unrolling in the above example is unlikely to help. If the data being copied is not in the cache, then every LDMIA will be stalled while the write buffer empties. Thus the time the CMP and BNE take is irrelevent, as the processor will be stalled on the following LDMIA. Loop unrolling can be a useful technique, but be warned that it doesn't always gain anything, and in some situations can reduce performance - detailed analysis is often necessary. Write Buffer Stalling --------------------- On ARM processors with a write buffer, performance can be maximised by writing code which avoids stalling due to the write buffer. For a write buffer with 2 tags and 8 words such as the ARM610, no three STR or STM instructions should be close together (as the third will be stalled until the first has finished). Similarly no two STR or STM instructions which together store more than 8 words should be close together, as the second will be stalled until there is space in the write buffer. Rearranging code so that the write buffer does not cause a stall in this way is often hard, but is frequently worth the effort, and in any case it is always wise to be aware of this performance factor. 16-bit Data ----------- If possible treat 16-bit data as 32-bit data. However, if this is not possible, then be aware that it is possible to make use of the barrel shifter and non word-aligned LDR's in order to make working with 16-bit data more efficient than might be initially imagined. See Using 16-bit Data on the ARM starting on page41 for a full discussion of this topic. 8-bit Data ---------- When processing a sequence of byte sized objects (eg. strings), the number of loads and stores can be reduced if the data is loaded a word at a time and then processed a byte at a time by extracting the bytes using the barrel shifter. The Floating Point Emulator --------------------------- If the software-only floating point emulator is being used then floating point instructions should placed sequentially, as the floating point emulator will detect that the next instruction is also a floating point instruction, and will emulate it without leaving the undefined instruction code. Note that this advice is not applicable to systems which use the ARM FPA co-processor. Make Full Use of Cache Lines ---------------------------- In order to help the cache on a cached ARM processor maintain a high hit rate for data, place frequently accessed data values together so that the are loaded into the same cache line, rather scattering them around memory, as this will require more cache lines to be loaded, and kept in the cache. In a similar vein, commonly called subroutines (especially short ones) will often run more quickly on a cached ARM processor if the entry address is aligned so that it will be loaded into the first word of a cache line. On the ARM610, for example this means quad-word aligned. This ensures that all four words of the first line fetch will be subsequently used by instruction fetches before another line fetch is caused by an instruction fetch. This technique is most useful for large programs which do not cache well, as the number of times the code will have to feteched from memory is not likely to be significant if the program does cache well. Minimise Non-sequential Cycles ------------------------------ This technique is only appropriate to un-cached ARM processors, and is intended for memory systems in which non-sequential memory accesses take longer than sequential memory accesses. Consider such a system in which the length of memory bursts is B. That is, if executing a long sequence of data operations, the memory accesses which result are: one non-sequential memory cycle followed by B-1 sequential memory cycles. eg. DRAM controlled by the ARM memory manager MEMC1a. This sequence of memory accesses will be broken up by several ARM instruction types: Load or Store (single or multiple), Data Swap, Branch instructions, SWI's and other instructions which modify the PC. By placing these instructions carefully, so that they break up the normal sequence of memory cycles only where a non-sequential cycle was about to occur anyway, the number of sequential cycles which are turned into longer non-sequential cycles can be minimised. For a memory system in which has memory bursts of length B, the optimal position for instructions which break up the memory cycle sequence is 3 words before the next B-word boundary. To help explain this consider a memory system with memory bursts of length 4 (ie. quad word bursts), the optimal position for these break-up instructions is 16-12=4 bytes from a quad-word offset. To demonstrate this is the case imagine executing the following code in this system: 0x0000 Data Instr 1 0x0004 STR 0x0008 Data Instr 2 0x000C Data Instr 3 0x0010 Data Instr 4 The memory cycles executing this code will produce are (taking into account the ARM instruction pipeline): Instruction Fetch 0x0000 (Non Seq) Instruction Fetch 0x0004 (Seq) Instruction Fetch 0x0008 (Seq) + Execute Data Instr 1 Instruction Fetch 0x000C (Seq) + Execute STR Data Write (Non Seq) Instruction Fetch 0x0010 (Non Seq) + Execute Data Instr 2 Instruction Fetch 0x0014 (Seq) + Execute Data Instr 3 The instruction fetch after the Data Write cycle had to be non-sequential cycle, but since the instruction fetch was of a quad-word aligned address it had to be non-sequential anyway. Hence the STR is optimally positioned to avoid changing sequential instruction fetches into non-sequential instruction fetches. Use an Efficient Algorithm -------------------------- Despite all these techniques for optimising ARM Assembly Language, it is important that care is taken to start off with an efficient algorithm - all the optimisations in the world won't turn bubble sort into an n log n sorting algorithm! 3.11.3 Related Topics --------------------- Most of the recipes in this chapter (Exploring ARM Assembly Language starting on page20) are likely to have some relevence. Specific references have been indicated above for particular topics.