This document is Copyright 1994 ARM Ltd, and has been included on this 
disc with their kind permission. This manual is supplied "as is"; ARM 
Limited ("ARM") makes no warranty, express or implied, of the 
merchantability of this document or its fitness for any particular 
purpose. In no circumstances shall ARM be liable for any damage, loss 
of profits, or any indirect or consequential loss arising out of the 
use of these recipes or inability to use these recipes, even if ARM has 
been advised of the possibility of such loss.
---------------------------------------------------------------------------

3. Exploring ARM Assembly Language
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3.1 Integer to String conversion
--------------------------------

3.1.1 About this Recipe
-----------------------
This recipe shows you:

     how to convert an integer to a string in ARM assembly language;
     how to use a stack in an ARM assembly language program;
     how to write a recursive function in ARM assembly language.

This recipe refers to the program utoa1.s in the examples directory. Its 
dtoa entry point converts a signed integer to a string of decimal digits 
(possibly with a leading '-''); its utoa entry point converts an unsigned 
integer to a string of decimal digits.

3.1.2 The Algorithm
-------------------
To convert a signed integer to a decimal string: generate a '-' and negate 
the number if it is negative; then convert the remaining unsigned value.
To convert a given unsigned integer to a decimal string, divide it by 10, 
yielding a quotient and a remainder. The remainder is in the range 0-9 and 
is used to create the last digit of the decimal representation. If the 
quotient is non-zero it is dealt with in the same way as the original
number, creating the leading digits of the decimal representation; otherwise 
the process has finished.

3.1.3 The Implementation
------------------------
The implementation of utoa sketched below employs the register naming and 
usage conventions of the ARM Procedure Call Standard: a1-a4 are argument or 
scratch registers and a1 is the function result register; v1-v5 are 
'variable' registers (preserved across function calls); sp is the stack 
pointer; at routine entry, lr holds the subroutine call return address; and 
pc is the program counter.

utoa
  STMFD  sp!, {v1, v2, lr}  ; function entry - save some v-registers
                            ; and the return address.
  MOV    v1, a1             ; preserve arguments over following
  MOV    v2, a2             ; function calls

  MOV    a1, a2
  BL     udiv10             ; a1 = a1 / 10

  SUB    v2, v2, a1, LSL #3 ; number - 8*quotient
  SUB    v2, v2, a1, LSL #1 ;        - 2*quotient = remainder

  CMP    a1, #0             ; quotient non-zero?
  MOVNE  a2, a1             ; quotient to a2...
  MOV    a1, v1             ; buffer pointer unconditionally to a1
  BLNE   utoa               ; conditional recursive call to utoa

  ADD    v2, v2, #'0'       ; final digit
  STRB   v2, [a1], #1       ; store digit at end of buffer

  LDMFD  sp!, {v1, v2, pc}  ; function exit - restore and return

Explanation
-----------
On entry, a2 contains the unsigned integer to be converted and a1 addresses 
a buffer to hold the character representation of it.
On exit, a1 points immediately after the last digit written.
Both the buffer pointer and the original number have to be saved across the 
call to udiv10. This could be done by saving the values to memory. However, 
it turns out to be more efficient to use two 'variable' registers, v1 and 
v2 (which, in turn, have to be saved to memory).
(An instructive exercise for the reader is to rework this example with a1 
and a2 saved to memory in the initial STMFD, rather than v1 and v2).
Because utoa calls other functions (udiv10 and utoa) it must save its return 
link address passed in lr. The function therefore begins by stacking v1, v2 
and lr using STMFD sp!, {v1,v2,lr}.
In the next block of code, a1 and a2 are saved (across the call to udiv10) 
in v1 and v2 respectively and the given number (a2) is moved to the first 
argument register (a1) before calling udiv10 with a BL (Branch with Link) 
instruction.
On return from udiv10, 10 times the quotient is subtracted from the original 
number (preserved in v2) by two SUB instructions. The remainder (in v2) is 
ready to be converted to character form (by adding ASCII '0') and to be 
stored into the output buffer.
But first, utoa has to be called to convert the quotient, unless that is 
zero. The next four instructions do this, comparing the quotient (in a1) 
with 0, moving the quotient to the second argument register (a2) if not 
zero, moving the buffer pointer to the first argument/result register (a1), 
and calling utoa if the quotient is not zero.
Note that the buffer pointer is moved to a1 unconditionally: if utoa is 
called recursively then a1 will be updated, but it will still identify the 
next free buffer location; if utoa is not called recursively, the next free 
buffer location is still needed in a1 by the following code which plants the 
remainder digit and returns the updated buffer location (via a1).
The remainder (in a2) is converted to character form by adding '0' and is 
then stored in the location addressed by a1. A post-incrementing STRB is 
used which stores the character and increments the buffer pointer in a 
single instruction, leaving the result value in the result register a1.
Finally, the function is exited by restoring the saved values of v1 and v2 
from the stack, loading the stacked link address into pc and popping the 
stack using a single multiple load instruction LDMFD sp!, {v1,v2,pc}.

3.1.4 Creating a Runnable Example
---------------------------------
You can run the utoa routine described here under armsd. To do this, you 
must assemble the example and the udiv10 function, compile a simple test 
harness written in C, and link the resulting objects together to create a 
runnable program.
Begin by setting your current directory to the examples directory then use 
the following commands:

armasm utoa1.s -o utoa1.o -li
armasm udiv10.s -o udiv10.o -li 
armcc -c utoatest.c -apcs 3/32bit
armlink -o utoatest utoa1.o udiv10.o utoatest.o somewhere/armlib.32l

where somewhere is the directory in which armlib.32l can be found.

Explanation
-----------
The first two armasm commands assemble the utoa function and the udiv10 
function, creating relocatable object files utoa1.o and udiv10.o. The -li 
flag tells armasm to assemble for a little-endian memory. You can omit this 
flag if your armasm has been configured for this default (see The ARM Tool 
Reconfiguration Utility (reconfig) starting on page45 of the User Manual 
for how to configure the ARM development tools).
The armcc command compiles the test harness. The -c flag tells armcc not to 
link its output with the C library; the -li flag tells armcc to compile for 
a little-endian memory (as with armasm).
The armlink command links your three relocatable objects with the ARM C 
library to create a runnable program (here called utoatest).
If you have installed your ARM development tools in a standard way then you 
could use the following shorter command to do the compilation and linking:

armcc utoatest.c utoa1.o udiv10.o -apcs 3/32bit -li

3.1.5 Running the Example
-------------------------
You can run your example program under armsd using:

armsd -li utoatest

Note that the -li and -apcs 3/32bit options can be omitted if the tools have 
been configured appropriately.  See The ARM Tool Reconfiguration Utility 
(reconfig) starting on page45 of the User Manual for details.

3.1.6 Stacks in Assembly Language
---------------------------------
In this example, three words are pushed on to the stack on entry to utoa and 
popped off again on exit. By convention, ARM software uses r13, usually 
called sp, as a stack pointer pointing to the last-used word of a downward 
growing stack (a so-called 'full, descending' stack). However, this is only 
a convention and the ARM instruction set supports equally all four stacking 
possibilities: {full or empty} x {ascending or descending}.
The instruction used to push values on the stack was:

STMFD  sp!, {v1, v2, lr}

The action of this instruction is as follows:

     subtract 4 * number-of-registers from sp;
     store the registers named in {...} in ascending register number order 
     to memory at [sp], [sp,4], [sp,8] ...

The matching pop instruction was:

LDMFD  sp!, {v1, v2, pc}

Its action is:

     load the registers named in {...} in ascending register number order 
     from memory at [sp], [sp,4], [sp,8] ...
     add 4 * number-of-registers to sp.

Discussion
----------
Many, if not most, register-save requirements in simple assembly language 
programs can be met using this approach to stacks.
A more complete treatment of run-time stacks requires a discussion of:

     stack-limit checking (and extension);
     local variables and stack frames.

In the utoa program, you must assume the stack is big enough to deal with 
the maximum depth of recursion, as no one bothers to check this. In 
practice, this assumption is OK. The biggest 32-bit unsigned integer is 
about four billion, or ten decimal digits. This means that at most 10 x 3 
registers = 120 bytes have to be stacked. Because the ARM Procedure Call 
Standard (APCS) guarantees that there are at least 256 bytes of stack 
available when a function is called and because we can guess (or know) that 
udiv10 uses no stack space, we can be confident that utoa is quite safe if 
called by an APCS-conforming caller such as a compiled-C test harness.
This discussion raises another delicacy. The stacking technique illustrated 
here conforms to the ARM Procedure Call Standard only if the function using 
it makes no function calls. utoa calls both udiv10 and itself; it really 
ought to establish a proper stack frame (see ARM Procedure Call Standard 
starting on page38 of the Technical Specifications). If you really want to 
be safe and write functions that can 'plug and play together' you have to 
follow the APCS exactly.
However, when writing a whole program in assembly language you often know 
much more than when writing a program fragment for general, robust service. 
This allows you to gently break the APCS in the following way:

    Any chain of function/subroutine calls can be considered compatible with 
    the APCS provided it uses less than 256 bytes of stack space.

So the utoa example is compatible with the APCS even though it doesn't 
conform to the APCS.
Note however: if you call any function whose stack use is unknown (but which 
is believed to be APCS conforming), you court disaster unless you establish 
a proper APCS call frame and perform APCS stack limit checking on function 
entry. Please refer to ARM Procedure Call Standard starting on page38 of the 
Technical Specifications for further details.

3.1.7 Related Topics
--------------------
For more information about stacks, and conforming to the ARM Procedure Call 
Standard see:

     Flexibility of Load and Store Multiple starting on page13;
     Register Usage under the ARM Procedure Call Standard starting on page62;
     Stack Overflow Checking starting on page93.

3.2 Multiply by a Constant
--------------------------

3.2.1 About this Recipe
-----------------------
This recipe shows you how to construct a sequence of ARM instructions to 
multiply by a constant.
For some applications multiply is used extensively, so it is important to 
make the application run as fast as possible.  For instance, most DSP 
(Digital Signal Processing) applications perform a lot of multiplication.
In many cases where a multiply is used, one of the values is a constant (eg. 
weeks*7).  A naive programmer would assume that the only way to calculate 
this would be to use the MUL instruction - but there is an alternative...
This recipe demonstrates how to improve the speed of multiply-by-constant by 
using a sequence of arithmetic instructions instead of the general-purpose 
multiplier.

3.2.2 Introduction
------------------
Throughout this recipe, registers are referred to using register names (eg. 
Rd, Rm, Rs), but you should use only register names which have been declared 
using the RN directive (eg. a1, r4 etc.) in assembler source code.  This 
recipe does not refer to any example programs; it should be viewed as an 
explanation of the multiply-by-constant technique.
MUL has the following syntax:

    MUL    Rd, Rm, Rs

The timing of this instruction depends on the value in Rs.  The ARM6 
datasheet specifies that for Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive 
takes 1S + mI cycles.  For more details on the multiplier, see ARM6 
Multiplier Performance Issues starting on page38.  There is, of course, no 
guarantee that MUL will not be implemented differently (possibly faster) in 
the future...
When multiplying by a constant value, it is possible to replace the general 
multiply with a fixed sequence of adds and subtracts which have the same 
effect.  For instance, multiply by 5 could be achieved using a single 
instruction:

ADD    Rd, Rm, Rm, LSL #2   ; Rd = Rm + (Rm * 4) = Rm * 5

This ADD version is obviously better than the MUL version below:

    MOV    Rs, #5
    MUL    Rd, Rm, Rs
The 'cost' of the general multiply includes the instructions needed to load 
the constant into a register (up to 4 may be needed, or an LDR from a literal
pool) as well as the multiply itself.

3.2.3 The problem of finding the optimum sequence
-------------------------------------------------
The difficulty in using a sequence of arithmetic instructions is that the 
constant must be decomposed into a set of operations which can be done by 
one instruction each.  Consider multiply by 105:

This could be achieved by decomposing thus:
    105 == 128 - 23
        == 128 - (16 + 7)
        == 128 - (16 + (8 - 1))

    RSB    Rd, Rm, Rm, LSL #3; Rd = Rm*7
    ADD    Rd, Rd, Rm, LSL #4; Rd = Rm*7 + Rm*16 = Rm*23
    RSB    Rd, Rd, Rm, LSL #7; Rd = -Rm*23 + Rm*128 = Rm*105

Or, decomposing differently:
    105 == 15 * 7
        == (16 - 1) * (8 - 1)

    RSB    Rt, Rm, Rm, LSL #4; Rt = Rm*15 (tmp reg)
    RSB    Rd, Rt, Rt, LSL #3; Rd = Rt*7 = Rm*105

The second method is the optimal solution (fairly easy to find for small 
values such as 105).  However, the problem of finding the optimum becomes 
much  more difficult for larger constant values.  A program can be written 
to search exhaustively for the optimum, but it may take a long time to 
execute.  There are no known algorithms which solve this problem quickly.
Temporary registers can be used to store intermediate results to help 
achieve the shortest sequence.  For a large constant, more than one 
temporary may be needed, otherwise the sequence will be longer.
The C-compiler restricts the amount of searching it performs in order to 
minimise the impact on compilation time.  The current version of armcc has a 
cut-off so that it uses a normal MUL if the number of instructions used in 
the multiply-by-constant sequence exceeds some number N.  This is to avoid 
the sequence becoming too long.

3.2.4 Experimenting with armcc assembly output
----------------------------------------------
When writing a speed-critical ARM assembler program, it is a good idea to 
code it in C first (to check the algorithm) before converting it to hand 
tuned assembler.  It is helpful to see the ARM code which the compiler 
generates as a starting point for your work.
Invoking armcc with the -S  flag will generate an assembly file instead of 
an object file.  For example, consider the following simple C code:

int mulby105( int num )
{
    return num * 105;
}

Compile this using:

armcc -li -S mulby105.c

Now, examine the file mulby105.s which has been created.  The version of 
armcc in this release may not produce precisely this output, although it 
should be very similar:

; generated by Norcroft ARM C vsn 4.41 (Advanced RISC Machines)
    AREA |C$$code|, CODE, READONLY
|x$codeseg|

    EXPORT  mulby105
mulby105
    RSB    a1,a1,a1,LSL #4
    RSB    a1,a1,a1,LSL #3
    MOV    pc,lr

    AREA |C$$data|,DATA

|x$dataseg|

    END

Notice that the compiler has found the short multiply-by-constant sequence.

3.2.5 Discussion of speed improvement
-------------------------------------
To evaluate the speed gains from using multiply-by-constant, consider 
multiplying by 11585 (which is 8192*sqr2) as an example:

A normal multiply consists of:
    MOV    Rs, #0x2D << 8; load constant
    ORR    Rs, Rs, #0x41; load constant, now Rs = 11585
    MUL    Rd, Rm, Rs; do the multiply

The load-a-constant stage may take up to four 4 instructions (in this case 
2) or an LDR,= and the multiply takes 1 instruction fetch plus a number of 
internal cycles to calculate the multiply (on ARM6 based processors) .

The optimal multiply-by-constant sequence consists of:
    ADD    Rd, Rm, Rm, LSL #1; Rd = Rm*3
    RSB    Rd, Rd, Rd, LSL #4; Rd = Rd*15 = Rm*45
    ADD    Rd, Rm, Rd, LSL #8; Rd = Rm + Rd*256 = Rm*11521
    ADD    Rd, Rd, Rm, LSL #6; Rd = Rd + Rm*64 = Rm*11585

This is just 4 data processing instructions.

Method                      Cycles
------                      ------
Normal multiply           3 instructions + MUL internal cycles
Multiply-by-constant      4 instructions

Considering the ARM6 family, the 2-bit Booth's Multiplier used by MUL takes 
a number of I-cycles depending on the value in Rs (in this case m=8, as Rs 
lies  between 8192 and 32767).

Hence multiply-by-constant looks to be the winner in this case.

An instruction fetch is an external memory S-cycle on the ARM60, or a cache 
F-cycle (if there is a cache hit) on cached processors like the ARM610.
With slow memory systems and non-cached processors, I-cycles can be much 
faster  than other cycles because they are internal to the ARM core.  This 
means that  the general multiply can sometimes be the fastest option (for 
large constants where an efficient solution cannot be found) - it should 
also use less memory. If the load-a-constant stage could be moved outside a 
loop, the balance would tip further in favour of the general multiply as 
there is only the MUL to execute.

Method            Cycles on ARM60           Cycles on ARM610
------            ---------------           ----------------
Normal multiply      3S + 8I                      11F
Multiply-by-constant 4S                            4F

3.2.6 Related Topics
--------------------

     Digital Signal Processing on the ARM starting on page35;
     ARM6 Multiplier Performance Issues starting on page38;
     Multiplication - Returning a 64-bit Result starting on page69;
     ARM Assembly Programming Performance Issues starting on page55.

3.3 Division by a Constant
--------------------------

3.3.1 About this Recipe
-----------------------

This recipe shows you:

     how to improve on the general divide code for the case when the divisor 
     is a constant.
     the simple case for divide-by-2^n using the barrel shifter. 
     how to use divc.c to generate ARM code for divide-by-constant.

3.3.2 Introduction
------------------
The ARM instruction set was designed following a RISC philosophy.  One of 
the consequences of this is that the ARM core has no divide instruction, so 
divides must be performed using a subroutine.  This means that divide can be 
quite slow, but this is not a major issue as divide performance is rarely 
critical for applications.
It is possible to do better than the general divide in the special case when 
the divisor is a constant.  This recipe explains how the divide-by-constant 
technique works, and how to generate ARM assembler code for 
divide-by-constant.

3.3.3 Special case for divide-by-2^n
------------------------------------
In the special case when dividing by 2^n, a simple right shift is all that 
is required (instead of a left shift which multiplies by a power of 2).
There is a small caveat which concerns the handling of signed and unsigned 
numbers.  For signed numbers, an arithmetic right shift is required as this 
performs sign extension (to handle negative numbers correctly). In contrast, 
unsigned numbers require a 0-filled logical shift right. Please refer to an 
ARM datasheet for more details of the difference between arithmetic and 
logical shifts.

    MOV    a2, a1, lsr #5; unsigned division by 32
    MOV    a2, a1, asr #10; signed division by 1024

3.3.4 Explanation of divide-by-constant ARM code
------------------------------------------------
The divide-by-constant technique basically does a multiply in place of the 
divide, but it is somewhat more complicated than multiply-by-constant (see 
Multiply by a Constant starting on page24):

x/y == x * (1/y)
           ^^^^^
consider the underlined portion as a 0.32 fixed-point number (truncating any 
bits past the most significant 32). 0.32 means 0 bits before the decimal 
point and 32 after it.

    == (x * (2^32/y)) / 2^32
            ^^^^^^^^
the underlined portion here is a 32.0 bit fixed-point number

    == (x * (2^32/y)) >> 32

This is effectively returning the top 32-bits of the 64-bit product of x and 
(2^32/y).
If y is a constant, then (2^32/y) is obviously also a constant.
For certain y, the reciprocal (2^32/y) is a repeating pattern in binary:
   y                  (2^32/y)

   2      10000000000000000000000000000000    #
   3      01010101010101010101010101010101    *
   4      01000000000000000000000000000000    #
   5      00110011001100110011001100110011    *
   6      00101010101010101010101010101010    *
   7      00100100100100100100100100100100    *
   8      00100000000000000000000000000000    #
   9      00011100011100011100011100011100    *
  10      00011001100110011001100110011001    *
  11      00010111010001011101000101110100
  12      00010101010101010101010101010101    *
  13      00010011101100010011101100010011
  14      00010010010010010010010010010010    *
  15      00010001000100010001000100010001    *
  16      00010000000000000000000000000000    #
  17      00001111000011110000111100001111    *
  18      00001110001110001110001110001110    *
  19      00001101011110010100001101011110
  20      00001100110011001100110011001100    *
  21      00001100001100001100001100001100    *
  22      00001011101000101110100010111010
  23      00001011001000010110010000101100
  24      00001010101010101010101010101010    *
  25      00001010001111010111000010100011
The lines marked with a # are the special cases 2^n, which have already been 
dealt with.  The lines marked with a * have a simple repeating pattern.
Note how regular the patterns are for y=2^n+2^m or y=2^n-2^m (for n>m)...

  n   m   (2^n+2^m)          n   m    (2^n-2^m)

  1   0      3               1   0        1
  2   0      5               2   1        2
  2   1      6               2   0        3
  3   0      9               3   2        4
  3   1      10              3   1        6
  3   2      12              3   0        7
  4   0      17              4   3        8
  4   1      18              4   2        12
  4   2      20              4   1        14
  4   3      24              4   0        15
  5   0      33              5   4        16
  5   1      34              5   3        24
  5   2      36              5   2        28
  5   3      40              5   1        30
  5   4      48              5   0        31

For the repeating patterns, it is a relatively easy matter to calculate the 
product by using a multiply-by-constant method.
The result can be calculated in a small number of instructions by taking 
advantage of the repetition in the pattern; this corresponds to the optimal 
solution in the multiply-by-constant problem (see Multiply by a Constant 
starting on page24).
The actual multiply is slightly unusual due to the need to return the top  
32-bits of the 64-bit result.  It efficient to calculate just the top 
32-bits;  this can be achieved by modifying the multiply-by-constant 
sequence so that the input value is shifted right rather than left.
Consider this fragment of the divide-by-ten code (x is the input dividend as 
used in the above equations):

SUB  a1,  x,  x, lsr #2   ; a1 = x*%0.11000000000000000000000000000000
ADD  a1, a1, a1, lsr #4   ; a1 = x*%0.11001100000000000000000000000000
ADD  a1, a1, a1, lsr #8   ; a1 = x*%0.11001100110011000000000000000000
ADD  a1, a1, a1, lsr #16  ; a1 = x*%0.11001100110011001100110011001100
MOV  a1, a1, lsr #3       ; a1 = x*%0.00011001100110011001100110011001

The SUB calculates (for example)

a1 = x - x/4
   = x - x*%0.01
   = x*%0.11

Hence, just 5 instructions are needed to perform the multiply.
A small problem is caused by calculating just the top 32-bits, as this 
ignores any carry from the low 32-bits of the 64-bit product.  Fortunately, 
this  can be corrected.  A correct divide would round down, so the remainder 
can be  calculated by:

x - (x/10)*10 = 0..9

It takes just 2 ARM instructions to perform this multiply-by-10 and 
subtract, by making good use of the ARM�s barrel shifter. In the case when 
(x/10) is too small by 1 (if carry has been lost), the remainder will be in 
the range 10..19 in which case corrections must be applied. This test would 
require a compare-with-10 instruction, but this can be combined with other 
operations to save an instruction (see below).
When a lost carry is detected, both the quotient and remainder must be fixed 
up (1 instruction each).
This should explain the full divide-by-10 code:

div10
; takes argument in a1
; returns quotient in a1, remainder in a2
; cycles could be saved if only divide or remainder is required
    SUB    a2, a1, #10             ; keep (x-10) for later
    SUB    a1, a1, a1, lsr #2
    ADD    a1, a1, a1, lsr #4
    ADD    a1, a1, a1, lsr #8
    ADD    a1, a1, a1, lsr #16
    MOV    a1, a1, lsr #3
    ADD    a3, a1, a1, asl #2
    SUBS   a2, a2, a3, asl #1      ; calc (x-10) - (x/10)*10
    ADDPL  a1, a1, #1              ; fix-up quotient
    ADDMI  a2, a2, #10             ; fix-up remainder
    MOV    pc, lr

The optimisation which eliminates the compare-with-10 instruction is to keep 
(x-10) for use in the subtraction to calculate the remainder.  This means 
that compare-with-0 is required instead, which is easily achieved by adding 
an S (to set the flags) to the SUB opcode.  This also means that the 
subtraction has to be undone if no rounding error occured (which is why the 
ADDMI instruction is used).

3.3.5 How to generate divide-by-constant sequences
--------------------------------------------------
For suitable numbers , the details of the divide-by-constant technique can 
be avoided completely by using the divc program.  This is supplied in ANSI C 
source form which is in the examples directory.  Naturally, it must be 
compiled it in order to use it; use your host system's C compiler, or armcc 
in which case the executable must  be run using armsd.
The divc command-line help can be obtained by running divc with no  
arguments:

Usage: divc <n>
Generates optimal ARM code for divide-by-constant
where <n> is one of (2^n-2^m) or (2^n+2^m) eg. 10

Advanced RISC Machines [01 Jul 92]
Type "divc 10" to generate the ARM assembler code for divide-by-10.  The 
output is suitable for immediate use as an armasm source file.
The routine is called 'udiv10' for unsigned divide-by-10 (for example).  It 
takes the unsigned argument in a1, and returns the quotient in a1 and the  
remainder in a2.  It conforms fully to the APCS, but the remainder may not 
be  available when called from C.
The range of values covered by (2^n-2^m) and (2^n+2^m) contains some useful 
numbers such as 7, 10, 24, 60.

3.3.6 Related Topics
--------------------

     Multiply by a Constant starting on page24;
     C Programming for Deeply Embedded Applications starting on page87 for 
     information about the division routines to which armcc generates 
     references;

3.4 Choosing a division implementation
--------------------------------------
This recipe shows you:

     how to select a divide implementation for the C-library
     how to use the fast divide routines from the examples directory
     a comparison of the speeds of the divide routines.

The ARM instruction set does not have a divide instruction.  In some 
applications it is important that a general purpose divide executes as 
quickly as possible.  This recipe shows how to choose between different 
divide implementations for the ARM.

3.4.1 Divide implementations in the C-library
---------------------------------------------
The C-library offers a choice between two divide variants.  This choice is 
basically a speed vs. space tradeoff; the options are: 'unrolled' and 
'small'.
In the C-library build directory (eg. directory semi for the semi-hosted 
library), the file options is used to select variants of the C-library.

The supplied file contains the following:
memcpy = fast
divide = unrolled
stdfile_redirection = off
fp_type = module
stack = contiguous
backtrace = off

The default divide implementation 'unrolled' is fast, but occupies a total 
of 416 bytes (55 instructions for the signed version plus 49 instructions 
for the unsigned version).  This is an appropriate default for most toolkit 
users who are interested in obtaining maximum performance.
Alternatively you can change this file to select 'small' divide which is 
more compact at 136 bytes(20 instructions for signed plus 14 instructions 
for unsigned) but somewhat slower as there is considerable looping overhead.
For a comparison of the speed difference between these two routines, consult 
the following table (the speed of divide is data-dependent):

Signed division example timings
-------------------------------
Cycle times are F-cycles on a cached ARM6 series processor excluding call & 
return

          Calc           unrolled cycles         rolled cycles
          ----           ---------------         -------------
         0/100                22                      19
         9/7                  22                      19
      4096/2                  70                      136
   1000000/3                  99                      240
1000000000/1                  148                     370

If you have a specific requirement you can modify the supplied routines to 
suit your application better.  For instance, you could write an 
unrolled-2-times version or you could combine the signed and unsigned 
versions to save more space.

3.4.2 Guaranteed-performance divide routines for real-time applications
-----------------------------------------------------------------------
The C-library also contains two fully unwound divide routines.  These have 
been carefully implemented for maximum speed.  They are useful when a 
guaranteed performance is required, eg. for real-time feedback control 
routines or DSP.  The long maximum division time of the standard C-library 
functions may make them unsuitable for some real-time applications.
The supplied routines implement signed division only; it would be possible 
to modify them for unsigned division if required.  The routines are 
described by the standard header file "stdlib.h" which contains the 
following declarations:

ARM real-time divide functions for guaranteed performance

typedef struct __sdiv32by16 { int quot, rem; } __sdiv32by16;
/* used int so that values return in regs, although 16-bit */
typedef struct __sdiv64by32 { int rem, quot; } __sdiv64by32;

__value_in_regs extern __sdiv32by16 __rt_sdiv32by16(
     int /*numer*/,
     short int /*denom*/);
/*
 * Signed div: (16-bit quot), (16-bit rem) = (32-bit) / (16-bit)
 */
__value_in_regs extern __sdiv64by32 __rt_sdiv64by32(
     int /*numer_h*/, unsigned int /*numer_l*/,
     int /*denom*/);
/*
 * Signed div: (32-bit quot), (32-bit rem) = (64-bit) / (32-bit)
 */

These routines both have guaranteed performance:

    108 cycles for __rt_sdiv64by32 (excluding call & return)
    44 cycles for __rt_sdiv32by16

Currently the C-compiler does not automatically use these routines, as the 
default routines have early-termination which yields good performance for 
small values.
In order to use these new divide routines, you should explicitly call the 
relevant function.  The __rt_div64by32 function is complicated by the fact 
that our C-compiler does not currently support 64-bit integers, as you have 
to split a 64-bit value between two 32-bit variables.
The following example program shows how to use these routines.  This program 
is available as dspdiv.c in the examples directory.  Once the program has 
been compiled and linked, type

armsd dspdiv 1000 3            to calculate 1000/3
divdsp.c source code
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  if (argc != 3)
    puts("needs 2 numeric args");
  else
  {
    __sdiv32by16 result;

    result = __rt_sdiv32by16(atoi(argv[1]), atoi(argv[2]));

    printf("quotient %d\n", result.quot);
    printf("remainder %d\n", result.rem);
  }
  return 0;
}

3.4.3 Summary
-------------
The standard division routine used by the C-library can be selected by using 
the options file in the C-library build area.  If the supplied variants are 
not suitable, you can write your own.
For real-time applications, the maximum division time must be as short as 
possible to ensure that the calculation can complete in time.  In this case, 
the functions __rt_sdiv64by32 and __rt_sdiv32by16 are useful.

3.4.4 Related Topics
--------------------
     Division by a Constant starting on page28;
     C Programming for Deeply Embedded Applications starting on page87 for 
     information about the division routines to which armcc generates 
     references;

3.5 Digital Signal Processing on the ARM
----------------------------------------
3.5.1 About this Recipe
-----------------------
This recipe is designed to explain the issues when performing digital signal 
processing (DSP) on the ARM.
DSP often needs to be performed in real-time, so it is important to achieve 
the highest possible throughput.  In such instances, careful speed 
optimisation  of ARM assembly code is often necessary to achieve the 
performance required.

3.5.2 Introduction
------------------
DSP is finding a growing number of applications as all kinds of signals are 
now processed digitally eg. Compact Disc, telephone speech compression (GSM, 
G.721 etc.), PhotoCD, JPEG, MPEG...
The ARM cannot match a dedicated DSP chip for raw performance, but not all 
applications require ultra-high performance.  The ARM processor can also 
perform other tasks; DSP chips are severely limited in their range of 
functions due to their specialised architecture.

Features of most DSP chips

     single cycle multiply-accumulate
     no-overhead loops (dedicated loop counter register is decremented in 
     parallel with executing body of loop)
     address generators (with circular buffer support, and bit-reversed 
     addressing)
     Harvard architecture to funnel data into multiply-accumulate

There is just one main operation which DSP chips perform very fast: the 
weighted sum.  This is a scalar product where each element of one array of 
data is multiplied by a corresponding element in another array, and the 
total is accumulated.
In any book on digital signal processing, there are hundreds of equations 
using sigma notation which denotes a weighted sum.  This single operation  
is required for the all the major DSP functions: correlation, 
autocorrelation, FIR filters, IIR filters, convolution, DCT etc.

Features of the ARM which are advantageous for DSP
--------------------------------------------------
     barrel shifter in parallel with data processing instructions;
     MUL instruction;
     auto-update load/store instructions;
     auto-update load/store multiple (quick sequential addresses).

3.5.3 Examples of some DSP code on the ARM
------------------------------------------
These examples demonstrate typical DSP code.  The MLA accumulates the 
products in a single 32-bit register, so care must be taken to ensure that 
the value will not overflow.  If 1024 products are to be accumulated, the 
number of bits in the result should not exceed 22 otherwise the total may 
overflow.
Naive version of weighted-sum ARM code:

This is the obvious version of the weighted-sum code which uses 2 load 
instructions and a MLA.

    MOV    r10, #0; initialise total
naive_sigma_loop
    LDR    r0, [r8], #4; load data A & update pointer
    LDR    r1, [r9], #4; load data B & update pointer
    MLA    r10, r0, r1, r10
    SUBS   r11, r11, #1; decrement counter
    BNE    naive_sigma_loop

Faster version of weighted-sum ARM code:

This shows how to unwind the loop 4 times (to lower the branch overhead). 
Load-multiple (LDM) is used instead of a single register load; this improves 
performance significantly.  It is possible to use more registers and unwind 
the loop more.

    MOV    r10, #0; initialise total
faster_sigma_loop
    LDMIA  r8!, {r0-r3}; load 4 data A values & update pointer
    LDMIA  r9!, {r4-r7}; load 4 data B values & update pointer
    MLA    r10, r0, r4, r10
    MLA    r10, r1, r5, r10
    MLA    r10, r2, r6, r10
    MLA    r10, r3, r7, r10
    SUBS   r11, r11, #1; decrement counter
    BNE    faster_sigma_loop

Cross-correlation:

This example performs cross-correlation; this particular code was written 
for a telephone-quality speech compressor.  It demonstrates careful 
optimisation for  this specific function; there are 10 MLA instructions to 
every 1 LDM  instruction.  All the registers (apart from r15) are used in 
order to reduce load operations.
Cross-correlation involves multiplying every element of one list with the 
corresponding element in another list and accumulating the total (a weighted 
sum).  To calculate the cross-correlation function, the offset is varied as 
in this example:

     i1   i2   i3   i4
      *    *    *    *                             = cross_corr_0
     j1   j2   j3   j4   j5   j6   j7   j8   j9

          i1   i2   i3   i4
           *    *    *    *                        = cross_corr_1
     j1   j2   j3   j4   j5   j6   j7   j8   j9

     and so on until:
                              i1   i2   i3   i4
                               *    *    *    *    = cross_corr_5
     j1   j2   j3   j4   j5   j6   j7   j8   j9

Notice that 'i' has 4 elements and 'j' has 9 elements, so the cross_corr 
list has (9-4+1)=6 elements.
The routine given here is designed to process 4 elements from 'i' and 4 
elements from 'j' per block.  The block can be repeated to process the 
entire 'i' list (which should be a multiple of 4). The 'i'-list should be 
the smaller of the two, so that it is traversed completely.  This yields 5 
totals which are cross-correlation results.  An  outer loop can be used to 
update the start position in 'j' in order to  calculate the full 
cross-correlation function.

    LDMIA  r8!, {r0-r3}      ; initialise: load j1-j4              (1)

    LDMIA  r9!, {r4-r7}      ; repeating block start, load i1-i4   (2)
    MLA    r10, r0, r4, r10
    MLA    r10, r1, r5, r10
    MLA    r10, r2, r6, r10
    MLA    r10, r3, r7, r10
    MLA    r11, r1, r4, r11
    MLA    r11, r2, r5, r11
    MLA    r11, r3, r6, r11
    MLA    r12, r2, r4, r12
    MLA    r12, r3, r5, r12
    MLA    r13, r3, r4, r13
    LDMIA  r8!, {r0-r3}      ; load j5-j8                          (3)
    MLA    r14, r0, r4, r14
    MLA    r14, r1, r5, r14
    MLA    r14, r2, r6, r14
    MLA    r14, r3, r7, r14
    MLA    r13, r0, r5, r13
    MLA    r13, r1, r6, r13
    MLA    r13, r2, r7, r13
    MLA    r12, r0, r6, r12
    MLA    r12, r1, r7, r12
    MLA    r11, r0, r7, r11  ; repeating block end
; use loop (or repeat block) in order to traverse 'i'-list

The 22-instruction block calculates 5 cross-correlation sums (in r10-r14), 
according to the following diagram:

     i1   i2   i3   i4 |
      *    *    *    * |                          Total in r10
     j1   j2   j3   j4 | j5   j6   j7   j8
                       |
                       |
          i1   i2   i3 | i4
           *    *    * |  *                       Total in r11
     j1   j2   j3   j4 | j5   j6   j7   j8
                       |
                       |
               i1   i2 | i3   i4
                *    * |  *    *                  Total in r12
     j1   j2   j3   j4 | j5   j6   j7   j8
                       |
                       |
                    i1 | i2   i3   i4
                     * |  *    *    *             Total in r13
     j1   j2   j3   j4 | j5   j6   j7   j8
                       |
                       |
                       | i1   i2   i3   i4
                       |  *    *    *    *        Total in r14
     j1   j2   j3   j4 | j5   j6   j7   j8|
                       |
     First half of     | Second half of
     repeating block   | repeating block

The j1-j4 values are loaded into r0-r3 by (1).  In the repeating block, the 
next i1-i4 values are loaded by (2).  This data is sufficient to calculate 
the products on the left side of the dividing line; this is done by the 
first 10 MLA instructions.
The number of registers required can be minimised by reusing r0-r3 to hold 
the j5-j8 values which are loaded in  (3).  By arranging the MLA 
instructions into 2 groups (left and right of the  dividing line), it is 
possible to use the j1-j4 values first, and then use  j5-j8 second.
At the end of the block, r0-r3 (j5-j8) are used as j1-j4 for the next block 
(because the pointers have moved on by 4).
This technique could also be applied to reduce the memory load traffic of  
other DSP functions.

Fixed Point Arithmetic
----------------------
Fixed-point arithmetic is an important part of DSP (for example, weighting 
coefficients are often in the range -1..1).  The MUL instruction is an 
integer multiplier, so shifting will be necessary to justify the result 
correctly.  Fortunately, the ARM barrel shifter makes this very easy.
When a single MUL is being used to multiply fixed-point numbers, it may be 
necessary to right-shift the multiply operands so that the result fits in  
32-bits to avoid overflow.
As you would expect, add and subtract are unaffected if the operands are 
fixed-point numbers, provided that both operands are the same fixed-point  
format.  Naturally, the barrel shifter can be applied to the second operand  
(with no overhead) if it is necessary to align the formats.

3.5.4 ARM6 Multiplier Performance Issues
----------------------------------------
The performance of the MUL instruction is important for many DSP 
applications where multiply is used extensively (eg. digital filters, 
correlation etc.)
This section explains how to predict the timing of a MUL instruction, and 
suggests some ideas to improve the performance of speed-critical programs. 
This section is specific to the 2-bit Booth's Multiplier in the ARM6, ARM2  
and ARM3.

Explanation of Booth's multiplication

Consider the instruction:
    MUL   Rd, Rm, Rs
                  ^^
The speed of the multiply depends on the value in Rs.  It is important to 
place the smallest number in Rs so that the multiply takes the least number 
of cycles.  The rest of this section explains how the value in Rs affects 
the time taken to perform the multiplication.
In the ARM6 core, the value in Rs is transferred to the Booth's multiplier 
register during the first cycle of the instruction.  Thereafter, a number of
internal I-cycles are required to perform the multiplication.
The Booth's multiplier operates in the following way: a 32-bit multiplier 
register is initialised with the second operand of the multiplication.  This 
register is extended at the low end with an extra bit, which is initialised 
to zero. So, the register's contents after initialisation are:

  M31 M30 M29 M28 ... M5 M4 M3 M2 M1 M0 0

On each iteration, the bottom 3 bits of this register are used to generate a 
Booth digit, which controls what is done on the datapath with the 
destination register and the first operand register. Then this register is 
shifted right by 2 bits, losing the two bits at the right hand end.  The 2 
leftmost bits are filled with zeros.
Early termination occurs if and when the entire multiplier register is all 
zeros, with the process terminating after 16 iterations in any case.
So, after the first iteration the multiplier register's contents are:

  0 0 M31 M30 ... M7 M6 M5 M4 M3 M2 M1

and the Booth digit which was used on the first iteration was based on the 
three bits "M1 M0 0".
The second Booth digit will hence be "M3 M2 M1".
Each Booth digit takes an I-cycle to process, as the ARM datapath is 
involved in accumulating the partial product.  The total time for a MUL is 
thus 1S + nI cycles where n depends on the value in Rs.  From the above 
explanation it can be shown that n has the following relationship to the 
value in Rs:
Multiplication by 2^(2n-3) and 2^(2n-1)-1 inclusive takes 
1S + nI-cycles (n>1). (Multiplication by 0 or 1 takes 1S + 1I-cycle).
(These speeds are taken straight from the ARM6 datasheet)

Overflow issues
---------------
It is common to use the current multiply instruction in (for instance) a 
16bit x 16bit -> 32bit mode to avoid overflow.
This situation can be generalised, as the number of result bits is just the 
sum of the operand bits.  Thus, the MUL can perform 16bit x 16bit -> 32bit, 
8bit x 24bit -> 32bit etc. all without overflow.  For MLA, the total is 
accumulated (overflow of the total must be avoided).  Hence, MLA would be 
used as (for example) 12x12->24 leaving 8 bits to accumulate up to  256 
values without the possibility of overflow.
So, although the worst-case multiply is 1S + 16I-cycles, in practice it is 
possible to arrange for a worst case which is at most 1S + 9I-cycles (by 
putting the operand with the least number of bits into Rs, so that 
Rs <= 16bits), but often considerably better.

Negative operands
-----------------
The multiplier yields correct results for negative operand values, so the 
sign of the operands can be ignored.  For positive values of Rs, a 
16x16->32 MUL takes at most 1S + 9I-cycles (the average should be better 
than this).  But, the MUL *always* takes 1S + 16I-cycles if Rs is negative.  
Early termination  cannot take place because the top bit of Rs is a 1, so 
the Booth's multiplier  register never contains all 0s (the maximum-of-16 
limit is reached instead).
To reduce the number of MUL cycles required, ensure that the Rs value is 
positive:

    CMP    Rs, #0
    RSBMI  Rs, Rs, #0       ; make Rs positive
    MUL    Rd, Rm, Rs
    RSBMI  Rd, Rd, #0       ; correct the sign of the result

but this does not really improve things unless Rs is very small so that the 
gain exceeds the (3-instruction) overhead.  It is sometimes possible to do 
away  with the CMP by incorporating this into another instruction (see 
below).
In the special case when squaring, the result does not need to be negated 
after the multiplication as it will always be positive (thus the second RSB 
instruction can be eliminated).
For example, consider this critical bit of code which uses MUL to square 
signed  5-bit input values.  This demonstrates the importance of ensuring 
that Rs is  positive, as the worst-case performance is improved to just 
1S + 3I-cycles for the MUL or MLA.

    AND    r1, r8, #31; extract 5-bit field of interest
    AND    r2, r9, #31; extract 5-bit field of interest
    SUBS   r1, r1, r2
    RSBMI  r1, r1, #0
    MUL    r0, r1, r1

As you can see, it has been arranged to only cost 1 S-cycle (for the RSBMI 
instruction) to ensure the multiply is fast.
A similar method can be used to ensure MLA is always fast, without affecting 
the total:

    CMP    Rs, #0
    RSBMI  Rs, Rs, #0
    RSBMI  Rm, Rm, #0
    MLA    Rd, Rm, Rs, Rn

Multiplication by constant
--------------------------
This technique replaces a MUL by a sequence of arithmetic instructions which 
are equivalent to multiplying by a constant.  The gains depend on the value 
of the constant (smaller constants are generally faster).  For more details, 
please see Multiply by a Constant starting on page24.

3.5.5 Related Topics
--------------------
     Multiply by a Constant starting on page24;
     Division by a Constant starting on page28;
     Using 16-bit Data on the ARM starting on page41;
     ARM Assembly Programming Performance Issues starting on page55.

3.6 Using 16-bit Data on the ARM
--------------------------------
3.6.1 About this Recipe
-----------------------
This recipe covers several different approaches to 16-bit data manipulation:

     Converting the 16-bit data to 32-bit data, and from then on treating it 
     as 32-bit data.
     Converting 16-bit data into 32-bit data when loading and storing, but 
     using 32-bit data within ARM's registers.
     Load 16-bit data into the top 16-bits of ARM registers, and processing 
     it as 16-bit data (ie. keeping the bottom 16-bits clear at all times).

Useful code fragments are given which can be used to help implement these 
different approaches efficiently.

3.6.2 Introduction
------------------
Since the ARM is a 32-bit processor, and does not have half-word load and 
store instructions in its instruction set, at first glance the ARM may look 
unsuitable for processing 16-bit data.
This recipe is intended to show that the ARM is quite capable of handling 
16-bit data efficiently, and in several different ways, depending on the 
what is needed for a particular application.

3.6.3 How "16-bit" is my data ?
-------------------------------
Just because data is 16-bit in size does not mean that it cannot be 
considered as 32-bit data by the ARM, and thus be manipulated using the ARM 
instruction set in the normal way.
Clearly any unsigned 16-bit value can be held as a 32-bit value in which the 
top 16 bits are all zero.  Similarly any signed 16-bit value can be held as 
a 32-bit value with the top 16 bits sign extended (ie. copied from the top 
bit of the 16-bit value).
The main disadvantage of storing 16-bit data as 32-bit data in this way for 
ARM based systems is that it takes up twice as much space in memory or on 
disk.  If the amount of memory taken up by the 16-bit data is small, then 
simply treating it as 32-bit data is likely to be the easiest and most 
efficient technique.  ie. converting the data to 32-bit format and from then 
on treating it as 32-bit data.
When the space taken by 16-bit data in memory or on disk is not small, an 
alternative method can be used:  The 16-bit data is loaded and converted to 
be 32-bit data for use within the ARM, and then when processed, can either 
be output as 32-bit or 16-bit data.  Useful code fragments are given to 
perform the necessary conversions for this approach in section Little 
Endian Loading Recipes starting on page42 to section Big Endian Storing 
Recipes starting on page45.
An issue which may arise when 16-bit data is converted to 32-bit data for 
use in the ARM and then stored back out as 16-bit data is detecting whether 
the data is still 16-bit data, ie. whether it has 'overflowed' into the top 
16 bits of the ARM register.  Code fragments which detect this are given in 
section Detecting Overflow into the Top 16 Bits starting on page46.
Another approach which avoids having to use explicit code to check whether 
results have overflowed into the top 16-bits is to keep 16-bit data as 
16-bit data all the time, by loading it into the top half of ARM registers, 
and ensuring that the bottom 16 bits are always 0.   Useful code sequences, 
and the issues involved when taking this approach are described in Using ARM 
Registers as 16-bit Registers starting on page46.

3.6.4 Little Endian Loading Recipes
-----------------------------------
Code fragments in this section which transfer a single 16-bit data item 
transfer it to the least significant 16 bits of an ARM register.  The byte 
offset referred to is the byte offset within a word at the  load address. 
eg. the address 0x4321 has a byte offset of 1.
One Data Item - Any Alignment (Byte offsets 0,1,2 or 3)
The following code fragment loads a 16-bit value into a register, whether 
the data is byte, half-word or word aligned in memory, by using the ARM's 
load byte instruction.
This code is also optimal for the common case where the 16-bit data is half 
word aligned, ie. at either byte offset 0 or 2 (but the same code is 
required to deal with both cases).  Optimisations can be made when it is 
known that the data is at at byte offset 0, and also when it is known to be 
at byte offset 2 (but not when it could be at either offset).

    LDRB   R0, [R2, #0]; 16-bit value is loaded from the
    LDRB   R1, [R2, #1]; address in R2, and put in R0
    ORR    R0, R0, R1, LSL #8; R1 is required as a
;   MOV    R0, R0, LSL #16; temporary register
;   MOV   R0, R0, ASR #16

The two MOV instructions are only required if the 16-bit value is signed, 
and it may be possible to combine the second MOV with another data 
processing operation by specifying the second argument as "R0, ASR, #16" 
rather than just R0.

One Data Item - Byte Offset 2
-----------------------------
If the data is aligned on a half word boundary, but not a word boundary, ie. 
the byte offset is 2, then the following code fragment can be used (which is 
clearly much more efficient than the general case given above):

    LDR    R0, [R2, #-2]; 16-bit data is loaded from address in
    MOV    R0, R0, LSR #16; R2 into R0 (R2 has byte offset 2)

The "LSR" should be replaced with "ASR" if the data is signed.  Note that as 
in the previous example it may be possible to combine the MOV with another 
data processing operation.

One Data Item - Byte Offset 0
-----------------------------
If the data is on a word boundary, then the following code fragment will 
load a 16-bit value (again a significant improvement over the general case):

    LDR    R0, [R2, #0]; 16-bit value is loaded from the word
    MOV    R0, R0, LSL #16; aligned address in R2 into R0.
    MOV    R0, R0, LSR #16

As before, "LSR" should be replaced with "ASR" if the data is signed. Also, 
it may be possible to combine the second MOV with another data processing 
operation.
This code can be further optimised if non-word-aligned word-loads are 
permitted (ie. Alignment faults are not enabled).  This makes use of the way 
ARM rotates data into a register for non-word-aligned word-loads, see the 
appropriate ARM Datasheet for more information:

    LDR    R0, [R2, #2]; 16-bit value is loaded from the word
    MOV    R0, R0, LSR #16; aligned address in R2 into R0.

Two Data Items - Byte Offset 0
------------------------------
Two 16-bit values stored in one word can be loaded more efficient;y than two 
separate values. The following code loads two unsigned 16-bit data items 
into two registers from a word aligned address:

    LDR    R0, [R2, #0]; 2 unsigned 16-bit values are loaded
    MOV    R1, R0, LSR #16; from one word of memory [R2}. The
    BIC    R0, R0, R1, LSL #16; 1st is put in R0, and the 2nd in R1.

The version of this for signed data is:
    LDR    R0, [R2, #0]; 2 signed 16-bit values are loaded
    MOV    R1, R0, ASR #16; from one word of memory [R2].  The
    MOV    R0, R0, LSL #16; 1st is put in R0, and the 2nd in R1.
    MOV    R0, R0, ASR #16

The address in R2 should be word aligned (byte offset 0), in which case 
these code fragments load the data item in bytes 0-1 into R0, and the data 
item in bytes 2-3 into R1.

3.6.5 Little Endian Storing Recipes
-----------------------------------
The code fragment in this section which transfers a single 16-bit data item 
transfers it from the least significant 16 bits of an ARM register.  The 
byte offset referred to is the byte offset from a word address of the store 
address. eg. the address 0x4321 has a byte offset of 1.

One Data Item - Any Alignment (Byte offsets 0,1,2 or 3)
-------------------------------------------------------
The following code fragment saves a 16-bit value to memory, whatever the 
alignment of the data address, by using the ARM's byte saving instructions:
    STRB   R0, [R2, #0]; 16-bit value is stored to the address
    MOV    R0, R0, ROR #8; in R2.
    STRB   R0, [R2, #1]
;   MOV    R0, R0, ROR #24

The second MOV instruction is not needed if the data is no longer needed
after the data is stored.
Unlike load operations, knowing the alignment of the destination address 
does not make optimisations possible.

Two Data Items - Byte Offset 0
------------------------------
Two unsigned 16-bit values in two registers can be packed into a single word 
of memory very efficiently, as the following code fragment demonstrates:

    ORR    R3, R0, R1, LSL #16; Two unsigned 16-bit values
    STR    R3, [R2, #0]; in R0 and R1 are packed into
; the word addressed by R2
    ; R3 is used as a temporary register

If the values in R0 and R1 are not needed after they are saved, then R3 need 
not be used as a temporary register (one of R0 or R1 can be used instead).
The version for signed data is:

    MOV    R3, R0, LSL #16; Two signed 16-bit values
    MOV    R3, R3, LSR #16; in R0 and R1 are packed into
    ORR    R3, R3, R1, LSL #16; the word addressed by R2
    STR    R3, [R2, #0]; R3 is a temporary register

Again, if the values in R0 and R1 are not needed after they are saved, then 
R3 need not be used as a temporary register (R0 can be used instead).

3.6.6 Big Endian Loading Recipes
--------------------------------
Code fragments in this section which transfer a single 16-bit data item 
transfer it to the least significant 16 bits of an ARM register.  The byte 
offset referred to is the byte offset within a word at the  load address. 
eg. the address 0x4321 has a byte offset of 1.

One Data Item - Any Alignment (Byte offsets 0,1,2 or 3)
-------------------------------------------------------
The following code fragment loads a 16-bit value into a register, whether 
the data is byte, half-word or word aligned in memory, by using the ARM's 
load byte instruction.
This code is also optimal for the common case where the 16-bit data is half 
word aligned, ie. at either byte offset 0 or 2 (but the same code is 
required to deal with both cases).  Optimisations can be made when it is 
known that the data is at at byte offset 0, and also when it is known to be 
at byte offset 2 (but not when it could be at either offset).

    LDRB   R0, [R2, #0]; 16-bit value is loaded from the
    LDRB   R1, [R2, #1]; address in R2, and put in R0
    ORR    R0, R1, R0, LSL #8; R1 is a temporary register
;   MOV    R0, R0, LSL #16
;   MOV    R0, R0, ASR #16

The two MOV instructions are only required if the 16-bit value is signed, 
and it may be possible to combine the second MOV with another data 
processing operation by specifying the second argument as "R0, ASR, #16" 
rather than just R0.

One Data Item - Byte Offset 0
-----------------------------
If the data is aligned on a word boundary, then the following code fragment 
can be used (which is clearly much more efficient than the general case 
given above):

    LDR    R0, [R2, #0]; 16-bit value is loaded from the word
    MOV    R0, R0, LSR #16; aligned address in R2 into R0.
The "LSR" should be replaced with "ASR" if the data is signed.  Note that as 
in the previous example it may be possible to combine the MOV with another 
data processing operation.

One Data Item - Byte Offset 2
-----------------------------
If the data is aligned on a half word boundary, but not a word boundary, ie. 
the byte offset is 2, then the following code fragment can be used (again a 
significant improvement over the general case):

    LDR    R0, [R2, #-2]; 16-bit value is loaded from the
    MOV    R0, R0, LSL #16; address in R2 into R0.  R2 is
    MOV    R0, R0, LSR #16; aligned to byte offset 2

As before, "LSR" should be replaced with "ASR" if the data is signed. Also, 
it may be possible to combine the second MOV with another data processing 
operation.
This code can be further optimised if non-word-aligned word-loads are 
permitted (ie. Alignment faults are not enabled).  This makes use of the way 
ARM rotates data into a register for non-word-aligned word-loads, see the 
appropriate ARM Datasheet for more information:

    LDR    R0, [R2, #0]; 16-bit value is loaded from the
    MOV    R0, R0, LSR #16; address in R2 into R0.  R2 is
; aligned to byte offset 2

Two Data Items - Byte Offset 0
------------------------------
Two 16-bit values stored in one word can be loaded more efficiently than two 
separate values. The following code loads two unsigned 16-bit data items 
into two registers from a word aligned address:

    LDR    R0, [R2, #0]; 2 unsigned 16-bit values are
    MOV    R1, R0, LSR #16; loaded from one word of memory
    BIC    R0, R0, R1, LSL #16; The 1st goes in R0, the 2nd in R1.

The version of this for signed data is:

    LDR    R0, [R2, #0]; 2 signed 16-bit values are loaded from 
    MOV    R1, R0, ASR #16; one word of memory (address in R2).
    MOV    R0, R0, LSL #16; The first is put in R0, and the second
    MOV    R0, R0, ASR #16; into R1.

3.6.7 Big Endian Storing Recipes
--------------------------------
The code fragment in this section which transfers a single 16-bit data item 
transfers it from the least significant 16 bits of an ARM register.  The 
byte offset referred to is the byte offset from a word address of the store 
address. eg. the address 0x4321 has a byte offset of 1.

One Data Item - Any Alignment (Byte offsets 0,1,2 or 3)
-------------------------------------------------------
The following code fragment saves a 16-bit value to memory, whatever the 
alignment of the data address:

    STRB   R0, [R2, #1]; 16-bit value is stored to the
    MOV    R0, R0, ROR #8; address in R2.
    STRB   R0, [R2, #0]
;   MOV    R0, R0, ROR #24

The second MOV instruction is not needed if the data is no longer needed 
after the data is stored.
Unlike load operations, knowing the alignment of the destination address 
does not make optimisations possible.

Two Data Items - Byte Offset 0
------------------------------
Two unsigned 16-bit values in two registers can be packed into a single word 
of memory very efficiently, as the following code fragment demonstrates:

    ORR    R3, R0, R1, LSL #16; Two unsigned 16-bit values in
    STR    R3, [R2, #0]; R0 and R1 are packed into the
; word addressed by R2
    ; R3 is used as a temporary register
If the values in R0 and R1 are not needed after they are saved, then R3 need 
not be used as a temporary register (one of R0 or R1 can be used instead).
The version for signed data is:

    MOV    R3, R0, LSL #16; Two signed 16-bit values in
    MOV    R3, R3, LSR #16; R0 and R1 are packed into the
    ORR    R3, R3, R1, LSL #16; word addressed by R2.  
    STR    R3, [R2, #0]; R3 is a temporary register
Again, if the values in R0 and R1 are not needed after they are saved, then 
R3 need not be used as a temporary register (R0 can be used instead).

3.6.8 Detecting Overflow into the Top 16 Bits
---------------------------------------------
If 16-bit data is converted to 32-bit data for use in the ARM, it may 
sometimes be necessary to check explicitly whether the result of a 
calculation has 'overflowed' into the top 16 bits of an ARM register. This 
is likely to be necessary because the ARM does not set its processor status 
flags when this happens.
The following instruction sets the Z flag if the value in R0 is a 16-bit 
unsigned value.  R1 is used as a temporary register.

    MOVS   R1, R0, LSR #16

The following instructions set the Z flag if the value in R0 is a valid 
16-bit signed value (ie. bit 15 is the same as the sign extended bits). R1 
is used as a temporary register.

    MOVS   R1, R0, ASR #15
    CMNNE  R1, R1, #1

3.6.9 Using ARM Registers as 16-bit Registers
---------------------------------------------
Instead of holding 16-bit data as 32-bit data within the ARM it can be held 
as 16-bit data.  This is done by positioning it in the top 16-bits of the 
ARM registers as opposed to the bottom 16 bits as has been described so far.
The advantages of this approach are:

    Some 16-bit data load instruction sequences are shorter.  The loading 
    and storing sequences shown above will have to be modified, and in some 
    cases shorter instruction sequences will be possible.  In particular, 
    handling signed data will often be more efficient, as the top bit does 
    not have to be copied into the top 16 bits of the register.  Note that 
    it is, however, necessary to ensure that the bottom 16 bits are all clear.

    The ARM processor status flags will be set if the 'S' bit of a data 
    processing  instruction is set and overflow or carry occurs out of the 
    16-bit value.  Thus explicit 'overflow' checking instructions are not 
    needed.

    Pairs of signed 16-bit integers can be saved more efficiently than in the
    previous approach, since the sign extended bits do not have to be cleared
    out before the two values are combined.

The disadvantages of this approach are:

    Instructions such as add with carry cannot be used. eg. the instruction

    ADC    R0, R0, #0

    to increment R0 if Carry is set should be replaced by

    ADDCS R0, R0, #&10000

    Having to use this form of instruction reduces the chance of being able 
    to combine several data processing operations into one by making use of 
    the barrel shifter.

    Before multiplying two 16-bit values in the top half of the registers, 
    the values must be shifted into the bottom half of the register.

    Before combining a 16-bit value with a 32-bit value, the 16-bit value 
    must be shifted into the bottom half of the register.  Note, however, 
    that this may cost nothing if the barrel shifter can be used in parallel.  
The recipes given above for loading and storing 16-bit data into the bottom 
half of ARM registers can be easily adapted to load the data into the top 
half of the registers (and ensure the bottom half is all zero), or save the 
data from the top half of the registers.

3.6.10 Related Topics
---------------------
     Using the Barrel Shifter starting on page10;
     Fixed Point Arithmetic starting on page38.

3.7 ARM600 Page Table Generation
--------------------------------
3.7.1 About This Recipe
-----------------------
This recipe tells you how to use:

     armasm repetitive assembly;
     armasm conditional assembly;
     armasm macros;
     armasm and armlink to produce a plain binary output file (containing 
     only the bytes you describe in your source program).

This recipe refers to the program pagetab.s in the examples directory.  
Although pagetab.s generates ARM600 page tables, it is not the format of the 
page tables which is being discussed in this recipe, but how they are 
generated using armasm.  To find out more about ARM600 page tables please 
refer to the ARM600 datasheet.

3.7.2 Repetitive Assembly
-------------------------
The following code fragment is taken from pagetab.s:

        GBLA    counter
counter SETA 0

        WHILE   counter <= 3
        L1Entry SECTION, (counter:SHL:20), 0, U_BIT+C_BIT+B_BIT, SVC_RW
counter SETA counter+1
        WEND

Explanation
-----------
GBLA counter declares a global numeric variable called counter which is 
initialized to zero using the SETA directive.
The WHILE ... WEND construct is then used to repeatedly assemble the lines 
between WHILE and WEND.
In this example, the loop body is assembled for counter = 0, 1, 2 and 3, but 
because the looping condition is checked at the top of the loop, it is 
possible for code between a WHILE and a WEND never to be assembled. For 
example, if counter were initialized to 4, the body of the WHILE ... WEND 
loop would not be assembled at all.
Each time around the loop the macro L1Entry is called (with 5 arguments), 
and then counter is incremented.

3.7.3 Macro Usage and Conditional Assembly
------------------------------------------
The following code fragment for L1Entry is also taken from pagetab.s:

    MACRO
$L  L1Entry $type, $addr, $dom, $ucb, $acc
$L  IF ($type=SECTION)
      DCD ((($addr):AND:&FFF00000):OR:(($acc):SHL:10) \
         :OR:(($dom):SHL:5):OR:($ucb):OR:($type))
      MEXIT
    ENDIF
    IF ($type=PAGE)
      DCD ((($addr):AND:&FFFFFC00):OR:(($dom):SHL:5) \
         :OR:(($ucb):AND:U_BIT):OR:$type)
    ELSE
      DCD 0 ; Invalid Level 1 Page Table Entry
    ENDIF
    MEND

Note that a backslash breaks a logical line of assembly language across two 
physical lines.  However, there must be no character after the backslash on 
the line.

Explanation
-----------
The macro definition is enclosed between MACRO and MEND. The first line of 
the definition gives the macro's name and lists its parameters, $L is a 
special parameter, called the label parameter.
The body of the macro illustrates the use of IF ... ENDIF and IF ... ELSE ...
ENDIF to assemble different code conditional on a value known at 
assembly-time. In this example, the controlling expressions of the IFs 
involve a macro parameter ($type) which gets its value when the macro is 
called.
This macro definition also shows how the MEXIT directive can be used to exit 
from processing a macro before the MEND directive is reached. You can think 
of MEXIT as being like a return statement in a C function.
The label parameter is used to allow labels to be setup within a macro and 
then be accessed from outside that macro.  In this macro the label parameter 
is set up to be the address of this page table entry.

3.7.4 Assembling the Page Tables in Plain Binary Format
-------------------------------------------------------
This section tells you how to create a file containing a plain binary image 
of the page tables. In other words, a file containing just the bytes you 
would need to load into memory and nothing else by way of symbolic 
information, file content descriptors, load addresses, etc.
You create a plain binary image in two steps: first you create a relocatable 
object file from your source file; then you use armlink to make a plain 
binary version of the relocatable object.

Method
------
Set your current directory to that containing the pagetab.s program then 
assemble pagetab.s and link pagetab.o as follows:

armasm pagetab.s -o pagetab.o -li
armlink -bin -o pagetab pagetab.o

Explanation
-----------
As in earlier examples, the -li option tells armasm to assemble code for a 
little-endian memory.  This need not be specified if the tools have been 
configured for little-endian operation (see The ARM Tool Reconfiguration 
Utility (reconfig) starting on page45 of the User Manual for details).
The -bin option tells armlink to make a plain binary output file containing 
nothing but the bytes you described in your source program.
Because pagetab contains no position-dependent data, you do not need to tell 
armlink where to base its output.  If there had been position-dependent data 
or code, you would have had to use the -base address option to tell armlink 
where to base its output and, of course, you would only be able to use the 
output at that memory address.

3.7.5 Related Topics
--------------------
      Loading Constants into Registers starting on page15.

3.8 Pseudo Random Number Generation
-----------------------------------
3.8.1 About This Recipe
-----------------------
This recipe describes a 32-bit pseudo random number generator implemented 
efficiently in ARM Assembly Language.

3.8.2 The Details
-----------------
It is often necessary to generate pseudo random numbers, and the most 
efficient algorithms are based on shift generators with exclusive-or 
feedback (rather like a cyclic redundancy check generator).  Unfortunately t
he sequence of a 32-bit generator needs more than one feedback tap to be 
maximal length (ie. 2^32-1 cycles before repetition), so this example uses a 
33 bit register with taps at bits 33 and 20.
The basic algorithm is newbit:=bit33 EOR bit20, shift left the 33 bit number 
and put in newbit at the bottom;  this operation is performed for all the 
newbits needed (ie. 32 bits).  The entire operation can be coded compactly 
by making maximal use of the ARM's barrel shifter:

; enter with seed in R0 (32 bits), R1 (1 bit in least significant bit)
; R2 is used as a temporary register.
; on exit the new seed is in R0 and R1 as before
; Note that a seed of 0 will always produce a new seed of 0.
; All other values produce a maximal length sequence.
;
    TST    R1, R1, LSR #1         ; top bit into Carry
    MOVS   R2, R0, RRX            ; 33 bit rotate right
    ADC    R1, R1, R1             ; carry into lsb of R1
    EOR    R2, R2, R0, LSL #12    ; (involved!)
    EOR    R0, R2, R2, LSR #20    ; (similarly involved!)

3.8.3 Using This Example
------------------------
This random number generation code is provided as random.s in the examples 
directory.  It is provided as ARM Assembly Language source which can be 
assembled and then linked with C modules (see recipes in chapter Interfacing 
Assembly Language and C starting on page62 for more information).
The C test program randtest.c (also in the examples directory) can be used 
to demonstrate this.  First set the examples directory to be your current 
directory, and execute the following commands to build an executable 
suitable for armsd:

armasm random.s -o random.o -li
armcc -c randtest.c -li -apcs 3/32bit
armlink randtest.o random.o -o randtest somewhere/armlib.32l

Where somewhere is the path to the lib directory of the ARM Software Development Toolkit release.
Note that in both the above commands, and the armsd command below, -li 
indicates that the target ARM is little endian, and -apcs 3/32bit specifies 
that the 32 bit variant of the ARM Procedure Call Standard should be used.  
These options can be omitted if the tools have already been configured a
ppropriately - see The ARM Tool Reconfiguration Utility (reconfig) starting 
on page45 of the User Manual for details.
armsd can be used to run this program as follows:

> armsd -li randtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file randtest
armsd: go
randomnumber() returned 0b3a9965
randomnumber() returned ac0b1672
randomnumber() returned 6762ad4f
randomnumber() returned 1965a731
randomnumber() returned d6c1cef4
randomnumber() returned f78fa802
randomnumber() returned 8147fc15
randomnumber() returned 3f62adfc
randomnumber() returned b56e9da8
randomnumber() returned b36dc5e2
Program terminated normally at PC = 0x000082c8
      0x000082c8: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
>

3.8.4 Related Topics
--------------------
Please refer to the index for other topics of interest.

3.9 Loading a Word from an Unknown Alignment
--------------------------------------------
3.9.1 About This Recipe
-----------------------
In this recipe a code sequence which loads a word from memory at any byte 
alignment is described.  Although loading 32-bit data from non word aligned 
addresses should be avoided whenever possible, it may sometimes be 
necessary.

3.9.2 The Details
-----------------
The ARM Load and Store (single and multiple) instructions are designed to 
load word aligned data.  Unless there is a very good reason it is best to 
avoid having to load or store word-sized data to or from non word aligned 
addresses, as neither the Load or Store instruction can do this unaided.
To deal with misaligned word fetches two words must be read and the required 
data extracted from these two words.  The code below performs this operation 
for a little endian ARM, making good use of the barrel shifter in the 
process.

; enter with address in R0
; R2 and R3 are used as temporary registers
; the word is loaded into R1
;
    BIC    R2, R0, #3             ; Get word aligned address
    LDMIA  R2, {R1, R3}           ; Get 64 bits containing data
    AND    R2, R0, #3             ; Get offset in bytes
    MOVS   R2, R2, LSL #3         ; Get offset in bits
    MOVNE  R1, R1, LSR R2         ; Extract data from bottom 32 bits
    RSBNE  R2, R2, #32            ; Get 32 - offset in bits
    ORRNE  R1, R1, R3, LSL R2     ; Extract data from top 32 bits
                                  ; and combine with the other data

This code can easily be modified for use on a big endian ARM - the LSR R2 
and LSL R2 just have to be swapped over.
For details of what the Load and Store instructions do if used with non word 
aligned addresses refer to the appropriate datasheet.  Note that non word 
aligned word loads are also used in Using 16-bit Data on the ARM starting on 
page41.

3.9.3 Related Topics
--------------------
      Using 16-bit Data on the ARM starting on page41

3.10 Byte Order Reversal
------------------------
3.10.1 About This Recipe
------------------------
This recipe gives a compact ARM Instruction Sequence to perform byte order 
reversal ie. reversing the endianess of a word.

3.10.2 The Details
------------------
Changing the endianess of a word can be a common operation in certain 
applications.  For example when communicating word sized data as a stream of 
bytes to a receiver of the opposite endianess.
This operation can be performed efficiently on the ARM, using just four 
instructions.  The word to be reversed is held in a1 both on entry and exit 
of this instruction sequence.  ip is used as a temporary register (For more 
information about these register names see Register Usage under the ARM 
Procedure Call Standard starting on page62): 

    EOR    ip, a1, a1, ror #16
    BIC    ip, ip, #&ff0000
    MOV    a1, a1, ror #8
    EOR    a1, a1, ip, lsr #8

A demonstration program which should help explain how this works has been 
provided in source form in the examples directory.  To compile this program 
and run it under armsd first set your current directory to be examples and 
then use the following commands:

>armcc bytedemo.c -o bytedemo -li -apcs 3/32bit
>armsd -li bytedemo
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file bytedemo
armsd: go

    ... demonstration program executes ...

Note that this program uses ANSI control codes, so should work on most 
terminal types under Unix and also on the PC.

3.10.3 Related Topics
---------------------
Please refer to the index for further topics of interest.

3.11 ARM Assembly Programming Performance Issues
------------------------------------------------
3.11.1 About This Recipe
------------------------
This recipe outlines many performance related issues which the ARM Assembly 
Language Programmer should be aware of.  Many of these issues are dealt with 
more fully elsewhere in the Cookbook, but are mentioned here for 
completeness.  This recipe is also useful background for C programmers using 
armcc, as some of these issues can also apply to programming in C.

3.11.2 Performance Issues
-------------------------
Not all of the issues in this recipe apply to every ARM processor based 
system.  However, unless otherwise stated, all issues relate to processors 
based on ARM6 and ARM7, with or without cache and/or write buffer.

LDM / STM
Use LDM and STM instead of a sequence of LDR or STR instructions wherever 
possible.  This provides several benefits:

     The code is smaller (and thus will cache better on an ARM processor 
     with a cache);
     An instruction fetch cycle and a register copy back cycle is saved for 
     each LDR or STR eliminated;
     On an uncached ARM processor (for LDM) or an unbuffered ARM processor 
     (for STM), non-sequential memory cycles can be turned into faster 
     memory sequential cycles.

For a more detailed discussion of LDM and STM  see Flexibility of Load and 
Store Multiple starting on page13.

Conditional Execution
---------------------
In many situations, branches around short pieces of code can be avoided by 
using conditionally executed instructions.  This reduces the size of code 
and may avoid a pipeline break.
For a more detailed explanation of the Conditional Execution of ARM 
Instructions see Making the Most of Conditional Execution starting on page7.

Using the Barrel Shifter
------------------------
Combining shift operations with other operations can significantly increase 
the code density (and thus performance) of much ARM code.  An explanation of 
how to use the ARM's barrel shifter is given in Using the Barrel Shifter 
starting on page10.

Many other recipes also demonstrate good use of the barrel shifter.

Addressing Modes
----------------
The ARM instruction set provides a useful selection of addressing modes, 
which can often be used to improve the performance of code. eg. using LDR or 
STR pre- or post-indexed with a non zero offset increments the base register 
and performs the data transfer.  For full details of the addressing modes 
available refer to the appropriate ARM datasheet.

Multiplication
--------------
Be aware of the time taken by the ARM multiply and multiply accumulate 
instructions.  Making the best use of the multiplier is discussed in ARM6 
Multiplier Performance Issues starting on page38.
When multiplying by a constant value note that using the multiply 
instruction is often not the optimal solution.  The issues involved are 
discussed in the Multiply by a Constant starting on page24.

Optimise Register Usage
-----------------------
Examine your code and see if registers can be reused for another value 
during parts of a long calculation which uses many registers.  Doing this 
will reduce the amount of 'register spillage', ie. the number of times a 
value has to be reloaded or an intermediate value saved and then reloaded.
Notice that because data processing is cheap (much can be achieved in a 
single instruction), keeping a calculated result in a register for use some 
considerable time later may be less efficient than recalculating it when it 
is next needed.  This is because it may allow the register so freed to be 
used for another purpose in the meantime, thus reducing the amount of 
register spillage. 
For example code which takes great care to optimise register usage see some 
of the examples in Digital Signal Processing on the ARM starting on page35.

Loop Unrolling
--------------
Loop unrolling involves using more than one copy of the inner loop of an 
algorithm.  The following benefits may be gained by loop unrolling:

     the branch back to the beginning of the loop is executed less 
     frequently;
     it may be possible to combine some of one iteration with some of the 
     next iteration, and thereby significantly reduce the cost of each 
     iteration.

A common case of this is combining LDR or STR instructions from two or more 
iterations into single LDM or STM instructions.  This reduces code size, the 
number of instruction fetches, and in the case of LDM, the number of 
register writeback cycles.
As an example to illustrate the issues involved in loop unrolling, let us 
consider calculating the following over an array: x[i] = y[i] - y[i+1].  
Below is a code fragment which performs this:

    LDR    R2, [R0]; Preload y[i]
Loop
    LDR    R3, [R0, #4]!; Load y[i+1]
    SUB    R2, R2, R3; x[i] = y[i] - y[i+1]
    STR    R2, [R1], #4; Store x[i]
    MOV    R2, R3; y[i+1] is the next y[i]
    CMP    R0, R4; Finished ?
    BLT    Loop

Let us consider the number of execution cycles this will take on an ARM6 
based processor.  IF stands for Instruction Fetch, WB stands for Register 
Write Back, R stands for Read, and W stands for Write.
The loop will execute in the following cycles: 6 IF, 1 R, 1 WB, 1 W, and the 
branch costs an additional 2 IF cycles.  Therefore the total cycle count for 
processing a 100 element x[] array is:

799 IF (198 caused by branching), 101 R, 101 WB, 100 W (1198 cycles)
Code size: 7 instructions

Now consider the effects of unrolling this loop:

Branch overhead cycles
----------------------
In the above example there are 198 IFs caused by branching.  Unrolling the 
loop can clearly reduce this, and the table below shows how progressively 
unrolling the loop gives reducing returns for the increase in code size:

Times Unrolled    IFs caused by Branching    IF saving
--------------    -----------------------    ---------
     2                      98                  100
     3                      66                  134
     4                      48                  150
     5                      38                  160
     10                     18                  180
     100                    0                   198

Therefore if code size is an issue of any importance, unrolling any more 
than around 3 times is unlikely to pay off with regard to branch overhead 
elimination.

Combining LDRs and STRs into LDM and STM
------------------------------------------
The number of LDRs or STRs which can be combined into a single LDM or STM is 
restricted by the number of available registers.  In this instance 10 
registers is the most which are likely to be usable.  This would result in 
unrolling the loop 10 times for the above example.  Another case to consider 
is unrolling 3 times, as this seems to be a good compromise for branch 
overhead reduction.

Other optimisations
-------------------
Upon examining the unrolled code below, it can be seen that it is only 
necessary to execute the MOV once per loop, thus saving another 2 IF cycles 
per loop for the 3 times unrolled code, and another 9 IF cycles per loop for 
the 10 times unrolled code.
Here is the code unrolled three times and then optimised as described above:

    LDR     R2, [R0], #4; Preload y[i]
Loop
    LDMIA   R0!, {R3-R5}; Load y[i+1] to y[i+3]
    SUB     R2, R2, R3; x[i]   = y[i]   - y[i+1]
    SUB     R3, R3, R4; x[i+1] = y[i+1] - y[i+2]
    SUB     R4, R4, R5; x[i+2] = y[i+2] - y[i+3]
    STMIA   R1!, {R2-R4}; Store x[i] to x[i+2]
    MOV     R2, R5; y[i+3] is the next y[i]
    CMP     R0, R6; Finished ?
    BLT     Loop

Analysing how this code executes for a y[] array of size 100, as described 
above for the unrolled code produces the following results:

339 IF (66 caused by branching), 101 R, 34 WB, 100 W (574 cycles)
Code size: 9 instructions
Saving over unrolled code: 460 IF, 67 WB

Here is the code unrolled ten times and then optimised in the same way:

    LDR     R2, [R0], #4; Preload y[i]
Loop
    LDMIA   R0!, {R3-R12}; Load y[i+1] to y[i+10]
    SUB     R2,  R2,  R3; x[i]   = y[i]   - y[i+1]
    SUB     R3,  R3,  R4; x[i+1] = y[i+1] - y[i+2]
    SUB     R4,  R4,  R5; x[i+2] = y[i+2] - y[i+3]
    SUB     R5,  R5,  R6; x[i+3] = y[i+3] - y[i+4]
    SUB     R6,  R6,  R7; x[i+4] = y[i+4] - y[i+5]
    SUB     R7,  R7,  R8; x[i+5] = y[i+5] - y[i+6]
    SUB     R8,  R8,  R9; x[i+6] = y[i+6] - y[i+7]
    SUB     R9,  R9,  R10; x[i+7] = y[i+7] - y[i+8]
    SUB     R10, R10, R11; x[i+8] = y[i+8] - y[i+9]
    SUB     R11, R11, R12; x[i+9] = y[i+9] - y[i+10]
    STMIA   R1!, {R2-R11}; Store x[i] to x[i+9]
    MOV     R2,  R12; y[i+10] is the next y[i]
    CMP     R0,  R13; Finished ?
    BLT     Loop

Analysing how this code executes for a y[] array of size 100, produces the 
following results:

169 IF (18 caused by branching), 101 R, 10 WB, 100 W (380 cycles)
Code size: 16 instructions
Saving over unrolled code: 630 IF, 91 WB

Thus for this problem, unless the extra seven instructions make the code too 
large unrolling ten times is likely to be the optimum solution.
However, loop unrolling is not always a good idea, especially when the 
optimisation between one iteration and the next is small.  Consider the 
following loop which copies an area of memory:

Loop
    LDMIA  R0!,{R3-R14}
    STMIA  R1!,{R3-R14}
    CMP    R0, #LimitAddress
    BNE    Loop

This could be unrolled as follows:

Loop
    LDMIA  R0!,{R3-R14}
    STMIA  R1!,{R3-R14}
    LDMIA  R0!,{R3-R14}
    STMIA  R1!,{R3-R14}
    LDMIA  R0!,{R3-R14}
    STMIA  R1!,{R3-R14}
    LDMIA  R0!,{R3-R14}
    STMIA  R1!,{R3-R14}
    CMP    R0, #LimitAddress
    BLT    Loop

In this code the CMP and BNE will be executed only a quarter as often, but 
this will  give only a small saving.  However, other issues should be taken 
into account:

     If in the above case the amount of data to be transferred was not a 
     multiple of 48, then this amount of loop unrolling will copy too much 
     data.  This may be catastrophic, or may merely be inefficient.

     On a cached ARM processor, the larger the inner loop, the more likely 
     it is that the loop will not stay entirely in the cache.  In this case, 
     it is not obvious at what point the performance gain due to unrolling 
     is offset by the performance loss due to cache misses, or the 
     disadvantage of larger code.

     On an ARM processor with a write buffer, the loop unrolling in the 
     above example is unlikely to help.  If the data being copied is not in 
     the cache, then  every LDMIA will be stalled while the write buffer 
     empties.  Thus the time the CMP and BNE take is irrelevent, as the 
     processor will be stalled on the following LDMIA.

Loop unrolling can be a useful technique, but be warned that it doesn't 
always gain anything, and in some situations can reduce performance - 
detailed analysis is often necessary.

Write Buffer Stalling
---------------------
On ARM processors with a write buffer, performance can be maximised by 
writing code which avoids stalling due to the write buffer.  For a write 
buffer with 2 tags and 8 words such as the ARM610, no three STR or STM 
instructions should be close together (as the third will be stalled until 
the first has finished).  Similarly no two STR or STM instructions which 
together store more than 8 words should be close together, as the second 
will be stalled until there is space in the write buffer.
Rearranging code so that the write buffer does not cause a stall in this way 
is often hard, but is frequently worth the effort, and in any case it is 
always wise to be aware of this performance factor.

16-bit Data
-----------
If possible treat 16-bit data as 32-bit data.  However, if this is not 
possible, then be aware that it is possible to make use of the barrel 
shifter and non word-aligned LDR's in order to make working with 16-bit data 
more efficient than might be initially imagined.  See Using 16-bit Data on 
the ARM starting on page41 for a full discussion of this topic.

8-bit Data
----------
When processing a sequence of byte sized objects (eg. strings), the number 
of loads and stores can be reduced if the data is loaded a word at a time 
and then processed a byte at a time by extracting the bytes using the barrel 
shifter.

The Floating Point Emulator
---------------------------
If the software-only floating point emulator is being used then floating 
point instructions should placed sequentially, as the floating point 
emulator will detect that the next instruction is also a floating point 
instruction, and will emulate it without leaving the undefined instruction 
code.
Note that this advice is not applicable to systems which use the ARM FPA 
co-processor.

Make Full Use of Cache Lines
----------------------------
In order to help the cache on a cached ARM processor maintain a high hit 
rate for data, place frequently accessed data values together so that the 
are loaded into the same cache line, rather scattering them around memory, 
as this will require more cache lines to be loaded, and kept in the cache.
In a similar vein, commonly called subroutines (especially short ones) will 
often run more quickly on a cached ARM processor if the entry address is 
aligned so that it will be loaded into the first word of a cache line.  On 
the ARM610, for example this means quad-word aligned.  This ensures that all 
four words of the first line fetch will be subsequently used by instruction 
fetches before another line fetch is caused by an instruction fetch.  This 
technique is most useful for large programs which do not cache well, as the 
number of times the code will have to feteched from memory is not likely to 
be significant if the program does cache well.

Minimise Non-sequential Cycles
------------------------------
This technique is only appropriate to un-cached ARM processors, and is 
intended for memory systems in which non-sequential memory accesses take 
longer than sequential memory accesses.
Consider such a system in which the length of memory bursts is B.  That is, 
if executing a long sequence of data operations, the memory accesses which 
result are: one non-sequential memory cycle followed by B-1 sequential 
memory cycles.  eg. DRAM controlled by the ARM memory manager MEMC1a.
This sequence of memory accesses will be broken up by several ARM 
instruction types: Load or Store (single or multiple), Data Swap, Branch 
instructions, SWI's and other instructions which modify the PC.
By placing these instructions carefully, so that they break up the normal 
sequence of memory cycles only where a non-sequential cycle was about to 
occur anyway, the number of sequential cycles which are turned into longer 
non-sequential cycles can be minimised.
For a memory system in which has memory bursts of length B, the optimal 
position for instructions which break up the memory cycle sequence is 3 
words before the next B-word boundary.
To help explain this consider a memory system with memory bursts of length 4 
(ie. quad word bursts), the optimal position for these break-up instructions 
is 16-12=4 bytes from a quad-word offset.  To demonstrate this is the case 
imagine executing the following code in this system:

0x0000  Data Instr 1
0x0004  STR
0x0008  Data Instr 2
0x000C  Data Instr 3
0x0010  Data Instr 4

The memory cycles executing this code will produce are (taking into account 
the ARM instruction pipeline):

Instruction Fetch 0x0000 (Non Seq)
Instruction Fetch 0x0004 (Seq)
Instruction Fetch 0x0008 (Seq)      + Execute Data Instr 1
Instruction Fetch 0x000C (Seq)      + Execute STR
Data Write               (Non Seq)
Instruction Fetch 0x0010 (Non Seq)  + Execute Data Instr 2
Instruction Fetch 0x0014 (Seq)      + Execute Data Instr 3

The instruction fetch after the Data Write cycle had to be non-sequential 
cycle, but since the instruction fetch was of a quad-word aligned address it 
had to be non-sequential anyway.  Hence the STR is optimally positioned to 
avoid changing sequential instruction fetches into non-sequential 
instruction fetches.

Use an Efficient Algorithm
--------------------------
Despite all these techniques for optimising ARM Assembly Language, it is 
important that care is taken to start off with an efficient algorithm - all 
the optimisations in the world won't turn bubble sort into an n log n 
sorting algorithm!

3.11.3 Related Topics
---------------------
Most of the recipes in this chapter (Exploring ARM Assembly Language 
starting on page20) are likely to have some relevence.  Specific references 
have been indicated above for particular topics.