Date : Thu, 17 Jun 1993 09:48:19 BST
From : matthew@... (Matthew Sweet)
Subject: Unrolling loops
Mik Davis wrote:
> I need a *fast* routine to move 1024 bytes of RAM across the memoru of my
>machine. Must be compatible with all the 6502 based machine (B, B+, master etc)
David Andrew Sainty replied:
>Fastest method is using an index on self modifying absolute address
>instructions. eg.
>
>lda#start DIV256:sta adr+2
>lda#end DIV256:sta adr2+2
>ldy#3:ldx#0
>.adr lda &FF00,X
>.adr2 sta &FF00,X
>inx:bne adr
>inc adr+2
>inc adr2+2
>dey:bpl adr
>
>This works fine if the blocks you want to move are on page boundaries.
>If they are not, it still works but you have to program both bytes of the
>address fields. You will also loose some speed as the lda instruction takes
>an extra cycle if the address +X crosses a page boundary (I don't think
>the sta takes an extra cycle though). In this case, you should move the
>first few bytes seperately to bring the load address up to a page
>boundary, then use this type of routine for the rest.
>
>If my memory serves correct, this routine takes 4+5+2+3=14 cycles for
>each byte on the inner loop, which is what really counts. total cost
>is 2+4+2+4+2+2+4*(256*(4+5+2+3)-1+6+6+2+3)-1=14415 I think.
>
>Which is 7.2 milliseconds. Hmm, not too bad....
>
>This is if the sta takes a consistant 5 cycles, which may be wrong, so
>adjust accordingly.... (If it's not 5 cycles, it's 4).
>
>Dave.
This can be sped up still further by copying more bytes
each time through the loop. This takes more program bytes,
but more time is spent transferring memory, and less
is spent looping. Also, there is not need to use Y:
ldx#start DIV256:stx adr1+2
inx:stx adr3+2
inx:stx adr5+2
inx:stx adr7+2
ldx#end DIV256:stx adr2+2
inx:stx adr4+2
inx:stx adr6+2
inx:stx adr8+2
ldx#0
.adr1 lda &FF00,X
.adr2 sta &FF00,X
.adr3 lda &FF00,X
.adr4 sta &FF00,X
.adr5 lda &FF00,X
.adr6 sta &FF00,X
.adr7 lda &FF00,X
.adr8 sta &FF00,X
inx:bne adr
If daves memory serves him right about timing issues...
Inner loop: 4+5+4+5+4+5+4+5+2+3=41 cycles for four bytes.
Total cost is: 2+4+2+4+2+4+2+4+2+4+2+4+2+4+2+4+2+256*41 = 10546
Which is 5.26 milliseconds....
Matthew
matthew@...