Subject: Unrolling loops

<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>

Date   : Thu, 17 Jun 1993 23:45:36 +1200
From   : David Andrew Sainty <David.Sainty@...>
Subject: Unrolling loops

   From: matthew@... (Matthew Sweet)

   David Andrew Sainty replied:
   >Fastest method is using an index on self modifying absolute address
   >instructions. eg.
   >
   >lda#start DIV256:sta adr+2
   >lda#end DIV256:sta adr2+2
   >ldy#3:ldx#0
   >.adr lda &FF00,X
   >.adr2 sta &FF00,X
   >inx:bne adr
   >inc adr+2
   >inc adr2+2
   >dey:bpl adr
   >
   >Which is 7.2 milliseconds. Hmm, not too bad....
   >
   >This is if the sta takes a consistant 5 cycles, which may be wrong, so
   >adjust accordingly.... (If it's not 5 cycles, it's 4).

   This can be sped up still further by copying more bytes
   each time through the loop. This takes more program bytes,
   but more time is spent transferring memory, and less
   is spent looping. Also, there is not need to use Y:

   ldx#start DIV256:stx adr1+2
   inx:stx adr3+2
   inx:stx adr5+2
   inx:stx adr7+2
   ldx#end DIV256:stx adr2+2
   inx:stx adr4+2
   inx:stx adr6+2
   inx:stx adr8+2
   ldx#0
   .adr1 lda &FF00,X
   .adr2 sta &FF00,X
   .adr3 lda &FF00,X
   .adr4 sta &FF00,X
   .adr5 lda &FF00,X
   .adr6 sta &FF00,X
   .adr7 lda &FF00,X
   .adr8 sta &FF00,X
   inx:bne adr

   If daves memory serves him right about timing issues...
   Inner loop: 4+5+4+5+4+5+4+5+2+3=41 cycles for four bytes.
   Total cost is: 2+4+2+4+2+4+2+4+2+4+2+4+2+4+2+4+2+256*41 = 10546

   Which is 5.26 milliseconds....

Quite right! Of course, you can always speed things up by unravelling loops,
but this particular case suits unravelling so well I was stupid to miss it!

At least I got the timing right! :-)

LDA abs,X is 4 cycles (+1 if you know what)
STA abs,X is 5 cycles in all cases.

Dave.

<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>