<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>
Date   : Wed, 15 Jun 2005 15:03:52 +0100
From   : Richard_Talbot-Watkins@...
Subject: Faster 6502 multiplication

Here's a little trick I'd never come across before: doubtless it's well
known, but thought I'd share the goodies.  Normally, multiplication of two
8-bit numbers can be achieved with a little routine such as this:

.multiply
\ Multiplies A*X, and stores result in product/product+1

CPX #0:BEQ zero
DEX:STX product+1
LSR A:STA product
LDA #0
BCC s1:ADC product+1:.s1 ROR A:ROR product
BCC s2:ADC product+1:.s2 ROR A:ROR product
BCC s3:ADC product+1:.s3 ROR A:ROR product
BCC s4:ADC product+1:.s4 ROR A:ROR product
BCC s5:ADC product+1:.s5 ROR A:ROR product
BCC s6:ADC product+1:.s6 ROR A:ROR product
BCC s7:ADC product+1:.s7 ROR A:ROR product
BCC s8:ADC product+1:.s8 ROR A:ROR product
STA product+1
RTS
.zero
STX product:STX product+1
RTS

This completes in an average of 113 cycles (excluding the multiply by zero
special case) - not bad, but it's possible to do better, provided you're
happy to set aside some space for some tables...


There is a method which yields the product of two values from the
difference between two squares.

Mathematically:

  (a+b)^2 = a^2 + b^2 + 2ab     --> (I)
  (a-b)^2 = a^2 + b^2 - 2ab     --> (II)

(I) minus (II) gives:

  (a+b)^2 - (a-b)^2 = 4ab

or, in other words:

        (a+b)^2     (a-b)^2
  ab =  -------  -  -------
           4           4

So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and
then can achieve multiplication via a single 16-bit subtract!  In reality,
we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is
less than 256, and a further 2 for when n is 256 or greater.

Curiously, the integer division by 4 - and hence, truncation of the values
stored in the tables - doesn't appear to sacrifice any accuracy.  It's as
if there's never any 'borrow' from bit 2 to bit 1 in any of the
subtractions it ever performs, but I haven't investigated *why* this should
be.  Perhaps someone else can explain?  I honestly expected to have to
implement some fiddle factor to make up for truncation when I was writing
this, and yet it seems totally unnecessary.

Here's some code:

  10 sqrlo256%=&900
  20 sqrlo512%=&A00
  30 sqrhi256%=&B00
  40 sqrhi512%=&C00
  50 :
  60 FOR N%=0 TO 255
  70 s256%=(N%*N%) DIV 4
  80 s512%=((N%+256)*(N%+256)) DIV 4
  90 N%?sqrlo256%=s256% AND 255
 100 N%?sqrhi256%=s256% DIV 256
 110 N%?sqrlo512%=s512% AND 255
 120 N%?sqrhi512%=s512% DIV 256
 130 NEXT
 140 :
 150 num1=&70:num2=&71:result=&72
 160 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
 170 .mult
 180 SEC:LDA num1:SBC num2:BCS positive
 190 EOR #255:ADC #1:.positive TAY
 200 CLC:LDA num1:ADC num2:TAX
 210 BCS morethan256:SEC
 220 LDA sqrlo256%,X:SBC sqrlo256%,Y:STA result
 230 LDA sqrhi256%,X:SBC sqrhi256%,Y:STA result+1
 240 RTS
 250 .morethan256
 260 LDA sqrlo512%,X:SBC sqrlo256%,Y:STA result
 270 LDA sqrhi512%,X:SBC sqrhi256%,Y:STA result+1
 280 RTS
 290 ]:NEXT
 300 :
 310 !result=0
 320 REPEAT
 330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
 340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
 350 IF ?num1*?num2<>!result:END
 360 UNTIL FALSE

So there's a routine which achieves an 8 bit * 8 bit multiplication in just
55 cycles - over twice the speed of the original routine.  The only problem
with this is the amount of table space required - 1k.

It turns out that we can reduce the table space by 256 bytes by unifying
sqrlo256 and sqrlo512.  They are virtually the same, apart from that
sqrlo512(n) = sqrlo256(n) EOR &80 when n is odd.

Which leads us to the following space optimisation:

  10 sqrlo%=&900
  20 sqrhi256%=&A00
  30 sqrhi512%=&B00
  40 :
  50 FOR N%=0 TO 255
  60 s256%=(N%*N%) DIV 4
  70 s512%=((N%+256)*(N%+256)) DIV 4
  80 N%?sqrlo%=s256% AND 255
  90 N%?sqrhi256%=s256% DIV 256
 100 N%?sqrhi512%=s512% DIV 256
 110 NEXT
 120 :
 130 num1=&70:num2=&71:result=&72
 140 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
 150 .mult
 160 SEC:LDA num1:SBC num2:BCS positive
 170 EOR #255:ADC #1:.positive TAY
 180 CLC:LDA num1:ADC num2:TAX
 190 BCS morethan256
 200 LDA sqrhi256%,X:STA result+1
 210 LDA sqrlo%,X:SEC:BCS lessthan256
 220 .morethan256
 230 LDA sqrhi512%,X:STA result+1
 240 TXA:AND #1:BEQ skip:LDA #&80:.skip
 250 EOR sqrlo%,X:.lessthan256
 260 SBC sqrlo%,Y:STA result
 270 LDA result+1:SBC sqrhi256%,Y:STA result+1
 280 RTS
 290 ]:NEXT
 300 :
 310 !result=0
 320 REPEAT
 330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
 340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
 350 IF ?num1*?num2<>!result:END
 360 UNTIL FALSE

...which is an average of about 67 cycles, i.e. still substantially better
than the original routine.

I wish I'd thought of this trick when I was writing
multiplication-intensive demos years ago...

Cheers,
Rich


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
postmaster@...

This footnote also confirms that this email message has been checked
for all known viruses.

**********************************************************************
Sony Computer Entertainment Europe
<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>