Date : Wed, 15 Jun 2005 15:03:52 +0100
From : Richard_Talbot-Watkins@...
Subject: Faster 6502 multiplication
Here's a little trick I'd never come across before: doubtless it's well
known, but thought I'd share the goodies. Normally, multiplication of two
8-bit numbers can be achieved with a little routine such as this:
.multiply
\ Multiplies A*X, and stores result in product/product+1
CPX #0:BEQ zero
DEX:STX product+1
LSR A:STA product
LDA #0
BCC s1:ADC product+1:.s1 ROR A:ROR product
BCC s2:ADC product+1:.s2 ROR A:ROR product
BCC s3:ADC product+1:.s3 ROR A:ROR product
BCC s4:ADC product+1:.s4 ROR A:ROR product
BCC s5:ADC product+1:.s5 ROR A:ROR product
BCC s6:ADC product+1:.s6 ROR A:ROR product
BCC s7:ADC product+1:.s7 ROR A:ROR product
BCC s8:ADC product+1:.s8 ROR A:ROR product
STA product+1
RTS
.zero
STX product:STX product+1
RTS
This completes in an average of 113 cycles (excluding the multiply by zero
special case) - not bad, but it's possible to do better, provided you're
happy to set aside some space for some tables...
There is a method which yields the product of two values from the
difference between two squares.
Mathematically:
(a+b)^2 = a^2 + b^2 + 2ab --> (I)
(a-b)^2 = a^2 + b^2 - 2ab --> (II)
(I) minus (II) gives:
(a+b)^2 - (a-b)^2 = 4ab
or, in other words:
(a+b)^2 (a-b)^2
ab = ------- - -------
4 4
So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and
then can achieve multiplication via a single 16-bit subtract! In reality,
we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is
less than 256, and a further 2 for when n is 256 or greater.
Curiously, the integer division by 4 - and hence, truncation of the values
stored in the tables - doesn't appear to sacrifice any accuracy. It's as
if there's never any 'borrow' from bit 2 to bit 1 in any of the
subtractions it ever performs, but I haven't investigated *why* this should
be. Perhaps someone else can explain? I honestly expected to have to
implement some fiddle factor to make up for truncation when I was writing
this, and yet it seems totally unnecessary.
Here's some code:
10 sqrlo256%=&900
20 sqrlo512%=&A00
30 sqrhi256%=&B00
40 sqrhi512%=&C00
50 :
60 FOR N%=0 TO 255
70 s256%=(N%*N%) DIV 4
80 s512%=((N%+256)*(N%+256)) DIV 4
90 N%?sqrlo256%=s256% AND 255
100 N%?sqrhi256%=s256% DIV 256
110 N%?sqrlo512%=s512% AND 255
120 N%?sqrhi512%=s512% DIV 256
130 NEXT
140 :
150 num1=&70:num2=&71:result=&72
160 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
170 .mult
180 SEC:LDA num1:SBC num2:BCS positive
190 EOR #255:ADC #1:.positive TAY
200 CLC:LDA num1:ADC num2:TAX
210 BCS morethan256:SEC
220 LDA sqrlo256%,X:SBC sqrlo256%,Y:STA result
230 LDA sqrhi256%,X:SBC sqrhi256%,Y:STA result+1
240 RTS
250 .morethan256
260 LDA sqrlo512%,X:SBC sqrlo256%,Y:STA result
270 LDA sqrhi512%,X:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE
So there's a routine which achieves an 8 bit * 8 bit multiplication in just
55 cycles - over twice the speed of the original routine. The only problem
with this is the amount of table space required - 1k.
It turns out that we can reduce the table space by 256 bytes by unifying
sqrlo256 and sqrlo512. They are virtually the same, apart from that
sqrlo512(n) = sqrlo256(n) EOR &80 when n is odd.
Which leads us to the following space optimisation:
10 sqrlo%=&900
20 sqrhi256%=&A00
30 sqrhi512%=&B00
40 :
50 FOR N%=0 TO 255
60 s256%=(N%*N%) DIV 4
70 s512%=((N%+256)*(N%+256)) DIV 4
80 N%?sqrlo%=s256% AND 255
90 N%?sqrhi256%=s256% DIV 256
100 N%?sqrhi512%=s512% DIV 256
110 NEXT
120 :
130 num1=&70:num2=&71:result=&72
140 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
150 .mult
160 SEC:LDA num1:SBC num2:BCS positive
170 EOR #255:ADC #1:.positive TAY
180 CLC:LDA num1:ADC num2:TAX
190 BCS morethan256
200 LDA sqrhi256%,X:STA result+1
210 LDA sqrlo%,X:SEC:BCS lessthan256
220 .morethan256
230 LDA sqrhi512%,X:STA result+1
240 TXA:AND #1:BEQ skip:LDA #&80:.skip
250 EOR sqrlo%,X:.lessthan256
260 SBC sqrlo%,Y:STA result
270 LDA result+1:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE
...which is an average of about 67 cycles, i.e. still substantially better
than the original routine.
I wish I'd thought of this trick when I was writing
multiplication-intensive demos years ago...
Cheers,
Rich
**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
postmaster@...
This footnote also confirms that this email message has been checked
for all known viruses.
**********************************************************************
Sony Computer Entertainment Europe