Subject: Re: binary search opcode (was: Re: my 6502 now dead fast!)

<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>
Date   : Mon, 25 Jul 1994 12:56:32 WET DST
From   : Bonfield James <jkb@...>
Subject: Re: binary search opcode (was: Re: my 6502 now dead fast!)

Stephen Quan wrote:

>In real practice, I suspect that in any general code that is being
>executed, there are more than 15 instructions frequently used,
>meaning that if the lower entries of the binary search tree are
>hit often enough then this C program will perform worse than the
>standard switch statement.  Because the time spent on the instruction
>depends on where it is placed in the tree this may yield to a non-smooth
>emulation.
>
>As a proof of concept I can demonstrate that this program will perform
>quite well for the 256^3 loop instruction, my moving DEX DEY and BNE
>to the top 2 levels of the search tree, and have TXA, TAX, LDX#
>(or was it TYA, TYA and LDY #) at the 3rd level.  This shows nothing
>more than an example program that can perform well under benchtests
>but perhaps worse in general application.

Someone (you I think) mentioned Huffman encoding. This would allow for the
most frequently used instructions to be at the top of the tree and hence the
fastest. However it is often the case that, in a tight loop, but ran many
times, lower level tree opcodes will be used. In these cases we would like to
adjust the tree. A method known as Adaptic Huffman Compression allows for
this. However the downside of this is that it actually requires computation at
runtime rather than at 'tree generation' time. I'm unsure of how much overhead
this is though. For starters it requires counts of instructions being ran.

I'm still unsure of why your tree method should be any faster. There must be
some method, somehow, of getting the compiler to optimise a switch statement
better. I've ran some tests, very similar to yours (except with a bit more
trickery to dodge over optimisation) on our Alpha. I found that the number of
instructions involved in the indirect jump as follows:

gcc -O2                        10
cc -O2                 7
cc -migrate -O4                6

(Note: "cc -O4" can't produce assembly so I couldn't test the best
optimisation available there.)

401FF7BC     0050               cmpule  r0, 255, r28   
E7800203     0054               beq     r28, L$5       
A79D8010     0058               ldq     r28, (gp)      
401C065C     005C               s8addq  r0, r28, r28   
A79C0000     0060               ldq     r28, (r28)     
6BFC0000     0064               jmp     r28            

The six instructions in the best case are fairly minimal as far as I can tell
(which isn't much as I understand little of Alpha assembly language).

L$5 is the 'increment our loop counter code'. This doesn't need to be tested
for as every 'case' is accounted for. So there's one extra instruction that
ought to have been junked. Another thing I noticed is that my code of:

    *PC = rand() & 0xff;

    for (l=0; l<33685245L; l++) {
        opcode = *PC;
        switch (opcode) {
        case 0x00: A++; break;
        case 0x01: X--; break;
...

produced when using PC as type "unsigned char *" code to 'mov r3,opcode' on
EVERY loop!!! This wasn't done with PC was type "int *". It seems to me that
the choice of a decent compiler makes a huge difference. This is obviously
something assembly programmers win with.

       James
<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>