Subject: binary search opcode (was: Re: my 6502 now dead fast!)

<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>
Date   : Fri, 22 Jul 1994 09:31:11 EST
From   : Stephen Quan <quan@...>
Subject: binary search opcode (was: Re: my 6502 now dead fast!)

Stephen Quan wrote (that's me) :

> I have recently found that this switch statement costs nearly the
> equivalent of 8 if statements, and have thought about using if
> statements instead of a switch statement. Something like a binary
> search.

James Fidell writes :

> Would you post your evidence for this ? It would be interesting to see
> it. Are you saying that 8 compare/branch instruction pairs could be
> executed in the time it takes to reach the code associated with any of
> the individual cases in a swtich statement implementation ?

The numbers I got were rough as guts and were based on a sparc II.
It does, of course, depend on your compiler and platform what the
equivalent time a switch statements takes, but in my opinion it was
unnecessary slow for our purposes.

Since we are doing switch (opcode), and list 256 possible entries the
compiler should use a register as an index variable into a table of
256-locations to jump to.  Using the right compiler options some
compilers will do this, however, I've noted if you have a 4-case
branch (even if it is 0, 1, 2 and 3) it still builds a 256 sized table.
This is because the base type is still an unsigned char.

The test program I wrote was simple, and I could easily test the
performance of the whole 8 levels of the tree if I wanted to!
(BTW, James F you were right about that if statement, I copied it
wrong).

The actual fragment of code I use is

    {
      register unsigned char opcode;
      register          char findopcode;  /* note that this is signed. */

      /* opcode = *PC++; - this is the real fetch */
      opcode = *PC + 0x41; /* this is the fake fetch to test 2 deep search. */

      if ((findopcode=opcode-0x80) == 0)
        goto opcode_0x80;
      else if (findopcpode>0)
        goto try_above_0x80;

      if ((findopcode=opcode-0x40) == 0)
        goto opcode_0x40;
      else if (findopcpode>0)
        goto try_above_0x40;
  
      if ((findopcode=opcode-0x40) == 0)
        goto opcode_0x40;
      else if (findopcpode>0)
        goto try_above_0x40;

      if ((findopcode=opcode-0x20) == 0)
        goto opcode_0x20;
      else if (findopcpode>0)
        goto try_above_0x20;

      if ((findopcode=opcode-0x10) == 0)
        goto opcode_0x10;
      else if (findopcpode>0)
        goto try_above_0x10;

      if ((findopcode=opcode-0x08) == 0)
        goto opcode_0x08;
      else if (findopcpode>0)
        goto try_above_0x08;

      if ((findopcode=opcode-0x04) == 0)
        goto opcode_0x04;
      else if (findopcpode>0)
        goto try_above_0x04;

      if ((findopcode=opcode-0x02) == 0)
        goto opcode_0x02;
      else if (findopcpode>0)
        goto opcode_0x03;
      else
        goto opcode_0x01;

      try_above_0x80:
        goto opcode_0x81;

      try_above_0x40:
        goto opcode_0x41;

      try_above_0x20:
        goto opcode_0x21;

      try_above_0x10:
        goto opcode_0x11;

      try_above_0x08:
        goto opcode_0x09;

      try_above_0x04:
        goto opcode_0x05;

      opcode_0x00: A++; goto finish;
      opcode_0x01: X--; goto finish;
      opcode_0x02: Y--; goto finish;
      opcode_0x03: A++; goto finish;
      opcode_0x04: X--; goto finish;
      opcode_0x05: Y--; goto finish;
      opcode_0x06: A++; goto finish;
      opcode_0x07: X--; goto finish;
      opcode_0x08: Y--; goto finish;
        /* ...  */
      opcode_0xff: Y--; goto finish;
      finish :
        ;
    }

For comparing the above with the switch case I wrote

    {
      register unsigned char opcode;

      /* opcode = *PC++; - this is the real fetch */
      opcode = *PC + 0x41; /* this is the fake fetch to test 2 deep search. */

      switch (opcode)
      {
        case 0x00: A++; break;
        case 0x01: X--; break;
          /* ... */
      }
    }

A few key notes.  I used opcode = *PC + 0x41, rather than straight
opcode = 0x41, because some compilers optimise the value of opcode
out and optimise all the redundant cases of my code out!  So I had
to do a few things to defeat the optimiser.  (BTW *PC == 0).  Those
dummy execution instructions where necessary to defeat the optimiser
too, otherwise if they were blank some C compilers optimise the entire
tree out.

My statement, 1 switch statement is approximately equal to 8 if
statements was based only on the scenario above.  Meaning that if
I had selected opcodes 0x81,0x80,0x41,0x40,0x21,0x20 I would get
the entire loop executing faster than the switch statement.  If
I had used opcodes 0x11,0x10 the speed would be on par or just below
the switch statement, and of course 0x09,0x08,0x05,0x04,0x03,0x02,0x01
generated the worse cases for the binary tree and used the most time.

It was from this observation I stated that if you traverse the
tree 3 levels deep (1+2+4 = 7 instructions) you could get performance
greater than the switch statement.  If you traverse the tree 4 levels
deep (1+2+4+8 = 15 instructions) you will approach the same speed or
slightly less than that of a switch statement.  The tests I ran were
solely on one platform, the sparc II and running with gcc with every
optimisation flag I could think off.  Basically this system was going
to be near the real system I would run.  I did extrapolate my times,
my multiply a ratio to quote figures for a 386DX-40, but I did  not
repeat the test on a 386DX-40.  Next time I post, I will do a complete
test on all platforms I get my hands on with the above code.

In generating performance figures for the 256^3 loop, I wrap the
whole lot in a loop and timed it.

   for (l=0; l<33685245L; l++)
   {
   }

This figure is really meaningless as it does not fully implement
the 256^3 loop, but as the fake instructions (A++, X--) consume
some time the figure could be taken to a certain degree.

I ran the tests at 1-level, 2-level, 3-levels and 4-levels deep on
a sparc II and I think I had times like 3s, 7s, 15s, and 23s respectively.
The switch statement executed at about 21s.  Sorry, I don't think these
figures are accurate, I will rerun my tests and quote the actual figures
later on.

In real practice, I suspect that in any general code that is being
executed, there are more than 15 instructions frequently used,
meaning that if the lower entries of the binary search tree are
hit often enough then this C program will perform worse than the
standard switch statement.  Because the time spent on the instruction
depends on where it is placed in the tree this may yield to a non-smooth
emulation.

As a proof of concept I can demonstrate that this program will perform
quite well for the 256^3 loop instruction, my moving DEX DEY and BNE
to the top 2 levels of the search tree, and have TXA, TAX, LDX#
(or was it TYA, TYA and LDY #) at the 3rd level.  This shows nothing
more than an example program that can perform well under benchtests
but perhaps worse in general application.
-- 
Stephen Quan (quan@...                 ), SysAdmin, Analyst/Programmer.
Centre for Spatial Information Studies, University of Tasmania, Hobart.
GPO BOX 252C, Australia, 7001.  Local Tel: (002) 202898 Fax: (002) 240282
International Callers use +6102 instead of (002).
<< Previous Message	Main Index	Next Message >>
<< Previous Message in Thread	This Month	Next Message in Thread >>