Date : Wed, 20 Jul 1994 08:24:04 +0100 (BST)
From : amh15@... (Alan Hart)
Subject: Fast assembler emulation techniques
I apologise for the length of this article. It might be very useful to those
writing assembler emulators, though.
It is an article I saw by Peter McGavin who wrote the Amiga Spectrum
Emulator, followed by a discussion between him and me about how his emulation
worked. Method 4 is very, very interesting, I think.
Anyway, read on...
>From lyra.csx.cam.ac.uk!doc.ic.ac.uk!daresbury!keele!uknet!bt!pipex!howland.reston.ans.net!usc!elroy.jpl.nasa.gov!decwrl!waikato!comp.vuw.ac.nz!zephyr.grace.cri.nz!maths!peterm
Thu Jun 2 09:06:09 1994
Path: lyra.csx.cam.ac.uk!doc.ic.ac.uk!daresbury!keele!uknet!bt!pipex!howland.reston.ans.net!usc!elroy.jpl.nasa.gov!decwrl!waikato!comp.vuw.ac.nz!zephyr.grace.cri.nz!maths!peterm
From: peterm@... (Peter McGavin)
Newsgroups: comp.sys.mac.games,comp.sys.sinclair
Subject: Re: Here is a (Mac) ZX Spectrum emulator
Date: 19 May 1994 02:59:11 GMT
Organization: Applied Maths, Industrial Research Ltd, NZ
Lines: 98
Message-ID: <PETERM.94May19145911@...>
References: <1994May9.194248.5896@...> <CpK0uw.GK@...>
<allen_marshall-100594195055@...> <CpnLJz.AMx@...>
<djkeogan.768743479@...> <1994May16.145910.17678@...>
NNTP-Posting-Host: kea.grace.cri.nz
In-reply-to: amd@...'s message of Mon, 16 May 1994 14:59:10 GMT
Xref: lyra.csx.cam.ac.uk comp.sys.mac.games:43163 comp.sys.sinclair:4292
amd@... (amd) writes:
>I really don't know what is the interpretation (instruction dispatching)
>technique
>you are using but, accordingly to my experience, choosing an efficient
>interpretation technique is more important to the performance of
>an emulator than the use of assembly language. In case you are using the
>method 1, I recommend you to switch to method 2 and to forget assembly
>language altogether.
>
>Method 1:
>
>The most inefficient technique I know, although a popular one, involves
>the association of a C function to each machine instruction, and the creation
>of
>an array of pointers to those functions. This method is nice and modular but
>is slow because the dispatching and execution of each instruction involves too
>many
>overheads: indexing the array to get the right function, calling it,
>creating the function activation record (frame), deleting the frame and
>returning. Furthermore this method forbids declaring the registers of the
>emulated processor
>as registers of the host machine (the C language does not support global
>'register' variables).
>
>Method 2:
>
>A much faster method is to write a HUGE switch statement (inside a loop), that
>accounts for all the instructions, and to declare some critical variables as
>'register'. The overheads are now less important involving only indexation
>of the jump table of the switch statement (that has been created by the
>compiler) and performing the jump itself.
The fastest methods I've discovered for the 680x0 are much faster, but
involve assembly. They are:
Method 3: (to continue your numbering)
Space out the 256 instruction routines on 256-byte boundaries, and use
the following code at the end of each routine. This is threaded code.
There is no main loop.
move.b (a1)+,(1$+2-base,a4) ; self-modify the next instruction
1$: jmp ($xx00,a4) ; jump to the next routine
This won't work on 68020+ with the instruction cache on, unless you
change it to something like:
move.b (a1)+,d0
lsl.w #8,d0
jmp (a4,d0.w) ; jump to the next routine
or:
move.b (a1)+,(a0)
move.w (a0),d0
jmp (a4,d0.w) ; jump to the next routine
if you have very fast memory.
Method 4:
This one is hard to describe, but extremely fast, especially for
prefixed opcodes. I use it in the Amiga Spectrum emulator v1.7, and
JPP uses a very similar technique.
Maintain a table of 65536 words, the vector table. Each word is the
offset to the instruction routine for the corresponding byte in Z80
address space. If the routine is unknown, the entry is 0. Initially
the table is all 0s.
Maintain a pseudo-pc pointing into this table, A3 say. Each instruction
routine ends with:
move.w (a3)+,d0
jmp (a4,d0.w)
There is a special routine (I call it "patch") corresponding to
offset 0. "Patch" does the following 3 things:
Calculate the offset of the routine for the current
instruction using a traditional table lookup.
Stuff the offset into the table so the next time this
instruction is executed control jumps directly to the
right routine.
Jump to the routine for this instruction.
Finally, every instruction that writes to Z80 memory must also zero
the corresponding word(s) in the vector table.
This method has the advantage that "patch" can look ahead,
semi-compiling prefixed instructions to handled as a single jump,
rather than the usual 2 (or 3) steps. It is also possible to code up
common sequences of instructions to be handled as a single jump.
Great care must be taken, however, to check for self-modifying code.
--
Peter McGavin. (peterm@... )
I then sent some mail to Peter McGavin asking some questions. I appear to
have lost this so please deduce my questions from his quotes of me, in his
reply (below).
>From peterm@... Thu Jun 2 23:28:49 1994
Received: from kea.grace.cri.nz [131.203.4.51]
by bootes.cus.cam.ac.uk with smtp
(Smail-3.1.28.1 #136) id m0q9LFj-000BzIC; Thu, 2 Jun 94 23:28 BST
Received: by kea.grace.cri.nz id <AA00452-5.65c+IDA/2.1.1>; Fri, 3 Jun 1994
10:28:36 +1200
Date: Fri, 3 Jun 1994 10:28:36 +1200
From: Peter McGavin <peterm@... >
Message-Id: <199406022228.AA00452@... >
To: amh15@...
Subject: Re: Coding emulators
Status: RO
Hi Alan,
>I was very impressed with your ideas for fast emulators you posted recently,
>particularly "method 4" which you use in your Amiga Spectrum Emulator.
Thanks... :)
>When your 'patch' routine first decodes an instruction in order to add an
>entry into the 65536-word-table, do you actually compile the instruction as
>it stands, there, in the code (e.g. 'jmp $8000' (OK so I can't remember my
>Z80)) and generate new code for every table entry (i.e. MASSES of code) or do
>you generate code that says 'jump indirect through Z80 program counter + 1'?
>The latter would involve 256 fixed pieces of code, which would need to access
>the Z80 program in order to determine arguments e.g. where to jump or what
>value to load; the former would generate a huge volume of very similar code,
>but the emulator wouldn't access the program memory at all (except when an
>address got invalidated). The '256 fixed pieces of code' seems more probable.
>I wonder if it would work for emulating a 68000.
Err, I'm not sure I understand your question. Currently, when 'patch'
compiles an instruction, it updates only 1 entry in the vector table.
Actually that's not completely true, but it's easiest to understand it
that way to begin with.
For example, suppose the Z80 code is:
$1233 LD A,B
$1234 LDIR
$1236 JP $3355
$1239 ...
and the corresponding entries in the vector table are:
$1233 0000
$1234 0000 0000
$1236 0000 0000 0000
(There is 1 word in the vector table for each byte of Z80 address space.)
All the words are 0000, i.e, they all point to 'patch'. After LD A,B
has been compiled, this changes to:
$1233 0354
$1234 0000 0000
$1236 0000 0000 0000
where $0354 is the offset to the LD A,B routine in 680x0 address
space. Then, after LDIR (which is a 2-byte 'prefixed' opcode) is
compiled, the table becomes:
$1233 0354
$1234 3008 fffe
$1236 0000 0000 0000
where $3008 is the offset to the LDIR routine in 680x0 address space.
Finally, after the JP $3355 is compiled, the table becomes:
$1233 0354
$1234 3008 fffe
$1236 0578 fffe fffe
where $0578 is the offset to the JP routine in 680x0 address space.
I haven't explained about the $fffes before. They are to handle self-
modifying code. The worst thing that could happen is that the Z80
program could change the jump address without poking the JP opcode
itself. In that case the vector table changes to:
$1233 0354
$1234 3008 fffe
$1236 0578 0000 0000
because every write to Z80 address space sets the corresponding entry
in the vector table to $0000 (with, e.g, 'clr.w (a3,d6.w*2)' where d6
is the Z80 address being poked and a3 points to the vector table).
The first thing the JP routine at offset $0578 does is check that the
next 2 vectors are still $fffe, e.g,
; code to handle JP
base+$0578 tst.w (a3)+
bmi.w selfmodifydetected1
tst.w (a3)+
bmi.w selfmodifydetected2
...
Also, note what happens if the Z80 program jumps to $1235, say. The
vector offset there is $fffe, so the 680x0 jumps to a location 2 bytes
preceding 'patch'. The emulator has a 680x0 'nop' instruction there,
so it drops through to 'patch'.
Now, on to something more advanced. Say the Z80 sequence:
$1233 LD A,B
$1234 LDIR
$1236 JP $3355
is incredibly common. There is nothing to stop us immediately
compiling this to:
$1233 5322
$1234 fffe fffe
$1236 fffe fffe fffe
where $5322 is the offset of a 680x0 routine which handles the whole
sequence at once. In other words, common sequences can be handled
like single instructions. I haven't done this, apart from a few
sequences like 'ADD A,B; DAA', which I emulate with 'abcd'.
Finally, there is nothing to stop 'patch' from generating and
optimising 680x0 routines for new sequences on-the-fly. I haven't
tried this at all yet though.
>Where did you get the idea for this excellent algorithm?
I can't remember now. It evolved out of some other ideas I had for
vector tables. Richard Carlsson(sp?) told me how to handle prefixed
opcodes and self-modifying code long after I released a version of my
emulator which used a slightly slower algorithm and didn't handle
self-modifying code correctly. I doubt I am the first to do this.
Research papers have been written on the topic, usually in abstract
terms though, I think. Digital have perfected machine translation
from VAX and MIPS to Alpha.
Regards, Peter.
>From amh15 Fri Jun 3 09:31:08 1994
Subject: Re: Coding emulators
To: peterm@... (Peter McGavin)
Date: Fri, 3 Jun 1994 09:31:08 +0100 (BST)
In-Reply-To: <199406022228.AA00452@... > from "Peter McGavin"
at Jun 3, 94 10:28:36 am
X-Mailer: ELM [version 2.4 PL21]
Status: RO
Hi Peter,
Thanks for your reply. It was really interesting to get some more insights
into your method.
> >When your 'patch' routine first decodes an instruction in order to add an
> >entry into the 65536-word-table, do you actually compile the instruction as
> >it stands, there, in the code (e.g. 'jmp $8000' (OK so I can't remember my
> >Z80)) and generate new code for every table entry (i.e. MASSES of code) or do
> >you generate code that says 'jump indirect through Z80 program counter + 1'?
> >The latter would involve 256 fixed pieces of code, which would need to access
> >the Z80 program in order to determine arguments e.g. where to jump or what
> >value to load; the former would generate a huge volume of very similar code,
> >but the emulator wouldn't access the program memory at all (except when an
> >address got invalidated). The '256 fixed pieces of code' seems more probable.
> >I wonder if it would work for emulating a 68000.
>
> Err, I'm not sure I understand your question. Currently, when 'patch'
> compiles an instruction, it updates only 1 entry in the vector table.
I was wondering whether I was being clear about what I wanted to know -
obviously not!
What I was actually getting at was the following.
The very fastest way that you could implement your idea would be to have a
separate 680x0 routine for _every_separate_Z80_address_, not just one for
each of 256 or so Z80 machine instructions. You would need a huge amount of
memory for this. In your example 'JP $3355', your patch routine would
generate a piece of 680x0 code saying 'jump to the address contained in
offset 3355 in the vector table' and set up the vector table entry for that
location to point to this new routine. The number $3355 would be hard-coded
into the 680x0 code. There would be many JP instructions in the Z80 program,
and each would have its own routine in 680x0 code. Each Jp instruction would
have a _different_ entry in the vector table.
Alternatively, and the way I interpret your method [I could be wrong], you
could use a less memory-hogging but slightly slower method [I would guess].
In this method, the vector table would contain the same address for every
occurrence of a JP instruction. Obviously the 680x0 code for that instruction
would have to know how to look up the address to jump to, which would mean
looking at the Z80 memory image (the couple of bytes after the 'JP'
instruction). Presumably byte order is reversed as well, so that's an extra
instruction.
Now that I think about it, the former method would probably give a marginal
speed improvement for a _lot_ of effort. The 'patch' routine would be
significantly slower in the this version because it would actually have to
write a small piece of code, but the resulting code would surely be
faster because all the instruction arguments (immediate data values, jump
addresses and so on) would be hard coded.
> ; code to handle JP
> base+$0578 tst.w (a3)+
> bmi.w selfmodifydetected1
> tst.w (a3)+
> bmi.w selfmodifydetected2
> ...
I'm not sure why self-modifying needs to be checked here. Maybe you could
enlighten me. Surely it will be caught by invalidating the vector table when
the write to memory occurs - or are you worrying about an instruction
overwriting itself, e.g.
label mov data,label+1
or something (please excuse my pidgin 680x0) ?
> Finally, there is nothing to stop 'patch' from generating and
> optimising 680x0 routines for new sequences on-the-fly. I haven't
> tried this at all yet though.
I think you've answered my earlier question here. Sort of.
Yours still somewhat in awe :)
Alan
| | amh15@... --- Work Tel: +44 223 337493 |
| Alan | Fax: +44 223 337706 --- Home Tel: +44 223 244595 |
| Hart | Microelectronics Research Centre, Cavendish Lab, |
| | Madingley Road, Cambridge, ENGLAND. CB3 0HE |
>From peterm@... Sat Jun 4 00:55:29 1994
Received: from kea.grace.cri.nz [131.203.4.51]
by bootes.cus.cam.ac.uk with smtp
(Smail-3.1.28.1 #136) id m0q9j55-000BzFC; Sat, 4 Jun 94 00:55 BST
Received: by kea.grace.cri.nz id <AA01822-5.65c+IDA/2.1.1>; Sat, 4 Jun 1994
11:55:12 +1200
Date: Sat, 4 Jun 1994 11:55:12 +1200
From: Peter McGavin <peterm@... >
Message-Id: <199406032355.AA01822@... >
To: amh15@...
Subject: Re: Coding emulators
Status: RO
Hi Alan,
>The very fastest way that you could implement your idea would be to have a
>separate 680x0 routine for _every_separate_Z80_address_, not just one for
>each of 256 or so Z80 machine instructions. You would need a huge amount of
>memory for this. In your example 'JP $3355', your patch routine would
>generate a piece of 680x0 code saying 'jump to the address contained in
>offset 3355 in the vector table' and set up the vector table entry for that
>location to point to this new routine. The number $3355 would be hard-coded
>into the 680x0 code. There would be many JP instructions in the Z80 program,
>and each would have its own routine in 680x0 code. Each Jp instruction would
>have a _different_ entry in the vector table.
That's a good idea, but it's not the way I do it.
>Alternatively, and the way I interpret your method [I could be wrong], you
>could use a less memory-hogging but slightly slower method [I would guess].
>In this method, the vector table would contain the same address for every
>occurrence of a JP instruction. Obviously the 680x0 code for that instruction
>would have to know how to look up the address to jump to, which would mean
>looking at the Z80 memory image (the couple of bytes after the 'JP'
>instruction). Presumably byte order is reversed as well, so that's an extra
>instruction.
Yep, that's it.
>Now that I think about it, the former method would probably give a marginal
>speed improvement for a _lot_ of effort. The 'patch' routine would be
>significantly slower in the this version because it would actually have to
>write a small piece of code, but the resulting code would surely be
>faster because all the instruction arguments (immediate data values, jump
>addresses and so on) would be hard coded.
I've been thinking about a method similar to your former method for
some time. I call it incremental translation, but Digital call it
something else (can't remember right now --- there was an article
about translation in one of the IEEE journals when Alpha was
announced). The main difference from your suggestion is that there is
no need to reserve space for 65536 fixed-size routines.
The idea is to translate as much code as possible to optimised native
680x0 code, link it to already-translated code, then execute it.
Memory for new 680x0 code is allocated on-the-fly and routines are
concatinated together tightly in memory. Optimisations would include
not bothering to set condition flags that are overwritten by the next
instruction. In fact the code generator could use any optimisation
methods used by good compilers (e.g, peep-hole optimisation, dynamic
register allocation).
As soon as we get to something that hasn't been translated yet, simply
return to the beginning of the algorithm (i.e, translate as much as
possible, link it to already-translated code, then execute it).
Examples of things that prevent us from translating everything at once
are:
o anything that loads the PC from data space
(e.g, indirect jump; return from subroutine)
o conditional branches (because 1 of the branches could
never be taken and followed by data)
Writes to absolute memory locations are easy. We can generate direct
code for them. Indirect writes, however, must check in a table
whether already-compiled code is being clobbered. In that case, a
segment of already-compiled code must be unlinked, invalidated, and
the memory it occupied freed.
Obviously it is necessary to maintain extensive tables. Every
reference of anything to anything-else must be tracked with both
forwards and backwards links. I think this can be done efficiently
with a couple of binary trees. Also there needs to be a 65536-element
map of what is data and what is code.
Anyway, I'm sure it would work. I just don't have time to do it.
>> ; code to handle JP
>> base+$0578 tst.w (a3)+
>> bmi.w selfmodifydetected1
>> tst.w (a3)+
>> bmi.w selfmodifydetected2
>> ...
>
>I'm not sure why self-modifying needs to be checked here. Maybe you could
>enlighten me. Surely it will be caught by invalidating the vector table when
>the write to memory occurs - or are you worrying about an instruction
>overwriting itself, e.g.
>
> label mov data,label+1
>
>or something (please excuse my pidgin 680x0) ?
Oops, sorry... <red face>. You're absolutely right and my example is
wrong. The JP instruction is perfectly ok without the check (and my
emulator doesn't check either). Let's try a different example.
; code to handle LDIR
base+$3008 tst.w (a3)+
bmi.w selfmodifydetected1
...
The check copes with self-modifying code which modifies the 2nd byte
of the opcode without touching the first. For a perfect emulation it
is required for all prefixed instructions. So far, the only real
examples I've found are some graphics routines in games that change
SET to RES, but there could be others that I've fixed without knowing.
Regards, Peter.
Alan Hart - amh15@... - University of Cambridge, UK