I’ve been looking at a instruction level simulator for a processor with a simple instruction set. The first version was written with an eye towards being turned into a hardware level emulator with basic instruction level timing. This lead to a simple switch based dispatch loop:
uint8_t op = pc_cache.read(pc++); switch (op) { case 0: // LOADI, #0 acc = 0; break; ... }
This achieved a quite reasonable 17.4 million instructions/s on a Core 2 Duo P8600.
When you look at the assembled code there are quite a bit of overhead such as:
- Checking the value in the switch statement for a valid range
- The lookup and then indirect jump in the switch
- The jump caused by the final break
Threaded code can get rid of of quite a bit of this. The first step is to switch to computed gotos which changes the code to:
static const void* targets[] = { &&l0, &&l1, &&l2... }; top: uint8_t op = pc_cache.read(pc++); goto *targets[op]; l0: // LOADI, #0 acc = 0; goto top; ... }
This gets rid of the range checking but keeps the final jump. As we don’t care about code size then we can pull the initial dispatch into the tail of every instruction:
static const void* targets[] = { &&l0, &&l1, &&l2... }; top: uint8_t op = pc_cache.read(pc++); goto *targets[op]; l0: // LOADI, #0 acc = 0; op = pc_cache.read(pc++); goto *targets[op]; ... }
This gets the simulator up to 26.7 million ops/s.
Next is to inline the cache access code. As the simulator is done in C++ this involves pulling the Cache::read() function up into the class – nice and easy and brings us up to 40.2 M/s.
One thing that made these changes easier was code generation. The instruction set is described in Python and converted into C++ using a template meaning that these structural changes can be done quickly and easily.
The next step is to compile the instructions into something even easier to dispatch. I’m hoping to get things twice as fast again.