A bit more work on the threaded version brings us up to 54.9 million instructions/s. The interesting thing is that an average of 6.75 instructions are executed in a row before some type of branch. This seems high enough to justify a simple JIT that converts a branchless sequence into native code…
These changes however break the profiling and breakpoint support in the simulator. Shifting to a more complex/faster compiled version could bring these back in for free.