High-Level Cycle Accurate Original VIP CHIP-8 #

(Version 1.1, 2025-01-06, by Steffen “Gulrak” Schümann)

Normally CHIP-8 interpreters don’t strive to be cycle accurate, as this is extremely depending on the variant. Still it is possible to make a cycle accurate interpreter that will behave the same timing wise as the original interpreter on a COSMAC VIP, that is, each opcode and its time point relative to the VIP frame timing will be at the same time as on a real system.

This document provides the opcode execution timing and information about the overall timing. It does not show the derivation of those values, but they where generated by following the execution paths of the original interpreter, adding up cycles on the way and finding which have fixed times and which have variables times and on what they depend. I created this documents from notes I made on the way when I implemented my own (possibly first overall?) implementation of this in an interpreter in June 2023, the CHIP-8-STRICT core inside Cadmium, and I ran various programs on it and a COSMAC VIP emulation executing the original interpreter next to each other to verify my implementation stays in sync with a VIP.

Disclaimer: While keyboard input opcodes behave as close to the original as possible, the fact that they can’t actually access the real keyboard leads to the theoretical fact that pressing a key at the same time in a frame on a real COSMAC VIP and the emulation, will still not lead to the same recognition time as the operating systems event system sets other granularity constraints.

Also, while I tried hard and checked with a bunch of programs, there still can be bugs, so please don’t implement critical technology with it, or use it at your own risk. But if you find a bug please let me know, so it can be fixed here.

Startup #

The Interpreter starts the execution of the actual loaded program 3250 cycles in, as that is what the monitor startup checking for ram size and the initialization of the interpreter takes. If one also wants to count CHIP-8 instructions, those start with two, as the four bytes before 512/0x200 are already executed opcodes to clear (00E0) and enable (004B) the display.

Frame Timing #

Each instruction has some time in machine cycles that it takes to execute. The VIP timing is influenced by the interrupt/dma timing so the concept of frame cycles is important. A frame has 3668 machine cycles. In it there is the interrupt routine that is responsible for the timers and the video display. The cycle time of the next interrupt is calculated by:

((machineCycles + 2572) / 3668) * 3668) + 1096

If the current machine cycles is greater or equal to the next interrupt time, an interrupt call is to be simulated. It takes

1832 + (soundTimer ? 4 : 0) + (delayTimer ? 8 : 0)

machine cycles to complete and in this “time” the timers need to be decremented and the screen updated.

NOTE: When incrementing the machine cycles, a check for interrupt needs to be done and the interrupt inserted. In the pseudo-code a function addCycles() is always used to increment machine cycles and it is assumed that it will do the incrementing and handle interrupts. If your emulator needs an outer frame control (so you trigger it per frame to execute one frame), you need to add additional logic to return to outer emulation loop and return for the next frame (typically by decrementing the PC to keep at the opcode and using some additional waiting state to know where to continue in the opcode). If you instead e.g. run the emulation in it’s own thread, that would signal the frontent a frame has ended and to update the screen (and possibly push audio data), or if you use a language that can yield, you might be able to use that from inside the addCycles() instead. Either logic is outside of the scope of this document as to many ways to implement this are possible.

Opcodes #

For this type of CHIP-8 emulation, the main point is about where the opcode starts, not the inner opcode cycles, besides 00E0, Dxyn and Fx0A and they are described later in more detail.

NOTE on fetch and decode: All cycle numbers are in machine cycles. The Detailed Cycles column lists the fetch and decoding time extra, the first summand is the common fetch and decode time, which is simply 40 machine cycles for every opcode in the 0nnn range, and 68 machine cycles for all the others, then in case of Fnnn opcodes, a second dispatch stage is used, which adds another 4 machine cycles. The last summand is the actual opcode execution time, all of them are given as a sum in the Total Cycles column.

Opcode	Detailed Cycles	Total Cycles	Notes
`0nnn`	-	-	undefined, as this depends on the machine code called, this needs backend emulation, only COSMAC VIP and DREAM6800 support this
`00E0`	40 + 3078	3118	see below for some additional hint
`00EE`	40 + 10	50
`1nnn`	68 + 12	80
`2nnn`	68 + 26	94
`3xnn`	68 + 10	78	+4 if skipping
`4xnn`	68 + 10	78	+4 if skipping
`5xy0`	68 + 14	82	+4 if skipping
`6xnn`	68 + 6	74
`7xnn`	68 + 10	78
`8xy0`	68 + 12	80
`8xy1`	68 + 44	112
`8xy2`	68 + 44	112
`8xy3`	68 + 44	112
`8xy4`	68 + 44	112
`8xy5`	68 + 44	112
`8xy6`	68 + 44	112
`8xy7`	68 + 44	112
`8xyE`	68 + 44	112
`9xy0`	68 + 14	82	+4 if skipping
`Annn`	68 + 12	80
`Bnnn`	68 + 22	90	+2 on `PC` high byte change
`Cnnn`	68 + 36	104
`Dxyn`	*	*	see below
`Ex9E`	68 + 14	82	+4 if skipping
`ExA1`	68 + 14	82	+4 if skipping
`Fx07`	68 + 4 + 6	78
`Fx0A`	*	*	see below
`Fx15`	68 + 4 + 6	78
`Fx18`	68 + 4 + 6	78
`Fx1E`	68 + 4 + 12	84	+6 on `I` high byte change
`Fx29`	68 + 4 + 16	88
`Fx33`	68 + 4 + 80	152	`+(digit sum) 16`*
`Fx55`	68 + 4 + 14	86	`+ 14 (number of registers)`*
`Fx65`	68 + 4 + 14	86	`+ 14 (number of registers)`*

00E0: The Expensive Clear Screen #

If one follows the simple pattern of the other opcodes to emulate the clear screen opcode, a problem is that typically the visible deletion of a frame happens at least one frame too early. So for this opcode, it is important to first increment cycles by calling addCycles(3118) to emit the current frame before erasing its content, else flickering could be much worse than on the real machine.

Dxyn: The Complicated One #

The timing of Dxyn is quite complex. It is made up of preparation time, waiting time and drawing time. We look at all of these:

Preparation Time #

Dxyn first draws the sprite into a two byte wide, sixteen rows height buffer. The time needed for this is: 136 + lines * (46 + 20 * (x&7)) so it heavily depends on the amount of shifting needed.

prepareCycles = 68 + 68 + lines * (46 + 20 * (x&7))
while prepareCycles > 0:
    addCycles(cycles left in frame)
    prepareCycle -= cycles left in frame

The first 68 cycles are the fetch and decode part for Dxyn.

Drawing Time #

The time needed to copy the sprite into the the screen buffer is then calculated during drawing:

In pseudo-code:

drawingCycles = 26;
for each line not clipped:
    col1 = col2 = 0
    if first byte of line collides:
        col1 = 4
    if second byte of line collides:
        col2 = 4
    drawingCycles += (34 + col1 + (x < 56 ? 16 : 0) + col2)
addCycles(drawingCycles)

The collision indicators col1 and col2 in there depend on emulating as if the sprite bytes are actually shifted into a two byte buffer and each byte is then XORed to the screen memory. This can still be done by keeping track of pixel offset, but it might be easiest to actually implement the byte splitting.

Fx0A: Waiting for a Key #

The key waiting is dependent on key input, so there is no fixed timing. It of course has a fetch and decode prefix of 68 + 4 machine cycles but then it behaves as first looping to wait for a key to be pressed, and the first key it sees as pressed it will use to wait for its release, while constantly setting the sound-timer to 4 in that release wait loop. When the key is released, it waits for the sound-timer to run down to 0 and then takes at most 10 machine cycles after the interrupt decrementing the sound timer to 0 to continue. As the outer influence (key activity) is hugely dominating and randomizing its timing, it practically will not matter much if one emulates those 10 cycles or not, but they are there.

Fx33: BCD Conversion #

The comment in the table talks about (digit sum) and what this means is the cross sum of the conversion result, so if the number is 123 than the sum is 1 + 2 + 3 = 6.

Acknowledgements #

This work, and a lot of other of my work related to CHIP-8, builds on the work of others, and I want to thank them for their groundwork that made my life so much easier:

Gooitzen S. van der Wal and J. W. Wentworth, who analyzed and documented the working of the CHIP-8 interpreter on the COSMAC VIP and the operating system in its 512 byte ROM in ⎋ VIPER Volume I, Issue 2, August 1978 and ⎋ VIPER Volume I, Issue 3, September 1978. (And thanks to Matt Mikolay for putting up the scans for non-commercial use.)

Laurence Scotford for his work on ⎋ Chip-8 on the COSMAC VIP, where he in-detail explains the inner workings of the original CHIP-8 interpreter as published for the COSMAC VIP. He also did cycle analyses for them. However after some inaccuracies, and e.g., Dxyn not being detailed enough, I still recalculated them for all opcodes again myself. Admittedly, I would not have started the endeavor of making a cycle accurate high level emulated VIP CHIP-8, if it wasn’t for his work.

And all the people I had fruitful discussions with, on the Emulation Development Discord.

Changelog #

1.1, 2025-01-06 #

Added fetch and decode time details.

1.0, 2025-01-05 #

Initial Publish