rep lodsb

The 286's internal registers

The LOADALL structure as described by Intel is this:

Physical Address (Hex)    Associated CPU Register
    800-805                     None
    806-807                     MSW
    808-815                     None
    816-817                     TR
    818-819                     Flag word
    81A-81B                     IP
    81C-81D                     LDT
    81E-81F                     DS
    820-821                     SS
    822-823                     CS
    824-825                     ES
    826-827                     DI
    828-829                     SI
    82A-82B                     BP
    82C-82D                     SP
    82E-82F                     BX
    830-831                     DX
    832-833                     CX
    834-835                     AX
    836-83B                     ES descriptor cache
    83C-841                     CS descriptor cache
    842-847                     SS descriptor cache
    848-84D                     DS descriptor cache
    84E-853                     GDTR
    854-859                     LDT descriptor cache
    85A-85F                     IDTR
    860-865                     TSS descriptor cache

The normally visible registers aren't of much interest. That includes the MSW and flags: LOADALL can't change any of the reserved bits, and can't clear the protected mode bit once it has been set. There was even a bug in early steppings of the chip, where the word preceding the MSW (at 804h) would be mistakenly loaded into that register during a memory wait state. And if bit 0 happened to be set by this, it could not be cleared again!

What remains are the descriptor caches and those mysterious gaps. All of these were previously write-only, but with STOREALL we can look at what gets loaded into them under various conditions.

Descriptor caches

The term is somewhat misleading from a modern perspective, but this is what Intel called them.

They can be better understood as being the part of the segment registers which actually matters for the addressing and protection logic. As far as that unit is concerned, the programmer-visible segment values might as well contain any random 16 bits.

It is only when a segment register gets loaded, that the value (and operating mode) make any difference. There are not that many opcodes that do this, and they each have two entries in the instruction decoder PLA¹, so that they can be directed to different microcode entry points depending on the mode².

The layout of the 8 internal segment registers (including GDTR and IDTR) is the same:

3 BYTE base address
1 BYTE access rights
1 WORD limit

By making every segment load cause a protection fault, and using LOADALL to update the descriptor caches, an operating system could in theory emulate the real mode behaviour. But performance would be bad, since "large model" programs typically load segment registers every time they dereference a pointer.

Also, there was no paging on the 286, so the only address translation possible would be to move the base of the emulated address space. Every "virtual machine" would have to be in its own contiguous memory block.

More useful is the ability to load any arbitrary base address for the segment registers without entering protected mode. Some versions of Microsofts HIMEM.SYS did this to copy data between real and extended memory.

This new segment base would only be in effect until the next time that segment register is reloaded. That could happen unexpectedly if your code got interrupted, but there was a clever trick to detect this situation: also set a non-standard base of CS, so that when the interrupt handler returned it would go to somewhere else in your code (since CS would have been reloaded to its normal base). That way, a REP MOVS instruction could run with interrupts enabled and be restarted.

¹ basically a ROM, addressed by the opcode and mode bits, with some of them being ignored. More on this at the end!

² because this decoding can happen while a previous instruction is still executing, the LMSW instruction used to enter protected mode should be followed by a (near) jump so that the decoded instruction queue is flushed.

The access rights byte

bit
  7  : valid
6-5  : DPL
  4  : ignored?
  3  : code segment
  2  : expand-down / conforming
  1  : writable / readable
  0  : accessed (only set on pm descriptor load)

In real mode, every segment register load sets this to 82h. This is the value for a writable ring 0 data segment, except with bit 4 cleared (which seems to have no effect). The value for CS is the same, so it is also writable - using LOADALL, it can be made expand-down as well.

When loaded in protected mode, the byte will match the descriptor table entry, and bit 0 (accessed) will always be set.

The Current Privilege Level (CPL) is always determined by the DPL field of the stack segment.

If the valid bit on a normal segment (or LDTR) is clear, any access causes a protection fault. For GDTR, IDTR and TSS the access rights byte does not exist, and reads as FFh. LDTR only has bit 7, with the others reading as set.

While TSS can't be marked as invalid, the limit is checked like for every other segment.

Temporary registers

There are 10 registers which haven't been described so far. The values loaded into them don't have any effect, but they are used by the microcode as places for temporary data.

Some documents about early SMM on the 386 give the names "tmpa", "tmpb" ... "tmph", as well as "tst" and "idx". Not very meaningful, and the order in which these registers would appear on the 286 isn't clear since it could be the reverse. A diagram in the 286 patent shows similar names.

I will just refer to them as X0 through X9, in the order that they appear in memory.

One of the first things I tried is to test if all of these can in fact be loaded with arbitrary values. Only two couldn't, and that is because they are used by the LOADALL instruction itself (but interestingly not by STOREALL): X1, for some reason, gets the access rights of one of the segment descriptors (usually ES? dependent on some random timing?), and X8 is used as an address register. When LOADALL is finished, it will always have the value 864h, pointing to the last word loaded.

X1 generally seems to be used for protection checks, it gets loaded with either the word containing a descriptor's access rights or with the MSW (by floating point opcodes). It may be the one shown in the patent as connected to the "TEST PLA".

Simple instructions

These don't use any of the temporaries:

  • MOV (except to segment register in protected mode)
  • INC, DEC, shift/rotate and other ALU operations (except for immediate operands)
  • conditional jumps
  • other jumps (except inter-segment in prot. mode)
  • IN, OUT

If there is an immediate operand to ADD, SUB, etc., it will be loaded into X9. That register is also used for memory-to-memory compares (CMPS): it is loaded with the byte/word at ES:[DI].

Unlike the 8086 (and 186), other operands don't have to first be loaded into temporary ALU registers.

Remember to backup your data

Some instructions normally don't do much, but may cause exceptions in which case all of their effects will need to be reverted. So they will either save the old register values in temporaries, or hold updated ones there until they can be sure to complete successfully:

  • PUSH, POP save the previous stack pointer in X3. REPeated string instructions do this too for some reason?
  • LOOP puts the decremented CX value in X2 (while unusual, like any conditional jump it can also go forward, potentially causing a CS limit violation)
  • CALL, RET put the new IP into X2 (near) or X9 (far), and SP into X8
  • probably many similar ones

Others

X0, X5 and X6 seem to be only used in protected mode, which I didn't test extensively. Task switching probably uses all 10, given how complicated an operation it is.

X4 is also mostly for protected mode and a specific purpose: it contains a copy of the error code pushed on some exceptions.

X7 gets the start offset of a floating point operand, with the segment limit in X8. This must be passed to the FPU interface, which acts somewhat like a DMA controller and contains its own base and limit registers.

State after reset

MSW    FFF0
FLAGS  0002  all defined bits are cleared

X2     002A  answer to life, the universe & everything?

ES     0000  base=000000 limit=FFFF attr=82
CS     F000  base=FF0000 limit=FFFF attr=82
SS     0000  base=000000 limit=FFFF attr=82
DS     0000  base=000000 limit=FFFF attr=82

IDTR         base=000000 limit=FFFF
TR     0000  -- -- --  descriptor cache preserved
LDTR   0000  -- -- --  descriptor cache preserved

All other registers keep the value they had before reset. The value of X2 is the most unusual thing here, and might even be an easter egg?

Reading out all of the values immediately after power-up would require a custom ROM, instead of the cheap trick I used (overwriting the BIOS entry point in shadow RAM).

A normal BIOS will overwrite most of the registers during a cold boot, except for X4, X6 and X7. There are some differences between chips that can be observed in these:

  X4   X6   X7     chip
FFFF FFFF FFFF   N80L286-10/S  003D42S
0000 0000 0000   N80C286-12 ET 037E6KX
5EEC CD8D 8BFC   HARRIS CS80C286-16 F3360

The last one could be some kind of early CPUID?

After "triple-fault" shutdown

After shutdown & reset from protected mode, some information about what caused the exception is preserved:

X0   0040

X1   source of limit violation on first exception
     6CFF GDT
     6DFF LDT
     6EFF IDT
     6FFF TSS
     70FF ES
     71FF should be CS? never seen
     72FF should be SS? never seen
     73FF DS


X4   0000 (error code from double fault)

X8   selector if exception was from segment load, else zero

This might be a side effect of the exception handling rather than something intentional, because if the source is CS or SS, X1 gets loaded with the access rights for CS instead.

In real mode, only X4 will be set to zero.

What more is there?

Honestly, most of this is not all that useful, but it offers some insights into how the chip works.

Using STOREALL to single step through code is only possible in ring 0, since it is privileged. It is also 3 bytes that have to be inserted after every instruction to step through. I don't think there is any equivalent of the trap flag to trigger it automatically, since the ICE hardware wouldn't have needed it.

With custom hardware, it might be possible to blindly put opcodes on the bus and have them executed in ICE mode. Even if it works, it is somewhat unlikely that there are any extra registers that can be accessed this way. Another idea is to run one instruction again and again, starting from a defined state and each time resetting it at a different clock cycle to observe what happens when.

Getting the actual microcode bits certainly isn't possible without some very high-res photos of the die, and possibly delayering.

The best I could find is on visual6502.org. You can see how the ROM is split in two, with the upper half providing control lines and 3 x 6 bits in the lower half selecting which registers are put on the internal buses. Unfortunately, it seems too dense to get anything out of it, but I would be happy to be proven wrong by some deep learning or signal processing wizardry!

The entry point PLA is in the two blocks below, and similar to the decoder in the 8086 and 186. Staring at this for a long time, I could make out some of it, enough to confirm that there are no more hidden opcodes: all of them have matching patterns that go to the same entry point, which must be the one that generates the "invalid opcode" exception.