The LOADALL
structure as described by Intel is this:
Physical Address (Hex) Associated CPU Register
800-805 None
806-807 MSW
808-815 None
816-817 TR
818-819 Flag word
81A-81B IP
81C-81D LDT
81E-81F DS
820-821 SS
822-823 CS
824-825 ES
826-827 DI
828-829 SI
82A-82B BP
82C-82D SP
82E-82F BX
830-831 DX
832-833 CX
834-835 AX
836-83B ES descriptor cache
83C-841 CS descriptor cache
842-847 SS descriptor cache
848-84D DS descriptor cache
84E-853 GDTR
854-859 LDT descriptor cache
85A-85F IDTR
860-865 TSS descriptor cache
The normally visible registers aren't of much interest. That includes the MSW and flags: LOADALL
can't change any of the reserved bits, and can't clear the protected mode bit once it has been set. There was even a bug in early steppings of the chip, where the word preceding the MSW (at 804h) would be mistakenly loaded into that register during a memory wait state. And if bit 0 happened to be set by this, it could not be cleared again!
What remains are the descriptor caches and those mysterious gaps. All of these were previously write-only, but with STOREALL
we can look at what gets loaded into them under various conditions.
Descriptor caches
The term is somewhat misleading from a modern perspective, but this is what Intel called them.
They can be better understood as being the part of the segment registers which actually matters for the addressing and protection logic. As far as that unit is concerned, the programmer-visible segment values might as well contain any random 16 bits.
It is only when a segment register gets loaded, that the value (and operating mode) make any difference. There are not that many opcodes that do this, and they each have two entries in the instruction decoder PLA¹, so that they can be directed to different microcode entry points depending on the mode².
The layout of the 8 internal segment registers (including GDTR and IDTR) is the same:
3 BYTE base address
1 BYTE access rights
1 WORD limit
By making every segment load cause a protection fault, and using LOADALL
to update the descriptor caches, an operating system could in theory emulate the real mode behaviour. But performance would be bad, since "large model" programs typically load segment registers every time they dereference a pointer.
Also, there was no paging on the 286, so the only address translation possible would be to move the base of the emulated address space. Every "virtual machine" would have to be in its own contiguous memory block.
More useful is the ability to load any arbitrary base address for the segment registers without entering protected mode. Some versions of Microsofts HIMEM.SYS did this to copy data between real and extended memory.
This new segment base would only be in effect until the next time that segment register is reloaded. That could happen unexpectedly if your code got interrupted, but there was a clever trick to detect this situation: also set a non-standard base of CS, so that when the interrupt handler returned it would go to somewhere else in your code (since CS would have been reloaded to its normal base). That way, a REP MOVS
instruction could run with interrupts enabled and be restarted.
¹ basically a ROM, addressed by the opcode and mode bits, with some of them being ignored. More on this at the end!
² because this decoding can happen while a previous instruction is still executing, the LMSW instruction used to enter protected mode should be followed by a (near) jump so that the decoded instruction queue is flushed.
The access rights byte
bit
7 : valid
6-5 : DPL
4 : ignored?
3 : code segment
2 : expand-down / conforming
1 : writable / readable
0 : accessed (only set on pm descriptor load)
In real mode, every segment register load sets this to 82h. This is the value for a writable ring 0 data segment, except with bit 4 cleared (which seems to have no effect). The value for CS is the same, so it is also writable - using LOADALL
, it can be made expand-down as well.
When loaded in protected mode, the byte will match the descriptor table entry, and bit 0 (accessed) will always be set.
The Current Privilege Level (CPL) is always determined by the DPL field of the stack segment.
If the valid bit on a normal segment (or LDTR) is clear, any access causes a protection fault. For GDTR, IDTR and TSS the access rights byte does not exist, and reads as FFh. LDTR only has bit 7, with the others reading as set.
While TSS can't be marked as invalid, the limit is checked like for every other segment.
Temporary registers
There are 10 registers which haven't been described so far. The values loaded into them don't have any effect, but they are used by the microcode as places for temporary data.
Some documents about early SMM on the 386 give the names "tmpa", "tmpb" ... "tmph", as well as "tst" and "idx". Not very meaningful, and the order in which these registers would appear on the 286 isn't clear since it could be the reverse. A diagram in the 286 patent shows similar names.
I will just refer to them as X0 through X9, in the order that they appear in memory.
One of the first things I tried is to test if all of these can in fact be loaded with arbitrary values. Only two couldn't, and that is because they are used by the LOADALL
instruction itself (but interestingly not by STOREALL
): X1, for some reason, gets the access rights of one of the segment descriptors (usually ES? dependent on some random timing?), and X8 is used as an address register. When LOADALL
is finished, it will always have the value 864h, pointing to the last word loaded.
X1 generally seems to be used for protection checks, it gets loaded with either the word containing a descriptor's access rights or with the MSW (by floating point opcodes). It may be the one shown in the patent as connected to the "TEST PLA".
Simple instructions
These don't use any of the temporaries:
MOV
(except to segment register in protected mode)INC
,DEC
, shift/rotate and other ALU operations (except for immediate operands)- conditional jumps
- other jumps (except inter-segment in prot. mode)
IN
,OUT
If there is an immediate operand to ADD
, SUB
, etc., it will be loaded into X9
. That register is also used for memory-to-memory compares (CMPS
): it is loaded with the byte/word at ES:[DI]
.
Unlike the 8086 (and 186), other operands don't have to first be loaded into temporary ALU registers.
Remember to backup your data
Some instructions normally don't do much, but may cause exceptions in which case all of their effects will need to be reverted. So they will either save the old register values in temporaries, or hold updated ones there until they can be sure to complete successfully:
PUSH
,POP
save the previous stack pointer in X3.REP
eated string instructions do this too for some reason?LOOP
puts the decremented CX value in X2 (while unusual, like any conditional jump it can also go forward, potentially causing a CS limit violation)CALL
,RET
put the new IP into X2 (near) or X9 (far), and SP into X8- probably many similar ones
Others
X0, X5 and X6 seem to be only used in protected mode, which I didn't test extensively. Task switching probably uses all 10, given how complicated an operation it is.
X4 is also mostly for protected mode and a specific purpose: it contains a copy of the error code pushed on some exceptions.
X7 gets the start offset of a floating point operand, with the segment limit in X8. This must be passed to the FPU interface, which acts somewhat like a DMA controller and contains its own base and limit registers.
State after reset
MSW FFF0
FLAGS 0002 all defined bits are cleared
X2 002A answer to life, the universe & everything?
ES 0000 base=000000 limit=FFFF attr=82
CS F000 base=FF0000 limit=FFFF attr=82
SS 0000 base=000000 limit=FFFF attr=82
DS 0000 base=000000 limit=FFFF attr=82
IDTR base=000000 limit=FFFF
TR 0000 -- -- -- descriptor cache preserved
LDTR 0000 -- -- -- descriptor cache preserved
All other registers keep the value they had before reset. The value of X2 is the most unusual thing here, and might even be an easter egg?
Reading out all of the values immediately after power-up would require a custom ROM, instead of the cheap trick I used (overwriting the BIOS entry point in shadow RAM).
A normal BIOS will overwrite most of the registers during a cold boot, except for X4, X6 and X7. There are some differences between chips that can be observed in these:
X4 X6 X7 chip
FFFF FFFF FFFF N80L286-10/S 003D42S
0000 0000 0000 N80C286-12 ET 037E6KX
5EEC CD8D 8BFC HARRIS CS80C286-16 F3360
The last one could be some kind of early CPUID?
After "triple-fault" shutdown
After shutdown & reset from protected mode, some information about what caused the exception is preserved:
X0 0040
X1 source of limit violation on first exception
6CFF GDT
6DFF LDT
6EFF IDT
6FFF TSS
70FF ES
71FF should be CS? never seen
72FF should be SS? never seen
73FF DS
X4 0000 (error code from double fault)
X8 selector if exception was from segment load, else zero
This might be a side effect of the exception handling rather than something intentional, because if the source is CS or SS, X1 gets loaded with the access rights for CS instead.
In real mode, only X4 will be set to zero.
What more is there?
Honestly, most of this is not all that useful, but it offers some insights into how the chip works.
Using STOREALL
to single step through code is only possible in ring 0, since it is privileged. It is also 3 bytes that have to be inserted after every instruction to step through. I don't think there is any equivalent of the trap flag to trigger it automatically, since the ICE hardware wouldn't have needed it.
With custom hardware, it might be possible to blindly put opcodes on the bus and have them executed in ICE mode. Even if it works, it is somewhat unlikely that there are any extra registers that can be accessed this way. Another idea is to run one instruction again and again, starting from a defined state and each time resetting it at a different clock cycle to observe what happens when.
Getting the actual microcode bits certainly isn't possible without some very high-res photos of the die, and possibly delayering.
The best I could find is on visual6502.org. You can see how the ROM is split in two, with the upper half providing control lines and 3 x 6 bits in the lower half selecting which registers are put on the internal buses. Unfortunately, it seems too dense to get anything out of it, but I would be happy to be proven wrong by some deep learning or signal processing wizardry!
The entry point PLA is in the two blocks below, and similar to the decoder in the 8086 and 186. Staring at this for a long time, I could make out some of it, enough to confirm that there are no more hidden opcodes: all of them have matching patterns that go to the same entry point, which must be the one that generates the "invalid opcode" exception.