Math Coprocessor Interface Evolution
From Open Watcom
Unlike many other aspects of the x86 architecture, the interface between the CPU and FPU has changed quite significantly since the original Intel 8088/86 (and the IBM PC). Many of these changes are invisible to software, but not all.
Why the Changes?
Unlike the CPU, where today's CPUs retain near-100% compatibility with the original 8088/86 processors and the differences are obscure, the x87 FPU - originally called NDP, or Numeric Data Processor - has evolved in incompatible ways that math libraries, and sometimes even application software, needs to be aware of.
There were two major driving factors of the changes. The 8087 was developed in parallel with the IEEE 754 standard for binary floating-point arithmetic (the IEEE 754 standard was in some ways a product of Intel's work on the 8087). Several changes were made prior to the final adoption of the IEEE 754, and the 287 and 387 coprocessors reflected the evolving standard. The changes notably affected infinities, denormals, and NaNs, as well as the related unordered comparisons. As a consequence, the 8087, 80287, and 80387 each behave slightly differently in certain "corner" cases, and the same code may deliver different results.
The other driving force was tighter integration of the CPU and FPU, which reached its final stage when the i486 CPU shipped with an integrated FPU. These changes did not affect the results of computations, but they did have an effect on passing data between the CPU and FPU, as well as error reporting. See Math error handling on x87 for additional details on the latter topic.
The CU and NEU
For understanding some aspects of x87 operation, it is important to realize that the 8087 and the follow-up models in fact contain two units which are capable of parallel execution: The Control Unit (CU) and Numeric Execution Unit (NEU). The CU fetches instructions, reads and writes memory operands, and directly executes the 'control class' x87 instructions; the CU also maintains the x87 control registers. The NEU executes all numeric instructions and maintains the x87 register stack.
It is possible for the CU to execute a control instruction while the NEU is executing a numeric instruction. At the same time, the CPU may be executing yet another instruction!
The 8086 and 8087
Because the 8086 was a very simple CPU (by today's standards), the 8087 interfaced with it in a way that wasn't possible with later, more sophisticated processors. After reset, the 8086 and 8087 synchronized and the 8087 piggy-backed on the 8086's instruction fetches and kept track of its prefetch queue. The two processors effectively executed the same code in lock-step. The 8087 ignored all instructions without the ESC prefix, and the 8086 did not execute instructions with the prefix.
However, when the 8086 encountered an ESC instruction which accessed memory, it would always perform a single bus word read cycle at the effective address. The 8087 intercepted the bus cycle and stored the address. If the x87 instruction indeed read from memory and the operand was larger than one word, the 8087 issued back to back bus read cycles to consecutive addresses to fetch the entire operand. This mechanism ensured that the CPU always performed the effective address calculation and the 8087 would simply use the result.
If the instruction really needed to read from memory, the 8087 would atomically read the entire operand before the next 8086 instruction could execute and synchronization was not an issue. However, for math instructions which wrote back a result to memory, the synchronization was not transparent and software had to be aware of it. Because 8087 instructions could easily take 100 or more cycles to executed, lack of synchronization was quite likely to cause trouble.
In addition, the 8087 NEU was only capable of executing one numeric instruction at a time. A numeric instruction could not be started while the NEU was executing a previous instruction. On the hardware level, when the 8087 NEU was executing an instruction, it activated the BUSY signal, which was connected to the 8086's /TEST input. The WAIT instruction simply caused the CPU to wait while the /TEST input was active.
The WAIT/FWAIT instruction thus implements the software-visible portion of the synchronization mechanism. Note that FWAIT is the same instruction as WAIT, and it is a CPU instruction despite the 'F' prefix. The difference between WAIT and FWAIT is that the latter can be eliminated by the linker if an emulation library is used. In addition, an 8086/8087 assembler normally automatically inserts a (F)WAIT before all numeric instructions (FMUL, FDIV, etc.). Some control instructions also have two forms, waiting and non-waiting, such as FDISI and FNDISI.
The automatic WAIT insertion takes care of only executing one instruction at a time in the NEU. However, the problem of the 8087 writing back results must be handled explicitly. A (F)WAIT must be coded before the CPU can access a memory operand written by the 8087. That enables parallel execution between the 8086 and 8087. While the NEU is executing an instruction, the 8086 can execute unrelated instructions and only perform a WAIT before accessing before accessing memory written by the 8087.
The 8087 reported errors through an interrupt pin. The FENI/FDISI instructions could be used to enable or disable the interrupt. The 8086 did not have a dedicated math error interrupt pin. Intel recommended connecting the 8087 interrupt to the 8259A or similar part. IBM decided to route the 8087 interrupt to the CPU's NMI pin, even though Intel explicitly discouraged that. See Math error handling on x87 for additional details.
In the 8086 days, systems without an 8087 were very common. A need arose to write application code which could take advantage of an 8087 when present, but still execute on a lone 8086. A more or less standardized mechanism was used on the IBM PC by many programming languages. Interrupt vectors 34h to 3Bh were used for emulating ESC opcodes 0D8h to 0DFh. Interrupt vectors 3Ch to 3Eh were used for instructions with segment overrides and WAITs. Assemblers and compilers used symbols such as FIDRQQ, FICRQQ, FIWRQQ, FJARQQ, FJCRQQ, FJSRQQ, and similar as part of the mechanism. Instead of 8087 instructions, the application code would contain these software interrupts.
At run-time, the emulation library first detected the presence of an 8087. If no NDP was present, the code would execute unchanged and every FPU instruction would be handled as a software interrupt into the emulation library. If an 8087 was present, the emulation interrupt would patch the location where it was called from to contain the "original" 8087 instruction. The application code would thus eliminate calls into the emulation library at run-time and turn itself into 8087 code executing at full speed.
Using this mechanism, application size was somewhat larger (the emulation library had to be included), but the advantage was that the application could execute on systems with no 8087, but if an 8087 was present, it would take advantage of it and execute with negligible performance penalty.
The 80286 and 80287
The interface between the 80286 and 80287 was significantly changed. The 8087 interface was inadequate for several reasons. The 287 could run at a frequency different from the 286 and the longer prefetch queue of the 80286 was probably more difficult to track. More importantly, the simple memory operand access mechanism used by the 8087 could not handle protected mode operation. The 286/287 integration had to be much tighter and for that reason, the 287 was no longer called a NDP but rather a NPX (Numeric Processor Extension).
Instead of simply sharing the data and address bus, there was a dedicated interface which used I/O ports 0F8h, 0FAh, and 0FCh. The system board logic signaled accesses to these ports to the the 287's CMD0 and CMD1 input pins. Intel reserved those ports and warned users not to access them directly via IN/OUT instructions. In a protected-mode operating system, this was typically not an issue because user code did not have sufficient IOPL (I/O Privilege Level) and could not access I/O ports anyway.
The CU/NEU division still existed in the 287, only the CU was now called a BIU, or Bus Interface Unit. But the BIU nee CU still executed control instructions in parallel with the NEU.
Unlike the 8086, the 286 executed ESC instructions. The ESC instructions also included built-in WAIT functionality. The 286 monitored the BUSY signal and would not execute the ESC instruction while BUSY was active.
A dedicated Processor Extension Data Channel was used to transfer memory operands between the 286 and 287. That way, all transfers were subject to the 286's memory management and protection mechanism. Note that unlike the 8087, the 287 did not issue its own bus cycles to access memory. Instead, all memory accesses were handled by the 286 and moved between the 286 and 287 in additional cycles. Transfers could only occur while the BUSY signal was active, which effectively ensured that once BUSY was deactivated, the 287 had completed memory accesses.
Old 8087 code with implicit WAITs before each ESC instructions worked without change, but these explicit WAIT instructions were no longer required. However, explicit WAIT instructions (or waits executed as part of another ESC instruction) still needed to be inserted after a 287 memory write as well as read and before a 286 access to the same memory. An exception to this rule were some 287 control instructions, notably FSTSW, FSTCW (writes), FLDENV, FLDCW, and FRSTOR (reads) which did not need explicit WAITs and were guaranteed to complete before the next 286 instruction could execute.
The 286 defined an exception (vector 16, Processor Extension Error or #MF) which was triggered in response to the CPU's /ERROR input going active. On the IBM PC/AT, IRQ 13 was used instead for providing backwards compatible math error processing (see Math error handling on x87 for details).
The "native" math error handling via /ERROR and #MF differed in one aspect from the 8087: The exception processing was not quite asynchronous. While the 8087 could raise its interrupt signal and interrupt the CPU execution asynchronously like any other device (even more so on the IBM PC due to the use of NMI), the 286 only checked the /ERROR signal at the beginning of some floating-point instructions and when executing a WAIT instruction. The delivery of an #MF exception could thus be delayed quite significantly. The new arrangement was a logical consequence of the protected mode task switching mechanism. An asynchronous exception could arrive in the middle of or just after a task switch, potentially arriving in a task which did not use the 287 at all. Delaying the exception delivery ensured that the exception could be properly routed to its originator.
The 286 contained built-in support for emulating a 287 NPX. The 286 could be programmed to raise a dedicated exception (vector 7, Processor Extension Not Available or #NM) whenever an ESC opcode was encountered. This was accomplished by setting the EM (emulate) bit of the Machine Status Word.
Using this mechanism, an operating system could emulate a 287 in a transparent way. Applications for the 286/287 no longer needed to link with an emulation library and unmodified 287 code could be executed with help of an emulator.
The 80386 and 80387
The 386 was not significantly different in how it interfaced to a NPX. However, the matters were complicated somewhat by the fact that a 386 could work with either a 287 or a 387 NPX (that fact alone ensured that the interface could not be radically different).
The 386 still used reserved I/O ports to communicate with the NPX. However, the ports used were 800000F8h (for writing commands) and 800000FCh (for reading/writing data), which entirely avoided potential conflicts with CPU port I/O instructions (those were limited to a 64K address space).
The ET (extension type) bit in control register CR0 determined whether the 387 or 287 communications protocol was used by the CPU. In some systems the type was detected automatically, in PC/AT compatibles because of PC/AT compatible math error handling interfering with this auto-detection the BIOS had to detect the NPX type and set the ET bit appropriately. The 387 was of course a 32-bit part with a wider data bus than the 287, but this difference was transparent to software.
The 387 had several significant differences from the 287, including additional instructions and modifications required by changes to the now-finalized IEEE 754 standard. The interface between the CPU and NPX, as well as the resulting error handling and emulation details, were essentially unchanged.
However, there was one notable change between the 286 and 386 which affected CPU/NPX data synchronization. In a 286/287 system, the programmer had to insert explicit WAITs in cases where a CPU accessed data that was also accessed by a preceding numeric instruction. The 386 maintained synchronization implicitly by not allowing further instructions to execute until the NPX signaled that it was done with data transfers.
Thus the only remaining use of the WAIT instruction was exception synchronization. A WAIT still had to be inserted in situations where restartable math exceptions were required and a follow-up CPU instruction might change the operands of a preceding numeric instruction.
The i486 CPU significantly changed the math coprocessor interface by integrating a Floating-Point Unit (FPU) on chip together with the CPU (in the i486DX; the i486SX had no FPU and is irrelevant for the purpose of this discussion). On the other hand, the FPU instruction set and the results of instruction execution changed very little since the 387.
The FPU integration obviously eliminated the visible physical interface between the CPU and FPU; all data and signals were now handled within a single chip. However, for backwards compatibility purposes, the i486 could signal math errors externally and emulate a 386/387 in cooperation with external circuitry on the logic board.
The ET bit of CR0 was no longer used (it was hardwired to 1). The i486 could only use the integrated FPU or no math co-processor at all.
The NE (Numeric Exception) bit in CR0 enabled backwards compatible error handling. The IGNNE# (Ignore Numeric Error) input signal and FERR# (FPU Error) output signals were used for this purpose. The motherboard chipset used FERR# to trigger IRQ 13 and allow old 287/8087 code to run unchanged.
The FERR# signal essentially externalized the #MF exception. The IGNNE# input made it possible to emulate the case where an exception is unmasked in the FPU but masked between the CPU and FPU. In PC compatible systems using the 287 and 387, the motherboard logic could (and did) intercept the signals between the CPU and NPX without explicit support in either in order to emulate an IBM PC with an 8087. With the i486 that was no longer possible and Intel had to provide these signals for PC compatibility.
As in the 386, the WAIT instruction was only used and required for exception synchronization; all FPU instruction and data synchronization was handled automatically.
Emulating 487 instructions was accomplished in the same way as on the 386 (the basic mechanism was unchanged since the 286). The MP and EM bits were used to cause FPU instructions to cause exception 7. On the i486SX that was the only possibility, on the i486DX it was an option to test software emulators.
Intel iAPX 86,88 User's Manual, Order no. 1981, 210201-001 Intel iAPX 286 Hardware Reference Manual, 1983, Order no. 210768-001 Intel iAPX 286 Programmer's Reference Manual Including the iAPX 286 Numeric Supplement, 1985, Order no. 210498-003 Intel 80386 Hardware Reference Manual, 1987, Order no. 231732-002 Intel 80386 Programmer's Reference Manual, 1987, Order no. 230985-002 Intel486(TM) Microprocessor Family Programmer's Reference Manual, 1992, Order no. 240486-002