Pentium II

Math Bug



Pentium II Math Bug ?

It would appear that there may be a bug in the floating point unit of the new Pentium II Processor, as well as the current Pentium Pro Processor. Is it real? Is it serious? It appears to be real. The observed behavior contradicts the IEEE Floating Point Specifications, and Intel's printed documentation. However, I'm not a numerical analyst, and therefore I'm not qualified to comment on its seriousness or its implications. Instead, I'll present the facts herein, and leave the determination to you.


The Facts

I received email from "Dan" who asked if I could reproduce what he thought was a bug in the Pentium Pro processor. I wrote an assembly language program that checked into the problem. I also ran the test on a Pentium-II processor that I had recently bought at Fry's Electronics, an Intel Pentium Processor (P54C), Intel Pentium Processor with MMX Technology (P55C), and an AMD K6. Sure enough, I came to the same conclusion as Dan: it looks like a bug to me.


What do we call this bug?

These days, astronomers name new stars and comets by combining the discoverer's name and some number. Why should microprocessor bugs be any different? In this case, "Dan" is the discoverer of the bug, and 04-11 (1997) is the date on which I got my first email about it. So I've named the bug "Dan-0411" after its discoverer and the date he first reported it to me.


What is the bug, and what does it affect?

The bug relates to operations that convert floating point numbers into integer numbers. Floating point numbers are stored inside of the microprocessor in an 80-bit format. Integer numbers are stored in two different sizes. A short integer is stored in 16-bits, and a long integer is stored in 32-bits. It is often desirable to store the 80-bit floating point numbers as integer numbers. Sometimes the converted number won't fit into the smaller integer format. This is when the bug occurs.

The host software is supposed to be warned by the microprocessor when such a floating point conversion error occurs; a specific error flag is supposed to be set in a floating point status register. If the microprocessor fails to set this flag, it would not be in compliance with the IEEE Floating Point Standards which mandate such behavior. For the Dan-0411 bug, the Pentium II and Pentium Pro processors fail to set this error flag in many cases.

When storing 16-bit integers, the chance of randomly hitting the bug is 247/280 or 1 in 8,589,934,592 (1 in 8.6 billion). When storing 32-bit integers, the chance is 231/280 (1 in 562,950 billion). That's approximately 140,739,635,839,000 different floating point numbers that result in the incorrect behavior. The Pentium, Pentium with MMX Technology, and AMD K6 microprocessors do not appear to have this problem.

It might be interesting to note that a launch failure of the Ariane 5 rocket, which happened less than a minute into the launch, was traced to behavior around an overflow condition (in this case, it was software, not hardware, that was the problem). One of the computers on board had a floating point to integer conversion that overflowed, but because the overflow was not handled by the software the computer did a dump of its memory. Unfortunately, this memory dump was interpreted by the rocket as instructions to its rocket nozzles. Result--boom!


Why wasn't this bug detected before?

I'm not exactly sure why this bug wasn't detected sooner, but there are a few clues that could help provide an explanation. There appears to be a bug in a popular floating point test program. If Intel relied on this program, its bug may have inadvertently allowed the Dan-0411 bug to slip by undetected. Professor William Kahan of Berkeley has written a suite of floating point test programs in the FORTRAN programming language (Please refer to Dr. Kahan's home page).
These programs are commonly used to test the Float-to-Integer Store instructions (FIST and FISTP). FORTRAN compilers may have differences in how they handle bit-wise expressions. These compiler differences could make this test behave differently as well. Technically, it looks like the original intent of Dr. Kahan's was to use a bit-wise AND instead of a logical AND in his original FORTRAN source code; this is a potential non-portability issue -- as I'm not sure how AND is defined by the FORTRAN standard. This "non-portable" code was discovered when Dan tried to convert Dr. Kahan's FORTRAN source code to the C programming language -- which has separate bit-wise and logical AND operators. Dan recognized Dr. Kahan's original intent and used the proper bit-wise AND operator in his C source code. This is when the bug appeared in the chip. So in the end, either a bug in the test software, or in a FORTRAN compiler, may have hidden a bug in the chip.

That's the end of the non-technical discussion. For further technical details, continue reading.


How did I get involved?

"Dan, who wants his full name to remain anonymous, sent me the following email on April 11, 1997 (reprinted with permission):

        Robert,

        There seems to be a bug in the FIST[P] m16int and FIST[P]
m32int instructions for the P6 (Pentium Pro).  Some (perhaps all)
values in the following ranges fail to set the IE (Invalid
operation Exception) flag as required for integer overflow.

FIST[P] m32int: [ c05e80000000000000001, c05e8000000080000000 ] (~-295)
FIST[P] m16int: [ c06e80000000000000001, c06e8000800000000000 ] (~-2111)

(Number of failing mantissas = 231 + 247)

Example on P6 (Pentium Pro):
  fcw = 0x37f
  FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
  FPU status word:  B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
                    0  0 000  0  0  0  0  0  1  0  0  0  0  0
  ***FAIL***


Example on P5 (Pentium):
  fcw = 0x37f
  FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
  FPU status word:  B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
                    0  0 000  0  0  0  0  0  0  0  0  0  0  1


-- Dan"

Dan wanted to make sure that there wasn't a bug in his C source code, or his C compiler. That's when he contacted me. Dan wanted me to write assembly language source code on his behalf. By writing in assembly language, the floating point hardware may be tested directly and queried directly for its response without the possible influence of compiler bugs and such.

Normally I don't get involved in debugging other people's problems or writing source code on their behalf. But Dan was persistent. Within a day or two, Dan had come up with some very concrete examples of the bug and instructions which I could use as guidelines for reproducing it. I still wasn't convinced that I wanted to be involved (not being a floating point expert). But after 10 days or so, I finally became convinced, and that's when I wrote the first piece of assembly language source code to detect the Dan-0411 bug.


The Nature of the Bug

This bug occurs when a large negative floating point number is stored to memory in an integer format. Under normal operation, the largest negative integer is stored in memory when a floating point number is too large to fit in the integer format. The FPU Status Word indicates that an Invalid operand Exception (IE) occurred (FSW.IE = 1).

Storing floating point numbers that overflow the "real number" format are supposed to behave differently than floating point numbers that overflow the "integer number" format. Floating point numbers set the overflow flag (FSW.OE = 1), not the Invalid operand Exception flag (FSW.IE). Instead of setting the Invalid operand Exception flag (FSW.IE), the Dan-0411 bug sets the Precision Exception flag (FSW.PE = 1). The Pentium Pro Family Developer's Manual, Volume 2, section 7.8.4 makes this difference quite clear:

The FPU reports a floating-point numeric overflow exception (#O) whenever the rounded result of an arithmetic instruction exceeds the largest allowable finite value that will fit into the real format of the destination operand. For example, if the destination format is extended-real (80 bits), overflow occurs when the rounded result falls outside the unbiased range of -1.0 * 216834 to 1.0 * 216834 (exclusive). Numeric overflow can occur on arithmetic operations where the result is stored in an FPU data register. It can also occur on store-real operations (with the FST and FSTP instructions), where a within-range value in a data register is stored in memory in a single-or double-real format. The overflow threshold range for the single-real format is -1.0 * 2128 to 1.0 * 2128; the range for the double-real format is -1.0 * 21024 to 1.0 * 21024.

That explains how float-to-real overflows are supposed to be handled. But the Pentium Pro manual is very specific by making a distinction between float-to-real overflows and float-to-integer overflows. In fact, the very next paragraph in the Pentium Pro manual describes the behavior for the exact conditions exposed by Dan-0411.

The numeric overflow exception cannot occur when overflow occurs when storing values in an integer or BCD integer format. Instead, the invalid-arithmetic-operand exception is signaled.

As I said, this is the precise condition which is not being met by the Pentium Pro and Pentium II microprocessors. The programs that demonstrate Dan-0411 will set up these conditions and test whether or not the proper error condition codes are set by the microprocessor.


Is this already a known bug?

Part of the process of disclosing this bug, was ensuring that it hadn't already been reported in any of Intel's errata documents. Thanks to Intel for providing electronic versions of their errata for the Pentium and Pentium Pro microprocessors, it's very easy to perform an electronic search to see if this bug has been previously reported. Using this technique, I could not find any documentation disclosing the Dan-0411 bug on either the Pentium or Pentium Pro microprocessors.


The Results

I ran a test on various Pentiums and other microprocessors. For demonstration purposes of this article, I will show the results of the Intel 486, Pentium (P54C), Pentium with MMX Technology (P55C), AMD K6, Pentium Pro, and Pentium II microprocessors. These resultsdemonstrate that the bug is only present on the Pentium Pro and Pentium II microprocessors. All other processors I tested did not demonstrate the Dan-0411 bug.


Conclusion

After reading this, I'm sure than many people will work vigorously to verify or refute my test results. Since I'm not a numerical analyst, you should draw your own conclusions or rely on the conclusions of a qualified expert as to the significance of the Dan-0411 bug. One thing I can say conclusively: the Pentium Pro and Pentium II processors behave differently than their predecessors.



TEXTE RECUPERE SUR INTERNET PAR
Cédric / QueST



[Retour au sommaire]