The IEEE-754 Standard for Floating-Point Arithmetic (“the standard”) specifies how to represent real numbers (i.e., positive and negative numbers that can have fractional parts) using binary digits, both for computer calculations and for exchanging data values between digital systems. Virtually all processors and virtually all programming languages use this standard for floating-point operations and data types. As the standard’s documentation says:
“This standard provides a method for computation with floating-point numbers that will yield the same result whether the processing is done in hardware, software, or a combination of the two. The results of the computation will be identical, independent of implementation, given the same input data. Errors, and error conditions, in the mathematical processing will be reported in a consistent manner regardless of implementation.”#
The standard specifies four different binary formats for floating-point numbers: binary16, binary32, binary64, and binary128, which are the focus of this document. The standard also specifies specifications for calculations that must be supported (addition, subtraction, division, and a special combination called fusedMultiplyAdd), how exceptions are managed, and various conversion operations. In this document, however, we are dealing only with the parts of the standard that specify the ways floating-point values are encoded using binary digits.
This section reviews properties of binary “numbers” that underlie the IEEE-754 standard. The first property is the reason numbers was in quotes in the previous sentence: in information theory, the amount of uncertainty that can be reduced by answering one yes-no question is defined as one bit of information. Before I flip a coin, my uncertainty about which of two ways it will land is defined to be one bit of information. After I see the result, I have no more uncertainty about the outcome: my uncertainty has been reduced by one bit to zero. Claude Shannon formalized information theory in the 1940’s, and his colleague at Bell Telephone Laboratories, John Tukey, coined the term “bit” as the name for the unit of information. Since “bit” is a contraction of “binary digit” it might be reasonable to think of information in numerical terms, and Shannon’s work certainly used mathematical rigor to provide a workable model for measuring information. But the genius of Shannon’s work is that bits can be used to measure any kind of information. The immediate application of Shannon’s work, for example, was engineering the transmission of speech information between telephones.
A second principle of information theory is that binary digits can be used not only to measure information, they can also be used to encode information. If I want to talk about the two possible outcomes of a coin toss, I can use the words “heads” and “tails”, or I can encode them using the symbol “1” to represent “heads” and “0” to represent “tails”, or vice-versa. How the values of the bits are assigned to the two outcomes has nothing to do with their numerical values. Just think of “0” and “1” as a set of two symbols that can be used to encode information.
In addition, there are three features of using just two symbols to encode information that makes information theory so fundamental to the design of computing devices.
The crucial point here is just that binary digits can be used to encode any kind of information (including numbers!) rather than that binary “numbers” are the fundamental concept. The footnote in the previous paragraph made the point that electric circuits can be used to perform logic operations, using bits to represent truth values. In turn, logic circuits can be used to perform numerical operations. For example, the logical operators and and exclusive or can be used to calculate the sum and carry when adding two binary numbers together:
| A | B | Exclusive OR (Sum) | AND (Carry) |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
The 754 Standard specifies how binary digits are used to encode at least five pieces of information about a floating point value. In most cases there are just three, based on scientific notation: the fractional part of the value, an exponent for scaling the value, and the sign of the value. But the standard also specifies how to encode values that can arise as the result of computation, but which can’t be represented using scientific notation: infinity, and values that cannot be represented using the standard because they are too large or too close to zero to be represented in the number of bits available, NaN (Not a Number).
number of bits, number of combinations, positional number systems, fixed-point range and precision. using scientific notation to “optimize” both.
Using logic circuits to perform calculations with binary floating-point numbers is very complicated: for example, just to add two numbers together the exponents must first be adjusted to get the two fixed-point parts to line up with each other properly and the exponent of the result must then be determined after the addition is complete. The circuits to do all this are so complex that when Intel Corporation introduced the 8086 microprocessor in 1976, there was no hardware support for floating-point operations until 1980, when they introduced a separate processing circuit called the 8087 to handle all that complexity. The details of the way bits were used to represent the exponent and fixed-point parts of floating-point values in the 8087 became an “industry standard,” and in 1985 a professional organization, the Institute of Electrical and Electronics Engineers formalized the 8087’s model as IEEE-754, the “IEEE Standard for Floating Point Arithmetic.” The standard has been updated since 1985, most recently in 2008 and again in 2019, and it’s the 2019 revision that this web site is based on.
There are three parts of an IEEE-754 encoded number: the sign bit that tells whether the value is positive or negative, the exponent value, and the fixed-point part, called the significand in the standard. The three formats (binary16, binary 32, binary64, and binary128) each uses a fixed number of bits (16, 32, 64, or 128) to represent a value. For a given number of bits, there is a limit to how many different values can be encoded, i.e.m,how many different combinations of 1’s and 0’s are possible. That means, for example, that there are only 65,536 (i.e., 216)different values that can be represented. Even at the other extreme, binay128 can represent over 3×10308 different values, but there is an unlimited number of values that can exist between any two numbers, so even here there the number of possible values is gigantic, but limited. All four of the formats use one bit to represent the sign of the encoded value: 0 for positive; 1 for negative. This is called sign-magnitude encoding. The remaining bits are divided into the exponent and significand fields:
| format | sign | exponent | significand |
|---|---|---|---|
| binary16 | 1 | 5 | 11 |
| binary32 | 1 | 8 | 24 |
| binary64 | 1 | 11 | 53 |
| binary128 | 1 | 15 | 113 |
Did you notice that the numbers don’t add up? the binary16 format somehow uses 16 binary digits to represent 17 (1+5+11) bits of information, for example. This brings up two important concepts about IEEE-754 encoding: the implicit (“hidden”) bit, and the unique representation of values.
Using : rather than . to represent the position that separates the
whole number part from the fractional part of binary numbers, numbers 1111×20,
111:1×21, 11:11×22, 1:111×23, and 0:1111×24 all
have the same value (decimal 15.0). but only the fourth one is in what’s called “normalized”
or “canonical” form, where there is exactly one bit to the left of the binary point, and it’s
a 1. The IEEE-754 standard always represents the significand in the canonical form, with the
exponent adjusted to give the correct value. Since the bit to the left of the binary point is
always a 1, it’s redundant information, and is not stored as part of the encoded value.
But the word “always” in the previous paragraph isn’t quite right: the standard specifies a set of values for each format called subnormals where the bit to the left of the binary point is 0. To explain that, we need to look at how exponents are defined.
Using binary16 as an example, there are 5 bits for the exponent field, giving
25 (i.e., 32) possible combinations: 00000, 00001,
00010, … 11111. As unsigned numbers, these represent the
decimal values from zero to 31. But IEEE-754 exponents have to be signed in order to represent
real values between 0.0 and 1.0. For example, 0.510 (i.e.,
1/21) is 2-1. There are several commonly-used
ways to use binary digits to encode signed values. For example, we already mentioned that the
standard used sign-magnitude encoding for the entire floating-point value. You probably also
know that two’s complement encoding is typically used for signed integer (fixed-point) values.
The IEEE-754 standard uses
biased notation for encoding the exponent part of a value, which which we will
explore further below. First, here is a small table comparing five-bit values in
sign-magnitude, one’s complement, two’s complement and biased notation. One’s complement and
two’s complement are not used anywhere in the IEEE-754 standard, but I’ve included them here
to show how the two encodings used by the standard compare to some alternate possibilites.
| Bit Pattern | One’s Complement | Two's Complement | Sign-Magnitude | Biased (Bias = 15) |
|---|---|---|---|---|
| 00000 | 0 | 0 | 0 | -15 |
| 00001 | 1 | 1 | 1 | -14 |
| … | ||||
| 01110 | 14 | 14 | 14 | -1 |
| 01111 | 15 | 15 | 15 | 0 |
| 10000 | -15 | -16 | -0 | 1 |
| … | ||||
| 11110 | -1 | -2 | -14 | 15 |
| 11111 | -0 | -1 | -15 | 16 |
As mentioned above, the logic circuits for doing two’s complement arithmetic (addition and subtraction) are simpler for two’s complement, but biased notation is simpler for counting operations, such as counting how many positions to shift a significand value to normalize it.
The IEEE-754 standard reserves the all zero’s and all one’s binary patterns in the exponent field to represent special values rather than as actual exponent values.
If the exponent and significand are both all zero, the floating point value is zero. A quirk is that the sign bit can be either 0 or 1, giving rise to the possibility of having the value negative zero in addition to zero. Given that 2n is always an even number for any number of bits (n) the value zero always causes problems: for two’s complement it’s that there is one more negative value than there are positive values. For one’s complement, biased, and sign-magnitude notations, it’s the possibility of negative zero. In all of these cases, computations have to take this factor into account.
When the exponent is zero but the significand is not zero, the value is said to be subnormal or denormal. This pattern allows for a feature called gradual underflow.