Floating Point Background

About the Standard

The IEEE-754 Standard for Floating-Point Arithmetic (“the standard”) specifies how to represent real numbers (i.e., positive and negative numbers that can have fractional parts) using binary digits, both for computer calculations and for exchanging data values between digital systems. Virtually all processors and virtually all programming languages use this standard for floating-point operations and data types. As the standard’s documentation says:

“This standard provides a method for computation with floating-point numbers that will yield the same result whether the processing is done in hardware, software, or a combination of the two. The results of the computation will be identical, independent of implementation, given the same input data. Errors, and error conditions, in the mathematical processing will be reported in a consistent manner regardless of implementation.”^#

The standard specifies four different binary formats for floating-point numbers: binary16, binary32, binary64, and binary128, which are the focus of this document. The standard also specifies specifications for calculations that must be supported (addition, subtraction, division, and a special combination called fusedMultiplyAdd), how exceptions are managed, and various conversion operations. In this document, however, we are dealing only with the parts of the standard that specify the ways floating-point values are encoded using binary digits.

Binary Information

This section reviews properties of binary “numbers” that underlie the IEEE-754 standard. The first property is the reason numbers was in quotes in the previous sentence: in information theory, the amount of uncertainty that can be reduced by answering one yes-no question is defined as one bit of information. Before I flip a coin, my uncertainty about which of two ways it will land is defined to be one bit of information. After I see the result, I have no more uncertainty about the outcome: my uncertainty has been reduced by one bit to zero. Claude Shannon formalized information theory in the 1940’s, and his colleague at Bell Telephone Laboratories, John Tukey, coined the term “bit” as the name for the unit of information. Since “bit” is a contraction of “binary digit” it might be reasonable to think of information in numerical terms, and Shannon’s work certainly used mathematical rigor to provide a workable model for measuring information. But the genius of Shannon’s work is that bits can be used to measure any kind of information. The immediate application of Shannon’s work, for example, was engineering the transmission of speech information between telephones.

A second principle of information theory is that binary digits can be used not only to measure information, they can also be used to encode information. If I want to talk about the two possible outcomes of a coin toss, I can use the words “heads” and “tails”, or I can encode them using the symbol “1” to represent “heads” and “0” to represent “tails”, or vice-versa. How the values of the bits are assigned to the two outcomes has nothing to do with their numerical values. Just think of “0” and “1” as a set of two symbols that can be used to encode information.

In addition, there are three features of using just two symbols to encode information that makes information theory so fundamental to the design of computing devices.

Electrical and electronic circuits that can be in either of two states can be built readily. A switch can be on or off; a magnet has a north and south pole; a battery has a positive and negative side, etc. This property makes it possible to encode and manipulate information efficiently using switches, magnets, and/or voltages to represent the bits of information.
Electrical and electronic circuits can implement the rules of Boolean algebra (or (propositional logic) using bits to represent the two values true and false. For example if you want to control a light in a stairway using one switch at the top of the stairs and another at the bottom, you would use two “single pole double throw” light switches (also called “three-way switches” in North America and “two-way” switches elsewhere) to implement Boolean algebra’s exclusive OR function: the light is on if either one of the switches is in the “on” position, but is off if both switches are on or both switches are off.

Two switches controlling a light. In the left panel both switches are off and the light is off; in the middle two panels just one switch is on and the light is on; in the right panel both switches are on and the light is off. The red lines show the connected part of the path from the plug on the left towards the lightbulb on the right.

The crucial point here is just that binary digits can be used to encode any kind of information (including numbers!) rather than that binary “numbers” are the fundamental concept. The footnote in the previous paragraph made the point that electric circuits can be used to perform logic operations, using bits to represent truth values. In turn, logic circuits can be used to perform numerical operations. For example, the logical operators and and exclusive or can be used to calculate the sum and carry when adding two binary numbers together:

Logic truth table for computing the *sum* and *carry* when adding two binary numbers, *A and B.*
A	B	Exclusive OR (Sum)	AND (Carry)
0	0	0	0
0	1	1	0
1	0	1	0
1	1	0	1

The 754 Standard specifies how binary digits are used to encode at least five pieces of information about a floating point value. In most cases there are just three, based on scientific notation: the fractional part of the value, an exponent for scaling the value, and the sign of the value. But the standard also specifies how to encode values that can arise as the result of computation, but which can’t be represented using scientific notation: infinity, and values that cannot be represented using the standard because they are too large or too close to zero to be represented in the number of bits available, NaN (Not a Number).

Using logic circuits to perform calculations with binary floating-point numbers is very complicated: for example, just to add two numbers together the exponents must first be adjusted to get the two fixed-point parts to line up with each other properly and the exponent of the result must then be determined after the addition is complete. The circuits to do all this are so complex that when Intel Corporation introduced the 8086 microprocessor in 1976, there was no hardware support for floating-point operations until 1980, when they introduced a separate processing circuit called the 8087 to handle all that complexity. The details of the way bits were used to represent the exponent and fixed-point parts of floating-point values in the 8087 became an “industry standard,” and in 1985 a professional organization, the Institute of Electrical and Electronics Engineers formalized the 8087’s model as IEEE-754, the “IEEE Standard for Floating Point Arithmetic.” The standard has been updated since 1985, most recently in 2008 and again in 2019, and it’s the 2019 revision that this web site is based on.

There are three parts of an IEEE-754 encoded number: the sign bit that tells whether the value is positive or negative, the exponent value, and the fixed-point part, called the significand in the standard. The three formats (binary16, binary 32, binary64, and binary128) each uses a fixed number of bits (16, 32, 64, or 128) to represent a value. For a given number of bits, there is a limit to how many different values can be encoded, i.e.m,how many different combinations of 1’s and 0’s are possible. That means, for example, that there are only 65,536 (i.e., 2¹⁶)different values that can be represented. Even at the other extreme, binay128 can represent over 3×10³⁰⁸ different values, but there is an unlimited number of values that can exist between any two numbers, so even here there the number of possible values is gigantic, but limited. All four of the formats use one bit to represent the sign of the encoded value: 0 for positive; 1 for negative. This is called sign-magnitude encoding. The remaining bits are divided into the exponent and significand fields:

Number of bits of information for floating-point number parts.
format	sign	exponent	significand
binary16	1	5	11
binary32	1	8	24
binary64	1	11	53
binary128	1	15	113

Did you notice that the numbers don’t add up? the binary16 format somehow uses 16 binary digits to represent 17 (1+5+11) bits of information, for example. This brings up two important concepts about IEEE-754 encoding: the implicit (“hidden”) bit, and the unique representation of values.

Normalized Significand Form

Using : rather than . to represent the position that separates the whole number part from the fractional part of binary numbers, numbers 1111×2⁰, 111:1×2¹, 11:11×2², 1:111×2³, and 0:1111×2⁴ all have the same value (decimal 15.0). but only the fourth one is in what’s called “normalized” or “canonical” form, where there is exactly one bit to the left of the binary point, and it’s a 1. The IEEE-754 standard always represents the significand in the canonical form, with the exponent adjusted to give the correct value. Since the bit to the left of the binary point is always a 1, it’s redundant information, and is not stored as part of the encoded value.

But the word “always” in the previous paragraph isn’t quite right: the standard specifies a set of values for each format called subnormals where the bit to the left of the binary point is 0. To explain that, we need to look at how exponents are defined.

Using binary16 as an example, there are 5 bits for the exponent field, giving 2⁵ (i.e., 32) possible combinations: 00000, 00001, 00010, … 11111. As unsigned numbers, these represent the decimal values from zero to 31. But IEEE-754 exponents have to be signed in order to represent real values between 0.0 and 1.0. For example, 0.5₁₀ (i.e., 1/2¹) is 2^-1. There are several commonly-used ways to use binary digits to encode signed values. For example, we already mentioned that the standard used sign-magnitude encoding for the entire floating-point value. You probably also know that two’s complement encoding is typically used for signed integer (fixed-point) values. The IEEE-754 standard uses biased notation for encoding the exponent part of a value, which which we will explore further below. First, here is a small table comparing five-bit values in sign-magnitude, one’s complement, two’s complement and biased notation. One’s complement and two’s complement are not used anywhere in the IEEE-754 standard, but I’ve included them here to show how the two encodings used by the standard compare to some alternate possibilites.

Some 5-bit Signed Notations
Bit Pattern	One’s Complement	Two's Complement	Sign-Magnitude	Biased (Bias = 15)
00000	0	0	0	-15
00001	1	1	1	-14
…
01110	14	14	14	-1
01111	15	15	15	0
10000	-15	-16	-0	1
…
11110	-1	-2	-14	15
11111	-0	-1	-15	16

As mentioned above, the logic circuits for doing two’s complement arithmetic (addition and subtraction) are simpler for two’s complement, but biased notation is simpler for counting operations, such as counting how many positions to shift a significand value to normalize it.

The IEEE-754 standard reserves the all zero’s and all one’s binary patterns in the exponent field to represent special values rather than as actual exponent values.

#. IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019. ↩

Floating Point Background Material

About the Standard

Binary Information

Range and Position

Normalized Significand Form