關於IEEE754二進制浮點數算術標準的介紹

IEEE 754

IEEE二進制浮點數算術標準（IEEE 754）是最廣泛使用的浮點數運算標準，爲許多CPU與浮點運算器所採用。這個標準定義了表示浮點數的格式（包括負零（−0 (number)）與反常值（denormal number）），一些特殊數值（無窮與非數值（NaN）），以及這些數值的“浮點數運算子”；它也指明瞭四種數值修約規則和五種例外狀況（包括例外發生的時機與處理方式）。

IEEE 754規定了四種表示浮點數值的方式：單精確度（32位元）、雙精確度（64位元）、延伸單精確度（43位元以上，很少使用）與延伸雙精確度（79位元以上，通常以80位元實做）。只有32位元模式有強制要求，其他都是選擇性的。大部分程序語言都有提供IEEE格式與算術，但有些將其列爲非必要的。例如，IEEE 754問世之前就有的C語言，現在有包括IEEE算術，但不算作強制要求（C語言的float通常是指IEEE單精確度，而double是指雙精確度）。

該標準的全稱爲IEEE二進制浮點數算術標準（ANSI/IEEE Std 754-1985），又稱IEC 60559:1989，微處理器系統的二進制浮點數算術（本來的編號是IEC 559:1989）^[1]。後來還有“與基數無關的浮點數”的“IEEE 854-1987標準”，有規定基數爲2跟10的狀況。

浮點數剖析

以下是該標準對浮點數格式的描述。

本文表示位元的約定

我們將電腦上一個長度爲W的單字（word）其中的位元以0到W−1的整數編碼，通常將最右邊的位元編成0，以讓編號最小的位元與最低效位元（least significant bit或lsb，代表最小位數，改變時對數值影響最小的位元）一致。

整體呈現

二進制浮點數是以符號數值表示法格式儲存，將最高效位元指定爲符號位元（sign bit）；“指數部份”，即次高效的e位元，爲浮點數中經指數偏差（exponent bias）處理過後的指數；“小數部份”，即剩下的f位元，爲有效位數（significand）減掉有效位數本身的最高效位元。

一些非中文的文字因爲尚未翻譯而被隱藏，歡迎參與翻譯。

指數偏差

指數偏差（表示法中的指數爲實際指數減掉某個值）爲 2^e^-1 - 1，參見有符號數處理的Excess-N。減掉一個值是因爲指數必須是有號數才能表達很大或很小的數值，但是有號數通常的表示法，二的補數（two's complement），會使得 Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.

For example, to represent a number which has exponent of 17, exponent is 17+2^e^-1 - 1.

範例

The most significant bit of the significand ( not stored) is determined by the value of exponent. If 0 < exponent < 2^e − 1, the most significant bit of the significand is 1, and the number is said to be normalized. If exponent is 0, the most significant bit of the significand is 0 and the number is said to be de-normalized. Three special cases arise:

1. if exponent is 0 and fraction is 0, the number is ±0 (depending on the sign bit)

2. if exponent = 2^e − 1 and fraction is 0, the number is ±infinity (again depending on the sign bit), and

3. if exponent = 2^e − 1 and fraction is not 0, the number being represented is not a number (NaN).

This can be summarized as:

Type	Exponent	Fraction
Zeroes	0	0
Denormalized numbers	0	non zero
Normalized numbers	1 to 2^e − 2	any
Infinities	2^e − 1	0
NaNs	2^e − 1	non zero

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.

Bit values for the the IEEE 754 32bit float 0.15625

The exponent is biased by 2⁸⁻¹ − 1 = 127 in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.

For normalised numbers, the most common, exponent is the biased exponent and fraction is the significand minus the most significant bit.

The number has value v:

v = s × 2^e× m

Where

s = +1 (positive numbers) when the sign bit is 0

s = −1 (negative numbers) when the sign bit is 1

e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 ≤ m < 2.

In the example shown above, the sign is zero, the exponent is −3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2⁻³, which is +0.15625.

Notes:

1. Denormalized numbers are the same except that e = −126 and m is 0.fraction. (e is NOT −127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to −126 for the calculation.)

2. −126 is the smallest exponent for a normalized number

3. There are two Zeroes, +0 (s is 0) and −0 (s is 1)

4. There are two Infinities +∞ (s is 0) and −∞ (s is 1)

5. NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish signaling NaNs from quiet NaNs

6. NaNs and Infinities have all 1s in the Exp field.

7. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2⁻¹⁴⁹ ≈ ±1.4012985×10⁻⁴⁵

8. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2⁻¹²⁶ ≈ ±1.175494351×10⁻³⁸

9. The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)²⁴)2¹²⁸) ^[2]≈ ±3.4028235×10³⁸

Here is the summary table from the previous section with some example 32-bit single-precision examples:

Type	Exponent	Significand	Value
Zero	0000 0000	000 0000 0000 0000 0000 0000	0.0
One	0111 1111	000 0000 0000 0000 0000 0000	1.0
Denormalized number	0000 0000	100 0000 0000 0000 0000 0000	5.9×10^-39
Large normalized number	1111 1110	111 1111 1111 1111 1111 1111	3.4×10³⁸
Small normalized number	0000 0001	000 0000 0000 0000 0000 0000	1.18×10^-38
Infinity	1111 1111	000 0000 0000 0000 0000 0000	Infinity
NaN	1111 1111	non zero	NaN

A more complex example

Bit values for the IEEE 754 32bit float -118.625

Let us encode the decimal number −118.625 using the IEEE 754 system.

1. First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".

2. Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101.

3. Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 2⁶. This is a normalized floating point number. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.

4. The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.

Double-precision 64 bit

The three fields in a 64bit IEEE 754 float

Double precision is essentially the same except that the fields are wider:

The fraction part is much larger, while the exponent is only slightly larger. The standard creators believed precision is more important than range.

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalized numbers the exponent bias is +1023 (so e is exponent − 1023). For Denormalized numbers the exponent is −1022 (the minimum exponent for a normalized number—it is not −1023 because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.

Notes:

1. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2⁻¹⁰⁷⁴ ≈ ±5×10⁻³²⁴

2. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2⁻¹⁰²² ≈ ±2.2250738585072020×10⁻³⁰⁸

3. The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)⁵³)2¹⁰²⁴) ^[2]≈ ±1.7976931348623157×10³⁰⁸

Comparing floating-point numbers

IEEE floating point numbers use lexicographical ordering. If NaN's are excluded IEEE floating point numbers can be compared as signed magnitude integers.

Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.

· Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called roundTiesToEven to distinguish it from another round-to-nearest mode)

· Round toward 0 – directed rounding towards zero

· Round toward +∞ – directed rounding towards positive infinity

· Round toward −∞ – directed rounding towards negative infinity.

Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.^[3][4][5]

Recommended functions and predicates

· Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). Note that this is one of the few operations which operates on a NaN in a way resembling arithmetic. Note that copysign is a new function under the C99 standard.

· −x returns x with the sign reversed. Note that this is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.

· scalb (y, N)

· logb (x)

· finite (x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf

· isnan (x) a predicate for "x is a nan", equivalent to "x ≠ x"

· x <> y which turns out to have different exception behavior than NOT(x = y).

· unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.

· class (x)

· nextafter(x,y) returns the next representable value from x in the direction towards y

References

1. ↑ Codes （英文）

2. ^ ^2.0 ^2.1 Prof. W. Kahan. "Lecture Notes on the Status of IEEE 754" (PDF). October 1, 1997 3:36 am. Elect. Eng. & Computer Science University of California. Retrieved on 2007-04-12.

3. ↑ John R. Hauser (March 1996). "Handling Floating-Point Exceptions in Numeric Programs" (PDF). ACM Transactions on Programming Languages and Systems 18 (2).

4. ↑ David Stevenson (March 1981). "IEEE Task P754: A proposed standard for binary floating-point arithmetic". Computer 14 (3): 51–62.

5. ↑ Kahan, W. and Palmer, J. (1979). "On a proposed floating-point standard". SIGNUM Newsletter 14 (Special): 13–21.

· Floating Point Unit by Jidan Al-Eryani

Revision of the standard

Note that the IEEE 754 standard is currently under revision. See: IEEE 754r

See also

· −0 (negative zero)

· IEEE 754r working group to revise IEEE 754-1985.

· NaN (Not a Number)

· minifloat for simple examples of properties of IEEE 754 floating point numbers

· Intel 8087 (early implementation effort)

· Q (number format) For constant resolution

外部鏈接

· IEEE 754 references

· Let's Get To The (Floating) Point by Chris Hecker

· What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg - a good introduction and explanation.

· IEEE 854-1987 History and minutes

· Converter

· Another Converter

· Converter as MS-Windows program

· Comparing doubles in C++

· An Interview with the Old Man of Floating-Point Coprocessor.info : x87 FPU pictures, development and manufacturer information

關於IEEE754二進制浮點數算術標準的介紹

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

關於IEEE754二進制浮點數算術標準的介紹

indent的使用

GNU GCC 5篇

ARM的開發,學習步驟

GCC使用入門

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結