Floating Point

References

Slides adapted from CMU

Outline

Background: fractional binary numbers
IEEE floating point standard
Example and properties
Rounding, addition, and multiplication
Floating point in C
Summary

Fractional Binary Numbers

Representation
- Bits to the right of “binary point” represent fractional powers of 2
- Represents rational number: $\sum_{k=-j}^{i} b_k \cdot 2^{k}$
Fractional Binary

Fractional Binary Number Examples

Value	Representation
23/4	101.11 = 4 + 1 + 1/2 + 1/4
23/8	10.111 = 2 + 1/2 + 1/4 + 1/8
23/16	1.0111 = 1 + 1/4 + 1/8 + 1/16

Observations
- Divide by 2 by shifting right (unsigned)
- Multiply by 2 by shifting left
- Numbers of form $0.1111 \ldots_{2}$ are just below 1.0

Representable Numbers

Limitation 1
- Can only exactly represent numbers of the form $\frac{x}{2^k}$
  - Other rational numbers have repeating bit representations
- Example
  - 1/3 = 0.01010101[01] $\ldots_{2}$
Limitation 2
- Just one setting of binary point within the $w$ bits
  - limited range of numbers

IEEE Floating Point

IEEE Standard 754
- Established in 1985 as uniform standard for floating point arithmetic
- Supported by all major CPUs
Driven by numerical concerns
- Nice standards for rounding, overflow, underflow
- Difficult to make fast in hardware

Floating Point Representation

Numerical Form: $(-1)^s \cdot M \cdot 2^E$
- sign bit $s$ determines whether number is negative or positive
- significand $M$ normally a fractional value in range $[1.0, 2.0)$
- exponent $E$ weights value by power of two
Encoding:
- most significant bit is sign bit $s$
- exp field encodes $E$ (but is not equal to $E$)
- frac field encodes $M$ (but is not equal to $M$)

Precision options

Single precision: 32 bits
- exp field is 8 bits
- frac field is 23 bits
Double precision: 64 bits
- exp field is 11 bits
- frac field is 52 bits
Floating Point Precision

Floating Point Numbers

Three different “kinds” of floating point numbers based on the exp field:
- normalized: exp bits are not all ones and not all zeros
- denormalized: exp bits are all zero
- special: exp bits are all one

Normalized Values

Exponent coded as a biased value: $E = exp - bias$
- $exp$: unsigned value of exp field
- $bias = 2^{k-1} -1$, where $k$ is number of exponent bits
Significand coded with implied leading 1: $M = 1.xx \ldots x_2$
- $xxx \ldots x$: bits of frac field
- minimum when $frac = 000 \ldots 0$ ($M = 1.0$)
- maximum when $frac = 111 \ldots 1$ ($M = 2.0 - \epsilon$)
- get extra leading bit for “free”

Normalized Encoding Example

Value: float F = 15213.0;
- $15213_{10} = 11101101101101_{2} = 1.1101101101101_{2} \times 2^{13}$
Significand
- $M = 1.1101101101101$
- $frac = 11011011011010000000000$
Exponent
- $E = 13$
- $bias = 127$
- $exp = 140 = 10001100_{2}$

Denormalized Values

Exponent value: $E = 1 - bias$ (instead of $exp - bias$)
Significand coded with implied leading 0: $M = 0.xxx \ldots x_{2}$
- $xxx \ldots x$: bits of frac
Cases
- $exp = 000 \ldots 0, frac = 000 \ldots 0$
  - represents zero value
  - Note distinct values: $+0$ and $-0$
- $exp = 000 \ldots 0, frac \neq 000 \ldots 0$
  - numbers closest to 0.0
  - equally spaced

Special Values

Case: $exp = 111 \ldots 1, frac = 000 \ldots 0$
- represents value $\infty$ (infinity)
- operation that overflows
- both positive and negative
- examples: 1.0/0.0 = -1.0/-0.0 = $+\infty$, 1.0/-0.0 = $-\infty$
Case: $exp = 111 \ldots 1, frac \neq 000 \ldots 0$
- Not-a-Number (NaN)
- represents case when no numeric value can be determined
- examples: sqrt(-1), $\infty = \infty$, $\infty \times 0$

C `float` Decoding Example 1

float value = 0xC0A00000
binary: 1100 0000 1010 0000 0000 0000 0000 0000
$E = exp - bias = 129 - 127 = 2_{10}$
$s = 1$ negative number
$M = 1.010 0000 0000 0000 0000 0000 = 1 + 1/4 = 1.25_{10}$
$v = (-1)^s \cdot M \cdot 2^E = (-1)^1 \cdot 1.25 \cdot 2^2 = -5_{10}$

C `float` Decoding Example 2

float value = 0x001C0000
binary: 0000 0000 0001 1100 0000 0000 0000 0000
$E = exp - bias = 1 - 127 = -126_{10}$
$s = 0$ positive number
$M = 0.001 1100 0000 0000 0000 0000 = 1/8 + 1/16 + 1/32 = 7 \cdot 2^{-5}$
$v = (-1)^s \cdot M \cdot 2^E = (-1)^0 \cdot 7 \cdot 2^{-5} \cdot 2^{-126} = 7 \cdot 2^{-131}$

Tiny Floating Point Example

8-bit floating point representation
- the sign bit is the most significant bit
- the next four bits are the $exp$, with a bias of 7
- the last three bits are the $frac$
Same general form as IEEE format
- normalized, denormalized
- representation of 0, NaN, infinity

Dynamic Range ($s = 0$)

	exp	frac	E	value
	0000	000	-6	0
closest to zero	0000	001	-6	1/512
largest denorm	0000	111	-6	7/512
smallest norm	0001	000	-6	8/512
closest to 1 below	0110	111	-1	15/16
	0111	000	0	1
closest to 1 above	0111	001	0	9/8
largest norm	1110	111	7	240
	1111	000	-	`inf`

Special Properties of the IEEE Encoding

Floating point zero same as integer zero
Can (almost) use unsigned integer comparison
- must first compare sign bits
- must consider -0 = 0
- NaNs are problematic
- Otherwise OK
  - Denormalized vs. normalized
  - Normalized vs. infinity

Floating Point Operations: Basic Idea

$x +_{f} y = round(x + y)$
$x \times_{f} y = round(x \times y)$
Basic idea
- first compute exact result
- make it fit into the desired precision
  - possibly overflow if exponent is too large
  - possibly round to fit into $frac$

Rounding

Rounding modes (illustrate with rounding USD)

$1.40 $1.60 $1.50 $2.50 -$1.50

towards zero $1 $1 $1 $2 -$1

round down ($-\infty$) $1 $1 $1 $2 -$2

round up ($\infty$) $2 $2 $2 $3 -$1

nearest even $1 $2 $2 $2 -$2
Nearest even rounds to the nearest, but if half-way in-between then round to nearest even

	$1.40	$1.60	$1.50	$2.50	-$1.50
towards zero	$1	$1	$1	$2	-$1
round down (\(-\infty\))	$1	$1	$1	$2	-$2
round up (\(\infty\))	$2	$2	$2	$3	-$1
nearest even	$1	$2	$2	$2	-$2

Closer Look at Round-To-Even

Default Rounding Mode
- Difficult to get any other kind without dropping into assembly
- All others are statistically biased
  - sum of set of positive numbers will consistently be over- or under- estimated
- Applying to other decimal places / bit positions
  - when exactly halfway between two possible values, round so that least significant digit is even
  - Example round to the nearest hundreth
    - 7.8950000 = 7.90 (halfway – round up)
    - 7.8850000 = 7.88 (halfway – round down)

Rounding Binary Numbers

Binary Fractional Numbers
- “even” when least significant bit is 0
- “half way” when bits to right of rounding position $= 100 \ldots_{2}$

Examples: round to the nearest 1/4 (2 bits right of binary point)

value	binary	rounded	action	rounded value
$2 \frac{3}{32}$	10.00011	10.00	down	$2$
$2 \frac{3}{16}$	10.00110	10.01	up	$2 \frac{1}{4}$
$2 \frac{7}{8}$	10.11100	11.00	up	$3$
$2 \frac{5}{8}$	10.10100	10.10	down	$2 \frac{1}{2}$

Rounding

Terminology
- guard bit: least significant bit of result
- round bit: the first bit removed
- sticky bit: OR of remaining bits
Round up conditions
- round = 1, sticky = 1 $\rightarrow > 0.5$
- guard = 1, round = 1, sticky = 0 $\rightarrow$ round to even

Rounding Example

Round to three bits after the binary point

fraction GRS Incr? Rounded

1.0000000 000 N 1.000

1.1010000 100 N 1.101

1.0001000 010 N 1.000

1.0011000 110 Y 1.010

1.0001010 011 Y 1.001

1.1111100 111 Y 10.000

fraction	GRS	Incr?	Rounded
1.0000000	000	N	1.000
1.1010000	100	N	1.101
1.0001000	010	N	1.000
1.0011000	110	Y	1.010
1.0001010	011	Y	1.001
1.1111100	111	Y	10.000

Floating Point Multiplication

$(-1)^{s1} \cdot M1 \cdot 2^{E1} \times (-1)^{s2} \cdot M2 \cdot 2^{E2}$
Exact result: $(-1)^{s} \cdot M \cdot 2^{E}$
- sign $s$: $s1$ ^ $s2$
- significand $M$: $M1 \times M2$
- exponent $E$: $E1 + E2$
Fixing
- If $M \geq 2$, shift $M$ right, increment $E$
- If $E$ out of range, overflow
- Round $M$ to fit $frac$ precision

Floating Point Addition

$(-1)^{s1} \cdot M1 \cdot 2^{E1} + (-1)^{s2} \cdot M2 \cdot 2^{E2}$, Assume $E1 > E2$
Exact result: $(-1)^{s} \cdot M \cdot 2^{E}$
- sign $s$, significand $M$
  - result of signed align and add, that is align at binary point
- exponent $E$: $E1$
Fixing
- If $M \geq 2$, shift $M$ right, increment $E$
- If $M < 1$, shift $M$ left $k$ positions, decrement $E$ by $k$
- If $E$ out of range, overflow
- Round $M$ to fit $frac$ precision

Properties of Floating Point Addition

Compare to those of Abelian Group
- Closed under addition, but may generate infinity or NaN
- Commutative
- Not associative
- 0 is additive identity
- Almost every element has an additive inverse, except for infinities and NaNs
Monotonicity
- $a \geq b \rightarrow a + c \geq b + c$ except for infinities and NaNs

Properties of Floating Point Multiplication

Compare to Commutative Ring
- Closed under multiplication, but may generate infinity or NaN
- Commutative
- Not associative: possibility of overflow, inexactness of rounding
- 1 is multiplicative identity
- Multiplication does not distribute over addition
Monotonicity
- $a \geq b \land c \geq 0 \rightarrow a * c \geq b * c$ except for infinities and NaNs

Floating Point in C

C guarantees two levels
- float: single precision
- double: double precision
Conversions / Casting
- Casting between int, float, and double changes bit representation
- double/float to int
  - truncates fractional part (like rounding to zero)
  - not defined when out of range or NaN
- int to double
  - exact conversion, as long as int has $\leq 53$ bit word size
- int to float
  - will round according to rounding mode

Summary

IEEE Floating Point has clear mathematical properties
Represents numbers of form $M \times 2^{E}$
One can reason about operations independent of implementation
- as if computed with perfect precision and then rounded
Not the same as real arithmetic
- violates associativity and distributivity
- makes life difficult for compilers and serious numerical applications programmers

value	binary	rounded	action	rounded value
\(2 \frac{3}{32}\)	10.00011	10.00	down	\(2\)
\(2 \frac{3}{16}\)	10.00110	10.01	up	\(2 \frac{1}{4}\)
\(2 \frac{7}{8}\)	10.11100	11.00	up	\(3\)
\(2 \frac{5}{8}\)	10.10100	10.10	down	\(2 \frac{1}{2}\)

Floating Point

References

Outline

Fractional Binary Numbers

Fractional Binary Number Examples

Representable Numbers

IEEE Floating Point

Floating Point Representation

Precision options

Floating Point Numbers

Normalized Values

Normalized Encoding Example

Denormalized Values

Special Values

C float Decoding Example 1

C float Decoding Example 2

Tiny Floating Point Example

Dynamic Range (\(s = 0\))

Special Properties of the IEEE Encoding

Floating Point Operations: Basic Idea

Rounding

Closer Look at Round-To-Even

Rounding Binary Numbers

Rounding

Rounding Example

Floating Point Multiplication

Floating Point Addition

Properties of Floating Point Addition

Properties of Floating Point Multiplication

Floating Point in C

Summary

C `float` Decoding Example 1

C `float` Decoding Example 2