- i.e. floating point numbers (real / float / double).
*Will take about two lectures*.

- <exponent, mantissa>

- limited numerical accuracy
- e.g. 1/3 and 1/5 are not represented exactly

- e.g. 1/3 and 1/5 are not represented exactly
- perhaps 6 or 16 decimal digits

(cf Babbage's*difference engine*had (has) 30 decimal digits of precision and his*Analytical Engine*was to have 40 decimal digits of precision.)

- <exponent, mantissa>
**Problem**: Solve equation f(x)=0.**Problem**: Integrate function . . .

e.g.

S | exponent (8) | mantissa or fraction (23) | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |

S | E | E | E | E | E | E | E | E | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F |

- normal value =
(-1) ^{S}. 2^{E-127}. (1.F)_{2} - +0: E=0, F=0, S=0
- -0: E=0, F=0, S=1
- NaN, not a number:
E=FF
_{16}=255_{10}, F<>0 - +oo : E=FF
_{16}=255_{10}, F=0, S=1 - -oo : E=FF
_{16}=255_{10}, F=0, S=0 - unnormalised if E=0 & F<>0,
(-1) ^{S}. 2^{-126}. (0.F)_{2} - least +ve value
= 2 ^{-126}. (0.00000000000000000000001_{2})= 2 ^{-149}

It is generally better to combine small numbers first before combining them with large numbers.

Consider SUM_{i=1..} ( 1 / i )
-- sum to infinity is a *divergent* series

Also see [oneOverN.c].

There is a number `delta' s.t.

[lecturer: use the
demo';
class: note value of delta & probable representation.]

In some cases limited numerical accuracy
can cause severe

**?**Big - ( Big' - small ) =( Big - Big' ) + small **?**

if Big = Big'

then Big - ( Big' - small ) = Big - (Big - small) which*may*equalBig - Big = 0

- but ( Big - Big' ) + small
=
0 + small =small <> 0

© L . A l l i s o n |

**NB.**requires f(x) to be**continuous****PRE**: f(lo) < 0 and f(hi) > 0 or v.v.

- loSign := sign( f( lo ) );
**repeat**- mid := (lo + hi) / 2;
- midSign := sign( f( mid ) );
**if**midSign = loSign then- lo := mid

**else if**midSign = 0 then- lo := hi := mid

**else**-- assert midSign = sign(f(hi))- hi := mid

**end_if**

**until**hi - lo is "small enough" -- [lecturer: draw illustration; class: take notes!]

Solving f(x)=0

- Binary search for f(x)=0 is
"like" binary search in a lookup
table

**But**here we should not stop onhi = lo or similar

- because [______________________]

- [lecturer: use the demo'; class: take notes!]

See algorithm . . .

double rectangle(double f(double x), double lo, double hi, int N) { double width = (hi-lo)/N; double sum = 0.0; int i; for(i=0; i < N; i++) sum += f( lo+(i+0.5)*width ); */ f() at centre of i-th interval */ return sum*width; }

See algorithm . . .

double trapezoidal(double f(double x), double lo, double hi, int N) { double width = (hi-lo)/N; double sum = (f(lo)+f(hi))/2; int i; for(i=1; i < N; i++) sum += f( (lo*(N-i) + hi*i)/N ); return sum*width; }

See algorithm . . .

double Simpson(double f(double x), double lo, double hi, int N) /* PRE: N is even */ { double width = (hi-lo)/N; double sum = f(lo)+f(hi); int i, odd=true; for(i=1; i < N; i++) { sum += f(lo+i*width) * (odd ? 4 : 2); odd = !odd; } return sum*width/3; }

- [lecturer: use
demo';
class: take notes!]

- [discuss results]

- [__________] rule is best

- [__________] rule is next best

- [__________] rule is worst

- (of the methods discussed)

real / float and double.

- limited numerical accuracy

- errors can accumulate

- will a test for equality
*ever*succeed?

- a strictly monotonic series may have a bound yet never reach it (re program termination).