1/40
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
a numerical method usually produces an approximation, not the exact answer
if the error becomes too large:
the result becomes unreliable
engineering conclusions may be incorrect
main types of numerical error:
roundoff error
truncation error
also important:
error accumulation
numerical stability
convergence testing
Numerical Error
computers store real numbers using floating point representation
a floating point number contains:
sign bit
exponent
significand (mantissa)
common precisions:
Precision | Bits | Approx. Accuracy |
|---|---|---|
Single precision | 32-bit | ~7 decimal digits |
Double precision | 64-bit | ~16 decimal digits |
python/NumPy usually use:
float32 → single precision
float64 → double precision
floating point representation
some numbers can be represented exactly
example:
6.5 is exact in binary floating point
however, many decimal numbers cannot be represented exactly
example:
0.1
stored approximately as:
0.100000001490116119…
this creates representation error before calculations even begin
representation error
absolute error:
measures the direct difference:
|x−fl(x)|
where:
x = exact value
fl(x) = floating point approximation
relative error
scales the error relative to the size of the number:
|x−fl(x)∣|/|x|
relative error is usually more useful because it behaves like a percentage error
absolute and relative error
machine epsilon (ϵm) is the maximum relative error caused by floating point representation
|x−fl(x)∣|/|x| <= ϵm
IEE 754 Values:
Precision | Machine Epsilon |
|---|---|
Single precision | (2^{-23} \approx 1.19 \times 10^{-7}) |
Double precision | (2^{-52} \approx 2.22 \times 10^{-16}) |
smaller epsilon = higher precision
machine epsilon
normal arithmetic laws do not always hold in floating point arithmetic
associativity failure:
normally:
(a+b)+c=a+(b+c)
but floating point rounding can make the results different
important implication
never compare floating point numbers using:
a == binstead use:
abs(a-b) < tolerancefloating point arithmetic problems
occurs when ading number with very different magnitudes
example:
1.2257 + 249.237
the smaller number loses significant digits when exponents are aligned
result:
small contributions may partially disappear
this is called absorption
absorption error
after arithmetic operations, the result may require more digits than available precision
the extra digits must be rounded away
example:
425 + 670 = 1095
with only 3-digit precision:
1.095×10³
becomes: 1.09×10³
information is lost due to rounding
roundoff error
multiplication often requires approximately double the significand bits for exact storage
example:
6.954 × 8.212 = 57.106248
if only 4-digit precision is available:
57.10
57.11
must be used instead
this introduces rounding error
multiplication error
occurs when subtracting two nearly equal numbers
example:
102.0577 - 102.055 = 0.0027
many leading digits cancel out
small representation errors because huge relative errors
why it is dangerous:
the final answer may have:
very few accurate digits
extremely large relative error
catastrophic cancellation
round by chop:
simply cuts off extra digits
example: 1.299 → 1.2
round to nearest:
rounds to the closest machine number
1.299 → 1.3
comparison:
more accurate
slightly more computationally expensive
used in most modern systems
rounding strategies
truncation error occurs because numerical methods simplify or approximate mathematics
even with perfect arithmetic, the approximation itself causes error
example: Leibniz series for pi
pi = 4∑∞k=0 (-1)k/(2k+1)
the series is infinite, so we truncate it after N terms:
π≈4∑k=0N (−1)k/(2k+1)
because the inifinite series is cut short:
approximation error remains
this is truncation error
truncation error
errors build up during repeated calculations
sources include:
representation error
absorption
roundoff
cancellation
truncation
iterative algorithms are especially vulnerable because they perform many operations
lower precision (e.g. half precision) accumulates error faster
error accumulation
a good iterative numerical method should:
gradually approach the true solution
reduce error as iterations increase
example:
increasing N in the Leibniz series improves approximation of pi
convergence of numerical methods
used to determine whether iterations are approaching a stable solution
absolute convergence test:
∣yk−yk−1∣<ϵt
characteristics:
sensitive for large values
relative convergence test:
∣yk−yk−1∣/∣yk∣<ϵt
characteristics:
sensitive for small values
uniform convergence test:
∣yk−yk−1∣/1+∣yk∣<ϵt
characteristics:
combines benefits of both tests
often the most robust
where:
k = iteration number
ϵt = tolerance
numerical convergence tests
many numerical methods are iterative (they repeatedly improve an approximation)
a method is said to converge if the solution approaches the true answer as:
the number of iterations increases, or
the step size decreases
example: Leibniz series for pi
pi = 4∑∞k=0 (-1)k/(2k+1)
the approximation gets closer to the true value of pi
why convergence matters
error can be checked directly:
|π−πN|≤ϵt
where:
ϵt= chosen tolerance
when the exact solution is known
for many engineering ODEs the true solution is unavailable
instead:
solve using step size h
solve again using a smaller step size
compare the results
if the difference becomes sufficiently small, convergence is assumed
when the exact solution is unknown
absolute test
|yk−yk−1|<ϵt
relative test
|yk−yk−1|/|yk|<ϵt
uniform test
|yk−yk-1|/1+|yk|
where:
k = iteration number
ϵt = tolerance
common convergence tests
absolute test:
good for: small-valued solution
problem: sensitive to large values
relative test:
good for: large-valued solutions
problem: unstable when solution approaches zero
uniform test:
combines both approaches:
behaves like absolute test for small values
behaves like relative test for large values
most robust convergence test
advantages and disadvantages of convergence tests
General form:
dy/dt=f(t,y)
where:
t = independent variable
y = dependent variable
f(t,y) = derivative function
Goal:
Determine y(t)
Initial condition y(0)
Symbol | Meaning |
|---|---|
(t) | Independent variable |
(y) | Dependent variable |
(h) | Step size |
(k) | Iteration number |
(k=0) | Initial condition |
(t^{(k+1)}) | Next time step |
(y^{(k)}) | Numerical solution at step k |
first-order ODEs
Runge-Kutta methods are numerical techniques used to solve ODEs
they estimate the solution by evaluating derivatives at several locations within each step
general form:
y(k+1)=y(k)+h∑αifi
where:
fi = derivative evaluations
αi= weights
two categories:
Explicit RK Methods:
current derivative evaluations depend only on previous ones
examples:
euler
improved euler
RK4
Implicit RK Methods:
derivative evaluations depend on future unknown values
Runge-Kutta (RK) Methods
The simplest RK Method
Formula:
y(k+1)=y(k)+hf0
where:
interpretation:
uses:
one derivative evaluation
evaluated at the start of the step
essentially assumes:
slope remains constant throughout the step
characteristics:
advantages:
very simple
very cheap computationally
disadvantages:
least accurate
larger truncation error
can become unstable
Euler Method
also called a predictor-corrector method
uses two derivative evaluations
step 1: predictor
calculate initial slope:
f0=f(t(k),y(k))
predict endpoint:
y∗=y(k)+hf0
step 2: corrector
evaluate slope at predicted endpoint
f1=f(t(k)+h,y∗)
final update:
y(k+1)=y(k)+h(f0+f1/2)
characteristics:
advantages:
more accurate than Euler
better estimate of average slope
disadvantages:
twice the computational cost
Improved Euler Method
uses four derivative evaluations
RK4 Equations:
first slope:
f0=f(t,y)
second slope:
f1=f(t+h/2,y+hf0/2)
third slope:
f2 = f(t + h/2, y + hf1/2)
fourth slope:
f3 = f(t + h, y + hf2)
final update:
y(k+1) = y(k) + h(f0 + 2f1 + 2f2 + f3/6)
why RK4 works well:
uses:
start of interval
middle of interval
middle again
end of interval
to estimate the average slope
produces very high accuracy for reasonable cost
Classical RK4 Method
a compact way of writing Runge-Kutta methods
contains:
β values:
nodes: wher derivatives are evaluated
γ values:
RK matrix: how derivative evaluations depend on each other
α values:
weights: used in final weighted average
important property:
for explicit RK methods:
the RK matrix must be:
lower triangular
zero along diagonal
reason: each derivaive can only depend on previously computed derivatives
Butcher Tableau
error caused by simplifying mathematics
occurs even with perfect arithmetic
RK methods only sample the derivative at finite locations
the actual derivative may vary continuously between those locations
therefore some information is lost
truncation error
Taylor Series:
y(t+h)=y(t)+hdt/dy+(h2/2!)(d²y/dt²)+⋯
euler keeps only the first derivative term
everything else is ignored
therefore:
local error = O(h²)
Taylor Series and Euler Error
local error
error introduced in ONE step
for euler:
ϵl∝h2
halving h:
local error reduced by factor 4
global error
total accumulated error after many steps
for euler:
ϵg∝h
halving h:
global error reduced by factor 2
local vs global error
the higher the order:
faster convergence
greater accuracy
for the same step size
Method | Derivatives | Order | Local Error |
|---|---|---|---|
Euler | 1 | 1st | (O(h^2)) |
Improved Euler | 2 | 2nd | (O(h^3)) |
RK4 | 4 | 4th | (O(h^5)) |
more derivative evaluations usually improve accuracy
however:
number of derivatives does not equal order
higher-order methods can be constructed in many ways
order of accuracy
two appraoches:
method 1: smaller step size
reduce h
advantages:
more accurate
disadvantages:
more iterations
more computation time
method 2: higher-order method
use:
improved euler
RK4
advantages:
greater accuracy
disadvantages:
more derivative evaluations per step
reducing numerical error
a method is stable if its numerical solution behaves similarly to the true solution
stable solution
if exact solution decays:
numerical solution also decays
unstable solution:
if exact solution decays:
numerical solution grows
this causes error to dominate
numerical stability
consider:
dt/dy=ay
exact solution:
y=Ceat
if a<0:
the true solution decays toward zero
euler update:
y(k+1)=(1+ha)y(k)
after k steps:
y(k)=(1+ha)ky(0)
stability analysis of euler method
for stability:
|1+ha|<1
which gives:
−2<ha<0
therefore:
h< -2/a
(for a<0)
stability regions:
Condition | Behaviour |
|---|---|
(-2<ha<0) | Stable |
(ha=-2) | Neutral |
(ha<-2) | Unstable |
euler stability criterion
sampled data: measurements of a continuous process taken at discrete points
independent variable: t (e.g. time)
dependent variable: y
for n data points: ti are measurement points, yi = y(ti) are the measured values
common questions answered using sampled data:
what is the relationship between y and y?
whast is y at an unmeasured point (interpolation)?
what is y outside the measured range (extrapolation)?
three key tools: regression, interpolation, extrapolation
sampled data
concept:
regression fits a single function (commonly a polynomial) that approximate the overall trend of the data
general m-order polynomial:
f(t)=a0+a1t+a2t2+⋯+amtm=∑mk=0 aktk
linear regression = special case where m = 1
goal: choose the lowest-order polynomial that fits the data reasonably well
goodness of fit — sum of squared differences (R²):
R2=i=0∑n−1(y(ti)−f(ti))2
smaller R² = better fit
least-squares regression = finding the m + 1 coefficients a0, …, am that minimise R²
minimising R²:
treat R² as a multivariable function of the coefficients of a0, … am
find stationary points by setting all partial derivatives to zero:
∂(R2)/∂a0=∂(R2)/∂a1=⋯=∂(R2)/∂am=0
the gives m + 1 equations for m + 1 unknown coefficients
Matrix Form — The Vandermonde System:
the resulting system can be written as Ax = b:
A = Vandermonde (coefficient) matrix
b = vector built from sums involved yi
solve Ax = b with a linear algebra solver → get coefficients a0, …, am. Once found, evaluate f(tj) to approximate y(tj) at any desired point.
Worked Example (8 data points):
first-order (linear) fit: poor fit, large R² = 277
higher-order fit (m = n - 1 = 7): passes through every data point exactly, R² = 0
Overfitting:
R² = 0 does not mean the model is “best” — it may just be overfitting
warning signs/checks for overfitting:
does the function make physical sense?
are the polynomial coefficients unusually large?
are there many parameters relative to the number of data points
aim: a model that is reasonable, not just one that minimises R².
regression
General Idea:
interpolation fits n - 1 piecewise functions fi(t), each exactly matching the data at the endpoints of its subinterval:
f0(t0) = y(t0), f0(t1) = y(t1)
f1(t1) = y(t1), f1(t2) = y(t2)
… and so on
to estimate y(tj) for ti <= tj <= ti+1, use fi(tj).
Linear Interpolation:
fits a first-order polynomial (straight line) in each subinterval:
fi(t) = yi + (yi+1 - yi/ti+1 - ti)(t - ti)
simple but only C0 continuous (line has “kinks” at data points)
Natural Cubic Splines:
fits a cubic polynomial in each of the n - 1 subintervals:
fi(t)=ai+bi(t−ti)+ci(t−ti)²+di(t−ti)3
first derivative:
fi′(t)=bi+2ci(t−ti)+3di(t−ti)2
second derivative:
fi′′(t)=2ci+6di(t−ti)
“natural” condition: second derivative = 0 at the two endpoints of the dataset
provides C0, C1, and C2 continuity → much smoother curve than linear interpolation
Building the System of 4(n-1) equations
for n-1 cubics, there are 4(n-1) unknown coefficients (ai, bi, ci, di), the equations come from:
function value matching (2(n-1) equations)
left end: ai = yi
right end: ai + bihi + cihi² + dihi³ = yi+1, where hi = ti+1 - ti
first derivative continuity (n-2 equations) — between adjacent cubics at shared points:
bi+2cihi+3dihi2−bi+1=0
second derivative continuity (n-2 equations):
2ci + 6dihi - 2ci+1 = 0
natural boundary conditions (2 equations) — zero curvature at first and last points:
2c0 = 0
2cn-2 + 6dn-2hn-2 = 0
total: 4(n-1) equations → solved as Ax = b for all the cubic coefficients
Interpolation
extrapolation: estimating y outside the range of the sampled data (vs. interpolation, which is within the range).
any interpolating or regression function g(x) can in principle be used as the extrapolating function
key danger: accuracy of extrapolation is not bounded by known data — be very cautious
example: a linear fit extrapolates “reasonably”, but the high-order (order-7) polynomial that fit the data perfectly (R² = 0) blows up wildly outside the data range — a classic overfitting/extrapolation hazard
extrapolation
5.1 Overview
Used to numerically approximate I = ba∫y(t) dt using sampled data
closed rules: integration limits are known data points (t0 = a, tn-1 = b) — the focus of this course
open rules: limits are not known data points (not covered here)
Trapezoid Rule:
approximates the integrand with a straight line between a and b
ba∫y(t)dt≈h/2(y(a)+y(b)),h=b−a
geometrically: area of the trapezoid under the line connecting the endpoints
5.3 Simpson’s 1/3 Rule
fits a quadratic through a, midpoint, and b:
∫aby(t)dt≈h/3[y(a)+4y(a+b/2)+y(b)],h=(b−a)/2
requires the function value at the midpoint
5.4 Simpson’s 3/8 Rule
fits a cubic through 4 points (a, two interior points, b):
ba∫y(t)dt≈3h/8[y(a)+3y(2a+b/3)+3y(a+2b/3)+y(b)],h=3b−a
requires two additonal interior points
5.5 Limitations:
assumes the integrand is known at specific points — may not match available sampled data
accuracy suffers when:
h (step size) is large, or
the function varies significantly between the limits
Newton-Cotes Rules (Numerical Integration)
6.1 Idea
split [a,b] into multiple subintervals and sum the integral approximation over each:
I = I1 + I2 + I3 + …
increasing the number of subintervals increases accuracy (e.g. example showed % error dropping from -26.3% with 2 subintervals to -2.5% with 7 subintervals)
6.2 Composite Trapezoidal Rule
each subinterval contributes:
Ii=(ti+1−ti)(y(ti)+y(ti+1))/2
sum over all n-1 subintervals:
ba∫y(t)dt≈ n−2i=0∑ti+1−ti/2(y(ti)+y(ti+1))
Composite Simpson’s 1/3 Rule
Requires fitting a quadratic per subinterval → needs 3 points per subinterval
preconditions:
uniform step size
odd number of data points (so subintervals can pair up correctly)
adjacent subintervals share only their endpoints:
subinterval 0: y(t0), y(t1), y(t2)
subinterval 1: y(t2), y(t3), y(t4)
etc.
formula:
ba∫y(t)dt≈hi/3=0 ⌊n/2⌋−1i=0∑[y(t2i)+4y(t2i+1)+y(t2i+2)]
a composite Simpson’s 3/8 rule also exists (not derived in these slides)
Composite Rules
7.1 Overview
approximates an integral via:
normalising (change of variable) the integral to the interval [-1,1]
evaluating the integrand at specific Gauss points, weight appropriately
for low-order polynomial, this can give an exact result (zero error)
7.2 General Formula
1-1∫f(ξ)dξ=ϵ+ G−1g=0∑wgf(ξg)
ϵ = numerical error
ξg = Gauss points
wg = weights (sum to 2, the width of [−1,1])
G = number of Gauss points
7.3 Three-point Guass Quadrature (example scheme)
ξ0=−sqrt(3/5),ξ1=0,ξ2=sqrt(3/5) w0=5/9,w1=8/9,w2=5/9
7.4 Worked Example
exact integral: 1-1∫3x²dx=2
using 3-point quadrature with f(x) = 3x²:
I = 5/9(3)(sqrt(3/5))² + 8/3(3)(0)² +5/9(3)(sqrt(3/5))² = 1 + 0
result is exact (ϵ=0) — demonstrates Gaussian quadrature’s strength for polynomial integrands
Gaussian Quadrature