Course Notes: ITCS122 Numerical Methods - Curve Fitting and Descriptive Statistics

Introduction to Curve Fitting

  • Definition: Curve fitting is a technique used when data are provided for discrete values along a continuum.
  • Primary Objectives:     * Estimation: To estimate points between existing discrete data values.     * Intermediate Estimates: To fit curves to data specifically to obtain these intermediate values.     * Simplification: To simplify a mathematically complicated function using a simpler function. This is achieved by computing values of the complex function at various discrete points along a range, and then computing a simpler function that fits those discrete values.

General Approaches to Curve Fitting

  • Least-Squares Regression:     * Used when data contain significant error or when working with scatter data.     * The goal is to identify a single curve that represents the general trend or pattern of the data rather than hitting every point exactly.
  • Interpolation:     * Used when data are considered very precise.     * The goal is to find a curve that passes directly through every single data point in the set.

Descriptive Statistics and Data Summary

  • Definition: Descriptive statistics are summary statistics that quantitatively describe or summarize features from a collection of information.
  • Three Main Types of Descriptive Statistics:     1. Central Tendency: Information concerning the averages of the values.     2. Variability or Dispersion: Information concerning how spread out the values are.     3. Distribution: Information concerning the frequency of each specific value.

Measures of Central Tendency

  • Arithmetic Mean (μ\mu, yˉ\bar{y}):     * Calculated as the sum of the individual data points (yiy_i) divided by the number of points (nn).     * Formula: yˉ=yin\bar{y} = \frac{\sum y_i}{n}
  • Median:     * The midpoint of a group of data (the 50th percentile).     * Calculation Process:         1. Arrange the data in ascending order.         2. If nn is odd, the median is the middle value.         3. If nn is even, the median is the arithmetic mean of the two middle values.
  • Mode:     * The value that occurs most frequently within the data set.

Measures of Variability (Spread)

  • Range: The difference between the largest value and the smallest value in the data set.
  • Standard Deviation (sys_y):     * Represents the average amount of variability in the data; it indicates, on average, how far each data point lies from the mean.     * A larger standard deviation indicates the data set is more variable (spread out widely around the mean).     * A smaller standard deviation indicates data points are grouped tightly around the mean.     * Formula: sy=Stn1s_y = \sqrt{\frac{S_t}{n - 1}}     * Variable definition (StS_t): The total sum of the squares of the residuals between the data points and the mean.     * St=(yiyˉ)2S_t = \sum (y_i - \bar{y})^2
  • Variance: The square of the standard deviation (sy2s_y^2).     * General Formula: sy2=(yiyˉ)2n1s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1}     * Computational Formula (does not require predetermining yˉ\bar{y}): sy2=yi2(yi)2nn1s_y^2 = \frac{\sum y_i^2 - \frac{(\sum y_i)^2}{n}}{n - 1}
  • Degrees of Freedom:     * The quantity n1n - 1 is called the degrees of freedom.     * StS_t and sys_y are based on n1n - 1 degrees of freedom.     * Justification for n1n-1:         * If St=0S_t = 0, then (y1yˉ)2+(y2yˉ)2++(ynyˉ)2=0(y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \dots + (y_n - \bar{y})^2 = 0. If yˉ\bar{y} and n1n-1 of the yiy_i values are known, the final value of yy is predetermined. Thus, only n1n-1 values are freely determined.         * The spread of a single data point (n=1n=1) does not exist; applying the formula for n=1n=1 yields a result of infinity, which is meaningless.
  • Coefficient of Variation (c.v.):     * The ratio of the standard deviation to the mean, providing a normalized measure of spread.     * Formula (expressed as percentage): c.v.=syyˉ×100%\text{c.v.} = \frac{s_y}{\bar{y}} \times 100\%

Detailed Simple Statistics Example

  • Data Summary (n = 24 entries):     * Sum of values (yi\sum y_i): 158.400158.400     * Sum of squared residuals ((yiyˉ)2\sum (y_i - \bar{y})^2): 0.217000.21700     * Sum of squares (yi2\sum y_i^2): 1045.6571045.657
  • Specific Indexed Data Points (ii | yiy_i | (yiyˉ)2(y_i - \bar{y})^2 | yi2y_i^2):     1. 11 | 6.3956.395 | 0.042030.04203 | 40.89640.896     2. 22 | 6.4356.435 | 0.027230.02723 | 41.40941.409     3. 33 | 6.4856.485 | 0.013230.01323 | 42.05542.055     4. 44 | 6.4956.495 | 0.011030.01103 | 42.18542.185     5. 55 | 6.5056.505 | 0.009030.00903 | 42.31542.315     6. 66 | 6.5156.515 | 0.007230.00723 | 42.44542.445     7. 77 | 6.5556.555 | 0.002030.00203 | 42.96842.968     8. 88 | 6.5556.555 | 0.002030.00203 | 42.96842.968     9. 99 | 6.5656.565 | 0.001230.00123 | 43.09943.099     10. 1010 | 6.5756.575 | 0.000630.00063 | 43.23143.231     11. 1111 | 6.5956.595 | 0.000030.00003 | 43.49443.494     12. 1212 | 6.6056.605 | 0.000020.00002 | 43.62643.626     13. 1313 | 6.6156.615 | 0.000220.00022 | 43.75843.758     14. 1414 | 6.6256.625 | 0.000620.00062 | 43.89143.891     15. 1515 | 6.6256.625 | 0.000620.00062 | 43.89143.891     16. 1616 | 6.6356.635 | 0.001220.00122 | 44.02344.023     17. 1717 | 6.6556.655 | 0.003020.00302 | 44.28944.289     18. 1818 | 6.6556.655 | 0.003020.00302 | 44.28944.289     19. 1919 | 6.6656.665 | 0.004220.00422 | 44.42244.422     20. 2020 | 6.6856.685 | 0.007220.00722 | 44.68944.689     21. 2121 | 6.7156.715 | 0.013220.01322 | 45.09145.091     22. 2222 | 6.7156.715 | 0.013220.01322 | 45.09145.091     23. 2323 | 6.7556.755 | 0.024020.02402 | 45.63045.630     24. 2424 | 6.7756.775 | 0.030620.03062 | 45.90145.901

Data Distribution

  • Definition: Data distribution describes the shape with which the data are spread around the mean.
  • Histogram: A graphical representation constructed by sorting measurements into specific intervals or "bins."
  • Normal Distribution:     * Also known as Gaussian distributions or bell curves.     * Characteristics:         * Data is symmetrically distributed with no skew.         * Follows a bell shape when plotted.         * Most values cluster around a central region.         * Values taper off as they move further away from the center.     * The Empirical Rule: Known as the 68-95-99.7 rule, describing the percentage of data falling within 1, 2, and 3 standard deviations of the mean.