Course Notes: ITCS122 Numerical Methods - Curve Fitting and Descriptive Statistics

Introduction to Curve Fitting

Definition: Curve fitting is a technique used when data are provided for discrete values along a continuum.
Primary Objectives: * Estimation: To estimate points between existing discrete data values. * Intermediate Estimates: To fit curves to data specifically to obtain these intermediate values. * Simplification: To simplify a mathematically complicated function using a simpler function. This is achieved by computing values of the complex function at various discrete points along a range, and then computing a simpler function that fits those discrete values.

General Approaches to Curve Fitting

Least-Squares Regression: * Used when data contain significant error or when working with scatter data. * The goal is to identify a single curve that represents the general trend or pattern of the data rather than hitting every point exactly.
Interpolation: * Used when data are considered very precise. * The goal is to find a curve that passes directly through every single data point in the set.

Descriptive Statistics and Data Summary

Definition: Descriptive statistics are summary statistics that quantitatively describe or summarize features from a collection of information.
Three Main Types of Descriptive Statistics: 1. Central Tendency: Information concerning the averages of the values. 2. Variability or Dispersion: Information concerning how spread out the values are. 3. Distribution: Information concerning the frequency of each specific value.

Measures of Central Tendency

Arithmetic Mean ( $\mu$ , $\bar{y}$ ): * Calculated as the sum of the individual data points ( $y_i$ ) divided by the number of points ( $n$ ). * Formula: $\bar{y} = \frac{\sum y_i}{n}$
Median: * The midpoint of a group of data (the 50th percentile). * Calculation Process: 1. Arrange the data in ascending order. 2. If $n$ is odd, the median is the middle value. 3. If $n$ is even, the median is the arithmetic mean of the two middle values.
Mode: * The value that occurs most frequently within the data set.

Measures of Variability (Spread)

Range: The difference between the largest value and the smallest value in the data set.
Standard Deviation ( $s_y$ ): * Represents the average amount of variability in the data; it indicates, on average, how far each data point lies from the mean. * A larger standard deviation indicates the data set is more variable (spread out widely around the mean). * A smaller standard deviation indicates data points are grouped tightly around the mean. * Formula: $s_y = \sqrt{\frac{S_t}{n - 1}}$ * Variable definition ( $S_t$ ): The total sum of the squares of the residuals between the data points and the mean. * $S_t = \sum (y_i - \bar{y})^2$
Variance: The square of the standard deviation ( $s_y^2$ ). * General Formula: $s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1}$ * Computational Formula (does not require predetermining $\bar{y}$ ): $s_y^2 = \frac{\sum y_i^2 - \frac{(\sum y_i)^2}{n}}{n - 1}$
Degrees of Freedom: * The quantity $n - 1$ is called the degrees of freedom. * $S_t$ and $s_y$ are based on $n - 1$ degrees of freedom. * Justification for $n-1$ : * If $S_t = 0$ , then $(y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \dots + (y_n - \bar{y})^2 = 0$ . If $\bar{y}$ and $n-1$ of the $y_i$ values are known, the final value of $y$ is predetermined. Thus, only $n-1$ values are freely determined. * The spread of a single data point ( $n=1$ ) does not exist; applying the formula for $n=1$ yields a result of infinity, which is meaningless.
Coefficient of Variation (c.v.): * The ratio of the standard deviation to the mean, providing a normalized measure of spread. * Formula (expressed as percentage): $\text{c.v.} = \frac{s_y}{\bar{y}} \times 100\%$

Detailed Simple Statistics Example

Data Summary (n = 24 entries): * Sum of values ( $\sum y_i$ ): $158.400$ * Sum of squared residuals ( $\sum (y_i - \bar{y})^2$ ): $0.21700$ * Sum of squares ( $\sum y_i^2$ ): $1045.657$
Specific Indexed Data Points ( $i$ | $y_i$ | $(y_i - \bar{y})^2$ | $y_i^2$ ): 1. $1$ | $6.395$ | $0.04203$ | $40.896$ 2. $2$ | $6.435$ | $0.02723$ | $41.409$ 3. $3$ | $6.485$ | $0.01323$ | $42.055$ 4. $4$ | $6.495$ | $0.01103$ | $42.185$ 5. $5$ | $6.505$ | $0.00903$ | $42.315$ 6. $6$ | $6.515$ | $0.00723$ | $42.445$ 7. $7$ | $6.555$ | $0.00203$ | $42.968$ 8. $8$ | $6.555$ | $0.00203$ | $42.968$ 9. $9$ | $6.565$ | $0.00123$ | $43.099$ 10. $10$ | $6.575$ | $0.00063$ | $43.231$ 11. $11$ | $6.595$ | $0.00003$ | $43.494$ 12. $12$ | $6.605$ | $0.00002$ | $43.626$ 13. $13$ | $6.615$ | $0.00022$ | $43.758$ 14. $14$ | $6.625$ | $0.00062$ | $43.891$ 15. $15$ | $6.625$ | $0.00062$ | $43.891$ 16. $16$ | $6.635$ | $0.00122$ | $44.023$ 17. $17$ | $6.655$ | $0.00302$ | $44.289$ 18. $18$ | $6.655$ | $0.00302$ | $44.289$ 19. $19$ | $6.665$ | $0.00422$ | $44.422$ 20. $20$ | $6.685$ | $0.00722$ | $44.689$ 21. $21$ | $6.715$ | $0.01322$ | $45.091$ 22. $22$ | $6.715$ | $0.01322$ | $45.091$ 23. $23$ | $6.755$ | $0.02402$ | $45.630$ 24. $24$ | $6.775$ | $0.03062$ | $45.901$

Data Distribution

Definition: Data distribution describes the shape with which the data are spread around the mean.
Histogram: A graphical representation constructed by sorting measurements into specific intervals or "bins."
Normal Distribution: * Also known as Gaussian distributions or bell curves. * Characteristics: * Data is symmetrically distributed with no skew. * Follows a bell shape when plotted. * Most values cluster around a central region. * Values taper off as they move further away from the center. * The Empirical Rule: Known as the 68-95-99.7 rule, describing the percentage of data falling within 1, 2, and 3 standard deviations of the mean.