Visual Display of Data & Statistics

Visual Display of Data

Frequency Distribution

Definition: A frequency distribution is a method of organizing data to show how often each value or group of values occurs. Data is categorized into intervals, and the frequency (number of times each category appears) is recorded, moving data away from raw lists.

Types of Frequency Distribution

Tabular Form

Data is presented in a table showing categories (or intervals) and their corresponding frequencies.
Example: Test scores of 20 students
- Score Range: 0–10, Frequency: 2
- Score Range: 11–20, Frequency: 4
- Score Range: 21–30, Frequency: 6
- Score Range: 31–40, Frequency: 5
- Score Range: 41–50, Frequency: 3
- This table indicates, for instance, that $6$ students scored between $21$ and $30$ .

Graphical Form

Frequencies are displayed using visual representations.

Bar Chart

A bar chart graphically displays a frequency distribution using rectangular bars of equal width.
The height (or length) of each bar represents the frequency of a category or group.
Best suited for categorical data (e.g., favorite colors, fruits, movie genres).
Categories are placed along the x-axis, and frequencies are shown on the y-axis.
Bars are separated by gaps, unlike histograms.
Example: Favorite fruits of 10 students
- Survey list: Apple, Orange, Banana, Apple, Mango, Apple, Banana, Mango, Apple, Orange
- Frequency Table:
  - Fruit: Apple, Frequency: 4
  - Fruit: Orange, Frequency: 2
  - Fruit: Banana, Frequency: 2
  - Fruit: Mango, Frequency: 2
- A bar chart would show four distinct bars for Apple ( $4$ ), Orange ( $2$ ), Banana ( $2$ ), and Mango ( $2$ ) with gaps between them.

Histogram

A histogram graphically displays a frequency distribution for numerical data.
It groups numbers into intervals (called bins or classes), and the bins touch each other to emphasize the continuous nature of the data.
The x-axis represents the intervals (e.g., score ranges).
The y-axis represents the frequency (the number of data points within each interval).
Bins are adjacent (no gaps).
How to construct bin size or range for a Histogram:
1. Decide on the number of classes (bins):
  - For large datasets: $10$ to $20$ classes.
  - For small datasets: $4$ to $6$ classes.
  - Thumb rule: Number of bins $= ext{number of observations} / ext{desired class size (at least 4)}$ .
2. Compute the width of each class:
  - $ext{Class width} = \frac{ ext{Range of data}}{ ext{Number of classes from Step 1}}$
  - Always round this result up to a convenient number.
3. Select lower limits:
  - Choose the smallest data value (or a convenient smaller value) as the lower limit of the first class.
  - Add multiples of the class width (from Step 2) to generate the lower limits of the remaining classes.
4. Find upper class limits:
  - Rule: Upper limit $=$ Lower limit $+$ width $−$ smallest significant unit in the data (e.g., for whole numbers, subtract $1$ ).
  - This prevents overlap of class intervals.
5. Define class boundaries:
  - Take the midpoint between the upper limit of one class and the lower limit of the next class (e.g., apply a $ext{±}0.5$ rule for whole numbers).
  - This ensures intervals touch and all data values are included without ambiguity.
Example: Test scores of 20 students (Dataset: $12, 25, 33, 41, 27, 38, 45, 21, 29, 19, 10, 30, 22, 36, 40, 14, 18, 32, 47, 24$ )
- Step 1: Number of classes. There are $n = 20$ observations. Choose $5$ classes (since the data set is small).
- Step 2: Class width.
  - Range $= 47 - 10 = 37$ .
  - Class width $= \frac{37}{5} = 7.4$ . Round up to $8$ .
- Step 3: Lower limits. Starting at the minimum value $10$ , successive lower limits are: $10, 18, 26, 34, 42$ .
- Step 4: Upper class limits. Using the rule (lower limit $+$ width $− 1$ ):
  - 10
    ightarrow 17 ( $10 + 8 - 1 = 17$ )
  - 18
    ightarrow 25 ( $18 + 8 - 1 = 25$ )
  - 26
    ightarrow 33 ( $26 + 8 - 1 = 33$ )
  - 34
    ightarrow 41 ( $34 + 8 - 1 = 41$ )
  - 42
    ightarrow 49 ( $42 + 8 - 1 = 49$ )
- Step 5: Class boundaries. Applying the $ext{±}0.5$ rule (halfway between limits):
  - $[9.5, 17.5)$ (data points $10, 12, 14, 18, 19, 21, 22, 24, 25, 27, 29, 30, 32, 33, 36, 38, 40, 41, 45, 47$ )
  - $[17.5, 25.5)$ (e.g., $17.5$ is the midpoint between $17$ and $18$ )
  - $[25.5, 33.5)$
  - $[33.5, 41.5)$
  - $[41.5, 49.5)$
- Frequency Table:
  - Class Interval (limits): 10–17, Class Boundaries: 9.5–17.5, Frequency: 3
  - Class Interval (limits): 18–25, Class Boundaries: 17.5–25.5, Frequency: 6
  - Class Interval (limits): 26–33, Class Boundaries: 25.5–33.5, Frequency: 5
  - Class Interval (limits): 34–41, Class Boundaries: 33.5–41.5, Frequency: 4
  - Class Interval (limits): 42–49, Class Boundaries: 41.5–49.5, Frequency: 2
- A histogram would show these bins with their corresponding frequencies, with no gaps between bars.

Scatter Plots

A scatter plot displays the relationship between two variables.
Each data point is represented as a dot on a coordinate plane, with the x-axis for one variable and the y-axis for the other.
The pattern of points reveals if the variables are related.
Uses of scatter plots:
- To check for a positive relationship (one variable increases as the other increases).
- To check for a negative relationship (one variable increases as the other decreases).
- To detect no clear relationship (points scattered randomly).
- To identify possible outliers (points significantly far from the rest).

Linear Regression

Bivariate Data

Definition: Bivariate data consists of two variables measured on the same individual, object, or event. It is used to analyze the relationship between these two variables, often through graphs, correlation, or regression.
If both variables are numerical, they are typically plotted on a scatter diagram to study their relationship (positive, negative, or no correlation).
If one variable depends on the other, linear regression is often used.
General Representation of Bivariate Data: For $n$ observations, data is represented as a set of ordered pairs: $ext{{(x1, y1), (x2, y2), (x3, y3), …, (xn, yn)}}$ . For $i = 1, 2, …, n$ :
- $x_i$ is the value of the first variable (independent variable).
- $y_i$ is the corresponding value of the second variable (dependent variable).
Tabular Representation: Can also be shown in a two-column table (Variable 1: $xi$ , Variable 2: $yi$ ).
Example: Hours studied (x) and test score (y)
- Hours Studied (x): 2, Test Score (y): 55
- Hours Studied (x): 4, Test Score (y): 65
- Hours Studied (x): 6, Test Score (y): 72
- Hours Studied (x): 8, Test Score (y): 85
- Hours Studied (x): 10, Test Score (y): 92
- This data checks if increased study hours lead to increased test scores.

Regression and Linear Regression

Definition of Regression: A statistical method to study the relationship between a dependent variable and one or more independent variables. It helps predict the dependent variable's value based on independent variable values.
Definition of Linear Regression: The simplest form of regression, assuming a linear relationship between the dependent variable $y$ and the independent variable $x$ . It is modeled by the equation: $y = mx + b$
- $m$ is the slope of the line, indicating the rate of change of $y$ with respect to $x$ .
- $b$ is the intercept, representing the value of $y$ when $x = 0$ .
Graphically, data points are plotted on a scatter plot, and the straight line $y = mx + b$ is drawn to best describe the data trend.

Examples of Linear Relationship

Example 1: Positive Linear Relationship
- Dataset: $ext{{(1, 3), (2, 5), (3, 7), (4, 9), (5, 11)}}$
- Variables increase together.
- Best-fit line: $y = 2x + 1$ (slope $m = 2$ , intercept $b = 1$ ).
Example 2: Negative Linear Relationship
- Dataset: $ext{{(1, 10), (2, 8), (3, 6), (4, 4), (5, 2)}}$
- As $x$ increases, $y$ decreases.
- Best-fit line: $y = -2x + 12$ (slope $m = -2$ , intercept $b = 12$ ).

Slope of a Line Through Two Points

If a straight line passes through two distinct points $(x1, y1)$ and $(x2, y2)$ , its slope is given by:
- $m = \frac{y2 - y1}{x2 - x1}$ (where $x1 \neq x2$ ).
Once $m$ is found, the equation of the line can be written using the point-slope form:
- $y - y1 = m(x - x1)$
This can be rearranged into the slope-intercept form:
- $y = mx + b$ where $b = y1 - mx1$ .

Residuals in Linear Regression

Definition: A residual is the difference between the observed value of the dependent variable ( $yi$ ) and the value predicted by the regression line ( $ext{\hat{y}}i$ ) for a data point $(xi, yi)$ .
- $ext{Residual} = yi - ext{\hat{y}}i$
Residuals measure how far each data point lies from the fitted line.
Graphical Representation: The residuals are the vertical distances from each observed data point to the regression line.
Note: When the line represents the linear regression line, the slope $m$ and the y-intercept $b$ are known as regression coefficients.
Example: Farmer's Fertilizer and Crop Yield
- Data from 6 plots of land for fertilizer used (x, in kg) and crop yield (y, in quintals):
  - x: 2, y (Observed Yield): 40, $ext{\hat{y}}$ (Predicted Yield): 39.7, Residual $e_i = y - ext{\hat{y}}$ : 0.3
  - x: 4, y (Observed Yield): 55, $ext{\hat{y}}$ (Predicted Yield): 50.1, Residual $e_i = y - ext{\hat{y}}$ : 4.9
  - x: 6, y (Observed Yield): 65, $ext{\hat{y}}$ (Predicted Yield): 60.5, Residual $e_i = y - ext{\hat{y}}$ : 4.5
  - x: 8, y (Observed Yield): 70, $ext{\hat{y}}$ (Predicted Yield): 70.9, Residual $e_i = y - ext{\hat{y}}$ : -0.9
  - x: 10, y (Observed Yield): 85, $ext{\hat{y}}$ (Predicted Yield): 81.3, Residual $e_i = y - ext{\hat{y}}$ : 3.7
  - x: 12, y (Observed Yield): 95, $ext{\hat{y}}$ (Predicted Yield): 91.7, Residual $e_i = y - ext{\hat{y}}$ : 3.3
Note on Residual Squares:
- For each data point $(xi, yi)$ , the residual is $ei = yi - ext{\hat{y}}_i$ .
- The residual square is $ei^2 = (yi - ext{\hat{y}}_i)^2$ .
- The sum of these across all points gives the Residual Sum of Squares (R2): $R^2 = ext{\sum}{i=1}^n ei^2$ .

Interpolation, Extrapolation, and Correlation

Interpolation and Extrapolation

Definition (Interpolation): The process of estimating or predicting the value of a dependent variable for an independent variable that lies within the range of the observed data points.
Definition (Extrapolation): The process of estimating or predicting the value of a dependent variable for an independent variable that lies outside the range of the observed data points.
Example: Farmer's Fertilizer and Crop Yield (using the fitted regression line $y = 5.29x + 31.33$ )
- Data:
  - x (Fertilizer, kg): 2, y (Observed Yield): 40
  - x (Fertilizer, kg): 4, y (Observed Yield): 55
  - x (Fertilizer, kg): 6, y (Observed Yield): 65
  - x (Fertilizer, kg): 8, y (Observed Yield): 70
  - x (Fertilizer, kg): 10, y (Observed Yield): 85
  - x (Fertilizer, kg): 12, y (Observed Yield): 95
- The regression coefficients for the least squares fitted line are: $m ext{\approx} 5.29$ , $b ext{\approx} 31.33$ . The fitted line is $y = 5.29x + 31.33$ .
- Question 1: Interpolate the yield when the farmer uses $x = 7$ kg of fertilizer.
  - $y(7) = 5.29(7) + 31.33 = 37.03 + 31.33 = 68.36$
  - The required yield is $68.36$ quintals.
- Question 2: Extrapolate the yield when the farmer uses $x = 15$ kg of fertilizer.
  - $y(15) = 5.29(15) + 31.33 = 79.35 + 31.33 = 110.68$
  - The required yield is $110.68$ quintals.
- Question 3: If the farmer obtains a crop yield of $y = 90$ quintals, estimate the amount of fertilizer used (x).
  - $90 = 5.29x + 31.33$
  - $5.29x = 90 - 31.33$
  - $5.29x = 58.67$
  - $x = \frac{58.67}{5.29} ext{\approx} 11.09$
  - The estimated amount of fertilizer used is approximately $11.09$ kg.
Note: Interpolation is generally more reliable than extrapolation because interpolation predicts values within the observed data range (where the trend is established), while extrapolation predicts values outside this range (where the trend may not hold).

Correlation and Correlation Coefficient

Definition (Correlation): A statistical measure describing the strength and direction of the linear relationship between two variables. It indicates whether an increase in one variable is consistently associated with an increase (positive), decrease (negative), or no consistent change (no correlation) in the other variable.
Definition (Correlation Coefficient): Denoted by $\omega$ (or $r$ ), it is a numerical value that quantifies the degree of linear correlation between two variables.
Properties:
- The correlation coefficient ranges from $-1$ to $1$ : $-1 ext{\leq} ext{\omega} ext{\leq} 1$ .
- $\omega = 1$ : Perfect positive correlation.
- $\omega = -1$ : Perfect negative correlation.
- $\omega = 0$ : No linear correlation.
- Generally, an absolute value less than $0.5$ is considered too weak to suggest a meaningful correlation.
Rule of Thumb for Strength of Correlation:
- Weak: |\omega| < 0.5
- Moderate: 0.5 ext{\leq} |\omega| < 0.7
- Strong: $|\omega| ext{\geq} 0.7$
- This can be visualized on a scale from $-1$ (Perfect Negative) through $0$ (No Correlation) to $1$ (Perfect Positive).
Graphical Representation of Correlation Coefficients: Scatter plots can visually depict strong positive (\omega \text{\approx} 0.9), weak positive (\omega \text{\approx} 0.3), strong negative (\omega \text{\approx} -0.9), and no correlation (\omega \text{\approx} 0).
Relation Between Correlation Coefficient and Slope:
- The slope $m$ of the regression line and the correlation coefficient $\omega$ are related in terms of sign only:
- If \omega > 0, then the slope m > 0 (the line rises from left to right).
- If \omega < 0, then the slope m < 0 (the line falls from left to right).
- If $\omega = 0$ , then the slope m \text{\approx} 0 (no linear relationship).
- Important: The magnitude (value) of the slope is not related to the value of the correlation coefficient. Correlation measures the strength of linear association, while slope measures the rate of change.
Correlation vs. Causation:
- Correlation measures the strength and direction of a linear relationship.
- Causation means that changes in one variable directly cause changes in the other.
- Correlation does not imply causation.
- Examples:
  - Ice cream sales and drowning incidents may be positively correlated, but both are caused by hot summer weather, not a direct causal link between ice cream and drowning.
  - Shoe size and reading ability in children may be correlated, but the common underlying cause is age.
- Therefore, while correlation and regression suggest patterns, they should not be interpreted as proof of cause-and-effect without further evidence.

Outliers

Definition (Outlier): A data point that lies significantly far from the overall pattern of the data.
Outliers can result from unusual conditions, measurement errors, or genuinely rare events, and they can substantially affect correlation and regression analysis.
Example (Car Age vs. Resale Value):
- x (Car Age, years): 1, y (Resale Value, \$1000): 25
- x (Car Age, years): 2, y (Resale Value, \$1000): 22
- x (Car Age, years): 3, y (Resale Value, \$1000): 20
- x (Car Age, years): 4, y (Resale Value, \$1000): 18
- x (Car Age, years): 5, y (Resale Value, \$1000): 15
- x (Car Age, years): 6, y (Resale Value, \$1000): 13
- x (Car Age, years): 7, y (Resale Value, \$1000): 11
- x (Car Age, years): 8, y (Resale Value, \$1000): 50 (Outlier)
- Here, the 8-year-old car's resale value of \$50,000 is unusually high compared to the general decreasing trend, suggesting it might be an outlier (e.g., a rare vintage model).

Exponential, Logarithms, and Half-Life

Exponential Function

Definition: An exponential function is of the form $y = a^x$ , where the base $a$ is a positive real number (a > 0, a \neq 1) and the exponent $x$ is a real number.
Types of Exponential Functions:
- Exponential Growth: If a > 1, the function $y = a^x$ increases rapidly as $x$ increases.
- Exponential Decay: If 0 < a < 1, the function $y = a^x$ decreases rapidly as $x$ increases.
- Note: If $a = 1$ , the function becomes $y = 1^x = 1$ , which is a constant function (a horizontal line at $y = 1$ ) and therefore not considered exponential growth or decay.
Real-World Examples of Exponential Functions:
- Finance: Calculating compound interest over time.
- Biology: Studying population growth of bacteria or viruses.
- Chemistry: Measuring acidity levels using the pH scale.
- Physics: Analyzing radioactive decay of unstable elements.
Laws of Exponents:
1. $a^{x+y} = a^x \cdot a^y$
2. $a^{x-y} = a^x \cdot a^{-y} = \frac{a^x}{a^y}$
3. $(a^x)^y = a^{xy}$
4. $a^x b^x = (ab)^x$
5. $a^0 = 1$ (provided $a \neq 0$ )
6. $a^{-x} = \frac{1}{a^x}$
Note: The general form of an exponential function is $y = b \cdot a^x$ , where b > 0 is the initial value and a > 0, a \neq 1 is the base.
Examples:
- Population growth: $P(t) = P_0 \cdot e^{rt}$ , where r > 0 is the growth constant.
- Radioactive decay: $N(t) = N_0 \cdot e^{-\omega t}$ , where \omega > 0 is the decay constant.

Logarithm

Definition: The logarithm of a number $x$ to the base $a$ (with a > 0, a \neq 1) is the exponent $y$ such that $a^y = x$ . It is written as $\log_a x = y$ .
Note: The logarithm is the inverse of the exponential function. That is, if $y = a^x$ , then $x = \log_a y$ .
Laws of Logarithms:
1. $\loga(xy) = \loga x + \log_a y$
2. $\loga(\frac{x}{y}) = \loga x - \log_a y$
3. $\loga(x^k) = k \loga x$ (also valid if $k$ is a variable)
4. $\log_a(a) = 1$
5. $\loga x = \frac{\log{10} x}{\log_{10} a}$ (change of base formula, base $10$ )
6. $\log_a x = \frac{\ln x}{\ln a}$ (change of base formula, base $e$ )
Inverse Function Property: If $f(x)$ and $g(x)$ are inverse functions, then $f(g(x)) = g(f(x)) = x$ .
- For example, let $f(x) = a^x$ and $g(x) = \loga x$ . Then, $a^{\loga x} = \log_a(a^x) = x$ .
Examples:
1. Solve $5^{2x} = 0.23$ for $x$ .
 - Take the natural logarithm of both sides: $\ln(5^{2x}) = \ln(0.23)$
 - Apply logarithm law $\loga(x^k) = k \loga x$ : $2x \ln(5) = \ln(0.23)$
 - Solve for $x$ : $x = \frac{\ln(0.23)}{2 \ln(5)} = \frac{-1.4697}{2 \cdot 1.6094} \approx \frac{-1.4697}{3.2188} \approx -0.4566$
2. If $\loga x = 2.1$ and $\loga y = 0.45$ , compute $\log_a(x^3y)$ .
 - Apply logarithm law $\loga(xy) = \loga x + \loga y$ : $\loga(x^3y) = \loga(x^3) + \loga y$
 - Apply logarithm law $\loga(x^k) = k \loga x$ : $= 3 \loga x + \loga y$
 - Substitute given values: $= 3(2.1) + 0.45 = 6.3 + 0.45 = 6.75$
3. Solve $\log_4(3x) = 1.4$ for $x$ .
 - Convert to exponential form ( $\log_a y = x \implies a^x = y$ ): $3x = 4^{1.4}$
 - Calculate $4^{1.4}$ : $3x \approx 6.9644$
 - Solve for $x$ : $x = \frac{6.9644}{3} \approx 2.3215$
4. Simplify: $\log2 8 + \log2 4$ .
 - Method 1 (using $\loga(xy) = \loga x + \log_a y$ ):
 - $\log2(8 \cdot 4) = \log2(32)$
 - Since $2^5 = 32$ , then $\log_2(32) = 5$ .
 - Method 2 (evaluating each logarithm):
 - Since $2^3 = 8$ , $\log_2 8 = 3$ .
 - Since $2^2 = 4$ , $\log_2 4 = 2$ .
 - $3 + 2 = 5$ .

Half-Life and Doubling Time

Half-Life

Definition: The half-life ( $T_{1/2}$ ) of a substance is the time required for its quantity to decrease to half of its initial value.
If the decay is exponential, the half-life is given by: $T_{1/2} = \frac{\ln(2)}{\omega}$ , where \omega > 0 is the decay constant.
Example: The half-life of Carbon-14 is about $5730$ years, meaning that after $5730$ years, only half of the initial Carbon-14 atoms remain.

Doubling Time

Definition: The doubling time ( $T_d$ ) is the time required for a quantity to double its initial value under exponential growth.
If the growth is exponential, the doubling time is given by: $T_d = \frac{\ln(2)}{r}$ , where r > 0 is the growth rate.
Example: If a population of bacteria doubles every $30$ minutes, its doubling time is $T_d = 30$ minutes.

Example: Drug Decay in the Bloodstream

Let $C(t)$ be the amount of drug (in milligrams) at time $t$ (in days), and $C0$ be the initial amount. The decay is modeled by $C(t) = C0 e^{-kt}$ , where k > 0 is the decay constant.
- (a) If the drug has a half-life of 10 days, what is the value of $k$ ?
 - At half-life, $C(T{1/2}) = \frac{1}{2} C0$ .
 - $\frac{1}{2} C0 = C0 e^{-k(10)}$
 - $\frac{1}{2} = e^{-10k}$
 - Take natural logarithm: $\ln(\frac{1}{2}) = -10k$
 - $- \ln(2) = -10k$
 - $k = \frac{\ln(2)}{10} \approx \frac{0.6931}{10} \approx 0.06931$
 - The decay constant $k$ is approximately $0.06931 ext{ days}^{-1}$ .
- (b) What percent of the administered amount of drug remains in the bloodstream after 4 hours?
 - First, convert $4$ hours to days: $4 ext{ hours} = \frac{4}{24} ext{ days} = \frac{1}{6} ext{ days} \approx 0.1667 ext{ days}$ .
 - Use the decay function: $C(t) = C_0 e^{-kt}$ .
 - $C(\frac{1}{6}) = C_0 e^{-(0.06931)(\frac{1}{6})}$
 - $C(\frac{1}{6}) = C_0 e^{-0.01155}$
 - $C(\frac{1}{6}) \approx C_0 (0.9885)$
 - The percentage remaining is approximately $0.9885 \times 100\% = 98.85\%$ . Approximately $99\%$ remains.

Example: Oxygen Consumption of Salmon

Oxygen consumption of yearling salmon increases exponentially with swimming speed according to $f(x) = 100e^{0.6x}$ , where $x$ is speed in ft/s.
- (a) What is the amount of oxygen consumption when the fish are not moving?
  - Not moving means $x = 0$ ft/s.
  - $f(0) = 100e^{0.6(0)} = 100e^0 = 100(1) = 100$
  - Oxygen consumption is $100$ mg.
- (b) What is the oxygen consumption at a speed of 2 ft/s?
  - $f(2) = 100e^{0.6(2)} = 100e^{1.2}$
  - $f(2) \approx 100(3.3201) \approx 332.01$
  - Oxygen consumption is approximately $332.01$ mg.
- (c) If a salmon is swimming at 2 ft/s, how much faster does it need to swim in order to double its oxygen consumption?
  - Current consumption at $2$ ft/s is $332.01$ mg (from part b).
  - Double consumption would be $2 \times 332.01 = 664.02$ mg.
  - Set $f(x) = 664.02$
  - $664.02 = 100e^{0.6x}$
  - $6.6402 = e^{0.6x}$
  - Take natural logarithm: $\ln(6.6402) = 0.6x$
  - $1.8931 \approx 0.6x$
  - $x = \frac{1.8931}{0.6} \approx 3.155 \text{ ft/s}$
  - The additional speed needed is $3.155 - 2 = 1.155$ ft/s.

Allometric or Power Laws, Rescaling, and Log Plots

Power Law or Allometry

Definition: A power law function (or allometric function in biology) is of the form $y = ax^k$ , where:
- a > 0 is a constant of proportionality.
- $k \in \mathbb{R}$ is the power (or scaling exponent).
- x > 0 is the independent variable.
Note: $y$ is an allometric function of $x$ , meaning $x$ and $y$ are allometrically related.
Properties:
- If k > 1, the function grows faster than linear (superlinear growth).
- If 0 < k < 1, the function grows slower than linear (sublinear growth).
- If $k = 1$ , the function reduces to a linear function $y = ax$ .
- If k < 0, the function represents a decreasing relationship, such as inverse proportionality.
Real-World Examples of Power Laws (Allometry):
- Biology (Allometry): Metabolic rate of animals often scales as body mass to the power of $3/4$ (e.g., $B = aM^{3/4}$ ).
- Physics: Gravitational force follows an inverse-square law ( $F = G\frac{m1 m2}{r^2}$ ).
- Economics: Wealth distributions often follow a Pareto power law.
- Engineering: Stress or fracture strength scaling with material size.
Example: Elephant Surface Area (Allometry)
- Surface area ( $S$ ) of an African elephant's body is an allometric function of trunk length ( $L$ ) with an exponent of $0.74$ . So, $S = aL^{0.74}$ .
- An elephant has a surface area of $200 ext{ ft}^2$ and a trunk length of $6 ext{ ft}$ .
- Find $a$ : $200 = a(6)^{0.74}$
  - $200 = a(3.765)$
  - $a = \frac{200}{3.765} \approx 53.12$
- So, the specific allometric equation is: $S = 53.12L^{0.74}$ .
- What is the expected surface area of an elephant with a trunk length of 7 ft?
  - $S(7) = 53.12(7)^{0.74}$
  - $S(7) = 53.12(4.280)$ (since $7^{0.74} \approx 4.280$ )
  - $S(7) \approx 227.35$ (or approximately $227.16$ from original calculation if $a$ is kept more precise as $200 / 6^{0.74}$ )
  - The expected surface area is approximately $227.35 ext{ ft}^2$ .

Rescaling Data

Used for biological variables $x$ and $y$ .
Definition (Log-Log Graph): A graph where the horizontal axis is labeled as $\ln(x)$ and the vertical axis is labeled as $\ln(y)$ .
Definition (Semi-Log Graph): A graph where the horizontal axis is labeled as $x$ and the vertical axis is labeled as $\ln(y)$ .
Note: Rescaling data using log or semi-log axes is particularly useful for:
- Exponential functions: They appear as straight lines on a semi-log plot.
- Power-law (allometric) functions: They appear as straight lines on a log-log plot.
Transformation of functions by taking natural logarithm:
- For an exponential function $f(x) = ab^x$ :
  - $\ln(f(x)) = \ln(ab^x)$
  - $\ln(f(x)) = \ln(a) + \ln(b^x)$
  - $\ln(f(x)) = \ln(a) + x\ln(b)$
  - This is in the form $Y = A + Bx$ (where $Y = \ln(f(x))$ , $A = \ln(a)$ , $B = \ln(b)$ ), which is a linear equation with respect to $x$ and $\ln(f(x))$ . So it's linear on a semi-log plot.
- For a power-law function $g(x) = cx^k$ :
  - $\ln(g(x)) = \ln(cx^k)$
  - $\ln(g(x)) = \ln(c) + \ln(x^k)$
  - $\ln(g(x)) = \ln(c) + k\ln(x)$
  - This is in the form $Y = A + BX$ (where $Y = \ln(g(x))$ , $A = \ln(c)$ , $B = k$ , $X = \ln(x)$ ), which is a linear equation with respect to $\ln(x)$ and $\ln(g(x))$ . So it's linear on a log-log plot.

Examples: Rescaling Data

Exponential Function (Semi-Log Plot)

Consider the function $y = 2e^{0.5x}$ for $x = 0, 1, 2, 3, 4, 5$ .
Data values:
- x: 0, y: 2.00
- x: 1, y: 3.30
- x: 2, y: 5.44
- x: 3, y: 8.96
- x: 4, y: 14.78
- x: 5, y: 24.36
Observation: On ordinary axes, the curve rises exponentially. On a semi-log plot (x vs. $\ln(y)$ ), the points will fall on a straight line.

Allometric Function (Log-Log Plot)

Consider the function $y = 3x^{0.75}$ for $x = 1, 2, 3, 4, 5, 6$ .
Data values:
- x: 1, y: 3.00
- x: 2, y: 5.04
- x: 3, y: 6.84
- x: 4, y: 8.48
- x: 5, y: 9.96
- x: 6, y: 11.34
Observation: On ordinary axes, the curve increases sublinearly. On a log-log plot ( $\ln(x)$ vs. $\ln(y)$ ), the points will fall on a straight line with a slope of $0.75$ .

Basic Descriptive Statistics

Types of Data

Ratio Scale

Definition: A measurement scale with a constant interval size and a true zero point (meaning the absence of the quantity).
Examples: Age, height, distance, weight. (e.g., $0$ height means no height).

Interval Scale

Definition: A measurement scale with a constant interval size but no true zero point.
Examples: Temperature (Celsius/Fahrenheit), dates on a calendar, time on a watch. (e.g., $0^ ext{o} ext{C}$ does not mean no temperature).

Ordinal Scale

Definition: Data can be ordered or ranked according to some measurement, but the intervals between ranks may not be equal or meaningful.
Examples: Education level (primary, secondary, post-secondary), income levels (low, middle, high), satisfaction ratings (poor, good, excellent).

Nominal Scale

Definition: Data is classified by an attribute or category rather than by a quantity measurement. Categories have no inherent order.
Examples: Grade scale (A, B, C, D), gender (Female, Male), blood group (A, B, AB, O), species (bird, mammal).

Continuous and Discrete Data

Continuous data: Can take any value within a given range (e.g., decimals).
- Examples: Height $(e.g., 1.75 ext{m})$ , temperature $(e.g., 25.3^ ext{o} ext{C})$ .
Discrete data: Consists of distinct, separate values that can be counted (usually whole numbers).
- Examples: Number of students in a classroom, number of cars in a parking lot, books on a shelf.

Tools used to describe and summarize data

Measures of Central Tendency

These summarize a data set by a single value point, typically representing the center of the data.
Arithmetic Mean (Mean):
- Let \text{{x}1, x2, …, x_n}} denote a data set with $n$ data points. The arithmetic mean is defined as:
 - $\bar{x} = \frac{x1 + x2 + \dots + xn}{n} = \frac{\sum{i=1}^n x_i}{n}$ .
- Example: For marks $ext{{50, 60, 70, 80, 90}}$ , the mean is:
 - $\bar{x} = \frac{50 + 60 + 70 + 80 + 90}{5} = \frac{350}{5} = 70$ .
Median:
- The median is the middle value of an ordered dataset.
- If the number of data points $n$ is odd, the median is the value at position $\frac{(n+1)}{2}$ .
- If $n$ is even, the median is the average of the values at positions $\frac{n}{2}$ and $(\frac{n}{2}) + 1$ .
- Example (odd n): Dataset $ext{{3, 13, 2, 34, 11, 26, 47}}$ ( $n=7$ ).
  - Ordered set: $ext{{2, 3, 11, 13, 26, 34, 47}}$ .
  - Position: $\frac{(7+1)}{2} = 4^ ext{th}$ . The median is $13$ .
- Example (even n): Dataset $ext{{3, 13, 2, 34, 11, 17, 27, 47}}$ ( $n=8$ ).
  - Ordered set: $ext{{2, 3, 11, 13, 17, 27, 34, 47}}$ .
  - Positions: $\frac{8}{2} = 4^ ext{th}$ ( $13$ ) and $(\frac{8}{2})+1 = 5^ ext{th}$ ( $17$ ).
  - The median is the average: $\frac{13+17}{2} = \frac{30}{2} = 15$ .
Difference between mean and median:
- Example: Dataset $ext{{0, 0, 0, 1, 1, 2, 10, 10}}$ .
  - Mean: $\bar{x} = \frac{0+0+0+1+1+2+10+10}{8} = \frac{24}{8} = 3$ .
  - Median: Ordered set is already given. $n=8$ (even).
    - $\frac{n}{2} = 4^ ext{th}$ position (value is $1$ ).
    - $(\frac{n}{2})+1 = 5^ ext{th}$ position (value is $1$ ).
    - Median $= \frac{1+1}{2} = 1$ .
  - In this case, the mean ( $3$ ) is higher than the median ( $1$ ) due to the presence of larger values ( $10, 10$ ) tugging the mean upwards, illustrating the mean's sensitivity to outliers/skewness.
Mode:
- The mode is the value (or values) that occurs most frequently in a dataset.
- A dataset can have:
  - One mode (unimodal): e.g., $ext{{2, 4, 4, 6, 7}}$ has mode $4$ .
  - Two modes (bimodal): e.g., $ext{{1, 2, 2, 3, 3, 4}}$ has modes $2$ and $3$ .
  - More than two modes (multimodal): e.g., $ext{{5, 6, 6, 7, 7, 8, 8}}$ has modes $6, 7, 8$ .
  - No mode: If all values occur with the same frequency.
Midrange:
- The midrange is the value halfway between the minimum and maximum data values.
- $ext{Midrange} = \frac{ ext{Minimum value} + ext{Maximum value}}{2}$ .
- Example: For dataset $ext{{3, 7, 10, 15, 18}}$ , Midrange $= \frac{3+18}{2} = \frac{21}{2} = 10.5$ .
Geometric Mean (GM):
- For a dataset of $n$ positive numbers \text{{x}1, x2, …, x_n}, the geometric mean is the $n^ ext{th}$ root of the product of those $n$ points.
- $GM = (x1 \cdot x2 \cdot \dots \cdot xn)^{\frac{1}{n}} = (\prod{i=1}^n x_i)^{\frac{1}{n}}$ .
- Example: For dataset $ext{{4, 16}}$ , $GM = (4 \cdot 16)^{\frac{1}{2}} = (64)^{\frac{1}{2}} = 8$ .
- Example: For dataset $ext{{2, 8, 18}}$ ( $n=3$ ), $GM = (2 \cdot 8 \cdot 18)^{\frac{1}{3}} = (288)^{\frac{1}{3}} \approx 6.60$ .
Harmonic Mean (HM):
- For a dataset of $n$ positive numbers \text{{x}1, x2, …, x_n}, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the data points.
- $HM = \frac{n}{\frac{1}{x1} + \frac{1}{x2} + \dots + \frac{1}{xn}} = \frac{n}{\sum{i=1}^n \frac{1}{x_i}}$ .
- Example: For dataset $ext{{4, 8, 16}}$ ( $n=3$ ), $HM = \frac{3}{\frac{1}{4} + \frac{1}{8} + \frac{1}{16}} = \frac{3}{\frac{4}{16} + \frac{2}{16} + \frac{1}{16}} = \frac{3}{\frac{7}{16}} = 3 \cdot \frac{16}{7} = \frac{48}{7} \approx 6.86$ .

Measures of Dispersion

These describe the spread of data points around the central tendency.
Range:
- The range of a dataset is the difference between the maximum and minimum values.
- $ext{Range} = ext{Maximum value} - ext{Minimum value}$ .
- Example: For dataset $ext{{5, 8, 12, 20, 25}}$ , Range $= 25 - 5 = 20$ .
Variance:
- Measures the average of the squared deviations from the mean.
- For a sample dataset, variance ( $s^2$ ) is computed using $(n-1)$ in the denominator (Bessel's correction) to provide an unbiased estimate of the population variance.
- $s^2 = \frac{1}{(n-1)} \sum{i=1}^n (xi - \bar{x})^2$ .
- Example: For dataset $ext{{2, 4, 6}}$ ( $n=3$ ):
 - Mean: $\bar{x} = \frac{2+4+6}{3} = \frac{12}{3} = 4$ .
 - Deviations: $(2-4) = -2$ , $(4-4) = 0$ , $(6-4) = 2$ .
 - Squared deviations: $(-2)^2 = 4$ , $(0)^2 = 0$ , $(2)^2 = 4$ .
 - Sum of squared deviations: $4 + 0 + 4 = 8$ .
 - Variance: $s^2 = \frac{1}{(3-1)} (8) = \frac{8}{2} = 4$ .
Standard Deviation:
- Indicates how much data values deviate, on average, from the mean. It is the square root of the variance.
- $s = \sqrt{\frac{1}{(n-1)} \sum{i=1}^n (xi - \bar{x})^2}$ .
- Example: For dataset $ext{{2, 4, 6}}$ , with $\bar{x} = 4$ and $s^2 = 4$ , the standard deviation is $s = \sqrt{4} = 2$ .
Coefficient of Variation (CV):
- A standardized measure of dispersion that expresses the standard deviation as a percentage of the mean. It allows comparison of variability between datasets with different units or vastly different means.
- $CV = \frac{s}{\bar{x}} \times 100\%$ , where $s$ is the standard deviation and $\bar{x}$ is the arithmetic mean.
- Example: For dataset $ext{{2, 4, 6}}$ , with $\bar{x} = 4$ and $s = 2$ , CV $= \frac{2}{4} \times 100\% = 0.5 \times 100\% = 50\%$ . (The incorrect calculation in the transcript example, resulting in 40%, seems to have used a population standard deviation or a different $n$ value, but the provided solution for this example $s=2, \bar{x}=4$ correctly yields $50\%$ ).