l9- Exploratory Data Analysis: Degrees of Freedom, Standard Error, and Normality

Secrets to Success in Data Analysis and Lab Work

  • Use Comments for Learning: Annotate R Markdown documents and provided lecture notes. Remind yourself what specific arguments and functions are doing. This bridges the gap between high-level statistical concepts and the technical execution in labs.

  • Active Inquiry: Explicitly document concepts that are not fully understood during lectures to ask about them later. Interrupting the lecturer for clarification is encouraged.

  • Problem-Solving Tracking: Keep track of the steps taken during troubleshooting. Even if code is deleted, notes like "I tried to use this wrong data frame and it broke the code, so I changed it to this" should be preserved to understand the reasoning behind specific steps.

Revisit: Degrees of Freedom (DoFDoF)

  • Definition: Degrees of Freedom represent the number of independent pieces of information available for analysis.

  • Formula for Categorical Variables:

    • DoF=k1DoF = k - 1

    • Where kk is the number of categories.

  • Formula for Continuous Variables:

    • Initially, for the first parameter estimated, DoF=n1DoF = n - 1 (where nn is the number of observations/sample size).

    • If subsequent estimations rely on previously estimated parameters, the degrees of freedom may change.

  • General Principle: The more parameters calculated before a specific statistic, the lower the degrees of freedom. Calculating each parameter uses up a portion of the data, leaving less independent information for following steps.

  • The Elevator Vignette (Analogy):

    • Imagine three people in an elevator and someone farts. Only the person who did it knows for sure who it was.

    • If only two people are in the elevator, both know for sure who it was because if it wasn't you, it must be the other person. In the two-person scenario, there are no degrees of freedom; you are not "free" from knowing the source. This illustrates the loss of independent information as constraints (or parameters) are added.

Practice Scenarios: Calculating Degrees of Freedom

  • Case 1: pH Values

    • Dataset: 11281128 observations of pH values.

    • Question: What is the DoFDoF for the first parameter estimate?

    • Calculation: n1=11281=1127n - 1 = 1128 - 1 = 1127.

  • Case 2: Buoy Locations

    • Dataset: 11281128 observations distributed across 1010 different buoy sites (Auckland, Stewart Island, Wellington, etc.).

    • Question: What is the DoFDoF when considering the buoy location (site) as the variable of interest?

    • Calculation: k1=101=9k - 1 = 10 - 1 = 9 (where kk is the number of categories/sites).

Standard Error of the Mean (SEMSEM)

  • Concept: While measures of spread (like standard deviation) summarize variation in the actual dataset, the Standard Error of the Mean summarizes our estimate of the true population parameter (the true population mean).

  • Hypothetical Basis: Imagine resampling a population 100100 times. The standard deviation of those 100100 resulting sample means is essentially what the SEMSEM aims to capture.

  • In Practice: Since we cannot sample a population infinite times, we use information from null distributions to estimate how close our sample mean is to the true population mean.

  • Mathematical Formula:

    • The speaker initially misspoke and then corrected the formula to the square root of the variance divided by the sample size.

    • SEM=Variance of yn\text{SEM} = \sqrt{\frac{\text{Variance of } y}{n}}

    • Notation: Often written as SEMSEM or SEyˉSE_{\bar{y}}.

Confidence Intervals (CICI)

  • Definition: An interval based on a specific probability (usually 95%95\%) that the calculated range contains the true population mean.

  • Relationship to SEM: SEMSEM is used to calculate these intervals.

  • Function: Confidence intervals help visualize and quantify the precision of our mean estimate rather than the spread of the raw data.

  • Visualization: Often plotted as a mean point with error bars representing the 95% CI95\% \text{ CI}.

The Central Limit Theorem (CLTCLT)

  • Core Principle: Sample means will be normally distributed regardless of the distribution of the underlying variable being sampled.

  • Example Visualization: If you sample from a non-normal distribution (e.g., asking people how many cups of coffee they drink per week), and you repeatedly take samples of a certain size and calculate their means, the resulting distribution of those means will be roughly normal.

  • Importance: This theorem allows for the use of normal-based statistical distributions even when raw data is not normal, provided the sample size is sufficient.

The Student’s t-Distribution

  • Definition: A probability distribution specifically developed to describe the distribution of sample means based on the number of samples (nn).

  • Historical Context: Developed by a brewer at the Guinness Brewery in Dublin (William Sealy Gosset). Guinness prohibited him from publishing under his name to keep the statistical brewing techniques private, so he used the pseudonym "Student." Thus, it is known as the "Student's t-test."

  • Comparison to Normal Distribution:

    • Tails: The t-distribution has "fatter" tails than the standard normal distribution. This means extreme values have a higher probability of occurring, which accounts for the uncertainty in smaller sample sizes.

    • Convergence: As the degrees of freedom increase toward infinity (DoFDoF \rightarrow \infty), the t-distribution precisely converges to (becomes identical to) the standard normal distribution.

    • Visual Case: At DoF=1DoF = 1 (sample size of 22), the distribution is at its most extreme deviation from normal. At DoF=83DoF = 83, it looks very similar to a standard normal curve.

R Implementation: Managing the Black Corals Dataset

  • Context: Data from the MPI database regarding weight (in kg) of black corals caught as bycatch in fishing nets around New Zealand.

  • Loading Data:

    • corals <- read.csv("scratch/projects/RCH209/data/black_corals.csv")

    • The dataset contains 3939 observations of 33 variables.

  • Exploration Functions:

    • head(corals): Shows the first six rows of the data frame to understand the table structure (columns: NFPS_weight_kg, fishing_year, region).

    • str(corals): Checks variable types.

  • Calculating SEM in R:

    1. Save sample size: n <- nrow(corals) (n=39n = 39).

    2. Calculate variance: coral_var <- var(corals$NFPS_weight_kg) (Variance=15.67\text{Variance} = 15.67).

    3. Calculate SEM: coral_sem <- sqrt(coral_var / n) (SEM=0.633879SEM = 0.633879).

Calculating 95\% Confidence Intervals in R

  • Symmetry: Because the t-distribution is symmetrical, the interval is defined by an upper and lower bound equidistant from the mean.

  • Quantile Calculation (qtq t):

    • To find the middle 95%95\%, we must account for the remaining 5%5\%. This is split into 2.5%2.5\%\ in the lower tail and 2.5%2.5\%\ in the upper tail.

    • ts <- qt(0.975, df = n - 1): We use 0.9750.975 because the function calculates the cumulative probability to the left of the point. At the point where 97.5%97.5\%\ of the distribution is to the left, exactly 2.5%2.5\%\ remains in the upper tail.

    • Result: ts2.02439ts \approx 2.02439\ for 3838 degrees of freedom.

  • Bounds Formulas:

    • coral_mean <- mean(corals$NFPS_weight_kg)

    • Upper Bound: Mean+(testimate×SEM)Mean + (t_{estimate} \times SEM)

    • Lower Bound: Mean(testimate×SEM)Mean - (t_{estimate} \times SEM)

  • Example Results for Corals:

    • Mean: 4.6524 (Upper estimation after calculation).

    • The t-value scales the standard error to account for the variation and confidence level.

Assessing Normality in Exploratory Data Analysis

  • Visualizing the Coral Data:

    • hist(corals$NFPS_weight_kg) creates a histogram.

    • Visual check: The coral distribution was found to be highly skewed (not normal), resembling a log-normal distribution.

  • The QQ Plot (Quantile-Quantile Plot):

    • Compares expected quantiles of a normal distribution to the actual quantiles of the dataset.

    • A straight line represents normal distribution.

    • Common deviations include the "hockey stick effect," where points curve away from the line at the tails.

  • Formal Hypothesis Testing (Normality Tests):

    • Null Hypothesis (H0H_0): The data are not different from a normal distribution (i.e., the data are normal).

    • Shapiro-Wilk Test: Common but sensitive to small sample sizes and repeated values.

    • Kolmogorov-Smirnov Test: Measures the distance from the ideal QQ line; less robust.

    • D’Agostino Test: The preferred test in this course. It incorporates:

      1. Skewness: Measures symmetry (excess of positive or negative values).

      2. Kurtosis: Measures the weight of the tails (how skinny or wide the curve is compared to normal).

R Function: normal_test

  • Library: Requires the installation and loading of specific statistical packages (e.g., install.packages("package_name") and library(package_name)).

  • Code: normal_test(corals$NFPS_weight_kg, method = "dagostino").

  • Output Interpretation:

    • The output provides p-values for Kurtosis, Skewness, and the Omnibus test.

    • Omnibus p-value: The primary value to check. It combines all deviations.

    • Example Result: p=0.00000015p = 0.00000015.

    • Decision: Since p < 0.05, we reject the null hypothesis. The coral data is officially not normally distributed.

Data Transformation Techniques

  • Log Transformation: Most common for skewed data (log-normal). It compresses the upper tail of the distribution to make it more symmetrical.

    • Note: Cannot take the log of zero; a small constant may need to be added.

  • Power Transformations: Squaring or raising data to a power (e.g., variance calculation utilizes a power transformation).

  • Reciprocal Transformation: Turning values into fractions (e.g., 1/x1/x).

  • Reversibility: Log, Power, and Reciprocal transformations are back-transformable, meaning you can return to the original scale while retaining the information about variance.

  • Rank Transformation: Ordering data from lowest to highest and assigning a rank (1,2,3...1, 2, 3...).

    • Warning: Rank transformation cannot be back-transformed and loses information about the actual variance between points.

Questions & Discussion

  • Student Question: Can you repeat why we used 0.9750.975 instead of 0.950.95 in the qt function?

  • Lecturer Response: We want the middle 95%95\%\ of the distribution. This leaves 5%5\%\ "outside," split into 2.5%2.5\%\ in the bottom tail and 2.5%2.5\%\ in the top tail. Because the qt function calculates the cumulative probability from the left, we need to find the point where all of the lower 2.5%2.5\%\ and all of the middle 95%95\%\ have passed. Therefore, 0.025+0.95=0.9750.025 + 0.95 = 0.975. This value gives us the line for the upper tail. Because the distribution is symmetrical around zero, this value (e.g., 2.022.02) is the same magnitude as the lower bound (e.g., 2.02-2.02).