Unit 1: Examining Distributions Complete Study Notes

Core Statistical Concepts and Terminology

Statistics Defined: Statistics refers to the set of methods used for obtaining, organizing, summarizing, presenting, and analyzing data.
Data Sources: Data is derived from characteristics measured on individuals or units. Examples of units include people, animals, places, or objects (things).
Population: This is defined as the totality of individuals about which information is desired.
Sample: A sample is a subset of the units in a population that are actually examined to gather information. * Example 1: 1,000 voters are asked which candidate they support in an upcoming election. The population would be all eligible voters in that election. * Example 2: 50 insomnia patients are given a new treatment. The population would be all patients suffering from insomnia. * Example 3: 200 Canada geese are tagged to study migration. The population would be all Canada geese.

Variables and Data Classifications

Variable: A characteristic or property of an individual. Examples include: * Time until a light bulb burns out. * Distance traveled by a taxi driver in one day. * Number of Heads in five tosses of a quarter. * Hair colour. * A student's grade in a course.
Categorical Data: Represents values of categorical variables that place individuals into one of several groups. * Categorical and Nominal: Used when there is no natural ordering to the groups. Examples: Gender of a newborn, eye colour, favourite television show, or reason for taking a specific course. * Categorical and Ordinal: Used when there is a logical ordering to the values. Examples: Letter grades ( $A+$ , $A$ , $B+$ , etc.), service ratings (Good, Fair, Poor), or placing in a tournament ( $1^{st}$ , $2^{nd}$ , $3^{rd}$ , etc.).
Quantitative Data: Represents values of quantitative variables for which arithmetic operations (like adding and averaging) make sense. * Discrete Variables: Quantitative variables that can only take certain values. Examples: The number of children in a family, number of rainy days in a month, or the highest denomination of bill in a wallet. * Continuous Variables: Quantitative variables that can take any value within a given range. Examples: Weight, age, and distance.
Classification Logic: * If the data takes numerical values for which arithmetic makes sense, it is Quantitative. * If not, it is Categorical. * If categorical data has a sensible order, it is Categorical and Ordinal; otherwise, it is Categorical and Nominal.

Data Distribution and Categorical Visualization

Definition of Distribution: The distribution of a data set identifies what values a variable takes and how often it takes these values. "Value" can be non-quantitative (e.g., "Blue" is a value for eye colour).
Bar Charts: * Displays categorical values on one axis and frequencies on the other. * Formatting: There must be spaces between bars to indicate that the data is not continuous.
Pie Charts: Provide a visual representation of the relative frequency (proportions) of observed values for a categorical variable.
R Implementation for Bar Charts: * party <- c("Bloc", "CPC", "Green", "Liberal", "NDP", "PPC", "Other") * vote <- c(7.64, 33.74, 2.33, 32.62, 17.83, 4.94, 0.90) * colours <- c("deepskyblue", "blue", "green3", "red", "orange", "purple4", "black") * barplot(vote, names.arg = party, col = colours, ylab = "% Vote")

Frequency Distributions for Quantitative Data

Frequency Distribution: A count of how many data values fall into predetermined classes or intervals.
Interval Construction Rules: * Intervals are chosen by the researcher; usually $5$ to $10$ intervals are used. * The first interval must include the minimum value; the last must include the maximum. * All intervals must be of equal length. * Endpoint Convention: Intervals typically include the left endpoint but not the right ( $[left, right)$ ). For example, in the interval $70 - 80$ , the value $70$ is included, but $80$ is not.
Relative Frequency Distribution: Calculated by dividing the number of data values in an interval by the total number of data values ( $n$ ). * Proportions: Values between $0$ and $1$ . Proportions for all intervals must sum to $1$ . * Percentages: Calculated as proportion * 100.
Continuity Concern: While intervals like $30-39$ and $40-49$ avoid overlap, continuous intervals (e.g., $30-40$ , $40-50$ ) are preferred to ensure every possible decimal value (like $59.5$ ) has a place.

Histograms and Shape of Distributions

Histogram: A graphical display of the count or proportion of data values falling into intervals. * It is a form of bar graph with no spaces between bars to reflect data continuity. * The height of the rectangle represents frequency or relative frequency.
Symmetry: A distribution is symmetric if its center divides it into two approximate mirror images.
Skewness: * Right Skewed: The right side (tail) of the histogram (larger values) extends much further than the left. * Left Skewed: The left side (tail) extends further than the right.
Time Plots: Used for time series data. Time is plotted on the x-axis, and variable values on the y-axis. Points are connected to show trends. * Seasonal Variation: A pattern that repeats at regular intervals (e.g., average monthly temperatures in Winnipeg).

Measures of Location (Center)

Mode: The most frequently observed data value. A data set can have more than one mode.
Median: The middle value in an ordered data set. * Calculation: Order the data and compute the position using $\frac{n+1}{2}$ . * If $n$ is odd, the median is the value in that position. * If $n$ is even, the median is the average of the two middle values. * Resistance: The median is resistant to outliers. Extreme values do not change its value.
Mean ( $\bar{x}$ ): The arithmetic average, calculated by summing all data values and dividing by the sample size $n$ . $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$ * Center of Mass: The mean is the "balance point" of the data. If data points were weights on a teeter-totter, the mean is where it would balance. * Sensitivity: The mean is not resistant to outliers; extreme values pull the mean toward them.
Weighted Mean ( $\bar{x}_w$ ): Used when some data values carry more weight than others. $\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}$ * Example 1 (GPA): Credit hours serve as the weights ( $w_i$ ) for grades ( $x_i$ ). * Example 2 (Combined Group Mean): To find the mean age of a class with $6$ males (mean $23.2$ ) and $4$ females (mean $21.7$ ): $\bar{x}_c = \frac{(6 \times 23.2) + (4 \times 21.7)}{10} = 22.6$

Measures of Variability (Spread)

Range ( $R$ ): The difference between the maximum and minimum values ( $R = \text{max} - \text{min}$ ). It is highly sensitive to outliers.
Interquartile Range ( $IQR$ ): Measures the length of the interval covering the middle $50\%$ of observations. $IQR = Q_3 - Q_1$
Quartiles: * First Quartile ( $Q_1$ ): The $25^{th}$ percentile. The median of the lower half of the data. * Third Quartile ( $Q_3$ ): The $75^{th}$ percentile. The median of the upper half of the data.
Five-Number Summary: Consists of [Min, $Q_1$ , Median, $Q_3$ , Max]. This describes the center, shape, and spread.

Boxplots and Outlier Detection

Quantile Boxplots: A visual representation of the five-number summary: a line at the median, a box for the IQR, and whiskers reaching the min and max.
Outlier Boxplots (Modified Boxplots): Used to identify extreme values. * Fences: $\text{Lower Fence (LF)} = Q_1 - 1.5 \times IQR$ $\text{Upper Fence (UF)} = Q_3 + 1.5 \times IQR$ * Outliers: Any data point outside these fences. * Whiskers: Extend to the lowest and highest data values that are inside the fences (the "new" min and max). Outliers are plotted as individual points.
Comparison: Side-by-side boxplots allow comparison of different populations (e.g., heights of Blue Jays pitchers vs. fielders) with respect to center, shape, and spread using a uniform scale.

Variance and Standard Deviation

Sample Variance ( $s^2$ ): Loosely, the average squared deviation from the mean. $s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$
Sample Standard Deviation ( $s$ ): The positive square root of the variance. $s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}$
Calculation Steps: 1. Find the mean $\bar{x}$ . 2. Calculate deviations $(x_i - \bar{x})$ . Note: $\sum (x_i - \bar{x})$ always equals $0$ . 3. Square those deviations. 4. Sum the squared deviations. 5. Divide by $n - 1$ .
Units: * The mean and standard deviation share the same units as the individual observations (e.g., dollars). * Variance is expressed in squared units (e.g., $\text{dollars}^2$ ).

Choosing Numerical Summaries

Symmetric Distributions (no outliers): Report the mean and standard deviation because they utilize all data in the sample.
Skewed Distributions or Outliers present: Report the five-number summary (specifically median and IQR) because they are resistant to extreme values and provide a more accurate description of the typical observation.