Unit 1: Examining Distributions Complete Study Notes

Core Statistical Concepts and Terminology

  • Statistics Defined: Statistics refers to the set of methods used for obtaining, organizing, summarizing, presenting, and analyzing data.

  • Data Sources: Data is derived from characteristics measured on individuals or units. Examples of units include people, animals, places, or objects (things).

  • Population: This is defined as the totality of individuals about which information is desired.

  • Sample: A sample is a subset of the units in a population that are actually examined to gather information.     * Example 1: 1,000 voters are asked which candidate they support in an upcoming election. The population would be all eligible voters in that election.     * Example 2: 50 insomnia patients are given a new treatment. The population would be all patients suffering from insomnia.     * Example 3: 200 Canada geese are tagged to study migration. The population would be all Canada geese.

Variables and Data Classifications

  • Variable: A characteristic or property of an individual. Examples include:     * Time until a light bulb burns out.     * Distance traveled by a taxi driver in one day.     * Number of Heads in five tosses of a quarter.     * Hair colour.     * A student's grade in a course.

  • Categorical Data: Represents values of categorical variables that place individuals into one of several groups.     * Categorical and Nominal: Used when there is no natural ordering to the groups. Examples: Gender of a newborn, eye colour, favourite television show, or reason for taking a specific course.     * Categorical and Ordinal: Used when there is a logical ordering to the values. Examples: Letter grades (A+A+, AA, B+B+, etc.), service ratings (Good, Fair, Poor), or placing in a tournament (1st1^{st}, 2nd2^{nd}, 3rd3^{rd}, etc.).

  • Quantitative Data: Represents values of quantitative variables for which arithmetic operations (like adding and averaging) make sense.     * Discrete Variables: Quantitative variables that can only take certain values. Examples: The number of children in a family, number of rainy days in a month, or the highest denomination of bill in a wallet.     * Continuous Variables: Quantitative variables that can take any value within a given range. Examples: Weight, age, and distance.

  • Classification Logic:     * If the data takes numerical values for which arithmetic makes sense, it is Quantitative.     * If not, it is Categorical.     * If categorical data has a sensible order, it is Categorical and Ordinal; otherwise, it is Categorical and Nominal.

Data Distribution and Categorical Visualization

  • Definition of Distribution: The distribution of a data set identifies what values a variable takes and how often it takes these values. "Value" can be non-quantitative (e.g., "Blue" is a value for eye colour).

  • Bar Charts:     * Displays categorical values on one axis and frequencies on the other.     * Formatting: There must be spaces between bars to indicate that the data is not continuous.

  • Pie Charts: Provide a visual representation of the relative frequency (proportions) of observed values for a categorical variable.

  • R Implementation for Bar Charts:     * party <- c("Bloc", "CPC", "Green", "Liberal", "NDP", "PPC", "Other")     * vote <- c(7.64, 33.74, 2.33, 32.62, 17.83, 4.94, 0.90)     * colours <- c("deepskyblue", "blue", "green3", "red", "orange", "purple4", "black")     * barplot(vote, names.arg = party, col = colours, ylab = "% Vote")

Frequency Distributions for Quantitative Data

  • Frequency Distribution: A count of how many data values fall into predetermined classes or intervals.

  • Interval Construction Rules:     * Intervals are chosen by the researcher; usually 55 to 1010 intervals are used.     * The first interval must include the minimum value; the last must include the maximum.     * All intervals must be of equal length.     * Endpoint Convention: Intervals typically include the left endpoint but not the right ([left,right)[left, right)). For example, in the interval 708070 - 80, the value 7070 is included, but 8080 is not.

  • Relative Frequency Distribution: Calculated by dividing the number of data values in an interval by the total number of data values (nn).     * Proportions: Values between 00 and 11. Proportions for all intervals must sum to 11.     * Percentages: Calculated as proportion * 100.

  • Continuity Concern: While intervals like 303930-39 and 404940-49 avoid overlap, continuous intervals (e.g., 304030-40, 405040-50) are preferred to ensure every possible decimal value (like 59.559.5) has a place.

Histograms and Shape of Distributions

  • Histogram: A graphical display of the count or proportion of data values falling into intervals.     * It is a form of bar graph with no spaces between bars to reflect data continuity.     * The height of the rectangle represents frequency or relative frequency.

  • Symmetry: A distribution is symmetric if its center divides it into two approximate mirror images.

  • Skewness:     * Right Skewed: The right side (tail) of the histogram (larger values) extends much further than the left.     * Left Skewed: The left side (tail) extends further than the right.

  • Time Plots: Used for time series data. Time is plotted on the x-axis, and variable values on the y-axis. Points are connected to show trends.     * Seasonal Variation: A pattern that repeats at regular intervals (e.g., average monthly temperatures in Winnipeg).

Measures of Location (Center)

  • Mode: The most frequently observed data value. A data set can have more than one mode.

  • Median: The middle value in an ordered data set.     * Calculation: Order the data and compute the position using n+12\frac{n+1}{2}.     * If nn is odd, the median is the value in that position.     * If nn is even, the median is the average of the two middle values.     * Resistance: The median is resistant to outliers. Extreme values do not change its value.

  • Mean (xˉ\bar{x}): The arithmetic average, calculated by summing all data values and dividing by the sample size nn.     xˉ=i=1nxin\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}     * Center of Mass: The mean is the "balance point" of the data. If data points were weights on a teeter-totter, the mean is where it would balance.     * Sensitivity: The mean is not resistant to outliers; extreme values pull the mean toward them.

  • Weighted Mean (xˉw\bar{x}_w): Used when some data values carry more weight than others.     xˉw=wixiwi\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}     * Example 1 (GPA): Credit hours serve as the weights (wiw_i) for grades (xix_i).     * Example 2 (Combined Group Mean): To find the mean age of a class with 66 males (mean 23.223.2) and 44 females (mean 21.721.7):     xˉc=(6×23.2)+(4×21.7)10=22.6\bar{x}_c = \frac{(6 \times 23.2) + (4 \times 21.7)}{10} = 22.6

Measures of Variability (Spread)

  • Range (RR): The difference between the maximum and minimum values (R=maxminR = \text{max} - \text{min}). It is highly sensitive to outliers.

  • Interquartile Range (IQRIQR): Measures the length of the interval covering the middle 50%50\% of observations.     IQR=Q3Q1IQR = Q_3 - Q_1

  • Quartiles:     * First Quartile (Q1Q_1): The 25th25^{th} percentile. The median of the lower half of the data.     * Third Quartile (Q3Q_3): The 75th75^{th} percentile. The median of the upper half of the data.

  • Five-Number Summary: Consists of [Min, Q1Q_1, Median, Q3Q_3, Max]. This describes the center, shape, and spread.

Boxplots and Outlier Detection

  • Quantile Boxplots: A visual representation of the five-number summary: a line at the median, a box for the IQR, and whiskers reaching the min and max.

  • Outlier Boxplots (Modified Boxplots): Used to identify extreme values.     * Fences:         Lower Fence (LF)=Q11.5×IQR\text{Lower Fence (LF)} = Q_1 - 1.5 \times IQR         Upper Fence (UF)=Q3+1.5×IQR\text{Upper Fence (UF)} = Q_3 + 1.5 \times IQR     * Outliers: Any data point outside these fences.     * Whiskers: Extend to the lowest and highest data values that are inside the fences (the "new" min and max). Outliers are plotted as individual points.

  • Comparison: Side-by-side boxplots allow comparison of different populations (e.g., heights of Blue Jays pitchers vs. fielders) with respect to center, shape, and spread using a uniform scale.

Variance and Standard Deviation

  • Sample Variance (s2s^2): Loosely, the average squared deviation from the mean.     s2=(xixˉ)2n1s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}

  • Sample Standard Deviation (ss): The positive square root of the variance.     s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}

  • Calculation Steps:     1. Find the mean xˉ\bar{x}.     2. Calculate deviations (xixˉ)(x_i - \bar{x}). Note: (xixˉ)\sum (x_i - \bar{x}) always equals 00.     3. Square those deviations.     4. Sum the squared deviations.     5. Divide by n1n - 1.

  • Units:     * The mean and standard deviation share the same units as the individual observations (e.g., dollars).     * Variance is expressed in squared units (e.g., dollars2\text{dollars}^2).

Choosing Numerical Summaries

  • Symmetric Distributions (no outliers): Report the mean and standard deviation because they utilize all data in the sample.

  • Skewed Distributions or Outliers present: Report the five-number summary (specifically median and IQR) because they are resistant to extreme values and provide a more accurate description of the typical observation.