Unit 1: Examining Distributions Complete Study Notes
Core Statistical Concepts and Terminology
Statistics Defined: Statistics refers to the set of methods used for obtaining, organizing, summarizing, presenting, and analyzing data.
Data Sources: Data is derived from characteristics measured on individuals or units. Examples of units include people, animals, places, or objects (things).
Population: This is defined as the totality of individuals about which information is desired.
Sample: A sample is a subset of the units in a population that are actually examined to gather information. * Example 1: 1,000 voters are asked which candidate they support in an upcoming election. The population would be all eligible voters in that election. * Example 2: 50 insomnia patients are given a new treatment. The population would be all patients suffering from insomnia. * Example 3: 200 Canada geese are tagged to study migration. The population would be all Canada geese.
Variables and Data Classifications
Variable: A characteristic or property of an individual. Examples include: * Time until a light bulb burns out. * Distance traveled by a taxi driver in one day. * Number of Heads in five tosses of a quarter. * Hair colour. * A student's grade in a course.
Categorical Data: Represents values of categorical variables that place individuals into one of several groups. * Categorical and Nominal: Used when there is no natural ordering to the groups. Examples: Gender of a newborn, eye colour, favourite television show, or reason for taking a specific course. * Categorical and Ordinal: Used when there is a logical ordering to the values. Examples: Letter grades (, , , etc.), service ratings (Good, Fair, Poor), or placing in a tournament (, , , etc.).
Quantitative Data: Represents values of quantitative variables for which arithmetic operations (like adding and averaging) make sense. * Discrete Variables: Quantitative variables that can only take certain values. Examples: The number of children in a family, number of rainy days in a month, or the highest denomination of bill in a wallet. * Continuous Variables: Quantitative variables that can take any value within a given range. Examples: Weight, age, and distance.
Classification Logic: * If the data takes numerical values for which arithmetic makes sense, it is Quantitative. * If not, it is Categorical. * If categorical data has a sensible order, it is Categorical and Ordinal; otherwise, it is Categorical and Nominal.
Data Distribution and Categorical Visualization
Definition of Distribution: The distribution of a data set identifies what values a variable takes and how often it takes these values. "Value" can be non-quantitative (e.g., "Blue" is a value for eye colour).
Bar Charts: * Displays categorical values on one axis and frequencies on the other. * Formatting: There must be spaces between bars to indicate that the data is not continuous.
Pie Charts: Provide a visual representation of the relative frequency (proportions) of observed values for a categorical variable.
R Implementation for Bar Charts: *
party <- c("Bloc", "CPC", "Green", "Liberal", "NDP", "PPC", "Other")*vote <- c(7.64, 33.74, 2.33, 32.62, 17.83, 4.94, 0.90)*colours <- c("deepskyblue", "blue", "green3", "red", "orange", "purple4", "black")*barplot(vote, names.arg = party, col = colours, ylab = "% Vote")
Frequency Distributions for Quantitative Data
Frequency Distribution: A count of how many data values fall into predetermined classes or intervals.
Interval Construction Rules: * Intervals are chosen by the researcher; usually to intervals are used. * The first interval must include the minimum value; the last must include the maximum. * All intervals must be of equal length. * Endpoint Convention: Intervals typically include the left endpoint but not the right (). For example, in the interval , the value is included, but is not.
Relative Frequency Distribution: Calculated by dividing the number of data values in an interval by the total number of data values (). * Proportions: Values between and . Proportions for all intervals must sum to . * Percentages: Calculated as
proportion * 100.Continuity Concern: While intervals like and avoid overlap, continuous intervals (e.g., , ) are preferred to ensure every possible decimal value (like ) has a place.
Histograms and Shape of Distributions
Histogram: A graphical display of the count or proportion of data values falling into intervals. * It is a form of bar graph with no spaces between bars to reflect data continuity. * The height of the rectangle represents frequency or relative frequency.
Symmetry: A distribution is symmetric if its center divides it into two approximate mirror images.
Skewness: * Right Skewed: The right side (tail) of the histogram (larger values) extends much further than the left. * Left Skewed: The left side (tail) extends further than the right.
Time Plots: Used for time series data. Time is plotted on the x-axis, and variable values on the y-axis. Points are connected to show trends. * Seasonal Variation: A pattern that repeats at regular intervals (e.g., average monthly temperatures in Winnipeg).
Measures of Location (Center)
Mode: The most frequently observed data value. A data set can have more than one mode.
Median: The middle value in an ordered data set. * Calculation: Order the data and compute the position using . * If is odd, the median is the value in that position. * If is even, the median is the average of the two middle values. * Resistance: The median is resistant to outliers. Extreme values do not change its value.
Mean (): The arithmetic average, calculated by summing all data values and dividing by the sample size . * Center of Mass: The mean is the "balance point" of the data. If data points were weights on a teeter-totter, the mean is where it would balance. * Sensitivity: The mean is not resistant to outliers; extreme values pull the mean toward them.
Weighted Mean (): Used when some data values carry more weight than others. * Example 1 (GPA): Credit hours serve as the weights () for grades (). * Example 2 (Combined Group Mean): To find the mean age of a class with males (mean ) and females (mean ):
Measures of Variability (Spread)
Range (): The difference between the maximum and minimum values (). It is highly sensitive to outliers.
Interquartile Range (): Measures the length of the interval covering the middle of observations.
Quartiles: * First Quartile (): The percentile. The median of the lower half of the data. * Third Quartile (): The percentile. The median of the upper half of the data.
Five-Number Summary: Consists of [Min, , Median, , Max]. This describes the center, shape, and spread.
Boxplots and Outlier Detection
Quantile Boxplots: A visual representation of the five-number summary: a line at the median, a box for the IQR, and whiskers reaching the min and max.
Outlier Boxplots (Modified Boxplots): Used to identify extreme values. * Fences: * Outliers: Any data point outside these fences. * Whiskers: Extend to the lowest and highest data values that are inside the fences (the "new" min and max). Outliers are plotted as individual points.
Comparison: Side-by-side boxplots allow comparison of different populations (e.g., heights of Blue Jays pitchers vs. fielders) with respect to center, shape, and spread using a uniform scale.
Variance and Standard Deviation
Sample Variance (): Loosely, the average squared deviation from the mean.
Sample Standard Deviation (): The positive square root of the variance.
Calculation Steps: 1. Find the mean . 2. Calculate deviations . Note: always equals . 3. Square those deviations. 4. Sum the squared deviations. 5. Divide by .
Units: * The mean and standard deviation share the same units as the individual observations (e.g., dollars). * Variance is expressed in squared units (e.g., ).
Choosing Numerical Summaries
Symmetric Distributions (no outliers): Report the mean and standard deviation because they utilize all data in the sample.
Skewed Distributions or Outliers present: Report the five-number summary (specifically median and IQR) because they are resistant to extreme values and provide a more accurate description of the typical observation.