Introduction to Statistics
Definition of Statistics: The science of collecting, organizing, analyzing, and interpreting data.
Types of Statistics:
Descriptive Statistics: Summarizes and organizes data (e.g., graphs, charts, averages).
Inferential Statistics: Makes predictions and inferences about a population based on a sample.
Data Types
Qualitative (Categorical) Data: Non-numeric categories (e.g., gender, color).
Quantitative Data: Numeric data that can be measured.
Discrete Data: Countable values (e.g., number of students).
Continuous Data: Measurable values (e.g., height, weight).
Data Visualization
Graphs:
Bar Graphs: Represent categorical data with rectangular bars.
Histograms: Represent quantitative data with bars showing frequency distribution.
Box Plots: Show distribution through quartiles and medians.
Scatter Plots: Display the relationship between two quantitative variables.
Shapes of Distributions:
Look at the Shape, Outliers, Center, Variability
Shape: Determine if the distribution is symmetric, skewed, or uniform.
Outliers: Identify any data points that lie far outside the overall pattern.
Center: Analyze measures of central tendency such as mean and median.
Variability: Assess the spread of the data using range, interquartile range, and standard deviation.
Skewed Left: Distribution tail extends to the left, indicating that the majority of values are concentrated on the right. (Mean < Median)
Skewed Right: Distribution tail extends to the right, indicating that most values are concentrated on the left. (Mean > Median)
Roughly Symmetric: Distribution is balanced on both sides of the center, with similar frequencies on either side. (Mean ~ Median)
Approximately Uniform: Distribution shows equal frequency across the range of values, indicating no apparent peaks.
Bimodal: Distribution has two distinct peaks, suggesting the presence of two different groups within the data set.
Mean: The average value of a dataset, calculated by summing all the observations and dividing by the number of observations.
Median: The middle number of the distribution when the data is arranged in ascending order, providing a measure of central tendency that is less affected by outliers. (Only the median is resistant to outliers)
Range: Maximum - Minimum
Standard Deviation: The typical distance of values from the mean.
IQR Rule:
Q1 - 1.5 (IQR)
Anything lower than this number is an outlier
Q3 + 1.5 (IQR)
Anything higher than this number is an outlier
Add/Subtract Constant Values
Mean: If you add (or subtract) a constant value (c) to every data point, the mean will increase (or decrease) by that constant.
Standard Deviation: The standard deviation remains unchanged when a constant is added or subtracted from data points because it measures how spread out the values are around the mean, which does not change from merely shifting the dataset up or down.
Multiply/Divide Constant Values
Mean: If you multiply (or divide) all data points by a constant value (k), the mean will also be multiplied (or divided) by that constant.
Standard Deviation: The standard deviation will also be multiplied (or divided) by the same constant value.
Exploring Two-Variable Data
Percentiles: Less than equal to a certain value in a dataset, indicating the relative standing of a value within a distribution.
Z-score: How many standard deviations from the mean an observation falls, and in what direction. (Value - Mean / Standard Deviation)
Continuous Probability Distribution:
Describes probabilities of continuous random variables using probability density functions (PDF).
Normal Distribution: A common continuous probability distribution characterized by its bell-shaped curve and defined by its mean (µ) and standard deviation (σ).
Empirical Rule: Approximately 68% of data falls within 1 standard deviation, 95% within 2 standard deviations, and 99.7% within 3 standard deviations from the mean.
Two-Way Table: A table that counts or relative frequencies that summarizes data on the relationship between two categorical variables for some groups of individuals.
Joint Relative Frequency: The proportion of individuals in a two-way table that fall into a specific category for one variable relative to the total number of individuals in the table.
Conditional Relative Frequency:The ratio of the joint relative frequency to the marginal frequency of the corresponding category, providing insight into the likelihood of one event occurring given the occurrence of another.
Bivariate Data: Data that involves two different variables.
Scatter Plots:
Used to visualize the relationship between two quantitative variables.
Each point represents an individual observation with its values on both axes.
Correlation:
Measures the strength and direction of a relationship between two variables.
Correlation Coefficient (r):
Ranges from -1 to 1.
r > 0 indicates a positive correlation; r < 0 indicates a negative correlation; r = 0 indicates no correlation.
Response Variable: Measure an outcome of a study.
Explanatory Variable: May help predict or explain changes in a response variable.
Least-Squares Regression Line:
The line that minimizes the sum of the squares of the vertical distances between the points and the line.
Regression Line: y^ = a + bx (a= y-intercept, b = slope of the regression line, and y^ represents the predicted value of y for a given x. )
b = r(s of y / s of x)
Interpretation of Slope and Intercept:
Slope (m): Change in the response variable for each one-unit increase in the explanatory variable.
y-intercept (b): Value of the response variable when the explanatory variable is zero.
Residuals:
The differences between observed values and predicted values from the regression line.
Residual Plot: A scatter plot of the residuals against the explanatory variable; used to check the appropriateness of the linear model.
RAP or Residual = Actual - Predicted
If the residual plot shows only random scatter, the regression model is appropriate; however, if there is a discernible pattern, it indicates that a linear model may not be the best fit for the data.
Coefficient of Determination (R²):
Represents the proportion of the variance for the response variable that is explained by the explanatory variable in the regression model.
R² closer to 1 indicates a better fit of the model to the data.
s: The standard deviation of the residuals. A smaller value of s suggests that the data points are closer to the regression line, indicating a more accurate model.
Causation vs. Correlation:
Correlation does not imply causation; additional research is necessary to establish a cause-and-effect relationship.
Collecting Data
Population and Sample:
Population: The entire group that is the subject of study.
Sample: A subset of the population used to collect data.
Census: Collects data from every individual in the population.
Sampling Techniques:
Random Sampling:
Each member of the population has an equal chance of being selected.
Stratified Sampling:
The population is divided into subgroups (strata) and random samples are taken from each stratum.
Cluster Sampling:
Entire groups (clusters) are randomly selected (e.g., select entire classrooms).
Systematic Sampling:
Select every nth member from a list.
Convenience Sampling:
Using an easily accessible group, which can lead to bias and unrepresentative samples.
Simple Random Sample (SRS)
Every group of individuals in the population has an equal chance to be selected.
With Replacement: An individual can be chosen more than once, ensuring that each selection does not affect the probability of subsequent selections.
Without Replacement: An individual can be chosen only once, which means that the probability of selection changes after each individual is selected.
Voluntary Response Sample:
A sample that consists of individuals who choose to participate, often leading to bias as those with strong opinions are more likely to respond.
Experimental Design:
Response Variable: An outcome of a study.
Explanatory Variable: A variable that is manipulated or categorized to observe its effect on the response variable.
Confounding: A situation in which the effects of two or more explanatory variables are not separated, making it difficult to determine the individual effects on the response variable.
Placebo: A treatment that has no active ingredient and is used as a control in experiments to compare against the effects of the actual treatment.
Factor: An explanatory variable that is manipulated and may cause a change in the response variable.
Block: A group of experimental units that are similar in some way that is expected to affect the response to the treatment, thus allowing for more accurate comparisons between treatments.
Matched Pairs Experiment: A type of experimental design that involves comparing two treatments by using pairs of similar experimental units, where each pair receives different treatments, allowing for a direct comparison of the effects.
Treatment: A condition applied to subjects in an experiment.
Control Group: A group that receives no treatment or a standard treatment for comparison.
Randomization: Randomly assigning subjects to different treatment groups to reduce bias.
Replication: Repeating the experiment on a large number of subjects to ensure validity.
Blinding: Keeping participants or staff unaware of which treatment is given to reduce bias.
Single-Blind: Participants are unaware of treatment.
Double-Blind: Both participants and researchers are unaware.
Undercoverage: A sampling error that occurs when some members of the population are inadequately represented in the sample.
Observational Studies:
Researches that observe individuals without manipulating variables, often to find associations.
Retrospective: Existing Data
Prospective: Future Data
Bias in Sampling and Experiments:
Selection Bias: When the sample is not representative of the population.
Response Bias: When participants give inaccurate responses due to various factors (e.g., wording of questions).
Non response Bias: When a significant number of selected individuals do not respond or participate.
Key Principles:
Differentiate between observational studies and experimental designs.
Understand the impact of different sampling techniques on data validity and bias.
Probability
Definition of Probability: The measure of the likelihood that an event will occur. Value ranges from 0 (impossible event) to 1 (certain event).
Probability Rules:
The Sum Rule: The probabilities of all possible outcomes must sum to 1.
Complement Rule: The probability of an event not occurring is 1 minus the probability of the event occurring (P(A') = 1 - P(A)).
Complement Rule: P (Ac) = 1 - P(A), where Ac represents the complement of event A, indicating all outcomes in the sample space that are not part of event A.
Addition Rule: P(A or B) = P(A) + P(B) - P(A and B). This rule accounts for the overlap between events A and B, ensuring that we do not double-count the probability of both events occurring simultaneously.
Conditional Probability: P(A | B) = P(A and B) / P(B), which represents the probability of event A occurring given that event B has already occurred.
Multiplication Rule: P(A and B) = P(A) * P(B | A), which is used to find the probability of both events A and B occurring together, taking into account the likelihood of A occurring first and then B given that A has occurred. ( A and B must be independent)
Random Variables
Definition: A variable that takes on numerical values based on the outcome of a random phenomenon.
Types of Random Variables:
Discrete Random Variables: Take on a countable number of distinct values (e.g., number of students in a class).
Continuous Random Variables: Take on an infinite number of values within a given range (e.g., height, weight) (The average over many, many trials) is known as the expected value, which provides a measure of the central tendency for continuous random variables.
Probability Distributions
Definition: A function that describes the likelihood of obtaining the possible values that a random variable can take.
Discrete Probability Distribution:
Lists all possible values of a discrete random variable and their corresponding probabilities.
Expected Value (E(X)): The long-term average or mean of a random variable calculated as E(X) = Σ [x * P(x)] for all possible values of x.
Continuous Probability Distribution:
Describes probabilities of continuous random variables using probability density functions (PDF).
Normal Distribution: A common continuous probability distribution characterized by its bell-shaped curve and defined by its mean (µ) and standard deviation (σ).
Empirical Rule: Approximately 68% of data falls within 1 standard deviation, 95% within 2 standard deviations, and 99.7% within 3 standard deviations from the mean.
Law of Large Numbers: States that as the number of trials increases, the experimental probability approaches the theoretical probability.
Mutually Exclusive: Two events are mutually exclusive if they cannot occur at the same time; that is, the occurrence of one event means the other cannot happen. These events do not have an overlap.
Independent: The probability of one event does not affect the outcome of the next event.
Discrete Random Variable
Mean: E(x) = Ex (initial) x P(x (initial) )
Standard Deviation: sqrt (x - mean of x)² x P(x initial)
Binomial Distribution: A discrete probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success (p).
B: Each trial is a “success” or “failure”
I: The trials must be independent
N: The number of trials (must be a specific number)
S: Each trial has the same probability
Binompdf: “Greater than or “Equal to” = use 1 -
Use for a specific number ( Ex: X = 3, in the context of finding the probability of exactly that number of successes. ) (“At least” = x - 1)
Binomcdf: Use for cumulative probabilities (Ex: P(X ≤ 3))
Useful for finding probabilities of X being less than or equal to a specific number.
To find probabilities for ranges (Ex: P(2 < X < 5), use binomcdf to calculate P(X ≤ 4) - P(X ≤ 2).
Mean: Number of Trials x Probability (n x p)
Standard Deviation: (sqrt (n * p * (1 - p))
Geometric Distribution: A probability distribution that models the number of trials needed to get the first success in a sequence of independent Bernoulli trials.
Each trial has the same probability of success. The expected number of trials until the first success can be calculated using the formula E(X) = 1/p, where p is the probability of success in each trial.
Mean: 1 / p (Probability)
Standard Deviation: sqrt 1 - p