Statistics: The science of collecting, organizing, and interpreting data.
Individuals: The objects on which data are collected (e.g., students, states, hospitals).
Variables: Characteristics recorded about individuals.
Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).
Categorical Variables: Groups or categories (e.g., gender, college type).
Identifier Variables: Unique values assigned to individuals (e.g., ID numbers).
Bar Charts & Pie Charts: Represent categorical data.
Histograms: Display quantitative data distributions.
Boxplots: Compare distributions and identify outliers.
Dotplots & Density Plots: Represent distributions and trends.
Mean (x̄): Sum of all values divided by the number of values.
Median (m): The middle value when data is ordered.
Range: Difference between the largest and smallest values.
Interquartile Range (IQR): The difference between Q3 (75th percentile) and Q1 (25th percentile).
Standard Deviation (S): Measures variation around the mean.
Z-Score: Z=X−μσZ = \frac{X - \mu}{\sigma} (Measures how far a value is from the mean in standard deviations)
68-95-99.7 Rule: Describes Normal Distribution percentages.
Explanatory Variable: The variable suspected to influence another.
Response Variable: The variable that is measured as an outcome.
When a relationship between two variables reverses due to a lurking variable.
# Bar Chart
bargraph(~Variable, data = Dataset)
# Histogram
histogram(~Variable, data = Dataset)
# Boxplot
bwplot(Variable ~ Category, data = Dataset)
# Mean and Median
mean(Dataset$Variable)
median(Dataset$Variable)
# Standard Deviation
sd(Dataset$Variable)
# Interquartile Range
IQR(Dataset$Variable)
# Probability below a value
pnorm(value, mean = mu, sd = sigma)
# Probability above a value
1 - pnorm(value, mean = mu, sd = sigma)
# Finding percentiles
qnorm(percentile, mean = mu, sd = sigma)
STAT 118 Exam Study Sheet (Chapters 1-5)
Categorical (Qualitative) Variables: Describe qualities or categories (e.g., gender, college type).
Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).
Nominal Variables: Categories without a meaningful order (e.g., colors, names).
Ordinal Variables: Categories with a meaningful order but no consistent difference (e.g., ranking, education level).
Natural Variables: Ordered with meaningful differences (e.g., temperature, income).
Proportion: A fraction representing part of a whole (e.g., 0.25 or 1/4).
Percent: A proportion multiplied by 100 (e.g., 0.25 = 25%).
Mean (x̄): Average of data.
Median (m): Middle value when ordered.
Range: Max - Min.
Interquartile Range (IQR): Q3 - Q1 (middle 50% of data).
Standard Deviation (S): Measures spread around the mean.
Issues with SD for Outliers: SD is sensitive to outliers; extreme values heavily influence it.
Symmetric (Bell-shaped): Mean ≈ Median.
Right-skewed: Mean > Median.
Left-skewed: Mean < Median.
Uniform: Equal frequency across bins.
Bimodal: Two peaks.
Standardizing (Z-score): Z=X−μσZ = \frac{X - \mu}{\sigma} (Tells how many SDs a value is from the mean)
Shifting: Adding/subtracting a constant affects mean but not spread.
Scaling: Multiplying/dividing a constant affects both center and spread.
68-95-99.7 Rule:
68% within 1 SD
95% within 2 SDs
99.7% within 3 SDs
pnorm(x, mean, sd): Finds the probability below a value.
1 - pnorm(x, mean, sd): Finds the probability above a value.
qnorm(percentile, mean, sd): Finds the value corresponding to a given percentile.
Measures the strength of a linear relationship between two quantitative variables.
Ranges from -1 to 1:
R = 1: Perfect positive correlation.
R = -1: Perfect negative correlation.
R = 0: No linear correlation.
A trend in different groups reverses when combined due to a lurking variable.
Example: A hospital appears to have a higher death rate overall, but when split by patient condition, it actually has a lower death rate in each category.
Summary Statistics:
mean(dataset$variable)
median(dataset$variable)
sd(dataset$variable)
IQR(dataset$variable)
Histograms & Boxplots:
histogram(~ variable, data = dataset)
bwplot(variable ~ category, data = dataset)
Normal Distribution Calculations:
pnorm(x, mean, sd) # Probability below x
1 - pnorm(x, mean, sd) # Probability above x
qnorm(percentile, mean, sd) # Value at given percentile
STAT 118 Exam Study Sheet (Chapters 1-5)
Categorical (Qualitative) Variables: Describe qualities or categories (e.g., gender, college type).
Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).
Nominal Variables: Categories without a meaningful order (e.g., colors, names).
Ordinal Variables: Categories with a meaningful order but no consistent difference (e.g., ranking, education level).
Natural Variables: Ordered with meaningful differences (e.g., temperature, income).
Proportion: A fraction representing part of a whole (e.g., 0.25 or 1/4).
Percent: A proportion multiplied by 100 (e.g., 0.25 = 25%).
Mean (x̄): Average of data.
Median (m): Middle value when ordered.
Range: Max - Min.
Interquartile Range (IQR): Q3 - Q1 (middle 50% of data).
Standard Deviation (S): Measures spread around the mean.
Issues with SD for Outliers: SD is sensitive to outliers; extreme values heavily influence it.
Symmetric (Bell-shaped): Mean ≈ Median.
Right-skewed: Mean > Median.
Left-skewed: Mean < Median.
Uniform: Equal frequency across bins.
Bimodal: Two peaks.
Standardizing (Z-score): Z=X−μσZ = \frac{X - \mu}{\sigma} (Tells how many SDs a value is from the mean)
Shifting: Adding/subtracting a constant affects mean but not spread.
Scaling: Multiplying/dividing a constant affects both center and spread.
68-95-99.7 Rule:
68% within 1 SD
95% within 2 SDs
99.7% within 3 SDs
pnorm(x, mean, sd): Finds the probability below a value.
1 - pnorm(x, mean, sd): Finds the probability above a value.
qnorm(percentile, mean, sd): Finds the value corresponding to a given percentile.
Measures the strength of a linear relationship between two quantitative variables.
Ranges from -1 to 1:
R = 1: Perfect positive correlation.
R = -1: Perfect negative correlation.
R = 0: No linear correlation.
A trend in different groups reverses when combined due to a lurking variable.
Example: A hospital appears to have a higher death rate overall, but when split by patient condition, it actually has a lower death rate in each category.
Summary Statistics:
mean(dataset$variable)
median(dataset$variable)
sd(dataset$variable)
IQR(dataset$variable)
Histograms & Boxplots:
histogram(~ variable, data = dataset)
bwplot(variable ~ category, data = dataset)
Normal Distribution Calculations:
pnorm(x, mean, sd) # Pro
Here’s a detailed term-definition study set based on the exam topics:
Mean (Average): The sum of all values in a dataset divided by the number of values.
Median: The middle value in an ordered dataset; if even, the average of the two middle values.
Mode: The most frequently occurring value(s) in a dataset.
Range: The difference between the maximum and minimum values in a dataset.
Interquartile Range (IQR): The range of the middle 50% of data, calculated as Q3 - Q1.
Quartiles: Values that divide a dataset into four equal parts:
Q1 (First Quartile): 25th percentile
Q2 (Median): 50th percentile
Q3 (Third Quartile): 75th percentile
Standard Deviation (SD): A measure of how spread out the data is around the mean. A higher SD indicates more variability.
Five-Number Summary: A set of five values (Min, Q1, Median, Q3, Max) that summarize a dataset.
Outliers: Data points that are significantly higher or lower than the rest of the dataset.
Lower Fence: Q1−1.5×IQRQ1 - 1.5 \times IQR, used to detect low-end outliers.
Upper Fence: Q3+1.5×IQRQ3 + 1.5 \times IQR, used to detect high-end outliers.
Proportion: A fraction representing a part of a whole, often converted into a percentage.
Percentage: A way to express a proportion out of 100, calculated as partwhole×100\frac{\text{part}}{\text{whole}} \times 100.
Conditional Probability: The likelihood of an event occurring given that another event has already occurred (e.g., percentage of Obama supporters who were male).
Frequency Table: A table that lists the number of times different categories occur in a dataset.
Contingency Table: A table that shows the frequency distribution of variables to examine relationships between them.
Gender Gap in Voting: A phenomenon where voting preferences differ significantly between males and females.
Boxplot: A graphical representation of the five-number summary, useful for comparing distributions.
Histogram: A bar chart representing the frequency of numerical data intervals.
Symmetric Distribution: A dataset where the left and right sides of the histogram are roughly mirror images.
Skewed Distribution:
Right-Skewed (Positive Skew): Tail is longer on the right side.
Left-Skewed (Negative Skew): Tail is longer on the left side.
Spread/Variability: The extent to which data values differ, measured by range, IQR, and standard deviation.
Bar Chart: A chart that uses bars to represent categorical data.
Scatterplot: A graph of plotted points that show the relationship between two variables.
Alternative Graphical Representations: Other ways to display data, such as side-by-side boxplots for comparing distributions.
favstats(): An R function that provides summary statistics (mean, median, Q1, Q3, etc.) for a dataset.
histogram(): An R function that generates a histogram to visualize numerical data distributions.
tally(): An R function that creates frequency tables for categorical data.
bwplot(): An R function that generates boxplots to compare distributions of a variable across different categories.