Categorical Variables
Variables that take on values as category names or group labels, organized into frequency tables or represented by displays like bar graphs, dot plots, and pie charts.
Quantitative Variable
Variables with numerical values for measured quantities, organized into frequency tables or represented by displays like histograms, dot plots, and box plots.
Discrete Quantitative Variable
Takes on a countable number of values with gaps between them.
Continuous Quantitative Variable
Can take on infinite values without gaps, like heights and weights.
Center
The value that separates the data roughly in half, indicating the middle.
Spread
The range of values from smallest to largest, showing the variability.
Clusters
Natural subgroups in the data, indicating where values fall.
Gaps
Holes in the data where no values fall, showing gaps in the distribution.
Unimodal Distribution
Distribution with one peak; Bimodal Distribution:Distribution with two peaks.
Skewed Distribution
Spread towards higher (right-skewed) or lower (left-skewed) values.
Bell-shaped Distribution
Symmetric with a center mound and sloping tails.
Descriptive Statistics
Data presentation including average values, variability measures, and distribution shape.
Inferential Statistics
Drawing inferences from limited data, discussed in later units.
Median
Middle number in a set; Mean:Average found by summing and dividing by the number of items.
Variability
Key concept in statistics, described by range, interquartile range, variance, and standard deviation.
Parallel Boxplots
Graphical representation showing the comparison of stock price statistics across different years, including median, quartiles, yearly low, and interquartile range.
Normal Distribution
A bell-shaped and symmetric distribution used to model various natural phenomena, with the mean equal to the median and points of inflection at one standard deviation from the mean.
Empirical Rule
Also known as the 68-95-99.7 rule, states the percentage of values within 1, 2, and 3 standard deviations from the mean in a normal distribution.
Two-Way Table
A table displaying qualitative data from two categorical variables, often used to calculate marginal frequencies and distributions.
Scatterplot
A visual representation of the relationship between two quantitative variables, showing form, direction, strength, and unusual features like outliers.
Correlation
A measure (r) of the strength of a linear relationship between two variables, ranging from -1 to +1, with r^2 indicating the proportion of variance explained by the relationship.
Coefficient of Determination (r^2)
The percentage of variation in the response variable explained by the linear regression model, derived from the correlation coefficient.
Least Squares Regression
A method to find the best-fitting line through a set of points by minimizing the sum of squared vertical differences, with the slope determined by the correlation coefficient.
Residuals
The differences between observed and predicted values in a regression model, with a sum of residuals always equal to zero.
Outliers
Data points that significantly deviate from the overall pattern in a scatterplot, often identified by large discrepancies in the response variable compared to predicted values.
Influential Scores
Scores whose removal would sharply change the regression line, especially points with extreme x-values.
High Leverage
Points with x-values far from the mean x-value, having the potential to strongly influence the regression line.
Regression Outlier
A point with a large residual compared to others, affecting the regression line but not necessarily influential.
Correlation Coefficient (r)
Indicates the strength and direction of a linear relationship between two variables.
Simple Random Sampling (SRS)
A sampling method where every possible sample of the desired size has an equal chance of being selected.
Stratified Sampling
Involves dividing the population into homogeneous groups (strata) and selecting random samples from each stratum.
Cluster Sampling
Divides the population into heterogeneous groups (clusters) and selects entire clusters randomly.
Systematic Sampling
Involves selecting every kth individual from a list after choosing a random starting point.
Sampling Variability
The natural presence of sampling error in a sample, which can be described using probability and tends to decrease with larger sample sizes.
Observational Studies
Studies where observations and measurements are made without influencing the subjects, aiming to show associations between variables.
Experiments
Studies where treatments are imposed on subjects to measure responses, aiming to establish cause-and-effect relationships.
Experimental Units
Objects on which an experiment is performed, while subjects refer to people as units.
Explanatory Variables
Factors in an experiment believed to affect the response variables, with different levels of treatment applied to groups.
Control Group
A group in an experiment that does not receive the treatment of interest, or receives a placebo, to determine the treatment's effect.
Placebo Effect
The phenomenon where individuals respond to any perceived treatment, even if it is inactive.
Blinding
When subjects are unaware of the treatment they are receiving in an experiment.
Double-blinding
When both subjects and evaluators are unaware of the treatment assignments in an experiment.
Matched Pairs Design
A design where two treatments are compared based on responses from paired subjects, often involving single subjects receiving both treatments in random order.
Guess Strategy
A strategy in a standard literacy test where the test taker selects answers randomly when the correct answer is unknown.
Score 60-79
A range of scores in a standard literacy test considered passing but not superior, falling between 60 and 79.
Does not score 60-79
The probability of a test taker not achieving a score between 60 and 79 in a standard literacy test.
Strategy "Answer (c)" and Scores 80-100
The joint probability of a test taker choosing answer (c) and scoring between 80 and 100 in a standard literacy test.
Strategy "Longest Answer" or Scores 0-59
The probability of a test taker either choosing the longest answer or scoring between 0 and 59 in a standard literacy test.
Guess Strategy given Score 0-59
The probability of a test taker using the guess strategy given that their score falls between 0 and 59 in a standard literacy test.
Scored 80-100 given Strategy "Longest Answer"
The probability of a test taker scoring between 80 and 100 given that they chose the strategy "longest answer" in a standard literacy test.
Guess Strategy and Scoring 0-59 Independence
The assessment of whether the strategy "guess" and scoring between 0 and 59 are independent events in a standard literacy test.
Strategy "Longest Answer" and Scoring 80-100 Mutual Exclusivity
The evaluation of whether the strategy "longest answer" and scoring between 80 and 100 are mutually exclusive events in a standard literacy test.
Cumulative Probability Distribution
A function, table, or graph linking outcomes with the probability of less than or equal to that outcome occurring.
Normal Distribution
Provides a model for how sample statistics vary under random sampling, often calculated using z-scores.
Central Limit Theorem
States that for sufficiently large sample sizes, the sampling distribution of the mean will be approximately normal.
Biased and Unbiased Estimators
Bias indicates the sampling distribution is not centered on the population parameter; unbiased estimators are centered on the population parameter.
Sampling Distribution for Sample Proportions
Focuses on the proportion of successes in a sample, approximating a normal distribution for large sample sizes.
Sampling Distribution for Differences in Sample Proportions
Deals with differences obtained by subtracting sample proportions of one population from another.
Sampling Distribution for Sample Means
The variance of sample means is the population variance divided by the sample size squared.
Sampling Distribution
The distribution of sample means or proportions taken from a population, with a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of the sample size.
Confidence Interval
A range of values that is likely to contain the true population parameter with a certain level of confidence, typically expressed as (point estimate ± margin of error).
Standard Error
A measure of how much the sample statistic typically varies from the population parameter, calculated as the standard deviation of the sampling distribution.
Normality Assumption
The assumption that the sampling distribution of sample means or proportions is approximately normal if certain conditions are met, like the sample size being large enough.
Type I Error
Mistakenly rejecting a true null hypothesis in hypothesis testing, with a probability denoted as α (alpha).
Type II Error
Mistakenly failing to reject a false null hypothesis in hypothesis testing, with a probability denoted as β (beta).
Power of a Test
The probability of correctly rejecting a false null hypothesis, influenced by the sample size and significance level chosen for the test.
P-value
A measure that helps determine the significance of results in a hypothesis test; a small P-value indicates strong evidence against the null hypothesis.
Type I error
Occurs when the null hypothesis is rejected when it is actually true, leading to a false positive conclusion.
Type II error
Occurs when the null hypothesis is not rejected when it is false, resulting in a false negative conclusion.
Confidence Interval
A range of values that is likely to contain the true parameter being estimated, with a specified level of confidence.
Difference of Two Proportions
Refers to the contrast between two population proportions, often analyzed using hypothesis tests or confidence intervals.
t-distribution
A probability distribution that is used when the population standard deviation is unknown, providing a more accurate estimate than the normal distribution for small sample sizes.
Standard Error
An estimate of the standard deviation of a sampling distribution, often used to calculate confidence intervals and conduct hypothesis tests for means.
Significance Test
A statistical method used to determine whether there is enough evidence to reject the null hypothesis in favor of an alternative hypothesis.
Type-I Error
Mistakenly rejecting a true null hypothesis, leading to the consumer agency discouraging customers from purchasing a new brand of air-conditioning unit that could actually save on electricity consumption.
Confidence Interval
A range of values that is likely to contain the true parameter, such as the 95% confidence interval for the mean difference in accidents per month between two departments.
Type-II Error
Mistakenly failing to reject a false null hypothesis, potentially resulting in a company not making necessary fixes, affecting future sales.
Paired Data
Involves one-sample analysis on the differences from paired data, like finding a 90% confidence interval of the mean improvement in test scores for a SAT preparation class.
P-Value
A measure that helps determine the strength of the evidence against the null hypothesis, as seen in the simulation example where a recalibration of machinery was deemed necessary based on the P-value.
Power
The probability of correctly rejecting a false null hypothesis, contrasting with Type II error, as illustrated in the scenario where the candidate's true support was 63% but might not be recognized due to a Type II error.
Hypothesis Test
Involves making a claim about a population parameter and testing it, like the significance test for the difference of two means in the example of comparing computer downtimes.
Parameter
A characteristic of a population, such as the mean electricity usage of a new brand of air-conditioning units, denoted by μ in hypothesis testing.
Chi-Square Statistic
The sum of weighted differences or discrepancies used in the Chi-Square test denoted as χ2.
P-value
The probability of obtaining a Chi-Square value as extreme as the one obtained if the null hypothesis is true.
Degrees of Freedom (df)
The number of categories minus one used in Chi-Square distributions to determine the critical value.
Goodness-of-Fit Test
A test to determine if a given theoretical distribution correctly describes a situation, problem, or activity.
Chi-Square Test for Independence
A test to determine if there is a significant association between two categorical variables.
Chi-Square Test for Homogeneity
A test to compare samples from two or more populations to see if they are homogeneous.
Sampling Distribution for the Slope
The distribution of the sample slope b with mean μb and standard deviation σb.
Confidence Interval for the Slope
An interval estimate for the slope of the regression line using t-scores with degrees of freedom n-2.
Confidence Interval
A range of values that is likely to contain the true slope of the regression line with a certain level of confidence.
Null Hypothesis (H0)
The assumption that there is no relationship or no effect in a statistical test.
Residuals Plot
A graph that shows the differences between observed values and predicted values in a regression analysis.
P-Value
The probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is true.
Least Squares Regression Line
The line that minimizes the sum of the squared differences between the observed values and the values predicted by the line.
Slope
The measure of the steepness of a line, indicating the rate of change of the dependent variable with respect to the independent variable.
Linear Relationship
A relationship between two variables that can be represented by a straight line.
Scatterplot
A graph that shows the relationship between two variables by displaying data points on a two-dimensional plane.