1/79
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Statistics
The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions.
Descriptive Statistics
Methods of organizing, summarizing, and presenting data in an informative way.
Inferential Statistics
A decision, estimate, prediction, or generalization about a population, based on a sample.
Population
The entire set of individuals or objects of interest, or the measurements obtained from all individuals or objects of interest.
Sample
A portion, or part, of the population of interest.
Qualitative or categorical variable
The characteristic being studied is nonnumeric.
Quantitative variable
Information is reported numerically.
Discrete variables
Can only assume certain values, and there are usually 'gaps' between values.
Continuous variable
Can assume any value within a specified range.
Parameter
A measurable characteristic of a population.
Statistic
A measurable characteristic of a sample.
Measures of Location
Purpose is to pinpoint the center of a distribution of data.
Mean
(Population Mean) Population values divided by the total number of population values.
Median
The middle value in a data set that has been arranged in ascending or descending order.
Mode
The value of the observation that appears most frequently.
Dispersion
Describes spread around the center.
Range
Largest - Smallest value.
Mean Deviation
The arithmetic mean of the absolute values of the deviations from the arithmetic mean.
Variance
The arithmetic mean of the squared deviations from the mean. For populations whose values are near the mean, the variance will be small; For populations whose values are dispersed from the mean, the variance will be large.
Standard Deviation
The square root of the variance. For populations whose values are near the mean, the standard deviation will be small; For populations whose values are dispersed from the mean, the standard deviation will be large.
Normal Probability Distribution
It is bell-shaped and has a single peak, symmetrical about the mean, total area under the curve is 1.00, mean, median, and mode are equal, asymptotic, area to the left/right of the mean is 0.515.
Standard Normal Probability Distribution
A normal distribution with a mean of 0 and a standard deviation of 116.
z-value
The signed distance between a selected value, designated X, and the population mean, divided by the population standard deviation.
Central Limit Theorem
If all samples of a particular size are selected from any population, the sampling distribution of the sample mean is approximately a normal distribution. This approximation improves with larger samples.
Normal Distribution
If the population does not follow the normal distribution, but the sample is of at least 30 observations, the distribution of the sample means will follow the normal distribution.
Point estimate
A single value (point) derived from a sample and used to estimate a population value.
Confidence interval estimate
A range of values constructed from sample data so that the population parameter is likely to occur within that range at a specified probability.
Level of confidence
The specified probability for a confidence interval estimate.
Margin of error
The ± value added/subtracted from the point estimate to form a confidence interval.
Hypothesis Testing
An objective method of making decisions or inferences from sample data (evidence) by comparing what is observed to what is expected if the Null Hypothesis was true.
Statistical inference
Generalizing from a sample to a population with calculated degree of certainty.
Hypothesis
A statement of the researcher's idea or guess.
Null Hypothesis (H0)
The hypothesis that is actually tested; what we assume is true to begin with, often the opposite of the researcher's guess.
Alternative Hypothesis (HA)
The hypothesis assumed true if the null is false; what we aim to gather evidence of, typically that there is a difference/effect/relationship.
Test statistic
A quantity, calculated based on a sample, whose value is the basis for deciding whether or not to reject the null hypothesis.
P-value
The probability of observing data as extreme as the sample data, assuming the null hypothesis is true. A low p-value (e.g., less than the significance level) suggests evidence against the null hypothesis.
Significance level (alpha)
The predetermined probability of making a Type I error (rejecting a true null hypothesis).
Type I error
Rejecting a true null hypothesis. Occurs when sample data incorrectly suggest a treatment effect or relationship exists.
Type II error
Failing to reject a false null hypothesis.
Bivariate data
Data relating to the relationship between two variables.
Scatter diagram
A graphical technique used to show the relationship between two variables by plotting one on the horizontal axis and the other on the vertical axis.
Covariance
A measure indicating how two variables are linearly related (tend to move in the same or opposite directions) but does not measure the strength of the relationship.
Correlation
A measure of the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).
Linear Regression Model
A statistical model postulating a linear relationship between a dependent variable (Y) and one or more independent variables (X), represented by the equation 𝑌 = 𝛽0 + 𝛽1 × 𝑋 + 𝑢 for simple linear regression.
Dependent variable (Y)
The variable being predicted or explained in a regression model.
Independent variable (X)
The predictor or explanatory variable used to predict or explain the dependent variable in a regression model.
Population regression line
The true, unknown linear relationship between X and Y in the population, defined by the population parameters 𝛽0 + 𝛽1 × 𝑋.
Intercept (𝛽₀)
The true population y-intercept in the regression model, representing the expected mean value of Y when all independent variables are zero.
Slope (𝛽₁)
The true population slope coefficient in the regression model, representing the expected change in the mean of Y for a one-unit increase in the independent variable X, holding other variables constant (in multiple regression). Estimated by the sample slope 𝛽̂₁.
Error term (u or ε)
The random component in the regression model representing all factors other than the independent variable(s) that influence the dependent variable. It is the difference between the actual Y and the value predicted by the population regression line.
Ordinary Least Squares (OLS)
The standard method for estimating the coefficients (𝛽₀ and 𝛽₁) of a linear regression model by minimizing the sum of the squared differences between the observed dependent variable values and the values predicted by the estimated regression line.
OLS Estimator
The values of the regression coefficients (𝛽̂₀ and 𝛽̂₁) calculated using the OLS method from sample data, serving as estimates for the true population coefficients.
OLS regression line
The estimated linear relationship based on sample data, represented by the equation Ŷ = 𝛽̂₀ + 𝛽̂₁𝑋, also known as the estimated regression equation.
Predicted value (Ŷ)
The value of the dependent variable (Y) estimated by the OLS regression line for a given value of the independent variable(s) (X).
Residual (𝑢̂ or ε)
The difference between the actual observed value of the dependent variable (Yᵢ) and its predicted value (Ŷᵢ) from the OLS regression line for a specific observation, representing the unexplained part of the variation.
Sum of Squares (SS)
Measures the total deviation of data points away from a mean value, used to quantify variability.
Total Sum of Squares (SST or TSS)
The total variability in the dependent variable (Y), calculated as the sum of the squared differences between each observed Yᵢ and the mean of Y (𝑌). It is the sum of the explained variation (SSR) and the unexplained variation (SSE).
Sum of Squares Regression (SSR or ESS)
The portion of the total variability in the dependent variable (Y) that is explained by the regression model. Calculated as the sum of the squared differences between the predicted Y values (Ŷᵢ) and the mean of Y (𝑌).
Sum of Squared Errors (SSE or RSS)
The portion of the total variability in the dependent variable (Y) that is not explained by the regression model, representing the residual or unexplained variation. Calculated as the sum of the squared differences between the actual Y values (Yᵢ) and the predicted Y values (Ŷᵢ). Minimizing this sum is the goal of OLS.
Coefficient of Determination (R-square)
A statistic representing the proportion (or percentage) of the variance in the dependent variable (Y) that is predictable from the independent variable(s) (X) in the regression model. Calculated as SSR/SST. In simple linear regression, it is the square of the correlation coefficient (r²).
Multiple R
In regression output, for simple linear regression, it is the absolute value of the correlation coefficient between the dependent and independent variables.
Adjusted R-squared
A modified version of R-squared that accounts for the number of predictors and sample size, providing a less biased estimate of the population R-squared, useful for comparing regression models with different numbers of independent variables.
Standard Error of Estimate (SEE or s)
A measure of the typical distance between the observed Y values and the values predicted by the regression line. It quantifies the accuracy of predictions and is the square root of the Mean Square Error (MSE). A larger SEE indicates more scatter and less accurate predictions.
Analysis of Variance (ANOVA)
A statistical method used in regression to assess the overall significance of the model by partitioning the total variability of the dependent variable into explained (regression) and unexplained (error) components.
ANOVA Table
A standard table in regression output that presents the sums of squares, degrees of freedom, mean squares, and F-statistic along with its p-value, summarizing the results of the ANOVA.
Degrees of Freedom (df)
The number of independent values that can vary in a statistical calculation. In regression, df are associated with the sums of squares and mean squares. Regression df = number of independent variables (k); Residual df = n - k - 1 (sample size minus number of estimated coefficients, including intercept); Total df = n - 1 (sample size minus one).
Mean Squares (MS)
Calculated in ANOVA by dividing a sum of squares (SS) by its corresponding degrees of freedom (df). They represent the average variability for a source of variation; Regression MS = SSR / Regression df; Residual MS (MSE) = SSE / Residual df.
Mean Square Error (MSE)
The Residual Mean Square in the ANOVA table; it is an estimate of the variance of the error term (σ²). The Standard Error of Estimate is the square root of MSE.
F-test (ANOVA F test)
A statistical test evaluating the overall significance of the regression model. It tests the null hypothesis that all slope coefficients are simultaneously equal to zero (H0: β₁=β₂=...=β_k=0). In simple linear regression, this is equivalent to testing H0: β₁=0. The test statistic is the ratio of the Regression MS to the Residual MS (F = MSR/MSE). A significant F-test (low p-value) indicates that the model explains a statistically significant portion.
Regression Assumptions
Fundamental conditions about the data and the error term that must be reasonably satisfied for the statistical inferences (t-tests, F-tests, confidence intervals) from a linear regression model to be valid. Includes Linearity, Independence of errors, Homoscedasticity, and Normality of errors.
Linearity (Assumption)
Assumes that the mean of the dependent variable is a linear function of the independent variable(s). The relationship between the variables can be represented by a straight line. Violation is often checked with residual plots.
Independence (of errors) (Assumption)
Assumes that the error terms for different observations are uncorrelated with each other. Absence of this is Autocorrelation.
Homoscedasticity
The assumption that the variance of the error term is the same for all values of the independent variable(s). This is the condition of equal variance of residuals.
Heteroscedasticity
The condition where the variance of the error term is not constant across all levels of the independent variable(s). It is the violation of the homoscedasticity assumption and leads to unreliable standard errors and significance tests. Checked with residual plots.
Normality (of Errors) (Assumption)
Assumes that the distribution of the error terms is normal, implied for small sample inference validity. Can be checked visually using density plots and QQ plots of the residuals.
Robust standard errors
Standard error estimates for regression coefficients that are computed in a way that makes them less sensitive to violations of the assumptions of homoscedasticity or independence of errors. They can provide more reliable inference (t-tests, p-values) when these assumptions are not met.
Multicollinearity
A phenomenon in multiple regression where two or more independent variables are highly correlated with each other. This complicates the interpretation of individual coefficient estimates and can lead to unstable standard errors.
attach() function (R)
A function in R that adds a data frame to the search path, allowing variables within that data frame to be accessed by their names without explicitly specifying the data frame using the $ operator.
str() function (R)
A function in R used to display the compact internal structure of an R object, such as a data frame, showing the names and types of variables.
describe() function (R)
A function that computes and displays basic descriptive statistics (e.g., mean, median, standard deviation, min, max, skew, kurtosis) for variables in a data set.