Key Concepts in Statistics and Data Analysis

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/79

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

80 Terms

New cards

Statistics

The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions.

New cards

Descriptive Statistics

Methods of organizing, summarizing, and presenting data in an informative way.

New cards

Inferential Statistics

A decision, estimate, prediction, or generalization about a population, based on a sample.

New cards

Population

The entire set of individuals or objects of interest, or the measurements obtained from all individuals or objects of interest.

New cards

Sample

A portion, or part, of the population of interest.

New cards

Qualitative or categorical variable

The characteristic being studied is nonnumeric.

New cards

Quantitative variable

Information is reported numerically.

New cards

Discrete variables

Can only assume certain values, and there are usually 'gaps' between values.

New cards

Continuous variable

Can assume any value within a specified range.

New cards

Parameter

A measurable characteristic of a population.

New cards

Statistic

A measurable characteristic of a sample.

New cards

Measures of Location

Purpose is to pinpoint the center of a distribution of data.

New cards

Mean

(Population Mean) Population values divided by the total number of population values.

New cards

Median

The middle value in a data set that has been arranged in ascending or descending order.

New cards

Mode

The value of the observation that appears most frequently.

New cards

Dispersion

Describes spread around the center.

New cards

Range

Largest - Smallest value.

New cards

Mean Deviation

The arithmetic mean of the absolute values of the deviations from the arithmetic mean.

New cards

Variance

The arithmetic mean of the squared deviations from the mean. For populations whose values are near the mean, the variance will be small; For populations whose values are dispersed from the mean, the variance will be large.

New cards

Standard Deviation

The square root of the variance. For populations whose values are near the mean, the standard deviation will be small; For populations whose values are dispersed from the mean, the standard deviation will be large.

New cards

Normal Probability Distribution

It is bell-shaped and has a single peak, symmetrical about the mean, total area under the curve is 1.00, mean, median, and mode are equal, asymptotic, area to the left/right of the mean is 0.515.

New cards

Standard Normal Probability Distribution

A normal distribution with a mean of 0 and a standard deviation of 116.

New cards

z-value

The signed distance between a selected value, designated X, and the population mean, divided by the population standard deviation.

New cards

Central Limit Theorem

If all samples of a particular size are selected from any population, the sampling distribution of the sample mean is approximately a normal distribution. This approximation improves with larger samples.

New cards

Normal Distribution

If the population does not follow the normal distribution, but the sample is of at least 30 observations, the distribution of the sample means will follow the normal distribution.

New cards

Point estimate

A single value (point) derived from a sample and used to estimate a population value.

New cards

Confidence interval estimate

A range of values constructed from sample data so that the population parameter is likely to occur within that range at a specified probability.

New cards

Level of confidence

The specified probability for a confidence interval estimate.

New cards

Margin of error

The ± value added/subtracted from the point estimate to form a confidence interval.

New cards

Hypothesis Testing

An objective method of making decisions or inferences from sample data (evidence) by comparing what is observed to what is expected if the Null Hypothesis was true.

New cards

Statistical inference

Generalizing from a sample to a population with calculated degree of certainty.

New cards

Hypothesis

A statement of the researcher's idea or guess.

New cards

Null Hypothesis (H0)

The hypothesis that is actually tested; what we assume is true to begin with, often the opposite of the researcher's guess.

New cards

Alternative Hypothesis (HA)

The hypothesis assumed true if the null is false; what we aim to gather evidence of, typically that there is a difference/effect/relationship.

New cards

Test statistic

A quantity, calculated based on a sample, whose value is the basis for deciding whether or not to reject the null hypothesis.

New cards

P-value

The probability of observing data as extreme as the sample data, assuming the null hypothesis is true. A low p-value (e.g., less than the significance level) suggests evidence against the null hypothesis.

New cards

Significance level (alpha)

The predetermined probability of making a Type I error (rejecting a true null hypothesis).

New cards

Type I error

Rejecting a true null hypothesis. Occurs when sample data incorrectly suggest a treatment effect or relationship exists.

New cards

Type II error

Failing to reject a false null hypothesis.

New cards

Bivariate data

Data relating to the relationship between two variables.

New cards

Scatter diagram

A graphical technique used to show the relationship between two variables by plotting one on the horizontal axis and the other on the vertical axis.

New cards

Covariance

A measure indicating how two variables are linearly related (tend to move in the same or opposite directions) but does not measure the strength of the relationship.

New cards

Correlation

A measure of the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

New cards

Linear Regression Model

A statistical model postulating a linear relationship between a dependent variable (Y) and one or more independent variables (X), represented by the equation 𝑌 = 𝛽0 + 𝛽1 × 𝑋 + 𝑢 for simple linear regression.

New cards

Dependent variable (Y)

The variable being predicted or explained in a regression model.

New cards

Independent variable (X)

The predictor or explanatory variable used to predict or explain the dependent variable in a regression model.

New cards

Population regression line

The true, unknown linear relationship between X and Y in the population, defined by the population parameters 𝛽0 + 𝛽1 × 𝑋.

New cards

Intercept (𝛽₀)

The true population y-intercept in the regression model, representing the expected mean value of Y when all independent variables are zero.

New cards

Slope (𝛽₁)

The true population slope coefficient in the regression model, representing the expected change in the mean of Y for a one-unit increase in the independent variable X, holding other variables constant (in multiple regression). Estimated by the sample slope 𝛽̂₁.

New cards

Error term (u or ε)

The random component in the regression model representing all factors other than the independent variable(s) that influence the dependent variable. It is the difference between the actual Y and the value predicted by the population regression line.

New cards

Ordinary Least Squares (OLS)

The standard method for estimating the coefficients (𝛽₀ and 𝛽₁) of a linear regression model by minimizing the sum of the squared differences between the observed dependent variable values and the values predicted by the estimated regression line.

New cards

OLS Estimator

The values of the regression coefficients (𝛽̂₀ and 𝛽̂₁) calculated using the OLS method from sample data, serving as estimates for the true population coefficients.

New cards

OLS regression line

The estimated linear relationship based on sample data, represented by the equation Ŷ = 𝛽̂₀ + 𝛽̂₁𝑋, also known as the estimated regression equation.

New cards

Predicted value (Ŷ)

The value of the dependent variable (Y) estimated by the OLS regression line for a given value of the independent variable(s) (X).

New cards

Residual (𝑢̂ or ε)

The difference between the actual observed value of the dependent variable (Yᵢ) and its predicted value (Ŷᵢ) from the OLS regression line for a specific observation, representing the unexplained part of the variation.

New cards

Sum of Squares (SS)

Measures the total deviation of data points away from a mean value, used to quantify variability.

New cards

Total Sum of Squares (SST or TSS)

The total variability in the dependent variable (Y), calculated as the sum of the squared differences between each observed Yᵢ and the mean of Y (𝑌). It is the sum of the explained variation (SSR) and the unexplained variation (SSE).

New cards

Sum of Squares Regression (SSR or ESS)

The portion of the total variability in the dependent variable (Y) that is explained by the regression model. Calculated as the sum of the squared differences between the predicted Y values (Ŷᵢ) and the mean of Y (𝑌).

New cards

Sum of Squared Errors (SSE or RSS)

The portion of the total variability in the dependent variable (Y) that is not explained by the regression model, representing the residual or unexplained variation. Calculated as the sum of the squared differences between the actual Y values (Yᵢ) and the predicted Y values (Ŷᵢ). Minimizing this sum is the goal of OLS.

New cards

Coefficient of Determination (R-square)

A statistic representing the proportion (or percentage) of the variance in the dependent variable (Y) that is predictable from the independent variable(s) (X) in the regression model. Calculated as SSR/SST. In simple linear regression, it is the square of the correlation coefficient (r²).

New cards

Multiple R

In regression output, for simple linear regression, it is the absolute value of the correlation coefficient between the dependent and independent variables.

New cards

Adjusted R-squared

A modified version of R-squared that accounts for the number of predictors and sample size, providing a less biased estimate of the population R-squared, useful for comparing regression models with different numbers of independent variables.

New cards

Standard Error of Estimate (SEE or s)

A measure of the typical distance between the observed Y values and the values predicted by the regression line. It quantifies the accuracy of predictions and is the square root of the Mean Square Error (MSE). A larger SEE indicates more scatter and less accurate predictions.

New cards

Analysis of Variance (ANOVA)

A statistical method used in regression to assess the overall significance of the model by partitioning the total variability of the dependent variable into explained (regression) and unexplained (error) components.

New cards

ANOVA Table

A standard table in regression output that presents the sums of squares, degrees of freedom, mean squares, and F-statistic along with its p-value, summarizing the results of the ANOVA.

New cards

Degrees of Freedom (df)

The number of independent values that can vary in a statistical calculation. In regression, df are associated with the sums of squares and mean squares. Regression df = number of independent variables (k); Residual df = n - k - 1 (sample size minus number of estimated coefficients, including intercept); Total df = n - 1 (sample size minus one).

New cards

Mean Squares (MS)

Calculated in ANOVA by dividing a sum of squares (SS) by its corresponding degrees of freedom (df). They represent the average variability for a source of variation; Regression MS = SSR / Regression df; Residual MS (MSE) = SSE / Residual df.

New cards

Mean Square Error (MSE)

The Residual Mean Square in the ANOVA table; it is an estimate of the variance of the error term (σ²). The Standard Error of Estimate is the square root of MSE.

New cards

F-test (ANOVA F test)

A statistical test evaluating the overall significance of the regression model. It tests the null hypothesis that all slope coefficients are simultaneously equal to zero (H0: β₁=β₂=...=β_k=0). In simple linear regression, this is equivalent to testing H0: β₁=0. The test statistic is the ratio of the Regression MS to the Residual MS (F = MSR/MSE). A significant F-test (low p-value) indicates that the model explains a statistically significant portion.

New cards

Regression Assumptions

Fundamental conditions about the data and the error term that must be reasonably satisfied for the statistical inferences (t-tests, F-tests, confidence intervals) from a linear regression model to be valid. Includes Linearity, Independence of errors, Homoscedasticity, and Normality of errors.

New cards

Linearity (Assumption)

Assumes that the mean of the dependent variable is a linear function of the independent variable(s). The relationship between the variables can be represented by a straight line. Violation is often checked with residual plots.

New cards

Independence (of errors) (Assumption)

Assumes that the error terms for different observations are uncorrelated with each other. Absence of this is Autocorrelation.

New cards

Homoscedasticity

The assumption that the variance of the error term is the same for all values of the independent variable(s). This is the condition of equal variance of residuals.

New cards

Heteroscedasticity

The condition where the variance of the error term is not constant across all levels of the independent variable(s). It is the violation of the homoscedasticity assumption and leads to unreliable standard errors and significance tests. Checked with residual plots.

New cards

Normality (of Errors) (Assumption)

Assumes that the distribution of the error terms is normal, implied for small sample inference validity. Can be checked visually using density plots and QQ plots of the residuals.

New cards

Robust standard errors

Standard error estimates for regression coefficients that are computed in a way that makes them less sensitive to violations of the assumptions of homoscedasticity or independence of errors. They can provide more reliable inference (t-tests, p-values) when these assumptions are not met.

New cards

Multicollinearity

A phenomenon in multiple regression where two or more independent variables are highly correlated with each other. This complicates the interpretation of individual coefficient estimates and can lead to unstable standard errors.

New cards

attach() function (R)

A function in R that adds a data frame to the search path, allowing variables within that data frame to be accessed by their names without explicitly specifying the data frame using the $ operator.

New cards

str() function (R)

A function in R used to display the compact internal structure of an R object, such as a data frame, showing the names and types of variables.

New cards

describe() function (R)

A function that computes and displays basic descriptive statistics (e.g., mean, median, standard deviation, min, max, skew, kurtosis) for variables in a data set.