1/58
Flashcards covering statistical concepts, probability, hypothesis testing, regression models, and data management based on the Economic Data Analytics lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Inferential Statistics
Drawing conclusions about the population based on the sample
Population
The entire group were studying
Sample
A small subgroup in the population
Parameter
A numerical measurement of the population
Statistic
A numerical measurement of the sample
Constant
The true value of a population parameter
Sampling Distribution
The probability distribution of a statistic that would be found if one selected all random samples of size n from a population.
Standard Error
The standard deviation of a statistic or the sampling distribution.
Sampling Error
The difference between the values of the sample statistic and the population parameter.
Simple Random Sampling
Picking items at random where each item has an equal chance of getting picked
Stratified Sampling
A method where the population is broken into homogenous subgroups called strata which are mutually exclusive and exhaustive.
Cluster Sampling
A method where the population is broken into natural groupings called clusters, such as by town within a state, and a random sample is used to choose one cluster.
Normal Curve
Symmetric
Not Skewed
Its tails come close to but never touch the axis (asymptotic)
Central Limit Theorem
The principle that the sampling distribution of the mean of any independent, random variable will be normal if the sample size is large enough. Almost all values are between -3 to 3 standard deviations
Z-score
A standard measure that expresses scores in units of standard deviations, making it possible to compare different distributions. X - X dash/ S
Null Hypothesis (H0)
A statement of equality that assumes no difference between groups or variables until evidence proves otherwise.
Research Hypothesis (H1)
A definite statement that a difference exists between variables, which can be directional or nondirectional.
One-tailed Test
A statistical test that reflects a directional hypothesis and posits a difference in a particular direction.
Two-tailed Test
A statistical test that reflects a nondirectional hypothesis and does not specify the direction of the difference.
Statistically Significant
A finding that a difference is not due to chance, but rather due to some systematic influence.
Difference b/n null hypothesis and research hypothesis
Null hypothesis: 1. statement of equality
Related to population
Indirectly related to the sample
uses mu (u)
Research hypothesis: 1. statement of inequality
Related to the sample
uses x dash
Features of a good hypothesis
Declaration not question
Expected relationship between variables
Reflects theory
Brief
Testable
Statistically Significant
A difference is not due to chance by actually has a systematic reason
Type 1 Error (Alpha)
Incorrectly rejecting a true null hypothesis, also known as a "false positive."
Type 2 Error (Beta)
Incorrectly accepting a false null hypothesis, also known as a "false negative."
Power
The probability of detecting an effect if one is present, or the probability of avoiding a Type 2 error. 1-beta
Factors that affect power
Alpha - increase in alpha, decreases beta and increases power
Sample size - larger sample size increases power
Variability - greater variability decreases power
Magnitude of the effect of a variable - higher magnitude makes detection easier which increases the power
Difference b/n significant and meaningful
Significant - something happens for a reason not by chance. Determined by p value and alpha
Meaningful - the result actually matters. Determined by effect size, impact etc.
Confidence Interval
An estimated range of values that is likely to include the unknown population value.
Z statistic
x dash - population mean/ standard of error
Z test checks whether the population mean is equal to some specific value
If the z statistic > z critical value we reject. If not we fail to reject
SEM (Standard Error of the Mean)
Measure of variability between sample means calculated as SEM=nσ; it estimates how accurately the sample mean represents the population mean.
Effect Size
A measure of how different two groups are from one another that helps determine the "meaningfulness" of a result. Simple effect size = x1 dash - x2 dash/ S Complicated effect size = x1 dash - x2 dash/root variance 1 + variance 2/ 2. 0.0 - 0.2 effect size is small, 0.2 - 0.5 is medium, 0.5 and above is large
T test for independent groups
t statistic = x1 dash - x2 dash/root (1/n1 + 1/n2)((n1-1)s12 + (n2 - 1)s22 /n1+n2 -2
df= n1+n2 -2
This checks whether the population mean of group 1 = population mean of group 2
If t statistic > t critical value we reject and if the t statistic < t critical value we fail to reject
T test for dependent groups
t statistic = summation D/root nsummationD2 - summation (D)2 / n-1
df = n-1
This checks whether the average difference = 0
same rule as the independent groups to accept and reject
Degrees of Freedom (df)
The number of values in the final calculation of a statistic that are free to vary; for an independent t-test, it is (N1−1)+(N2−1), and for a one-way ANOVA, the numerator (k−1) and denominator (N−k).
Coefficient Correlation
Measures the strength and direction of the linear relationship between two variables. Ranges from -1 to 1 and a 0 coefficient correlation means no relationship. r = n(summationxy) - (summation x* summation y)/ root (n*summation x2 - summation (x)2 )(n*summation y2 - summation (y)2 ). 0.0-0.2 is very weak, 0.2-0.4 is weak, 0.4-0.6 is moderate, 0.6-0.8 strong, 0.8-1.0 very strong. x decreases y decreases (direct and positive), x increases and y increases (direct and positive), x decreases and y increases (indirect and negative) and x increases and y decreases (indirect and negative)
Coefficient of Determination (r2)
The percentage of variance in one variable that is shared with or explained by the variance of another variable.
Coefficient of Non Determination
How much of a change in variable Y is not caused by variable X
Coeffiicent correlation test
p = 0 and p not equal to 0
The test statistic is either the coefficient correlation value or the t statistic
Find t critical value and comparing
df = n - 2
Simple Linear Regression
The line of best fit that shows the linear relationship between two variables.
Independent variable
Variable that helps determine another variable (X)
Dependent Variable
Variable that is the outcome which depends on another variable for its values
Prediction error
Difference between actual and forecasted. Total prediction error should be the lowest at the line of best fit. summation (y actual - y predicted)2
Limits of R2
Only focuses on the linear relationship and neglects all others
Influenced by outliers
Analysis of Variance (ANOVA)
A statistical test used to determine if there are significant differences between the means of more than two groups.
F-statistic
The ratio of the variability between groups to the variability within groups, calculated as F=Mean Squares WithinMean Squares Between.
Factorial Analysis of Variance
A variation of ANOVA that explores more than one treatment factor and identifies main effects and interaction effects.
Nonparametric Tests
Statistical tests used when assumptions like normal distribution or homogeneity of variance are violated.
Chi-square Goodness-of-fit Test
A nonparametric test used to determine if the observed frequency of occurrences in categories matches what is expected by chance.
Linear Regression
A predictive tool that estimates a statistical relationship between a continuous independent variable (X) and a continuous dependent variable (Y) using a line of best fit.
R-squared (R2)
A measure of how much variation in the dependent variable is explained by the variation in the independent variable(s).
Multiple Regression
A regression model with more than one predictor (independent) variable to explain the outcome of a dependent variable.
Logistic Regression
A predictive tool used when the dependent variable is binary rather than continuous, utilizing maximum likelihood estimation (MLE).
Analysis File
A static version of the data that is fully cleaned and prepped before starting the formal analysis.
Appending
The process of stacking one data file on top of another to combine observations.
Merging
The process of combining files by adding variables to observations using a unique identifier.
Data Science
A field combining statistics, computer science, and programming used to extract information from big or unstructured data.
Natural Language Processing (NLP)
A data science technique focused on extracting the fuller meaning from free text, including grammar and parts of speech.
Pivot Tables
Excel tools used to summarize and analyze large data files through functions like Sum, Count, Average, Max, and Min.