Data Analysis - Midterm 2 Review

0.0(0)

Studied by 1 person

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/74

Earn XP

Description and Tags

Statistics

University/Undergrad

Last updated 7:14 PM on 11/30/22

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

75 Terms

New cards

Hypothesis

Proposed explanation made on the basis of limited evidence that serves as a starting point for further investigation

Formal statement that predicts the relationship between 2+ variables

New cards

Null Hypothesis (H0)

Hypothesis that there is no effect, assumed to be true throughout the testing procedure

New cards

Alternate Hypothesis

Hypothesis that is contrary to the null hypothesis, suggests an effect to be observed

New cards

Type 1 Error

False Positive

Sample results lead to rejection of null hypothesis when it is true

New cards

Types 2 Error

False Negative

Sample results lead to null hypothesis being rejected when it is false

New cards

Test Statistic

Contains important info about the data for deciding whether to reject H0

Based on probability model, a different test stat may be used

New cards

One tailed test

Critical area of the distribution is one-sided so that it is either greater or less than a certain value

New cards

Two tailed test

Critical area of the distribution is two-sided so that it is greater than or less than a certain range of values

New cards

Critical Area

Set of values for test stat which H0 is rejected

New cards

Level of Significance (a)

Probability of making Type 1 Error

100(1-a)% confidence level

New cards

Critical Values

Cut-off point between accepting and rejecting the null hypothesis

New cards

Area Under the Curve

AUC = 1
1-a is the white area

New cards

Confidence Level

confidence is getting the same values if the survey was repeated

0-100%

higher = more accurate

New cards

Confidence Coefficient

confidence level as stated as a proportion rather than a percentage

0.00-1.00

New cards

Confidence Interval

Null hypothesis acceptance region, range of values for test stats for which null hypothesis is accepted

New cards

p-value

Evaluates well the sample data supports that the null hypothesis is true

Value close to 1 validates validity of H0

Value close to 0 suggests little faith in H0

New cards

Hypothesis Testing Process

- Set up null and alt hypothesis
- Choose the appropriate sample test statistic
- Select the level of significance
- Construct a decision rule
- Compute the test stat
- Decision

New cards

One sample difference of means

classical hypothesis testing

compares the mean of a sample to a pre-specified value and tests for a deviation from that value

New cards

Normal distribution

Symmetric, bell-shaped distribution where the relative frequency distribution is deemed continuous

characterized by the equality of the mean, median, mode

New cards

Z score

standard score

measure of the standard deviations an observation is from the mean

New cards

z score formula

z = (x - x(bar)) / SD

X = obs value
x(bar) = mean
sd = standard deviation

New cards

standard deviation

indicates how values are spread around the average, provides a range in which values are dispersed

New cards

Sampling distribution

the probability distribution of a sample statistic

found by calculating all stats from all samples of n size, drawing the distribution of sample values

New cards

degrees of freedom

measure of sample size, number of observations that are free to vary when the mean is given

df = n - 1

New cards

z-stat is small, then

difference between the sample mean and population mean is small

H0 is not rejected

New cards

z-stat is large, then

difference btw the sample mean and population mean is large

H0 is rejected

New cards

two sample difference of means

compare sample means with a test stat and p-value to determine if there is enough evidence to suggest a difference between the two population means

classical hypothesis testing

New cards

z stat

used when n >= 30 and variance/SD is known, based on normal distribution

New cards

t test stat

used when n < 30 and/or variance/SD is unknown, based on the student t-distribution

New cards

limitations of hypothesis testing

choice for the level of significance is arbitrary

decision to accept/reject the H0 is binary

limited info is gathered

New cards

p-value hypothesis testing

more exact than classical hypothesis testing as probability values are estimated

critical values that separate rejection and non-rejection regions are now based on the location of the sample mean relative to the population mean

New cards

one sample difference of proportions, one tailed test

critical area of distribution is either greater or less than a certain value, not both

use p-value

New cards

rejection of H0

the effect is statistically significant

New cards

parametric tests

population are known or can be calculated

an assumption about the data's underlying distribution

New cards

non-parametric test

no knowledge about population parameters is required

no distributional assumptions are made

New cards

Mann-Whitney Test (U test)

non-parametric equivalent to two sample tests

H0 = sum of ranks A = sum of ranks B
H1 = sum of ranks A / sum of ranks B

- set up null and alt hypothesis
- rank the data (both samples)
- sum of ranks of the smaller group
- calculate test stat

New cards

Wilcoxon rank sum test (W test)

non-parametric equivalent to the paired comparison test

- set up null and alt hypothesis
- calculate differences
- calculate absolute value of differences (lowest rank = smallest difference, + or - based on pos or neg difference)
- rank absolute values
- calculate the test stat
- interpret test stat

New cards

chi-square test (x^2)

goodness of fit tests

compares the observed frequency counts with the expected frequency counts of a variable over the same categories to determine if there us a significant difference

cannot be used for percentages, proportions, rates...

H0 = no difference btw observed and expected frequency counts
H1 = magnitude of differences btw frequencies is large in at least one category

New cards

kolmogrov-smirnov test (K-S test)

goodness of fit tests

compares observed distribution of the sample data to an expected distribution

H0 = there is no difference btw the observed and expected distribution
H1 = observed distribution of the sample data differs from the expected distribution

D = max |S(x) - F(x)|
D = largest absolute deviation btw two distributions
S(x) = cumulative relative frequency (observed) for x
F(x) = cumulative relative frequency (expected) for x

New cards

Contingency table

a cross-classified set of frequency counts for two nominally or ordinally scaled variables, each with 2+ classes

New cards

Marginal totals

frequency count totals of a row or column

New cards

X^2 test assumptions

entire contingency table is evaluated statistically as a single entity

test can inform you if there is a stat difference btw expected and observed frequency counts but cannot tell you where this difference occurs

no problem w 2 variables w 2 categories

problem with one variable with 3+ categories

New cards

x^2 test restrictions

data must be absolute frequency counts

at least 2 samples and 2 categories

no category should have an expected frequency less than 2

no more than 20% of categories should have expected frequencies of < 5

New cards

X^2 test stat

(observed - expected)^2 / expected

do this for every sample and sum to find test stat

New cards

contingency table analysis

- set up null and alt hypothesis
- verify assumptions and specify the appropriate sample stat
- select level of significance
- construct decision rule
- decision

- construct contingency table
- determine the row totals and column totals
- calculate the expected frequencies
- calculate X^2 test stat
- determine df
- compare the x^2 test stat to the x^2 critical value
- write statement about your findings

test stat < x^2 critical, fail to reject H0

New cards

scatterplot

a graph that shows the relationship between two variables, does not provide a numerical measure of the strength of the relationship

relationships btw variables can be
- positive
- negative
- linear
- nonlinear
- strong
- weak
- null
- none

can reveal patterns such as
- data clusters
- gap in values
- outliers

New cards

ways to add third variable to scatterplot

- through sized bubbles
- through a sequence of colours with varying hues

New cards

Outlier

data point that differs significantly from the other observations

New cards

covariance

measure of linear association between variables

qualifies the strength of the relationship

not a standardized measure

Cov(X, Y) = sum of (x-x(bar))*(y-y(bar))

positive if most points lie in Q1&3

negative if most points lie in Q2&4

New cards

correlation

standardized measure of linear association btw variables
- between -1 and +1
- larger value indicates a stronger association btw variables
- 0 is no correlation

New cards

monotonical association

order of paired values that is preserved when this association is observed

functions and associations btw variables are known as monotonical is they are increasing or decreasing for the entirety of the domain

variables tend to move in the same direction but not necessarily at a constant rate

New cards

Spearman's rank correlation

non-parametric test
assumes variables have monotonic association
computes a correlation on the ranks of x and y
continuous numerical or ordinal

- set up null hypothesis
- choose appropriate sample test stat and its distribution
- select level of significance
- construct level of significance
- compute test stat
- decision

New cards

Pearson's correlation

parametric test
assumes variables have a linear relationship
requires numerical data wherein the data pairs are randomly selected from the population

rp = COV (X, Y) / Sx*Sy
Sx = SD of x
Sy = SD of y

New cards

Kendall's T correlation

non-parametric test
assumes variables have a monotonic association
measures the degree of correspondence between two rankings
- perfect agreement = +1
- disagreement is perfect = -1

T = 2P / 0.5n(n-1)
P = sum of pairs
n = number of pairs

New cards

Cramer's V Correlation

standardizes the X^2 test to provide a correlation coefficient

significance test, does not inform of the strength of association

V = √[(X^)/(n(min)(r -1,c-1)]

New cards

Choosing between Pearsons, Spearmans, Kendalls, Cramers

Pearsons
- data is numeric
- linear relationship

Spearmans
- numeric or ordinal
- non linear or not normally distributed
- monotonic

Kendalls
- numeric or ordinal
- non linear or not normally distributed
- monotonic
- has extreme outliers
- n>10

Cramers
- categorical

New cards

Challenges of correlation

correlation does not imply causality
strength of association is affected by outliers
association may be due to chance
different scale of analysis may produce different degrees of correlation

New cards

linear regression

assumes linear relationship exists btw the dependent and independent variable

Y = a + Bx + E

New cards

ordinary least squares (OLS)

minimizes the sum of squared deviations between line and each data point

New cards

residuals

difference between predicted values and observed values of y

better model has a lower residual standard error

New cards

regression coefficient

slope of the regression lines

shows chance of the line in the y direction associated with a unit increase in the x direction

+ or - provides insight to the direction of relationship btw variables

magnitude of the slope indicates the flatness/steepness of the regression line

New cards

R-squared

quantifies goodness of fit, how well does x explain variation in y

bound by 0 and 1

higher R^2 value is indicative of a better fitting model

New cards

F-test

model fit

compares the joint effect of all the variables together

determines if the new variables you added to the regression equation improved the model

read alongside the p-value, only has value if H0 was rejected = small p-value

higher value is better

New cards

t-test on individual coefficients

t-test on individual coefficients of regression coefficient to determine whether they are statistically different from 0

t-stat is compared with the t-critical
- t-stat > t-crit, then coefficient is statistically significant

New cards

multiple regression

Y_i = a + B_1 X_1 + B_2 X_2 + B_3 X_3 +...+ B_n X_n + E

used when there is more than one independent variable that might influence the dependent variable

one equation for all variables being examined

New cards

adjusted R^2

used for multiple regression

accounts for the number of variables in the regression equation by reducing the effect of adding a new variable

acts as a control

New cards

multiple regression variable selection

is choice if dependent and independent variable correct?

does hypothesized relationship make sense?

are the variables highly correlated?

cannot have equal to or more variables in equation than in sample size

New cards

parsimony

2+ models explain data equally well, use simpler model

New cards

multicollinearity

several independent variables are correlated

sign reversals occur

redundant info is provided and significance levels are misleading

New cards

dummy variables

allow us to analyze categorical variables

numerical value is assigned to a class level
- reference category is assigned a value of 0

New cards

non-linear regression

often an independent variable has a nonlinear relationship with a dependent variable

important to plot data before creating a regression-line in order to establish what type of relationship exists

non-linear regression transforms data so a regression equation can be developed

New cards

logistic regression

Y is now dischotomous (yes/no, 0/1)

finding the probability of y being 1 or 0

a + Bx = 0, predicted prob of y = 0.5
a + Bx = large pos no., Y ~ 0
a + Bx = large neg no., Y ~ 1

B tells us how much the odds of Y change when there is a 1 unit increase in x

New cards

maximum likelihood estimation

logistic regression

predicts odds of 0/1 and identifies what values of the observed data are the most probable to maximize the response variable

New cards

goodness of fit measure for logistic regression

likelihood ratio (similar to F-test)
Rho-squared or Pseudo R^2 (R^2)
t-tests to determine stat difference from 0

New cards

regression use

used to relate variables to one another through the use of mathematical functions

allows examination of variability in one variable as a function of the variability in another

good for forecasting