Data Analysis - Midterm 2 Review

0.0(0)
Studied by 1 person
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/74

flashcard set

Earn XP

Description and Tags

Last updated 7:14 PM on 11/30/22
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

75 Terms

1
New cards
Hypothesis
Proposed explanation made on the basis of limited evidence that serves as a starting point for further investigation

Formal statement that predicts the relationship between 2+ variables
2
New cards
Null Hypothesis (H0)
Hypothesis that there is no effect, assumed to be true throughout the testing procedure
3
New cards
Alternate Hypothesis
Hypothesis that is contrary to the null hypothesis, suggests an effect to be observed
4
New cards
Type 1 Error
False Positive

Sample results lead to rejection of null hypothesis when it is true
5
New cards
Types 2 Error
False Negative

Sample results lead to null hypothesis being rejected when it is false
6
New cards
Test Statistic
Contains important info about the data for deciding whether to reject H0

Based on probability model, a different test stat may be used
7
New cards
One tailed test
Critical area of the distribution is one-sided so that it is either greater or less than a certain value
8
New cards
Two tailed test
Critical area of the distribution is two-sided so that it is greater than or less than a certain range of values
9
New cards
Critical Area
Set of values for test stat which H0 is rejected
10
New cards
Level of Significance (a)
Probability of making Type 1 Error

100(1-a)% confidence level
11
New cards
Critical Values
Cut-off point between accepting and rejecting the null hypothesis
12
New cards
Area Under the Curve
AUC = 1
1-a is the white area
13
New cards
Confidence Level
confidence is getting the same values if the survey was repeated

0-100%

higher = more accurate
14
New cards
Confidence Coefficient
confidence level as stated as a proportion rather than a percentage

0.00-1.00
15
New cards
Confidence Interval
Null hypothesis acceptance region, range of values for test stats for which null hypothesis is accepted
16
New cards
p-value
Evaluates well the sample data supports that the null hypothesis is true

Value close to 1 validates validity of H0

Value close to 0 suggests little faith in H0
17
New cards
Hypothesis Testing Process
- Set up null and alt hypothesis
- Choose the appropriate sample test statistic
- Select the level of significance
- Construct a decision rule
- Compute the test stat
- Decision
18
New cards
One sample difference of means
classical hypothesis testing

compares the mean of a sample to a pre-specified value and tests for a deviation from that value
19
New cards
Normal distribution
Symmetric, bell-shaped distribution where the relative frequency distribution is deemed continuous

characterized by the equality of the mean, median, mode
20
New cards
Z score
standard score

measure of the standard deviations an observation is from the mean
21
New cards
z score formula
z = (x - x(bar)) / SD

X = obs value
x(bar) = mean
sd = standard deviation
22
New cards
standard deviation
indicates how values are spread around the average, provides a range in which values are dispersed
23
New cards
Sampling distribution
the probability distribution of a sample statistic

found by calculating all stats from all samples of n size, drawing the distribution of sample values
24
New cards
degrees of freedom
measure of sample size, number of observations that are free to vary when the mean is given

df = n - 1
25
New cards
z-stat is small, then
difference between the sample mean and population mean is small

H0 is not rejected
26
New cards
z-stat is large, then
difference btw the sample mean and population mean is large

H0 is rejected
27
New cards
two sample difference of means
compare sample means with a test stat and p-value to determine if there is enough evidence to suggest a difference between the two population means

classical hypothesis testing
28
New cards
z stat
used when n >= 30 and variance/SD is known, based on normal distribution
29
New cards
t test stat
used when n < 30 and/or variance/SD is unknown, based on the student t-distribution
30
New cards
limitations of hypothesis testing
choice for the level of significance is arbitrary

decision to accept/reject the H0 is binary

limited info is gathered
31
New cards
p-value hypothesis testing
more exact than classical hypothesis testing as probability values are estimated

critical values that separate rejection and non-rejection regions are now based on the location of the sample mean relative to the population mean
32
New cards
one sample difference of proportions, one tailed test
critical area of distribution is either greater or less than a certain value, not both

use p-value
33
New cards
rejection of H0
the effect is statistically significant
34
New cards
parametric tests
population are known or can be calculated

an assumption about the data's underlying distribution
35
New cards
non-parametric test
no knowledge about population parameters is required

no distributional assumptions are made
36
New cards
Mann-Whitney Test (U test)
non-parametric equivalent to two sample tests

H0 = sum of ranks A = sum of ranks B
H1 = sum of ranks A / sum of ranks B

- set up null and alt hypothesis
- rank the data (both samples)
- sum of ranks of the smaller group
- calculate test stat
37
New cards
Wilcoxon rank sum test (W test)
non-parametric equivalent to the paired comparison test

- set up null and alt hypothesis
- calculate differences
- calculate absolute value of differences (lowest rank = smallest difference, + or - based on pos or neg difference)
- rank absolute values
- calculate the test stat
- interpret test stat
38
New cards
chi-square test (x^2)
goodness of fit tests

compares the observed frequency counts with the expected frequency counts of a variable over the same categories to determine if there us a significant difference

cannot be used for percentages, proportions, rates...

H0 = no difference btw observed and expected frequency counts
H1 = magnitude of differences btw frequencies is large in at least one category
39
New cards
kolmogrov-smirnov test (K-S test)
goodness of fit tests

compares observed distribution of the sample data to an expected distribution

H0 = there is no difference btw the observed and expected distribution
H1 = observed distribution of the sample data differs from the expected distribution

D = max |S(x) - F(x)|
D = largest absolute deviation btw two distributions
S(x) = cumulative relative frequency (observed) for x
F(x) = cumulative relative frequency (expected) for x
40
New cards
Contingency table
a cross-classified set of frequency counts for two nominally or ordinally scaled variables, each with 2+ classes
41
New cards
Marginal totals
frequency count totals of a row or column
42
New cards
X^2 test assumptions
entire contingency table is evaluated statistically as a single entity

test can inform you if there is a stat difference btw expected and observed frequency counts but cannot tell you where this difference occurs

no problem w 2 variables w 2 categories

problem with one variable with 3+ categories
43
New cards
x^2 test restrictions
data must be absolute frequency counts

at least 2 samples and 2 categories

no category should have an expected frequency less than 2

no more than 20% of categories should have expected frequencies of < 5
44
New cards
X^2 test stat
(observed - expected)^2 / expected

do this for every sample and sum to find test stat
45
New cards
contingency table analysis
- set up null and alt hypothesis
- verify assumptions and specify the appropriate sample stat
- select level of significance
- construct decision rule
- decision

- construct contingency table
- determine the row totals and column totals
- calculate the expected frequencies
- calculate X^2 test stat
- determine df
- compare the x^2 test stat to the x^2 critical value
- write statement about your findings

test stat < x^2 critical, fail to reject H0
46
New cards
scatterplot
a graph that shows the relationship between two variables, does not provide a numerical measure of the strength of the relationship

relationships btw variables can be
- positive
- negative
- linear
- nonlinear
- strong
- weak
- null
- none

can reveal patterns such as
- data clusters
- gap in values
- outliers
47
New cards
ways to add third variable to scatterplot
- through sized bubbles
- through a sequence of colours with varying hues
48
New cards
Outlier
data point that differs significantly from the other observations
49
New cards
covariance
measure of linear association between variables

qualifies the strength of the relationship

not a standardized measure

Cov(X, Y) = sum of (x-x(bar))*(y-y(bar))

positive if most points lie in Q1&3

negative if most points lie in Q2&4
50
New cards
correlation
standardized measure of linear association btw variables
- between -1 and +1
- larger value indicates a stronger association btw variables
- 0 is no correlation
51
New cards
monotonical association
order of paired values that is preserved when this association is observed

functions and associations btw variables are known as monotonical is they are increasing or decreasing for the entirety of the domain

variables tend to move in the same direction but not necessarily at a constant rate
52
New cards
Spearman's rank correlation
non-parametric test
assumes variables have monotonic association
computes a correlation on the ranks of x and y
continuous numerical or ordinal

- set up null hypothesis
- choose appropriate sample test stat and its distribution
- select level of significance
- construct level of significance
- compute test stat
- decision
53
New cards
Pearson's correlation
parametric test
assumes variables have a linear relationship
requires numerical data wherein the data pairs are randomly selected from the population

rp = COV (X, Y) / Sx*Sy
Sx = SD of x
Sy = SD of y
54
New cards
Kendall's T correlation
non-parametric test
assumes variables have a monotonic association
measures the degree of correspondence between two rankings
- perfect agreement = +1
- disagreement is perfect = -1

T = 2P / 0.5n(n-1)
P = sum of pairs
n = number of pairs
55
New cards
Cramer's V Correlation
standardizes the X^2 test to provide a correlation coefficient

significance test, does not inform of the strength of association

V = √[(X^)/(n(min)(r -1,c-1)]
56
New cards
Choosing between Pearsons, Spearmans, Kendalls, Cramers
Pearsons
- data is numeric
- linear relationship

Spearmans
- numeric or ordinal
- non linear or not normally distributed
- monotonic

Kendalls
- numeric or ordinal
- non linear or not normally distributed
- monotonic
- has extreme outliers
- n>10

Cramers
- categorical
57
New cards
Challenges of correlation
correlation does not imply causality
strength of association is affected by outliers
association may be due to chance
different scale of analysis may produce different degrees of correlation
58
New cards
linear regression
assumes linear relationship exists btw the dependent and independent variable

Y = a + Bx + E
59
New cards
ordinary least squares (OLS)
minimizes the sum of squared deviations between line and each data point
60
New cards
residuals
difference between predicted values and observed values of y

better model has a lower residual standard error
61
New cards
regression coefficient
slope of the regression lines

shows chance of the line in the y direction associated with a unit increase in the x direction

+ or - provides insight to the direction of relationship btw variables

magnitude of the slope indicates the flatness/steepness of the regression line
62
New cards
R-squared
quantifies goodness of fit, how well does x explain variation in y

bound by 0 and 1

higher R^2 value is indicative of a better fitting model
63
New cards
F-test
model fit

compares the joint effect of all the variables together

determines if the new variables you added to the regression equation improved the model

read alongside the p-value, only has value if H0 was rejected = small p-value

higher value is better
64
New cards
t-test on individual coefficients
t-test on individual coefficients of regression coefficient to determine whether they are statistically different from 0

t-stat is compared with the t-critical
- t-stat > t-crit, then coefficient is statistically significant
65
New cards
multiple regression
Y_i = a + B_1 X_1 + B_2 X_2 + B_3 X_3 +...+ B_n X_n + E

used when there is more than one independent variable that might influence the dependent variable

one equation for all variables being examined
66
New cards
adjusted R^2
used for multiple regression

accounts for the number of variables in the regression equation by reducing the effect of adding a new variable

acts as a control
67
New cards
multiple regression variable selection
is choice if dependent and independent variable correct?

does hypothesized relationship make sense?

are the variables highly correlated?

cannot have equal to or more variables in equation than in sample size
68
New cards
parsimony
2+ models explain data equally well, use simpler model
69
New cards
multicollinearity
several independent variables are correlated

sign reversals occur

redundant info is provided and significance levels are misleading
70
New cards
dummy variables
allow us to analyze categorical variables

numerical value is assigned to a class level
- reference category is assigned a value of 0
71
New cards
non-linear regression
often an independent variable has a nonlinear relationship with a dependent variable

important to plot data before creating a regression-line in order to establish what type of relationship exists

non-linear regression transforms data so a regression equation can be developed
72
New cards
logistic regression
Y is now dischotomous (yes/no, 0/1)

finding the probability of y being 1 or 0

a + Bx = 0, predicted prob of y = 0.5
a + Bx = large pos no., Y ~ 0
a + Bx = large neg no., Y ~ 1

B tells us how much the odds of Y change when there is a 1 unit increase in x
73
New cards
maximum likelihood estimation
logistic regression

predicts odds of 0/1 and identifies what values of the observed data are the most probable to maximize the response variable
74
New cards
goodness of fit measure for logistic regression
likelihood ratio (similar to F-test)
Rho-squared or Pseudo R^2 (R^2)
t-tests to determine stat difference from 0
75
New cards
regression use
used to relate variables to one another through the use of mathematical functions

allows examination of variability in one variable as a function of the variability in another

good for forecasting