1/76
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Positive skew
When the mean is larger than the median and mode
I.e., income is typically positively skewed
The tail is to the right
Negative skew
When the mode is larger than the median and the mean
The mean is smaller than the median
Tail is to the left
What are the different types of univariate graphs?
Histograms
Frequency distributions (kernel density plots)
(Normal) quantile comparison plots
Box plots
Histograms
Place variables into intervals of equal width, we call these bins
Count the number of observations within each bin
Display the frequency counts in bar graph
CONS of histograms
The visual representation of data depends on the arbitrary origin of bins
Shape of histogram depends on arbitrary width of bins
Histograms appear discontinous, even if they actually display continuous data
Bins may be too narrow to avoid “noise” where data is thinly dispersed
Kernel Density Plots
Non-parametric way of smoothing histograms
Alternative to histograms by averaging and smoothing them
Continuously moves window of fixed width across the data, calculating locally weighted avg of number of observations falling within window
Choose window width is a matter of trial and error, must see statistical theory to determine what works
Quantile comparison plots
Helps to compare the distribution with the theoretical distribution
One kind of data:
How close does our data apply to the normal curve?
It doesn’t use arbitrary bins or averages
The continuity of data is preserved!
The more the data points deviate from the comparison line, the more it deviates from the normal curve
In quantile comparison plots, it allows us to look at the _____ of the distribution
tails
Boxplots
Shows summary information on the center, spread, and skewness
Show individual observations in tails and potential outliers
Useful to compare several distributions or make data look more symmetrical
We use box plots when we look at multiple variables
In boxplots, when the median is ____ in the middle, the distribution is most likely ____
not, skewed
What are the main components of a boxplot?
Minimum
Q1
Median
Q3
Maximum
Skewness
In distribution, where do the tails condense?
Center
Where is the mean, median, and mode.
Spread
Where is most of the data contained, and what is the range of data
The difference between Q1 – Q3 (IQR)
Minimum and maximum data points (variance)
Scatterplots
Display the relationship between two quantitative variables
Does not work well for discontinuous or non-continuous variables, OR values within a few categories relative to size
In a scatterplot, watch out for skewed data. Data that are skewed need to be _____ !
transformed
Multivariate graphs are helpful to examine ______ for all pairs of variables
bivariate scatterplots
Non-normality
When data is not normal
Heteroskedastivity
Variance is not constant
Non-linearity
The relationship is not linear
Linear transformation
Goal is to keep the spacing the same
I.e., inches → cm / Fahrenheit → Celsius / American dollars → Canadian dollars
Values that are _____ before transformation will still stay the same space afterwards
evenly spaced
Nonlinear transformation
Change spacing and shape, but keep data in order
I.e., log, powers, roots
Helpful for fixing regression issues
Monotonic increasing function
It maintains the order of data
If a > b then f(a) > f(b)
Monotonic decreasing function
Reverses the order of data
If a > b then f(a) < f(b)
Descending powers (log, roots, reciprocals) ____ large values and _____ small ones
shrink, spread
Descending powers can fix _____
Positive skew
Ascending powers (x²) do the opposite effect, they fix ____
Negative skew
We must only have ______ in a Box-Cox family of transformation
positive values
How to make positive values in Box-Cox
Add a constant (start)
i.e., X² + 3
Power transformations are effective ONLY when ratio of _______ is sufficiently large
highest to lowest data values
Positive skew (right tail too long) use ____ transformations to pull the tail in
log or root
Negative skew (left tail too long) uses ____ to stretch the tail
powers (x²)
Transformation can help _____ and make data ______
stabilize variance, easier to analyze
Mosteller and Turkey’s bulging rule
It gives guidance on which transformations to try
Nominal variables
Simple categories, categorize variables. (i.e., gender)
Ordinal variables
Rank different categories; however, we cannot quantify the variables. (i.e., education level)
Interval variable
Rank different categories and quantify the variables. (i.e, temperature)
Dichotomous variable
Works with only two categories. It can be nominal or ordinal.
Interval variables use measures of dispersion:
Range
Variance
Standard Deviation
Sample
Subset of the population
Population parameters
Information we want to know
Sample distribution
The distribution within a sample
Descriptive statistics
Describe the traits of a population/sample
Inferential statistics
Make predictions about a population derived from our sample
Theoretical distribution of sample means
Take all possible random samples
Calculate the mean for each sample
Plot the distribution of those means
Sample mean should congregate around the ______
population mean
The ____ the sample size, the _____ the sample mean aligns to the population mean
larger, closer
Central Limit Theorem
If all possible random samples of size n are drawn from a population with a mean and a SD then as n gets larger the distribution of sample means becomes approximately normal, with mean equal to the population mean and a SD equal to the standard error (SE).
CLT tells us three things:
Shape
Central tendency
Variability
Mean of the distribution of sample means is ____ to the true population mean
equal
If sample is big enough the SE will be very _____ and means cluster around the true pop mean
small
When we ____ sample size (n), we ____ standard error
increase, decrease
Standard error
Average between the difference between pop mean and sample mean.
Sample mean is an _____ point estimate of the real pop mean
unbiased
Standard deviation
How far does the score of a distribution deviate from the mean of the distribution. It describes the distribution of scores.
Null hypothesis
No association between two variables or conditions
Statistical independence
H0
Alternative hypothesis
Research hypothesis
There IS an association between two variables or conditions
Statistical dependence
We can only ____ or ____ the null hypothesis
reject, fail to reject
We ____ prove the alternative hypothesis to be true
cannot
Falsifiability
A single study can never prove something to be true. We can only fail to prove that it is false.
Type I error
Reject null hypothesis when it is actually true
I.e., Conclude the treatment is effective when it does not create any impact
False positive
Probability of making Type I error is alpha
Type II error
Reject null hypothesis when it is actually true
I.e., Conclude the treatment is NOT effective when it actually does create an impact
False negative
Probability of making Type II error is beta
We focus on ____ type I errors
decreasing
If sample means fall in the critical region than we must _____ the null
reject
T-test
Calculation used to test the null hypothesis about a population mean when the population SD is unknown and estimated using the sample standard deviation. It is characterized by heavy tails.
We use the t-distribution when the population standard deviation is _____
unknown
We use the Z distribution when the population standard error (SE) of the difference is _____
known
When sample size (n) is greater than ____ the t-distribution is roughly the same as z-distribution (normal distribution)
120
One-tail test
Test between two different variables going in one direction (i.e., women’s GPA is higher than a man’s).
Two-tailed test
Is the population mean equal to or not equal to a predetermined value? It is a test between various dimensions. The value could fall one way or the other (i.e., women’s GPA differs from men)
Steps for hypothesis testing:
State null and alternative hypotheses
Set alpha level
Find critical regions
Collect data and compute the test statistic
Once you calculate, decide if you want to reject or accept the null
Alpha
Probability that hypothesis test will result in Type I error
The most common alpha level:
95% confidence level, alpha = 0.05
Degrees of freedom
df = n - 1
Statistical significant is _______ practical importance
not the same as
P-value
Probability value that tells you how likely it is that your data could have occurred under the null hypothesis. It is calculated based on the results of a statistical test using your data. A small p-value (x<0.05) indicates that the observed results are unlikely to be due to chance alone.