1/47
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Contingency Table
Displays the distribution frequency of categorical variables
Bar Plot
A graph that represents numeric categorical variables with rectangles
Making a bar plot in R
ggplot(Dataset) + aes(x = Variable1, y = Variable2) + geom_bar()
Making a 2 way contingency table in R
Dataset %>% count(Variable1, Variable2) %>% mutate(prop = n / sum (n)) %>% pivot_wider (names_from = Variable1, values_from = Variable2)
2 way contingency table
Displays frequencies for combinations of two categorical variables. They classify outcomes for one variable in rows and the other in columns
Overall 2 way contingency table
Dividing the observations by the overall total
Conditional, row or column, 2 way contingency table
Dividing the observations by the row or column total
Stacked bar plot
Showing multiple (raw data) values for each category by stacking bars atop each other. Total bar height represents the sum
Dodged bar plot
Compares bars side by side
Standardized stacked bar plot
Like a regular stacked bar plot but with proportions
Unimodal distribution
Probability distribution with a clear peak
Bimodal distribution
Has two clear peaks
Multimodal distribution
Having two or more peaks
Uniform distribution
Every outcome is equally likely to occur
Symmetric distribution
Distribution where the left and right sides mirror each other
Left skewed distribution
Having a tail on the left
Right skewed distribution
Having a tail on the right
Skewness
A measure of asymmetry of a distribution
Histogram
Like a bar graph without gaps between the bars
Density plot
Like a histogram except it uses a smooth curve to represent data distribution
Box plots are best for…?
For comparing data across different groups
Mean
The average sum of all observations in a dataset
Median
The middle observation value (or middle 2 added and divided)
Mode
The most common value in a dataset
Standard deviation
The average variation of the values from the mean
Variance
The expectation of a squared deviation of a random variable from it’s mean
Quartile 1
The 25th percentile, where the lowest 25% lies
Quartile 3
Where 75% of the data lies
IQR(interquartile range)
The 50% of data that lie between Q1 and Q3
What data shows
Raw values (examples: 1, 2, 3, 4, 5)
What numerical data shows
Summary statistics (examples: mean = 5.5, median = 4.5)
What graphical data shows
Shape, spread, and outliers (examples: histogram, box plot, density plot)
What verbal data shows
Conceptual summary (examples: “right skewed with one high outlier, median around 5”)
Mean = median
Roughly symmetric graph
Mean > median
Graph shows a right skew
Mean < median
Graph shows a left skew
Large SD or IQR
Graph shows a wide spread
Graph shows outliers
Isolated points in boxplots or gaps in histograms
IQR = 0 means…?
Means the middle 50% of data are all the same value
Standard deviation equals..?
The square root of sample variance
1.5 IQR method
Method for identifying outliers
1.5 x IQR =
Below -1.5 x IQR - Q1 or above 1.5 x IQR + Q3
Use Mean and SD for…?
For symmetric data
Use Median and IQR for..?
Skewed data
When to use mean and SD
Symmetric data w no outliers, Interval data with equal spacing, heights of students in class, daily high temperatures in June, normally distributed test scores
When to use Median and IQR
For asymmetrical skewed data, if data has one outlier, small sample, ordinal ratings (satisfaction from 1-5 stars), salaries of employees (CEO makes $10m), number of emergency room visits per patient, response times for a computer program (some fast some very slow)
Use a histogram when:
You want to see the shape of distribution, you need frequency of data points for specific intervals, you are analyzing a single numerical dataset with many data points, you need to determine if a process is meeting customer requirements based on its distribution
Use a boxplot when:
Comparing distributions across multiple groups, you want a quick summary of the key statistics components (like mean), if you have a large amount of data and need a way to visualize it without showing every data point, you want to identify potential outliers.