1/77
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is Exploratory Data Analysis (EDA)?
A set of statistical and visualization techniques used to understand data before preprocessing and modeling
What are 5 purposes of EDA?
Understand data and summarize key properties, Discover noisy data and outliers, Comprehend data distribution, Decide which cleaning techniques to apply, Guide preprocessing decisions
How is EDA cross-classified?
By method type (non-graphical or graphical) and by scope (univariate or multivariate, usually bivariate)
What is the difference between Population and Sample?
Population: the entire group we want conclusions about. Sample: subset of population used when size is too large
What should a sample be?
An unbiased subset that best represents the entire population
What is the Mean (average)?
An algebraic measure of central tendency calculated by summing all values and dividing by count
What is the difference between sample mean and population mean formulas?
Sample mean uses n (sample size), Population mean uses N (population size)
What is Weighted arithmetic mean?
A mean where different values have different weights or importance in the calculation
What is Trimmed mean?
A mean calculated by chopping extreme values (e.g., Olympics gymnastics score computation)
What is the Median?
The middle value in a data set when values are ordered
How do you calculate median for odd number of data points?
The median is the middle value after sorting
How do you calculate median for even number of data points?
The median is the average of the two middle values after sorting
Why use Median instead of Mean?
Median is resistant to extreme outliers and useful for skewed distributions
What is the formula for median in grouped data?
Median = L + ((n/2 - B) / G) × W
What does L represent in grouped median formula?
Lower class boundary of the median bin
What does B represent in grouped median formula?
Cumulative frequency of the bins before the median bin
What does G represent in grouped median formula?
Frequency of the median bin
What does W represent in grouped median formula?
Median bin width
What is the Mode?
The value that occurs most frequently in the data
What is the empirical formula relating mean, median, and mode?
Mean - Mode ≈ 3(Mean - Median)
What is Unimodal distribution?
Distribution with one mode (one peak)
What is Bimodal distribution?
Distribution with two modes (two peaks)
What is Trimodal distribution?
Distribution with three modes (three peaks)
In symmetric (normal) distribution, how are mean, median, and mode related?
They are all equal and located at the center
In positively skewed distribution, what is the order of mean, median, and mode?
Mode < Median < Mean (mean is pulled toward the tail)
In negatively skewed distribution, what is the order of mean, median, and mode?
Mean < Median < Mode (mean is pulled toward the tail)
What does the center (μ) of a normal distribution represent?
Central tendency (mean, median, mode are all equal)
What does sigma (σ) in normal distribution represent?
Data dispersion or spread
After z-score normalization, what is the mean?
μ = 0
After z-score normalization, what is the standard deviation?
σ = 1
What is Variance?
A measure of dispersion around the mean
What is the sample variance formula?
s² = Σ(xi - x̄)² / (n-1)
What is the population variance formula?
σ² = Σ(xi - μ)² / N
Why do we divide by n-1 for sample variance?
To account for bias in the estimation
What is Standard Deviation?
The square root of variance, measuring dispersion in the same units as the value
What does Covariance measure?
The relationship between two numerical variables, showing how they change together
What does positive covariance indicate?
Both variables move together (increase together or decrease together)
What does negative covariance indicate?
Variables move in opposite directions (one increases while other decreases)
What does zero covariance indicate?
No clear pattern in variable movements
Why is covariance sensitive to scale?
Because it's calculated using the actual values without standardization
What is a Covariance Matrix?
A matrix summarizing variance and covariance information for variables (variance on diagonal, covariance in off-diagonal entries)
What is the Correlation formula?
ρ12 = cov(X1, X2) / (σ1 × σ2), where σ represents standard deviation
What does correlation measure?
Standard covariance obtained by normalizing covariance with standard deviation of each variable
What does ρ12 > 0 indicate?
A and B are positively correlated (X1's values increase as X2's increase)
What does ρ12 < 0 indicate?
A and B are negatively correlated (X1's values increase as X2's decrease)
What does ρ12 = 0 indicate?
Variables are independent (no linear relationship)
What is a Correlation Matrix (Correlation Heatmap)?
A matrix showing correlations between each pair of variables in a dataset
What is the range of correlation coefficient values?
[-1,