1/125
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is statistics? |
Science of collecting, organizing, presenting and interpreting data.
Main steps of statistics |
Explore → Summarize → Model → Estimate → Test. |
Descriptive statistics |
Describes the given data using tables, charts and summary statistics. |
Inductive / inferential statistics
Uses sample data to draw conclusions about an unknown population.
Population |
The whole group we are interested in.
Sample |
A subset of the population.
Observational unit |
One object/case/subject whose characteristics are measured.
Attribute |
Characteristic measured on observational units, e.g. age, grade, color.
Attribute value |
Concrete value of an attribute, e.g. age = 20, color = red. |
Parameter |
Information about the population, e.g. true mean μ\muμ. |
Statistic |
Information calculated from the sample, e.g. xˉ\bar{x}xˉ. |
Raw data list |
Data in uncompressed form, value by value |
Representative sample |
Sample that reflects the population well. |
Simple random sample |
Every object has an equal chance of being selected.
Random error |
Random difference between sample and population.
Systematic bias |
Non-random sampling error; hard to fix statistically. |
Nominal scale
Categories without natural order. Example: blood type, cuisine, ticket type. |
Ordinal scale
Categories with natural order, but distances are not objectively interpretable. Example: pain rating, satisfaction 1–5.
Metric discrete |
Countable numerical values. Example: number of children, website visits.
Metric continuous |
Measurable values on a continuum. Example: time, income, volume, length.
Quantitative data |
Numerical data where distances are meaningful.
Categorical data
Category labels, e.g. color, gender, blood type.
Metric scale includes ordinal property?
Yes, metric values can be ordered. |
Nominal scale includes ordinal property?
No, nominal categories have no natural order.
Frequency distribution |
Shows how often each value/class occurs.
Frequency table |
Tabular summary of counts and percentages. |
Cross tabulation |
Table describing relationship between two categorical variables. |
Class limits |
Boundaries of intervals for grouped data. |
Why use classes? |
Continuous data often has too many different values, so grouping helps.
Measures of location |
Mean, median, mode, quantiles. |
Measures of location
Range, IQR, variance, standard deviation, CV.
Measures of variability
Range, IQR, variance, standard deviation, CV.
Mean is sensitive to outliers?
Yes. Extreme values can pull the mean strongly.
Median is robust? |
Yes. Median is less affected by outliers. |
Range weakness |
Uses only min and max, very sensitive to outliers.
IQR advantage |
More robust because it focuses on middle 50%.
Variance meaning
Average squared distance from the mean.
Standard deviation meaning |
Typical distance of observations from the mean. |
Population variance vs sample variance |
Population: divide by NNN. Estimator from sample: divide by n−1n-1n−1.
Boxplot shows
Minimum, Q1Q_1Q1, median, Q3Q_3Q3, maximum, and sometimes outliers. |
Bivariate data |
Data with two paired variables (xi,yi)(x_i,y_i)(xi,yi).
Scatterplot |
Visualizes relationship between two variables.
Covariance sign |
Positive = variables move together; negative = one increases while other decreases.
Covariance weakness
Not standardized, depends on units. |
Pearson correlation |
Standardized measure of linear relationship. |
r=1
Perfect positive linear relationship. |
r=−1
Perfect negative linear relationship.
r=0 |
No linear relationship, but nonlinear relationship may still exist.
Pearson affected by outliers?
Yes. Use carefully if scatterplot has outliers.
Spearman correlation |
Correlation based on ranks; useful for ordinal data or outliers. |
Regression goal |
Predict Y from X using a line. |
Slope interpretation |
Expected change in Y if X increases by 1. |
Intercept interpretation |
Predicted Y when X=0
Extrapolation danger |
Prediction far outside observed XXX-range can be unreliable. |
R^2 meaning |
Share of variation in Y explained by the regression model. |
Random experiment
Experiment with uncertain outcome. |
Sample space Ω\OmegaΩ |
Set of all possible outcomes.
Event |
Subset of the sample space. |
Atomic event
Event with one outcome only. |
Impossible event |
Event that cannot happen, probability 0.
Certain event |
Event that always happens, probability 1. |
Disjoint events
Events that cannot happen together.
Independent events |
Occurrence of one event does not change probability of the other.
Disjoint vs independent
Disjoint events are usually dependent, because if one happens, the other cannot. |
Theoretical probability |
Based on model/formula. |
Empirical probability |
Based on observed data or simulations.
Law of large numbers
With many repetitions, empirical average/probability approaches theoretical value. |
Conditional probability
Probability of A, given that B happened.
Bayes idea |
Updates probability after receiving new evidence.
Base-rate problem |
Even accurate tests can have low posterior probability if the event is very rare. |
Random variable |
Numerical outcome of a random experiment.
Discrete random variable |
Countable possible values. |
Continuous random variable |
Uncountable possible values.
PMF/PDF for discrete variables |
Gives probabilities P(X=x)P(X=x)P(X=x).
Density for continuous variables
Area under curve gives probability.
Why P(X=c)=0 for continuous X? |
A single point has zero area.
CDF |
F(x)=P(X≤x).
Expected value |
Long-run average/theoretical mean. |
Variance
Theoretical spread around expected value. |
Quantile |
Value below which a certain probability lies. |
Bernoulli distribution |
One trial, two outcomes: success/failure. |
Binomial distribution |
Number of successes in nnn independent Bernoulli trials. |
Hypergeometric distribution
Sampling without replacement.
Binomial vs hypergeometric
Binomial = with replacement/independent; hypergeometric = without replacement/dependent. |
Poisson distribution |
Counts rare independent events in fixed time/area. |
Normal distribution |
Symmetric distribution; used for measurement errors and many natural processes.
Standard normal |
Normal distribution with mean 0 and variance 1.
z-transformation purpose
Converts any normal variable to standard normal.
Central limit theorem
Sums/means of many independent variables tend to normal distribution. |
Chi-square distribution |
Sum of squared standard normal variables.
Point estimator |
One numerical estimate of an unknown population parameter. |
Weakness of point estimator
Very precise, but low reliability for continuous parameters. |
Confidence interval |
Interval of plausible values for unknown parameter.
Confidence level 1−α |
Long-run probability that the method captures the true parameter.
Precision of CI |
Shorter interval = higher precision. |
Confidence vs precision
Higher confidence usually means wider interval.
Larger sample size effect
More precision and/or more confidence.
Unbiased estimator
Hits the true parameter on average.
Consistent estimator
Gets more precise as sample size increases. |
Efficient estimator |
Among unbiased estimators, has smallest variance.