1/34
Flashcards covering fundamental concepts in data types, exploratory data analysis, population vs. sample, and measures of central tendency and spread based on lecture notes from STSTA 198CNL / Duke University / Fall 2025.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Exploratory data analysis (EDA)
An initial data analysis that summarizes main characteristics, often done through visual means or basic summary statistics.
Nominal data
Named categories without numeric meaning; if only two categories, often referred to as binary or dichotomous.
Binary data
A type of nominal data with only two categories.
Dichotomous data
Another term for binary data, referring to data with only two categories.
Ordinal data
Ordered categories where differences between values are not easily measured, but relative comparisons about differences between levels matter.
Categorical data
A broad type of data that includes nominal and ordinal data, consisting of categories.
Count data
Data representing counts or ranks (e.g., number of alcoholic drinks consumed).
Rank data
Data representing a position in a sequence, derived from ordering a set of items by some characteristic.
Continuous data
Measurable quantities where the difference between possible values can be arbitrarily small, and data might lie within a range or be unbounded.
Numeric data
A broad type of data that includes count/rank data and continuous data, consisting of numerical values.
Population
The entire group of individuals or items that the research question is interested in.
Sample
A subset of the population from which data is collected for analysis.
Parameters
Attributes of the population of interest, not computable directly (unless the entire population is perfectly measured), usually written in Greek letters.
Statistics
Attributes of a sample, a function of the observed values at hand, usually written in Roman letters.
Sample mean
The arithmetic average of values in a sample, calculated as the sum of all values divided by the sample size.
Population mean
The arithmetic average of values in an entire population.
Point estimate
A single value used to estimate an unknown population parameter, such as the sample mean estimating the population mean.
Sample median
The 50th percentile of a sample; the value for which 50% of values are below when observations are ranked numerically.
Percentile
The numeric value at which a specified percentage of values are below.
Robust to extreme values
Describes a statistic (like the median) that is less affected by outliers or extreme values in a dataset compared to others (like the mean).
Sample mode
The most frequent value in a dataset, corresponding to 'peaks' in distributions.
Multimodal distribution
A distribution that has multiple peaks, indicating several frequent values.
Sample minimum
The smallest observation in a dataset.
Sample maximum
The largest observation in a dataset.
Sample range
The difference between the sample maximum and minimum.
Quantiles
Cutpoints that divide data into equal-sized groups (e.g., tertiles, quartiles, quintiles, percentiles).
Interquartile range (IQR)
The width of the middle 50% of the data; the difference between the third and first quartiles.
Five-number summary
A set of five descriptive statistics for a dataset: the sample minimum, first quartile (Q1), median (Q2), third quartile (Q3), and sample maximum.
Outliers
Observations numerically distant from others in a dataset, which should be noted and handled carefully.
Sample variance
Approximately the average squared deviation from the mean in a sample, used to estimate population variance.
Population variance
The average squared deviation from the mean for an entire population.
Sample standard deviation (SD)
The square root of the sample variance, providing a measure of spread in the same units as the original dataset.
Skewed distribution
A distribution that is not symmetric, characterized by a 'tail' on either the right (right-skewed) or left (left-skewed) side.
Right-skewed distribution
A distribution with a tail extending to the right, meaning the majority of data points are concentrated on the lower end.
Chebyshev's inequality
A theorem stating that for any distribution (with a mean and standard deviation), the proportion of values within k standard deviations of the mean is at least 1 - 1/k^2.