Agresti & Franklin - Chapter 2
Variable: any characteristic observed in a study
Categorical variable: variable that belongs in a category. For example: did it rain (yes/no), sex (male/female) and so on.
Quantitative variable: variable with numerical values that differ in magnitudes.
Discrete variable: a variable which forms a set of numbers with its possible values. For example: number of pets, number of houses owned etc.
- ‘The number of’ → discrete.
- Any variable with a finite amount of possible values is discrete.
Continuous variable: variable which forms an interval with its possible values. For example: height, weight, age etc.
- Any variable with an infinite continuum of possible values is continuous. For example: You can weigh 50 kg, but this can also be 50,032354 kg.
Distribution: the distribution of a variable describes how the observations fall (are distributed) across the range of possible values.
modal category: category with the highest frequency
Proportion: observations from a particular category / total observations in all categories
→ proportion percentage: x 100
Categorical:
- Pie chart
- Bar graph
- Pareto chart
Pareto principle: a small subset of categories often contain most of the observations.
Pareto chart: chart in which categories are ordered by their frequency.
Quantitative:
- Dot plot
- Stem-and-leaf plot
- Histogram
Histogram: graph that uses bars to portray the frequencies of the relative frequencies of the possible outcome for a quantitative variable.
- Discrete variable: bars have space in between
- Continuous variable: bars have no space in between, intervals.
Mode: value that occurs most often.
Distribution has one peak → unimodal
Distribution has more than one distinct peak → bimodal
Shape of distribution
- Unimodal:
- Skewed or symmetric.
Time series: dataset collected over time
→ time plot
Mean: sum of observations / number of observations.
- Used to determine the balance point of the distribution
Median: half of the observations are smaller than it, half of the observations are larger than it
Outlier: observation that falls out of the general bulk of the data.
Mean compared to median:
- Symmetric: mean = median
- Skewed to the left: mean < median
- Skewed to the right: mean > median
Resistant: if numerical summary are not influenced or only little by extreme observations.
Median → resistant
Mean → not resistant
mean preferred → when distribution is symmetrical/mildly skewed
median preferred → when distribution is highly skewed
Range: difference between largest and smallest observations