Agresti & Franklin - Chapter 2

Variable: any characteristic observed in a study

Categorical variable: variable that belongs in a category. For example: did it rain (yes/no), sex (male/female) and so on.

Quantitative variable: variable with numerical values that differ in magnitudes.

Discrete variable: a variable which forms a set of numbers with its possible values. For example: number of pets, number of houses owned etc.

Continuous variable: variable which forms an interval with its possible values. For example: height, weight, age etc.

Any variable with an infinite continuum of possible values is continuous. For example: You can weigh 50 kg, but this can also be 50,032354 kg.

Distribution: the distribution of a variable describes how the observations fall (are distributed) across the range of possible values.

modal category: category with the highest frequency

Proportion: observations from a particular category / total observations in all categories

→ proportion percentage: x 100

Categorical:

Pareto principle: a small subset of categories often contain most of the observations.

Pareto chart: chart in which categories are ordered by their frequency.

Quantitative:

Histogram: graph that uses bars to portray the frequencies of the relative frequencies of the possible outcome for a quantitative variable.

Mode: value that occurs most often.

Distribution has one peak → unimodal

Distribution has more than one distinct peak → bimodal

Shape of distribution

Time series: dataset collected over time

→ time plot

Mean: sum of observations / number of observations.

Median: half of the observations are smaller than it, half of the observations are larger than it

Outlier: observation that falls out of the general bulk of the data.

Mean compared to median:

Resistant: if numerical summary are not influenced or only little by extreme observations.

Median → resistant

Mean → not resistant

mean preferred → when distribution is symmetrical/mildly skewed

median preferred → when distribution is highly skewed

Range: difference between largest and smallest observations