Agresti & Franklin - Chapter 2 

Variable: any characteristic observed in a study

Categorical variable: variable that belongs in a category. For example: did it rain (yes/no), sex (male/female) and so on.

Quantitative variable: variable with numerical values that differ in magnitudes.

Discrete variable: a variable which forms a set of numbers with its possible values. For example: number of pets, number of houses owned etc.

  • ‘The number of’ → discrete.
  • Any variable with a finite amount of possible values is discrete.

Continuous variable: variable which forms an interval with its possible values. For example: height, weight, age etc.

  • Any variable with an infinite continuum of possible values is continuous. For example: You can weigh 50 kg, but this can also be 50,032354 kg.

Distribution: the distribution of a variable describes how the observations fall (are distributed) across the range of possible values.

modal category: category with the highest frequency

Proportion: observations from a particular category / total observations in all categories

→ proportion percentage: x 100

Categorical:

  • Pie chart
  • Bar graph
  • Pareto chart

Pareto principle: a small subset of categories often contain most of the observations.

Pareto chart: chart in which categories are ordered by their frequency.

Quantitative:

  • Dot plot
  • Stem-and-leaf plot
  • Histogram

Histogram: graph that uses bars to portray the frequencies of the relative frequencies of the possible outcome for a quantitative variable.

  • Discrete variable: bars have space in between
  • Continuous variable: bars have no space in between, intervals.

Mode: value that occurs most often.

Distribution has one peak → unimodal

Distribution has more than one distinct peak → bimodal

Shape of distribution

  • Unimodal:
    • Skewed or symmetric.

Time series: dataset collected over time

→ time plot

Mean: sum of observations / number of observations.

  • Used to determine the balance point of the distribution

Median: half of the observations are smaller than it, half of the observations are larger than it

Outlier: observation that falls out of the general bulk of the data.

Mean compared to median:

  • Symmetric: mean = median
  • Skewed to the left: mean < median
  • Skewed to the right: mean > median

Resistant: if numerical summary are not influenced or only little by extreme observations.

Median → resistant

Mean → not resistant

mean preferred → when distribution is symmetrical/mildly skewed

median preferred → when distribution is highly skewed

Range: difference between largest and smallest observations