Design, Data and Decisions Lecture 1 Notes

Module 1 – Descriptive Statistics

Lecture Overview

  • Focus on key concepts:

    • Graphical summaries of a single variable (Categorical and Continuous)

    • Numerical summaries of a single variable

      • Location of the data

      • Spread of the data


Structure of Data

  • Data Forms:

    • Transactional: Records interactions between systems

    • Images: Digital files like MRIs

    • Ongoing Data: Continuous data generation

  • Data Structure:

    • Organized into attributes

    • Example:

      Obs

      Attribute 1

      Attribute 2

      Attribute 3

      ...

      Attribute p

      1

      22.3

      Aus

      4

      ...

      Low

      2

      41.7

      Overseas

      7

      ...

      High

      ...

      ...

      ...

      ...

      ...

      ...


Types of Variables

Categorical (Qualitative)

  • Nominal: No order (e.g., nationality, gender)

  • Ordinal: Natural ordering (e.g., age group, level of education)

Quantitative (Numerical)

  • Discrete: Whole number values (e.g., number of birds)

  • Continuous: Any real number (e.g., height, temperature)


Categorical Variables

Frequency Reporting

  • Count observations per category and report frequencies.

  • Use relative frequencies as percentages:

    • Example:[ f_a = \frac{f_a}{f_t} \times 100 ]

    • Self-reported activity level example:

      Activity Level

      Frequency

      %

      Slight

      10

      10.9%

      Moderate

      61

      66.3%

      High

      21

      22.8%

      Total

      92

      100.0%


Graphical Summaries for Categorical Variables

Bar Chart and Pie Chart

  • Bar Chart:

    • Uses bars to show frequency/relative frequency

  • Pie Chart:

    • Slices represent relative frequencies

    • Angle of slice proportional to frequency

Choosing Between Charts

  • Bar Charts are suitable for:

    • Ordinal variables

    • Nominal variables

    • Comparing distributions of two categorical variables

  • Pie Charts should be used cautiously:

    • Only for single nominal variables with few categories

    • Avoid 3D as it distorts size perception


Quantitative Variables – Graphical Summaries

Stem-and-Leaf Plot Example

  • Example data set height in cm:| Stem & Leaf | Frequency ||-------------|-----------|| 15 | 4 || 16 | 677779 || 17 | 00000002222|| 18 | 0000001222 || 19 | 000 |(Other plotting options can include histograms)


Quantitative Variables – Center and Spread

Center Measurement Stats

  • Mode: Most common value, applicable for discrete/ordinal variables

  • Median: Middle value when data is ordered; essential for understanding skewed distributions

  • Mean: Average when the total of observations is evenly distributed

Spread Measurement Stats

  • Range: Difference between the largest and smallest values (e.g., 52 - 17 = 35 years)

  • Inter-Quartile Range (IQR): Middle 50% of ordered data

  • Standard Deviation: Measure of how spread out observations are from the mean

    • Example Calculation of Median:

      1. Order the dataset

      2. Find middle value (specific rules for odd/even numbers of observations)


Inter-Quartile Range Calculation Steps

  1. Order the data

  2. Identify the first quartile ( Q1 ) and the third quartile ( Q3 ) positions in the data

  3. Calculate IQR as ( Q3 - Q1 )

Box and Whisker Plot Example

  • Comprised of:

    • Lower Quartile

    • Median

    • Upper Quartile


Standard Deviation Calculation

  • Evaluate distance each observation is from the sample mean

  • Calculate variance by summing squared deviations and dividing accordingly

    • Formula for Sample Standard Deviation:[ s^2 = \frac{1}{n-1} \sum (x_i - \overline{x})^2 ]

Z-scores

  • Useful for comparing values across different variables

    • Example: Comparing scores across two subjects

  • Z-score Calculation:[ z = \frac{o - \mu}{\sigma} ]

    • General rule of thumb: 95% of observations within +/- 2 standard deviations of the mean.