Design, Data and Decisions Lecture 1 Notes

Module 1 – Descriptive Statistics

Lecture Overview

Focus on key concepts:
- Graphical summaries of a single variable (Categorical and Continuous)
- Numerical summaries of a single variable
  - Location of the data
  - Spread of the data

Structure of Data

Data Forms:
- Transactional: Records interactions between systems
- Images: Digital files like MRIs
- Ongoing Data: Continuous data generation
Data Structure:
- Organized into attributes
- Example:
  Obs
  Attribute 1
  Attribute 2
  Attribute 3
  ...
  Attribute p
  1
  22.3
  Aus
  4
  ...
  Low
  2
  41.7
  Overseas
  7
  ...
  High
  ...
  ...
  ...
  ...
  ...
  ...

Types of Variables

Categorical (Qualitative)

Nominal: No order (e.g., nationality, gender)
Ordinal: Natural ordering (e.g., age group, level of education)

Quantitative (Numerical)

Discrete: Whole number values (e.g., number of birds)
Continuous: Any real number (e.g., height, temperature)

Categorical Variables

Frequency Reporting

Count observations per category and report frequencies.
Use relative frequencies as percentages:
- Example:[ f_a = \frac{f_a}{f_t} \times 100 ]
- Self-reported activity level example:
  Activity Level
  Frequency
  %
  Slight
  10
  10.9%
  Moderate
  61
  66.3%
  High
  21
  22.8%
  Total
  92
  100.0%

Graphical Summaries for Categorical Variables

Bar Chart and Pie Chart

Bar Chart:
- Uses bars to show frequency/relative frequency
Pie Chart:
- Slices represent relative frequencies
- Angle of slice proportional to frequency

Choosing Between Charts

Bar Charts are suitable for:
- Ordinal variables
- Nominal variables
- Comparing distributions of two categorical variables
Pie Charts should be used cautiously:
- Only for single nominal variables with few categories
- Avoid 3D as it distorts size perception

Quantitative Variables – Graphical Summaries

Stem-and-Leaf Plot Example

Example data set height in cm:| Stem & Leaf | Frequency ||-------------|-----------|| 15 | 4 || 16 | 677779 || 17 | 00000002222|| 18 | 0000001222 || 19 | 000 |(Other plotting options can include histograms)

Quantitative Variables – Center and Spread

Center Measurement Stats

Mode: Most common value, applicable for discrete/ordinal variables
Median: Middle value when data is ordered; essential for understanding skewed distributions
Mean: Average when the total of observations is evenly distributed

Spread Measurement Stats

Range: Difference between the largest and smallest values (e.g., 52 - 17 = 35 years)
Inter-Quartile Range (IQR): Middle 50% of ordered data
Standard Deviation: Measure of how spread out observations are from the mean
- Example Calculation of Median:
  1. Order the dataset
  2. Find middle value (specific rules for odd/even numbers of observations)

Inter-Quartile Range Calculation Steps

Order the data
Identify the first quartile ( Q1 ) and the third quartile ( Q3 ) positions in the data
Calculate IQR as ( Q3 - Q1 )

Box and Whisker Plot Example

Comprised of:
- Lower Quartile
- Median
- Upper Quartile

Standard Deviation Calculation

Evaluate distance each observation is from the sample mean
Calculate variance by summing squared deviations and dividing accordingly
- Formula for Sample Standard Deviation:[ s^2 = \frac{1}{n-1} \sum (x_i - \overline{x})^2 ]

Z-scores

Useful for comparing values across different variables
- Example: Comparing scores across two subjects
Z-score Calculation:[ z = \frac{o - \mu}{\sigma} ]
- General rule of thumb: 95% of observations within +/- 2 standard deviations of the mean.