Design, Data and Decisions Lecture 1 Notes
Module 1 – Descriptive Statistics
Lecture Overview
Focus on key concepts:
Graphical summaries of a single variable (Categorical and Continuous)
Numerical summaries of a single variable
Location of the data
Spread of the data
Structure of Data
Data Forms:
Transactional: Records interactions between systems
Images: Digital files like MRIs
Ongoing Data: Continuous data generation
Data Structure:
Organized into attributes
Example:
Obs
Attribute 1
Attribute 2
Attribute 3
...
Attribute p
1
22.3
Aus
4
...
Low
2
41.7
Overseas
7
...
High
...
...
...
...
...
...
Types of Variables
Categorical (Qualitative)
Nominal: No order (e.g., nationality, gender)
Ordinal: Natural ordering (e.g., age group, level of education)
Quantitative (Numerical)
Discrete: Whole number values (e.g., number of birds)
Continuous: Any real number (e.g., height, temperature)
Categorical Variables
Frequency Reporting
Count observations per category and report frequencies.
Use relative frequencies as percentages:
Example:[ f_a = \frac{f_a}{f_t} \times 100 ]
Self-reported activity level example:
Activity Level
Frequency
%
Slight
10
10.9%
Moderate
61
66.3%
High
21
22.8%
Total
92
100.0%
Graphical Summaries for Categorical Variables
Bar Chart and Pie Chart
Bar Chart:
Uses bars to show frequency/relative frequency
Pie Chart:
Slices represent relative frequencies
Angle of slice proportional to frequency
Choosing Between Charts
Bar Charts are suitable for:
Ordinal variables
Nominal variables
Comparing distributions of two categorical variables
Pie Charts should be used cautiously:
Only for single nominal variables with few categories
Avoid 3D as it distorts size perception
Quantitative Variables – Graphical Summaries
Stem-and-Leaf Plot Example
Example data set height in cm:| Stem & Leaf | Frequency ||-------------|-----------|| 15 | 4 || 16 | 677779 || 17 | 00000002222|| 18 | 0000001222 || 19 | 000 |(Other plotting options can include histograms)
Quantitative Variables – Center and Spread
Center Measurement Stats
Mode: Most common value, applicable for discrete/ordinal variables
Median: Middle value when data is ordered; essential for understanding skewed distributions
Mean: Average when the total of observations is evenly distributed
Spread Measurement Stats
Range: Difference between the largest and smallest values (e.g., 52 - 17 = 35 years)
Inter-Quartile Range (IQR): Middle 50% of ordered data
Standard Deviation: Measure of how spread out observations are from the mean
Example Calculation of Median:
Order the dataset
Find middle value (specific rules for odd/even numbers of observations)
Inter-Quartile Range Calculation Steps
Order the data
Identify the first quartile ( Q1 ) and the third quartile ( Q3 ) positions in the data
Calculate IQR as ( Q3 - Q1 )
Box and Whisker Plot Example
Comprised of:
Lower Quartile
Median
Upper Quartile
Standard Deviation Calculation
Evaluate distance each observation is from the sample mean
Calculate variance by summing squared deviations and dividing accordingly
Formula for Sample Standard Deviation:[ s^2 = \frac{1}{n-1} \sum (x_i - \overline{x})^2 ]
Z-scores
Useful for comparing values across different variables
Example: Comparing scores across two subjects
Z-score Calculation:[ z = \frac{o - \mu}{\sigma} ]
General rule of thumb: 95% of observations within +/- 2 standard deviations of the mean.