Week 1: Exploring Data
Introduction to Statistics
Statistics: Quantifying the world in an objective way for good decision-making.
Modules:
Descriptive Statistics and data communication (Module 1).
Inferential Statistics (Module 2).
Hypothesis Testing (Module 3).
Regression Analysis (Module 4).
Data ethics (Module 5).
Week 1: Exploring Data - Foundations of Descriptive Statistics
Data Types and Visualization
Preliminary Terms and Definitions
Variable: A characteristic, number, or quantity that can be measured or counted.
Random variable: A variable whose outcome is unknown before data collection
Example: income of an Australian household.
Population: The complete pool of a particular random variable
Example: income of all Australian households.
Sample: A subset of the population
Example: income of 100 households.
Goal: Describe and visualize information contained in different types of variables.
Types of Data
Variables are broadly classified as qualitative/categorical or quantitative/numerical.
Qualitative/Categorical:
Nominal: Categories with no natural ordering
Example: 0/1 variable for male/female.
Ordinal: Categories with a natural order, but numbers are meaningless
Example: Agreement scale (donβt agree=-1/somewhat agree=0/completely agree=1).
Quantitative/Numerical:
Discrete: Values can be listed (not infinitely divisible), often from counting
Example: Number of children in a household (0, 1, 2, 3, β¦).
Continuous: Can take an infinite number of values within a range, often from measurement.
Example: Heights of professional basketball players.
Frequency Distributions
For qualitative/categorical data, visualize via a table displaying frequencies.
Example Table:
Material Status of home loan applicants:
Single: Frequency 102, Relative Frequency 0.1262, Percent Frequency 12.62
Married: Frequency 341, Relative Frequency 0.4220, Percent Frequency 42.20
Widowed: Frequency 155, Relative Frequency 0.1918, Percent Frequency 19.18
De Facto: Frequency 50, Relative Frequency 0.0619, Percent Frequency 6.19
Separated: Frequency 40, Relative Frequency 0.0495, Percent Frequency 4.95
Divorced: Frequency 120, Relative Frequency 0.1485, Percent Frequency 14.85
Total: Frequency 808, Relative Frequency 1, Percent Frequency 100
Key Terms
Frequency counts: Total occurrences for each category.
Relative frequency: Fraction/proportion of total data items in a category.
Percent frequency: Relative frequency Γ 100 (%).
Excel Function
Use
COUNTIF(range, values)to obtain frequency countsExample Formula:
=COUNTIF(I$10:I$389, $D2)
Data Visualization: Histograms
Commonly used for continuous variables.
Steps:
Choose a bandwidth/bin size to group incomes into equally spaced categories
e.g., $0-100, $101-$200, $201-$300 etc.
Plot frequencies for each group in a bar chart
Frequencies on the y-axis, categories on the x-axis.
Histogram Example
Income brackets and Frequencies:
(6, 526]: Frequency near 35
(526, 1046]: Frequency around 25
(1046, 1566]: Frequency around 15
(1566, 2086]: Frequency around 10
(2086, 2606]: Frequency around 5
(2606, 3126]: Frequency around 2
Histogram Considerations
In histograms for numerical data, ensure that the income ranges are contiguous.
Gaps or overlaps in income ranges are not accurate representation of the data
Data Visualization: Bar Chart
Visually similar to a histogram, but:
Categories need not be equally ranged continuous values.
The y-axis can represent things other than frequency.
Usually whitespace between the bars.
Example: Marital Status of home loan applicants:
Single: ~100
Married: ~350
Widowed: ~150
De Facto: ~50
Separated: ~40
Divorced: ~120
Data Visualization: Pie Chart
Way to visualize categorical data, frequencies shown as segments of a circle
Example: Marital Status of home loan applicants (same categories as bar chart)
Tip: Pie charts are rarely a good idea, and never when there are a large number of categories
Summary Statistics: Central Tendency
Describing Data: Central Tendency, Variability, Skewness
Notation
Random variables: Denoted by capital letters (π, π).
π: Number of children in a household.
π: Amount of time spent by the husband on housework per day.
Realizations/observations of a random variable: Lowercase letters with subscript (π₯π, π¦π).
π₯1: Number of children in household 1.
π¦137: Amount of time spent by husband 137 on housework per day.
π and π: Denote the size or number of observations.
π: Population size (usually very large, can be infinite).
π: Sample size, i.e., the number of data points collected in a sample.
Central Tendency
Definition: Measures of central tendency provide information about the center of the distribution of a random variable; indicate a typical, middle, or average value. (Measures of location)
Mean: Arithmetic average value.
Mode: Most commonly occurring value.
Median: Middle value in an ordered array.
Central Tendency: Mean
Population Mean: Denoted by or , the expectation of . Computed by:
Sample Mean: Denoted by , called X bar. Computed by:
Example
Random variable: Height of females aged between 25 and 40.
John has a sample of randomly chosen females aged 25 and 40:
Heights are 157cm, 163cm, 166cm, 148cm, 174cm, 165cm, 168cm.
Sample size . ()
Sample mean:
Example: Gamble
Gamble: Tossing a fair coin.
Heads: Receive $10.
Tails: Pay $10.
Scenario: Play the gamble 100 times; 60 heads, 40 tails.
Sample mean:
Example: Population Mean
Consider the same gamble: fair coin toss, receive/pay $10 for heads/tails
Population mean:
Central Tendency: Mode
The mode is the most commonly occurring value.
Example: Waiting times of people in a queue (minutes):
2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)
What is the mode? 3 (occurs six times).
Random variables with two modes are bimodal; with multiple modes, multimodal.
Central Tendency: Median
The median is the middle value in an ordered array.
Example: Waiting times (minutes):
2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)
What is the median? Median is 2.5 minutes (lies in the middle of the 14 numbers).
Median =
Central Tendency: Qualitative Data
Example: University major of employees (1=marketing, 2=finance, 3=economics, 4=law, 5=others).
Recorded data: 2, 5, 3, 1, 4, 2, 5, 3, 4, 2, 1 (ordered: 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5).
Mode is 2, median is 3, mean is 2.9091.
Question: Which measure of central tendency is the most appropriate?
Summary Statistics: Variability
Describing Data: Central Tendency, Variability, Skewness
Descriptive Statistic: Variability
Definition: Measures of variability provide information about how dispersed the values of a random variable are around the mean (measures of scale, spread, dispersion, or risk).
Variance (Var): Average of squared distance from the mean.
Standard deviation (std): Square root of variance.
Coefficient of variation: . Measures risk per unit of expected return.
Variability: Example
Question: Which stock to invest based on the data of their weekly returns?
Stock X and Stock Y have , meaning that every week both stocks are expected to grow 1.5% (on average). But which one do you prefer?
Variability: Formulas
Population Variance: Denoted by or . Computed by:
Sample Variance: Denoted by . Computed by:
Variability: Variance
Computes the average squared distance between data points and their mean.
Given: data points +2, +4, -6 and
Example: Waiting Time
: waiting time of people in a queue (in minutes)
Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14
Population: .
Sample: .
Variance: Remarks
Q1: Why sum up or average out squared distance instead of distance?
Distance in different directions may cancel out, not suitable for measuring variability.
Q2: What is the unit?
Distance such as is in the unit of the data.
Squared distance such as is in the unit of the data squared!
Example: Distance such as is in minutes. Squared distance such as is in minutes squared.
Standard Deviation
Population Standard Deviation: Denoted by or . Computed by:
Sample Standard Deviation: Denoted by . Computed by:
Standard Deviation
Standard deviation solves the problem of squared units.
Has the same units as the original data.
In the waiting example:
Population:
Sample:
Standard Deviation - Example
Variance and standard deviation measure how spread out the distribution of a random variable is.
X: time spent on work, Y: time spent on leisure (per day) with 5 observations.
Means are the same . Variances are different (sX^2 = 2.5 < sY^2 = 12.5).
Coefficient of Variation
Population CV(%): . Sample CV(%): .
It is unit free because both the numerator and denominator have the same unit as the original data.
Coefficient of Variation - Example
: waiting time of people in a queue (in minutes)
Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14
Population CV(%): .
Sample CV(%): .
CV: Interpretation
is unit free. It measures standard deviation per unit of mean.
Example: Time on leisure per day in hours vs in minutes (same coefficient of variation if regarding the data as a sample).
Example: In finance when the random variable denotes asset returns, measures risk per unit of expected return.
Variability: Excel
Excel is our friend for assignment and your future career. Google is our friend for learning Excel.
Summary Statistics: Skewness
Describing Data: Central Tendency, Variability, Skewness
Descriptive Statistics: Shape
Central tendency and variability are useful to describe and summarise data.
They cannot summarise asymmetry.
Skewness is a measure of asymmetry (Calculating skewness will not be examined).
Skewness
Symmetric distribution (skewness = 0): median = mean
Right-skewed distribution (skewness > 0, positively skewed): median < mean
Left-skewed distribution (skewness < 0, negatively skewed): median > mean
Summary for week 1
Categorical data is summarised using tables and frequency counts, and visualised using histograms or pie charts.
Distribution is the general shape that shows the probability that a random variable takes a certain value.
Central tendency includes mean, mode (most commonly occurring value in an array of numbers) and median (the middle number if you sort the array)
Population mean:
Sample mean:
Variability includes variance, standard deviation and coefficient of variation
Measure of shape: skewness
Population Variance:
Sample Variance:
Standard deviation: ,
Coefficient of variation: ,