Week 1: Exploring Data

Introduction to Statistics

Statistics: Quantifying the world in an objective way for good decision-making.
Modules:
- Descriptive Statistics and data communication (Module 1).
- Inferential Statistics (Module 2).
- Hypothesis Testing (Module 3).
- Regression Analysis (Module 4).
- Data ethics (Module 5).

Week 1: Exploring Data - Foundations of Descriptive Statistics

Data Types and Visualization

Preliminary Terms and Definitions

Variable: A characteristic, number, or quantity that can be measured or counted.
Random variable: A variable whose outcome is unknown before data collection
- Example: income of an Australian household.
Population: The complete pool of a particular random variable
- Example: income of all Australian households.
Sample: A subset of the population
- Example: income of 100 households.
Goal: Describe and visualize information contained in different types of variables.

Types of Data

Variables are broadly classified as qualitative/categorical or quantitative/numerical.
Qualitative/Categorical:
- Nominal: Categories with no natural ordering
  - Example: 0/1 variable for male/female.
- Ordinal: Categories with a natural order, but numbers are meaningless
  - Example: Agreement scale (don’t agree=-1/somewhat agree=0/completely agree=1).
Quantitative/Numerical:
- Discrete: Values can be listed (not infinitely divisible), often from counting
  - Example: Number of children in a household (0, 1, 2, 3, …).
- Continuous: Can take an infinite number of values within a range, often from measurement.
  - Example: Heights of professional basketball players.

Frequency Distributions

For qualitative/categorical data, visualize via a table displaying frequencies.
Example Table:
- Material Status of home loan applicants:
  - Single: Frequency 102, Relative Frequency 0.1262, Percent Frequency 12.62
  - Married: Frequency 341, Relative Frequency 0.4220, Percent Frequency 42.20
  - Widowed: Frequency 155, Relative Frequency 0.1918, Percent Frequency 19.18
  - De Facto: Frequency 50, Relative Frequency 0.0619, Percent Frequency 6.19
  - Separated: Frequency 40, Relative Frequency 0.0495, Percent Frequency 4.95
  - Divorced: Frequency 120, Relative Frequency 0.1485, Percent Frequency 14.85
  - Total: Frequency 808, Relative Frequency 1, Percent Frequency 100

Key Terms

Frequency counts: Total occurrences for each category.
Relative frequency: Fraction/proportion of total data items in a category.
Percent frequency: Relative frequency × 100 (%).

Excel Function

Use COUNTIF(range, values) to obtain frequency counts
Example Formula: =COUNTIF(I$10:I$389, $D2)

Data Visualization: Histograms

Commonly used for continuous variables.
Steps:
1. Choose a bandwidth/bin size to group incomes into equally spaced categories
  - e.g., $0-100, $101-$200, $201-$300 etc.
2. Plot frequencies for each group in a bar chart
  - Frequencies on the y-axis, categories on the x-axis.

Histogram Example

Income brackets and Frequencies:
- (6, 526]: Frequency near 35
- (526, 1046]: Frequency around 25
- (1046, 1566]: Frequency around 15
- (1566, 2086]: Frequency around 10
- (2086, 2606]: Frequency around 5
- (2606, 3126]: Frequency around 2

Histogram Considerations

In histograms for numerical data, ensure that the income ranges are contiguous.
Gaps or overlaps in income ranges are not accurate representation of the data

Data Visualization: Bar Chart

Visually similar to a histogram, but:
- Categories need not be equally ranged continuous values.
- The y-axis can represent things other than frequency.
- Usually whitespace between the bars.
Example: Marital Status of home loan applicants:
- Single: ~100
- Married: ~350
- Widowed: ~150
- De Facto: ~50
- Separated: ~40
- Divorced: ~120

Data Visualization: Pie Chart

Way to visualize categorical data, frequencies shown as segments of a circle
Example: Marital Status of home loan applicants (same categories as bar chart)
Tip: Pie charts are rarely a good idea, and never when there are a large number of categories

Summary Statistics: Central Tendency

Describing Data: Central Tendency, Variability, Skewness

Notation

Random variables: Denoted by capital letters (𝑋, 𝑌).
- 𝑋: Number of children in a household.
- 𝑌: Amount of time spent by the husband on housework per day.
Realizations/observations of a random variable: Lowercase letters with subscript (𝑥𝑖, 𝑦𝑖).
- 𝑥1: Number of children in household 1.
- 𝑦137: Amount of time spent by husband 137 on housework per day.
𝑁 and 𝑛: Denote the size or number of observations.
- 𝑁: Population size (usually very large, can be infinite).
- 𝑛: Sample size, i.e., the number of data points collected in a sample.

Central Tendency

Definition: Measures of central tendency provide information about the center of the distribution of a random variable; indicate a typical, middle, or average value. (Measures of location)
1. Mean: Arithmetic average value.
2. Mode: Most commonly occurring value.
3. Median: Middle value in an ordered array.

Central Tendency: Mean

Population Mean: Denoted by $μ$ or $E(X)$ , the expectation of $X$ . Computed by:
- $μ = E(X) = \frac{x1 + x2 + \cdots + xN}{N} = \frac{1}{N}\sum{i=1}^{N} x_i$
Sample Mean: Denoted by $\bar{X}$ , called X bar. Computed by:
- $\bar{X} = \frac{x1 + x2 + \cdots + xn}{n} = \frac{1}{n}\sum{i=1}^{n} x_i$

Example

Random variable: Height of females aged between 25 and 40.
John has a sample of randomly chosen females aged 25 and 40:
- Heights are 157cm, 163cm, 166cm, 148cm, 174cm, 165cm, 168cm.
Sample size $n = 7$ . ( $x1 = 157\text{cm}, \dots, x7 = 168\text{cm}$ )
Sample mean: $\bar{X} = \frac{157 + 163 + 166 + 148 + 174 + 165 + 168}{7} = 163 \text{cm}$

Example: Gamble

Gamble: Tossing a fair coin.
- Heads: Receive $10.
- Tails: Pay $10.
Scenario: Play the gamble 100 times; 60 heads, 40 tails.
Sample mean: $\bar{X} = \frac{60 \times 10 + 40 \times (-10)}{100} = \frac{600 - 400}{100} = 2$

Example: Population Mean

Consider the same gamble: fair coin toss, receive/pay $10 for heads/tails
Population mean:
- $E(X) = 0.5 \times 10 + 0.5 \times -10 = 0$

Central Tendency: Mode

The mode is the most commonly occurring value.
Example: Waiting times of people in a queue (minutes):
- 2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)
What is the mode? 3 (occurs six times).
Random variables with two modes are bimodal; with multiple modes, multimodal.

Central Tendency: Median

The median is the middle value in an ordered array.
Example: Waiting times (minutes):
- 2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)
What is the median? Median is 2.5 minutes (lies in the middle of the 14 numbers).
- Median = $\frac{2+3}{2} = 2.5$

Central Tendency: Qualitative Data

Example: University major of employees (1=marketing, 2=finance, 3=economics, 4=law, 5=others).
Recorded data: 2, 5, 3, 1, 4, 2, 5, 3, 4, 2, 1 (ordered: 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5).
Mode is 2, median is 3, mean is 2.9091.
Question: Which measure of central tendency is the most appropriate?

Summary Statistics: Variability

Describing Data: Central Tendency, Variability, Skewness

Descriptive Statistic: Variability

Definition: Measures of variability provide information about how dispersed the values of a random variable are around the mean (measures of scale, spread, dispersion, or risk).
1. Variance (Var): Average of squared distance from the mean.
2. Standard deviation (std): Square root of variance.
3. Coefficient of variation: $\frac{\text{std}}{\text{mean}} \times 100\%$ . Measures risk per unit of expected return.

Variability: Example

Question: Which stock to invest based on the data of their weekly returns?
Stock X and Stock Y have $E(X) = E(Y) = 1.5\%$ , meaning that every week both stocks are expected to grow 1.5% (on average). But which one do you prefer?

Variability: Formulas

Population Variance: Denoted by $\sigma^2$ or $Var(X)$ . Computed by:
- $\sigma^2 = Var(X) = \frac{(x1 - \mu)^2 + \cdots + (xN - \mu)^2}{N} = \frac{1}{N}\sum{i=1}^{N} (xi - \mu)^2$
Sample Variance: Denoted by $s^2$ . Computed by:
- $s^2 = \frac{(x1 - \bar{X})^2 + \cdots + (xn - \bar{X})^2}{n - 1} = \frac{1}{n - 1}\sum{i=1}^{n} (xi - \bar{X})^2$

Variability: Variance

Computes the average squared distance between data points and their mean.
Given: data points +2, +4, -6 and $\bar{X} = 12$
$x1 = 14, x1 - \bar{X} = 2, (x_1 - \bar{X})^2 = 4$
$x3 = 16, x3 - \bar{X} = 4, (x_3 - \bar{X})^2 = 16$
$x2 = 6, x2 - \bar{X} = -6, (x_2 - \bar{X})^2 = 36$

Example: Waiting Time

$X$ : waiting time of people in a queue (in minutes)
Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14
Population: $N = 15, \mu = \frac{1}{15}\sum{i=1}^{15} xi = 9.93 \text{ minutes}$ . $\sigma^2 = \frac{1}{15}\sum{i=1}^{15} (xi - \mu)^2 = 3.929$
Sample: $n = 15, \bar{X} = \frac{1}{15}\sum{i=1}^{15} xi = 9.93 \text{ minutes}$ . $s^2 = \frac{1}{14}\sum{i=1}^{15} (xi - \bar{X})^2 = 4.210$

Variance: Remarks

Q1: Why sum up or average out squared distance instead of distance?
- Distance in different directions may cancel out, not suitable for measuring variability.
Q2: What is the unit?
- Distance such as $x_1 - \mu$ is in the unit of the data.
- Squared distance such as $(x_1 - \mu)^2$ is in the unit of the data squared!
- Example: Distance such as $x1 - \mu = (12-9.93)$ is in minutes. Squared distance such as $(x1 - \mu)^2 = (12-9.93)^2$ is in minutes squared.

Standard Deviation

Population Standard Deviation: Denoted by $\sigma$ or $std(X)$ . Computed by:
- $\sigma = \sqrt{\sigma^2} = \sqrt{\frac{(x1 - \mu)^2 + \cdots + (xN - \mu)^2}{N}} = \sqrt{\frac{1}{N}\sum{i=1}^{N} (xi - \mu)^2}$
Sample Standard Deviation: Denoted by $s$ . Computed by:
- $s = \sqrt{s^2} = \sqrt{\frac{(x1 - \bar{X})^2 + \cdots + (xn - \bar{X})^2}{n - 1}} = \sqrt{\frac{1}{n - 1}\sum{i=1}^{n} (xi - \bar{X})^2}$

Standard Deviation

Standard deviation solves the problem of squared units.
Has the same units as the original data.
In the waiting example:
- Population: $\sigma = \sqrt{3.929 \text{ minutes}^2} = 1.982 \text{ minutes}$
- Sample: $s = \sqrt{4.210 \text{ minutes}^2} = 2.052 \text{ minutes}$

Standard Deviation - Example

Variance and standard deviation measure how spread out the distribution of a random variable is.
X: time spent on work, Y: time spent on leisure (per day) with 5 observations.
Means are the same $(\bar{X} = \bar{Y} = 6)$ . Variances are different (sX^2 = 2.5 < sY^2 = 12.5).

Coefficient of Variation

Population CV(%): $CV = \frac{\sigma}{\mu} \times 100\%$ . Sample CV(%): $CV = \frac{s}{\bar{X}} \times 100\%$ .
It is unit free because both the numerator and denominator have the same unit as the original data.

Coefficient of Variation - Example

$X$ : waiting time of people in a queue (in minutes)
Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14
Population CV(%): $CV = \frac{1.982 \text{ minutes}}{9.93 \text{ minutes}} \times 100\% = 19.96\%$ .
Sample CV(%): $CV = \frac{2.052 \text{ minutes}}{9.93 \text{ minutes}} \times 100\% = 20.66\%$ .

CV: Interpretation

$CV$ is unit free. It measures standard deviation per unit of mean.
Example: Time on leisure per day in hours vs in minutes (same coefficient of variation if regarding the data as a sample).
Example: In finance when the random variable $X$ denotes asset returns, $CV$ measures risk per unit of expected return.

Variability: Excel

Excel is our friend for assignment and your future career. Google is our friend for learning Excel.

Summary Statistics: Skewness

Describing Data: Central Tendency, Variability, Skewness

Descriptive Statistics: Shape

Central tendency and variability are useful to describe and summarise data.
They cannot summarise asymmetry.
Skewness is a measure of asymmetry (Calculating skewness will not be examined).

Skewness

Symmetric distribution (skewness = 0): median = mean
Right-skewed distribution (skewness > 0, positively skewed): median < mean
Left-skewed distribution (skewness < 0, negatively skewed): median > mean

Summary for week 1

Categorical data is summarised using tables and frequency counts, and visualised using histograms or pie charts.
Distribution is the general shape that shows the probability that a random variable takes a certain value.
Central tendency includes mean, mode (most commonly occurring value in an array of numbers) and median (the middle number if you sort the array)
- Population mean: $μ = E(X) = \frac{x1 + x2 + \cdots + xN}{N} = \frac{1}{N} \sum{i=1}^{N} x_i$
- Sample mean: $\bar{X} = \frac{x1 + x2 + \cdots + xn}{n} = \frac{1}{n} \sum{i=1}^{n} x_i$
Variability includes variance, standard deviation and coefficient of variation
Measure of shape: skewness
Population Variance: $\sigma^2 = Var(X) = \frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2$
Sample Variance: $s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{X})^2$
Standard deviation: $\sigma = \sqrt{\frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2}$ , $s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{X})^2}$
Coefficient of variation: $\frac{\sigma}{\mu} \times 100\%$ , $\frac{s}{\bar{X}} \times 100\%$