Week 1: Exploring Data

Introduction to Statistics

  • Statistics: Quantifying the world in an objective way for good decision-making.

  • Modules:

    • Descriptive Statistics and data communication (Module 1).

    • Inferential Statistics (Module 2).

    • Hypothesis Testing (Module 3).

    • Regression Analysis (Module 4).

    • Data ethics (Module 5).

Week 1: Exploring Data - Foundations of Descriptive Statistics

  • Data Types and Visualization

Preliminary Terms and Definitions

  • Variable: A characteristic, number, or quantity that can be measured or counted.

  • Random variable: A variable whose outcome is unknown before data collection

    • Example: income of an Australian household.

  • Population: The complete pool of a particular random variable

    • Example: income of all Australian households.

  • Sample: A subset of the population

    • Example: income of 100 households.

  • Goal: Describe and visualize information contained in different types of variables.

Types of Data

  • Variables are broadly classified as qualitative/categorical or quantitative/numerical.

  • Qualitative/Categorical:

    • Nominal: Categories with no natural ordering

      • Example: 0/1 variable for male/female.

    • Ordinal: Categories with a natural order, but numbers are meaningless

      • Example: Agreement scale (don’t agree=-1/somewhat agree=0/completely agree=1).

  • Quantitative/Numerical:

    • Discrete: Values can be listed (not infinitely divisible), often from counting

      • Example: Number of children in a household (0, 1, 2, 3, …).

    • Continuous: Can take an infinite number of values within a range, often from measurement.

      • Example: Heights of professional basketball players.

Frequency Distributions

  • For qualitative/categorical data, visualize via a table displaying frequencies.

  • Example Table:

    • Material Status of home loan applicants:

      • Single: Frequency 102, Relative Frequency 0.1262, Percent Frequency 12.62

      • Married: Frequency 341, Relative Frequency 0.4220, Percent Frequency 42.20

      • Widowed: Frequency 155, Relative Frequency 0.1918, Percent Frequency 19.18

      • De Facto: Frequency 50, Relative Frequency 0.0619, Percent Frequency 6.19

      • Separated: Frequency 40, Relative Frequency 0.0495, Percent Frequency 4.95

      • Divorced: Frequency 120, Relative Frequency 0.1485, Percent Frequency 14.85

      • Total: Frequency 808, Relative Frequency 1, Percent Frequency 100

Key Terms

  • Frequency counts: Total occurrences for each category.

  • Relative frequency: Fraction/proportion of total data items in a category.

  • Percent frequency: Relative frequency Γ— 100 (%).

Excel Function

  • Use COUNTIF(range, values) to obtain frequency counts

  • Example Formula: =COUNTIF(I$10:I$389, $D2)

Data Visualization: Histograms

  • Commonly used for continuous variables.

  • Steps:

    1. Choose a bandwidth/bin size to group incomes into equally spaced categories

      • e.g., $0-100, $101-$200, $201-$300 etc.

    2. Plot frequencies for each group in a bar chart

      • Frequencies on the y-axis, categories on the x-axis.

Histogram Example

  • Income brackets and Frequencies:

    • (6, 526]: Frequency near 35

    • (526, 1046]: Frequency around 25

    • (1046, 1566]: Frequency around 15

    • (1566, 2086]: Frequency around 10

    • (2086, 2606]: Frequency around 5

    • (2606, 3126]: Frequency around 2

Histogram Considerations

  • In histograms for numerical data, ensure that the income ranges are contiguous.

  • Gaps or overlaps in income ranges are not accurate representation of the data

Data Visualization: Bar Chart

  • Visually similar to a histogram, but:

    • Categories need not be equally ranged continuous values.

    • The y-axis can represent things other than frequency.

    • Usually whitespace between the bars.

  • Example: Marital Status of home loan applicants:

    • Single: ~100

    • Married: ~350

    • Widowed: ~150

    • De Facto: ~50

    • Separated: ~40

    • Divorced: ~120

Data Visualization: Pie Chart

  • Way to visualize categorical data, frequencies shown as segments of a circle

  • Example: Marital Status of home loan applicants (same categories as bar chart)

  • Tip: Pie charts are rarely a good idea, and never when there are a large number of categories

Summary Statistics: Central Tendency

  • Describing Data: Central Tendency, Variability, Skewness

Notation

  • Random variables: Denoted by capital letters (𝑋, π‘Œ).

    • 𝑋: Number of children in a household.

    • π‘Œ: Amount of time spent by the husband on housework per day.

  • Realizations/observations of a random variable: Lowercase letters with subscript (π‘₯𝑖, 𝑦𝑖).

    • π‘₯1: Number of children in household 1.

    • 𝑦137: Amount of time spent by husband 137 on housework per day.

  • 𝑁 and 𝑛: Denote the size or number of observations.

    • 𝑁: Population size (usually very large, can be infinite).

    • 𝑛: Sample size, i.e., the number of data points collected in a sample.

Central Tendency

  • Definition: Measures of central tendency provide information about the center of the distribution of a random variable; indicate a typical, middle, or average value. (Measures of location)

    1. Mean: Arithmetic average value.

    2. Mode: Most commonly occurring value.

    3. Median: Middle value in an ordered array.

Central Tendency: Mean

  • Population Mean: Denoted by ΞΌΞΌ or E(X)E(X), the expectation of XX. Computed by:

    • ΞΌ=E(X)=x<em>1+x</em>2+β‹―+x<em>NN=1Nβˆ‘</em>i=1NxiΞΌ = E(X) = \frac{x<em>1 + x</em>2 + \cdots + x<em>N}{N} = \frac{1}{N}\sum</em>{i=1}^{N} x_i

  • Sample Mean: Denoted by XΛ‰\bar{X}, called X bar. Computed by:

    • XΛ‰=x<em>1+x</em>2+β‹―+x<em>nn=1nβˆ‘</em>i=1nxi\bar{X} = \frac{x<em>1 + x</em>2 + \cdots + x<em>n}{n} = \frac{1}{n}\sum</em>{i=1}^{n} x_i

Example

  • Random variable: Height of females aged between 25 and 40.

  • John has a sample of randomly chosen females aged 25 and 40:

    • Heights are 157cm, 163cm, 166cm, 148cm, 174cm, 165cm, 168cm.

  • Sample size n=7n = 7. (x<em>1=157cm,…,x</em>7=168cmx<em>1 = 157\text{cm}, \dots, x</em>7 = 168\text{cm})

  • Sample mean: XΛ‰=157+163+166+148+174+165+1687=163cm\bar{X} = \frac{157 + 163 + 166 + 148 + 174 + 165 + 168}{7} = 163 \text{cm}

Example: Gamble

  • Gamble: Tossing a fair coin.

    • Heads: Receive $10.

    • Tails: Pay $10.

  • Scenario: Play the gamble 100 times; 60 heads, 40 tails.

  • Sample mean: XΛ‰=60Γ—10+40Γ—(βˆ’10)100=600βˆ’400100=2\bar{X} = \frac{60 \times 10 + 40 \times (-10)}{100} = \frac{600 - 400}{100} = 2

Example: Population Mean

  • Consider the same gamble: fair coin toss, receive/pay $10 for heads/tails

  • Population mean:

    • E(X)=0.5Γ—10+0.5Γ—βˆ’10=0E(X) = 0.5 \times 10 + 0.5 \times -10 = 0

Central Tendency: Mode

  • The mode is the most commonly occurring value.

  • Example: Waiting times of people in a queue (minutes):

    • 2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)

  • What is the mode? 3 (occurs six times).

  • Random variables with two modes are bimodal; with multiple modes, multimodal.

Central Tendency: Median

  • The median is the middle value in an ordered array.

  • Example: Waiting times (minutes):

    • 2, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 1, 1 (ordered: 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4)

  • What is the median? Median is 2.5 minutes (lies in the middle of the 14 numbers).

    • Median = 2+32=2.5\frac{2+3}{2} = 2.5

Central Tendency: Qualitative Data

  • Example: University major of employees (1=marketing, 2=finance, 3=economics, 4=law, 5=others).

  • Recorded data: 2, 5, 3, 1, 4, 2, 5, 3, 4, 2, 1 (ordered: 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5).

  • Mode is 2, median is 3, mean is 2.9091.

  • Question: Which measure of central tendency is the most appropriate?

Summary Statistics: Variability

  • Describing Data: Central Tendency, Variability, Skewness

Descriptive Statistic: Variability

  • Definition: Measures of variability provide information about how dispersed the values of a random variable are around the mean (measures of scale, spread, dispersion, or risk).

    1. Variance (Var): Average of squared distance from the mean.

    2. Standard deviation (std): Square root of variance.

    3. Coefficient of variation: stdmeanΓ—100%\frac{\text{std}}{\text{mean}} \times 100\%. Measures risk per unit of expected return.

Variability: Example

  • Question: Which stock to invest based on the data of their weekly returns?

  • Stock X and Stock Y have E(X)=E(Y)=1.5%E(X) = E(Y) = 1.5\%, meaning that every week both stocks are expected to grow 1.5% (on average). But which one do you prefer?

Variability: Formulas

  • Population Variance: Denoted by Οƒ2\sigma^2 or Var(X)Var(X). Computed by:

    • Οƒ2=Var(X)=(x<em>1βˆ’ΞΌ)2+β‹―+(x</em>Nβˆ’ΞΌ)2N=1Nβˆ‘<em>i=1N(x</em>iβˆ’ΞΌ)2\sigma^2 = Var(X) = \frac{(x<em>1 - \mu)^2 + \cdots + (x</em>N - \mu)^2}{N} = \frac{1}{N}\sum<em>{i=1}^{N} (x</em>i - \mu)^2

  • Sample Variance: Denoted by s2s^2. Computed by:

    • s2=(x<em>1βˆ’XΛ‰)2+β‹―+(x</em>nβˆ’XΛ‰)2nβˆ’1=1nβˆ’1βˆ‘<em>i=1n(x</em>iβˆ’XΛ‰)2s^2 = \frac{(x<em>1 - \bar{X})^2 + \cdots + (x</em>n - \bar{X})^2}{n - 1} = \frac{1}{n - 1}\sum<em>{i=1}^{n} (x</em>i - \bar{X})^2

Variability: Variance

  • Computes the average squared distance between data points and their mean.

  • Given: data points +2, +4, -6 and XΛ‰=12\bar{X} = 12

  • x<em>1=14,x</em>1βˆ’XΛ‰=2,(x1βˆ’XΛ‰)2=4x<em>1 = 14, x</em>1 - \bar{X} = 2, (x_1 - \bar{X})^2 = 4

  • x<em>3=16,x</em>3βˆ’XΛ‰=4,(x3βˆ’XΛ‰)2=16x<em>3 = 16, x</em>3 - \bar{X} = 4, (x_3 - \bar{X})^2 = 16

  • x<em>2=6,x</em>2βˆ’XΛ‰=βˆ’6,(x2βˆ’XΛ‰)2=36x<em>2 = 6, x</em>2 - \bar{X} = -6, (x_2 - \bar{X})^2 = 36

Example: Waiting Time

  • XX: waiting time of people in a queue (in minutes)

  • Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14

  • Population: N=15,ΞΌ=115βˆ‘<em>i=115x</em>i=9.93Β minutesN = 15, \mu = \frac{1}{15}\sum<em>{i=1}^{15} x</em>i = 9.93 \text{ minutes}. Οƒ2=115βˆ‘<em>i=115(x</em>iβˆ’ΞΌ)2=3.929\sigma^2 = \frac{1}{15}\sum<em>{i=1}^{15} (x</em>i - \mu)^2 = 3.929

  • Sample: n=15,XΛ‰=115βˆ‘<em>i=115x</em>i=9.93Β minutesn = 15, \bar{X} = \frac{1}{15}\sum<em>{i=1}^{15} x</em>i = 9.93 \text{ minutes}. s2=114βˆ‘<em>i=115(x</em>iβˆ’XΛ‰)2=4.210s^2 = \frac{1}{14}\sum<em>{i=1}^{15} (x</em>i - \bar{X})^2 = 4.210

Variance: Remarks

  • Q1: Why sum up or average out squared distance instead of distance?

    • Distance in different directions may cancel out, not suitable for measuring variability.

  • Q2: What is the unit?

    • Distance such as x1βˆ’ΞΌx_1 - \mu is in the unit of the data.

    • Squared distance such as (x1βˆ’ΞΌ)2(x_1 - \mu)^2 is in the unit of the data squared!

    • Example: Distance such as x<em>1βˆ’ΞΌ=(12βˆ’9.93)x<em>1 - \mu = (12-9.93) is in minutes. Squared distance such as (x</em>1βˆ’ΞΌ)2=(12βˆ’9.93)2(x</em>1 - \mu)^2 = (12-9.93)^2 is in minutes squared.

Standard Deviation

  • Population Standard Deviation: Denoted by Οƒ\sigma or std(X)std(X). Computed by:

    • Οƒ=Οƒ2=(x<em>1βˆ’ΞΌ)2+β‹―+(x</em>Nβˆ’ΞΌ)2N=1Nβˆ‘<em>i=1N(x</em>iβˆ’ΞΌ)2\sigma = \sqrt{\sigma^2} = \sqrt{\frac{(x<em>1 - \mu)^2 + \cdots + (x</em>N - \mu)^2}{N}} = \sqrt{\frac{1}{N}\sum<em>{i=1}^{N} (x</em>i - \mu)^2}

  • Sample Standard Deviation: Denoted by ss. Computed by:

    • s=s2=(x<em>1βˆ’XΛ‰)2+β‹―+(x</em>nβˆ’XΛ‰)2nβˆ’1=1nβˆ’1βˆ‘<em>i=1n(x</em>iβˆ’XΛ‰)2s = \sqrt{s^2} = \sqrt{\frac{(x<em>1 - \bar{X})^2 + \cdots + (x</em>n - \bar{X})^2}{n - 1}} = \sqrt{\frac{1}{n - 1}\sum<em>{i=1}^{n} (x</em>i - \bar{X})^2}

Standard Deviation

  • Standard deviation solves the problem of squared units.

  • Has the same units as the original data.

  • In the waiting example:

    • Population: Οƒ=3.929Β minutes2=1.982Β minutes\sigma = \sqrt{3.929 \text{ minutes}^2} = 1.982 \text{ minutes}

    • Sample: s=4.210Β minutes2=2.052Β minutess = \sqrt{4.210 \text{ minutes}^2} = 2.052 \text{ minutes}

Standard Deviation - Example

  • Variance and standard deviation measure how spread out the distribution of a random variable is.

  • X: time spent on work, Y: time spent on leisure (per day) with 5 observations.

  • Means are the same (XΛ‰=YΛ‰=6)(\bar{X} = \bar{Y} = 6). Variances are different (sX^2 = 2.5 < sY^2 = 12.5).

Coefficient of Variation

  • Population CV(%): CV=σμ×100%CV = \frac{\sigma}{\mu} \times 100\%. Sample CV(%): CV=sXΛ‰Γ—100%CV = \frac{s}{\bar{X}} \times 100\%.

  • It is unit free because both the numerator and denominator have the same unit as the original data.

Coefficient of Variation - Example

  • XX: waiting time of people in a queue (in minutes)

  • Observations: 12, 9, 8, 8, 11, 9, 10, 9, 14, 9, 9, 7, 10, 10, 14

  • Population CV(%): CV=1.982Β minutes9.93Β minutesΓ—100%=19.96%CV = \frac{1.982 \text{ minutes}}{9.93 \text{ minutes}} \times 100\% = 19.96\%.

  • Sample CV(%): CV=2.052Β minutes9.93Β minutesΓ—100%=20.66%CV = \frac{2.052 \text{ minutes}}{9.93 \text{ minutes}} \times 100\% = 20.66\%.

CV: Interpretation

  • CVCV is unit free. It measures standard deviation per unit of mean.

  • Example: Time on leisure per day in hours vs in minutes (same coefficient of variation if regarding the data as a sample).

  • Example: In finance when the random variable XX denotes asset returns, CVCV measures risk per unit of expected return.

Variability: Excel

  • Excel is our friend for assignment and your future career. Google is our friend for learning Excel.

Summary Statistics: Skewness

  • Describing Data: Central Tendency, Variability, Skewness

Descriptive Statistics: Shape

  • Central tendency and variability are useful to describe and summarise data.

  • They cannot summarise asymmetry.

  • Skewness is a measure of asymmetry (Calculating skewness will not be examined).

Skewness

  • Symmetric distribution (skewness = 0): median = mean

  • Right-skewed distribution (skewness > 0, positively skewed): median < mean

  • Left-skewed distribution (skewness < 0, negatively skewed): median > mean

Summary for week 1

  • Categorical data is summarised using tables and frequency counts, and visualised using histograms or pie charts.

  • Distribution is the general shape that shows the probability that a random variable takes a certain value.

  • Central tendency includes mean, mode (most commonly occurring value in an array of numbers) and median (the middle number if you sort the array)

    • Population mean: ΞΌ=E(X)=x<em>1+x</em>2+β‹―+x<em>NN=1Nβˆ‘</em>i=1NxiΞΌ = E(X) = \frac{x<em>1 + x</em>2 + \cdots + x<em>N}{N} = \frac{1}{N} \sum</em>{i=1}^{N} x_i

    • Sample mean: XΛ‰=x<em>1+x</em>2+β‹―+x<em>nn=1nβˆ‘</em>i=1nxi\bar{X} = \frac{x<em>1 + x</em>2 + \cdots + x<em>n}{n} = \frac{1}{n} \sum</em>{i=1}^{n} x_i

  • Variability includes variance, standard deviation and coefficient of variation

  • Measure of shape: skewness

  • Population Variance: Οƒ2=Var(X)=1Nβˆ‘<em>i=1N(x</em>iβˆ’ΞΌ)2\sigma^2 = Var(X) = \frac{1}{N} \sum<em>{i=1}^{N} (x</em>i - \mu)^2

  • Sample Variance: s2=1nβˆ’1βˆ‘<em>i=1n(x</em>iβˆ’XΛ‰)2s^2 = \frac{1}{n-1} \sum<em>{i=1}^{n} (x</em>i - \bar{X})^2

  • Standard deviation: Οƒ=1Nβˆ‘<em>i=1N(x</em>iβˆ’ΞΌ)2\sigma = \sqrt{\frac{1}{N} \sum<em>{i=1}^{N} (x</em>i - \mu)^2} , s=1nβˆ’1βˆ‘<em>i=1n(x</em>iβˆ’XΛ‰)2s = \sqrt{\frac{1}{n-1} \sum<em>{i=1}^{n} (x</em>i - \bar{X})^2}

  • Coefficient of variation: σμ×100%\frac{\sigma}{\mu} \times 100\% , sXΛ‰Γ—100%\frac{s}{\bar{X}} \times 100\%