Lecture-3 Visualisation and Presentation of Data 061124

Page 1: Lecture Information

  • Lecture Title: Visualization and Presentation of Data (Continued)

  • Lecturer: Dr. Lei Xu

  • Office Location: BE.203 in Sir Richard Morris

  • Feedback and Consultation Hours:

    • Tuesday 16:00 - 17:00

    • Wednesday 09:30 - 10:30

  • Email: L.Xu2@lboro.ac.uk

Page 2: Learning Objectives and Readings

Learning Objectives

  • Measure of Dispersion (Variability)

  • Presentation and Descriptive Analysis

  • Correlation

  • Causal Analysis

Recommended Readings

  1. Koop, G. (2013). Analysis of Economic Data. John Wiley & Sons. Chapter 3.

  2. Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press.

    • Sections: 'Introduction', 'Directed Acyclic Graphs', and 'Potential Outcomes Causal Model'.

  3. Anderson, D. R., Williams, T. A., & Cochran, J. J. (2020). Statistics for Business & Economics. Cengage Learning. Chapters 2-3.

Page 3: Central Tendency and Measures of Dispersion

Measure of Location (Central Tendency)

  1. Mean

  2. Median

  3. Mode

Measure of Dispersion (Variability)

  1. Range and Percentile

  2. Quartiles and Interquartile Range (IQR)

  3. Mean Deviation

  4. Variance

  5. Standard Deviation

Page 4: Range and Percentile

Range

  • Definition: The simplest measure of dispersion, calculated as the difference between the maximum and minimum values in a data set.

    Formula: Range = Maximum value - Minimum value

Example 1
  • Data Set: 1000, 1050, 3000, 2500, 1780, 2210, 2540, 1980, 3650, 4970, 5000, 8500, 7010

  • Solution: Range = 8500 - 1000 = 7500

Percentile

  • Definition: A value such that at least 𝑝 percent of observations are less than or equal to this value.

Calculation Steps for Percentile
  1. Arrange data in ascending order.

  2. Compute an index: 𝑖 = 𝑝/100 * 𝑛, where 𝑛 is the number of observations.

  3. Depending on whether 𝑖 is an integer, find corresponding value.

Page 5: Exercise on Percentile Calculation

  • Goal: Find the 75th percentile of the data set: 1000, 1050, 3000, 2500, 1780, 2210, 2540, 1980, 3650, 4970, 5000, 8500, 7010.

Solution Steps

  1. Arrange Data: 1000, 1050, 1780, 1980, 2210, 2500, 2540, 3000, 3650, 4970, 5000, 7010, 8500

  2. Calculate Index: 𝑖 = 75/100 * 13 = 9.75.

  3. Determine Position: Round up to position 10;

    • 75th Percentile: 4970 (10th position).

Note

  • The median represents the 50th percentile.

  • Excel Formula: =PERCENTILE.INC(array, k)

Page 6: Quartiles

  • Definition: Data is divided into four parts, each containing approximately 25% of observations.

  • Quartiles are Defined as:

    • Q1: First quartile (25th percentile)

    • Q2: Second quartile (50th percentile, median)

    • Q3: Third quartile (75th percentile)

  • Calculation: Same method as percentiles.

  • Excel Formula: =QUARTILE.INC(array, quart)

Page 7: Interquartile Range (IQR)

  • Definition: Difference between third quartile (Q3) and first quartile (Q1);

  • Significance: Measures variability, focusing on the middle 50% of the data.

  • Illustration: See Figure 3.2.

  • Excel Formula: =Q3 - Q1 or =QUARTILE(array,3) - QUARTILE(array,1)

Page 8: Box Plot

  • Overview: A box plot (or box-and-whisker plot) is used in descriptive statistics for data visualization.

  • Purpose: Shows the distribution and skewness of numerical data through quartiles and means.

  • Five-Number Summary Components:

    • Minimum value

    • First quartile (Q1)

    • Median (Q2)

    • Third quartile (Q3)

    • Maximum value

Page 9: Percentile Point Visualization

  • Source: ONS

  • Graphical Representation: Percentile points related to total income before tax with intervals.

Page 10: Central Tendency and Distributional Measures

Central Tendency

  • Definition: A single value representing the "center" or typical value of a dataset.

  • Common Measures:

    • Mean: Average of all data points.

    • Median: Middle value when data is ordered.

    • Mode: Most frequently occurring value.

Distributional Measures

  • Definition: Describe the spread or dispersion of the data across its values.

  • Common Measures:

    • Range: Difference between max and min values.

    • Variance: Average of squared differences from the mean.

    • Standard Deviation: Square root of variance.

    • Skewness: Measure of data distribution asymmetry.

    • Kurtosis: Measure of the distribution's sharpness.

Purpose of Measures

  • Central Tendency: Identifies a typical value.

  • Distributional Measures: Understands variability and shape of the dataset.

Page 11: Excel Formulas for Measures

Measures of Location

  1. Mean: =AVERAGE(number1, number2,…)

  2. Percentile: =PERCENTILE(array, k)

  3. Quartile: =QUARTILE(array, quart)

  4. Median: =MEDIAN(number1, number2,…)

  5. Skewness: =SKEW(number1, number2,…)

  6. Kurtosis: =KURT(number1, number2,…)

Measures of Dispersion

  1. Standard Deviation: =STDEV.S(number1, number2,…)

  2. Variance: =VAR.S(number1, number2,…)

  3. Range: =MAX(number1, number2,…)-MIN(number1, number2,…)

  4. Interquartile Range: =QUARTILE(array,3)-QUARTILE(array,1)

Page 12: Summary Statistics

  • Definition of Summary Statistics: Provide a summary of data on a numerical variable.

  • Sample Insight: Out of nearly 181,000 male employees, 95.3% are whites, with the lowest years of schooling yet highest potential work experience.

  • Wage Statistics: Average hourly wages in January 2018 prices are:

    • Whites: £19.5

    • Minority Natives: £18.7

    • Minority Immigrants: £18.3

Page 13: Tabular and Graphical Methods for Summarizing Data

Data Types

  • Qualitative Data

  • Quantitative Data

Tabular Methods

  • Frequency Distribution

  • Relative Frequency Distribution

  • Percentage Frequency Distribution

Graphical Methods

  • Bar Chart

  • Pie Chart

  • Histogram

  • Scatter Diagram

  • Ogive

Page 14: Scatter Plot

  • Definition: A two-dimensional visualization that uses dots to represent values of two different variables.

  • Purpose: Shows the relationship between two variables.

Page 15: Line Graph

  • Definition: A graph for visualizing values over time, using a horizontal (x-axis) and vertical (y-axis).

  • Axes: x-axis for time intervals and y-axis for corresponding values (e.g., revenue).

  • Data Representation: Data points connected in a "dot-to-dot" fashion.

Page 16: Presentation Example

  • Context: Create a figure depicting life expectancy trends of white and Black males over time.

Page 17: Life Expectancy Presentation Data

Observations by Year

  • Data for White Males:

  • Data for Black Males:

  • Life expectancy trends displayed graphically.

  • Notable dip in 1918 attributed to the influenza pandemic.

Page 18: Wage Presentation

  • Sector Analysis: Overview of average hourly wage across various sectors, illustrated graphically.

Page 19: Average Hourly Wage Presentation

  • Detailed distribution of hourly wages for women aged 34 to 46 across various sectors, correlating with job types.

Page 20: Research Question

  • Inquiry: Does attending university lead to higher earnings?

Page 21: Earnings by Graduation Cohort

  • Data Analysis: Real earnings tracked over time after graduation for female and male cohorts.

  • Cohorts Analyzed: 2008, 2009, 2010, 2011, 2012.

Page 22: Earnings by Education Level

  • Graphical Analysis: Real earnings distinguished by GCSE results and higher education attendance.

Page 23: Mean Earnings Post-Graduation

  • Summary of Findings: Mean earnings corresponding to education levels 5 years post-graduation.

Page 24: Earnings by Subject Studied

  • Context: Average earnings differentiated by subject study, considering drops outs.

Page 25: Ice Cream Consumption Case Study

  • Scenario: Matt's ice-cream sales over 30 days collected to assess effects of temperature on sales.

Page 26: Univariate Description

  • Visual graphs showing frequency of ice cream consumption against temperature.

Page 27: Bivariate Descriptive Statistics

  • Summary statistics for ice cream consumption and associated temperatures, detailing means, standard errors, medians, modes, etc.

Page 28: Bivariate Description: Scatter Plot

  • Plot showing relationship between temperature and ice cream consumption, indicating a strong positive correlation.

Page 29: Covariance Analysis

  • Discussion on the interpretation of data points' quadrants in a covariance context, evaluating positive and negative relationships.

Page 30: Covariance Formula

  • Statistical formulas for calculating the covariance between two variables with explanations regarding population and sample covariance.

Page 31: Covariance Calculation for Ice Cream Sales

  • Detailed calculations presented from the collected data on ice-cream sales and temperature.

Page 32: Covariance Properties

  • Key properties of covariance explaining relationships: positive, negative, or no relationship.

Page 33: Correlation Definition

  • Coefficient of Correlation (r): Normalized index indicating the strength and direction of a linear relationship between two variables.

Page 34: Correlation Calculation for Ice Cream Sales

  • Data-driven analysis of correlation using previously mentioned dataset, illustrating calculation steps and results.

Page 35: Correlation Properties

  • An examination of correlation values denoting their interpretations about linear relationships between variables.

Page 36: Correlation Examples

  • Graphical examples demonstrating varying correlation strengths and their interpretations from scattered data plots.

Page 37: Causal Relationships

  • Causal inference explained as a measure of relationship strength between variables, emphasizing difficulty in establishing causation.

Page 38: Rubin Causal Model

  • Explanation of the model where treatment variables and potential outcomes are linked to a causal effect analysis based on individual observations.

Page 39: Average Treatment Effect (ATE)

  • Importance of understanding average effects across populations through conditional expectations, focusing on wage premiums related to educational attainment.

Page 40: Challenges in Causal Inference

  • Discussion on observational data's limitations and the missing data problem correlating to causal effect estimations.

Page 41: Wage Differential Analysis

  • Key Comparisons: College vs. Non-College wage differentials with focus on selection bias and how this affects causal interpretations.

Page 42: Random Assignment in Causal Studies

  • Definition and importance of random assignment in minimizing selection bias within experimental causal inference.

Page 43: Random Controlled Trials (RCT)

  • Overview of RCT as an essential method for determining causal relationships, including its design and execution fundamentals.

Page 44: Boxplot Analysis of Returns by Subject

  • Visualization and discussion of earnings based on academic institutions.

Page 45: Distribution of UCAS Points by University

  • Comparative analysis of UCAS tariff across various recognized UK universities.

Page 46: Subject Coefficients by A-Level Impact

  • Assessment of adjusted coefficients corresponding to subject intake based on mathematical A-levels.

Page 47: Adjusted Coefficients Summary

  • Detailed examination of subject-wise coefficients among various UK universities, with statistical significance metrics.

Page 48: Statistical Summary of Gender-Stratified Earnings

  • Data presenting overall impacts of higher education on earnings, segmenting by gender and number of individuals observed.