Week 1 - Descriptive Statistics: Principles of Data Cleaning and Organization

Introduction to Descriptive Statistics and Data Management

  • Definition and Importance of Descriptive Statistics: Descriptive statistics is the process categorized by moving from raw data to organized information. This transformation is essential for effective data analysis in industries such as hospitality and tourism.
  • The Data Narrative (Visual Case Studies):
    • Cruise Industry Turbulence: In Q2 2019, the revenue for major cruise lines (including Carnival Corporation & PLC and RCCL) was approximately $4.84b\$4.84\,b. By Q2 2020, revenue plummeted to $0.1b\$0.1\,b, representing a decrease of approximately 85%-85\%.
    • Hotel and Resort Industry Market Size:
      • The global market size in 2013 was approximately $0.97Trillion\$0.97\,Trillion.
      • By 2019, the market grew to $1.478Trillion\$1.478\,Trillion.
      • In 2023 (estimate), the market size was $1.21Trillion\$1.21\,Trillion.
      • Projections for 2030 (estimate) suggest a market size of $1.5Trillion\$1.5\,Trillion.
    • Hotel Distribution Channels: gross bookings in the global hotel industry involve a complex mix of direct channels (online and offline), online travel platforms, offline travel agents, and wholesale. Recorded values include segments at $570b\$570\,b, $400b\$400\,b, and $310b\$310\,b.
    • COVID-19 Impact on Tourism: International tourism saw significant declines due to the pandemic and border closures. For example, in the Americas, international tourist arrivals decreased by 55%-55\% through June 2020.
    • Airlines Sector Performance: Financial metrics for airlines such as Delta and Southwest include figures like $5.9b\$5.9\,b, $1.5b\$1.5\,b, and $1.0b\$1.0\,b related to equity and returns during the pandemic period.

Learning Objectives

  • Component Identification: Students will be able to identify and explain the main components of descriptive statistics.
  • Data Organization Skills: Students will be able to successfully organize categorical and numerical data into clear, structured frequency tables.

The Data Analysis Process

  • Raw Data: This is the starting point of any analysis. It consists of potentially very large and messy files, such as Excel spreadsheets.
  • Data Management: Before performing statistical analysis or creating graphs, data must be cleaned and organized. Essential steps include:
    1. Organize: Making tables.
    2. Describe: Qualitative assessment of the data.
    3. Compute: Calculating statistical indicators.
    4. Visualize: Creating graphs and charts.
  • Instructional Lead: These techniques are outlined by Pr. Petar Zivkovic.

Data Cleaning and Transformation

  • Data Cleaning Definition: The process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. Working with incorrect data leads to unreliable results and poor decision-making.
  • Data Transformation Definition: This process converts data from one format or structure into another (e.g., normalizing values, changing data types, or aggregating data). Transformation focuses on preparing cleaned data for analysis/modeling rather than fixing errors.
  • Steps to Clean Data:
    • Remove Duplicates: Ensures the same observation is not counted multiple times.
    • Fix Errors: Correcting typos and ensuring consistent naming and capitalization.
    • Filter Unwanted Outliers: Remove clear mistakes; however, the analyst must evaluate whether extreme but valid values should remain in the dataset.
    • Handle Missing Data: Deciding whether to delete affected rows/columns or fill in missing values appropriately.
    • Validate: Ensuring that the cleaned data makes sense through logic checks, range checks, and category validation.

Organizing Categorical Data

  • Standard Procedures: For categorical data, the following six steps are used to create structured information:
    1. List Unique Categories: Identify each distinct value or define classes.
    2. Count Occurrences: Determine the frequency of each value.
    3. Absolute Frequency (nin_i): Record the exact number of occurrences per category.
    4. Relative Frequency (fif_i): Calculate the proportion or percentage, where fi=ninf_i = \frac{n_i}{n}.
    5. Cumulative Frequency (FiF_i): Add successive frequencies iteratively up to 100%100\%.
    6. Present in a Table: Organize categories into columns (Category, nin_i, fif_i, FiF_i).
  • Notation: The symbol nn denotes the total sample size.
  • Table Types:
    • Summary Table: Used for a single categorical variable (e.g., data showing the location of 25302530 hotels in Switzerland).
    • Contingency Table: Used to show the relationship between two categorical variables (e.g., data showing both the Location and Operation of the 25302530 hotels in Switzerland).

Organizing Numerical Data

  • Challenge of Raw Numerical Data: Numerical datasets often have a wide range of values. For example, in a dataset of 25302530 Swiss hotels, the number of rooms per hotel varies from 66 to 496496.
  • Frequency Distribution: This method summarizes numerical values by grouping them into a set of numerically ordered classes.
  • Classes and Class Intervals:
    • Classes are groups representing a specific value or a range (class interval).
    • Mutual Exclusivity: Each value can belong to only one class.
    • Collective Exhaustivity: Every value in the dataset must be contained within one of the defined classes.