Week 1 - Descriptive Statistics: Principles of Data Cleaning and Organization

Introduction to Descriptive Statistics and Data Management

Definition and Importance of Descriptive Statistics: Descriptive statistics is the process categorized by moving from raw data to organized information. This transformation is essential for effective data analysis in industries such as hospitality and tourism.
The Data Narrative (Visual Case Studies):
- Cruise Industry Turbulence: In Q2 2019, the revenue for major cruise lines (including Carnival Corporation & PLC and RCCL) was approximately $\$4.84\,b$ . By Q2 2020, revenue plummeted to $\$0.1\,b$ , representing a decrease of approximately $-85\%$ .
- Hotel and Resort Industry Market Size:
  - The global market size in 2013 was approximately $\$0.97\,Trillion$ .
  - By 2019, the market grew to $\$1.478\,Trillion$ .
  - In 2023 (estimate), the market size was $\$1.21\,Trillion$ .
  - Projections for 2030 (estimate) suggest a market size of $\$1.5\,Trillion$ .
- Hotel Distribution Channels: gross bookings in the global hotel industry involve a complex mix of direct channels (online and offline), online travel platforms, offline travel agents, and wholesale. Recorded values include segments at $\$570\,b$ , $\$400\,b$ , and $\$310\,b$ .
- COVID-19 Impact on Tourism: International tourism saw significant declines due to the pandemic and border closures. For example, in the Americas, international tourist arrivals decreased by $-55\%$ through June 2020.
- Airlines Sector Performance: Financial metrics for airlines such as Delta and Southwest include figures like $\$5.9\,b$ , $\$1.5\,b$ , and $\$1.0\,b$ related to equity and returns during the pandemic period.

Learning Objectives

Component Identification: Students will be able to identify and explain the main components of descriptive statistics.
Data Organization Skills: Students will be able to successfully organize categorical and numerical data into clear, structured frequency tables.

The Data Analysis Process

Raw Data: This is the starting point of any analysis. It consists of potentially very large and messy files, such as Excel spreadsheets.
Data Management: Before performing statistical analysis or creating graphs, data must be cleaned and organized. Essential steps include:
1. Organize: Making tables.
2. Describe: Qualitative assessment of the data.
3. Compute: Calculating statistical indicators.
4. Visualize: Creating graphs and charts.
Instructional Lead: These techniques are outlined by Pr. Petar Zivkovic.

Data Cleaning and Transformation

Data Cleaning Definition: The process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. Working with incorrect data leads to unreliable results and poor decision-making.
Data Transformation Definition: This process converts data from one format or structure into another (e.g., normalizing values, changing data types, or aggregating data). Transformation focuses on preparing cleaned data for analysis/modeling rather than fixing errors.
Steps to Clean Data:
- Remove Duplicates: Ensures the same observation is not counted multiple times.
- Fix Errors: Correcting typos and ensuring consistent naming and capitalization.
- Filter Unwanted Outliers: Remove clear mistakes; however, the analyst must evaluate whether extreme but valid values should remain in the dataset.
- Handle Missing Data: Deciding whether to delete affected rows/columns or fill in missing values appropriately.
- Validate: Ensuring that the cleaned data makes sense through logic checks, range checks, and category validation.

Organizing Categorical Data

Standard Procedures: For categorical data, the following six steps are used to create structured information:
1. List Unique Categories: Identify each distinct value or define classes.
2. Count Occurrences: Determine the frequency of each value.
3. Absolute Frequency ( $n_i$ ): Record the exact number of occurrences per category.
4. Relative Frequency ( $f_i$ ): Calculate the proportion or percentage, where $f_i = \frac{n_i}{n}$ .
5. Cumulative Frequency ( $F_i$ ): Add successive frequencies iteratively up to $100\%$ .
6. Present in a Table: Organize categories into columns (Category, $n_i$ , $f_i$ , $F_i$ ).
Notation: The symbol $n$ denotes the total sample size.
Table Types:
- Summary Table: Used for a single categorical variable (e.g., data showing the location of $2530$ hotels in Switzerland).
- Contingency Table: Used to show the relationship between two categorical variables (e.g., data showing both the Location and Operation of the $2530$ hotels in Switzerland).

Organizing Numerical Data

Challenge of Raw Numerical Data: Numerical datasets often have a wide range of values. For example, in a dataset of $2530$ Swiss hotels, the number of rooms per hotel varies from $6$ to $496$ .
Frequency Distribution: This method summarizes numerical values by grouping them into a set of numerically ordered classes.
Classes and Class Intervals:
- Classes are groups representing a specific value or a range (class interval).
- Mutual Exclusivity: Each value can belong to only one class.
- Collective Exhaustivity: Every value in the dataset must be contained within one of the defined classes.