Week 1 - Descriptive Statistics: Principles of Data Cleaning and Organization
Introduction to Descriptive Statistics and Data Management
- Definition and Importance of Descriptive Statistics: Descriptive statistics is the process categorized by moving from raw data to organized information. This transformation is essential for effective data analysis in industries such as hospitality and tourism.
- The Data Narrative (Visual Case Studies):
- Cruise Industry Turbulence: In Q2 2019, the revenue for major cruise lines (including Carnival Corporation & PLC and RCCL) was approximately $4.84b. By Q2 2020, revenue plummeted to $0.1b, representing a decrease of approximately −85%.
- Hotel and Resort Industry Market Size:
- The global market size in 2013 was approximately $0.97Trillion.
- By 2019, the market grew to $1.478Trillion.
- In 2023 (estimate), the market size was $1.21Trillion.
- Projections for 2030 (estimate) suggest a market size of $1.5Trillion.
- Hotel Distribution Channels: gross bookings in the global hotel industry involve a complex mix of direct channels (online and offline), online travel platforms, offline travel agents, and wholesale. Recorded values include segments at $570b, $400b, and $310b.
- COVID-19 Impact on Tourism: International tourism saw significant declines due to the pandemic and border closures. For example, in the Americas, international tourist arrivals decreased by −55% through June 2020.
- Airlines Sector Performance: Financial metrics for airlines such as Delta and Southwest include figures like $5.9b, $1.5b, and $1.0b related to equity and returns during the pandemic period.
Learning Objectives
- Component Identification: Students will be able to identify and explain the main components of descriptive statistics.
- Data Organization Skills: Students will be able to successfully organize categorical and numerical data into clear, structured frequency tables.
The Data Analysis Process
- Raw Data: This is the starting point of any analysis. It consists of potentially very large and messy files, such as Excel spreadsheets.
- Data Management: Before performing statistical analysis or creating graphs, data must be cleaned and organized. Essential steps include:
- Organize: Making tables.
- Describe: Qualitative assessment of the data.
- Compute: Calculating statistical indicators.
- Visualize: Creating graphs and charts.
- Instructional Lead: These techniques are outlined by Pr. Petar Zivkovic.
- Data Cleaning Definition: The process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. Working with incorrect data leads to unreliable results and poor decision-making.
- Data Transformation Definition: This process converts data from one format or structure into another (e.g., normalizing values, changing data types, or aggregating data). Transformation focuses on preparing cleaned data for analysis/modeling rather than fixing errors.
- Steps to Clean Data:
- Remove Duplicates: Ensures the same observation is not counted multiple times.
- Fix Errors: Correcting typos and ensuring consistent naming and capitalization.
- Filter Unwanted Outliers: Remove clear mistakes; however, the analyst must evaluate whether extreme but valid values should remain in the dataset.
- Handle Missing Data: Deciding whether to delete affected rows/columns or fill in missing values appropriately.
- Validate: Ensuring that the cleaned data makes sense through logic checks, range checks, and category validation.
Organizing Categorical Data
- Standard Procedures: For categorical data, the following six steps are used to create structured information:
- List Unique Categories: Identify each distinct value or define classes.
- Count Occurrences: Determine the frequency of each value.
- Absolute Frequency (ni): Record the exact number of occurrences per category.
- Relative Frequency (fi): Calculate the proportion or percentage, where fi=nni.
- Cumulative Frequency (Fi): Add successive frequencies iteratively up to 100%.
- Present in a Table: Organize categories into columns (Category, ni, fi, Fi).
- Notation: The symbol n denotes the total sample size.
- Table Types:
- Summary Table: Used for a single categorical variable (e.g., data showing the location of 2530 hotels in Switzerland).
- Contingency Table: Used to show the relationship between two categorical variables (e.g., data showing both the Location and Operation of the 2530 hotels in Switzerland).
Organizing Numerical Data
- Challenge of Raw Numerical Data: Numerical datasets often have a wide range of values. For example, in a dataset of 2530 Swiss hotels, the number of rooms per hotel varies from 6 to 496.
- Frequency Distribution: This method summarizes numerical values by grouping them into a set of numerically ordered classes.
- Classes and Class Intervals:
- Classes are groups representing a specific value or a range (class interval).
- Mutual Exclusivity: Each value can belong to only one class.
- Collective Exhaustivity: Every value in the dataset must be contained within one of the defined classes.