INFS 343 1/26/26

Course Announcements and Homework Details

  • Questions and Help

    • The instructor is available to help with individual projects and HTML file corrections after class.

    • Assignments due Monday at 11:59 PM will only earn up to three points if submitted late.

  • Current Chapter

    • This week focuses on Chapter 3, titled "Data Cleaning and Manipulation."

    • No homework for this week, with assignments due next week.

    • Students can choose to complete homework before the first midterm, ensuring they keep up with coursework.

    • Extra credit opportunities may be available.

Chapter Overview: Data Cleaning and Manipulation

  • Introduction

    • Content is derived from textbooks but customized for in-class presentation.

    • Emphasis on data cleaning and manipulation as an essential part of data analysis.

    • Insufficient attention to detail can lead to disastrous outcomes, exemplified by the Mars Climate Orbiter (1999).

Case Study: Mars Climate Orbiter

  • Background

    • NASA partnered with Jet Propulsion Laboratories and Lockheed Martin to create the Mars Climate Orbiter.

    • The mishap resulted from inconsistent unit measurements, underlining the importance of clarity in data specifications.

  • Lesson Harnessed

    • Attention to detail is paramount in engineering and business.

    • Data must be cleaned and manipulated accurately before analytics can commence.

Data Analysis Stages

  • Phases in Business Analytics

    • The process consists of three main stages:

    • Framing

    • Solving

    • Reporting

    • In this course, the focus is on the "solving" phase, particularly analysis post-data collection.

Data Preparation and Familiarization

  • Before analysis, familiarize with the data, considering its origin, updates, and purpose.

  • Important queries referred to as the Six W's:

    • Who: Creator of the data

    • What: Details/type of data

    • Where: Source of the data

    • When: Last updated

    • Why: Purpose of data collection

    • How: Method of connection or collection

Data Extraction, Transformation, Load (ETL) Pipeline

  • The ETL process is critical when handling data from various sources:

    • Extract: Data is gathered from multiple databases or spreadsheets.

    • Transform: Requires cleaning, standardizing, enriching, and manipulating, including:

    • Type conversion

    • Structuring and restructuring data

    • Load: Finalizing data for analysis, while ensuring accuracy and relevancy.

Cleaning and Manipulating Data

  • Key procedures during the cleaning phase include:

    • Handling Missing Values:

    • Incorrect Data: May result from manual input errors or systemic faults.

    • Inaccurate Data: Requires identifying issues before analysis.

  • Early Transformation Strategies:

    • Type Conversion: For example, converting string representations of numbers into numerical format.

    • Data Subsetting: Remove irrelevant or outdated rows using subsets.

    • Aggregating Data: Group together relevant variables (e.g., bedrooms and bathrooms).

    • Sanitization: Remove leading or trailing spaces and entries that may interfere with analysis.

    • Standardization: Align data formats across different sets (e.g., currency and temperature units) to ensure consistency.

Dealing with Missing Data
  • Strategies:

    • Delete: Remove rows with missing values if they constitute a small percentage (<5%) of the dataset.

    • Imputation: Replace missing values with estimates (using mean, median, or mode).

    • Careful balance is essential to avoid distorting data distributions.

  • Consideration of Outliers:

    • Outliers can indicate important information but may also signal data entry errors.

  • Data Reference Values: Be aware of varying reference points in different systems (e.g., date references).

  • Categorization Conversion: Use functions to convert categorical data into a numerical format for analysis.

Time Commitment in Data Cleaning

  • A significant portion of data analysis time (over 60%) may be invested in cleaning and manipulation.

Class Interaction and Next Steps

  • The instructor asks if there are any questions about the discussed concepts.

  • A quiz assessing understanding of concepts is scheduled to follow the lecture.