INFS 343 1/26/26
Course Announcements and Homework Details
Questions and Help
The instructor is available to help with individual projects and HTML file corrections after class.
Assignments due Monday at 11:59 PM will only earn up to three points if submitted late.
Current Chapter
This week focuses on Chapter 3, titled "Data Cleaning and Manipulation."
No homework for this week, with assignments due next week.
Students can choose to complete homework before the first midterm, ensuring they keep up with coursework.
Extra credit opportunities may be available.
Chapter Overview: Data Cleaning and Manipulation
Introduction
Content is derived from textbooks but customized for in-class presentation.
Emphasis on data cleaning and manipulation as an essential part of data analysis.
Insufficient attention to detail can lead to disastrous outcomes, exemplified by the Mars Climate Orbiter (1999).
Case Study: Mars Climate Orbiter
Background
NASA partnered with Jet Propulsion Laboratories and Lockheed Martin to create the Mars Climate Orbiter.
The mishap resulted from inconsistent unit measurements, underlining the importance of clarity in data specifications.
Lesson Harnessed
Attention to detail is paramount in engineering and business.
Data must be cleaned and manipulated accurately before analytics can commence.
Data Analysis Stages
Phases in Business Analytics
The process consists of three main stages:
Framing
Solving
Reporting
In this course, the focus is on the "solving" phase, particularly analysis post-data collection.
Data Preparation and Familiarization
Before analysis, familiarize with the data, considering its origin, updates, and purpose.
Important queries referred to as the Six W's:
Who: Creator of the data
What: Details/type of data
Where: Source of the data
When: Last updated
Why: Purpose of data collection
How: Method of connection or collection
Data Extraction, Transformation, Load (ETL) Pipeline
The ETL process is critical when handling data from various sources:
Extract: Data is gathered from multiple databases or spreadsheets.
Transform: Requires cleaning, standardizing, enriching, and manipulating, including:
Type conversion
Structuring and restructuring data
Load: Finalizing data for analysis, while ensuring accuracy and relevancy.
Cleaning and Manipulating Data
Key procedures during the cleaning phase include:
Handling Missing Values:
Incorrect Data: May result from manual input errors or systemic faults.
Inaccurate Data: Requires identifying issues before analysis.
Early Transformation Strategies:
Type Conversion: For example, converting string representations of numbers into numerical format.
Data Subsetting: Remove irrelevant or outdated rows using subsets.
Aggregating Data: Group together relevant variables (e.g., bedrooms and bathrooms).
Sanitization: Remove leading or trailing spaces and entries that may interfere with analysis.
Standardization: Align data formats across different sets (e.g., currency and temperature units) to ensure consistency.
Dealing with Missing Data
Strategies:
Delete: Remove rows with missing values if they constitute a small percentage (<5%) of the dataset.
Imputation: Replace missing values with estimates (using mean, median, or mode).
Careful balance is essential to avoid distorting data distributions.
Consideration of Outliers:
Outliers can indicate important information but may also signal data entry errors.
Data Reference Values: Be aware of varying reference points in different systems (e.g., date references).
Categorization Conversion: Use functions to convert categorical data into a numerical format for analysis.
Time Commitment in Data Cleaning
A significant portion of data analysis time (over 60%) may be invested in cleaning and manipulation.
Class Interaction and Next Steps
The instructor asks if there are any questions about the discussed concepts.
A quiz assessing understanding of concepts is scheduled to follow the lecture.