E. Unit 2.3: Data Preparation Phase

🛠 2.3 – Phase 2: Data Preparation (Full Breakdown with Examples)

Theme: “Prepare your data like you're preparing a meal—clean, portion, and plate it before serving analysis.”

🎯 Purpose of This Phase

Set up a safe space (sandbox) for exploring data.
Perform ETLT (Extract, Transform, Load/Transform).
Understand and clean the data.
Prepare data structures, fill in gaps, and visualize data to find issues.
Most time-consuming phase (can take up 50%+ of total project effort).

🔲 2.3.1 – Preparing the Analytics Sandbox (In Simple Terms)

What It Is

A sandbox is a safe, separate workspace where data analysts and scientists can explore and experiment with data without touching live systems (like finance or sales databases).

Why It Matters

It protects real systems from accidental changes or crashes during testing.

Example

A fraud analyst wants to test models using customer transaction data. Instead of using the live payment system, they use a sandbox copy of the data.
✅ This keeps real customer data safe while allowing deep analysis.

Key Points

Load all types of data: raw, structured (like spreadsheets), aggregated (summarized), and unstructured (like call logs or clickstreams).
Conflict: IT teams want to limit access for security; Data Science teams want full access to explore.
Solution: Work together—follow IT rules but allow enough freedom to explore.

📌 Tip: Plan for 5–10x more storage than the size of your raw data to handle all the extra copies and experiments.

🔄 2.3.2 – Performing ETLT (Extract, Transform, Load, Transform)

What It Is

ETLT is a process for preparing data. It stands for:

Step	What It Means	Why It Matters
Extract	Pull data from different systems	From databases, APIs like Twitter or Salesforce
Load	Put raw data into the sandbox	Keeps the original data untouched
Transform	Clean and reshape the data	Makes it usable for analysis or modeling
(Transform again)	Do extra formatting if needed	For special analysis or custom models

Why ETLT > ETL

ETLT keeps a copy of the raw data, which is useful in cases like fraud detection, where unusual data (outliers) might be important, not just noise.

📌 Tip: For big data, use tools like Hadoop or MapReduce to load and process data in parallel.

🧠 2.3.3 – Learning About the Data (In Simple Terms)

What It’s For

Before analysis, you need to understand what data you have, what’s missing, and who controls it.

Activity	Purpose	Example
Catalog access	Know what data is available	Internal logs, survey results, web analytics
Identify gaps	Spot missing or restricted data	HR data exists, but you can’t see salary info
Plan partnerships	Talk to data owners to get access	Ask Finance or Sales teams for needed data

📌 Tip: Start looking for external data too—like open APIs or purchased datasets—that could add value to your project.

🧽 2.3.4 – Data Conditioning (a.k.a. Cleaning Your Data)

Think of this like prepping ingredients before cooking. You want everything clean, organized, and ready to go!

Step	What It Means	Example
Clean	Fix or remove bad data	Fill in missing ages with the average age
Normalize	Make data consistent	Change height from inches to centimeters
Transform	Change or split data into useful parts	Split “Full Name” into First and Last names
Validate	Double-check that everything makes sense	Make sure income is a number and not negative

🔍 Tips:

Work with your team—what one person thinks is “bad data” might be useful to someone else.
Don’t delete weird data too fast—it might reveal something important (like fraud!).

📊 2.3.5 – Survey and Visualize the Data

This is like looking at your ingredients before cooking—see what you’ve got!

Technique	Why Use It	Example
Overview first	Get the big picture	A chart shows most customers are under 30
Zoom & filter	Focus on specific groups	Look at customers from one zip code
Detail on demand	Check individual data points	Read comments from people who gave low ratings

🧩 Use this step to:

Check if numbers are in the right range (e.g., no one is 500 years old).
Make sure time data is consistent (daily vs. hourly).
Spot weird or inconsistent entries (like different ways of writing the same state).

🧰 2.3.6 – Common Tools for Data Prep

Here are some tools that help you clean and organize your data:

Tool	What It Does	Use It For
Hadoop	Handles huge amounts of data	Load millions of GPS or website clicks
Alpine Miner	Drag-and-drop tool for building workflows	Find top 100 customers and group them
OpenRefine	Clean messy data with a simple interface	Fix date formats or merge duplicates
Data Wrangler	Try out data changes and export the code	Test changes, then use the code in Python later

📌 Pro Tip:

Wrangler is great because you can play with data visually, then export the logic to use in your code later!

🧾 Summary Table: Phase 2 – Data Preparation

Step	What Happens	Example
Prepare Sandbox	Create isolated workspace	Copy production sales DB to cloud sandbox
Perform ETLT	Load raw → Transform → Clean	Import 1M Twitter posts, keep raw, and clean version
Learn the Data	Catalog access, find gaps	You have customer age but no income info
Condition the Data	Clean, format, validate	Income must be numeric and > 0
Visualize & Explore	Spot trends/outliers/skew	Bar chart of survey results shows bias
Choose Tools	Use right tech stack	Use OpenRefine + Hadoop for scaling

📘 Chapter 3.1 – Introduction to R

Sections Covered:

🔹 3.1: Introduction to R (pp. 64–67)
🔹 3.1.1: R Graphical User Interfaces (pp. 67–69)
🔹 3.1.2: Data Import and Export (pg. 69)
Based on the screenshots you provided (pages 64–69)

🧠 What Is R?

R is a free, open-source programming language designed for:
- Statistical computing
- Data visualization
- Modeling
- Exploratory data analysis
It’s maintained through CRAN (Comprehensive R Archive Network).
R is widely used by data scientists, statisticians, and researchers.

🔹 Key Actions in R (with Code Examples)

Task	Code	What It Does
Import data	`read.csv("C:/data/yearly_sales.csv")`	Loads a CSV file
View dataset	`head(sales)`	Displays the first 6 rows
Descriptive stats	`summary(sales)`	Min, max, mean, median, quartiles
Create scatterplot	`plot(sales$num_of_orders, sales$sales_total)`	Plots number of orders vs. total sales
Linear regression	`lm(sales_total ~ num_of_orders, data = sales)`	Fits a line to data
View regression output	`summary(results)`	Shows coefficients, R², p-values
Check residuals	`hist(results$residuals, breaks = 800)`	Visualizes model errors

📊 Example: Retail Sales Dataset

Scenario:
10,000 customer records, each with:

sales_total: Annual sales in USD
num_of_orders: Number of orders
gender: F or M

📋 Step-by-step Actions:

Import data:
```
r
```
CopyEdit
sales <- read.csv("C:/data/yearly_sales.csv")
Inspect first few rows:
```
r
```
CopyEdit
head(sales)
Summary statistics:
```
r
```
CopyEdit
summary(sales)
Output includes:
- Min/Max/Median of sales, orders
- Gender distribution: 5035 F, 4965 M
Scatterplot of sales vs. orders:
```
r
```
CopyEdit
plot(sales$num_of_orders, sales$sales_total, main="Number of Orders vs. Sales")
Linear regression:
```
r
```
CopyEdit
results <- lm(sales_total ~ num_of_orders, data = sales) summary(results)
- Intercept: -154.1
- Slope: 166.2
- R-squared: 0.5637 → model explains ~56% of variation
- p-values: very small (< 2e-16) → statistically significant
Histogram of residuals:
```
r
```
CopyEdit
hist(results$residuals, breaks = 800)

🔍 Understanding Functions

Generic functions adapt based on input:
- summary(), plot(), and hist() behave differently for different data types.
Regression output includes:
- Intercept and slope
- Error stats (Std. Error, t-value, p-value)
- Residuals (errors between predicted and actual)

📌 You don’t need to fully interpret regression output here — it will be explained deeper later in the book.

💻 3.1.1 – R Graphical User Interfaces (GUIs)

R is primarily command-line-based, but GUIs make R more accessible.

🔧 Common GUIs:

GUI	Purpose
RGui	The default R interface (basic)
RStudio	Most popular IDE for R (highly recommended)
Rattle	GUI for data mining workflows
Commander	Menu-based GUI for stats and plots
JGR / Deducer	Java-based spreadsheet-style R GUIs

🖼 RStudio Layout (Figure 3-3)

RStudio has 4 key panels:

Pane	What It Does
Scripts	Write and save R code
Console	Run commands and see output
Workspace	View variables and datasets
Plots	View and export data visualizations

💡 Use ?lm or help(lm) in Console to get function documentation.

You can also use:

edit() or fix() to manually edit data
save.image() to save your session
.RData files to store your environment

🔄 3.1.2 – Data Import and Export

You can load data from several formats, not just CSV.

🧾 File Handling Functions

Function	Description	Example
`setwd()`	Set working directory	`setwd("C:/data/")`
`read.csv()`	Read comma-separated files	`read.csv("file.csv")`
`read.table()`	Read tables with custom delimiters	`read.table("file.txt", sep="\t")`
`read.delim()`	Read tab-separated files	`read.delim("file.txt")`

📌 R uses forward slashes (/) for file paths — even in Windows.

✅ Final Takeaways

R is a powerful analytics tool, especially for statistical work.
You can:
- Import and clean data
- Run basic models (like linear regression)
- Create quick plots
- Work with GUIs like RStudio for ease of use
Understanding output and workflow early on helps with later lifecycle phases (like model planning and building).

📚 Key Terms – Data Analytics (with Examples & Explanations)

Term	What It Means (Easy)	Example	Why It Matters
Corporate Data Warehouse	A central hub for storing a company’s data, usually historical and structured.	All sales, customer, and inventory data from across departments stored in one place.	Ideal for data mining, reporting, and cross-functional analysis.
Data Availability	How easy it is to get data from different sources (internal or external).	Data from HR, CRM systems, or 3rd-party APIs.	Influences what data can be used and how reliable your analysis will be.
Data Cleaning (Cleansing)	The process of fixing or removing errors in data, like blanks, typos, or duplicates.	Fixing missing income fields or removing duplicate records.	Ensures data is accurate, which is critical before modeling.
Data Complexity	How complicated or structured the data is — includes data types, relationships, and volume.	Nested tables, multi-level categories, time-series logs.	High complexity may require more prep time and resources.
Qualitative Data Types	Data that describes categories or labels, not numbers. Includes:

Nominal = No order (e.g., gender, product type)
Ordinal = Has order (e.g., satisfaction rating: low, medium, high) | Nominal: “Red”, “Blue”
Ordinal: “Beginner”, “Intermediate”, “Expert” | Useful for categorical analysis, visualizations, or when filtering groups. |
| Quantitative Data Types | Data that is measurable and numeric. Includes:
Interval = Numbers without true zero (e.g., temperature in °C)
Ratio = Has true zero (e.g., sales, height, weight) | Interval: 30°C
Ratio: $1000 revenue | Used in statistical models and calculations (like mean, regression). |
| R | A programming language and environment for data analysis, stats, and visualization. | Using ggplot2 for charts or lm() for linear regression. | Widely used in analytics and open-source (free). Essential tool in data science. |
| Type I Error | A false positive – saying there is an effect when there isn’t. | Test says a medicine works, but it actually doesn't. | Controlled by alpha (α). Must minimize in sensitive tests (e.g., fraud detection). |
| Type II Error | A false negative – saying there is no effect when there is one. | Test says the medicine doesn’t work, but it does. | Controlled by beta (β). Risky in critical systems like disease detection. |