E. Unit 2.3: Data Preparation Phase

🛠 2.3 – Phase 2: Data Preparation (Full Breakdown with Examples)

Theme: “Prepare your data like you're preparing a meal—clean, portion, and plate it before serving analysis.”


🎯 Purpose of This Phase

  • Set up a safe space (sandbox) for exploring data.

  • Perform ETLT (Extract, Transform, Load/Transform).

  • Understand and clean the data.

  • Prepare data structures, fill in gaps, and visualize data to find issues.

  • Most time-consuming phase (can take up 50%+ of total project effort).


🔲 2.3.1 – Preparing the Analytics Sandbox (In Simple Terms)

What It Is

A sandbox is a safe, separate workspace where data analysts and scientists can explore and experiment with data without touching live systems (like finance or sales databases).

Why It Matters

It protects real systems from accidental changes or crashes during testing.

Example

A fraud analyst wants to test models using customer transaction data. Instead of using the live payment system, they use a sandbox copy of the data.
This keeps real customer data safe while allowing deep analysis.

Key Points
  • Load all types of data: raw, structured (like spreadsheets), aggregated (summarized), and unstructured (like call logs or clickstreams).

  • Conflict: IT teams want to limit access for security; Data Science teams want full access to explore.

  • Solution: Work together—follow IT rules but allow enough freedom to explore.

📌 Tip: Plan for 5–10x more storage than the size of your raw data to handle all the extra copies and experiments.


🔄 2.3.2 – Performing ETLT (Extract, Transform, Load, Transform)

What It Is

ETLT is a process for preparing data. It stands for:

Step

What It Means

Why It Matters

Extract

Pull data from different systems

From databases, APIs like Twitter or Salesforce

Load

Put raw data into the sandbox

Keeps the original data untouched

Transform

Clean and reshape the data

Makes it usable for analysis or modeling

(Transform again)

Do extra formatting if needed

For special analysis or custom models

Why ETLT > ETL

ETLT keeps a copy of the raw data, which is useful in cases like fraud detection, where unusual data (outliers) might be important, not just noise.

📌 Tip: For big data, use tools like Hadoop or MapReduce to load and process data in parallel.


🧠 2.3.3 – Learning About the Data (In Simple Terms)

What It’s For

Before analysis, you need to understand what data you have, what’s missing, and who controls it.

Activity

Purpose

Example

Catalog access

Know what data is available

Internal logs, survey results, web analytics

Identify gaps

Spot missing or restricted data

HR data exists, but you can’t see salary info

Plan partnerships

Talk to data owners to get access

Ask Finance or Sales teams for needed data

📌 Tip: Start looking for external data too—like open APIs or purchased datasets—that could add value to your project.


🧽 2.3.4 – Data Conditioning (a.k.a. Cleaning Your Data)

Think of this like prepping ingredients before cooking. You want everything clean, organized, and ready to go!

Step

What It Means

Example

Clean

Fix or remove bad data

Fill in missing ages with the average age

Normalize

Make data consistent

Change height from inches to centimeters

Transform

Change or split data into useful parts

Split “Full Name” into First and Last names

Validate

Double-check that everything makes sense

Make sure income is a number and not negative

🔍 Tips:

  • Work with your team—what one person thinks is “bad data” might be useful to someone else.

  • Don’t delete weird data too fast—it might reveal something important (like fraud!).


📊 2.3.5 – Survey and Visualize the Data

This is like looking at your ingredients before cooking—see what you’ve got!

Technique

Why Use It

Example

Overview first

Get the big picture

A chart shows most customers are under 30

Zoom & filter

Focus on specific groups

Look at customers from one zip code

Detail on demand

Check individual data points

Read comments from people who gave low ratings

🧩 Use this step to:

  • Check if numbers are in the right range (e.g., no one is 500 years old).

  • Make sure time data is consistent (daily vs. hourly).

  • Spot weird or inconsistent entries (like different ways of writing the same state).


🧰 2.3.6 – Common Tools for Data Prep

Here are some tools that help you clean and organize your data:

Tool

What It Does

Use It For

Hadoop

Handles huge amounts of data

Load millions of GPS or website clicks

Alpine Miner

Drag-and-drop tool for building workflows

Find top 100 customers and group them

OpenRefine

Clean messy data with a simple interface

Fix date formats or merge duplicates

Data Wrangler

Try out data changes and export the code

Test changes, then use the code in Python later

📌 Pro Tip:

Wrangler is great because you can play with data visually, then export the logic to use in your code later!


🧾 Summary Table: Phase 2 – Data Preparation

Step

What Happens

Example

Prepare Sandbox

Create isolated workspace

Copy production sales DB to cloud sandbox

Perform ETLT

Load raw → Transform → Clean

Import 1M Twitter posts, keep raw, and clean version

Learn the Data

Catalog access, find gaps

You have customer age but no income info

Condition the Data

Clean, format, validate

Income must be numeric and > 0

Visualize & Explore

Spot trends/outliers/skew

Bar chart of survey results shows bias

Choose Tools

Use right tech stack

Use OpenRefine + Hadoop for scaling

📘 Chapter 3.1 – Introduction to R

Sections Covered:

  • 🔹 3.1: Introduction to R (pp. 64–67)

  • 🔹 3.1.1: R Graphical User Interfaces (pp. 67–69)

  • 🔹 3.1.2: Data Import and Export (pg. 69)
    Based on the screenshots you provided (pages 64–69)


🧠 What Is R?

  • R is a free, open-source programming language designed for:

    • Statistical computing

    • Data visualization

    • Modeling

    • Exploratory data analysis

  • It’s maintained through CRAN (Comprehensive R Archive Network).

  • R is widely used by data scientists, statisticians, and researchers.


🔹 Key Actions in R (with Code Examples)

Task

Code

What It Does

Import data

read.csv("C:/data/yearly_sales.csv")

Loads a CSV file

View dataset

head(sales)

Displays the first 6 rows

Descriptive stats

summary(sales)

Min, max, mean, median, quartiles

Create scatterplot

plot(sales$num_of_orders, sales$sales_total)

Plots number of orders vs. total sales

Linear regression

lm(sales_total ~ num_of_orders, data = sales)

Fits a line to data

View regression output

summary(results)

Shows coefficients, R², p-values

Check residuals

hist(results$residuals, breaks = 800)

Visualizes model errors


📊 Example: Retail Sales Dataset

Scenario:
10,000 customer records, each with:

  • sales_total: Annual sales in USD

  • num_of_orders: Number of orders

  • gender: F or M

📋 Step-by-step Actions:

  1. Import data:

    r

    CopyEdit

    sales <- read.csv("C:/data/yearly_sales.csv")

  2. Inspect first few rows:

    r

    CopyEdit

    head(sales)

  3. Summary statistics:

    r

    CopyEdit

    summary(sales)

    Output includes:

    • Min/Max/Median of sales, orders

    • Gender distribution: 5035 F, 4965 M

  4. Scatterplot of sales vs. orders:

    r

    CopyEdit

    plot(sales$num_of_orders, sales$sales_total, main="Number of Orders vs. Sales")

  5. Linear regression:

    r

    CopyEdit

    results <- lm(sales_total ~ num_of_orders, data = sales) summary(results)

    • Intercept: -154.1

    • Slope: 166.2

    • R-squared: 0.5637 → model explains ~56% of variation

    • p-values: very small (< 2e-16) → statistically significant

  6. Histogram of residuals:

    r

    CopyEdit

    hist(results$residuals, breaks = 800)


🔍 Understanding Functions

  • Generic functions adapt based on input:

    • summary(), plot(), and hist() behave differently for different data types.

  • Regression output includes:

    • Intercept and slope

    • Error stats (Std. Error, t-value, p-value)

    • Residuals (errors between predicted and actual)

📌 You don’t need to fully interpret regression output here — it will be explained deeper later in the book.


💻 3.1.1 – R Graphical User Interfaces (GUIs)

R is primarily command-line-based, but GUIs make R more accessible.

🔧 Common GUIs:

GUI

Purpose

RGui

The default R interface (basic)

RStudio

Most popular IDE for R (highly recommended)

Rattle

GUI for data mining workflows

Commander

Menu-based GUI for stats and plots

JGR / Deducer

Java-based spreadsheet-style R GUIs


🖼 RStudio Layout (Figure 3-3)

RStudio has 4 key panels:

Pane

What It Does

Scripts

Write and save R code

Console

Run commands and see output

Workspace

View variables and datasets

Plots

View and export data visualizations

💡 Use ?lm or help(lm) in Console to get function documentation.

You can also use:

  • edit() or fix() to manually edit data

  • save.image() to save your session

  • .RData files to store your environment


🔄 3.1.2 – Data Import and Export

You can load data from several formats, not just CSV.

🧾 File Handling Functions

Function

Description

Example

setwd()

Set working directory

setwd("C:/data/")

read.csv()

Read comma-separated files

read.csv("file.csv")

read.table()

Read tables with custom delimiters

read.table("file.txt", sep="\t")

read.delim()

Read tab-separated files

read.delim("file.txt")

📌 R uses forward slashes (/) for file paths — even in Windows.


Final Takeaways

  • R is a powerful analytics tool, especially for statistical work.

  • You can:

    • Import and clean data

    • Run basic models (like linear regression)

    • Create quick plots

    • Work with GUIs like RStudio for ease of use

  • Understanding output and workflow early on helps with later lifecycle phases (like model planning and building).

📚 Key Terms – Data Analytics (with Examples & Explanations)

Term

What It Means (Easy)

Example

Why It Matters

Corporate Data Warehouse

A central hub for storing a company’s data, usually historical and structured.

All sales, customer, and inventory data from across departments stored in one place.

Ideal for data mining, reporting, and cross-functional analysis.

Data Availability

How easy it is to get data from different sources (internal or external).

Data from HR, CRM systems, or 3rd-party APIs.

Influences what data can be used and how reliable your analysis will be.

Data Cleaning (Cleansing)

The process of fixing or removing errors in data, like blanks, typos, or duplicates.

Fixing missing income fields or removing duplicate records.

Ensures data is accurate, which is critical before modeling.

Data Complexity

How complicated or structured the data is — includes data types, relationships, and volume.

Nested tables, multi-level categories, time-series logs.

High complexity may require more prep time and resources.

Qualitative Data Types

Data that describes categories or labels, not numbers. Includes:

  • Nominal = No order (e.g., gender, product type)

  • Ordinal = Has order (e.g., satisfaction rating: low, medium, high) | Nominal: “Red”, “Blue”
    Ordinal: “Beginner”, “Intermediate”, “Expert” | Useful for categorical analysis, visualizations, or when filtering groups. |
    | Quantitative Data Types | Data that is measurable and numeric. Includes:

  • Interval = Numbers without true zero (e.g., temperature in °C)

  • Ratio = Has true zero (e.g., sales, height, weight) | Interval: 30°C
    Ratio: $1000 revenue | Used in statistical models and calculations (like mean, regression). |
    | R | A programming language and environment for data analysis, stats, and visualization. | Using ggplot2 for charts or lm() for linear regression. | Widely used in analytics and open-source (free). Essential tool in data science. |
    | Type I Error | A false positive – saying there is an effect when there isn’t. | Test says a medicine works, but it actually doesn't. | Controlled by alpha (α). Must minimize in sensitive tests (e.g., fraud detection). |
    | Type II Error | A false negative – saying there is no effect when there is one. | Test says the medicine doesn’t work, but it does. | Controlled by beta (β). Risky in critical systems like disease detection. |