E. Unit 2.3: Data Preparation Phase
🛠 2.3 – Phase 2: Data Preparation (Full Breakdown with Examples)
Theme: “Prepare your data like you're preparing a meal—clean, portion, and plate it before serving analysis.”
🎯 Purpose of This Phase
Set up a safe space (sandbox) for exploring data.
Perform ETLT (Extract, Transform, Load/Transform).
Understand and clean the data.
Prepare data structures, fill in gaps, and visualize data to find issues.
Most time-consuming phase (can take up 50%+ of total project effort).
🔲 2.3.1 – Preparing the Analytics Sandbox (In Simple Terms)
What It Is
A sandbox is a safe, separate workspace where data analysts and scientists can explore and experiment with data without touching live systems (like finance or sales databases).
Why It Matters
It protects real systems from accidental changes or crashes during testing.
Example
A fraud analyst wants to test models using customer transaction data. Instead of using the live payment system, they use a sandbox copy of the data.
✅ This keeps real customer data safe while allowing deep analysis.
Key Points
Load all types of data: raw, structured (like spreadsheets), aggregated (summarized), and unstructured (like call logs or clickstreams).
Conflict: IT teams want to limit access for security; Data Science teams want full access to explore.
Solution: Work together—follow IT rules but allow enough freedom to explore.
📌 Tip: Plan for 5–10x more storage than the size of your raw data to handle all the extra copies and experiments.
🔄 2.3.2 – Performing ETLT (Extract, Transform, Load, Transform)
What It Is
ETLT is a process for preparing data. It stands for:
Step | What It Means | Why It Matters |
|---|---|---|
Extract | Pull data from different systems | From databases, APIs like Twitter or Salesforce |
Load | Put raw data into the sandbox | Keeps the original data untouched |
Transform | Clean and reshape the data | Makes it usable for analysis or modeling |
(Transform again) | Do extra formatting if needed | For special analysis or custom models |
Why ETLT > ETL
ETLT keeps a copy of the raw data, which is useful in cases like fraud detection, where unusual data (outliers) might be important, not just noise.
📌 Tip: For big data, use tools like Hadoop or MapReduce to load and process data in parallel.
🧠 2.3.3 – Learning About the Data (In Simple Terms)
What It’s For
Before analysis, you need to understand what data you have, what’s missing, and who controls it.
Activity | Purpose | Example |
|---|---|---|
Catalog access | Know what data is available | Internal logs, survey results, web analytics |
Identify gaps | Spot missing or restricted data | HR data exists, but you can’t see salary info |
Plan partnerships | Talk to data owners to get access | Ask Finance or Sales teams for needed data |
📌 Tip: Start looking for external data too—like open APIs or purchased datasets—that could add value to your project.
🧽 2.3.4 – Data Conditioning (a.k.a. Cleaning Your Data)
Think of this like prepping ingredients before cooking. You want everything clean, organized, and ready to go!
Step | What It Means | Example |
|---|---|---|
Clean | Fix or remove bad data | Fill in missing ages with the average age |
Normalize | Make data consistent | Change height from inches to centimeters |
Transform | Change or split data into useful parts | Split “Full Name” into First and Last names |
Validate | Double-check that everything makes sense | Make sure income is a number and not negative |
🔍 Tips:
Work with your team—what one person thinks is “bad data” might be useful to someone else.
Don’t delete weird data too fast—it might reveal something important (like fraud!).
📊 2.3.5 – Survey and Visualize the Data
This is like looking at your ingredients before cooking—see what you’ve got!
Technique | Why Use It | Example |
|---|---|---|
Overview first | Get the big picture | A chart shows most customers are under 30 |
Zoom & filter | Focus on specific groups | Look at customers from one zip code |
Detail on demand | Check individual data points | Read comments from people who gave low ratings |
🧩 Use this step to:
Check if numbers are in the right range (e.g., no one is 500 years old).
Make sure time data is consistent (daily vs. hourly).
Spot weird or inconsistent entries (like different ways of writing the same state).
🧰 2.3.6 – Common Tools for Data Prep
Here are some tools that help you clean and organize your data:
Tool | What It Does | Use It For |
|---|---|---|
Hadoop | Handles huge amounts of data | Load millions of GPS or website clicks |
Alpine Miner | Drag-and-drop tool for building workflows | Find top 100 customers and group them |
OpenRefine | Clean messy data with a simple interface | Fix date formats or merge duplicates |
Data Wrangler | Try out data changes and export the code | Test changes, then use the code in Python later |
📌 Pro Tip:
Wrangler is great because you can play with data visually, then export the logic to use in your code later!
🧾 Summary Table: Phase 2 – Data Preparation
Step | What Happens | Example |
|---|---|---|
Prepare Sandbox | Create isolated workspace | Copy production sales DB to cloud sandbox |
Perform ETLT | Load raw → Transform → Clean | Import 1M Twitter posts, keep raw, and clean version |
Learn the Data | Catalog access, find gaps | You have customer age but no income info |
Condition the Data | Clean, format, validate | Income must be numeric and > 0 |
Visualize & Explore | Spot trends/outliers/skew | Bar chart of survey results shows bias |
Choose Tools | Use right tech stack | Use OpenRefine + Hadoop for scaling |
📘 Chapter 3.1 – Introduction to R
Sections Covered:
🔹 3.1: Introduction to R (pp. 64–67)
🔹 3.1.1: R Graphical User Interfaces (pp. 67–69)
🔹 3.1.2: Data Import and Export (pg. 69)
Based on the screenshots you provided (pages 64–69)
🧠 What Is R?
R is a free, open-source programming language designed for:
Statistical computing
Data visualization
Modeling
Exploratory data analysis
It’s maintained through CRAN (Comprehensive R Archive Network).
R is widely used by data scientists, statisticians, and researchers.
🔹 Key Actions in R (with Code Examples)
Task | Code | What It Does |
|---|---|---|
Import data |
| Loads a CSV file |
View dataset |
| Displays the first 6 rows |
Descriptive stats |
| Min, max, mean, median, quartiles |
Create scatterplot |
| Plots number of orders vs. total sales |
Linear regression |
| Fits a line to data |
View regression output |
| Shows coefficients, R², p-values |
Check residuals |
| Visualizes model errors |
📊 Example: Retail Sales Dataset
Scenario:
10,000 customer records, each with:
sales_total: Annual sales in USDnum_of_orders: Number of ordersgender: F or M
📋 Step-by-step Actions:
Import data:
rCopyEdit
sales <- read.csv("C:/data/yearly_sales.csv")Inspect first few rows:
rCopyEdit
head(sales)Summary statistics:
rCopyEdit
summary(sales)Output includes:
Min/Max/Median of sales, orders
Gender distribution: 5035 F, 4965 M
Scatterplot of sales vs. orders:
rCopyEdit
plot(sales$num_of_orders, sales$sales_total, main="Number of Orders vs. Sales")Linear regression:
rCopyEdit
results <- lm(sales_total ~ num_of_orders, data = sales) summary(results)Intercept:
-154.1Slope:
166.2R-squared:
0.5637→ model explains ~56% of variationp-values: very small (
< 2e-16) → statistically significant
Histogram of residuals:
rCopyEdit
hist(results$residuals, breaks = 800)
🔍 Understanding Functions
Generic functions adapt based on input:
summary(),plot(), andhist()behave differently for different data types.
Regression output includes:
Intercept and slope
Error stats (Std. Error, t-value, p-value)
Residuals (errors between predicted and actual)
📌 You don’t need to fully interpret regression output here — it will be explained deeper later in the book.
💻 3.1.1 – R Graphical User Interfaces (GUIs)
R is primarily command-line-based, but GUIs make R more accessible.
🔧 Common GUIs:
GUI | Purpose |
|---|---|
RGui | The default R interface (basic) |
RStudio | Most popular IDE for R (highly recommended) |
Rattle | GUI for data mining workflows |
Commander | Menu-based GUI for stats and plots |
JGR / Deducer | Java-based spreadsheet-style R GUIs |
🖼 RStudio Layout (Figure 3-3)
RStudio has 4 key panels:
Pane | What It Does |
|---|---|
Scripts | Write and save R code |
Console | Run commands and see output |
Workspace | View variables and datasets |
Plots | View and export data visualizations |
💡 Use
?lmorhelp(lm)in Console to get function documentation.
You can also use:
edit()orfix()to manually edit datasave.image()to save your session.RDatafiles to store your environment
🔄 3.1.2 – Data Import and Export
You can load data from several formats, not just CSV.
🧾 File Handling Functions
Function | Description | Example |
|---|---|---|
| Set working directory |
|
| Read comma-separated files |
|
| Read tables with custom delimiters |
|
| Read tab-separated files |
|
📌 R uses forward slashes (/) for file paths — even in Windows.
✅ Final Takeaways
R is a powerful analytics tool, especially for statistical work.
You can:
Import and clean data
Run basic models (like linear regression)
Create quick plots
Work with GUIs like RStudio for ease of use
Understanding output and workflow early on helps with later lifecycle phases (like model planning and building).
📚 Key Terms – Data Analytics (with Examples & Explanations)
Term | What It Means (Easy) | Example | Why It Matters |
|---|---|---|---|
Corporate Data Warehouse | A central hub for storing a company’s data, usually historical and structured. | All sales, customer, and inventory data from across departments stored in one place. | Ideal for data mining, reporting, and cross-functional analysis. |
Data Availability | How easy it is to get data from different sources (internal or external). | Data from HR, CRM systems, or 3rd-party APIs. | Influences what data can be used and how reliable your analysis will be. |
Data Cleaning (Cleansing) | The process of fixing or removing errors in data, like blanks, typos, or duplicates. | Fixing missing income fields or removing duplicate records. | Ensures data is accurate, which is critical before modeling. |
Data Complexity | How complicated or structured the data is — includes data types, relationships, and volume. | Nested tables, multi-level categories, time-series logs. | High complexity may require more prep time and resources. |
Qualitative Data Types | Data that describes categories or labels, not numbers. Includes: |
Nominal = No order (e.g., gender, product type)
Ordinal = Has order (e.g., satisfaction rating: low, medium, high) | Nominal: “Red”, “Blue”
Ordinal: “Beginner”, “Intermediate”, “Expert” | Useful for categorical analysis, visualizations, or when filtering groups. |
| Quantitative Data Types | Data that is measurable and numeric. Includes:Interval = Numbers without true zero (e.g., temperature in °C)
Ratio = Has true zero (e.g., sales, height, weight) | Interval: 30°C
Ratio: $1000 revenue | Used in statistical models and calculations (like mean, regression). |
| R | A programming language and environment for data analysis, stats, and visualization. | Usingggplot2for charts orlm()for linear regression. | Widely used in analytics and open-source (free). Essential tool in data science. |
| Type I Error | A false positive – saying there is an effect when there isn’t. | Test says a medicine works, but it actually doesn't. | Controlled by alpha (α). Must minimize in sensitive tests (e.g., fraud detection). |
| Type II Error | A false negative – saying there is no effect when there is one. | Test says the medicine doesn’t work, but it does. | Controlled by beta (β). Risky in critical systems like disease detection. |