Preparing Business Data for Prediction – Comprehensive Study Notes

Objectives

Build foundational understanding of how to get raw business data “prediction-ready”.
Recognize three primary data classes that appear in business contexts.
Diagnose and remediate the four most common data quality problems.
Master a toolbox of seven Excel features/functions for systematic cleaning.
Practice on a mini-case (messy sales dataset) that contains every problem in one sheet.
Appreciate the downstream impact of clean data on forecasting, dashboards, and machine-learning pipelines.

Types of Business Data

Numerical (Quantitative)
- Definition : Values that can be measured and ordered.
- Typical fields : revenue, profit, units sold, employee age.
- Significance : Input for aggregation (sum, average), statistical modeling, regression.
Categorical (Qualitative / Labels)
- Definition : Finite set of discrete classes.
- Examples : region (North, South), product category, customer segment, gender.
- Significance : Provide grouping keys for pivot tables and dummy-variable encoding.
Time-Series
- Definition : Observations indexed by time stamps with consistent frequency.
- Examples : daily website visits, monthly sales, quarterly GDP.
- Significance : Needed for forecasting algorithms that exploit temporal order (ARIMA, ETS).
- Reminder : Excel’s built-in Forecast Sheet requires true serial dates.

Common Data Problems in Business Datasets

Missing Values
- Empty cells, “n/a”, "-", or placeholders that break calculations.
- Impact : causes list-wise deletion or biased imputations if ignored.
Outliers
- Extremely small/large numbers, e.g., $10{,}000$ units sold when average is $100$ .
- Impact : distorts mean, standard deviation, and regression coefficients.
Wrong Formats
- Dates stored as text ("June 1 2025"), currency as strings ("₱50,000" inside text cell).
- Impact : Excel cannot parse for chronological sort or numeric aggregation.
Inconsistent Entries
- Same label typed differently: "North ", "north", "NORTH" (extra space / mixed case).
- Impact : duplicates categories, creates fragmented pivots, misaligned joins.

Excel Cleaning Functions & Tools

TRIM()
- Removes leading, trailing, and repeated inner spaces.
- Example: $=TRIM(A2)$ converts " Best Seller " → "Best Seller".
CLEAN()
- Strips non-printable ASCII characters (line breaks, hidden control codes).
- Example: $=CLEAN(A2)$ fixes "Special@@Offer" when @@ are char codes.
ISERROR()
- Returns TRUE if cell contains any Excel error (#DIV/0!, #VALUE!, #REF!, …).
- Pair with IF() for custom handling: $=IF(ISERROR(B2),0,B2)$ .
IFNA()
- Specialized wrapper for #N/A only (leaves other errors untouched).
- Syntax: $=IFNA(VLOOKUP(…),"No Promo")$ .
TEXT()
- Re-formats numeric/datetime value into controlled text string.
- Example: $=TEXT(C2,"yyyy-mm-dd")$ standardizes disparate date styles.
FIND / REPLACE (Ctrl+H)
- Mass-edit spelling (“complete”→“Complete”), remove substrings (“@”).
Data Validation
- Dropdown list (Region = {North, South, East, West}) prevents future typos.
Text-to-Columns
- Delimiter or fixed-width splitting (e.g., "John Doe" → First Name / Last Name).
Number Formatting
- Currency with thousands separator: "₱##,##0.00"; percentages: "0.0%".
Conditional Formatting
- Highlight rules: Top 10%, greater than ₱ $500{,}000$ , blanks, errors.
Sorting & Filtering
- Sort ascending/descending by date, revenue.
- AutoFilter arrows to isolate blanks (“Blanks”) or particular categories.

Mini Case – Messy Sales Dataset (4 Stores, Multiple Copies)

Dataset snippet (columns: Branch Name, Monthly Sales, Transaction Date, Status, Remarks, Promo Applied) illustrates every problem:

Missing Region field (to be added via Data Validation later).
Product/branch names:
- Variants: "MANILA", "Manila", "manila" – case inconsistency.
- Extra internal spaces: "₱ 48 000".
Monthly Sales anomalies:
- Text currency, embedded spaces, “₱error”, #DIV/0!.
Date heterogeneity:
- ISO (2025-06-01), US (6/1/2025), dotted (2025.06.01), text (June 1, 2025), two-digit year (6/1/25).
Status values typed five ways: complete, COMPLETE, Completed, InProgress, INprogress.
Remarks with hidden / special characters: "Special@@Offer", "Delayed@@@Delivery".
Promo Applied column: #N/A, "-", "n/a", "N/A", real codes (SUMMER2025).

Cleaning Workflow (Recommended Steps)

Copy raw sheet → create staging tab; keep immutable raw backup.
Apply TRIM() & CLEAN() in helper columns on all text fields; paste → values.
Convert Monthly Sales
- Remove "₱" via FIND/REPLACE.
- Eliminate spaces using SUBSTITUTE(" ",""), then VALUE() to get numeric.
Detect errors
- $=ISERROR(B2)$ ; filter TRUE → inspect, fix (#DIV/0! becomes blank or imputed mean).
Standardize dates
- Use DATEVALUE() if text.
- For ambiguous "6/1/25", assume locale (Month-Day-Year) or parse with TEXT().
Normalize Status
- UPPER() helper → single case, then FIND/REPLACE "INPROGRESS" → "In Progress", etc.
Validate Region
- Insert new column; set Data Validation list {North, South, East, West}.
Handle Promo Applied
- $=IFNA(F2,"No Promo")$ ; also replace "-", "n/a" with "No Promo".
Outlier check
- Conditional Format → values > $3$ ×IQR or $\mu \pm 3\sigma$ .
“Ready” dataset saved as .xlsx and exported to .csv for BI tools.

Ethical & Practical Implications

Errors in currency fields can inflate projected revenue → mislead investors.
Inconsistent gender or region labels can bias demographic segmentation → fairness concerns.
Proper documentation of cleaning steps ensures auditability and regulatory compliance (e.g., SOX, GDPR).

Quick Reference – Excel Cleaning Functions (Cheat Sheet)

$=TRIM(\text{cell})$ → space normalization.
$=CLEAN(\text{cell})$ → strip ASCII 0–31.
$=ISERROR(\text{cell})$ → TRUE/FALSE flag.
$=IFNA(\text{cell},"replacement")$ → neutralize #N/A.
$=TEXT(\text{cell},"yyyy-mm-dd")$ → unified ISO date.
FIND (Ctrl+F) / REPLACE (Ctrl+H) → manual batch standardization.

Connecting to Predictive Modeling

Garbage-in-garbage-out: model accuracy $\approx$ quality of input data.
Clean, standardized fields enable:
- One-hot encoding of categorical variables.
- Smooth training/validation splits without hidden errors.
- Time-series forecasting that respects actual chronological order.
Performance metrics such as $R^2$ or $RMSE$ improve markedly once outliers and typos are fixed.

Recap & Action Items

Always profile new data: count blanks, run descriptive stats, scan formats.
Maintain a reusable data-cleaning checklist for future projects.
Document every transformation step (Excel Power Query or a notebook) for reproducibility.
Cleaned data should flow into dashboards, forecasting sheets, or be uploaded to cloud databases for advanced ML.

Practice Prompt

Open the provided .xlsx; implement the ten-step workflow above.
Compare pre- vs post-cleaning: count of errors, average sales, forecast outputs.
Reflect: Which cleaning step had the biggest impact on analytical insight?

Key Takeaway

Properly prepared data is not a clerical exercise—it is the foundation for trustworthy business predictions, strategic decisions, and ethical analytics.