Preparing Business Data for Prediction – Comprehensive Study Notes
Objectives
Build foundational understanding of how to get raw business data “prediction-ready”.
Recognize three primary data classes that appear in business contexts.
Diagnose and remediate the four most common data quality problems.
Master a toolbox of seven Excel features/functions for systematic cleaning.
Practice on a mini-case (messy sales dataset) that contains every problem in one sheet.
Appreciate the downstream impact of clean data on forecasting, dashboards, and machine-learning pipelines.
Types of Business Data
Numerical (Quantitative)
Definition : Values that can be measured and ordered.
Typical fields : revenue, profit, units sold, employee age.
Significance : Input for aggregation (sum, average), statistical modeling, regression.
Categorical (Qualitative / Labels)
Definition : Finite set of discrete classes.
Examples : region (North, South), product category, customer segment, gender.
Significance : Provide grouping keys for pivot tables and dummy-variable encoding.
Time-Series
Definition : Observations indexed by time stamps with consistent frequency.
Examples : daily website visits, monthly sales, quarterly GDP.
Significance : Needed for forecasting algorithms that exploit temporal order (ARIMA, ETS).
Reminder : Excel’s built-in Forecast Sheet requires true serial dates.
Common Data Problems in Business Datasets
Missing Values
Empty cells, “n/a”, "-", or placeholders that break calculations.
Impact : causes list-wise deletion or biased imputations if ignored.
Outliers
Extremely small/large numbers, e.g., units sold when average is .
Impact : distorts mean, standard deviation, and regression coefficients.
Wrong Formats
Dates stored as text ("June 1 2025"), currency as strings ("₱50,000" inside text cell).
Impact : Excel cannot parse for chronological sort or numeric aggregation.
Inconsistent Entries
Same label typed differently: "North ", "north", "NORTH" (extra space / mixed case).
Impact : duplicates categories, creates fragmented pivots, misaligned joins.
Excel Cleaning Functions & Tools
TRIM()
Removes leading, trailing, and repeated inner spaces.
Example: converts " Best Seller " → "Best Seller".
CLEAN()
Strips non-printable ASCII characters (line breaks, hidden control codes).
Example: fixes "Special@@Offer" when @@ are char codes.
ISERROR()
Returns TRUE if cell contains any Excel error (#DIV/0!, #VALUE!, #REF!, …).
Pair with IF() for custom handling: .
IFNA()
Specialized wrapper for #N/A only (leaves other errors untouched).
Syntax: .
TEXT()
Re-formats numeric/datetime value into controlled text string.
Example: standardizes disparate date styles.
FIND / REPLACE (Ctrl+H)
Mass-edit spelling (“complete”→“Complete”), remove substrings (“@”).
Data Validation
Dropdown list (Region = {North, South, East, West}) prevents future typos.
Text-to-Columns
Delimiter or fixed-width splitting (e.g., "John Doe" → First Name / Last Name).
Number Formatting
Currency with thousands separator: "₱##,##0.00"; percentages: "0.0%".
Conditional Formatting
Highlight rules: Top 10%, greater than ₱, blanks, errors.
Sorting & Filtering
Sort ascending/descending by date, revenue.
AutoFilter arrows to isolate blanks (“Blanks”) or particular categories.
Mini Case – Messy Sales Dataset (4 Stores, Multiple Copies)
Dataset snippet (columns: Branch Name, Monthly Sales, Transaction Date, Status, Remarks, Promo Applied) illustrates every problem:
Missing Region field (to be added via Data Validation later).
Product/branch names:
Variants: "MANILA", "Manila", "manila" – case inconsistency.
Extra internal spaces: "₱ 48 000".
Monthly Sales anomalies:
Text currency, embedded spaces, “₱error”, #DIV/0!.
Date heterogeneity:
ISO (2025-06-01), US (6/1/2025), dotted (2025.06.01), text (June 1, 2025), two-digit year (6/1/25).
Status values typed five ways: complete, COMPLETE, Completed, InProgress, INprogress.
Remarks with hidden / special characters: "Special@@Offer", "Delayed@@@Delivery".
Promo Applied column: #N/A, "-", "n/a", "N/A", real codes (SUMMER2025).
Cleaning Workflow (Recommended Steps)
Copy raw sheet → create staging tab; keep immutable raw backup.
Apply TRIM() & CLEAN() in helper columns on all text fields; paste → values.
Convert Monthly Sales
Remove "₱" via FIND/REPLACE.
Eliminate spaces using SUBSTITUTE(" ",""), then VALUE() to get numeric.
Detect errors
; filter TRUE → inspect, fix (#DIV/0! becomes blank or imputed mean).
Standardize dates
Use DATEVALUE() if text.
For ambiguous "6/1/25", assume locale (Month-Day-Year) or parse with TEXT().
Normalize Status
UPPER() helper → single case, then FIND/REPLACE "INPROGRESS" → "In Progress", etc.
Validate Region
Insert new column; set Data Validation list {North, South, East, West}.
Handle Promo Applied
; also replace "-", "n/a" with "No Promo".
Outlier check
Conditional Format → values > ×IQR or .
“Ready” dataset saved as .xlsx and exported to .csv for BI tools.
Ethical & Practical Implications
Errors in currency fields can inflate projected revenue → mislead investors.
Inconsistent gender or region labels can bias demographic segmentation → fairness concerns.
Proper documentation of cleaning steps ensures auditability and regulatory compliance (e.g., SOX, GDPR).
Quick Reference – Excel Cleaning Functions (Cheat Sheet)
→ space normalization.
→ strip ASCII 0–31.
→ TRUE/FALSE flag.
→ neutralize #N/A.
→ unified ISO date.
FIND (Ctrl+F) / REPLACE (Ctrl+H) → manual batch standardization.
Connecting to Predictive Modeling
Garbage-in-garbage-out: model accuracy quality of input data.
Clean, standardized fields enable:
One-hot encoding of categorical variables.
Smooth training/validation splits without hidden errors.
Time-series forecasting that respects actual chronological order.
Performance metrics such as or improve markedly once outliers and typos are fixed.
Recap & Action Items
Always profile new data: count blanks, run descriptive stats, scan formats.
Maintain a reusable data-cleaning checklist for future projects.
Document every transformation step (Excel Power Query or a notebook) for reproducibility.
Cleaned data should flow into dashboards, forecasting sheets, or be uploaded to cloud databases for advanced ML.
Practice Prompt
Open the provided .xlsx; implement the ten-step workflow above.
Compare pre- vs post-cleaning: count of errors, average sales, forecast outputs.
Reflect: Which cleaning step had the biggest impact on analytical insight?
Key Takeaway
Properly prepared data is not a clerical exercise—it is the foundation for trustworthy business predictions, strategic decisions, and ethical analytics.