Preparing Business Data for Prediction – Comprehensive Study Notes

Objectives

  • Build foundational understanding of how to get raw business data “prediction-ready”.

  • Recognize three primary data classes that appear in business contexts.

  • Diagnose and remediate the four most common data quality problems.

  • Master a toolbox of seven Excel features/functions for systematic cleaning.

  • Practice on a mini-case (messy sales dataset) that contains every problem in one sheet.

  • Appreciate the downstream impact of clean data on forecasting, dashboards, and machine-learning pipelines.

Types of Business Data

  • Numerical (Quantitative)

    • Definition : Values that can be measured and ordered.

    • Typical fields : revenue, profit, units sold, employee age.

    • Significance : Input for aggregation (sum, average), statistical modeling, regression.

  • Categorical (Qualitative / Labels)

    • Definition : Finite set of discrete classes.

    • Examples : region (North, South), product category, customer segment, gender.

    • Significance : Provide grouping keys for pivot tables and dummy-variable encoding.

  • Time-Series

    • Definition : Observations indexed by time stamps with consistent frequency.

    • Examples : daily website visits, monthly sales, quarterly GDP.

    • Significance : Needed for forecasting algorithms that exploit temporal order (ARIMA, ETS).

    • Reminder : Excel’s built-in Forecast Sheet requires true serial dates.

Common Data Problems in Business Datasets

  • Missing Values

    • Empty cells, “n/a”, "-", or placeholders that break calculations.

    • Impact : causes list-wise deletion or biased imputations if ignored.

  • Outliers

    • Extremely small/large numbers, e.g., 10,00010{,}000 units sold when average is 100100.

    • Impact : distorts mean, standard deviation, and regression coefficients.

  • Wrong Formats

    • Dates stored as text ("June 1 2025"), currency as strings ("₱50,000" inside text cell).

    • Impact : Excel cannot parse for chronological sort or numeric aggregation.

  • Inconsistent Entries

    • Same label typed differently: "North ", "north", "NORTH" (extra space / mixed case).

    • Impact : duplicates categories, creates fragmented pivots, misaligned joins.

Excel Cleaning Functions & Tools

  • TRIM()

    • Removes leading, trailing, and repeated inner spaces.

    • Example: =TRIM(A2)=TRIM(A2) converts " Best Seller " → "Best Seller".

  • CLEAN()

    • Strips non-printable ASCII characters (line breaks, hidden control codes).

    • Example: =CLEAN(A2)=CLEAN(A2) fixes "Special@@Offer" when @@ are char codes.

  • ISERROR()

    • Returns TRUE if cell contains any Excel error (#DIV/0!, #VALUE!, #REF!, …).

    • Pair with IF() for custom handling: =IF(ISERROR(B2),0,B2)=IF(ISERROR(B2),0,B2).

  • IFNA()

    • Specialized wrapper for #N/A only (leaves other errors untouched).

    • Syntax: =IFNA(VLOOKUP(),"NoPromo")=IFNA(VLOOKUP(…),"No Promo").

  • TEXT()

    • Re-formats numeric/datetime value into controlled text string.

    • Example: =TEXT(C2,"yyyymmdd")=TEXT(C2,"yyyy-mm-dd") standardizes disparate date styles.

  • FIND / REPLACE (Ctrl+H)

    • Mass-edit spelling (“complete”→“Complete”), remove substrings (“@”).

  • Data Validation

    • Dropdown list (Region = {North, South, East, West}) prevents future typos.

  • Text-to-Columns

    • Delimiter or fixed-width splitting (e.g., "John Doe" → First Name / Last Name).

  • Number Formatting

    • Currency with thousands separator: "₱##,##0.00"; percentages: "0.0%".

  • Conditional Formatting

    • Highlight rules: Top 10%, greater than ₱500,000500{,}000, blanks, errors.

  • Sorting & Filtering

    • Sort ascending/descending by date, revenue.

    • AutoFilter arrows to isolate blanks (“Blanks”) or particular categories.

Mini Case – Messy Sales Dataset (4 Stores, Multiple Copies)

Dataset snippet (columns: Branch Name, Monthly Sales, Transaction Date, Status, Remarks, Promo Applied) illustrates every problem:

  • Missing Region field (to be added via Data Validation later).

  • Product/branch names:

    • Variants: "MANILA", "Manila", "manila" – case inconsistency.

    • Extra internal spaces: "₱ 48 000".

  • Monthly Sales anomalies:

    • Text currency, embedded spaces, “₱error”, #DIV/0!.

  • Date heterogeneity:

    • ISO (2025-06-01), US (6/1/2025), dotted (2025.06.01), text (June 1, 2025), two-digit year (6/1/25).

  • Status values typed five ways: complete, COMPLETE, Completed, InProgress, INprogress.

  • Remarks with hidden / special characters: "Special@@Offer", "Delayed@@@Delivery".

  • Promo Applied column: #N/A, "-", "n/a", "N/A", real codes (SUMMER2025).

Cleaning Workflow (Recommended Steps)
  1. Copy raw sheet → create staging tab; keep immutable raw backup.

  2. Apply TRIM() & CLEAN() in helper columns on all text fields; paste → values.

  3. Convert Monthly Sales

    • Remove "₱" via FIND/REPLACE.

    • Eliminate spaces using SUBSTITUTE(" ",""), then VALUE() to get numeric.

  4. Detect errors

    • =ISERROR(B2)=ISERROR(B2); filter TRUE → inspect, fix (#DIV/0! becomes blank or imputed mean).

  5. Standardize dates

    • Use DATEVALUE() if text.

    • For ambiguous "6/1/25", assume locale (Month-Day-Year) or parse with TEXT().

  6. Normalize Status

    • UPPER() helper → single case, then FIND/REPLACE "INPROGRESS" → "In Progress", etc.

  7. Validate Region

    • Insert new column; set Data Validation list {North, South, East, West}.

  8. Handle Promo Applied

    • =IFNA(F2,"NoPromo")=IFNA(F2,"No Promo"); also replace "-", "n/a" with "No Promo".

  9. Outlier check

    • Conditional Format → values > 33×IQR or μ±3σ\mu \pm 3\sigma.

  10. “Ready” dataset saved as .xlsx and exported to .csv for BI tools.

Ethical & Practical Implications
  • Errors in currency fields can inflate projected revenue → mislead investors.

  • Inconsistent gender or region labels can bias demographic segmentation → fairness concerns.

  • Proper documentation of cleaning steps ensures auditability and regulatory compliance (e.g., SOX, GDPR).

Quick Reference – Excel Cleaning Functions (Cheat Sheet)

  • =TRIM(cell)=TRIM(\text{cell}) → space normalization.

  • =CLEAN(cell)=CLEAN(\text{cell}) → strip ASCII 0–31.

  • =ISERROR(cell)=ISERROR(\text{cell}) → TRUE/FALSE flag.

  • =IFNA(cell,"replacement")=IFNA(\text{cell},"replacement") → neutralize #N/A.

  • =TEXT(cell,"yyyymmdd")=TEXT(\text{cell},"yyyy-mm-dd") → unified ISO date.

  • FIND (Ctrl+F) / REPLACE (Ctrl+H) → manual batch standardization.

Connecting to Predictive Modeling

  • Garbage-in-garbage-out: model accuracy \approx quality of input data.

  • Clean, standardized fields enable:

    • One-hot encoding of categorical variables.

    • Smooth training/validation splits without hidden errors.

    • Time-series forecasting that respects actual chronological order.

  • Performance metrics such as R2R^2 or RMSERMSE improve markedly once outliers and typos are fixed.

Recap & Action Items

  • Always profile new data: count blanks, run descriptive stats, scan formats.

  • Maintain a reusable data-cleaning checklist for future projects.

  • Document every transformation step (Excel Power Query or a notebook) for reproducibility.

  • Cleaned data should flow into dashboards, forecasting sheets, or be uploaded to cloud databases for advanced ML.

Practice Prompt

  • Open the provided .xlsx; implement the ten-step workflow above.

  • Compare pre- vs post-cleaning: count of errors, average sales, forecast outputs.

  • Reflect: Which cleaning step had the biggest impact on analytical insight?

Key Takeaway

Properly prepared data is not a clerical exercise—it is the foundation for trustworthy business predictions, strategic decisions, and ethical analytics.