G. Unit 2.5: Model Building Phase

🔧 Phase 4: Model Building (Fully Simplified & Complete)


🧭 What’s the Goal of This Phase?

Phase 4 is where you build, train, and test your model.

  • You’re not just planning anymore — now you’re doing the modeling.

  • You use your prepared data to:

    1. Train the model (learn patterns from existing data)

    2. Test the model (see if it performs well on new, unseen data)

Think of it like teaching a student with practice questions (training data) and then giving them a final exam (test data) to see how well they learned.


🧪 Key Terms You Need to Know

Term

What It Means

Training Data

The data used to teach the model. It "learns" from this.

Test Data (Hold-Out Set)

A separate set of data used to test how accurate the model is.

Production Data

Real-world data the model will work with after deployment.


🧱 What Happens in Phase 4

1. Build the Model Using Training Data

  • You apply the techniques chosen in Phase 3.

  • Fit the model on the training dataset.

  • You start to see how the model performs.

2. Test the Model on New Data

  • After training, you test the model using a different dataset (hold-out/test data).

  • This helps you check how the model behaves with data it’s never seen before.

This avoids overfitting — when a model works great on training data but fails on new data.

3. Evaluate and Refine the Model

You ask yourself:

Question

Why It Matters

Does the model work well with the test data?

Confirms it's not just memorizing

Do the results make sense to business experts?

Ensures it's useful, not just accurate

Do the parameter values seem logical?

Helps verify it's not acting strangely

Is the model accurate enough?

Measures whether it meets the goal

Does it make costly mistakes?

Helps avoid serious business risks (e.g., false fraud alerts)

Do we need more/different data?

If yes, adjust the input features

Can this model be used in real systems?

Checks performance, speed, and feasibility

Should we go back to Phase 3?

If it's not working, re-plan the model


📌 Extra Notes You Must Know

  • Phases 3 and 4 often overlap. You might go back and forth while refining models.

  • Model building can be complex, but it’s often quicker than data prep or presenting results.

  • This phase includes many small decisions: tweaking inputs, changing models, transforming variables.

    • You must document everything — including:

      • What you changed

      • Why you changed it

      • What assumptions you made

    • If not documented, these choices may be forgotten after the project ends.


🔍 Common Mistakes to Look For

  • False Positives (e.g., wrongly predicting fraud)

  • False Negatives (e.g., missing real fraud)

These depend on the context. Sometimes false positives are worse, sometimes false negatives are worse. Always think about the business cost of errors.


🔁 When Can You Move On to the Next Phase?

You can move to Phase 5 (Communicate Results) when:

  • The model is reliable

  • It meets the business goal

  • Or, you’ve decided this model won’t work and need to try something else

At this point, you’ve either succeeded or identified what needs fixing.


🛠 2.5.1 – Tools Used in Model Building

This phase relies on software that supports data mining, machine learning, or statistical modeling.


💼 Commercial Tools

Tool

What It Does

SAS Enterprise Miner

Runs large-scale models; connects with enterprise databases

SPSS Modeler

GUI-based; easy for non-coders to run models

Matlab

Advanced platform for writing algorithms and running analytics

Alpine Miner

Lets users build workflows with Big Data tools

STATISTICA / Mathematica

Popular in academic and enterprise modeling environments


🌐 Free & Open Source Tools

Tool

What It Does

R

Used earlier in Phase 3; excellent for stats, graphs, models

PL/R

Runs R code inside PostgreSQL — better performance

Octave

Open-source alternative to Matlab; used in universities

WEKA

Visual modeling tool; integrates with Java programs

Python

Most flexible: supports scikit-learn, pandas, numpy, matplotlib, etc.

MADlib

Runs machine learning algorithms inside databases (PostgreSQL, Greenplum)

💡 Note: Running models inside the database (like with PL/R or MADlib) improves performance for Big Data use cases.


🧠 Summary Cheat Sheet: Phase 4 — Model Building

Step

What You Do

Tools You Use

Train

Fit your model to training data

R, Python, SAS, etc.

Test

Evaluate using hold-out/test data

Same tools, different data

Refine

Tweak inputs, fix model errors

Visualizations, logic tuning

Document

Record every step and assumption

Notes, logs, version control

Evaluate

Check accuracy, business relevance

Metrics, confusion matrix

Decide

Move forward or return to Phase 3

Decision meeting, testing output

🔮 Predictive Analytics – Simplified & Complete Summary


🧭 What Is Predictive Analytics?

Predictive analytics answers the question:
“What is likely to happen in the future?”

It uses patterns in existing data to forecast future outcomes — not with certainty, but with probability.

🔑 Important: Predictive analytics does not claim to predict the future exactly. Instead, it gives a best guess based on patterns, trends, and probabilities.


📦 Real-World Example: Amazon’s Recommendation System

🛒 How It Works:

Amazon’s “Recommended for You” feature uses predictive analytics through a collaborative filtering engine.

👀 What It Looks At:

  • What’s in your cart

  • What’s on your wishlist

  • What you’ve purchased recently

Then it compares this with:

  • What other customers bought when they bought the same items

Example: If you add peanut butter to your cart, Amazon might recommend jelly — because other customers often bought both together.

💡 Why Amazon Recommends Multiple Products:

  • Because predictive analytics can’t be 100% sure

  • Amazon shows multiple suggestions to increase the chance that you’ll buy at least one

  • More purchases = higher revenue


🎯 Core Principle

🔍 Predictive analytics identifies the most likely outcomes — not guaranteed results.

There are too many variables in real life to safely know what will happen, but historical data gives clues about what’s probable.


📊 Descriptive Statistic: Standard Deviation

As part of understanding data for predictive modeling, we use descriptive statistics — and one very important one is:

📏 Standard Deviation – What Is It?

  • It measures how spread out the data is around the average (mean).

  • Think of it as a “closeness check” — Are your data points huddled near the mean, or scattered far from it?

Type

What It Means

Small standard deviation

Most values are close to the average

Large standard deviation

Many values are far from the average (more spread out)


📈 Understanding Spread with the Bell Curve

We often assume that large datasets follow a normal distribution, also known as the bell curve.

🔢 The 68–95–99.7 Rule

This rule helps explain how much data falls within a certain range of the mean:

Range

% of Data Within

±1 standard deviation

~68%

±2 standard deviations

~95%

±3 standard deviations

~99.7%

So, if the average test score is 80:

  • 68% of scores are between 70 and 90 (±10 if SD = 10)

  • 95% are between 60 and 100

  • 99.7% are between 50 and 110


🧠 Why Standard Deviation Matters in Predictive Analytics

Understanding the spread of your data helps you:

  • Know how stable or varied your predictions might be

  • Decide whether a data point is normal or an outlier

  • Build better confidence around your forecast results


Final Takeaways

Concept

What It Means (Simple)

Why It Matters

Predictive Analytics

Looks ahead to what might happen

Supports business decisions like pricing, marketing, risk

Collaborative Filtering

Suggests items based on other users’ behavior

Drives sales with personalization (e.g., Amazon)

Standard Deviation

Measures spread of data around the average

Helps interpret how consistent or scattered data is

Bell Curve (Normal Distribution)

Common data shape where most values fall near the middle

Lets us apply rules like 68-95-99.7 to understand probability

Key Terms

  • training data: the dataset used for model development, where the model learns patterns and relationships in the data

  • test data: a separate dataset, also called hold-out data, used to evaluate the model's performance and accuracy on unseen data

  • model building phase: includes developing and fitting an analytical model on the training data

  • model assessment: the process of evaluating the technical merits of a model, such as accuracy, comprehensibility, and confidence in predictions

  • error rate: the percentage of records classified correctly or incorrectly, used to measure the accuracy of a model

  • lift: a measure that indicates the change in concentration of a particular class when the model is used to select a group from the general population

  • ROC charts: a performance measurement for binary response models, comparing the true positive rate with the false positive rate