G. Unit 2.5: Model Building Phase

🔧 Phase 4: Model Building (Fully Simplified & Complete)

🧭 What’s the Goal of This Phase?

Phase 4 is where you build, train, and test your model.

You’re not just planning anymore — now you’re doing the modeling.
You use your prepared data to:
1. Train the model (learn patterns from existing data)
2. Test the model (see if it performs well on new, unseen data)

Think of it like teaching a student with practice questions (training data) and then giving them a final exam (test data) to see how well they learned.

🧪 Key Terms You Need to Know

Term	What It Means
Training Data	The data used to teach the model. It "learns" from this.
Test Data (Hold-Out Set)	A separate set of data used to test how accurate the model is.
Production Data	Real-world data the model will work with after deployment.

🧱 What Happens in Phase 4

✅ 1. Build the Model Using Training Data

You apply the techniques chosen in Phase 3.
Fit the model on the training dataset.
You start to see how the model performs.

✅ 2. Test the Model on New Data

After training, you test the model using a different dataset (hold-out/test data).
This helps you check how the model behaves with data it’s never seen before.

This avoids overfitting — when a model works great on training data but fails on new data.

✅ 3. Evaluate and Refine the Model

You ask yourself:

Question	Why It Matters
Does the model work well with the test data?	Confirms it's not just memorizing
Do the results make sense to business experts?	Ensures it's useful, not just accurate
Do the parameter values seem logical?	Helps verify it's not acting strangely
Is the model accurate enough?	Measures whether it meets the goal
Does it make costly mistakes?	Helps avoid serious business risks (e.g., false fraud alerts)
Do we need more/different data?	If yes, adjust the input features
Can this model be used in real systems?	Checks performance, speed, and feasibility
Should we go back to Phase 3?	If it's not working, re-plan the model

📌 Extra Notes You Must Know

Phases 3 and 4 often overlap. You might go back and forth while refining models.
Model building can be complex, but it’s often quicker than data prep or presenting results.
This phase includes many small decisions: tweaking inputs, changing models, transforming variables.
- ⚠ You must document everything — including:
  - What you changed
  - Why you changed it
  - What assumptions you made
- If not documented, these choices may be forgotten after the project ends.

🔍 Common Mistakes to Look For

False Positives (e.g., wrongly predicting fraud)
False Negatives (e.g., missing real fraud)

These depend on the context. Sometimes false positives are worse, sometimes false negatives are worse. Always think about the business cost of errors.

🔁 When Can You Move On to the Next Phase?

You can move to Phase 5 (Communicate Results) when:

The model is reliable
It meets the business goal
Or, you’ve decided this model won’t work and need to try something else

At this point, you’ve either succeeded or identified what needs fixing.

🛠 2.5.1 – Tools Used in Model Building

This phase relies on software that supports data mining, machine learning, or statistical modeling.

💼 Commercial Tools

Tool	What It Does
SAS Enterprise Miner	Runs large-scale models; connects with enterprise databases
SPSS Modeler	GUI-based; easy for non-coders to run models
Matlab	Advanced platform for writing algorithms and running analytics
Alpine Miner	Lets users build workflows with Big Data tools
STATISTICA / Mathematica	Popular in academic and enterprise modeling environments

🌐 Free & Open Source Tools

Tool	What It Does
R	Used earlier in Phase 3; excellent for stats, graphs, models
PL/R	Runs R code inside PostgreSQL — better performance
Octave	Open-source alternative to Matlab; used in universities
WEKA	Visual modeling tool; integrates with Java programs
Python	Most flexible: supports `scikit-learn`, `pandas`, `numpy`, `matplotlib`, etc.
MADlib	Runs machine learning algorithms inside databases (PostgreSQL, Greenplum)

💡 Note: Running models inside the database (like with PL/R or MADlib) improves performance for Big Data use cases.

🧠 Summary Cheat Sheet: Phase 4 — Model Building

Step	What You Do	Tools You Use
Train	Fit your model to training data	R, Python, SAS, etc.
Test	Evaluate using hold-out/test data	Same tools, different data
Refine	Tweak inputs, fix model errors	Visualizations, logic tuning
Document	Record every step and assumption	Notes, logs, version control
Evaluate	Check accuracy, business relevance	Metrics, confusion matrix
Decide	Move forward or return to Phase 3	Decision meeting, testing output

🔮 Predictive Analytics – Simplified & Complete Summary

🧭 What Is Predictive Analytics?

Predictive analytics answers the question:
➡ “What is likely to happen in the future?”

It uses patterns in existing data to forecast future outcomes — not with certainty, but with probability.

🔑 Important: Predictive analytics does not claim to predict the future exactly. Instead, it gives a best guess based on patterns, trends, and probabilities.

📦 Real-World Example: Amazon’s Recommendation System

🛒 How It Works:

Amazon’s “Recommended for You” feature uses predictive analytics through a collaborative filtering engine.

👀 What It Looks At:

What’s in your cart
What’s on your wishlist
What you’ve purchased recently

Then it compares this with:

What other customers bought when they bought the same items

Example: If you add peanut butter to your cart, Amazon might recommend jelly — because other customers often bought both together.

💡 Why Amazon Recommends Multiple Products:

Because predictive analytics can’t be 100% sure
Amazon shows multiple suggestions to increase the chance that you’ll buy at least one
More purchases = higher revenue

🎯 Core Principle

🔍 Predictive analytics identifies the most likely outcomes — not guaranteed results.

There are too many variables in real life to safely know what will happen, but historical data gives clues about what’s probable.

📊 Descriptive Statistic: Standard Deviation

As part of understanding data for predictive modeling, we use descriptive statistics — and one very important one is:

📏 Standard Deviation – What Is It?

It measures how spread out the data is around the average (mean).
Think of it as a “closeness check” — Are your data points huddled near the mean, or scattered far from it?

Type	What It Means
Small standard deviation	Most values are close to the average
Large standard deviation	Many values are far from the average (more spread out)

📈 Understanding Spread with the Bell Curve

We often assume that large datasets follow a normal distribution, also known as the bell curve.

🔢 The 68–95–99.7 Rule

This rule helps explain how much data falls within a certain range of the mean:

Range	% of Data Within
±1 standard deviation	~68%
±2 standard deviations	~95%
±3 standard deviations	~99.7%

So, if the average test score is 80:

68% of scores are between 70 and 90 (±10 if SD = 10)
95% are between 60 and 100
99.7% are between 50 and 110

🧠 Why Standard Deviation Matters in Predictive Analytics

Understanding the spread of your data helps you:

Know how stable or varied your predictions might be
Decide whether a data point is normal or an outlier
Build better confidence around your forecast results

✅ Final Takeaways

Concept	What It Means (Simple)	Why It Matters
Predictive Analytics	Looks ahead to what might happen	Supports business decisions like pricing, marketing, risk
Collaborative Filtering	Suggests items based on other users’ behavior	Drives sales with personalization (e.g., Amazon)
Standard Deviation	Measures spread of data around the average	Helps interpret how consistent or scattered data is
Bell Curve (Normal Distribution)	Common data shape where most values fall near the middle	Lets us apply rules like 68-95-99.7 to understand probability

Key Terms

training data: the dataset used for model development, where the model learns patterns and relationships in the data
test data: a separate dataset, also called hold-out data, used to evaluate the model's performance and accuracy on unseen data
model building phase: includes developing and fitting an analytical model on the training data
model assessment: the process of evaluating the technical merits of a model, such as accuracy, comprehensibility, and confidence in predictions
error rate: the percentage of records classified correctly or incorrectly, used to measure the accuracy of a model
lift: a measure that indicates the change in concentration of a particular class when the model is used to select a group from the general population
ROC charts: a performance measurement for binary response models, comparing the true positive rate with the false positive rate