F. Unit: 2.4: Model Planning Phase

📊 Phase 3: Model Planning

Goal: Decide which models and techniques to use to analyze the data, based on business goals, data structure, and prior hypotheses. This is the blueprint phase — you’re designing the analytics plan before actually building models.

🔁 How This Phase Connects to Earlier Work

This phase links back to Phase 1 (Discovery) where the team formed initial hypotheses based on the business problem and data.
Now, in Phase 3, the team chooses which models or methods to test those hypotheses.
The output of this phase is the analytics strategy for the next step: building the model (Phase 4).

🧩 Key Activities in Model Planning

✅ 1. Assess Data Structure

Check the type and structure of data (e.g., text vs. transaction logs).
This determines which tools and which analytics techniques are appropriate.
- Example: Textual data may require NLP tools.
- Transactional data may work well with regression or clustering.

✅ 2. Validate Against Business Objectives

Confirm that the analytical techniques you choose will help you test or validate the business hypotheses.
Techniques should help determine whether the assumptions from earlier phases hold true or not.

✅ 3. Decide on One Model or a Workflow

Choose if a single model is enough or if a chain of techniques is needed.
- Example: You might combine classification with clustering in a workflow.
Tools like Alpine Miner can build model workflows visually and connect to Big Data systems (like PostgreSQL).

🔍 Learn from Industry Practices

Teams are encouraged to research how similar problems are solved in other industries. This can:

Inspire model selection
Suggest best practices for similar data and goals

📊 Table: Models Used by Industry

Industry	Techniques Used
Consumer Packaged Goods	Linear Regression, ARD (Automatic Relevance Determination), Decision Tree
Retail Banking	Multiple Regression
Retail Business	Logistic Regression, ARD, Decision Tree
Wireless Telecom	Neural Network, Decision Tree, Hierarchical Neurofuzzy Systems, Rule Evolver, Logistic Regression

This kind of benchmarking helps teams brainstorm which candidate models to test.

🔎 2.4.1 – Data Exploration & Variable Selection

🧼 How It's Different from Phase 2

In Phase 2 (Data Prep), data exploration focused on data cleaning and quality.
In Phase 3, it’s about understanding relationships between variables to decide:
- Which variables to include
- Which methods to use

🛠 Tools Used

Visualizations (e.g., charts, heatmaps) help show variable relationships.
This high-level view guides smart variable selection and prepares for model testing.

💡 Stakeholder Input vs. Data Science

Stakeholders often have hunches or hypotheses.
- Some are right, some are wrong — sometimes for the wrong reasons.
Data scientists must:
- Validate those assumptions objectively
- Use unbiased data exploration to confirm or reject hypotheses
- Choose predictors (input variables) that correlate well with outcomes

⚠ Key Modeling Challenges

Multicollinearity: When predictors are too closely related to each other
Serial correlation: When data points are overly influenced by previous values

To deal with these:

Use variable transformations
Consider removing or combining inputs
Select modeling techniques that handle correlation better

🔑 Main Goal: Keep only the essential and influential variables — not everything someone thinks might be relevant. This often requires multiple iterations and testing.

🧠 If Using Regression Models:

Identify:
- Predictor variables (inputs)
- Outcome variables (targets)
Ensure predictors are tied to outcomes, not just correlated with each other.
Be ready to handle:
- Correlation-only models (e.g., black-box predictions)
- Causal models if explanatory power is required (e.g., forecasting or stress-testing)

🧠 2.4.2 – Model Selection

🔧 What Is a “Model”?

A model is an abstraction of reality: a simplified version of what’s happening in your data.
In data science, models follow a set of rules and conditions to:
- Classify items
- Predict outcomes
- Group (cluster) similar items
- Find associations or relationships

🎯 How to Choose:

Match the technique to the project’s end goal.
Choose from common model categories:
- Classification – Predict categories (e.g., churn or not)
- Clustering – Group similar entities
- Association Rules – Discover co-occurrence patterns
Narrow it down to a shortlist of viable candidates.

🧵 Structured vs. Unstructured Data

Use different tools depending on the data:
- Structured: Rows and columns (e.g., sales data)
- Unstructured: Text, images, etc.
MapReduce and other distributed methods help with unstructured Big Data (see Chapter 10).

📝 Always Document:

Any assumptions made in model design
Why certain models or variables were chosen

🧪 Tool Example: Prototyping Models

Often, models are first created in:

R
SAS
Matlab

These tools:

Are powerful for machine learning and statistics
May not scale to very large datasets
Can be limited in Big Data applications

In large-scale environments, teams may need to redesign or deploy models inside the database itself, especially during pilot testing (Phase 6).

🧰 2.4.3 – Common Tools in Model Planning

Tool	What It Does	Why It Matters
R	Advanced modeling with thousands of packages	Great for visual, interpretive models; connects to Big Data via ODBC
SQL Analysis Services	In-database mining, aggregations, basic predictions	Helps build models inside SQL databases
SAS/ACCESS	Connects SAS to relational databases & apps like SAP	Enables SAS to access Big Data sources and enterprise platforms

🧠 R is especially powerful because it keeps growing via open-source contributions — similar to how Linux grew in the '90s.

🚀 Final Checklist Before Phase 4

To complete Phase 3 successfully, your team should:

Select a modeling approach based on project goals
Finalize which variables to include
Create a diagram or written plan of the model workflow
List tools to be used (e.g., R, SQL, SAS)
Document any assumptions
Have a clear direction for testing models in Phase 4

candidate models: potential models for clustering, classifying, or finding relationships in data
dataset structure: refers to the arrangement and organization of data used in the analysis process
analytical techniques: methods and tools used to analyze and process data to achieve business objectives
variable selection: the process of identifying essential predictors and variables to include in the model
structured data: data organized in a specific format or schema, making it easier to analyze
unstructured data: data that lacks a specific format or structure, often requiring additional processing before analysis