F. Unit: 2.4: Model Planning Phase
📊 Phase 3: Model Planning
Goal: Decide which models and techniques to use to analyze the data, based on business goals, data structure, and prior hypotheses. This is the blueprint phase — you’re designing the analytics plan before actually building models.
🔁 How This Phase Connects to Earlier Work
This phase links back to Phase 1 (Discovery) where the team formed initial hypotheses based on the business problem and data.
Now, in Phase 3, the team chooses which models or methods to test those hypotheses.
The output of this phase is the analytics strategy for the next step: building the model (Phase 4).
🧩 Key Activities in Model Planning
✅ 1. Assess Data Structure
Check the type and structure of data (e.g., text vs. transaction logs).
This determines which tools and which analytics techniques are appropriate.
Example: Textual data may require NLP tools.
Transactional data may work well with regression or clustering.
✅ 2. Validate Against Business Objectives
Confirm that the analytical techniques you choose will help you test or validate the business hypotheses.
Techniques should help determine whether the assumptions from earlier phases hold true or not.
✅ 3. Decide on One Model or a Workflow
Choose if a single model is enough or if a chain of techniques is needed.
Example: You might combine classification with clustering in a workflow.
Tools like Alpine Miner can build model workflows visually and connect to Big Data systems (like PostgreSQL).
🔍 Learn from Industry Practices
Teams are encouraged to research how similar problems are solved in other industries. This can:
Inspire model selection
Suggest best practices for similar data and goals
📊 Table: Models Used by Industry
Industry | Techniques Used |
|---|---|
Consumer Packaged Goods | Linear Regression, ARD (Automatic Relevance Determination), Decision Tree |
Retail Banking | Multiple Regression |
Retail Business | Logistic Regression, ARD, Decision Tree |
Wireless Telecom | Neural Network, Decision Tree, Hierarchical Neurofuzzy Systems, Rule Evolver, Logistic Regression |
This kind of benchmarking helps teams brainstorm which candidate models to test.
🔎 2.4.1 – Data Exploration & Variable Selection
🧼 How It's Different from Phase 2
In Phase 2 (Data Prep), data exploration focused on data cleaning and quality.
In Phase 3, it’s about understanding relationships between variables to decide:
Which variables to include
Which methods to use
🛠 Tools Used
Visualizations (e.g., charts, heatmaps) help show variable relationships.
This high-level view guides smart variable selection and prepares for model testing.
💡 Stakeholder Input vs. Data Science
Stakeholders often have hunches or hypotheses.
Some are right, some are wrong — sometimes for the wrong reasons.
Data scientists must:
Validate those assumptions objectively
Use unbiased data exploration to confirm or reject hypotheses
Choose predictors (input variables) that correlate well with outcomes
⚠ Key Modeling Challenges
Multicollinearity: When predictors are too closely related to each other
Serial correlation: When data points are overly influenced by previous values
To deal with these:
Use variable transformations
Consider removing or combining inputs
Select modeling techniques that handle correlation better
🔑 Main Goal: Keep only the essential and influential variables — not everything someone thinks might be relevant. This often requires multiple iterations and testing.
🧠 If Using Regression Models:
Identify:
Predictor variables (inputs)
Outcome variables (targets)
Ensure predictors are tied to outcomes, not just correlated with each other.
Be ready to handle:
Correlation-only models (e.g., black-box predictions)
Causal models if explanatory power is required (e.g., forecasting or stress-testing)
🧠 2.4.2 – Model Selection
🔧 What Is a “Model”?
A model is an abstraction of reality: a simplified version of what’s happening in your data.
In data science, models follow a set of rules and conditions to:
Classify items
Predict outcomes
Group (cluster) similar items
Find associations or relationships
🎯 How to Choose:
Match the technique to the project’s end goal.
Choose from common model categories:
Classification – Predict categories (e.g., churn or not)
Clustering – Group similar entities
Association Rules – Discover co-occurrence patterns
Narrow it down to a shortlist of viable candidates.
🧵 Structured vs. Unstructured Data
Use different tools depending on the data:
Structured: Rows and columns (e.g., sales data)
Unstructured: Text, images, etc.
MapReduce and other distributed methods help with unstructured Big Data (see Chapter 10).
📝 Always Document:
Any assumptions made in model design
Why certain models or variables were chosen
🧪 Tool Example: Prototyping Models
Often, models are first created in:
R
SAS
Matlab
These tools:
Are powerful for machine learning and statistics
May not scale to very large datasets
Can be limited in Big Data applications
In large-scale environments, teams may need to redesign or deploy models inside the database itself, especially during pilot testing (Phase 6).
🧰 2.4.3 – Common Tools in Model Planning
Tool | What It Does | Why It Matters |
|---|---|---|
R | Advanced modeling with thousands of packages | Great for visual, interpretive models; connects to Big Data via ODBC |
SQL Analysis Services | In-database mining, aggregations, basic predictions | Helps build models inside SQL databases |
SAS/ACCESS | Connects SAS to relational databases & apps like SAP | Enables SAS to access Big Data sources and enterprise platforms |
🧠 R is especially powerful because it keeps growing via open-source contributions — similar to how Linux grew in the '90s.
🚀 Final Checklist Before Phase 4
To complete Phase 3 successfully, your team should:
Select a modeling approach based on project goals
Finalize which variables to include
Create a diagram or written plan of the model workflow
List tools to be used (e.g., R, SQL, SAS)
Document any assumptions
Have a clear direction for testing models in Phase 4
candidate models: potential models for clustering, classifying, or finding relationships in data
dataset structure: refers to the arrangement and organization of data used in the analysis process
analytical techniques: methods and tools used to analyze and process data to achieve business objectives
variable selection: the process of identifying essential predictors and variables to include in the model
structured data: data organized in a specific format or schema, making it easier to analyze
unstructured data: data that lacks a specific format or structure, often requiring additional processing before analysis