F. Unit: 2.4: Model Planning Phase

📊 Phase 3: Model Planning

Goal: Decide which models and techniques to use to analyze the data, based on business goals, data structure, and prior hypotheses. This is the blueprint phase — you’re designing the analytics plan before actually building models.


🔁 How This Phase Connects to Earlier Work

  • This phase links back to Phase 1 (Discovery) where the team formed initial hypotheses based on the business problem and data.

  • Now, in Phase 3, the team chooses which models or methods to test those hypotheses.

  • The output of this phase is the analytics strategy for the next step: building the model (Phase 4).


🧩 Key Activities in Model Planning

1. Assess Data Structure

  • Check the type and structure of data (e.g., text vs. transaction logs).

  • This determines which tools and which analytics techniques are appropriate.

    • Example: Textual data may require NLP tools.

    • Transactional data may work well with regression or clustering.

2. Validate Against Business Objectives

  • Confirm that the analytical techniques you choose will help you test or validate the business hypotheses.

  • Techniques should help determine whether the assumptions from earlier phases hold true or not.

3. Decide on One Model or a Workflow

  • Choose if a single model is enough or if a chain of techniques is needed.

    • Example: You might combine classification with clustering in a workflow.

  • Tools like Alpine Miner can build model workflows visually and connect to Big Data systems (like PostgreSQL).


🔍 Learn from Industry Practices

Teams are encouraged to research how similar problems are solved in other industries. This can:

  • Inspire model selection

  • Suggest best practices for similar data and goals

📊 Table: Models Used by Industry

Industry

Techniques Used

Consumer Packaged Goods

Linear Regression, ARD (Automatic Relevance Determination), Decision Tree

Retail Banking

Multiple Regression

Retail Business

Logistic Regression, ARD, Decision Tree

Wireless Telecom

Neural Network, Decision Tree, Hierarchical Neurofuzzy Systems, Rule Evolver, Logistic Regression

This kind of benchmarking helps teams brainstorm which candidate models to test.


🔎 2.4.1 – Data Exploration & Variable Selection

🧼 How It's Different from Phase 2

  • In Phase 2 (Data Prep), data exploration focused on data cleaning and quality.

  • In Phase 3, it’s about understanding relationships between variables to decide:

    • Which variables to include

    • Which methods to use

🛠 Tools Used

  • Visualizations (e.g., charts, heatmaps) help show variable relationships.

  • This high-level view guides smart variable selection and prepares for model testing.

💡 Stakeholder Input vs. Data Science

  • Stakeholders often have hunches or hypotheses.

    • Some are right, some are wrong — sometimes for the wrong reasons.

  • Data scientists must:

    • Validate those assumptions objectively

    • Use unbiased data exploration to confirm or reject hypotheses

    • Choose predictors (input variables) that correlate well with outcomes

Key Modeling Challenges

  • Multicollinearity: When predictors are too closely related to each other

  • Serial correlation: When data points are overly influenced by previous values

To deal with these:

  • Use variable transformations

  • Consider removing or combining inputs

  • Select modeling techniques that handle correlation better

🔑 Main Goal: Keep only the essential and influential variables — not everything someone thinks might be relevant. This often requires multiple iterations and testing.


🧠 If Using Regression Models:

  • Identify:

    • Predictor variables (inputs)

    • Outcome variables (targets)

  • Ensure predictors are tied to outcomes, not just correlated with each other.

  • Be ready to handle:

    • Correlation-only models (e.g., black-box predictions)

    • Causal models if explanatory power is required (e.g., forecasting or stress-testing)


🧠 2.4.2 – Model Selection

🔧 What Is a “Model”?

  • A model is an abstraction of reality: a simplified version of what’s happening in your data.

  • In data science, models follow a set of rules and conditions to:

    • Classify items

    • Predict outcomes

    • Group (cluster) similar items

    • Find associations or relationships

🎯 How to Choose:

  • Match the technique to the project’s end goal.

  • Choose from common model categories:

    • Classification – Predict categories (e.g., churn or not)

    • Clustering – Group similar entities

    • Association Rules – Discover co-occurrence patterns

  • Narrow it down to a shortlist of viable candidates.

🧵 Structured vs. Unstructured Data

  • Use different tools depending on the data:

    • Structured: Rows and columns (e.g., sales data)

    • Unstructured: Text, images, etc.

  • MapReduce and other distributed methods help with unstructured Big Data (see Chapter 10).

📝 Always Document:

  • Any assumptions made in model design

  • Why certain models or variables were chosen


🧪 Tool Example: Prototyping Models

Often, models are first created in:

  • R

  • SAS

  • Matlab

These tools:

  • Are powerful for machine learning and statistics

  • May not scale to very large datasets

  • Can be limited in Big Data applications

In large-scale environments, teams may need to redesign or deploy models inside the database itself, especially during pilot testing (Phase 6).


🧰 2.4.3 – Common Tools in Model Planning

Tool

What It Does

Why It Matters

R

Advanced modeling with thousands of packages

Great for visual, interpretive models; connects to Big Data via ODBC

SQL Analysis Services

In-database mining, aggregations, basic predictions

Helps build models inside SQL databases

SAS/ACCESS

Connects SAS to relational databases & apps like SAP

Enables SAS to access Big Data sources and enterprise platforms

🧠 R is especially powerful because it keeps growing via open-source contributions — similar to how Linux grew in the '90s.


🚀 Final Checklist Before Phase 4

To complete Phase 3 successfully, your team should:

  • Select a modeling approach based on project goals

  • Finalize which variables to include

  • Create a diagram or written plan of the model workflow

  • List tools to be used (e.g., R, SQL, SAS)

  • Document any assumptions

  • Have a clear direction for testing models in Phase 4

  • candidate models: potential models for clustering, classifying, or finding relationships in data

  • dataset structure: refers to the arrangement and organization of data used in the analysis process

  • analytical techniques: methods and tools used to analyze and process data to achieve business objectives

  • variable selection: the process of identifying essential predictors and variables to include in the model

  • structured data: data organized in a specific format or schema, making it easier to analyze

  • unstructured data: data that lacks a specific format or structure, often requiring additional processing before analysis