D. Discovery Phase

🔍 2.2 – Phase 1: Discovery (Full Summary with Examples)

Theme: “Understand before you act.”
This phase is about deeply understanding the problem, context, goals, resources, and data before building any models.


🧭 Purpose of Discovery Phase:

  • Investigate the business problem.

  • Understand the domain and data sources.

  • Set clear project goals, success/failure criteria, and form initial hypotheses (IHs).

  • Identify stakeholders, interview project sponsors, and gather available data.


🧠 2.2.1 – Learning the Business Domain

What It Means

Example

Understand the industry or department the problem comes from.

A data scientist working on hospital readmissions needs to understand medical workflows, not just statistics.

  • Two types of data scientists:

    • 🧮 Quantitative Experts: Math/stats-focused, adaptable across industries.

    • 🧬 Domain Specialists: Deep in one field (e.g., oceanography, biotech), with moderate data skills.

🔑 Balance is key – The earlier the team identifies knowledge gaps, the better it can staff accordingly.


🧰 2.2.2 – Assessing Resources (In Simple Terms)

What This Means

Before starting a project, figure out what you already have and what you still need. Resources fall into three main categories:

  • People – What skills and roles are on the team?

  • Technology – What tools, platforms, or systems are available?

  • Data – What data do you already have, and what data do you still need?

Example

A startup wants to grow and scale its product. But they realize they don’t have enough engineers right now. So, they make a plan to hire more.

Key Takeaway

📌 Don’t just think about the current project—think long-term. Use this project as a step toward your bigger goals.


📌 2.2.3 – Framing the Problem (In Simple Terms)

What This Means

Framing means clearly writing out the problem you're trying to solve. Ask yourself:

  • What exactly is the problem?

  • Who cares about this problem, and why?

  • What does success look like? What does failure look like?

Good vs. Poor Framing
  • Good: “Reduce customer churn by 10% by improving mobile UX.”

  • Poor: “Do something with mobile analytics.”

Extra Tip

📌 Also define what failure looks like.
Example: “If churn reduction is less than 3%, the project isn’t worth continuing.”



👥 2.2.4 – Identifying Stakeholders (In Simple Terms)

What This Means

Stakeholders are anyone who is affected by the project or benefits from it.

Steps to Identify and Work with Them
  1. Find their pain points – What problems have they faced in similar past projects?

  2. Understand their expectations – What do they want this project to achieve?

  3. Define their roles:

    • Approver – Makes final decisions.

    • Contributor – Works on the project.

    • Advisor – Gives input and guidance.

Example

A sales director expects a dashboard that shows real-time KPIs (key performance indicators).

Why This Matters

📌 If you know what people expect early on, you can avoid misunderstandings and delays later.


🎤 2.2.5 – Interviewing the Project Sponsor (In Simple Terms)

What This Means

The project sponsor is usually the person who funds the project and has the first idea of what it should do. But they might have biases or assumptions that need to be explored.

Tips for a Good Interview
  • Ask open-ended questions – Let them explain in their own words.

  • Listen actively – Repeat back what they said to confirm you understood.

  • Don’t jump in with your own ideas too soon.

  • Write everything down – You’ll need it later to confirm details.

Good Questions to Ask
  • What problem are we solving?

  • What does success look like?

  • What are the risks or deadlines?

  • Who makes the final decisions?

  • What data sources do we have (inside or outside the company)?

Scenario Example
  • Sponsor says: “Just build a product recommender.”

  • You ask: “Why that solution? What’s the real business problem?”💡 2.2.6 – Developing Initial Hypotheses (IHs)

Form testable ideas that guide future analysis.

IH Example

Data to Test

“Customers leave due to long page load times.”

Web logs showing page load duration vs. bounce rate.

  • Gather IHs from stakeholders and domain experts too.

  • These guide what experiments/models will be run in Phases 3–5.

📌 Good IHs make analysis focused and meaningful.


🗃 2.2.7 – Identifying Potential Data Sources (In Simple Terms)

Goal

Figure out:

  • What data you need

  • What data you already have

  • How to get the rest


🛠 Five Key Activities

Task

What It Means

Example

1. Identify data sources

Make a list of all current and needed datasets

Sales logs, support tickets, app usage data

2. Capture aggregate data

Look at big-picture trends

Daily active users by region

3. Review raw data

Check the actual data fields and quality

Missing values in the ‘purchase_amount’ column

4. Evaluate tools

Make sure your tools can handle the data format

SQL for structured data, special tools for social media posts

5. Scope infrastructure

Estimate what tech resources you’ll need

Do we need cloud storage? Real-time data processing?

📌 Tip: Think about the 3 V’s of Big Data:

  • Volume – How much data? (1 year or 10 years?)

  • Variety – What types of data? (text, numbers, images?)

  • Velocity – How fast is the data coming in?


Discovery Phase Checkpoints

You’re ready to move to the next phase when:

  • The problem is clearly defined

  • Stakeholders and their roles are identified

  • You’ve mapped out what data you have and what you still need

  • You’ve written initial hypotheses

  • A draft analytics plan is ready for review


📦 Summary Table: Phase 1 – Discovery (In Simple Terms)

Activity

What It Means

Business Example

Learn the Domain

Understand the industry or department

Healthcare, Finance, Retail

Assess Resources

Check people, tools, and data

Do we have Python devs? Is Tableau licensed?

Frame the Problem

Write a clear problem statement

“Reduce churn in Q3 via mobile experience.”

Identify Stakeholders

Know who’s involved and what they expect

Sales VP, Marketing Analyst, IT Ops

Interview Sponsor

Ask deep questions to uncover real needs

“Why do we need a recommender system?”

Develop IHs

Create testable ideas to guide analysis

“Churn is linked to late delivery.”

Identify Data

List what data you have and need

Internal logs, CRM data, weather APIs