Data science methodology

1. Introduction to Data Science Methodology

What is Data Science Methodology?

Data Science Methodology is a structured framework used to design and complete AI or Data Science projects.

It provides a step-by-step process that helps organizations solve problems using data.

Purpose

Helps in selecting the correct analytical methods

Reduces time and cost

Improves accuracy of solutions

Ensures the problem is solved effectively

Proposed By

The Data Science Methodology was proposed by:

John Rollins (IBM Analytics)

Structure

The methodology consists of 10 iterative steps grouped into 5 modules.

It is cyclical, meaning the process can repeat to improve results.

5 Modules of Data Science Methodology

The 10 steps are grouped into 5 major modules:

From Problem to Approach

From Requirements to Collection

From Understanding to Preparation

From Modelling to Evaluation

From Deployment to Feedback

Each module focuses on a specific stage of the data science project.

Step 1: Business Understanding

Definition

Business Understanding focuses on defining and clearly understanding the problem that needs to be solved.

This step ensures that the AI solution aligns with business goals.

Key Activities

Identify the problem

Understand customer needs

Define business objectives

Determine expected outcomes

Another Name

This stage is also known as:

Problem Scoping

Tools Used

5W1H Method

This technique helps in clearly understanding the problem.

It involves asking:

What is the problem?

Why does it exist?

Who is affected?

Where does it occur?

When does it occur?

How can it be solved?

Design Thinking

Design Thinking focuses on human-centered problem solving.

Steps usually include:

Empathize

Define

Ideate

Prototype

Test

Goal

Ensure that the right problem is being solved.

Step 2: Analytic Approach

Definition

The Analytic Approach determines how data will be used to solve the problem.

In this stage, we decide:

Which data analysis techniques will be used

Which algorithms may solve the problem

Types of Data Analytics

1. Descriptive Analytics

Answers the question:

“What happened?”

It analyzes past data.

Example:

Monthly sales reports

Website traffic statistics

2. Diagnostic Analytics

Answers:

“Why did it happen?”

It finds reasons behind events.

Example:

Why sales dropped

Why customers left a service

3. Predictive Analytics

Answers:

“What will happen in the future?”

Uses machine learning models to predict outcomes.

Example:

Predicting stock prices

Forecasting sales

4. Prescriptive Analytics

Answers:

“What should we do?”

Suggests actions to achieve desired results.

Example:

Recommending best marketing strategy

Suggesting optimal routes in logistics

Step 3: Data Requirements

Definition

This stage identifies what type of data is needed to solve the problem.

It determines:

Data type

Data format

Data sources

Data structure

Types of Data

1. Structured Data

Highly organized data stored in tables or databases.

Example:

Excel sheets

SQL databases

Bank transaction records

2. Unstructured Data

Data that does not follow a predefined structure.

Example:

Images

Videos

Social media posts

Text documents

3. Semi-Structured Data

Data that has some structure but not fully organized.

Example:

Emails

XML files

JSON files

Step 4: Data Collection

Definition

Data Collection is the process of gathering the required data from various sources.

Types of Data Collection

1. Primary Data

Data collected directly by the researcher.

Examples:

Surveys

Interviews

Questionnaires

Sensors

Observations

2. Secondary Data

Data collected from existing sources.

Examples:

Books

Websites

Research papers

Databases

Kaggle datasets

Important Note

If there are missing or incomplete data, the data may need to be collected again.

Step 5: Data Understanding

Definition

This stage involves analyzing the collected data to understand its quality and usefulness.

We check whether the data is:

Relevant

Accurate

Complete

Consistent

Techniques Used

1. Descriptive Statistics

Used to summarize data.

Examples:

Mean

Median

Mode

Standard deviation

2. Data Visualization

Helps understand data patterns visually.

Examples:

Histograms

Bar charts

Line graphs

Scatter plots

Visualization helps detect:

Outliers

Trends

Patterns

Step 6: Data Preparation

Definition

Data Preparation is the process of cleaning and transforming data so it can be used for modelling.

This stage usually takes the most time in a data science project.

Activities in Data Preparation

1. Data Cleaning

Fixing errors in the dataset.

Includes:

Removing incorrect values

Fixing inconsistent data

2. Handling Missing Values

Methods include:

Removing missing rows

Replacing values with mean or median

Predicting missing values

3. Removing Duplicates

Duplicate entries can distort results and must be removed.

4. Feature Engineering

Feature Engineering means creating new useful features from existing data.

Example:

Original Feature:

Date of birth

New Feature:

Age

This improves model performance.

Step 7: AI Modelling

Definition

AI Modelling involves building machine learning models using prepared data.

The model learns patterns from training data.

Types of Modelling

1. Descriptive Modelling

Used to describe patterns in data.

Example:

Customer segmentation

2. Predictive Modelling

Used to predict future outcomes.

Example:

Predicting house prices

Predicting disease risk

Model Validation

Definition

Model Validation is the process of evaluating a trained machine learning model.

It checks:

Accuracy of the model

Reliability

Performance on new unseen data

It is done using a testing dataset.

Purpose of Model Validation

Improve model quality

Reduce prediction errors

Ensure generalization to new data

Step 8: Evaluation

Definition

Evaluation measures how well the model performs.

It determines whether the model solves the original problem effectively.

Common Evaluation Metrics

Accuracy

Measures overall correctness of predictions.

Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision

Precision measures how many predicted positives were actually correct.

Formula:

Precision = TP / (TP + FP)

Recall

Recall measures how many actual positives were correctly identified.

Formula:

Recall = TP / (TP + FN)

F1 Score

F1 Score balances Precision and Recall.

Formula:

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Mean Absolute Error (MAE)

Measures the average absolute difference between predicted values and actual values.

Lower MAE means better predictions.

Confusion Matrix

A Confusion Matrix is a table used to evaluate classification models.

It contains four outcomes:

True Positive (TP)

Correct positive prediction.

True Negative (TN)

Correct negative prediction.

False Positive (FP)

Incorrect positive prediction.

False Negative (FN)

Incorrect negative prediction.

It helps calculate:

Accuracy

Precision

Recall

F1 Score

Step 9: Deployment

Definition

Deployment is the stage where the trained AI model is made available for real-world use.

Ways to Deploy Models

Mobile applications

Websites

APIs

Software systems

Often, models are first released to limited users for testing.

Step 10: Feedback

Definition

Feedback collects user responses and performance data after deployment.

Purpose

Feedback helps to:

Improve the model

Fix errors

Retrain the model with new data

This makes the methodology iterative and cyclical.

Model Validation Techniques

1. Train-Test Split

In this technique, the dataset is divided into two parts:

Training Dataset

Used to train the model.

Testing Dataset

Used to evaluate model performance.

Purpose

To estimate how well the model will perform on new unseen data.

How It Works

Dataset is split into two parts.

Model is trained on training data.

Predictions are made on test data.

Predictions are compared with actual values.

2. K-Fold Cross Validation

In this technique:

The dataset is divided into k equal parts (folds).

Process

Model is trained on k−1 folds

Tested on the remaining 1 fold

Process repeats k times

Final performance = average of all results

3. Leave-One-Out Cross Validation

A special case of K-Fold Cross Validation.

Here:

Number of folds = number of data samples

Each time:

One sample is used for testing

Remaining samples are used for training

4. Time Series Cross Validation

Used when data has time dependency.

Example:

Stock prices

Weather forecasting

Sales trends

Training uses past data, testing uses future data.

make it easy to understand with points add imp topics like defination