Data science methodology
1. Introduction to Data Science Methodology
What is Data Science Methodology?
Data Science Methodology is a structured framework used to design and complete AI or Data Science projects.
It provides a step-by-step process that helps organizations solve problems using data.
Purpose
Helps in selecting the correct analytical methods
Reduces time and cost
Improves accuracy of solutions
Ensures the problem is solved effectively
Proposed By
The Data Science Methodology was proposed by:
John Rollins (IBM Analytics)
Structure
The methodology consists of 10 iterative steps grouped into 5 modules.
It is cyclical, meaning the process can repeat to improve results.
5 Modules of Data Science Methodology
The 10 steps are grouped into 5 major modules:
From Problem to Approach
From Requirements to Collection
From Understanding to Preparation
From Modelling to Evaluation
From Deployment to Feedback
Each module focuses on a specific stage of the data science project.
Step 1: Business Understanding
Definition
Business Understanding focuses on defining and clearly understanding the problem that needs to be solved.
This step ensures that the AI solution aligns with business goals.
Key Activities
Identify the problem
Understand customer needs
Define business objectives
Determine expected outcomes
Another Name
This stage is also known as:
Problem Scoping
Tools Used
5W1H Method
This technique helps in clearly understanding the problem.
It involves asking:
What is the problem?
Why does it exist?
Who is affected?
Where does it occur?
When does it occur?
How can it be solved?
Design Thinking
Design Thinking focuses on human-centered problem solving.
Steps usually include:
Empathize
Define
Ideate
Prototype
Test
Goal
Ensure that the right problem is being solved.
Step 2: Analytic Approach
Definition
The Analytic Approach determines how data will be used to solve the problem.
In this stage, we decide:
Which data analysis techniques will be used
Which algorithms may solve the problem
Types of Data Analytics
1. Descriptive Analytics
Answers the question:
“What happened?”
It analyzes past data.
Example:
Monthly sales reports
Website traffic statistics
2. Diagnostic Analytics
Answers:
“Why did it happen?”
It finds reasons behind events.
Example:
Why sales dropped
Why customers left a service
3. Predictive Analytics
Answers:
“What will happen in the future?”
Uses machine learning models to predict outcomes.
Example:
Predicting stock prices
Forecasting sales
4. Prescriptive Analytics
Answers:
“What should we do?”
Suggests actions to achieve desired results.
Example:
Recommending best marketing strategy
Suggesting optimal routes in logistics
Step 3: Data Requirements
Definition
This stage identifies what type of data is needed to solve the problem.
It determines:
Data type
Data format
Data sources
Data structure
Types of Data
1. Structured Data
Highly organized data stored in tables or databases.
Example:
Excel sheets
SQL databases
Bank transaction records
2. Unstructured Data
Data that does not follow a predefined structure.
Example:
Images
Videos
Social media posts
Text documents
3. Semi-Structured Data
Data that has some structure but not fully organized.
Example:
Emails
XML files
JSON files
Step 4: Data Collection
Definition
Data Collection is the process of gathering the required data from various sources.
Types of Data Collection
1. Primary Data
Data collected directly by the researcher.
Examples:
Surveys
Interviews
Questionnaires
Sensors
Observations
2. Secondary Data
Data collected from existing sources.
Examples:
Books
Websites
Research papers
Databases
Kaggle datasets
Important Note
If there are missing or incomplete data, the data may need to be collected again.
Step 5: Data Understanding
Definition
This stage involves analyzing the collected data to understand its quality and usefulness.
We check whether the data is:
Relevant
Accurate
Complete
Consistent
Techniques Used
1. Descriptive Statistics
Used to summarize data.
Examples:
Mean
Median
Mode
Standard deviation
2. Data Visualization
Helps understand data patterns visually.
Examples:
Histograms
Bar charts
Line graphs
Scatter plots
Visualization helps detect:
Outliers
Trends
Patterns
Step 6: Data Preparation
Definition
Data Preparation is the process of cleaning and transforming data so it can be used for modelling.
This stage usually takes the most time in a data science project.
Activities in Data Preparation
1. Data Cleaning
Fixing errors in the dataset.
Includes:
Removing incorrect values
Fixing inconsistent data
2. Handling Missing Values
Methods include:
Removing missing rows
Replacing values with mean or median
Predicting missing values
3. Removing Duplicates
Duplicate entries can distort results and must be removed.
4. Feature Engineering
Feature Engineering means creating new useful features from existing data.
Example:
Original Feature:
Date of birth
New Feature:
Age
This improves model performance.
Step 7: AI Modelling
Definition
AI Modelling involves building machine learning models using prepared data.
The model learns patterns from training data.
Types of Modelling
1. Descriptive Modelling
Used to describe patterns in data.
Example:
Customer segmentation
2. Predictive Modelling
Used to predict future outcomes.
Example:
Predicting house prices
Predicting disease risk
Model Validation
Definition
Model Validation is the process of evaluating a trained machine learning model.
It checks:
Accuracy of the model
Reliability
Performance on new unseen data
It is done using a testing dataset.
Purpose of Model Validation
Improve model quality
Reduce prediction errors
Ensure generalization to new data
Step 8: Evaluation
Definition
Evaluation measures how well the model performs.
It determines whether the model solves the original problem effectively.
Common Evaluation Metrics
Accuracy
Measures overall correctness of predictions.
Formula:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision
Precision measures how many predicted positives were actually correct.
Formula:
Precision = TP / (TP + FP)
Recall
Recall measures how many actual positives were correctly identified.
Formula:
Recall = TP / (TP + FN)
F1 Score
F1 Score balances Precision and Recall.
Formula:
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Mean Absolute Error (MAE)
Measures the average absolute difference between predicted values and actual values.
Lower MAE means better predictions.
Confusion Matrix
A Confusion Matrix is a table used to evaluate classification models.
It contains four outcomes:
True Positive (TP)
Correct positive prediction.
True Negative (TN)
Correct negative prediction.
False Positive (FP)
Incorrect positive prediction.
False Negative (FN)
Incorrect negative prediction.
It helps calculate:
Accuracy
Precision
Recall
F1 Score
Step 9: Deployment
Definition
Deployment is the stage where the trained AI model is made available for real-world use.
Ways to Deploy Models
Mobile applications
Websites
APIs
Software systems
Often, models are first released to limited users for testing.
Step 10: Feedback
Definition
Feedback collects user responses and performance data after deployment.
Purpose
Feedback helps to:
Improve the model
Fix errors
Retrain the model with new data
This makes the methodology iterative and cyclical.
Model Validation Techniques
1. Train-Test Split
In this technique, the dataset is divided into two parts:
Training Dataset
Used to train the model.
Testing Dataset
Used to evaluate model performance.
Purpose
To estimate how well the model will perform on new unseen data.
How It Works
Dataset is split into two parts.
Model is trained on training data.
Predictions are made on test data.
Predictions are compared with actual values.
2. K-Fold Cross Validation
In this technique:
The dataset is divided into k equal parts (folds).
Process
Model is trained on k−1 folds
Tested on the remaining 1 fold
Process repeats k times
Final performance = average of all results
3. Leave-One-Out Cross Validation
A special case of K-Fold Cross Validation.
Here:
Number of folds = number of data samples
Each time:
One sample is used for testing
Remaining samples are used for training
4. Time Series Cross Validation
Used when data has time dependency.
Example:
Stock prices
Weather forecasting
Sales trends
Training uses past data, testing uses future data.
make it easy to understand with points add imp topics like defination