1/54
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Science Life Cycle
The series of steps including problem definition, data collection, data preparation, exploratory data analysis, modeling, evaluation, and deployment.
Problem Definition
Define the problem or business question (e.g., reducing customer churn).
Data Collection
Gather data from various sources such as databases, APIs, or surveys.
Data Preparation
Clean and preprocess the data (handle missing values, normalize, remove duplicates).
Exploratory Data Analysis (EDA)
Use summary statistics and visualizations to understand data patterns and detect anomalies.
Modeling
Select and train an appropriate model (regression, classification, clustering).
Evaluation
Measure the model's performance using metrics like accuracy, precision, recall, or RMSE.
Deployment
Implement the model into a production environment to drive decision-making.
Volume
The enormous amount of data generated (e.g., social media, sensor data).
Velocity
The speed of data generation and processing (e.g., real-time stock data).
Variety
Different types of data: structured, unstructured, and semi-structured (e.g., text, images, videos).
Structured Data
Organized in tables with a fixed schema (e.g., SQL databases).
Unstructured Data
No fixed structure (e.g., emails, social media posts, images).
Semi-Structured Data
Contains tags or markers (e.g., JSON, XML).
Categorical Data
Data that represents groups (nominal like color, ordinal like rating scales).
Numerical Data
Quantitative data (discrete like count of items, continuous like height).
Text-Based Formats
CSV/TSV for tabular data; JSON for hierarchical data.
Markup Formats
HTML and XML for web content and data exchange.
Image Formats
Examples include GIF (supports animations), and the difference between lossless (e.g., PNG) and lossy (e.g., JPEG) compression.
Databases
Relational (SQL) for structured queries and NoSQL for flexible, unstructured data storage.
Hypothesis Formation
Develop a testable statement (e.g., "Introducing a loyalty program will reduce churn by 20%").
Independent Variable (IV)
The factor you manipulate.
Dependent Variable (DV)
The outcome you measure.
Confounding Variables
Uncontrolled factors that might affect the DV.
Control vs. Treatment Groups
A control group does not receive the treatment, allowing for comparison with the treatment group.
Randomization Techniques
Random Sampling, Randomized Controlled Trials (RCTs), stratified randomization, or block designs.
Data Collection Methods
Options include observational studies, experiments, surveys, or simulations.
Bias Reduction Techniques
Blinding (single or double), placebo control, and randomization to reduce selection bias.
Clarity in Data Visualization
Data should be easy to understand.
Accuracy in Data Visualization
Represent data truthfully without distortions.
Efficiency in Data Visualization
Allow viewers to quickly grasp the main insights.
Aesthetics in Data Visualization
Use design elements (color, layout) to enhance readability without distracting.
Comparison Visualization
Bar charts, line charts.
Correlation Visualization
Scatter plots.
Part-to-Whole Visualization
Pie charts or stacked bar charts.
Data Over Time Visualization
Time series graphs.
Distribution Visualization
Histograms, box-and-whisker plots.
Preattentive Features
Elements in data visualization that draw attention immediately, such as color or shape.
Exploratory Data Analysis (EDA)
A process to gain insights, identify patterns, and form hypotheses from data.
Summary Statistics
Calculations including mean, median, mode, variance, and standard deviation to summarize data.
Visual Analysis
Utilizing histograms, box plots, scatter plots, and bar charts to identify trends and outliers.
Correlation Analysis
Using Pearson's correlation coefficient to measure linear relationships between continuous variables.
Categorical Data
Data that can be divided into groups or categories.
Continuous Data
Data that can take any value within a range.
MCAR
Missing Completely at Random: No systematic pattern in the missing data.
MAR
Missing at Random: Missing data is related to observed data.
MNAR
Missing Not at Random: Missing data is related to unobserved data.
Handling Strategies for Missing Data
Methods such as deletion or imputation to address missing values.
Deletion
Removing data entries with excessive missing values.
Imputation
Replacing missing values using techniques like mean/median imputation or regression.
Error and Outlier Detection
Using statistical methods (z-scores, IQR) and visual methods (box plots) to identify anomalies.
Z-scores
A statistical method used to determine how many standard deviations an element is from the mean.
IQR Method
Interquartile Range method used to detect outliers by measuring the spread of the middle 50% of data.
Box-and-Whisker Plot
A visual tool in EDA that conveys information about median, quartiles, and potential outliers.
Duplicate Records
Entries in a dataset that are identical and can skew analysis results.