Data Science Life Cycle and Key Concepts

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/54

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

55 Terms

1
New cards

Data Science Life Cycle

The series of steps including problem definition, data collection, data preparation, exploratory data analysis, modeling, evaluation, and deployment.

2
New cards

Problem Definition

Define the problem or business question (e.g., reducing customer churn).

3
New cards

Data Collection

Gather data from various sources such as databases, APIs, or surveys.

4
New cards

Data Preparation

Clean and preprocess the data (handle missing values, normalize, remove duplicates).

5
New cards

Exploratory Data Analysis (EDA)

Use summary statistics and visualizations to understand data patterns and detect anomalies.

6
New cards

Modeling

Select and train an appropriate model (regression, classification, clustering).

7
New cards

Evaluation

Measure the model's performance using metrics like accuracy, precision, recall, or RMSE.

8
New cards

Deployment

Implement the model into a production environment to drive decision-making.

9
New cards

Volume

The enormous amount of data generated (e.g., social media, sensor data).

10
New cards

Velocity

The speed of data generation and processing (e.g., real-time stock data).

11
New cards

Variety

Different types of data: structured, unstructured, and semi-structured (e.g., text, images, videos).

12
New cards

Structured Data

Organized in tables with a fixed schema (e.g., SQL databases).

13
New cards

Unstructured Data

No fixed structure (e.g., emails, social media posts, images).

14
New cards

Semi-Structured Data

Contains tags or markers (e.g., JSON, XML).

15
New cards

Categorical Data

Data that represents groups (nominal like color, ordinal like rating scales).

16
New cards

Numerical Data

Quantitative data (discrete like count of items, continuous like height).

17
New cards

Text-Based Formats

CSV/TSV for tabular data; JSON for hierarchical data.

18
New cards

Markup Formats

HTML and XML for web content and data exchange.

19
New cards

Image Formats

Examples include GIF (supports animations), and the difference between lossless (e.g., PNG) and lossy (e.g., JPEG) compression.

20
New cards

Databases

Relational (SQL) for structured queries and NoSQL for flexible, unstructured data storage.

21
New cards

Hypothesis Formation

Develop a testable statement (e.g., "Introducing a loyalty program will reduce churn by 20%").

22
New cards

Independent Variable (IV)

The factor you manipulate.

23
New cards

Dependent Variable (DV)

The outcome you measure.

24
New cards

Confounding Variables

Uncontrolled factors that might affect the DV.

25
New cards

Control vs. Treatment Groups

A control group does not receive the treatment, allowing for comparison with the treatment group.

26
New cards

Randomization Techniques

Random Sampling, Randomized Controlled Trials (RCTs), stratified randomization, or block designs.

27
New cards

Data Collection Methods

Options include observational studies, experiments, surveys, or simulations.

28
New cards

Bias Reduction Techniques

Blinding (single or double), placebo control, and randomization to reduce selection bias.

29
New cards

Clarity in Data Visualization

Data should be easy to understand.

30
New cards

Accuracy in Data Visualization

Represent data truthfully without distortions.

31
New cards

Efficiency in Data Visualization

Allow viewers to quickly grasp the main insights.

32
New cards

Aesthetics in Data Visualization

Use design elements (color, layout) to enhance readability without distracting.

33
New cards

Comparison Visualization

Bar charts, line charts.

34
New cards

Correlation Visualization

Scatter plots.

35
New cards

Part-to-Whole Visualization

Pie charts or stacked bar charts.

36
New cards

Data Over Time Visualization

Time series graphs.

37
New cards

Distribution Visualization

Histograms, box-and-whisker plots.

38
New cards

Preattentive Features

Elements in data visualization that draw attention immediately, such as color or shape.

39
New cards

Exploratory Data Analysis (EDA)

A process to gain insights, identify patterns, and form hypotheses from data.

40
New cards

Summary Statistics

Calculations including mean, median, mode, variance, and standard deviation to summarize data.

41
New cards

Visual Analysis

Utilizing histograms, box plots, scatter plots, and bar charts to identify trends and outliers.

42
New cards

Correlation Analysis

Using Pearson's correlation coefficient to measure linear relationships between continuous variables.

43
New cards

Categorical Data

Data that can be divided into groups or categories.

44
New cards

Continuous Data

Data that can take any value within a range.

45
New cards

MCAR

Missing Completely at Random: No systematic pattern in the missing data.

46
New cards

MAR

Missing at Random: Missing data is related to observed data.

47
New cards

MNAR

Missing Not at Random: Missing data is related to unobserved data.

48
New cards

Handling Strategies for Missing Data

Methods such as deletion or imputation to address missing values.

49
New cards

Deletion

Removing data entries with excessive missing values.

50
New cards

Imputation

Replacing missing values using techniques like mean/median imputation or regression.

51
New cards

Error and Outlier Detection

Using statistical methods (z-scores, IQR) and visual methods (box plots) to identify anomalies.

52
New cards

Z-scores

A statistical method used to determine how many standard deviations an element is from the mean.

53
New cards

IQR Method

Interquartile Range method used to detect outliers by measuring the spread of the middle 50% of data.

54
New cards

Box-and-Whisker Plot

A visual tool in EDA that conveys information about median, quartiles, and potential outliers.

55
New cards

Duplicate Records

Entries in a dataset that are identical and can skew analysis results.