data 2 data mining

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/77

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

78 Terms

1
New cards

What is Exploratory Data Analysis (EDA)?

A set of statistical and visualization techniques used to understand data before preprocessing and modeling

2
New cards

What are 5 purposes of EDA?

Understand data and summarize key properties, Discover noisy data and outliers, Comprehend data distribution, Decide which cleaning techniques to apply, Guide preprocessing decisions

3
New cards

How is EDA cross-classified?

By method type (non-graphical or graphical) and by scope (univariate or multivariate, usually bivariate)

4
New cards

What is the difference between Population and Sample?

Population: the entire group we want conclusions about. Sample: subset of population used when size is too large

5
New cards

What should a sample be?

An unbiased subset that best represents the entire population

6
New cards

What is the Mean (average)?

An algebraic measure of central tendency calculated by summing all values and dividing by count

7
New cards

What is the difference between sample mean and population mean formulas?

Sample mean uses n (sample size), Population mean uses N (population size)

8
New cards

What is Weighted arithmetic mean?

A mean where different values have different weights or importance in the calculation

9
New cards

What is Trimmed mean?

A mean calculated by chopping extreme values (e.g., Olympics gymnastics score computation)

10
New cards

What is the Median?

The middle value in a data set when values are ordered

11
New cards

How do you calculate median for odd number of data points?

The median is the middle value after sorting

12
New cards

How do you calculate median for even number of data points?

The median is the average of the two middle values after sorting

13
New cards

Why use Median instead of Mean?

Median is resistant to extreme outliers and useful for skewed distributions

14
New cards

What is the formula for median in grouped data?

Median = L + ((n/2 - B) / G) × W

15
New cards

What does L represent in grouped median formula?

Lower class boundary of the median bin

16
New cards

What does B represent in grouped median formula?

Cumulative frequency of the bins before the median bin

17
New cards

What does G represent in grouped median formula?

Frequency of the median bin

18
New cards

What does W represent in grouped median formula?

Median bin width

19
New cards

What is the Mode?

The value that occurs most frequently in the data

20
New cards

What is the empirical formula relating mean, median, and mode?

Mean - Mode ≈ 3(Mean - Median)

21
New cards

What is Unimodal distribution?

Distribution with one mode (one peak)

22
New cards

What is Bimodal distribution?

Distribution with two modes (two peaks)

23
New cards

What is Trimodal distribution?

Distribution with three modes (three peaks)

24
New cards

In symmetric (normal) distribution, how are mean, median, and mode related?

They are all equal and located at the center

25
New cards

In positively skewed distribution, what is the order of mean, median, and mode?

Mode < Median < Mean (mean is pulled toward the tail)

26
New cards

In negatively skewed distribution, what is the order of mean, median, and mode?

Mean < Median < Mode (mean is pulled toward the tail)

27
New cards

What does the center (μ) of a normal distribution represent?

Central tendency (mean, median, mode are all equal)

28
New cards

What does sigma (σ) in normal distribution represent?

Data dispersion or spread

29
New cards

After z-score normalization, what is the mean?

μ = 0

30
New cards

After z-score normalization, what is the standard deviation?

σ = 1

31
New cards

What is Variance?

A measure of dispersion around the mean

32
New cards

What is the sample variance formula?

s² = Σ(xi - x̄)² / (n-1)

33
New cards

What is the population variance formula?

σ² = Σ(xi - μ)² / N

34
New cards

Why do we divide by n-1 for sample variance?

To account for bias in the estimation

35
New cards

What is Standard Deviation?

The square root of variance, measuring dispersion in the same units as the value

36
New cards

What does Covariance measure?

The relationship between two numerical variables, showing how they change together

37
New cards

What does positive covariance indicate?

Both variables move together (increase together or decrease together)

38
New cards

What does negative covariance indicate?

Variables move in opposite directions (one increases while other decreases)

39
New cards

What does zero covariance indicate?

No clear pattern in variable movements

40
New cards

Why is covariance sensitive to scale?

Because it's calculated using the actual values without standardization

41
New cards

What is a Covariance Matrix?

A matrix summarizing variance and covariance information for variables (variance on diagonal, covariance in off-diagonal entries)

42
New cards

What is the Correlation formula?

ρ12 = cov(X1, X2) / (σ1 × σ2), where σ represents standard deviation

43
New cards

What does correlation measure?

Standard covariance obtained by normalizing covariance with standard deviation of each variable

44
New cards

What does ρ12 > 0 indicate?

A and B are positively correlated (X1's values increase as X2's increase)

45
New cards

What does ρ12 < 0 indicate?

A and B are negatively correlated (X1's values increase as X2's decrease)

46
New cards

What does ρ12 = 0 indicate?

Variables are independent (no linear relationship)

47
New cards

What is a Correlation Matrix (Correlation Heatmap)?

A matrix showing correlations between each pair of variables in a dataset

48
New cards

What is the range of correlation coefficient values?

[-1,

49
New cards
What are the 6 main data visualization techniques?
Boxplot, Histogram and Bar chart, Quantile plot, Quantile-quantile (Q-Q) plot, Scatter plot, Line chart, Parallel Coordinates plot
50
New cards
What are Quartiles in a boxplot?
Q1 (25th percentile) and Q3 (75th percentile)
51
New cards
What is IQR in a boxplot?
Interquartile Range = Q3 - Q1
52
New cards
What is the five number summary in a boxplot?
min, Q1, median, Q3, max
53
New cards
What are Whiskers in a boxplot?
Two lines outside the box extended to Minimum and Maximum
54
New cards
How are outliers defined in a boxplot?
Points beyond a specified threshold (e.g., value higher/lower than 1.5 × IQR)
55
New cards
What does a Boxplot represent?
Data dispersion with a box where ends are at Q1 and Q3, height is IQR, and median is marked by a line within the box
56
New cards
What is a Histogram?
Tabulated frequencies represented by bars
57
New cards
What is a Bar chart?
Categorical data with bars proportional to the values they represent
58
New cards
What are 4 differences between Histogram and Bar chart?
Histograms show distributions while bar charts compare variables; Histograms plot binned quantitative/categorical data while bar charts only plot categorical; Bars can be reordered in bar charts but not histograms; In histograms, area of bar denotes value, not height
59
New cards
What are 2 histogram partitioning rules?
Equal-width (equal bucket range) and Equal-frequency/Equal-depth (same number of items per bucket)
60
New cards
Why do histograms often tell more than boxplots?
Two histograms may have the same box plot (same min, Q1, median, Q3, max) but have rather different data distributions
61
New cards
What is the purpose of a Quantile Plot?
Visualizes all quantile information for a specific attribute
62
New cards
What are 3 benefits of Quantile Plot?
Provides comprehensive view of attribute's distribution, Helps identify general trends and outliers, Shows that fi of data points have values ≤ xi
63
New cards
What is a Quantile-Quantile (Q-Q) Plot?
A graphical tool to compare the quantiles of two distributions
64
New cards
What is on the X-axis of a Q-Q plot?
Quantiles of the theoretical distribution
65
New cards
What is on the Y-axis of a Q-Q plot?
Quantiles of the data distribution
66
New cards
What is the reference line in a Q-Q plot?
A diagonal line (usually y=x)
67
New cards
What is the purpose of a Q-Q plot?
Assess distributional similarity between an attribute and another attribute or theoretical distribution (e.g., normal distribution)
68
New cards
How do you interpret a Q-Q plot with close fit to straight line?
The two distributions are similar
69
New cards
How do you interpret a Q-Q plot with deviations from the line?
The two distributions are different
70
New cards
What does a Scatter plot provide?
A first look at bivariate data to see clusters of points, outliers, etc.
71
New cards
How is data plotted in a scatter plot?
Each pair of values is treated as coordinates and plotted as points in the plane
72
New cards
What is a Line chart?
Displays information as a series of data points called 'markers' connected by straight line segments
73
New cards
What is a Parallel Coordinates Plot?
A plot that maps each object as a line, with each attribute represented by a point on the line
74
New cards
What are 3 criteria to choose an EDA technique?
Type of analysis (comparison, relationship, composition, distribution, dispersion), Number of attributes (one, two, multiple), Types of attributes (categorical, numerical)
75
New cards
To visually explore Continuous vs Continuous variables, use:
Scatter plot
76
New cards
To visually explore Categorical vs Continuous variables, use:
Box plot or violin plot
77
New cards
To visually explore Continuous vs Categorical variables, use:
Bar chart or histogram grouped by category
78
New cards
To visually explore Categorical vs Categorical variables, use:
Stacked bar chart or mosaic p