comprehensive questions data mining

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/57

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

58 Terms

1
New cards

Compare Min-max vs Z-score normalization: when would each be preferred?

Min-max: when need specific range [0,1] and no outliers. Z-score: when data follows normal distribution and algorithm sensitive to distribution. Min-max gives exact same scale but poor with outliers; Z-score more robust to outliers but doesn't produce exact same scale

2
New cards

Why is f00 excluded from Jaccard Coefficient but included in SMC?

Jaccard is for asymmetric attributes where both being absent (0-0) doesn't indicate similarity (e.g., not buying same products). SMC is for symmetric attributes where matching absences do indicate similarity (e.g., both being non-smokers)

3
New cards

Explain the relationship: Mean - Mode ≈ 3(Mean - Median)

In skewed distributions, this empirical formula shows the mean is pulled furthest toward the tail, median is in middle, and mode stays at the peak. The mean shifts 3 times as much as the median from the mode

4
New cards

What's the difference between supervised and unsupervised discretization?

Supervised uses class labels to guide discretization (optimizing for classification tasks using splitting/merging). Unsupervised ignores class labels (using equal width, equal frequency, or k-means). Supervised typically produces better bins for prediction tasks

5
New cards

Why divide sample variance by (n-1) instead of n?

Dividing by (n-1) corrects for bias in estimation. Sample tends to underestimate population variance because sample mean is used instead of true population mean. This is called Bessel's correction

6
New cards

When would you choose Cosine Similarity over Euclidean Distance?

When direction/orientation matters more than magnitude (e.g., document similarity, time series patterns). Cosine focuses on angle between vectors regardless of length, while Euclidean measures actual distance and is magnitude-sensitive

7
New cards

Why might two histograms have identical boxplots?

Boxplot only shows 5-number summary (min, Q1, median, Q3, max). Two distributions with same summary statistics but different shapes (e.g., uniform vs bimodal within same range) would have identical boxplots but different histograms

8
New cards

Explain why stratified sampling might be better than simple random sampling

Stratified ensures representation from all important subgroups by dividing population into strata and sampling each. Simple random might miss small but important groups. Stratified reduces sampling error and ensures proportional representation

9
New cards

What's the triangle inequality property and why is it important?

d(x,z) ≤ d(x,y) + d(y,z). It ensures the direct path is never longer than indirect path through third point. Important for proving a measure is a valid distance metric and for optimization algorithms that exploit this property

10
New cards

How does Label Encoding create problems for nominal attributes?

Assigns integers (0,1,2…) which implies order (France=0 < Spain=1). Algorithms might interpret this as Spain being "greater than" France, creating false ordinal relationships when categories are actually unordered

11
New cards

Why is median preferred over mean for skewed distributions?

Median is resistant to extreme values because it only depends on middle position, not actual values. Mean is pulled toward outliers/tail because it uses all values in calculation. For skewed data, median better represents "typical" value

12
New cards

Explain the difference between noise and outliers

Noise: random error/distortion in measurements (e.g., sensor error, data entry mistake) - should be removed/smoothed. Outliers: legitimate extreme values that may be noise OR the focus of analysis (e.g., fraud detection). Context determines handling

13
New cards

Why might we intentionally add noise to data?

To prevent overfitting by forcing models to be more robust, improve generalization to real-world variations, and enhance model adaptability. Makes model less sensitive to specific training data peculiarities

14
New cards

What's the purpose of the reference line (y=x) in Q-Q plot?

Shows where points would fall if both distributions were identical. Deviations from this line indicate differences: points above line mean data distribution has higher values than theoretical, below means lower values

15
New cards

Why is IQR useful for outlier detection?

IQR (Q3-Q1) represents the middle 50% of data, resistant to extreme values. Points beyond 1.5×IQR from quartiles are statistical outliers. This method is robust because it's based on quartiles, not mean/std which outliers affect

16
New cards

Explain why correlation doesn't imply causation

Correlation measures statistical association between variables, not cause-and-effect. Both variables might be caused by third factor (confounding variable), relationship might be coincidental, or causation might be reverse of what's assumed

17
New cards

When would you use Manhattan distance instead of Euclidean?

When movement is restricted to grid-like paths (city blocks), when dealing with high-dimensional data where Euclidean can be misleading, or when want distance less sensitive to outliers in individual dimensions

18
New cards

Why does One-Hot Encoding increase dimensionality?

Creates separate binary column for each category. If attribute has k categories, creates k columns instead of 1. With many categories (e.g., 1000 cities), dramatically increases feature space, which can cause computational issues

19
New cards

What's the relationship between covariance and correlation?

Correlation is standardized covariance: ρ = cov/(σ1×σ2). Covariance shows direction of relationship but sensitive to scale. Correlation normalizes to [-1,1] range, making it scale-independent and comparable across different variable pairs

20
New cards

Why might you keep duplicate data instead of removing it?

When duplicates represent legitimate repeated events (e.g., customer with multiple accounts, repeated purchases, multiple measurements). Removing these would lose important information about frequency or intensity of events

21
New cards

Explain the difference between Equal Width and Equal Frequency discretization

Equal Width: divides range into same-size intervals (e.g., 0-10, 10-20, 20-30). Simple but may create empty/overfull bins. Equal Frequency: ensures same number of points per bin. Better balanced but intervals have different widths

22
New cards

Why is standardization necessary for Euclidean Distance?

Attributes with larger scales dominate distance calculation. E.g., salary (0-100,000) vs age (0-100) - salary differences would overwhelm age differences. Standardization ensures each attribute contributes proportionally

23
New cards

What's the difference between clustering for understanding vs summarization?

Understanding: discover natural groupings to gain insights (customer segments, gene functions). Summarization: reduce data size by representing groups with centroids/representatives for efficient processing while preserving key information

24
New cards

Why are asymmetric attributes important in market basket analysis?

In retail, we care about what customers bought (presence), not what they didn't buy (absence). Two customers having thousands of non-purchased items in common doesn't make them similar - shared purchases do

25
New cards

Explain why high dimensionality presents challenges

Curse of dimensionality: as dimensions increase, data becomes sparse, distances become less meaningful, computational cost explodes, visualization impossible, more data needed to maintain statistical significance, many algorithms break down

26
New cards

What's the difference between data integration and data transformation?

Integration: combining data from multiple sources into unified view (different databases, formats, schemas). Transformation: converting data format for analysis (normalization, encoding, discretization, sampling) within same/integrated dataset

27
New cards

Why might regression be preferred over classification?

When target variable is continuous rather than categorical, when need specific numeric predictions (exact price, temperature), when want to capture subtle gradations rather than discrete classes, when probability estimates aren't sufficient

28
New cards

How does trimmed mean differ from regular mean and why use it?

Trimmed mean removes extreme values (e.g., top/bottom 5%) before calculating. Used when want average resistant to outliers but more data-driven than median. Common in Olympics scoring to eliminate biased judges

29
New cards

Why is KDD broader than Data Mining?

KDD (Knowledge Discovery in Databases) is the entire process: understanding domain, data selection, preprocessing, transformation, data mining, interpretation, evaluation. Data Mining is just the core step of applying algorithms to extract patterns

30
New cards

Explain the difference between stratified and cluster sampling

Stratified: divide into homogeneous groups, sample from ALL groups (ensures representation). Cluster: divide into clusters, randomly select SOME clusters completely (often geographical, more practical but less precise)

31
New cards

Why might you choose weighted arithmetic mean?

When values have different importance/frequency. E.g., calculating course grade where exam=50%, homework=30%, participation=20%. Gives more weight to more important values rather than treating all equally

32
New cards

What's the relationship between variance and standard deviation, and why have both?

Standard deviation = √variance. Variance is fundamental mathematical property but in squared units (dollars²). Standard deviation in same units as data (dollars), more interpretable. Both used: variance for theory, std dev for interpretation

33
New cards

Why is sample size important when choosing between deleting records vs imputation?

With large sample size, deleting few records with missing values has minimal impact on statistical power. With small sample, each record precious - imputation preserves information and maintains statistical validity

34
New cards

Explain why z-score normalization centers data at 0 with std dev 1

Formula: z = (x - μ)/σ. Subtracting mean (μ) shifts center to 0. Dividing by std dev (σ) scales spread to 1. This standardizes all variables to same scale, making them comparable regardless of original units

35
New cards

How does bottom-up discretization differ from top-down?

Bottom-up: starts with each value in own bin, merges similar bins (agglomerative). Top-down: starts with all values in one bin, splits to separate classes (divisive). Bottom-up maximizes purity; top-down maximizes separability

36
New cards

Why is parallel coordinates plot useful for multivariate data?

Maps each object as a line across multiple vertical axes (one per attribute). Can visualize many dimensions simultaneously, identify patterns/correlations between attributes, spot outliers. Better than multiple 2D plots for seeing relationships

37
New cards

What's the difference between descriptive and predictive data mining?

Descriptive: summarizes/describes patterns in existing data (clustering finds groups, association rules find relationships). Predictive: uses patterns to predict future/unknown values (classification predicts categories, regression predicts numbers)

38
New cards

Why might anomaly detection be both descriptive and predictive?

Descriptive: describes which current observations are anomalous based on patterns. Predictive: once anomaly patterns learned, can predict whether new observations will be anomalous. Depends on whether analyzing past data or forecasting future

39
New cards

Explain why non-traditional analysis methods are needed for modern data

Traditional statistics assumes small, clean, normally distributed data with independence. Modern data is massive scale, high-dimensional, heterogeneous, noisy, correlated, distributed. Traditional methods computationally infeasible or assumptions violated

40
New cards

Why is data ownership/distribution a challenge in data mining?

Data owned by different organizations (privacy/legal issues), geographically distributed (communication costs), different formats/standards, need to mine without centralizing (security), must consolidate results without sharing raw data (federated learning)

41
New cards

How does the choice of r parameter affect Minkowski distance?

r=1 (Manhattan): grid-like, sum of absolute differences, robust to outliers. r=2 (Euclidean): straight-line, most common, sensitive to outliers. r→∞ (Chebyshev): maximum difference in any dimension, extreme outlier-sensitive

42
New cards

Why use scatter plot for continuous vs continuous variables?

Both axes can represent continuous scales, each point shows exact (x,y) pair, reveals relationships (linear, nonlinear, clusters), shows correlation direction and strength visually, identifies outliers

43
New cards

Why use box plot for categorical vs continuous variables?

Categorical variable groups data into categories (x-axis), continuous variable shown as distribution (y-axis). Can compare distributions across categories, see medians, spreads, outliers for each group side-by-side

44
New cards

What makes association rules different from classification?

Association: finds co-occurrence patterns (IF milk THEN bread), no designated target variable, finds all interesting rules. Classification: predicts specific target variable, supervised learning with labeled training data, one directional prediction

45
New cards

Why is binning useful for handling noise?

Groups values into bins, replaces with bin mean/median/boundary. Smooths random fluctuations because averaging reduces noise impact. Creates more robust features less sensitive to measurement errors or minor variations

46
New cards

Explain the difference between similarity matrix and distance matrix

Similarity: higher values = more similar, diagonal often 1 (max similarity with self), range typically [0,1]. Distance: lower values = more similar, diagonal always 0 (no distance to self), range [0,∞). Both symmetric for most measures

47
New cards

Why might Chebyshev distance be preferred in some applications?

When only maximum difference matters (e.g., king in chess moves one square in any direction - limited by max coordinate difference). In control systems where worst-case dimension determines feasibility. For min-max optimization problems

48
New cards

What's the purpose of the cumulative frequency B in grouped median calculation?

Counts how many values fall BEFORE the median bin. Needed to determine where in the median bin the actual median falls. If B=4 and n=13, median position is 6.5, so it's (6.5-4)/G into the median bin

49
New cards

Why does cosine similarity range from -1 to 1 instead of 0 to 1?

Measures angle between vectors. 0° (same direction)=1, 90° (perpendicular)=0, 180° (opposite direction)=-1. Negative values indicate opposite orientations, important for detecting negative correlation or opposing patterns

50
New cards

How does web data differ from traditional business data in ways that affect mining?

Web data: unstructured/semi-structured (text, images), massive scale (petabytes), rapidly changing, noisy/unreliable, heterogeneous sources, sparse (most connections don't exist), requires real-time processing. Traditional: structured, smaller, static, clean, homogeneous

51
New cards

Why is the identity property (s(x,y)=1 only if x=y) noted as "may not always hold"?

Some similarity measures reach maximum for non-identical objects. E.g., cosine similarity of [1,0] and [2,0] is 1 (same direction) though vectors differ. Property holds for most measures but not universally

52
New cards

What makes market basket analysis particularly suitable for association rules?

Natural asymmetric binary data (bought/didn't buy), focus on co-purchased items, thousands of products make simple analysis impossible, need automatic pattern discovery, actionable for business (product placement, promotions, recommendations)

53
New cards

Why is "gather whatever data you can" now a mantra?

Storage cheap, don't know future uses, data value often unexpected, might enable unforeseen applications, competitive advantage, harder to collect later than store now, advanced analytics can find hidden value

54
New cards

How does feature selection relate to the curse of dimensionality?

With many features, need exponentially more data for reliable patterns. Feature selection removes irrelevant/redundant attributes, reduces dimensionality, improves model performance, decreases computation, prevents overfitting, makes models interpretable

55
New cards

Why might classification and clustering be combined?

Clustering first: discover natural groups, then classify within groups (simpler sub-problems). Classification first: classify known data, cluster misclassified points (find new patterns). Semi-supervised: use small labeled set + large unlabeled set

56
New cards

What's the relationship between data preprocessing quality and model performance?

Garbage in, garbage out. Poor preprocessing (duplicates, missing values, noise, wrong scales) causes models to learn wrong patterns, overfit noise, miss important signals, make biased predictions. Quality preprocessing often more impactful than algorithm choice

57
New cards

Why is symmetric sampling important and what causes sampling bias?

Sample must represent population characteristics proportionally. Bias from: convenience sampling (easy-to-reach people), self-selection (volunteers differ from non-volunteers), systematic patterns (sampling every 10th fails if data has period-10 pattern)

58
New cards

Explain why equal-frequency discretization might be preferred for imbalanced data

Equal-width might create many empty bins or bins with few points in sparse regions, many points in dense regions. Equal-frequency ensures each bin has sufficient data for statistical reliability, especially important when data concentrated in small ra