1/57
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Compare Min-max vs Z-score normalization: when would each be preferred?
Min-max: when need specific range [0,1] and no outliers. Z-score: when data follows normal distribution and algorithm sensitive to distribution. Min-max gives exact same scale but poor with outliers; Z-score more robust to outliers but doesn't produce exact same scale
Why is f00 excluded from Jaccard Coefficient but included in SMC?
Jaccard is for asymmetric attributes where both being absent (0-0) doesn't indicate similarity (e.g., not buying same products). SMC is for symmetric attributes where matching absences do indicate similarity (e.g., both being non-smokers)
Explain the relationship: Mean - Mode ≈ 3(Mean - Median)
In skewed distributions, this empirical formula shows the mean is pulled furthest toward the tail, median is in middle, and mode stays at the peak. The mean shifts 3 times as much as the median from the mode
What's the difference between supervised and unsupervised discretization?
Supervised uses class labels to guide discretization (optimizing for classification tasks using splitting/merging). Unsupervised ignores class labels (using equal width, equal frequency, or k-means). Supervised typically produces better bins for prediction tasks
Why divide sample variance by (n-1) instead of n?
Dividing by (n-1) corrects for bias in estimation. Sample tends to underestimate population variance because sample mean is used instead of true population mean. This is called Bessel's correction
When would you choose Cosine Similarity over Euclidean Distance?
When direction/orientation matters more than magnitude (e.g., document similarity, time series patterns). Cosine focuses on angle between vectors regardless of length, while Euclidean measures actual distance and is magnitude-sensitive
Why might two histograms have identical boxplots?
Boxplot only shows 5-number summary (min, Q1, median, Q3, max). Two distributions with same summary statistics but different shapes (e.g., uniform vs bimodal within same range) would have identical boxplots but different histograms
Explain why stratified sampling might be better than simple random sampling
Stratified ensures representation from all important subgroups by dividing population into strata and sampling each. Simple random might miss small but important groups. Stratified reduces sampling error and ensures proportional representation
What's the triangle inequality property and why is it important?
d(x,z) ≤ d(x,y) + d(y,z). It ensures the direct path is never longer than indirect path through third point. Important for proving a measure is a valid distance metric and for optimization algorithms that exploit this property
How does Label Encoding create problems for nominal attributes?
Assigns integers (0,1,2…) which implies order (France=0 < Spain=1). Algorithms might interpret this as Spain being "greater than" France, creating false ordinal relationships when categories are actually unordered
Why is median preferred over mean for skewed distributions?
Median is resistant to extreme values because it only depends on middle position, not actual values. Mean is pulled toward outliers/tail because it uses all values in calculation. For skewed data, median better represents "typical" value
Explain the difference between noise and outliers
Noise: random error/distortion in measurements (e.g., sensor error, data entry mistake) - should be removed/smoothed. Outliers: legitimate extreme values that may be noise OR the focus of analysis (e.g., fraud detection). Context determines handling
Why might we intentionally add noise to data?
To prevent overfitting by forcing models to be more robust, improve generalization to real-world variations, and enhance model adaptability. Makes model less sensitive to specific training data peculiarities
What's the purpose of the reference line (y=x) in Q-Q plot?
Shows where points would fall if both distributions were identical. Deviations from this line indicate differences: points above line mean data distribution has higher values than theoretical, below means lower values
Why is IQR useful for outlier detection?
IQR (Q3-Q1) represents the middle 50% of data, resistant to extreme values. Points beyond 1.5×IQR from quartiles are statistical outliers. This method is robust because it's based on quartiles, not mean/std which outliers affect
Explain why correlation doesn't imply causation
Correlation measures statistical association between variables, not cause-and-effect. Both variables might be caused by third factor (confounding variable), relationship might be coincidental, or causation might be reverse of what's assumed
When would you use Manhattan distance instead of Euclidean?
When movement is restricted to grid-like paths (city blocks), when dealing with high-dimensional data where Euclidean can be misleading, or when want distance less sensitive to outliers in individual dimensions
Why does One-Hot Encoding increase dimensionality?
Creates separate binary column for each category. If attribute has k categories, creates k columns instead of 1. With many categories (e.g., 1000 cities), dramatically increases feature space, which can cause computational issues
What's the relationship between covariance and correlation?
Correlation is standardized covariance: ρ = cov/(σ1×σ2). Covariance shows direction of relationship but sensitive to scale. Correlation normalizes to [-1,1] range, making it scale-independent and comparable across different variable pairs
Why might you keep duplicate data instead of removing it?
When duplicates represent legitimate repeated events (e.g., customer with multiple accounts, repeated purchases, multiple measurements). Removing these would lose important information about frequency or intensity of events
Explain the difference between Equal Width and Equal Frequency discretization
Equal Width: divides range into same-size intervals (e.g., 0-10, 10-20, 20-30). Simple but may create empty/overfull bins. Equal Frequency: ensures same number of points per bin. Better balanced but intervals have different widths
Why is standardization necessary for Euclidean Distance?
Attributes with larger scales dominate distance calculation. E.g., salary (0-100,000) vs age (0-100) - salary differences would overwhelm age differences. Standardization ensures each attribute contributes proportionally
What's the difference between clustering for understanding vs summarization?
Understanding: discover natural groupings to gain insights (customer segments, gene functions). Summarization: reduce data size by representing groups with centroids/representatives for efficient processing while preserving key information
Why are asymmetric attributes important in market basket analysis?
In retail, we care about what customers bought (presence), not what they didn't buy (absence). Two customers having thousands of non-purchased items in common doesn't make them similar - shared purchases do
Explain why high dimensionality presents challenges
Curse of dimensionality: as dimensions increase, data becomes sparse, distances become less meaningful, computational cost explodes, visualization impossible, more data needed to maintain statistical significance, many algorithms break down
What's the difference between data integration and data transformation?
Integration: combining data from multiple sources into unified view (different databases, formats, schemas). Transformation: converting data format for analysis (normalization, encoding, discretization, sampling) within same/integrated dataset
Why might regression be preferred over classification?
When target variable is continuous rather than categorical, when need specific numeric predictions (exact price, temperature), when want to capture subtle gradations rather than discrete classes, when probability estimates aren't sufficient
How does trimmed mean differ from regular mean and why use it?
Trimmed mean removes extreme values (e.g., top/bottom 5%) before calculating. Used when want average resistant to outliers but more data-driven than median. Common in Olympics scoring to eliminate biased judges
Why is KDD broader than Data Mining?
KDD (Knowledge Discovery in Databases) is the entire process: understanding domain, data selection, preprocessing, transformation, data mining, interpretation, evaluation. Data Mining is just the core step of applying algorithms to extract patterns
Explain the difference between stratified and cluster sampling
Stratified: divide into homogeneous groups, sample from ALL groups (ensures representation). Cluster: divide into clusters, randomly select SOME clusters completely (often geographical, more practical but less precise)
Why might you choose weighted arithmetic mean?
When values have different importance/frequency. E.g., calculating course grade where exam=50%, homework=30%, participation=20%. Gives more weight to more important values rather than treating all equally
What's the relationship between variance and standard deviation, and why have both?
Standard deviation = √variance. Variance is fundamental mathematical property but in squared units (dollars²). Standard deviation in same units as data (dollars), more interpretable. Both used: variance for theory, std dev for interpretation
Why is sample size important when choosing between deleting records vs imputation?
With large sample size, deleting few records with missing values has minimal impact on statistical power. With small sample, each record precious - imputation preserves information and maintains statistical validity
Explain why z-score normalization centers data at 0 with std dev 1
Formula: z = (x - μ)/σ. Subtracting mean (μ) shifts center to 0. Dividing by std dev (σ) scales spread to 1. This standardizes all variables to same scale, making them comparable regardless of original units
How does bottom-up discretization differ from top-down?
Bottom-up: starts with each value in own bin, merges similar bins (agglomerative). Top-down: starts with all values in one bin, splits to separate classes (divisive). Bottom-up maximizes purity; top-down maximizes separability
Why is parallel coordinates plot useful for multivariate data?
Maps each object as a line across multiple vertical axes (one per attribute). Can visualize many dimensions simultaneously, identify patterns/correlations between attributes, spot outliers. Better than multiple 2D plots for seeing relationships
What's the difference between descriptive and predictive data mining?
Descriptive: summarizes/describes patterns in existing data (clustering finds groups, association rules find relationships). Predictive: uses patterns to predict future/unknown values (classification predicts categories, regression predicts numbers)
Why might anomaly detection be both descriptive and predictive?
Descriptive: describes which current observations are anomalous based on patterns. Predictive: once anomaly patterns learned, can predict whether new observations will be anomalous. Depends on whether analyzing past data or forecasting future
Explain why non-traditional analysis methods are needed for modern data
Traditional statistics assumes small, clean, normally distributed data with independence. Modern data is massive scale, high-dimensional, heterogeneous, noisy, correlated, distributed. Traditional methods computationally infeasible or assumptions violated
Why is data ownership/distribution a challenge in data mining?
Data owned by different organizations (privacy/legal issues), geographically distributed (communication costs), different formats/standards, need to mine without centralizing (security), must consolidate results without sharing raw data (federated learning)
How does the choice of r parameter affect Minkowski distance?
r=1 (Manhattan): grid-like, sum of absolute differences, robust to outliers. r=2 (Euclidean): straight-line, most common, sensitive to outliers. r→∞ (Chebyshev): maximum difference in any dimension, extreme outlier-sensitive
Why use scatter plot for continuous vs continuous variables?
Both axes can represent continuous scales, each point shows exact (x,y) pair, reveals relationships (linear, nonlinear, clusters), shows correlation direction and strength visually, identifies outliers
Why use box plot for categorical vs continuous variables?
Categorical variable groups data into categories (x-axis), continuous variable shown as distribution (y-axis). Can compare distributions across categories, see medians, spreads, outliers for each group side-by-side
What makes association rules different from classification?
Association: finds co-occurrence patterns (IF milk THEN bread), no designated target variable, finds all interesting rules. Classification: predicts specific target variable, supervised learning with labeled training data, one directional prediction
Why is binning useful for handling noise?
Groups values into bins, replaces with bin mean/median/boundary. Smooths random fluctuations because averaging reduces noise impact. Creates more robust features less sensitive to measurement errors or minor variations
Explain the difference between similarity matrix and distance matrix
Similarity: higher values = more similar, diagonal often 1 (max similarity with self), range typically [0,1]. Distance: lower values = more similar, diagonal always 0 (no distance to self), range [0,∞). Both symmetric for most measures
Why might Chebyshev distance be preferred in some applications?
When only maximum difference matters (e.g., king in chess moves one square in any direction - limited by max coordinate difference). In control systems where worst-case dimension determines feasibility. For min-max optimization problems
What's the purpose of the cumulative frequency B in grouped median calculation?
Counts how many values fall BEFORE the median bin. Needed to determine where in the median bin the actual median falls. If B=4 and n=13, median position is 6.5, so it's (6.5-4)/G into the median bin
Why does cosine similarity range from -1 to 1 instead of 0 to 1?
Measures angle between vectors. 0° (same direction)=1, 90° (perpendicular)=0, 180° (opposite direction)=-1. Negative values indicate opposite orientations, important for detecting negative correlation or opposing patterns
How does web data differ from traditional business data in ways that affect mining?
Web data: unstructured/semi-structured (text, images), massive scale (petabytes), rapidly changing, noisy/unreliable, heterogeneous sources, sparse (most connections don't exist), requires real-time processing. Traditional: structured, smaller, static, clean, homogeneous
Why is the identity property (s(x,y)=1 only if x=y) noted as "may not always hold"?
Some similarity measures reach maximum for non-identical objects. E.g., cosine similarity of [1,0] and [2,0] is 1 (same direction) though vectors differ. Property holds for most measures but not universally
What makes market basket analysis particularly suitable for association rules?
Natural asymmetric binary data (bought/didn't buy), focus on co-purchased items, thousands of products make simple analysis impossible, need automatic pattern discovery, actionable for business (product placement, promotions, recommendations)
Why is "gather whatever data you can" now a mantra?
Storage cheap, don't know future uses, data value often unexpected, might enable unforeseen applications, competitive advantage, harder to collect later than store now, advanced analytics can find hidden value
How does feature selection relate to the curse of dimensionality?
With many features, need exponentially more data for reliable patterns. Feature selection removes irrelevant/redundant attributes, reduces dimensionality, improves model performance, decreases computation, prevents overfitting, makes models interpretable
Why might classification and clustering be combined?
Clustering first: discover natural groups, then classify within groups (simpler sub-problems). Classification first: classify known data, cluster misclassified points (find new patterns). Semi-supervised: use small labeled set + large unlabeled set
What's the relationship between data preprocessing quality and model performance?
Garbage in, garbage out. Poor preprocessing (duplicates, missing values, noise, wrong scales) causes models to learn wrong patterns, overfit noise, miss important signals, make biased predictions. Quality preprocessing often more impactful than algorithm choice
Why is symmetric sampling important and what causes sampling bias?
Sample must represent population characteristics proportionally. Bias from: convenience sampling (easy-to-reach people), self-selection (volunteers differ from non-volunteers), systematic patterns (sampling every 10th fails if data has period-10 pattern)
Explain why equal-frequency discretization might be preferred for imbalanced data
Equal-width might create many empty bins or bins with few points in sparse regions, many points in dense regions. Equal-frequency ensures each bin has sufficient data for statistical reliability, especially important when data concentrated in small ra