1/35
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Supervised vs Unsupervised Learning
Supervised uses labeled data for prediction; unsupervised finds patterns in unlabeled data.
Examples of Supervised Learning
Fraud detection, email spam classification — clear input-output pairs.
Examples of Unsupervised Learning
Customer segmentation, anomaly detection — no predefined labels.
Merging vs Concatenating in pandas
Merging joins DataFrames on keys; concatenating stacks them by axis.
When to use merge in pandas
When combining data from two sources with a common column (e.g., ID).
ANOVA Hypotheses
Null: all group means are equal; Alternative: at least one differs.
When to use ANOVA vs t-test
Use ANOVA for 3+ groups; t-test for 2 groups.
What is an Outlier
A value far from other data points that can distort model accuracy.
Effects of Outliers on Models
Can skew regression lines, increase error, or influence clustering.
SQL vs Text File
SQL databases handle large structured data more efficiently than text files.
SQLite
Lightweight SQL database stored in a single file; great for local analysis.
Categorical Variable
Variable with non-numeric categories (e.g., color, gender, job title).
Encoding Categorical Variables
Use one-hot encoding or label encoding for regression/classification.
Overfitting
Model memorizes training data but fails to generalize; common with small datasets.
Preventing Overfitting
Use cross-validation, regularization, pruning, or dropout.
Principal Component Analysis (PCA)
Reduces dimensionality while preserving variance via uncorrelated components.
Randomization in Ensembles
Introduces diversity in training data or features to reduce overfitting.
Random Forest
Ensemble of decision trees trained on bootstrapped data with random feature subsets.
Classification vs Regression
Classification = categories; Regression = continuous values.
Time Series Components
Trend, seasonality, cyclical, and random components.
Autocorrelation
Correlation of a time series with a lagged version of itself.
Importance of Autocorrelation
Helps identify patterns and predict future values in time series.
R² (coefficient of determination)
Measures how well the regression model explains data variability.
Use of R²
Compares model fit across different regression models.
Random Numbers in Data Analysis
Used in simulation, sampling, and randomized algorithms.
Monte Carlo Simulation
Method using repeated random sampling to model probabilistic systems.
Gradient Descent Risk Conditions
If learning rate is too high or function is not convex, it may miss the optimum.
Learning Rate in Gradient Descent
Controls how fast weights are updated; too high may overshoot, too low may be slow.
Overfitting with Small Datasets
Model may memorize training examples due to lack of generalization data.
Null Hypothesis in ANOVA
Group means are equal.
Alternative Hypothesis in ANOVA
At least one group mean is different.
SQL Definition
Structured Query Language used to manage and query relational databases.
Purpose of PCA
Reduce data dimensions while retaining most of the variance.
Ensemble Learning
Randomization helps reduce overfitting and improve model diversity.
Encoding Example
Gender as Male=1, Female=0 or one-hot: [1,0], [0,1].