SQL, Scikit, and Supervised Learning (ML)
Absolutely — here’s your consolidated, refined, human-style bullet list of everything, followed by 6 test questions at the bottom for Knowt flashcards:
🧠 Refined Notes – SQL, ML, and Scikit-learn
SQL Basics
ORDER BYlets you sort query results.Defaults to
ASC(ascending).Use
DESCto reverse it.You can sort by multiple columns — it’ll break ties in order from left to right.
SQL doesn't run top-to-bottom like code. It follows this logical execution order:
FROM→WHERE→GROUP BY→HAVING→SELECT→ORDER BY→LIMIT
WHEREfilters individual rows before any grouping.HAVINGfilters after grouping — usually paired with aggregate functions likeAVG()orCOUNT().GROUP BYis for grouping rows by unique values in one or more columns — often used with aggregates.
What is Machine Learning (ML)?
Machine Learning is about teaching computers to recognize patterns and make predictions using data.
AI is broader — simulating human behavior and tasks.
Data Science is about analyzing data to extract insights. ML is one of its tools.
Supervised Learning
Uses labeled data → we know the input and the correct output.
Input data = Feature Matrix (X) → all your columns of features for every row.
One row from X = Feature Vector (e.g., one patient, one movie, one product).
Output = Target Vector (y) → what you're trying to predict (e.g., disease or not, price, label).
Tasks:
Classification → predicting discrete labels (like spam/ham or apple/orange/banana).
Can be binary or multi-class.
Regression → predicting continuous numeric values (like price, temperature).
Data Types
Quantitative data = numeric.
Discrete = countable (e.g., # of pets).
Continuous = measurable, can be any value in a range (e.g., weight, time).
Qualitative data = categorical.
Ordinal = has a natural order (small/medium/large).
Nominal = no order (like eye color or country).
One-Hot Encoding converts categories into vectors like
[1, 0, 0].
Model Evaluation
You split data into:
Training set → the model learns from this.
Validation set → used to tune model without biasing it.
Test set → used once to evaluate final performance.
Loss measures model error:
L1 Loss (MAE) = sum of absolute errors — better for outliers.
L2 Loss (MSE) = sum of squared errors — penalizes large mistakes more.
Binary Cross-Entropy = for binary classification; punishes wrong confident guesses.
Accuracy = % of correct predictions — but not ideal for imbalanced datasets.
Unsupervised & Reinforcement Learning
Unsupervised Learning = no labels; it finds patterns (e.g., clustering customers).
Reinforcement Learning = an agent learns by interacting with an environment using rewards and penalties.
Scikit-learn Tips
Use
train_test_split(nottran_test_split) to divide data.Scikit-learn models can’t handle NaNs or nulls — clean your data first.
Import models like
LinearRegressionfromsklearn.linear_model.
⚠ Critical Edge Cases
Using
WHEREwith aggregates likeAVG()won’t work — useHAVINGinstead.Feeding missing values into scikit-learn models causes errors.
High accuracy but low real-world performance = overfitting (bad generalization).
❌ Common Mistakes
Misspelling
train_test_splitastran_test_split.Assuming classification always means binary — it can be more than two classes.
Relying only on accuracy for imbalanced data — use precision, recall, or F1 score.
🧪 Knowt Flashcard Questions
1. What’s the difference between WHERE and HAVING in SQL, and when should each be used?
2. In supervised learning, what’s the difference between a feature vector, feature matrix, and target vector?
3. What kind of problems would you use classification for? What about regression?
4. When should you use L1 loss vs. L2 loss?
5. What does one-hot encoding do, and why is it useful?
6. What’s a major limitation of scikit-learn when it comes to missing data?
Definitely — you can easily create 30–40 high-quality flashcards from these notes without forcing it. Here's a rough breakdown:
✅ Estimated Flashcard Breakdown
SQL (7–9 cards)
Execution order of SQL clauses
Difference between
WHEREandHAVINGWhat
GROUP BYdoesHow
ORDER BYworks with multiple fieldsWhen to use
ASCvsDESCPurpose of aggregate functions
Example of using
HAVINGSQL's logical vs written execution order
ML Concepts (8–10 cards)
What supervised learning is
Difference between AI, ML, and Data Science
Definitions of feature matrix, vector, and target vector
Types of tasks: classification vs regression
Real-world examples of classification
Real-world examples of regression
Data splitting: training/validation/test
What loss is and what it measures
Why generalization matters
Data Types & Encoding (5–7 cards)
Discrete vs continuous data
Qualitative vs quantitative
Nominal vs ordinal
One-hot encoding: what it is and why we use it
Example of encoding country data
Model Evaluation (5–7 cards)
L1 vs L2 loss: difference and use cases
What Binary Cross-Entropy is
Accuracy: what it tells you and its limitations
Why accuracy is bad for imbalanced datasets
Metrics to use instead of accuracy
Scikit-learn + Edge Cases (5–6 cards)
scikit-learn can't handle NaNs — what to do instead
What
train_test_splitdoesWhat happens if you train on data with nulls
What overfitting is
Example of a common typo (
tran_test_split)Why we split data at all