SQL, Scikit, and Supervised Learning (ML)

Absolutely — here’s your consolidated, refined, human-style bullet list of everything, followed by 6 test questions at the bottom for Knowt flashcards:

🧠 Refined Notes – SQL, ML, and Scikit-learn

SQL Basics

ORDER BY lets you sort query results.
- Defaults to ASC (ascending).
- Use DESC to reverse it.
- You can sort by multiple columns — it’ll break ties in order from left to right.
SQL doesn't run top-to-bottom like code. It follows this logical execution order:
- FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
WHERE filters individual rows before any grouping.
HAVING filters after grouping — usually paired with aggregate functions like AVG() or COUNT().
GROUP BY is for grouping rows by unique values in one or more columns — often used with aggregates.

What is Machine Learning (ML)?

Machine Learning is about teaching computers to recognize patterns and make predictions using data.
AI is broader — simulating human behavior and tasks.
Data Science is about analyzing data to extract insights. ML is one of its tools.

Supervised Learning

Uses labeled data → we know the input and the correct output.
Input data = Feature Matrix (X) → all your columns of features for every row.
One row from X = Feature Vector (e.g., one patient, one movie, one product).
Output = Target Vector (y) → what you're trying to predict (e.g., disease or not, price, label).
Tasks:
- Classification → predicting discrete labels (like spam/ham or apple/orange/banana).
  - Can be binary or multi-class.
- Regression → predicting continuous numeric values (like price, temperature).

Data Types

Quantitative data = numeric.
- Discrete = countable (e.g., # of pets).
- Continuous = measurable, can be any value in a range (e.g., weight, time).
Qualitative data = categorical.
- Ordinal = has a natural order (small/medium/large).
- Nominal = no order (like eye color or country).
One-Hot Encoding converts categories into vectors like [1, 0, 0].

Model Evaluation

You split data into:
- Training set → the model learns from this.
- Validation set → used to tune model without biasing it.
- Test set → used once to evaluate final performance.
Loss measures model error:
- L1 Loss (MAE) = sum of absolute errors — better for outliers.
- L2 Loss (MSE) = sum of squared errors — penalizes large mistakes more.
- Binary Cross-Entropy = for binary classification; punishes wrong confident guesses.
Accuracy = % of correct predictions — but not ideal for imbalanced datasets.

Unsupervised & Reinforcement Learning

Unsupervised Learning = no labels; it finds patterns (e.g., clustering customers).
Reinforcement Learning = an agent learns by interacting with an environment using rewards and penalties.

Scikit-learn Tips

Use train_test_split (not tran_test_split) to divide data.
Scikit-learn models can’t handle NaNs or nulls — clean your data first.
Import models like LinearRegression from sklearn.linear_model.

⚠ Critical Edge Cases

Using WHERE with aggregates like AVG() won’t work — use HAVING instead.
Feeding missing values into scikit-learn models causes errors.
High accuracy but low real-world performance = overfitting (bad generalization).

❌ Common Mistakes

Misspelling train_test_split as tran_test_split.
Assuming classification always means binary — it can be more than two classes.
Relying only on accuracy for imbalanced data — use precision, recall, or F1 score.

🧪 Knowt Flashcard Questions

1. What’s the difference between WHERE and HAVING in SQL, and when should each be used?
2. In supervised learning, what’s the difference between a feature vector, feature matrix, and target vector?
3. What kind of problems would you use classification for? What about regression?
4. When should you use L1 loss vs. L2 loss?
5. What does one-hot encoding do, and why is it useful?
6. What’s a major limitation of scikit-learn when it comes to missing data?

Definitely — you can easily create 30–40 high-quality flashcards from these notes without forcing it. Here's a rough breakdown: