D

SQL, Scikit, and Supervised Learning (ML)

Absolutely — here’s your consolidated, refined, human-style bullet list of everything, followed by 6 test questions at the bottom for Knowt flashcards:


🧠 Refined Notes – SQL, ML, and Scikit-learn

SQL Basics
  • ORDER BY lets you sort query results.

    • Defaults to ASC (ascending).

    • Use DESC to reverse it.

    • You can sort by multiple columns — it’ll break ties in order from left to right.

  • SQL doesn't run top-to-bottom like code. It follows this logical execution order:

    • FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT

  • WHERE filters individual rows before any grouping.

  • HAVING filters after grouping — usually paired with aggregate functions like AVG() or COUNT().

  • GROUP BY is for grouping rows by unique values in one or more columns — often used with aggregates.


What is Machine Learning (ML)?
  • Machine Learning is about teaching computers to recognize patterns and make predictions using data.

  • AI is broader — simulating human behavior and tasks.

  • Data Science is about analyzing data to extract insights. ML is one of its tools.


Supervised Learning
  • Uses labeled data → we know the input and the correct output.

  • Input data = Feature Matrix (X) → all your columns of features for every row.

  • One row from X = Feature Vector (e.g., one patient, one movie, one product).

  • Output = Target Vector (y) → what you're trying to predict (e.g., disease or not, price, label).

  • Tasks:

    • Classification → predicting discrete labels (like spam/ham or apple/orange/banana).

      • Can be binary or multi-class.

    • Regression → predicting continuous numeric values (like price, temperature).


Data Types
  • Quantitative data = numeric.

    • Discrete = countable (e.g., # of pets).

    • Continuous = measurable, can be any value in a range (e.g., weight, time).

  • Qualitative data = categorical.

    • Ordinal = has a natural order (small/medium/large).

    • Nominal = no order (like eye color or country).

  • One-Hot Encoding converts categories into vectors like [1, 0, 0].


Model Evaluation
  • You split data into:

    • Training set → the model learns from this.

    • Validation set → used to tune model without biasing it.

    • Test set → used once to evaluate final performance.

  • Loss measures model error:

    • L1 Loss (MAE) = sum of absolute errors — better for outliers.

    • L2 Loss (MSE) = sum of squared errors — penalizes large mistakes more.

    • Binary Cross-Entropy = for binary classification; punishes wrong confident guesses.

  • Accuracy = % of correct predictions — but not ideal for imbalanced datasets.


Unsupervised & Reinforcement Learning
  • Unsupervised Learning = no labels; it finds patterns (e.g., clustering customers).

  • Reinforcement Learning = an agent learns by interacting with an environment using rewards and penalties.


Scikit-learn Tips
  • Use train_test_split (not tran_test_split) to divide data.

  • Scikit-learn models can’t handle NaNs or nulls — clean your data first.

  • Import models like LinearRegression from sklearn.linear_model.


Critical Edge Cases

  • Using WHERE with aggregates like AVG() won’t work — use HAVING instead.

  • Feeding missing values into scikit-learn models causes errors.

  • High accuracy but low real-world performance = overfitting (bad generalization).


Common Mistakes

  • Misspelling train_test_split as tran_test_split.

  • Assuming classification always means binary — it can be more than two classes.

  • Relying only on accuracy for imbalanced data — use precision, recall, or F1 score.


🧪 Knowt Flashcard Questions

1. What’s the difference between WHERE and HAVING in SQL, and when should each be used?
2. In supervised learning, what’s the difference between a feature vector, feature matrix, and target vector?
3. What kind of problems would you use classification for? What about regression?
4. When should you use L1 loss vs. L2 loss?
5. What does one-hot encoding do, and why is it useful?
6. What’s a major limitation of scikit-learn when it comes to missing data?

Definitely — you can easily create 30–40 high-quality flashcards from these notes without forcing it. Here's a rough breakdown:


Estimated Flashcard Breakdown

SQL (7–9 cards)
  • Execution order of SQL clauses

  • Difference between WHERE and HAVING

  • What GROUP BY does

  • How ORDER BY works with multiple fields

  • When to use ASC vs DESC

  • Purpose of aggregate functions

  • Example of using HAVING

  • SQL's logical vs written execution order

ML Concepts (8–10 cards)
  • What supervised learning is

  • Difference between AI, ML, and Data Science

  • Definitions of feature matrix, vector, and target vector

  • Types of tasks: classification vs regression

  • Real-world examples of classification

  • Real-world examples of regression

  • Data splitting: training/validation/test

  • What loss is and what it measures

  • Why generalization matters

Data Types & Encoding (5–7 cards)
  • Discrete vs continuous data

  • Qualitative vs quantitative

  • Nominal vs ordinal

  • One-hot encoding: what it is and why we use it

  • Example of encoding country data

Model Evaluation (5–7 cards)
  • L1 vs L2 loss: difference and use cases

  • What Binary Cross-Entropy is

  • Accuracy: what it tells you and its limitations

  • Why accuracy is bad for imbalanced datasets

  • Metrics to use instead of accuracy

Scikit-learn + Edge Cases (5–6 cards)
  • scikit-learn can't handle NaNs — what to do instead

  • What train_test_split does

  • What happens if you train on data with nulls

  • What overfitting is

  • Example of a common typo (tran_test_split)

  • Why we split data at all