Absolutely — here’s your consolidated, refined, human-style bullet list of everything, followed by 6 test questions at the bottom for Knowt flashcards:
ORDER BY
lets you sort query results.
Defaults to ASC
(ascending).
Use DESC
to reverse it.
You can sort by multiple columns — it’ll break ties in order from left to right.
SQL doesn't run top-to-bottom like code. It follows this logical execution order:
FROM
→ WHERE
→ GROUP BY
→ HAVING
→ SELECT
→ ORDER BY
→ LIMIT
WHERE
filters individual rows before any grouping.
HAVING
filters after grouping — usually paired with aggregate functions like AVG()
or COUNT()
.
GROUP BY
is for grouping rows by unique values in one or more columns — often used with aggregates.
Machine Learning is about teaching computers to recognize patterns and make predictions using data.
AI is broader — simulating human behavior and tasks.
Data Science is about analyzing data to extract insights. ML is one of its tools.
Uses labeled data → we know the input and the correct output.
Input data = Feature Matrix (X) → all your columns of features for every row.
One row from X = Feature Vector (e.g., one patient, one movie, one product).
Output = Target Vector (y) → what you're trying to predict (e.g., disease or not, price, label).
Tasks:
Classification → predicting discrete labels (like spam/ham or apple/orange/banana).
Can be binary or multi-class.
Regression → predicting continuous numeric values (like price, temperature).
Quantitative data = numeric.
Discrete = countable (e.g., # of pets).
Continuous = measurable, can be any value in a range (e.g., weight, time).
Qualitative data = categorical.
Ordinal = has a natural order (small/medium/large).
Nominal = no order (like eye color or country).
One-Hot Encoding converts categories into vectors like [1, 0, 0]
.
You split data into:
Training set → the model learns from this.
Validation set → used to tune model without biasing it.
Test set → used once to evaluate final performance.
Loss measures model error:
L1 Loss (MAE) = sum of absolute errors — better for outliers.
L2 Loss (MSE) = sum of squared errors — penalizes large mistakes more.
Binary Cross-Entropy = for binary classification; punishes wrong confident guesses.
Accuracy = % of correct predictions — but not ideal for imbalanced datasets.
Unsupervised Learning = no labels; it finds patterns (e.g., clustering customers).
Reinforcement Learning = an agent learns by interacting with an environment using rewards and penalties.
Use train_test_split
(not tran_test_split
) to divide data.
Scikit-learn models can’t handle NaNs or nulls — clean your data first.
Import models like LinearRegression
from sklearn.linear_model
.
Using WHERE
with aggregates like AVG()
won’t work — use HAVING
instead.
Feeding missing values into scikit-learn models causes errors.
High accuracy but low real-world performance = overfitting (bad generalization).
Misspelling train_test_split
as tran_test_split
.
Assuming classification always means binary — it can be more than two classes.
Relying only on accuracy for imbalanced data — use precision, recall, or F1 score.
1. What’s the difference between WHERE
and HAVING
in SQL, and when should each be used?
2. In supervised learning, what’s the difference between a feature vector, feature matrix, and target vector?
3. What kind of problems would you use classification for? What about regression?
4. When should you use L1 loss vs. L2 loss?
5. What does one-hot encoding do, and why is it useful?
6. What’s a major limitation of scikit-learn when it comes to missing data?
Definitely — you can easily create 30–40 high-quality flashcards from these notes without forcing it. Here's a rough breakdown:
Execution order of SQL clauses
Difference between WHERE
and HAVING
What GROUP BY
does
How ORDER BY
works with multiple fields
When to use ASC
vs DESC
Purpose of aggregate functions
Example of using HAVING
SQL's logical vs written execution order
What supervised learning is
Difference between AI, ML, and Data Science
Definitions of feature matrix, vector, and target vector
Types of tasks: classification vs regression
Real-world examples of classification
Real-world examples of regression
Data splitting: training/validation/test
What loss is and what it measures
Why generalization matters
Discrete vs continuous data
Qualitative vs quantitative
Nominal vs ordinal
One-hot encoding: what it is and why we use it
Example of encoding country data
L1 vs L2 loss: difference and use cases
What Binary Cross-Entropy is
Accuracy: what it tells you and its limitations
Why accuracy is bad for imbalanced datasets
Metrics to use instead of accuracy
scikit-learn can't handle NaNs — what to do instead
What train_test_split
does
What happens if you train on data with nulls
What overfitting is
Example of a common typo (tran_test_split
)
Why we split data at all