D

SQL, ML, OOP, Pandas

Refined Notes: SQL, ML, OOP, Pandas


🧩 SQL — Subqueries & Execution Flow

  • A subquery is a SELECT inside another query — helps dynamically filter, insert, update, or delete.

  • You can use subqueries in:

    • SELECT → filter with values like MAX()

    • INSERT INTO ... SELECT → copy filtered data into another table

    • UPDATE → set values based on results of a subquery

    • DELETE → remove rows matching results of a subquery

  • Subqueries often work with IN, =, or EXISTS.


🤖 Machine Learning — Foundations

  • Supervised Learning: labeled data → predict outputs (classification, regression)

  • Unsupervised Learning: unlabeled data → find patterns (clustering, dimensionality reduction)

  • Reinforcement Learning: agent learns by reward/penalty

  • Ranking: predict the best item/order (used in recommendations)

  • Recommendation Systems = personalized filtering for music, shopping, etc.

  • Features: properties/columns of data (quantitative or categorical)

  • Labels: target outcomes (what we’re trying to predict)

  • Examples: (X, y) pairs

  • Feature Vector: one sample's data (row in X)

  • Feature Matrix: all samples (full table of features = X)

  • Target Vector: column of labels (y)

  • (x-bar) = mean, not a vector


🧮 Math Concepts

  • Derivative: measures rate of change (used in optimization)

  • Probability: likelihood of an outcome

    • Independent vs Conditional

  • Probability Distribution: how likely each outcome is

    • Gaussian (normal) vs Uniform


🧱 Scikit-Learn Basics

  • train_test_split() (not tran_test_split) = splits data for training and testing

  • scikit-learn models don’t accept NaNs

  • Import from sklearn.linear_model, sklearn.model_selection


🧠 Object-Oriented Programming (OOP)

  • Abstract Class: can’t be instantiated, contains at least one abstract method

  • Abstract Method: no implementation; must be overridden by child class

    • Use @abstractmethod from abc module

  • Static Method (@staticmethod): doesn’t use instance or class — utility method

  • Instance Method: regular method that acts on self

  • Concrete Method: has a full implementation (opposite of abstract)


🐼 Pandas Basics

  • .set_index() and .reset_index() help control DataFrame structure

    • drop=True removes the index column after reset

  • .loc[] is used for label-based subsetting (by index or column name)

  • df.groupby() groups data by a column’s values — often used with aggregation

  • Subset: A slice of your dataset, filtered down (rows, columns, or both)

  • Pivot Table: Reshape data to highlight summaries (like Excel pivot tables)


🧵 Notable Gaps / Fixes

  • is the mean, not a vector.

  • You wrote pd.subset() — there’s no such method in Pandas.

  • Your INSERT subquery SQL was missing IN (...) parentheses.

  • Still unsure about .loc[]? Think of it as label-based filtering: df.loc[row_label, column_label].


Highlights (5–10)

  • Subqueries allow filtering or setting values dynamically.

  • ML uses data to predict labels (supervised) or uncover patterns (unsupervised).

  • Features = input data (X); labels = targets (y); examples = (X, y).

  • Abstract classes provide structure, not instantiation.

  • Pandas lets you transform and reshape data with indexing and grouping.

  • Probability distributions help model uncertainty.

  • Reinforcement learning involves trial and error in an environment.

  • Use .loc[] for label-based filtering and .groupby() to summarize data.

  • One hot encoding is essential for handling categorical data in ML models.

  • Pivot tables and box plots help find insights quickly in messy datasets.


🚨 Critical Edge Cases

  1. Training a model on data with NaNs → model will error out.

  2. Using WHERE on an aggregate query → won’t work, must use HAVING.

  3. Using a static method where instance context is needed → leads to bugs.


Common Errors

  1. Misspelling train_test_split as tran_test_split.

  2. Forgetting @abstractmethod decorator → abstract methods won’t behave correctly.

  3. Thinking .subset() exists in pandas — it doesn’t.


🧪 Knowt Flashcard Questions

  1. What’s the difference between a feature vector, feature matrix, and target vector?

  2. When should you use a subquery in SQL?

  3. What does @abstractmethod do?

  4. What is the difference between a static method and an instance method?

  5. What does df.groupby() do in pandas?

  6. What happens if you try to fit a model with NaNs in scikit-learn?

  7. Define supervised learning vs. unsupervised learning.

  8. What’s the use of a pivot table?

  9. Why is .loc[] important in pandas?

  10. What does one-hot encoding solve?

SQL

  • What is a subquery?

    • A SELECT query that is enclosed inside another query.

    • The inner SELECT query is usually used to determine the results of the outer SELECT query.

    • SELECT, INSERT, UPDATE, DELETE

    • MySQLWorkbench?

SELECT dept 
FROM employees
WHERE salary = (SELECT MAX(salary) from employees);
  • INSERT: uses data returned from the subquery to insert into another table

INSERT INTO orders
SELECT product_id, product_name, sell_price
FROM products
WHERE product_id IN SELECT (product_id FROM projects WHERE sell_price > 1000);
  • UPDATE: Either a single or multiple columns can be updated when using a subquery with the update statement.

UPDATE employees
SET SALARY = SALARY * 0.35 
WHERE age IN (SELECT age FROM employees_b WHERE age >= 27);
  • DELETE: DELETE from

DELETE FROM employees
WHERE age IN (SELECT age FROM employees_b WHERE age >= 27);

ML

  • Terminology

    • Ranking: Optimizing machine learning models to rank candidates, such as music, articles, or products. Typically, the goal is to order the candidates such that the candidates which are most likely to be interacted with (purchased, viewed, liked, etc…) are above other candidates that aren’t as likely to be interacted with.

    • Supervised Learning: Optimizing machine learning models based on previously observed features and labels. Typically, the goal is to attach the most likely label to some provided features.

    • Unsupervised Learning: An approach within machine learning that takes in unlabeled examples and produces patterns from the provided data. Typically, the goal is to discover something previously unknown about the unlabeled examples.

    • Deep Learning: Optimizing neural networks, often with many hidden layers, to perform unsupervised or supervised learning.

    • Recommendation Systems: Systems with the goal of presenting an item to a user such that the user will most likely purchase, view, or like the recommended item. Items can take many forms, such as music, movies, products, etc… Also called a recommender system.

    • Feature: A set of quantities or properties describing an observation. They can be binary like “day” and “night”; categorical like “morning, “afternoon”, and “evening”; continuous like 3.141; or ordinal like “big”, “medium”, “large”, where the categories can be ordered.

    • Labels: Usually paired with a set of features for use in supervised learning. Can be discrete or continuous.

    • Examples: Pairs of features and labels

    • Dimensions: Here, the number of features associated with a particular example.

    • Vector: A feature vector, which is a list of features representing a particular example.

    • Matrix: An array of values, usually consisting of multiple rows and columns.

    • Matrix Transpose: An operator that flips a matrix over it’s diagonal.

    • Polynomial: A function with more than one variable / coefficient pair.

    • Derivative: Indicates how much the output of a function will change with respect to a change in it’s input.

    • Probability: How likely something is to occur. This can be independent, such as the roll of the dice, or conditional, such as drawing two cards subsequently our of a deck without replacement.

    • Probability Distribution: A function that takes in an outcome and outputs the probability of that particular outcome occurring.

    • Gaussian Distribution: A very common type of probability distribution which fits many real world observations; also called a normal distribution.

    • Uniform Distribution: A probability distribution in which earn outcome is equally likely; for example, rolling a normal six-sided die.

  • xbar: It’s a vector? I thought this meant it’s a mean?

  • Matrix are assigned to a capital letter

OOP

  • Abstract Classes: A class that contains at least one abstract method and is not meant to be instantiated. Abstract classes are meant to act as a parent or base class in the inheritance hierarchy. Typically abstract classes implement some funcitonality that can be used commonly by all child or subclasses.

    • A class that never has any instances of it.

    • Should not be instantiated.

    • A base class, superclass or parent class.

    • Can implement methods

  • Abstract Method: A method that is defined in an interface or abstract class and does not provide an implementation. Abstract methods are designed to be overridden by base or subclasses that extend the class or implement the interface they’re defined in.

    • required to be implemented by any class that inherits from this abstract class

    • What is @abstractmethod?

  • Instance vs Static Methods

  • @staticmethod

Pandas

  • df.set_index()

  • df.reset_index()

    • drop=True → Allows you to discard an index.

  • Indexes make subsetting cleaner

  • .loc[] is used for subsetting → accepts index values

  • df.sort_index()