Final Exam Review

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/114

Earn XP

Description and Tags

Flashcards from lecture notes to help you prepare for the exam.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

115 Terms

New cards

Row (Pandas I)

Represents one observation in tabular data.

New cards

Column (Pandas I)

Represents a characteristic or feature of an observation in tabular data.

New cards

Tabular data

Data in a table made up of rows and columns

New cards

DataFrame

A table in the language of pandas; a collection of columns called series.

New cards

Series

A sequence of values of the same type with a sequence of data labels called the index.

New cards

s.index

Accesses the indices of a series. Example: RangeIndex(start = 0, stop = 3, step = 1)

New cards

s.values

Accesses the values of a series. Example: array(['welcome', 'to', 'data100'], dtype=object)

New cards

Syntax for creating a DataFrame

pandas.DataFrame(data, index, columns)

New cards

Creating a DataFrame from a CSV

elections = pd.read_csv("data/elections.csv")

New cards

Setting the index of a DataFrame

elections = pd.readcsv("data/elections.csv", indexcol="Year") (The Year column is now the index column, usable for the loc function.)

New cards

Resetting the index column

elections.reset_index()

New cards

Extracting the first n rows

df.head(n)

New cards

Extracting the last n rows

df.tail(n)

New cards

loc

Allows us to specify the labels of rows and columns to extract. df.loc[rowlabels, columnslabels]

New cards

iloc

Used to extract data according to its position (integer-based). df.iloc[rowintegers, columnintegers]

New cards

~ (Bitwise operator)

NOT p

New cards

| (Bitwise operator)

p OR q

New cards

& (Bitwise operator)

p AND q

New cards

^ (Bitwise operator)

p XOR q

New cards

.isin

Alternative to direct boolean array selection. Returns a boolean Series that is True when the value is in the specified list.

New cards

Renaming a column

babynames = babynames.rename(columns={"name_lengths":"Length"}) The rename() function takes in a dictionary.

New cards

.shape

Returns the number of rows and the number of columns of a DataFrame or Series.

New cards

.size

Returns the number of rows times the number of columns of a DataFrame or Series.

New cards

.describe()

Returns a description of the DataFrame or Series that lists summary statistics of the data.

New cards

.sample()

Returns a random sample of items. By default, without replacement. Use replace=True for sampling with replacement.

New cards

.value_counts()

Counts the number of occurrences of each unique value in a Series. Return type is a Series.

New cards

.unique()

Returns an array of every unique value in a series.

New cards

.sort_values()

Sorts the values in a Series or DataFrame. Can sort a DataFrame by a specific column.

New cards

What is Grouping?

To group together rows that fall under the same category. Used to perform an operation that aggregates across all rows in the category. Used to: perform large operations at once, summarize trends in a dataset

New cards

Required groupby operations

Splitting the object, applying a function, combining the results

New cards

Groupby function result

Creates mini sub-DataFrames. Each subframe contains all rows that correspond to the same group.

New cards

groupby().size()

Returns a Series object counting the number of rows in each group.

New cards

groupby().count()

Returns a Dataframe with the counts of non-missing values in each column.

New cards

groupedbyparty.groups

A dictionary with all the groups mapping to an array of indices that belong to each group.

New cards

groupedbyparty.get_group("Socialist")

The rows in the form of a dataframe that belong to the specified group

New cards

pd.merge()

Used to join tables in pandas. Example: merged = pd.merge(left = elections, right = babynames2022, lefton = "First Name", right_on = "Name")

New cards

Goals of Data Visualization

To help your own understanding of your data/results. To communicate results/conclusions to others.

New cards

Quantitative Variable

Continuous (weather, height) or Discrete (cars sold in a day, num of children in a family)

New cards

Qualitative Variable

Ordinal (rankings) or Nominal (shirt sizes)

New cards

Matplotlib

A library used for plotting/generating plots: import matplotlib.pyplot as plt

New cards

plt.plot(xvalues, yvalues)

plotting the x and y values

New cards

plt.xlabel("X axis label")

label for the x axis

New cards

plt.ylabel("Y axis label")

label for the y axis

New cards

plt.title("Plot title")

title for the entire plot

New cards

Seaborn

Data visualization library based on matplotlib: import seaborn as sns

New cards

sns.countplot(data=wb, x = "Continent")

Bar plot using seaborn. Seaborn does the counting.

New cards

Skew of a histogram

Describes the direction in which the “tail” extends. A distribution with a long right tail is skewed right; a distribution with a long left tail is skewed left.

New cards

sns.displot()

Provided by seaborn and is a wrapper for histplot, kdeplot, and ecdfplot.

New cards

sns.scatterplot(data = df, x = "xcolumn", y = "ycolumn", hue = "hue_co

Used to create scatterplots.

New cards

Tukey-Mosteller Bulge Diagram

A guide to possible transformations to get linearity. Used to help make data appear more symmetric. Linearity allows us to fit lines to the transformed data.

New cards

Root Mean Square Error (RMSE)

The average loss. Lower loss = more accurate predictions.

New cards

Residual Plot

Error plot that visualizes the difference between actual and predicted values.

New cards

Anscombe’s Quartet

A famous example highlighting the importance of visualizing data before modeling. The Least Squares SLR model depends on the means of x and y, the standard deviations of x and y, and the r value.

New cards

MSE

Mean squared error (loss). Minimized by the sample mean.

New cards

MAE

Mean absolute error (loss). Minimized by the sample median.

New cards

Correlation

The average of the product of x and y, both measured in standard units. Ranges between -1 and 1.

New cards

L1 Loss

Absolute loss, equated to MAE: reasonable because good prediction → good fit → no loss; far from correct answer → bad prediction → bad fit → some loss

New cards

L2 Loss

Squared loss, equated to MSE: good prediction → good fit → no loss; far from correct answer → bad prediction → bad fit → lots of loss

New cards

Gradient Descent (Batch Descent)

Computes true gradient and always descends towards true minimum loss

New cards

Stochastic Gradient Descent (Mini Batch Descent)

Approximates the true gradient and may not descend towards the true minimum with each update. May get stuck at a local min, bounces around erratically

New cards

Feature Engineering

Transforming features to improve model performance.

New cards

One-Hot Encoding

A feature engineering technique to transform qualitative data into numeric features for modeling. Each category gets its own feature, with value = 1 if a row belongs to the category, value = 0 otherwise.

New cards

Model Variance

A model fits training data very well but performs poorly on unseen data.

New cards

Record/Tuple (SQL)

A row in a table.

New cards

Field (SQL)

A column in a table.

New cards

Relation (SQL)

The table itself.

New cards

Schema (SQL)

Describes the logical structure of a table.

New cards

Primary Key (SQL)

The set of column(s) used to uniquely identify each record in the table. Ensures data integrity and optimizes data access.

New cards

Simplest SQL Query

SELECT * FROM Dragon; (* is shorthand for “all columns”. Every query must include a SELECT and FROM clause.)

New cards

Rename column (SQL)

SELECT cute AS cuteness, year AS birth FROM Dragon;

New cards

SELECT DISTINCT year FROM Dragon

Return only the unique values from a column

New cards

SELECT name, year FROM Dragon WHERE cute > 0

Selects a row based on a condition

New cards

SELECT name, year FROM Dragon WHERE name='hiccup' or name='puff'

Select a row based on multiple conditions.

New cards

NULL SQL

Does not work with standard comparisons (=, >, <). Check if something IS or IS NOT NULL.

New cards

ORDER BY

Specifies the order to sort the data. Default order is ASC (small to large).

New cards

SQL Grouping

SELECT type from Dish GROUP BY type;

New cards

SQL LIMIT Keyword

LIMIT . Example: SELECT * FROM Dragon LIMIT 4

New cards

SQL WHERE Keyword

. Example: SELECT postid FROM posts WHERE hasmedia = 1

New cards

SQL GROUP BY Keyword

. Example: SELECT hasmedia FROM posts GROUP BY hasmedia;

New cards

SQL ORDER BY Keyword

. Example: SELECT * FROM Dragon ORDER BY year

New cards

SQL OFFSET Keyword

. Example: SELECT * FROM posts ORDER BY id OFFSET 3

New cards

SQL HAVING Keyword

. For Example: SELECT * FROM Dish WHERE cost > 4 GROUP BY type HAVING MAX(cost) < 10;

New cards

SQL AS Keyword

Used to rename columns. Column expressions may include aggregation functions

New cards

HAVING

Filters groups by applying some condition across all rows in each group. Same as filter in groupby(”type”).filter(lambda f: condition).

New cards

Used to perform simple text comparison. E.g., SELECT titleType, primaryTitle FROM Title WHERE primaTitle LIKE '%Star Wars%';

New cards

% (SQL)

Wildcard character that means “look for any character, any number of times”.

New cards

_ (SQL)

Wildcard character that means look for exactly 1 character

New cards

CAST (SQL)

Converts a column into a different data type. Example: SELECT primaryTitle, CAST(runtimeMinutes AS INT) FROM Title

New cards

CASE (SQL)

Creates a new column based on conditions. Example: SELECT titleType, startYear, CASE WHEN startYEAR < 1950 THEN 'old' WHEN startYear < 200 then 'mid-aged' ELSE 'new'

New cards

Inner Join

Combines every row from the first table with its matching entry in the second table. If there is no match, the row is omitted.

New cards

Cross Join

Finds every possible combination of rows across two tables; also called a Cartesian product. SELECT * FROM s CROSS JOIN t

New cards

Left Outer Join

Keeps all rows from the left table and only matching rows from the right table. Also known as Left Join.

New cards

Right Outer Join

Keeps all rows from the right table and only matching rows from the left table. Also known as Right Join.

New cards

Full Outer Join

Keeps all rows from both the left and right tables, filling missing values with NULL.

New cards

Alias (SQL)

Temporary names that are easier to refer to. Allows to reference columns using aliased table names.

New cards

Regression vs. Classification

Regression uses quantitative features to predict a quantitative unbounded output. Classification uses quantitative features to predict a categorical variable.

New cards

Decision Boundary

Describes the line that splits the data into classes based on its features; for logistic regression, it is a hyperplane

New cards

Linearly Separable

A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y

New cards

Confusion Matrix

A table that plots true positives, true negatives, false positives, and false negatives for a particular classifier (threshold) and dataset.

100

New cards

Precision

TP / (TP + FP). How many retrieved items are actually relevant? Focuses on the correctness of the model’s positive predictions.