1/114
Flashcards from lecture notes to help you prepare for the exam.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Row (Pandas I)
Represents one observation in tabular data.
Column (Pandas I)
Represents a characteristic or feature of an observation in tabular data.
Tabular data
Data in a table made up of rows and columns
DataFrame
A table in the language of pandas; a collection of columns called series.
Series
A sequence of values of the same type with a sequence of data labels called the index.
s.index
Accesses the indices of a series. Example: RangeIndex(start = 0, stop = 3, step = 1)
s.values
Accesses the values of a series. Example: array(['welcome', 'to', 'data100'], dtype=object)
Syntax for creating a DataFrame
pandas.DataFrame(data, index, columns)
Creating a DataFrame from a CSV
elections = pd.read_csv("data/elections.csv")
Setting the index of a DataFrame
elections = pd.readcsv("data/elections.csv", indexcol="Year") (The Year column is now the index column, usable for the loc function.)
Resetting the index column
elections.reset_index()
Extracting the first n rows
df.head(n)
Extracting the last n rows
df.tail(n)
loc
Allows us to specify the labels of rows and columns to extract. df.loc[rowlabels, columnslabels]
iloc
Used to extract data according to its position (integer-based). df.iloc[rowintegers, columnintegers]
~ (Bitwise operator)
NOT p
| (Bitwise operator)
p OR q
& (Bitwise operator)
p AND q
^ (Bitwise operator)
p XOR q
.isin
Alternative to direct boolean array selection. Returns a boolean Series that is True when the value is in the specified list.
Renaming a column
babynames = babynames.rename(columns={"name_lengths":"Length"}) The rename() function takes in a dictionary.
.shape
Returns the number of rows and the number of columns of a DataFrame or Series.
.size
Returns the number of rows times the number of columns of a DataFrame or Series.
.describe()
Returns a description of the DataFrame or Series that lists summary statistics of the data.
.sample()
Returns a random sample of items. By default, without replacement. Use replace=True for sampling with replacement.
.value_counts()
Counts the number of occurrences of each unique value in a Series. Return type is a Series.
.unique()
Returns an array of every unique value in a series.
.sort_values()
Sorts the values in a Series or DataFrame. Can sort a DataFrame by a specific column.
What is Grouping?
To group together rows that fall under the same category. Used to perform an operation that aggregates across all rows in the category. Used to: perform large operations at once, summarize trends in a dataset
Required groupby operations
Splitting the object, applying a function, combining the results
Groupby function result
Creates mini sub-DataFrames. Each subframe contains all rows that correspond to the same group.
groupby().size()
Returns a Series object counting the number of rows in each group.
groupby().count()
Returns a Dataframe with the counts of non-missing values in each column.
groupedbyparty.groups
A dictionary with all the groups mapping to an array of indices that belong to each group.
groupedbyparty.get_group("Socialist")
The rows in the form of a dataframe that belong to the specified group
pd.merge()
Used to join tables in pandas. Example: merged = pd.merge(left = elections, right = babynames2022, lefton = "First Name", right_on = "Name")
Goals of Data Visualization
To help your own understanding of your data/results. To communicate results/conclusions to others.
Quantitative Variable
Continuous (weather, height) or Discrete (cars sold in a day, num of children in a family)
Qualitative Variable
Ordinal (rankings) or Nominal (shirt sizes)
Matplotlib
A library used for plotting/generating plots: import matplotlib.pyplot as plt
plt.plot(xvalues, yvalues)
plotting the x and y values
plt.xlabel("X axis label")
label for the x axis
plt.ylabel("Y axis label")
label for the y axis
plt.title("Plot title")
title for the entire plot
Seaborn
Data visualization library based on matplotlib: import seaborn as sns
sns.countplot(data=wb, x = "Continent")
Bar plot using seaborn. Seaborn does the counting.
Skew of a histogram
Describes the direction in which the “tail” extends. A distribution with a long right tail is skewed right; a distribution with a long left tail is skewed left.
sns.displot()
Provided by seaborn and is a wrapper for histplot, kdeplot, and ecdfplot.
sns.scatterplot(data = df, x = "xcolumn", y = "ycolumn", hue = "hue_co
Used to create scatterplots.
Tukey-Mosteller Bulge Diagram
A guide to possible transformations to get linearity. Used to help make data appear more symmetric. Linearity allows us to fit lines to the transformed data.
Root Mean Square Error (RMSE)
The average loss. Lower loss = more accurate predictions.
Residual Plot
Error plot that visualizes the difference between actual and predicted values.
Anscombe’s Quartet
A famous example highlighting the importance of visualizing data before modeling. The Least Squares SLR model depends on the means of x and y, the standard deviations of x and y, and the r value.
MSE
Mean squared error (loss). Minimized by the sample mean.
MAE
Mean absolute error (loss). Minimized by the sample median.
Correlation
The average of the product of x and y, both measured in standard units. Ranges between -1 and 1.
L1 Loss
Absolute loss, equated to MAE: reasonable because good prediction → good fit → no loss; far from correct answer → bad prediction → bad fit → some loss
L2 Loss
Squared loss, equated to MSE: good prediction → good fit → no loss; far from correct answer → bad prediction → bad fit → lots of loss
Gradient Descent (Batch Descent)
Computes true gradient and always descends towards true minimum loss
Stochastic Gradient Descent (Mini Batch Descent)
Approximates the true gradient and may not descend towards the true minimum with each update. May get stuck at a local min, bounces around erratically
Feature Engineering
Transforming features to improve model performance.
One-Hot Encoding
A feature engineering technique to transform qualitative data into numeric features for modeling. Each category gets its own feature, with value = 1 if a row belongs to the category, value = 0 otherwise.
Model Variance
A model fits training data very well but performs poorly on unseen data.
Record/Tuple (SQL)
A row in a table.
Field (SQL)
A column in a table.
Relation (SQL)
The table itself.
Schema (SQL)
Describes the logical structure of a table.
Primary Key (SQL)
The set of column(s) used to uniquely identify each record in the table. Ensures data integrity and optimizes data access.
Simplest SQL Query
SELECT * FROM Dragon; (* is shorthand for “all columns”. Every query must include a SELECT and FROM clause.)
Rename column (SQL)
SELECT cute AS cuteness, year AS birth FROM Dragon;
SELECT DISTINCT year FROM Dragon
Return only the unique values from a column
SELECT name, year FROM Dragon WHERE cute > 0
Selects a row based on a condition
SELECT name, year FROM Dragon WHERE name='hiccup' or name='puff'
Select a row based on multiple conditions.
NULL SQL
Does not work with standard comparisons (=, >, <). Check if something IS or IS NOT NULL.
ORDER BY
Specifies the order to sort the data. Default order is ASC (small to large).
SQL Grouping
SELECT type from Dish GROUP BY type;
SQL LIMIT Keyword
LIMIT
SQL WHERE Keyword
SQL GROUP BY Keyword
SQL ORDER BY Keyword
SQL OFFSET Keyword
SQL HAVING Keyword
SQL AS Keyword
Used to rename columns. Column expressions may include aggregation functions
HAVING
Filters groups by applying some condition across all rows in each group. Same as filter in groupby(”type”).filter(lambda f: condition).
LIKE
Used to perform simple text comparison. E.g., SELECT titleType, primaryTitle FROM Title WHERE primaTitle LIKE '%Star Wars%';
% (SQL)
Wildcard character that means “look for any character, any number of times”.
_ (SQL)
Wildcard character that means look for exactly 1 character
CAST (SQL)
Converts a column into a different data type. Example: SELECT primaryTitle, CAST(runtimeMinutes AS INT) FROM Title
CASE (SQL)
Creates a new column based on conditions. Example: SELECT titleType, startYear, CASE WHEN startYEAR < 1950 THEN 'old' WHEN startYear < 200 then 'mid-aged' ELSE 'new'
Inner Join
Combines every row from the first table with its matching entry in the second table. If there is no match, the row is omitted.
Cross Join
Finds every possible combination of rows across two tables; also called a Cartesian product. SELECT * FROM s CROSS JOIN t
Left Outer Join
Keeps all rows from the left table and only matching rows from the right table. Also known as Left Join.
Right Outer Join
Keeps all rows from the right table and only matching rows from the left table. Also known as Right Join.
Full Outer Join
Keeps all rows from both the left and right tables, filling missing values with NULL.
Alias (SQL)
Temporary names that are easier to refer to. Allows to reference columns using aliased table names.
Regression vs. Classification
Regression uses quantitative features to predict a quantitative unbounded output. Classification uses quantitative features to predict a categorical variable.
Decision Boundary
Describes the line that splits the data into classes based on its features; for logistic regression, it is a hyperplane
Linearly Separable
A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y
Confusion Matrix
A table that plots true positives, true negatives, false positives, and false negatives for a particular classifier (threshold) and dataset.
Precision
TP / (TP + FP). How many retrieved items are actually relevant? Focuses on the correctness of the model’s positive predictions.