1/107
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Pandas Series
1D homogenous (all entries of same type) array.
Pandas DataFrame
2D heterogenous (different types) table, made of Pandas Series
How to import a Python library
Import {library} as {keyword} (examples: import pandas as pd, import numpy as np)
Index
Explicit axis of a Pandas Series or DataFrame.
TSV and CSV files
Tab Separated Value, Comma Separated Value
Import a CSV file as a DataFrame
import pandas as pd
dataframe = pd.read_csv('csv_title.csv', index_col = 'index')
.head()
Displays the first few rows of a DataFrame
.tail()
Displays the last few rows of a DataFrame
Sort a DataFrame with .sort_values()
sorted_df = dataframe.sort_values(by='example_col', ascending=False)
Access one column in a DataFrame
dataframe['access_col']
Access one row in a DataFrame
df.iloc[0]Access one cell in a DataFrame
Row comes first
dataframe.loc['example_row', 'access_col']
Indexing with conditions
Creates a table of booleans. Passing this to dataframe again returns a smaller table with only True values. You can link together multiple criteria as well.
Computing new columns
dataframe['new_col'] = dataframe['col_1'] + dataframe['col_2']
.max(), .min(), .mean()
Returns the maximum value, minimum value, and mean value in a column
.idxmax(), .idxmin()
Returns the index of the maximum or minimum value in a column
.describe()
Returns the count of entries, mean value, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and maximum of each column in the DataFrame
.corr()
Returns the values of the correlation (-1 to 1) of two variables. -1 indicates a perfectly inverse relationship between the values of the variables and 1 indicates a perfectly positive relationship.
.columns
No parentheses! Returns the names of each column.
.dtypes
No parentheses! Returns the types of each column, since each Series can only hold one type.
.isnull(), .dropna()
Detects and drops empty rows/columns in a DataFrame
.iterrows()
Iterates through every row in a DataFrame. To use this use 'for index, row in df.iterrows()'.
Histogram
A bar chart. You can convert DataFrames to this by using .hist(bins=x). This creates a histogram with x number of bars.
Box Plot
Created using .boxplot(), this gives…
median value
middle 50% of data
range of non-outliers
F-strings
Formatted strings. Easier to use than concatenated normal strings. Allows you to edit substrings easily. An example of a formatted string, if cost = 15.0925, would be f'The total cost was {cost.2f} dollars.' The .2f reduces the decimal to two places.
.split()
Turns a string with multiple parts separated by the argument into a list of separate strings.
if redsoxplayers = 'Abreu, Devers, Duran, Casas', and sox_split = redsoxplayers.split(',') is called…
sox_split = ['Abreu','Devers','Duran','Casas'].
.join()
The opposite of .split(). You need to call this on the separator and input a list though. ','.join(['Abreu','Devers','Duran','Casas']) = 'Abreu,Devers,Duran,Casas'.
.strip()
Strips whitespace off the ends of strings.
.splitlines()
Shortcut for .split('\n')
.startswith(), .endswith()
Returns True or False if the string starts or ends with the argument
in operator
Detects if a substring is in a string.
Calling string functions on DataFrames
Use the .str function. For example, dataframe.str.strip().
Regular Expressions ("Regex")
Search for patterns in the data. You need to import re!
Regex escape sequence for any digit
\d
Regex escape sequence for any whitespace
\s
Regex escape sequence for any alphanumeric character
\w
Regex escape sequence for zero or more characters
*
Regex escape sequence for one or more characters
+
Regex escape sequence for a character that may be there or not
?
Regex or operator
(option1|option2)
Regex capturing information in groups
Use parentheses ()
with, open()
Used for reading files. open() takes a filename string and returns the file if it is round. The with keyword cleans up resources associated with the file, for example by closing the file after it is used.
Reader and writer
Objects that read and write files.
!ls
See the file in its directory.
JSON
An alternative to CSV. To write a JSON object to file, call json.dump(dict, file) on a dictionary and provide the file to write it into. JSONs can be read into dictionaries.
Serialization
Committing data to a file.
pandas.read_csv(filename, index_col)
Reads a CSV file directly into a DataFrame.
df.to_csv(filename)
Writes a DataFrame to a CSV file.
Exceptions
Objects that occur when the code has errors.
Examples are FileNotFoundError, ZeroDivisionError, and ValueError (occurs when attempting to parse a non-int string as int).
If an exception isn't caught by the program, it immediately terminates.
Try not to generate exceptions, even if there are ways to work around them.
try and except keywords
If an exception occurs within the try block, the code will jump to the next except block. This prevents the program from crashing even if the code is faulty.
Else and finally
Else can occur after except blocks to run if there are no errors.
Finally blocks are run after everything else.
Object-oriented programming
Programming that uses a system of object classes
How to initialize a class
class ExampleClass:
pass keyword
Use to create an empty class ("nothing interesting here")
Initialize objects of classes
example1 = ExampleClass()
example2 = ExampleClass()
isinstance()
Checks if an object is an instance of a certain class and returns a boolean.
isinstance(example2, ExampleClass) returns True
isinstance(not_example, ExampleClass) returns False
Attribute
Variables associated with a class. Can be defined ahead of time.
Method in object
class ExampleClass:
def example_function(self):
ALL methods in objects must begin with the self attribute, this represents the object itself
Constructor method
Sets object attributes for the first time.
Usually titled "__init__".
class ExampleClass:
def __init__(self, attribute1, attribute2):
self.attribute1 = attribute1
self.attribute2 = attribute2
Instance
One copy of an object
Getter
Returns an attribute of an object.
def get_attribute(self):
return self.attribute
Try to avoid direct attribute access if you can.
Setter
Sets an attribute of an object.
def set_attribute(self, attribute):
self.attribute = attribute
Try to avoid direct attribute access if you can.
Validation in the Constructor
Constructors can have other things than initializing attributes! They can also test to make sure a value works for an object.
class GoodTeams:
def __init__(self, championships_21st_century):
if championships_21st_century < 3:
raise ValueError("This team is a bunch of bums, like the New York Yankees!")
else:
print("Wow! What an elite team, such as the Boston Red Sox and the Patriots!")
self.championships_21st_century = championships_21st_century
Default parameter values
Initialize a default value of an attribute.
class HomeRuns:
def __init__(self, homers = 0):
self.homers = homers
HomeRuns().homers returns 0.
What should be an object?
Something that has multiple attributes attached to it and needs several functions that relate to it specifically.
Inheritance
A way of sharing code between classes.
The "child" class inherits from the "parent" class, which means it has all the code of the parent class plus any extra code that is entered into the child class.
How to create a child class
class Child(Parent):
Child classes are called just like parent classes. You can also use the pass keyword to indicate that the child class does nothing but inherit from the parent, which seems useless but can be functional, for example if you want to use isinstance() to check if an instance is a certain object.
super()
A function that calls a method in the parent class for use in the child class.
class Player:
def __init__(self, age, number, position):
self.age = age
self.number = number
self.position = position
class Pitcher(Player):
def __init__(self, age, number, position, velocity, is_starter):
super().__init__(age, number, position)
self.velocity = velocity
self.is_starter = is_starter
When to use inheritance
If A inherits from B, A should satisfy an "is-a" relationship with B.
For example, every baseball pitcher is a baseball player.
Since not every baseball player is a pitcher, the reverse would not make sense.
Refactoring
Changing and improving existing code, for example in a child class to build upon code existing in a parent class.
Override a method
Rewrite a method in a child class to make it work differently from its equivalent in the parent class.
Recursion
Calling a function within itself.
def countdown(num):
print(num)
if num > 0:
countdown(num - 1)
countdown(5) prints:
5
4
3
2
1
0
Infinite recursion
A recursive function that runs infinitely. Make sure you always avoid these by causing your recursive function to eventually reach a condition that causes it to stop (base case).
Base case
The condition that prevents the recursive function from running infinitely.
One of two parts of recursion.
Recursive call
The method that is called within itself. One of two parts of recursion.
Data structure
An object that holds more data.
Simple examples are lists and dictionaries, more complicated ones would be dataframes and linked lists.
Linked list
A system of nodes where each node holds both a value and a link to another node.
Trees
Linked list nodes with multiple links from "parent" nodes to "child" nodes.
The node at the top of the tree (with no parents itself) is the root
The node with no children is called a leaf.
Binary tree
A tree that limits each parent node's number of children to 2.
Typical implementation of a tree (detailed for people like me who struggle with trees!)
class Tree:
def __init__(self, val): ##constructor
self.left = None
self.right = None ##initializes only first node, no children
self.val = val ##sets value to that specified in the initialization of the class
##adding specified nodes (this is a binary tree)
def addLeft(self, node):
self.left = node
def addRight(self, node):
self.right = node
def find(self, v): ##finds value in the tree
if self.val == v: ##if the value is here return True
return True
##checking all child nodes recursively
if self.left and self.left.find(v):
return True
if self.right and self.right.find(v):
return True
##if not found
return False
Supervised machine learning
Extrapolates values for the test data based on the training data.
Two types are classification and regression.
Classification
Goal: To predict a categorical label or discrete outcome (e.g., True/False, Red/Blue/Green).
Output: Labels/Classes.
Regression
Goal: To predict a continuous, numerical output variable.
Output: A continuous numerical value.
Classification
One type of machine learning that classifies multiple types of things into groups.
Regression
One type of machine learning that fits a function to data.
scikit-learn
A free and open-source machine learning library for Python.
k-nearest neighbors
Extrapolate a value for a node based on the values of its nearest neighbor nodes.
KNeighborsClassifier
Module for k-nearest neighbors.
from sklearn.neighbors import KNeighborsClassifier
Important: modules are created as objects.
nbrs = KNeighborsClassifier(n_neighbors=3).fit(digits.data, digits.target)
Transforming
Turns raw data into usable data for machine learners.
Preprocessing
Like transforming but broader, for example can handle missing values.
Pipeline
The process of different steps of machine learning: transforming and pre-processing, training, then predicting/testing.
.fit(X, y)
Learns the parameters for preprocessing transformation, trains the machine learner on the training data.
.score(X, y)
Runs the trained machine learner on the test data and gives the accuracy.
.transform()
Applies a learned transformation to new data.
.fit_transform()
Combines .fit() and .transform().
Overfitting
This occurs when the training model fits itself too closely to the training data.
To avoid this, you should split the data into training and testing data.
Train/test split
Splitting one set of data into training and testing pieces.
train_test_split
Splits the dataset into training and testing data for you.
from sklearn.model_selection import train_test_split
#The four objects created here are then inserted into other functions to represent the training and testing data. They can be named whatever you want but X_train, X_test, y_train, y_test is good practice. No one knows why X is capitalized and y is not, not even the DS tutors we have here.
X_train, X_test, y_train, y_test = train_test_split(test_size=0.25, train_size=0.75, random_state=69, shuffle=True)
Relevant parameters:
test_size and train_size should add to one and control the ratio of testing data to training data.
random_state controls the pattern of shuffling before the data is split.
shuffle determines whether or not the data should be shuffled. This is useful if you have ordered data where splitting it normally would cause bad sampling.
Random
Sets the seed of the random number generator. This can make operations that require randomness repeatable when they would otherwise not be due to shuffle regenerating the randomness each time. The random_state parameter handles this in train_test_split.
Validation data
Split off the training data to further evaluate the performance of a model during training.
Cross-validation
Repeatedly using different sections of the data as validation data. For example:
-Training on the first 80% then testing on the remaining 20%
-Then, training on the first 60% while testing on the next 20% and training on the 20% that was previously used as the test data
-Repeating however many times is required
cross_val_score
scikit-learn's function for cross-validation. Scores accuracy of the model.