1/46
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
pd.concat
Concatenate multiple DataFrames or Series along a particular axis (rows or columns).
Axis = 0
stacks vertically
axis = 1
stacks horizontally.
pd.merge
Combines DataFrames by aligning rows based on a specified key or set of keys. Requires a common key (or column) to perform the join.
Variance
Measures the average squared deviation from the mean.
Standard Deviation
The square root of the variance and provides a more interpretable measure of the spread in the original units of measurement.
Coordinate systems
Check coordinate system using gpd.crs. Understand the reason why we need to switch its coordinate system.
EPSG 4326
WGS84, geographic coordinates (lat/lon) - 3D degrees.
EPSG 27700
OSGB36, British National Grid (projected coordinates) - 2D metres.
Normalisation
Rescales features to a range, typically [0, 1], to ensure consistency in scale.
Standardisation
Transforms data to have a mean of 0 and a standard deviation of 1, often required for algorithms sensitive to feature scales.
Encoding
Converting categorical or textual data into numerical format so that it can be used by machine learning algorithms.
Time series data split
Select time stamp. Data before for training, data after for testing.
MAE
Mean Absolute Error = 1/n * sum of actual values - predicted values. Treats all errors equally.
MSE
Mean Squared Error = 1/n * sum of (actual values - predicted values)^2.
RMSE
Root Mean Squared Error = square root of MSE. Penalises larger errors more.
Accuracy
Indicates the proportion of correct classifications made by the model (TP + TN / All).
Precision
Measures how accurate the model's positive predictions are (TP / TP + FP).
Recall
Among all the actual positive cases, how many did the model correctly predict as positive? (TP / TP + FN).
F1 score
The harmonic mean of Precision and Recall, providing a balance between the two. F1 is always between 0 and 1.
Odds
Represent the ratio of the probability that an event will occur to the probability that it will not occur. Odds in favour = P(occur) / P(not occur) = P(occur) / 1-P(occur).
Entropy
A measure of the randomness or disorder within a set of data. It is used to determine how pure a split is.
Gini impurity
A measure used in decision tree algorithms to quantify a dataset's impurity level or disorder.
K-means clustering
Group the data into k clusters, where k is greater than 1.
Cluster centroid
The centre of a cluster. In k-means, it is the mean of all points assigned to that cluster.
df.loc
Location-based indexing
df.loc[select row labels, select column labels]
Syntax for selecting specific rows and columns
Multiple rows or columns
Requires another set of square brackets at either end
Colon in indexing (e.g. a:f)
Does not require additional square brackets
df.iloc
Positional indexing
0-based counts
Counts from left/top starting from 0
Negative indexing
Starts from right/bottom starting from -1
Start index inclusive, stop index exclusive
1:3 means row with label 1 included but 3 is not
Moran scatter-plot
Quadrants I and III = perfect clustering; II and IV = perfect dispersion
Validation set
Test set for unseen data used to select the best model configuration
Gini of 0
Indicates a pure node where all samples belong to a single class
AUC
Measures the likelihood that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance
Reading a decision tree
Darker shades indicate higher purity; lighter shades indicate lower purity
Root node
The feature that minimizes the Gini Impurity or Mean Squared Error after the initial split
Types of data in statistical analysis
Nominal, Ordinal, Binary, Discrete, Continuous
Bootstrap sampling
Same size as data set with replacement; probability of selection = 63%
SVM
C = regularisation strength; Random_state = random seed for model initialization
Linear Regression vs Decision Tree
Linear relationship vs Non-linear; Interpretable coefficients vs Rule-based
Limitations of K-means
Choice of K is subjective; assumes spherical clusters; sensitive to feature scaling
WGS84
EPSG4326
Low training error, high test error
Indicates low bias and high variance
Uses of K-means
Groups based on similar characteristics, identify high-risk zones where multiple characteristics coincide, discover patterns in unlabelled data without predefined categories