ML Key Points

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/46

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 7:50 AM on 5/11/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

47 Terms

1
New cards

pd.concat

Concatenate multiple DataFrames or Series along a particular axis (rows or columns).

2
New cards

Axis = 0

stacks vertically

3
New cards

axis = 1

stacks horizontally.

4
New cards

pd.merge

Combines DataFrames by aligning rows based on a specified key or set of keys. Requires a common key (or column) to perform the join.

5
New cards

Variance

Measures the average squared deviation from the mean.

6
New cards

Standard Deviation

The square root of the variance and provides a more interpretable measure of the spread in the original units of measurement.

7
New cards

Coordinate systems

Check coordinate system using gpd.crs. Understand the reason why we need to switch its coordinate system.

8
New cards

EPSG 4326

WGS84, geographic coordinates (lat/lon) - 3D degrees.

9
New cards

EPSG 27700

OSGB36, British National Grid (projected coordinates) - 2D metres.

10
New cards

Normalisation

Rescales features to a range, typically [0, 1], to ensure consistency in scale.

11
New cards

Standardisation

Transforms data to have a mean of 0 and a standard deviation of 1, often required for algorithms sensitive to feature scales.

12
New cards

Encoding

Converting categorical or textual data into numerical format so that it can be used by machine learning algorithms.

13
New cards

Time series data split

Select time stamp. Data before for training, data after for testing.

14
New cards

MAE

Mean Absolute Error = 1/n * sum of actual values - predicted values. Treats all errors equally.

15
New cards

MSE

Mean Squared Error = 1/n * sum of (actual values - predicted values)^2.

16
New cards

RMSE

Root Mean Squared Error = square root of MSE. Penalises larger errors more.

17
New cards

Accuracy

Indicates the proportion of correct classifications made by the model (TP + TN / All).

18
New cards

Precision

Measures how accurate the model's positive predictions are (TP / TP + FP).

19
New cards

Recall

Among all the actual positive cases, how many did the model correctly predict as positive? (TP / TP + FN).

20
New cards

F1 score

The harmonic mean of Precision and Recall, providing a balance between the two. F1 is always between 0 and 1.

21
New cards

Odds

Represent the ratio of the probability that an event will occur to the probability that it will not occur. Odds in favour = P(occur) / P(not occur) = P(occur) / 1-P(occur).

22
New cards

Entropy

A measure of the randomness or disorder within a set of data. It is used to determine how pure a split is.

23
New cards

Gini impurity

A measure used in decision tree algorithms to quantify a dataset's impurity level or disorder.

24
New cards

K-means clustering

Group the data into k clusters, where k is greater than 1.

25
New cards

Cluster centroid

The centre of a cluster. In k-means, it is the mean of all points assigned to that cluster.

26
New cards

df.loc

Location-based indexing

27
New cards

df.loc[select row labels, select column labels]

Syntax for selecting specific rows and columns

28
New cards

Multiple rows or columns

Requires another set of square brackets at either end

29
New cards

Colon in indexing (e.g. a:f)

Does not require additional square brackets

30
New cards

df.iloc

Positional indexing

31
New cards

0-based counts

Counts from left/top starting from 0

32
New cards

Negative indexing

Starts from right/bottom starting from -1

33
New cards

Start index inclusive, stop index exclusive

1:3 means row with label 1 included but 3 is not

34
New cards

Moran scatter-plot

Quadrants I and III = perfect clustering; II and IV = perfect dispersion

35
New cards

Validation set

Test set for unseen data used to select the best model configuration

36
New cards

Gini of 0

Indicates a pure node where all samples belong to a single class

37
New cards

AUC

Measures the likelihood that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance

38
New cards

Reading a decision tree

Darker shades indicate higher purity; lighter shades indicate lower purity

39
New cards

Root node

The feature that minimizes the Gini Impurity or Mean Squared Error after the initial split

40
New cards

Types of data in statistical analysis

Nominal, Ordinal, Binary, Discrete, Continuous

41
New cards

Bootstrap sampling

Same size as data set with replacement; probability of selection = 63%

42
New cards

SVM

C = regularisation strength; Random_state = random seed for model initialization

43
New cards

Linear Regression vs Decision Tree

Linear relationship vs Non-linear; Interpretable coefficients vs Rule-based

44
New cards

Limitations of K-means

Choice of K is subjective; assumes spherical clusters; sensitive to feature scaling

45
New cards

WGS84

EPSG4326

46
New cards

Low training error, high test error

Indicates low bias and high variance

47
New cards

Uses of K-means

Groups based on similar characteristics, identify high-risk zones where multiple characteristics coincide, discover patterns in unlabelled data without predefined categories