1/28
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Machine Learning
A software application that encapsulates a function to calculate an output based on one or more input values.
1. Training data is used where past data which includes observed features of the thing being observed [x] and the known value of the label you want to predict [y].
2. An algorithm is applied to the data to try and determine a relationship between the feature and the attribute, fitting the data to a function in which values of features can be used to create the label.
3. A model is created that encapsulates the function, y = f(x).
4. The trained model is used for inferencing, using it to predict new values.
Supervised Machine Learning
Algorithms in which the training data includes both feature values and known label values, used to train models by determining a relationship between features and labels in past observations so that unknown labels can be predicted for features in future cases.
Regression
A form of supervised machine learning
The label predicted by the model is a number (eg number of ice creams sold in a day based on temperature and rainfall)
Training a regression model requires multiple iterations, repeating the process with multiple algorithms and parameters until you achieve an acceptable level of predictive accuracy.
1. Split the training data randomly, holding back a subset of data to be used for validation
2. Use an algorithm to fit the training data to a model.
3. Use the validation data to test the model by predicting labels for features.
4. Compare the known actual labels to the predicted, then aggregate the differences to calculate a metric to indicate the accuracy of the prediction.
5. Repeat with different algorithms until an acceptable evaluation metric is achieved.
Mean Absolute Error
The variance by how much each prediction was wrong is known as the absolute error for each prediction, and can be summarised for the whole validation set as the mean absolute error (MAE). The mean absolute error metric takes all discrepancies between predicted and actual labels into account equally.
(May favour an algorithm that makes fewer but larger errors)
Mean Squared Error
A metric that "amplifies" larger errors by squaring the individual errors and calculating the mean of the squared values. The mean squared error helps take the magnitude of errors into account, but because it squares the error values, the resulting metric no longer represents the quantity measured by the label, it doesn't measure its accuracy in terms of the number of numerical difference between values that were mis-predicted.
(Favours small consistent errors over few large errors)
Root Mean Squared Error
The square root of the Mean Squared Error, measures the error in terms of the number of units.
Coefficient of determination (R2)
A metric that measures the proportion of variance in the validation results that can be explained by the model as opposed to an anomalous aspect of the validation data.
It compares the sum of squared differences between predicted and actual labels with the sum of squared differences between the actual label values and the mean of actual label values, like this:
R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
The result is a value between 0 and 1 that describes the proportion of variance explained by the model. The closer to 1 this value is, the better the model is fitting the validation data.
Classification
A form of supervised machine learning
The label represents a categorisation or class
Two common classification scenarios: binary and multiclass
Binary Classification
The label determines whether the observed item is or isn’t an instance of a specific class, predicting one of two mutually exclusive outcomes (eg predicting whether someone will get diabetes based on certain factors) represented as 0 or 1.
Training a binary classification model requires
1. Holding back a random subset of data to be used for validation
2. Using an algorithm to fit the training data to a function which describes the probability of y being true for a given value of x. f(x) = P(y=1|x)
3. Using the validation data to compare the predicted class labels to actual class labels.
Confusion Matrix for Binary Classification Evaluation
A matrix of the number of correct and incorrect predictions for each possible class label:
ŷ=0 and y=0: True negatives (TN)
ŷ=1 and y=0: False positives (FP)
ŷ=0 and y=1: False negatives (FN)
ŷ=1 and y=1: True positives (TP)
The confusion matrix for a multiclass classifier is similar to that of a binary classifier, except that it shows the number of predictions for each combination of predicted (ŷ) and actual class labels (y)
Accuracy (calculated from confusion matrix)
The proportion of predictions the model got right
The formula for accuracy is:
(TN+TP) ÷ (TN+FN+FP+TP)
Caveat: accuracy is not always the best indicator of a successful prediction, for example: suppose 11% of the population has diabetes. You could create a model that always predicts 0, and it would achieve an accuracy of 89%, even though it makes no real attempt to differentiate between patients by evaluating their features. We therefore need a deeper understanding of how the model performs at predicting 1 for positive cases and 0 for negative cases.
Recall / True Positive Rate
A metric that measures the proportion of positive cases that the model identified correctly.
The formula for recall/TPR is:
TP ÷ (TP+FN)
Precision
A metric that measures the proportion of predicted positive cases where the true label is actually positive
The formula for precision is:
TP ÷ (TP+FP)
F1 Score
A metric that combines recall and precision.
The formula for F1-score is:
(2 x Precision x Recall) ÷ (Precision + Recall)
False Positive Rate
A metric that measures the proportion of false cases that the model identified correctly.
The formula for FPR is:
FP÷(FP+TN)
Received Operator Characteristic
A graphical plot that compares the TPR and FPR for every possible threshold value between 0.0 and 1.0.
The ROC curve for a perfect model would go straight up the TPR axis on the left and then across the FPR axis at the top. Since the plot area for the curve measures 1x1, the area under this perfect curve would be 1.0 (meaning that the model is correct 100% of the time).
Multiclass Classification
The label represents one of multiple possible classes (eg the species of a penguin based on its physical measurements)
Mostly predicts mutually exclusive classes but can be trained to do multilabel classification where there may be more than one valid label for a single observation
Training a multiclass classification model requires
1. Holding back a random subset of data to be used for validation
2. Using an algorithm to fit the training data. There are two kinds of algorithms that can be used here: one-vs-rest and multinominal
3. Using the validation data to compare the predicted class labels to actual class labels.
One-vs-Rest Algorithms
Train a binary classification function for each class, each calculating the probability that the observation is an example of the target class.
Each function calculates the probability of the observation being a specific class compared to any other class ie:
f0(x) = P(y=0 | x)
f1(x) = P(y=1 | x)
f2(x) = P(y=2 | x)
Each algorithm produces a sigmoid function that calculates a probability value between 0.0 and 1.0. A model trained using this kind of algorithm predicts the class for the function that produces the highest probability output.
Multinominal Algorithms
Creates a single function that returns a multi-valued output. The output is a vector (an array of values) that contains the probability distribution for all possible classes - with a probability score for each class which when totalled add up to 1.0:
f(x) =[P(y=0|x), P(y=1|x), P(y=2|x)]
Unsupervised Machine Learning
Involves training models using data that consists only of feature values without any known labels. Unsupervised machine learning algorithms determine relationships between the features of the observations in the training data.
Clustering
A form of unsupervised machine learning
Identifies similarities between observations based on their features, and groups them into discrete clusters (eg grouping similar flowers based on number of leaves and number of petals)
K-Means clustering is one of the most commonly used algorithms to do this:
1. The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features: number of leaves (x1) and number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2])
2. The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features: number of leaves (x1) and number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2])
3. You decide how many clusters you want to use to group the flowers - call this value k. For example, to create three clusters, you would use a k value of 3. Then k points are plotted at random coordinates. These points become the center points for each cluster, so they're called centroids.
4. Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
5. After the centroid is moved, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
6. The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a predetermined maximum number of iterations is reached.
Average distance to cluster center
Clustering evaluation method: how close, on average, each point in the cluster is to the centroid of the cluster
Average distance to other center:
Clustering evaluation method: how close, on average, each point in the cluster is to the centroid of all other clusters.
Maximum distance to cluster center
Clustering evaluation method: the furthest distance between a point in the cluster and its centroid.
Silhouette
A value between -1 and 1 that summarizes the ratio of distance between points in the same cluster and points in different clusters (The closer to 1, the better the cluster separation).
Deep Learning
An advanced form of machine learning that tries to emulate the way the human brain learns. The key to deep learning is the creation of an artificial neural network that simulates electrochemical activity in biological neurons by using mathematical functions
Just like other machine learning techniques discussed in this module, deep learning involves fitting training data to a function that can predict a label (y) based on the value of one or more features (x). The function (f(x)) is the outer layer of a nested function in which each layer of the neural network encapsulates functions that operate on x and the weight (w) values associated with them. The algorithm used to train the model involves iteratively feeding the feature values (x) in the training data forward through the layers to calculate output values for ŷ, validating the model to evaluate how far off the calculated ŷ values are from the known y values (which quantifies the level of error, or loss, in the model), and then modifying the weights (w) to reduce the loss. The trained model includes the final weight values that result in the most accurate predictions.
Transformers
A transformer model is designed to take in a text input (called a prompt) and generate a syntactically correct output
Transformer models are trained with large volumes of text, enabling them to represent the semantic relationships between words and use those relationships to determine probable sequences of text that make sense. Transformer models with a large enough vocabulary are capable of generating language responses that are tough to distinguish from human responses.
Transformer model architecture consists of two components, or blocks: An encoder block that creates semantic representations of the training vocabulary and decoder block that generates new language sequences.
The encoder and decoder blocks in a transformer model include multiple layers that form the neural network for the model. In particular, self-attention involves considering how other tokens around one particular token influence that token's meaning. This contextualized approach means that the same word might have multiple embeddings depending on the context in which it's used.
1. The model is trained with a large volume of natural language text
2. The sequences of text are broken down into tokens
3. The output from the encoder is a collection of vectors, referred to as embeddings
4. The decoder block works on a new sequence of text tokens and uses the embeddings generated by the encoder to generate an appropriate natural language output.
Tokenization
The first step in training a transformer model is to decompose the training text into tokens - in other words, identify each unique text value, As you continue to train the model, each new token in the training text is added to the vocabulary with appropriate token IDs.
Embeddings
To create a vocabulary that encapsulates semantic relationships between the tokens, we define contextual vectors, known as embeddings, for them.
Vectors are multi-valued numeric representations of information, for example [10, 3, 1] in which each numeric element represents a particular attribute of the information.
For language tokens, each element of a token's vector represents some semantic attribute of the token. The specific categories for the elements of the vectors in a language model are determined during training based on how commonly words are used together or in similar contexts. Vectors represent lines in multidimensional space, describing direction and distance along multiple axes.
It can be useful to think of the elements in an embedding vector for a token as representing steps along a path in multidimensional space. A technique called cosine similarity is used to determine if two vectors have similar directions (regardless of distance), and therefore represent semantically linked words