Ch9 Artificial Neural Networks

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/39

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

40 Terms

New cards

what are artificial neural networks

a collection of connected comutational units (neurons), arranged in interconnected layers

very popular for supervised and unsupervised learning

state of the art across many comlex domeins, mainly based on unstructured data

New cards

what do we mean when we say that Artificial neural networks are very powerful and flexible ?

more layers / neurons → more capacity / complexity
specialized architectures

New cards

XOR problem

a function that a perceptron cannot learn

exclusive or is a logical operation that is true if and only if its arguments differ

New cards

the perceptron

an algorithm for supervised learning of binary classifiers

New cards

what do we mean when we say that a perceptron with a (sigmoid) activation function can only model linear decision boudaries

the result of the perceptron will be a sigmoid function, but the classification that happens is stil a linear classification since the decison boundary is still linear

→ the function that goes through the activation function is still linear

<p>the result of the perceptron will be a sigmoid function, but the classification that happens is stil a linear classification since the decison boundary is still linear </p><p>→ the function that goes through the activation function is still linear </p>

New cards

what is the solution to the XOR problem

a non linear activation function in the hidden layer

input layer : raw features, no computation thus no activation function
in hidden layer : non linear activation function required to model complex non linear patterns (often ReLU)
in outpuut layer : classification (usually sigmoid), regression → usually none (linear)

New cards

notation X

is a matrix containing all the feature values (excluding labels) of all instances (m) in the dataset. There is one row per instance and the ith row is equal to the transoise if x(i)

<p>is a matrix containing all the feature values (excluding labels) of all instances (m) in the dataset. There is one row per instance and the ith row is equal to the transoise if x(i) </p>

New cards

multi-layerd perceptron also called

feedforward (fully connected) artificial neural network ANN

New cards

ANN recap : input layer

no computation, input data, #nodes = #features

New cards

ANN recap hidden layers

intermediate layers, everything in between input and output layers

New cards

ANN : recap : output layer

layers producing the final result

New cards

universal approximation theorem

a neural network with 1 hidden layer can represent any continuous function (under mild constraints and given enough width)

New cards

if we know the universal approximation theorem, why do we use more than 1 hidden layer ?

why deep and not wide learning

deep learning is

faster and easier to train
generalizes better (intuition : representation perspective)

New cards

Q : which decision boundary would this ANN learn

Logistic regression bc we only have one node with a logistic regression

New cards

what is training

updating the (trainable) parameters of the network to minimize the loss

New cards

which are the common loss functions that are used

regression : MSE

classification - cross entropy

New cards

gradient

a vector that describes the rate of change of a function

components of the vector are the partial derivatives of the loss with respect to the network parameters

points in the direction of greatest incease of a function

is zero at a local maximum or local minimum

<p>a vector that describes the rate of change of a function </p><p>components of the vector are the partial derivatives of the loss with respect to the network parameters </p><p>points in the direction of greatest incease of a function </p><p>is zero at a local maximum or local minimum </p>

New cards

how does loss optimization work

gradient descent

randomly pick (theta1 and 2)
compute gradient (partial derivative of the loss function)
take step in the opposite direction (with n learning rate)
repeat until convergence

New cards

gradient descent, an algorithm

1) initialize weights randomly N(0, sigma²)

2) loop until convergence

3) compute partial derivative
4) update according to theta = theta - learning rate* partial derivative

5) return weights

New cards

what happens due to non-linearities :

non-convex optimization problem

optimisation gets stucj in a suboptimal point (parameter value)

New cards

gradient descent : solution to local minima

increase the learning rate

stochastic gradient descent (use mini batches)

New cards

gradient descent : solution to local minima : stochastic gradient descent

use minu-batches

calculate each update on a mini-batch → subset of full training set
since we calculate the loss for each subset, it will not be exactly the same

shuffle the observations in the training set, run through the whole training set in mini batches (1 epoc) then we shuffle again and we do the same thing

New cards

local minima in gradient descent is less a problem than it used to be,

with modern activations

and high dimensionallity → with more dimensions you can escape more easely the local minima

New cards

how to use gradient descent for huge networks

backpropagation : is a highly efficient method to compute the gradients (in neural network).

computed gradients can then be used for gradient descent to update parameters

New cards

backpropagation : GD

1) for each input in the batch, compute the network’s output

2) propagate the error term backwards to the preceding layers

New cards

batch gradient descent

update the weights (with GD) using the entire training set for each update

New cards

mini-batch gradient descent

update the weights using a subsample from the training set

1 epoch = 1 run of all training observations through the network

New cards

what happens when the learning rate is too high

oscillations, algorithm will diverge

New cards

what happens when the learning rate is too small

slow progress

New cards

if the training set = 10.000, validation set = 1.000, and mini-batch size = 100

how many epochs after 1.000 mini-batches ?

10 epochs

New cards

if you use the full dataset for each update, how are updates and epochs related

then the number of epochs = the number of updates

New cards

how can we avoid overfitting when training Gradient descent

→ they are prone to overfit
early stopping
esplicit regulkarization objectives
adjusting hyperparameters related to capacity, #neurons, layers

<ul><li><p>→ they are prone to overfit </p></li><li><p>early stopping </p></li><li><p>esplicit regulkarization objectives </p></li><li><p>adjusting hyperparameters related to capacity, #neurons, layers </p></li></ul><p></p>

New cards

hyperparameters KNN

weigght initialization, activation function, loss, #layers, #neurons, early stopping rule, optimizer, explicit regularization terms
→ always use the validation set
focus on those for which the loss is sensitive

New cards

what type of preprocessing needs to happen ANN

categorical data

onehot encode

continuous data :

standardize/ normalize

New cards

what do you need to learn non linear patterns

you need non-linear activations / feature transformations

New cards

what do the hidden layers learn so that the data is linearly separable

a representation

New cards

pros neural networks

reduces the need for feature engineering
can fit any function (non-linear or otherwise)
extreme flexibility
state of the art performance across many domains

New cards

feature engineering

the crucial process of transforming raw data into meaningful input variables (features) that machine learning models can use to learn patterns and make accurate predictions

New cards

cons to neural networks

‘ black box’, bad interpretability
optimization is stochastic, solution possibly unstable

New cards

what can neural networks learn without requiring feature engineering

interaction effects and non-linear relationships