Ch9 Artificial Neural Networks

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/39

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

40 Terms

1
New cards

what are artificial neural networks

a collection of connected comutational units (neurons), arranged in interconnected layers

very popular for supervised and unsupervised learning

state of the art across many comlex domeins, mainly based on unstructured data

2
New cards

what do we mean when we say that Artificial neural networks are very powerful and flexible ?

  • more layers / neurons → more capacity / complexity

  • specialized architectures

3
New cards
<p>XOR problem </p>

XOR problem

a function that a perceptron cannot learn

exclusive or is a logical operation that is true if and only if its arguments differ

4
New cards

the perceptron

an algorithm for supervised learning of binary classifiers

<p>an algorithm for supervised learning of binary classifiers</p><p></p>
5
New cards

what do we mean when we say that a perceptron with a (sigmoid) activation function can only model linear decision boudaries

the result of the perceptron will be a sigmoid function, but the classification that happens is stil a linear classification since the decison boundary is still linear

→ the function that goes through the activation function is still linear

<p>the result of the perceptron will be a sigmoid function, but the classification that happens is stil a linear classification since the decison boundary is still linear </p><p>→ the function that goes through the activation function is still linear </p>
6
New cards

what is the solution to the XOR problem

a non linear activation function in the hidden layer

  • input layer : raw features, no computation thus no activation function

  • in hidden layer : non linear activation function required to model complex non linear patterns (often ReLU)

  • in outpuut layer : classification (usually sigmoid), regression → usually none (linear)

<p>a non linear activation function in the hidden layer </p><ul><li><p>input layer : raw features, no computation thus no activation function </p></li><li><p>in hidden layer : non linear activation function required to model complex non linear patterns (often ReLU) </p></li><li><p>in outpuut layer : classification (usually sigmoid), regression → usually none (linear) </p></li></ul><p></p>
7
New cards

notation X

is a matrix containing all the feature values (excluding labels) of all instances (m) in the dataset. There is one row per instance and the ith row is equal to the transoise if x(i)

<p>is a matrix containing all the feature values (excluding labels) of all instances (m) in the dataset. There is one row per instance and the ith row is equal to the transoise if x(i) </p>
8
New cards

multi-layerd perceptron also called

feedforward (fully connected) artificial neural network ANN

9
New cards

ANN recap : input layer

no computation, input data, #nodes = #features

10
New cards

ANN recap hidden layers

intermediate layers, everything in between input and output layers

11
New cards

ANN : recap : output layer

layers producing the final result

12
New cards

universal approximation theorem

a neural network with 1 hidden layer can represent any continuous function (under mild constraints and given enough width)

<p>a neural network with 1 hidden layer can represent any continuous function (under mild constraints and given enough width) </p>
13
New cards

if we know the universal approximation theorem, why do we use more than 1 hidden layer ?

why deep and not wide learning

deep learning is

  • faster and easier to train

  • generalizes better (intuition : representation perspective)

14
New cards
<p>Q : which decision boundary would this ANN learn</p>

Q : which decision boundary would this ANN learn

Logistic regression bc we only have one node with a logistic regression

15
New cards

what is training

updating the (trainable) parameters of the network to minimize the loss

16
New cards

which are the common loss functions that are used

regression : MSE

classification - cross entropy

17
New cards

gradient

a vector that describes the rate of change of a function

components of the vector are the partial derivatives of the loss with respect to the network parameters

points in the direction of greatest incease of a function

is zero at a local maximum or local minimum

<p>a vector that describes the rate of change of a function </p><p>components of the vector are the partial derivatives of the loss with respect to the network parameters </p><p>points in the direction of greatest incease of a function </p><p>is zero at a local maximum or local minimum </p>
18
New cards

how does loss optimization work

gradient descent

  • randomly pick (theta1 and 2)

  • compute gradient (partial derivative of the loss function)

  • take step in the opposite direction (with n learning rate)

  • repeat until convergence

<p>gradient descent </p><ul><li><p>randomly pick (theta1 and 2)</p></li><li><p>compute gradient (partial derivative of the loss function) </p></li><li><p>take step in the opposite direction (with n learning rate) </p></li><li><p>repeat until convergence </p></li></ul><p></p>
19
New cards

gradient descent, an algorithm

1) initialize weights randomly N(0, sigma²)

2) loop until convergence

  • 3) compute partial derivative

  • 4) update according to theta = theta - learning rate* partial derivative

5) return weights

20
New cards

what happens due to non-linearities :

non-convex optimization problem

optimisation gets stucj in a suboptimal point (parameter value)

<p>non-convex optimization problem </p><p>optimisation gets stucj in a suboptimal point (parameter value) </p>
21
New cards

gradient descent : solution to local minima

increase the learning rate

stochastic gradient descent (use mini batches)

22
New cards

gradient descent : solution to local minima : stochastic gradient descent

use minu-batches

  • calculate each update on a mini-batch → subset of full training set

  • since we calculate the loss for each subset, it will not be exactly the same

shuffle the observations in the training set, run through the whole training set in mini batches (1 epoc) then we shuffle again and we do the same thing

23
New cards

local minima in gradient descent is less a problem than it used to be,

with modern activations

and high dimensionallity → with more dimensions you can escape more easely the local minima

24
New cards

how to use gradient descent for huge networks

backpropagation : is a highly efficient method to compute the gradients (in neural network).

computed gradients can then be used for gradient descent to update parameters

25
New cards

backpropagation : GD

1) for each input in the batch, compute the network’s output

2) propagate the error term backwards to the preceding layers

<p>1) for each input in the batch, compute the network’s output </p><p>2) propagate the error term backwards to the preceding layers </p>
26
New cards

batch gradient descent

update the weights (with GD) using the entire training set for each update

27
New cards

mini-batch gradient descent

update the weights using a subsample from the training set

1 epoch = 1 run of all training observations through the network

28
New cards

what happens when the learning rate is too high

oscillations, algorithm will diverge

29
New cards

what happens when the learning rate is too small

slow progress

30
New cards

if the training set = 10.000, validation set = 1.000, and mini-batch size = 100

how many epochs after 1.000 mini-batches ?

10 epochs

31
New cards

if you use the full dataset for each update, how are updates and epochs related

then the number of epochs = the number of updates

32
New cards

how can we avoid overfitting when training Gradient descent

  • → they are prone to overfit

  • early stopping

  • esplicit regulkarization objectives

  • adjusting hyperparameters related to capacity, #neurons, layers

<ul><li><p>→ they are prone to overfit </p></li><li><p>early stopping </p></li><li><p>esplicit regulkarization objectives </p></li><li><p>adjusting hyperparameters related to capacity, #neurons, layers </p></li></ul><p></p>
33
New cards

hyperparameters KNN

  • weigght initialization, activation function, loss, #layers, #neurons, early stopping rule, optimizer, explicit regularization terms

  • → always use the validation set

  • focus on those for which the loss is sensitive

34
New cards

what type of preprocessing needs to happen ANN

categorical data

  • onehot encode

continuous data :

  • standardize/ normalize

35
New cards

what do you need to learn non linear patterns

you need non-linear activations / feature transformations

36
New cards

what do the hidden layers learn so that the data is linearly separable

a representation

<p>a representation </p>
37
New cards

pros neural networks

  • reduces the need for feature engineering

  • can fit any function (non-linear or otherwise)

  • extreme flexibility

  • state of the art performance across many domains

38
New cards

feature engineering

the crucial process of transforming raw data into meaningful input variables (features) that machine learning models can use to learn patterns and make accurate predictions

39
New cards

cons to neural networks

  • ‘ black box’, bad interpretability

  • optimization is stochastic, solution possibly unstable

40
New cards

what can neural networks learn without requiring feature engineering

interaction effects and non-linear relationships