Methods of masterclass

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/22

There's no tags or description

Looks like no tags are added yet.

Last updated 11:05 AM on 3/26/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

23 Terms

New cards

What is machine learning ?

Machine Learning is about prediction: using data to teach algorithms to predict outcomes they have never seen.

We give the computer data and outcomes
The algorithm finds patterns by itself
It uses these patterns to make predictions on new data.

New cards

2 phases of machine learning

First: Learn from data
Then: Predict outcomes for new inputs

<ul><li><p><strong>First:</strong> Learn from data</p></li><li><p><strong>Then:</strong> Predict outcomes for new inputs</p></li></ul><p></p>

New cards

What is the difference between Statistics and Machine Learning?

Doel

Statistics: begrijpen hoe dingen samenhangen
Machine Learning: zo goed mogelijk voorspellen

Vraag

Statistics: wat is het verband tussen X en Y?
Machine Learning: als ik X weet, wat is dan Y?

Hoe je het beoordeelt

Statistics: kijk naar dingen zoals coëfficiënten en p-waarden
Machine Learning: kijk hoe goed het model voorspelt (fout/accuracy)

Aanpak

Statistics: werkt met modellen en aannames (bijv. lineair verband)
Machine Learning: leert zelf patronen uit data

Stijl

Statistics: duidelijk en goed uitlegbaar, maar minder flexibel
Machine Learning: flexibel en krachtig, maar vaak moeilijker te begrijpen

👉 Kort samengevat:

Statistics = uitleggen waarom iets gebeurt
Machine Learning = voorspellen wat er gaat gebeuren

New cards

What are the main types of Machine Learning?

Type	Description	Example
Supervised Learning	The model learns from labeled data (data + outcomes).	Classification, Regression
Unsupervised Learning	The model learns from unlabeled data and finds hidden patterns.	Clustering

Classification: predict a category
Regression: predict a number
Clustering: group similar data together

New cards

What are features, targets, and training in Machine Learning?

Term	Meaning	Examples
Feature (X)	Independent variable used to make a prediction	Firm size, word counts, pixel values
Target / Label (Y)	Dependent variable the model tries to predict	Management quality, AI vs human, fraud or not
Training	The process of estimating a model by minimizing a loss function	e.g. minimize sum of squared errors

New cards

What is the difference between Inference and Prediction?

Doel

Inference: begrijpen of X invloed heeft op Y
Prediction: Y zo goed mogelijk voorspellen

Focus

Inference: kijken wat elke variabele precies doet
Prediction: zo klein mogelijke fout maken

Belangrijk

Inference: let op problemen zoals verborgen factoren en oorzaak-gevolg
Prediction: oorzaak maakt niet uit, zolang de voorspelling goed is

Methodes

Inference: meer “strakke” modellen (zoals fixed effects)
Prediction: flexibele modellen (mogen complex en niet-lineair zijn)

Voorbeeldvraag

Inference: zorgt een grotere firma voor beter management?
Prediction: hoe goed kunnen we managementkwaliteit voorspellen?

New cards

What is a Regression Tree?

Een regressieboom is een soort beslisboom die een getal probeert te voorspellen.

Hij stelt stap voor stap simpele ja/nee-vragen
Met elke vraag wordt de groep data opgesplitst in kleinere groepen
De groepen worden zo gemaakt dat de waarden binnen elke groep zo veel mogelijk op elkaar lijken
Aan het einde (bij de “blaadjes” van de boom) krijg je een voorspelling:
→ dat is gewoon het gemiddelde van die groep

<p>Een <strong>regressieboom</strong> is een soort beslisboom die een getal probeert te voorspellen.</p><ul><li><p>Hij stelt stap voor stap simpele <strong>ja/nee-vragen</strong></p></li><li><p>Met elke vraag wordt de groep data <strong>opgesplitst in kleinere groepen</strong></p></li><li><p>De groepen worden zo gemaakt dat de waarden binnen elke groep zo veel mogelijk op elkaar lijken</p></li><li><p>Aan het einde (bij de “blaadjes” van de boom) krijg je een voorspelling:<br>→ dat is gewoon het <strong>gemiddelde van die groep</strong></p></li></ul><p></p>

New cards

Why do we start learning Machine Learning with Regression Trees?

Makkelijk te begrijpen en te visualiseren
→ Je kunt het zien als een boom met simpele stappen
Werkt als een algoritme
→ De computer zoekt zelf de beste splits (in plaats van formules te schatten zoals bij regressie)
Laat belangrijke ML-ideeën zien:
- Flexibiliteit → kan verschillende patronen leren
- Overfitting → kan té goed op de trainingsdata passen
- Cross-validation → helpt om te checken of het model ook goed werkt op nieuwe data

New cards

What is the anatomy of a Regression Tree?

Node: a yes/no question that splits the data
Leaf: the final point where the tree makes a prediction (average outcome)
The tree automatically finds the best splits. No need to specify the functional form

New cards

How does a Regression Tree work?

Bij elke stap probeert de boom veel verschillende splits (bijv. leeftijd < 30?)
Hij kiest de split waarbij de groepen zo veel mogelijk op elkaar lijken
Hij blijft splitsen totdat hij moet stoppen (bijv. groepen worden te klein)

Voordelen

Werkt automatisch met niet-lineaire verbanden (dus geen rechte lijn nodig)
Je hoeft interacties niet zelf te bedenken
Makkelijk te begrijpen en uit te leggen (je kunt het als een boom tekenen)

New cards

What is the difference between a Regression Line and a Regression Tree?

Method	How it works
Linear Regression	Fits one straight line through all the data
Regression Tree	Splits the data into groups and predicts the average outcome in each group

New cards

How Good Are Our Predictions?(RMSE (Root Mean Squared Error)

RMSE measures the average prediction error of a model
It compares the actual value (y) with the predicted value (ŷ)
The error is squared, averaged, and then square-rooted
The result is in the same units as the outcome (y)

Key idea:

Lower RMSE = better predictions

<ul><li><p><strong>RMSE</strong> measures the <strong>average prediction error</strong> of a model</p></li><li><p>It compares the <strong>actual value (y)</strong> with the <strong>predicted value (ŷ)</strong></p></li><li><p>The error is <strong>squared, averaged, and then square-rooted</strong></p></li><li><p>The result is in the <strong>same units as the outcome (y)</strong></p></li></ul><p><strong>Key idea:</strong></p><ul><li><p><strong>Lower RMSE = better predictions</strong></p></li></ul><p></p>

New cards

What is a baseline model in prediction?

The baseline always predicts the average outcome (ȳ)
It ignores all variables
If your model cannot beat the baseline, it is useless
Always compare models to the baseline as a sanity check

New cards

What happens when a Regression Tree becomes more complex (more splits/leaves)?

More splits/leaves → the tree captures more detailed patterns in the data
This increases model complexity
On the training data, RMSE usually keeps decreasing

Problem:

A very complex tree may fit noise instead of real patterns
This is called overfitting

Key idea:
Better performance on training data does not always mean better predictions on new data.

New cards

What is the difference between a training set and a test set?

Dataset	Purpose
Training set (~70%)	Used to train/estimate the model
Test set (~30%)	Not used during training; used to evaluate prediction performance

New cards

What does a validation curve show?

As model complexity increases, training RMSE keeps decreasing
Test RMSE first decreases, then increases
When test error increases, the model is overfitting (memorizing noise)

Key idea:

The best model is at the “sweet spot” where test RMSE is lowest, meaning it generalizes best to new data.

New cards

What is overfitting?

Overfitting happens when a model learns the training data too well, including noise
As a result, it performs worse on new (test) data

Key lessons:

Training error ≠ true performance
Always evaluate models on test data the model has never seen
More complex models are not always better 📊

New cards

How do we turn text into numbers for a machine learning model?

Split text into words and remove common words (e.g., “the”, “is”, “and”)
Select informative words that appear differently in AI vs. human texts
Create binary indicators:
- 1 = word appears in the text
- 0 = word does not appear

Example:
Select 100 words → 50 AI-signaling + 50 human-signaling.

New cards

What is the difference between a Regression Tree and a Classification Tree?

Aspect	Regression Tree	Classification Tree
Prediction	Predicts a number	Predicts a class/category (majority class)
Split criterion	Minimizes prediction error (RMSE)	Makes groups as pure as possible

New cards

When do we use a logit (logistic regression) model?

When the outcome is binary (0 or 1)
The model predicts the probability that Y = 1
Uses an S-shaped curve so predictions stay between 0 and 1
Often classify as 1 if P > 0.5 📊

New cards

What changes when moving from a Regression Tree to a Classification Tree?

Prediction: each leaf predicts a class (majority class) instead of a number
Split criterion: splits aim to make groups as pure as possible (observations mostly in the same class)

New cards

What is entropy in a classification tree?

Entropy measures how mixed a group is
Low entropy: mostly one class (pure group)
High entropy: classes are evenly mixed

<ul><li><p><strong>Entropy measures how mixed a group is</strong></p></li><li><p><strong>Low entropy:</strong> mostly one class (pure group)</p></li><li><p><strong>High entropy:</strong> classes are evenly mixed</p></li></ul><p></p>

New cards

Answer: (A) The FPR (50/500 = 10%) tells us what fraction of truly human texts get flagged. Overall accuracy (85%) is misleading here because it was measured on balanced data.