"Statistics is the science of collecting, organizing and interpreting numerical facts, which we call data."
Source: Statistiek in de Praktijk, David S. Moore / George P. McCabe, 1994
Important matters for the application of statistics (“Applied Statistics”):
Selecting a sample from a population
Deciding whether a sample is representative
Descriptive or inferential statistics
Measurement levels (NOIR) and types of variables (categorical/quantitative)
Selecting the correct statistical analysis
Experimental versus non-experimental research design
Important for the application of statistics ("Applied Statistics"):
Selecting the correct statistical analysis
6 hoorcolleges
maandag: theorie
6 interactieve colleges
woensdag: voorbereiding tentamen
5 werkgroepen (verplicht)
maandag of woensdag: werken aan opdrachten
Week 7: Q&A sessie (op woensdag)
Literatuur:
Warner (2020) - Applied Statistics II – International Student Edition 3rd Edition (tentamen)
Warner (2013) - Applied Statistics – From bivariate through multivariate techniques: - (tentamen) en
Agresti & Finlay (2012/2018) – Statistical Methods for the Social Sciences (tentamen)
Maandag (theorie)
Op de campus
Introduceren nieuw(e) hoofdstuk(ken)
Insluiten nieuwe theorie in de praktijk en in het weekschema
Woensdag (voorbereiding tentamen)
Op de campus
Interactief en met korte quizjes
Verheldering, voorbeelden en herhaling
Week 2 t/m 6
Op de campus
Aanwezigheid verplicht
Oefenen met SPSS, vragen stellen
Inhoudelijke vragen à tutor à discussieforum à Q&A
Tentamen: 30 meerkeuzevragen:
Woensdag 28 Mei 2025
10 vragen over Statistiek 1 & 2 (A&F 2009/2012/2018)
20 vragen over de statistische analyses uit Statistiek 3 (Warner, 2013/2020)
Eindcijfer = Tentamencijfer
Hoorcolleges:
Theorie, samenhang, herhaling en samenhang: week 1 t/m 6
Q&A in week 7
Werkgroepen:
Maandag en woensdag: oefenen en mogelijkheid vragen te stellen aan tutor
Aanwezigheid verplicht, maximaal één afwezigheid toegestaan
Boek :
Theorie: Agresti (2018) Ch. 9 + 12, Warner (2020-II): Ch. 5, 7, 8, 9, 11, 14 of Warner (2013): Ch. 6, 9, 12, 13, 15, 16, 17, 19, 22
Practice: comprehension questions at the end of every chapter
Herhaling Statistiek 1 and 2:
Hoorcollege week 1.
StatTalk: “Knowledge-clips” (4-5 min) divided per topic.
Grasple
Canvastoetsen/quizzes:
Wekelijkse formatieve toets; uit iedere wekelijkse toets wordt één (bewerkte) vraag in het tentamen gebruikt.
Oefententamens Statistiek 1 & 2
Doelstellingen Statistiek 3:
Herhaling statistiek 1+2 (met name de methoden en assumpties), plus nieuwe toevoegingen en het toepassen van deze methoden in de praktijk (SPSS).
Ontdek de samenhang tussen de verschillende methoden in het raamwerk van het Generalized Linear Model (GLM), en daarmee…
vormt Statistiek 3 een goede basis voor de B-these.
Belang van goed empirisch onderzoek (en daarvoor is statistiek noodzakelijk):
“Regression to the mean. It is a statistical fact of life that extreme scores tend to become less extreme upon re-testing, a phenomenon known as regression toward the mean (Kruger, Savitsky, & Gilovich, 1999). Regression to the mean can fool therapists and patients alike into believing that a useless treatment is effective (Gilovich, 1991)."
Recap(itulation) lecture
ANOVA / Regression
Factorial ANOVA
ANCOVA
Mediation / Moderation
MANOVA / Repeated Measures
Recap lecture
ANOVA
Fact. ANOVA
ANCOVA
MANOVA
Rep. Measures
Q&A lecture
F-test
Moderation
TSS
MSE
Tukey Contrasts
Bonferroni
Sphericity
DF
Mediation
Type II error
Type I error
Dummy
Overview of het most important concepts in statistics:
Descriptive vs inferential statistics
Data, population and sample
Reliability and validity
Variables, measurement levels and range
Central tendency-, dispersion-, and position measures
Population distribution, sample distribution and sampling distribution
Central Limit Theorem and hypothesis testing
Focus on empirical analyses:
Comparison of 2 groups on one quantitative outcome variable (t-test)
Comparison of 2 or more groups on one quantitative outcome variable (ANOVA)
Determine relation between 2 quantitative variables (regression analysis)
"Statistics is the science of collecting, organizing and interpreting numerical facts, which we call data."
Source: Statistiek in de Praktijk, David S. Moore / George P. McCabe, 1994
A&F: Statistics consists of a body of methods for obtaining and analyzing data, to:
Design [research studies that]
Describe [the data to]
Make inferences based on these data.
Descriptive Statistics:
Descriptive statistics summarize sample or population data with numbers, tables, and graphs
Inferential Statistics:
Inferential statistics make predictions about population parameters, based on a (random) sample of data.
Doing research by means of data: observation of characteristics
Population: the total set of participants, relevant for the research question
E.g. Population parameter: average hour of self study per week of all students.
Sample: a subset of the population about who the data is collected
E.g. Sample statistic: average hour of self study per week of a randomly selected sample of 800 students
Good data is necessary to answer the research question:
Reliability (Precision)
Validity (Bias)
Variable: measures characteristics that can differ between subjects
Types: behavior-, stimulus-, subject-, physiological variables
Measurement scales (NOIR):
Categorical/qualitative
Nominal unordered categories (eye color, biological sex)
Ordinal ordered categories (disagree/neutral/agree)
Quantitative/numerical
Interval: equal distance between consecutive values (°C)
Ratio: equal distance and true zero point (K)
Range:
Discrete: measurement unit that is indivisible (# brothers/sisters)
Continuous: infinitely divisible measurement unit (body height)
Characteristic | Ordered | Interpretable differences | Absolute zero point |
---|---|---|---|
Nominal | |||
Ordinal | ✓ | ||
Interval | ✓ | ✓ | |
Ratio | ✓ | ✓ | ✓ |
Absolute zero point means that the theoretically lowest possible value indicates an absence (value 0)
In descriptive statistics, 3 dimensions are of importance:
Central tendency - “typical observation”
Central tendency measures: mean, mode, median …
Dispersion - “variability in observations”
Dispersion measures: standard deviation, variance, interquartile range
Position - “relative position of the observation(s)”
Gives information about relative positions of observations: percentile, quartile, …
Goal: reliable and valid statements about the population based on a sample:
Sample statistic should not differ from population parameter
Problems:
Sampling error - “natural (random) sampling variation”
Sampling bias - “selective sampling”
Response bias - “incorrect answer”
Non-Response bias - “selective participation”
Important difference between problems concerning reliability (error) and validity (bias).
Solution:
“A random (or other probability) sampling approach of sufficient size that generates data for everyone approached, with correct responses on all items for all subjects.”
Population distribution
Proportion of students indicating the need for extra support in mathematics.
Sample data distribution
Proportion of students in the sample (here n = 1000) indicating the need for extra support in mathematics.
Sampling distribution
The probability distribution for the sample statistic (proportion/mean/regression coefficient). To interpret as the result of repetitive taking of a sample of size n (here n=1000).
Standard deviation of: = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.38 (1-0.38)}{1000}} = 0.015
Standard error (σM) estimated by SEM
Empirical rule for normal distribution
68% within ± 1 of the mean
95% within ± 2 of the mean
almost 100% within ± 3 of the mean
Jaccard and Becker (2002):
Given a population [of individual X scores] with a mean of μ and a standard deviation of σ, the sampling distribution of the mean [M] has a mean of μ and a standard deviation [generally called the “[population] standard error,” σM] of \frac{σ}{\sqrt{N}} and approaches a normal distribution as the sample size on which it is based, N, approaches infinity. (p. 189)
(Standard) normal distribution à z-statistic
Sampling distribution for proportion(s) when H0 holds.
(Sampling distribution for mean when H0 holds and when the population standard deviation is known)
Student’s T distribution(s) à t-statistic
Sampling distribution for mean when H0 holds and when the population standard deviation is unknown.
Sampling distribution for regression coefficient(s) when H0 holds.
Chi square distribution(s) à χ2-statistic
Sampling distribution for squared deviations (in frequencies) of categorical variables when H0 holds.
Fisher’s distribution(s) à F-statistic
Sampling distribution for ANOVA omnibus test of means when H0 holds.
Significance-test or hypothesis-test:
Method through which you determine, based on the sample, how strong the evidence against a certain hypothesis is and subsequently decide to (not) reject this hypothesis.
5 steps of a hypothesis test:
Defining assumptions
Set up hypothesis
Calculate test-statistic (e.g. t-value)
Determine p-value
Draw conclusion
Probability of a Type I-error (false positive) is determined by:
The chosen significance level (α).
Probability of a Type 2-error (false negative) is determined by:
Effect size
Sample size
Variance (dispersion) in the sample
The smaller the chosen Type I-error, the larger the acquired Type 2-error, given a certain sample..
Comparison of 2 groups with one quantitative outcome variable (t-test)
Comparison of 2 or more groups with one quantitative outcome variable (ANOVA)
Determine the relation between 2 quantitative variables (regression analysis)
Comparisons between 2 samples:
Dependent samples
Husbands and wives (e.g. time spent on household activities)
Repeated measures: the same person on two different points in time (e.g. extent of depression symptoms before and after therapy)
Independent samples:
Men and women in randomly selected samples
Democrats and Republicans
Null hypothesis : H0: m1 = m2
Assumptions of an independent samples t-test:
Dependent variable is quantitative and normally distributed (interval/ratio-level)
Equal variances for both groups: s21 = s22
Independent observations (within and between groups)
ANOVA: ANalysis Of VAriance
One-way between subjects ANOVA
Each participant falls into only one group (e.g. 4 types of stress situations)
For each participant there is one observation (e.g. self-reported anxiety)
Groups are determined by the levels (categories) of the factor:
In this case the number of different stress situations
Null hypothesis : H0: m1 = m2 = … = mk
Assumptions for an ANOVA omnibus test:
Dependent variable is quantitative and normally distributed (interval/ratio level)
Equal variances for all K groups: s12 = s22= … = sk 2
‘Independent observations’ (within and between groups)
ANOVA:
F = MSbg/MSwg MS= mean square, bg = between groups, wg = within groups
Numerator (MSbg) information about variance in means between groups (M1, M2, … Mk)
Denominator (MSwg) information about variance in means within groups
The F-test is an omnibus test (‘global test’): is there a difference between one or more of the means?
An F-test does not show which groups differ!
F-test signficant? Two ways to test for differences between specific groups:
Post hoc (after the fact, after data collection, explorative) à Tukey’s test
A priori (planned beforehand, confirmative) à contrasts, regression analysis
Group-indicator = i (i = 1, …, k)
Participant-indicator = j (j = 1, …, l )
First partition each deviation (Y{ij} – MY) = total variance
(Y{ij} – Mi) Unexplained variance within group and (Mi – MY) Explained variance between group components:
(Y{ij} – MY) = (Y{ij} – Mi) + (Mi – MY)
Square each component: (Y{ij} – MY)^2 = (Y{ij} – Mi)^2 + (Mi – MY)^2
Then sum the squared components across all scores in entire dataset \sum(Y{ij} – MY)^2 = \sum(Y{ij} – Mi)^2 + \sum(Mi – MY)^2
SS{total} = SS{wg} + SS_{bg}
k = Number of groups
N = Total number of observations
df = Degrees of Freedom
The univariate (“one variable”) statistics:
Measures of central tendency
Measures of dispersion
Confidence interval mean/proportion
Significance test mean/proportion
Significance test difference between groups
Bivariate (“two variables”) statistics is about investigating a possible association between two different variables:
Predictor variable or independent variable
Outcome variable or dependent variable
Other methods used in Statistics 3 ([M]AN[C]OVA) can be related to OLS- regression (together GLM)
Association between 2 variables
E.g.: exam grade (Y) and hours of self study (X)
Null hypothesis : H0: ρ= 0, H0: b = 0; H0: R = 0
Assumptions bivariate regression (simple linear regression)
Dependent variable (Y) is quantitative and independent variable (X) is quantitative or dichotomous.
There is a linear relationship between Y and X.
Independent observations.
Equal variance of errors.
Errors are normally distributed with a mean of 0 for all values of X.
Assumed functional form for the population:
Yi = β0 + βXi + εi
Regression function: Yi' = b0 + bX_i
Yi' predicted value for Yi
X_i observed X for person i
b_0 intercept
b slope
Yi– MY total deviation from the mean
Yi'– MY predicted part by X_i
Yi – Yi' error/residual of the prediction
SS{total} = SS{residual} + SS_{regression}
R^2 = \frac{SS{reg}}{SS{total}}
SE{est} = \sqrt{\frac{\sum (Y - Y')^2}{N-2}} = \sqrt{\frac{SS{residual}}{N-2}} = \sqrt{(1-R^2) * \frac{SS_{total}}{N-2}}
Difference between 2 groups: Independent samples t-test
Difference between 2 groups: OLS bivariate regression
Difference between 2 groups: one-factor ANOVA (between-subjects)
Zelfstudie:
Agresti (2018) Ch.9 of Warner (2013) Ch.9
Canvas quiz week 1