General questions from 2-3 lecutres, p-values, central limit theorem, type 1 error, type 2 error
mediation, moderation, principal component based Index
Research question - specific question based on observations, previous research, or theoretical knowledge that you aim to answer through your study
Observation - Something you observe yourself or read in previous research. Measurement for a specific variable
Theory
A theory is a set of principles that explains a broad phenomenon based on repeated testing and evidence. Example: "Well-designed websites are more attractive."
Hypothesis
A hypothesis is a proposed explanation for a specific observation or phenomenon, driven by theory but untested. Example: H1 Websites with easier checkouts lead to higher consumption. H2 Security awareness differs among social media users across age groups
Predictions
changing the hypothesis into measurable terms, enabling data collection and analysis. Example: Asking people to rate their checkout experience and measuring spending
Descriptive statistics
summarize and analyze data using measures of:
**Central Tendency:** Mean, median, and mode.
Mean: Average value.
Median: Middle value.
Mode: Most frequent value.
**Variability (Spread):**
Range: Difference between the smallest and largest value.
Variance: Measure of dispersion of data around the mean
Standard Deviation: Square root of the variance
Independent variable
A variable that influences or predicts the dependent variable; it does not depend on other variables.
Dependent variable
The outcome that depends on the independent variable.
Example: Age (independent) influences security awareness (dependent)
Nominal scale - Categorical data without any order (Gender)
Ordinal scale - Categorical data with a meaningful order but no consistent spacing (Ranking Website credibility)
Ratio scale - Numeric data where both differences and ratios are meaningful (Consumption)
Interval scale - Numeric data where differences are meaningful, but ratios are not (Body Mass Index)
Validity - Whether the measurement actually measures what it is supposed to measure ( Investigating total income vs. income from labor)
Reliability - Whether the measurement is free from errors or bias ( People can try to hide the true amount of screen time for the children or exagerate their security awareness.)
Population - The entire group you aim to study (All consumers on Swedish websites)
Sample - Subset of that population - 100 consumers from that population
Survey - Efficient and low-cost method to collect data from many respondents.
Experiment - Allows control over variables to establish cause-and-effect relationships.
Between effect - Comparing different groups receiving different treatments
Within effect - Measuring the same individuals before and after an intervention.
Frequency distribution
how likely it is that a score would occur (i.e., probability). table or chart that shows how often each value (or range of values) occurs in a dataset.
Probability distribution
Like a histogram except that the lumps and bumps have been smoothed out so that we see a nice smooth curve.
The area under the curve represents the probability of a value occurring.
Normal distribution
A continuous probability distribution where values lie symmetrically around the mean, often referred to as bell-shaped and is commonly used in statistics. It has properties such as:
Mean = Median = Mode.
Symmetry about the center
Standard Normal distribution
Special type of normal distribution with:
Mean = 0
Standard deviation = 1
Any normal distribution can be converted to a standard normal distribution using the Z-score
Z-score - how many standard deviations a data point is from the mean of the dataset. Z=2 (the data point is 2 st. dev. above the mean
SPINE of statistics is an acronym for: Standard error, Parameters, Interval estimates (confidence intervals), Null hypothesis significance testing, Estimation.
Statistical Model - Mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population).
Parameters
Parameters are values that describe characteristics of a population, such as the mean, proportion, or variance. Since we cannot observe the whole population, we estimate parameters using samples. Sample estimates: xˉ, s2, etc. These are called point estimators.
2. Statistical Model
A statistical model is a mathematical representation of how data is generated, incorporating assumptions and error.
3. Parameters
Parameters are values that describe characteristics of a population, such as the mean, proportion, or variance.
4. Degrees of Freedom
Degrees of freedom are the number of values in a calculation that are free to vary after certain constraints are applied.
5. Standard Error
The standard error measures the uncertainty or variability of a sample statistic when estimating a population parameter.
6. Sampling Distribution
The sampling distribution is the distribution of a sample statistic obtained from repeatedly sampling the population.
7. Central Limit Theorem (CLT)
The Central Limit Theorem states that the sampling distribution of the sample mean will approximate a normal distribution if the sample size is sufficiently large, regardless of the population's shape.
8. Confidence Intervals
A confidence interval is a range of values that is likely to contain the true population parameter with a specified level of confidence.
9. Coverage Rate
The coverage rate is the probability that a confidence interval contains the true population parameter.
10. T-Distribution
The t-distribution is a probability distribution used for small sample sizes when the population standard deviation is unknown.
11. Estimating Parameters
Estimating parameters involves finding values that best represent population characteristics using sample data.
12. Variation and Error
Variation in a model refers to the total variability in the data, which can be divided into explained variation (captured by the model) and unexplained variation (random error or noise).
A statistical method used to determine if there is enough evidence in a sample to infer that an effect or relationship exists in the population.
2. Null Hypothesis (H0H_0):
The assumption that there is no effect or no relationship between variables.
3. Alternative Hypothesis (H1H_1):
The assumption that there is an effect or a relationship between variables.
4. Test Statistic:
A value calculated from sample data that is used to test the null hypothesis (e.g., t-statistic, z-statistic).
5. P-Value:
The probability of obtaining the observed results, or more extreme results, if the null hypothesis is true.
6. Significance Level (α\alpha):
The threshold probability (e.g., 0.05) for deciding whether to reject the null hypothesis.
7. Type I Error (False Positive):
Rejecting the null hypothesis when it is actually true.
Denoted by α\alpha.
8. Type II Error (False Negative):
Failing to reject the null hypothesis when it is actually false.
Denoted by β\beta.
9. Power of a Test:
The probability of correctly rejecting a false null hypothesis.
Calculated as .
1−β1 - \beta
10. Two-Tailed Test:
A hypothesis test where the rejection region is in both tails of the distribution. It checks for effects in both directions (e.g., greater or smaller).
11. One-Tailed Test:
A hypothesis test where the rejection region is in one tail of the distribution. It checks for an effect in a specific direction (e.g., only greater or only smaller).
12. Rejection Region:
The range of values for a test statistic that leads to rejecting the null hypothesis (H0H_0).
13. Sampling Distribution:
The distribution of a sample statistic (e.g., mean or proportion) obtained from repeated samples of the population.
14. Matched Samples:
Samples where measurements are taken on the same individuals or subjects before and after treatment.
15. Independent Samples:
Samples where observations are taken from different and unrelated groups.
16. Hypothesis Test for Matched Samples:
A test where the differences between pairs of observations (before and after) are analyzed to test the mean difference.
17. Critical Value Method:
A method of hypothesis testing where the test statistic is compared to a critical value based on the significance level.
18. P-Value Method:
A method of hypothesis testing where the p-value is compared to the significance level (α\alpha) to decide whether to reject the null hypothesis.
19. Misconception About Significant Results:
A significant result does not necessarily mean the effect is important.
A non-significant result does not mean the null hypothesis is true.
20. Significance Levels:
The probability of making a Type I error, often set at 0.05, 0.01, or 0.10.
1. Non-Parametric Methods
Non-parametric methods are statistical techniques that do not require assumptions of normality or specific population distributions. They can handle qualitative data (nominal and ordinal scales).
2. Wilcoxon Matched-Pairs Signed Rank Test
The Wilcoxon Matched-Pairs Signed Rank Test is a non-parametric test used for related data as an alternative to the t-test for paired samples.
It compares before-and-after studies or measures taken under different conditions on the same subjects.
3. Mann-Whitney U Test
The Mann-Whitney U Test is a non-parametric alternative to the t-test for independent samples.
It compares two independent groups without requiring normality.
It can be used for ordinal data.
4. Chi-Square Distribution (χ2\chi^2)
The chi-square distribution is a statistical distribution used for tests involving categorical data.
If you square a standard normal variable, the resulting values follow a chi-square distribution.
Summing multiple squared normal variables also results in a chi-square distribution.
5. Contingency Tables
A contingency table is used to measure the relationship between two variables that are either nominal or ordinal. It summarizes observed frequencies for combinations of categories.
6. Contingency Coefficient
The contingency coefficient measures the degree of dependency between two variables in a contingency table.
A larger value indicates stronger dependency.
7. Expected Counts in Contingency Tables
The expected count is the value calculated for each cell in a contingency table under the assumption that the two variables are independent.
It is determined based on row totals, column totals, and the overall sample size.
8. Degrees of Freedom for Chi-Square Test
In a chi-square test, the degrees of freedom are calculated as:
df=(Number of rows−1)(Number of columns−1)\text{df} = (\text{Number of rows} - 1)(\text{Number of columns} - 1)
9. Null Hypothesis for Chi-Square Test
Null Hypothesis (H0H_0): The two variables are independent (no relationship).
Alternative Hypothesis (H1H_1): The two variables are dependent (there is a relationship).
1. Principal Component Analysis (PCA):
A data reduction technique used to reduce many variables into a smaller set of components that explain most of the variation in the data.
2. Factor Analysis:
A statistical method that explains variability among observed, correlated variables in terms of a smaller number of unobserved variables called factors.
3. Pearson Correlation Coefficient (PCC):
A measure of the linear relationship between two quantitative sets of data. It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation).
4. Eigenvalue:
A measure of how much variation in the data a principal component accounts for.
5. Scree Plot:
A visual tool used to determine the number of components to keep in PCA or factor analysis by looking for the "elbow" or break in the curve.
6. Common Variance:
The part of a variable's variation that is shared with other variables; also known as communalities.
7. Specific Variance:
The portion of a variable's variance that is unique to that variable and not shared with others.
8. Error Variance:
The portion of the variance that cannot be explained by the correlations with other variables.
9. Communalities:
A measure of how much of a variable's variance is explained by the extracted factors. Communalities above 0.5 indicate sufficient shared variance for factor analysis.
10. Factor Loadings:
The correlation coefficients between observed variables and factors. Factor loadings indicate how strongly a variable is related to a factor.
Values greater than 0.3: Minimum for interpretation.
Values greater than 0.5: Strong enough to interpret.
Values greater than 0.7: High practical significance.
11. Factor Rotations:
A method used to simplify the interpretation of factor loadings by minimizing complexity.
Varimax: Used when factors are assumed to be independent.
Oblimin/Promax: Used when factors are expected to be correlated.
12. Surrogate Variables:
Variables selected to represent a broader concept or construct, acting as a proxy for related variables in further analysis (e.g., regression models).
13. Summated Scale:
A composite measure created by combining multiple related variables into a single index, reducing complexity and improving reliability.
Reliability: Tested using Cronbach's alpha, where values above 0.6–0.7 are acceptable.
14. Data Requirements for Factor Analysis:
Data should ideally be on an ordinal scale.
Categorical data can be transformed into dummy variables.
At least 10–20 observations per variable are recommended.
Normality is preferred but not strictly necessary unless for hypothesis testing.
Regression analysis: statistical method to investigate the relationship between a dependent variable (what you are trying to predict) and one or more independent variables (what you use to predict). Regression finds relationships between predictors (X) and outcomes (Y).
Dependent Variable (Y): The outcome you are studying.
Independent Variable (X): The predictors or causes of the outcome.
$t=\frac{\bar x-u_0}{S\bar x}$
1.Simple Linear Regression - When there is one independent variable.
Example: Predicting a student’s exam score (Y) using their study hours (X).
β0: The intercept (where the line meets the Y-axis, when X=0). β1: The slope (how much Y changes for a one-unit increase in X). ε: Error term (captures other influences not included in X).
Example: If a student studies 1 hour more, their score increases by 5 points (β1=5β1=5).
2.Multiple Linear Regression - When there are two or more independent variables.
Example: Predicting website credibility (Y) using age (X₁), gender (X₂), and internet usage (X₃).
Each βi tells you the effect of one independent variable on Y, while keeping the others constant.
Example: "If internet usage increases by 1 time per day, website credibility increases by 0.129 points, assuming other factors (like age and gender) stay constant.”
The error term accounts for all other factors that influence the dependent variable but are not included as explanatory variables in the model.
measures how much of the variation in the dependent variable is explained by the independent variables in the regression model. Ranges between 0 and 1, where 1 represents a perfect fit.
$R2$=0.48: The model explains 48% of the variation in Y.
R2=0: The model explains none of the variation. R2=1R2=1: The model explains all of the variation.
The difference between the actual observed value and the predicted value from the regression model.
Dummy variables are used when you have categorical variables (like gender, font type). Dummy variables are used to include qualitative (categorical) data in regression models by coding them as 0 or 1.
How to include:
If there are kk categories, create k−1k−1 dummy variables.
The missing category is called the reference or base category.
Example: You are analyzing the effect of font size (3 types) on reading satisfaction:
X1=1X1=1 if font size = 2; 00 otherwise.
X2=1X2=1 if font size = 3; 00 otherwise.
The coefficient of each dummy variable shows how the group differs from the reference group (font size = 1).
Regression models are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared errors.
Used to determine if the independent variable(s) have a significant effect on the dependent variable.
Hypotheses for Individual Coefficients:
Null Hypothesis (H0H0): βi=0βi=0 → The independent variable has no effect on YY.
Alternative Hypothesis (H1H1): βi≠0βi=0 → The independent variable has an effect on YY.
Overall Model Hypothesis (Multiple Regression):
Null Hypothesis: H0:β1=β2=⋯=βk=0H0:β1=β2=⋯=βk=0 → None of the independent variables affect YY.
Alternative Hypothesis: At least one βi≠0βi=0.
F-test: Tests if the overall regression model is significant.
1. Functional Form
The functional form specifies the relationship between the dependent variable and the independent variables in a regression model. Functional forms describe how the dependent variable (Y) relates to independent variables (X). Choosing the correct form ensures a better fit and more accurate results.
2. Log Transform
A transformation used to make a non-linear relationship more linear. The slope parameters in this form are interpreted as elasticity coefficients.
3. Polynomial Form
A functional form that expresses the dependent variable as a function of the independent variables, where one or more variables are raised to powers greater than 1.
4. Moderation Effect
A moderation effect occurs when the relationship between two variables depends on a third variable, called the moderator.
5. Mediation Analysis
Mediation analysis explains how an independent variable affects a dependent variable through one or more mediator variables.
6. Types of Mediation
Full Mediation: The mediator fully explains the relationship between the independent and dependent variables.
Partial Mediation: The mediator explains part of the relationship while the independent variable still has a direct effect on the dependent variable.
7. Expected Value of the Error Term
The expected value of the error term in a regression model is 0, meaning there is no systematic overestimation or underestimation in the regression line.
8. Homoscedasticity (Constant Variance)
Homoscedasticity means that the variance of the error term is constant across all observations. If this condition is violated, the error term is heteroscedastic.
9. Normality Assumption
The normality assumption states that the error term in the regression model is normally distributed.
10. Multicollinearity
Multicollinearity occurs when two or more independent variables are highly correlated. This can lead to unstable estimates of regression coefficients and increases the risk of a Type II error (failing to reject a false null hypothesis).
11. Residual Analysis
Residual analysis involves examining the differences between observed and predicted values to validate regression assumptions such as homoscedasticity and normality.
12. Consequences of Violating Assumptions
Homoscedasticity Violation: Standard errors become unreliable, and t-tests and F-tests cannot be trusted.
Normality Violation: The estimated standard errors and tests are invalid, particularly for small samples.
Multicollinearity: Causes unstable parameters and increases the risk of not detecting significant effects.
13. Solutions to Violating Assumptions
Apply Weighted Least Squares (WLS) to correct heteroscedasticity.
Use White’s heteroscedasticity-consistent covariance matrix for robust standard errors.
Address multicollinearity by combining correlated variables into indices or using a single representative variable.
1. Experimental Study
An experimental study involves controlling one or more factors to observe their effect on a dependent variable.
2. Observational Study
An observational study does not control the factors but simply observes and collects data.
3. Outcome (Dependent) Variable
The variable selected by the experimenter to measure the effect of treatments or factors.
4. Factor (Treatment)
A non-metric independent variable manipulated in an experiment to observe its impact on the dependent variable.
5. Analysis of Variance (ANOVA)
A statistical method used to determine whether there are significant differences between the means of three or more groups.
6. Null Hypothesis for ANOVA (H0H_0)
The null hypothesis states that the population means are equal:
H0:μ1=μ2=...=μkH_0: \mu_1 = \mu_2 = ... = \mu_k
7. Alternative Hypothesis for ANOVA (H1H_1)
The alternative hypothesis states that not all population means are equal.
8. Sum of Squares Total (SST)
A measure of the total variation in the outcome variable. It includes both within-group and between-group variation.
9. Sum of Squares Model (SSM)
The portion of total variation explained by the differences between group means (between-group variation).
10. Sum of Squares Residual (SSR)
The portion of total variation that is unexplained and occurs within groups.
11. Mean Squares
The mean squares are obtained by dividing the sum of squares by their corresponding degrees of freedom.
Mean Squares Model (MSM): Average between-group variation.
Mean Squares Residual (MSR): Average within-group variation.
12. F-Statistic
The ratio of Mean Squares Model to Mean Squares Residual. A larger F-statistic indicates stronger evidence against the null hypothesis.
13. Type I Error
The probability of rejecting a true null hypothesis (false positive).
14. Type II Error
The probability of failing to reject a false null hypothesis (false negative).
15. Statistical Power
The probability of correctly rejecting a false null hypothesis.
16. Assumptions of ANOVA
No Heteroscedasticity: Equal variance between groups.
Normality: The error terms follow a normal distribution.
Independence: Observations must be independent of each other.
17. Logit Model
A regression model used when the dependent variable is binary (e.g., 0 or 1).
18. Binary Response Variable
A variable that takes on two possible values:
Example:
1 = Purchase
0 = No Purchase
19. Index Variable
An unobserved quantitative variable that determines a binary outcome based on cost-benefit analysis.
20. Maximum Likelihood Estimation (MLE)
A method used to estimate parameters in the logit model by finding values that maximize the likelihood of observing the data.
21. Odds Ratio
The odds ratio measures the change in the odds of the dependent variable being 1 when the independent variable increases by one unit.
Odds Ratio > 1: Positive relationship.
Odds Ratio < 1: Negative relationship.
22. Pseudo R-Squared
A measure similar to R-squared in linear regression but adjusted for models like the logit model. It indicates the model's fit to the data.
23. Hypothesis Testing in Logit Models
The significance of an individual parameter is tested to determine if it differs significantly from zero. This is done using p-values.
24. Significance in Logit Models
Significant Effect: p-value < 0.05.
Not Significant Effect: p-value > 0.05.
Let me know if you need further clarification on any of these definitions! 😊