Me, Myself and I Notes

Lecture 1

Learning Objectives

💙Describe why biology is considered a quantitative subject

🩷Explain the importance of statistical analyses in biological research

💛Define key descriptive statistics – mean, median, range

💙Quantifying Biological Data

Biological research relies on accurate and precise measurements of various biological parameters
- E.g. length, mass, concentration, time and genetic sequence

Researchers often manipulate variables and control experimental conditions to understand cause-and-effect relationships
- Rigorous quantification is required to ensure reliable, reproducible result

Mathematical models and statistical analyses play a vital role in understanding genetic data and deciphering complex mechanisms

Lecture 2

Learning Objectives

💙Describe the difference between samples and populations

🩷Define sampling error and explain how it may arise

🧡Classify different data as continuous or categorical (and present each approx.)

💚Explain what is meant by the statistical null hypothesis and alternative hypotheses (practice writing some hypotheses for different experiments).

💛Describe the Chi-squared test and how it is calculated (observed, expected, degrees of freedom and p-values).

Class height data

💙Samples and Populations

A sample is usually only a very small subset of the population

The population is all members of defined group

🩷Sampling Error

Sampling error is the random variation introduced into a data set as a function of only sampling a subset of the total population

💚Statistical Null Hypothesis and Alternative Hypothesis

Statistical significance is evaluated using p-values, where a smaller p-value indicates stronger evidence against the null hypothesis.

A commonly accepted level for statistical significance is a p-value less than or equal to 0.05.

Type I errors (false positives) occur when the null hypothesis (H0) is incorrectly rejected, while Type II errors (false negatives) occur when the null hypothesis is incorrectly accepted.

Both errors highlight the importance of adequate sample sizes and proper experimental designs to minimize misinterpretations of data.

The passage defines effect size as a crucial aspect to consider alongside p-values, as it helps contextualize the biological or practical significance of findings beyond just statistical significance.

The discussion includes examples of effect sizes in different contexts, such as Genome Wide Association Studies and clinical studies, where conventional p-value thresholds can vary based on the nature of the research.

Researchers are encouraged to understand sampling errors which arise from selecting a small subset of a population, potentially impacting the interpretation of results and leading to biased conclusions.

Null Hypothesis (H₀)
- The default expectation that categorical outcomes are all equally likely, that there is no relationship between two measured phenomena, or that there is no association between groups

is default expectation if there is no experimental/biological effect

Alternative hypothesis (H₁)
- Is the expectation that categorical outcomes are not all equally likely, that there is a relationship between two measured phenomena or an association among groups

Once the null and alternative hypotheses are defined, statistical test are conducted to assess the evidence in the data and determine if there is enough support to reject the null in favour of the alternative

💛Chi-squared Test

Can be used where the observations are assigned into mutually exclusive classes

The number of observations in each class are compared to those expected under the null

Assumes
- Variables are categorical
- Observation are independent
- Observations are mutually exclusive
- Most of the expected values are greater than 5

the chi-squared formula = 𝜲2 = ∑(d2/e)
- where: d (difference) = o (observed) – e (expected)
- ∑ = sum of

Degrees of freedom (d.o.f)= (no. rows - 1) x (no. columns - 1)

Unless no. columns = 1 then d.o.f = no. rows - 1

Min number must be 1

Lecture 3

Learning Outcomes

💙Explain the need for biological context when analysing and interpreting data.

🩷Describe what a t-test is used to measure and give examples of when its use would be appropriate.

🧡Describe a regression model and give examples of when its use would be appropriate.

💛Describe the line of best fit (including y=mx+c) and how it relates to regression models.

🩵Interpret Rstudio outputs for the above statistical tests.

💙Why We Need Context

P-values ( and other stats) are meaningless without context

We need biological context helps us to interpret and analyse data correctly

🩷t-test

Determines whether the mean of one group is statistically different from the mean od another group

Only use this test with two groups

🧡Regression Modelling

Describes relationships between a response variable (dependent) and an explanatory variable (independent)

We can use info from this output to create a formula we can use to make predictions

Line of best fit = straight line -> y=mx + c

💛Line of Best Fit

y = mx + c
- Y= predicted value of dependent variable
- X= value of independent variable
- M= slope of line (representing the change in y for a unit change in x
- C= y-intercept (value of y when x = 0)

The line is determined by calculating the values of m and c that minimise the residuals

Can predictions for the dependent variables (y) based on new values of the independent variable (x)

Helps us understand the overall trend and direction of the relationship between the variables

🩵R-studio Outputs

R-squared value tells us how much the variation in pulse rate is explained by variation

If it is 0 then none of the variation in pulse rate us explained by x

Adjusted r-squared shows the variation in y can be explained by x

Lecture 4

Learning Outcomes

Describe and use the following workflow for data analysis

Plot data
Initial visual analysis
Statistical test
Interpret test output
Interpret output in biological context

Interpret regression model results

Regression Modelling

Describes relationships between a response variable (dependent) and an explanatory variable (independent)

Coefficient
- Line of best fit
- Makes predictions of y values

P-value
- If high, accept null hypothesis (> 0.05)
- If low, reject null hypothesis (< 0.05)

R-squared is the proportion of the variance of our response variable that is explained by the explanatory variable(s)
- Between 0-1

Std error
- Standard error of the estimate

T-value
- Estimated coefficient divided by its std error

Residual standard error
- Estimate of the std deviation of the residuals
- Measures the average amount that the observed responses deviate from the regression line

R-squared and p-value
- The relationship between the variables and the statistical significance of this

Lecture 5



			95% confidence interval

Lecture 6

Learning Outcome

💙Explain what is meant by a multivariate linear model and give examples of when you might use one.

🩷Describe cause and effect relationships in biology.

🧡Describe longitudinal and cross-sectional in the context of data collection. Discuss the impact these methods may have on results.

💚Explain what is meant by multiple testing and the impact this may have on our results.

🩵Explain what is meant by the term “cherry picking” in the context of statistics and data analysis.

💙Multi-variate Linear Model

When looking at both football and sex, the effect of football decreases

Football isn't really telling us anything about height, it is simply associated with sex

🩷Cause and Effect Relationships

There is nothing necessarily relation to football
- It is likely due to gender stereotyping, social pressures and opportunity

Football and height are two independent outcomes of the same variable -> sex

🧡Cross Section Data

A snapshot of different individuals at a certain time

We would need longitudinal data to investigate how an individual gets smaller with age
- i.e. over time

💚Multiple Testing

If we measure height and sex, breakfast, lunch, dinner, basketball, football, hockey, judo, high jump, parent’s height, sibling height, eye colour, hair colour, skin colour, underpants colour, favourite football team, degree subject, higher grades, shoe size …

We will identify
- Some real biological modifiers
- Some co-variables
- Some chance associations

🩵Cherry Picking

Avoid cherry picking
- i.e. only presenting our positive results and ignoring other findings

This is a misrepresentation of the data

We should be open and transparent about all the data