Me, Myself and I Notes
Lecture 1
Learning Objectives
💙Describe why biology is considered a quantitative subject
🩷Explain the importance of statistical analyses in biological research
💛Define key descriptive statistics – mean, median, range
💙Quantifying Biological Data
Biological research relies on accurate and precise measurements of various biological parameters
E.g. length, mass, concentration, time and genetic sequence
Researchers often manipulate variables and control experimental conditions to understand cause-and-effect relationships
Rigorous quantification is required to ensure reliable, reproducible result
Mathematical models and statistical analyses play a vital role in understanding genetic data and deciphering complex mechanisms
Lecture 2
Learning Objectives
💙Describe the difference between samples and populations
🩷Define sampling error and explain how it may arise
🧡Classify different data as continuous or categorical (and present each approx.)
💚Explain what is meant by the statistical null hypothesis and alternative hypotheses (practice writing some hypotheses for different experiments).
💛Describe the Chi-squared test and how it is calculated (observed, expected, degrees of freedom and p-values).
Class height data
💙Samples and Populations
A sample is usually only a very small subset of the population
The population is all members of defined group
🩷Sampling Error
Sampling error is the random variation introduced into a data set as a function of only sampling a subset of the total population
💚Statistical Null Hypothesis and Alternative Hypothesis
Statistical significance is evaluated using p-values, where a smaller p-value indicates stronger evidence against the null hypothesis.
A commonly accepted level for statistical significance is a p-value less than or equal to 0.05.
Type I errors (false positives) occur when the null hypothesis (H0) is incorrectly rejected, while Type II errors (false negatives) occur when the null hypothesis is incorrectly accepted.
Both errors highlight the importance of adequate sample sizes and proper experimental designs to minimize misinterpretations of data.
The passage defines effect size as a crucial aspect to consider alongside p-values, as it helps contextualize the biological or practical significance of findings beyond just statistical significance.
The discussion includes examples of effect sizes in different contexts, such as Genome Wide Association Studies and clinical studies, where conventional p-value thresholds can vary based on the nature of the research.
Researchers are encouraged to understand sampling errors which arise from selecting a small subset of a population, potentially impacting the interpretation of results and leading to biased conclusions.
Null Hypothesis (H0)
The default expectation that categorical outcomes are all equally likely, that there is no relationship between two measured phenomena, or that there is no association between groups
is default expectation if there is no experimental/biological effect
Alternative hypothesis (H1)
Is the expectation that categorical outcomes are not all equally likely, that there is a relationship between two measured phenomena or an association among groups
Once the null and alternative hypotheses are defined, statistical test are conducted to assess the evidence in the data and determine if there is enough support to reject the null in favour of the alternative
💛Chi-squared Test
Can be used where the observations are assigned into mutually exclusive classes
The number of observations in each class are compared to those expected under the null
Assumes
Variables are categorical
Observation are independent
Observations are mutually exclusive
Most of the expected values are greater than 5
the chi-squared formula = 𝜲2 = ∑(d2/e)
where: d (difference) = o (observed) – e (expected)
∑ = sum of
Degrees of freedom (d.o.f)= (no. rows - 1) x (no. columns - 1)
Unless no. columns = 1 then d.o.f = no. rows - 1
Min number must be 1
Lecture 3
Learning Outcomes
💙Explain the need for biological context when analysing and interpreting data.
🩷Describe what a t-test is used to measure and give examples of when its use would be appropriate.
🧡Describe a regression model and give examples of when its use would be appropriate.
💛Describe the line of best fit (including y=mx+c) and how it relates to regression models.
🩵Interpret Rstudio outputs for the above statistical tests.
💙Why We Need Context
P-values ( and other stats) are meaningless without context
We need biological context helps us to interpret and analyse data correctly
🩷t-test
Determines whether the mean of one group is statistically different from the mean od another group
Only use this test with two groups
🧡Regression Modelling
Describes relationships between a response variable (dependent) and an explanatory variable (independent)
We can use info from this output to create a formula we can use to make predictions
Line of best fit = straight line -> y=mx + c
💛Line of Best Fit
y = mx + c
Y= predicted value of dependent variable
X= value of independent variable
M= slope of line (representing the change in y for a unit change in x
C= y-intercept (value of y when x = 0)
The line is determined by calculating the values of m and c that minimise the residuals
Can predictions for the dependent variables (y) based on new values of the independent variable (x)
Helps us understand the overall trend and direction of the relationship between the variables
🩵R-studio Outputs
R-squared value tells us how much the variation in pulse rate is explained by variation
If it is 0 then none of the variation in pulse rate us explained by x
Adjusted r-squared shows the variation in y can be explained by x
Lecture 4
Learning Outcomes
Describe and use the following workflow for data analysis
Plot data
Initial visual analysis
Statistical test
Interpret test output
Interpret output in biological context
Interpret regression model results
Regression Modelling
Describes relationships between a response variable (dependent) and an explanatory variable (independent)
Coefficient
Line of best fit
Makes predictions of y values
P-value
If high, accept null hypothesis (> 0.05)
If low, reject null hypothesis (< 0.05)
R-squared is the proportion of the variance of our response variable that is explained by the explanatory variable(s)
Between 0-1
Std error
Standard error of the estimate
T-value
Estimated coefficient divided by its std error
Residual standard error
Estimate of the std deviation of the residuals
Measures the average amount that the observed responses deviate from the regression line
R-squared and p-value
The relationship between the variables and the statistical significance of this
Lecture 5
| |||
| |||
| 95% confidence interval | ||
|
Lecture 6
Learning Outcome
💙Explain what is meant by a multivariate linear model and give examples of when you might use one.
🩷Describe cause and effect relationships in biology.
🧡Describe longitudinal and cross-sectional in the context of data collection. Discuss the impact these methods may have on results.
💚Explain what is meant by multiple testing and the impact this may have on our results.
🩵Explain what is meant by the term “cherry picking” in the context of statistics and data analysis.
💙Multi-variate Linear Model
When looking at both football and sex, the effect of football decreases
Football isn't really telling us anything about height, it is simply associated with sex
🩷Cause and Effect Relationships
There is nothing necessarily relation to football
It is likely due to gender stereotyping, social pressures and opportunity
Football and height are two independent outcomes of the same variable -> sex
🧡Cross Section Data
A snapshot of different individuals at a certain time
We would need longitudinal data to investigate how an individual gets smaller with age
i.e. over time
💚Multiple Testing
If we measure height and sex, breakfast, lunch, dinner, basketball, football, hockey, judo, high jump, parent’s height, sibling height, eye colour, hair colour, skin colour, underpants colour, favourite football team, degree subject, higher grades, shoe size …
We will identify
Some real biological modifiers
Some co-variables
Some chance associations
🩵Cherry Picking
Avoid cherry picking
i.e. only presenting our positive results and ignoring other findings
This is a misrepresentation of the data
We should be open and transparent about all the data