Summarising and Presenting Data

Summarising Data Lecture Notes

Learning Outcomes

Understanding the importance of summarising data in neuroscience.
Clear demonstration of different types of data.
How to summarise data, including measures of central tendency and dispersion.
Understanding data variability and how to identify outliers.
Demonstrate ways of managing research data.
Apply statistical software (e.g., STATA) to carry out exploratory analysis and display a dataset using histograms, box plots, and cumulative frequency presentations.

Time Distribution

Importance of Data Summarisation: Why summarising data is crucial in neuroscience research (2 min).
Types of Data: Overview of quantitative vs. qualitative data (8 mins).
Summarising data: Central Tendency: Mean, median, mode (5 mins)
Dispersion: Range, variance, standard deviation, and standard error (8 mins)
Outliers in data (5 mins)
How to manage research data from the start? (5 mins)
Visualization: Histograms, box plots using statistical software (4 mins)
Test knowledge: Mentimeter Quiz (last 10 mins)

Importance of Summarising Data in Neuroscience Research

Clarity and Understanding
- Neuroscience research generates vast amounts of complex data.
- Summarizing data helps distill it into a more understandable form, making it easier to identify key patterns and insights.
- Example: Summarizing fMRI data to show average brain activation levels in different regions during a memory task can help identify which areas are most involved in memory processing.
Hypothesis Testing
- Summarized data allows for more effective hypothesis testing.
- Researchers can determine whether their hypotheses are supported or refuted by focusing on essential data points, leading to more accurate conclusions.
- Example: Summarizing the results of an experiment testing the effects of a new drug on synaptic plasticity by reporting the average change in synaptic strength across different treatment groups.
Efficient Communication
- Effective communication of research findings is vital.
- Summarized data can be presented clearly and concisely, making it easier to share results with peers, publish in journals, and present at conferences.
- Example: Using a bar graph to summarize the average reaction times of participants in a cognitive task, making it easier to communicate the impact of a specific intervention on cognitive performance.
Resource Management
- Summarizing data helps manage resources efficiently.
- By focusing on the most relevant data, researchers can allocate their time, funding, and efforts more effectively, ensuring impactful research.
- Example: Summarizing the key findings from a pilot study on the effects of sleep deprivation on neural activity to decide whether to pursue a larger, more expensive study.
Meta-Analyses and Generalization
- Summarized data is essential for meta-analyses, which combine results from multiple studies to increase statistical power and improve the generalizability of findings.
- Example: Conducting a meta-analysis of studies on the effects of exercise on brain-derived neurotrophic factor (BDNF) levels to provide a comprehensive summary of the evidence.
Data Integrity and Reproducibility
- Summarizing data helps maintain data integrity and reproducibility.
- Clear documentation of key findings and methodologies allows other researchers to replicate studies, a cornerstone of scientific research.
- Example: Summarizing the methodology and key results of a study on the neural correlates of decision-making to ensure that other researchers can replicate the experiment.
Development of Theories
- Summarized data supports the development of new theories about brain function and behavior.
- By identifying consistent patterns, researchers can build and refine theoretical models.
- Example: Summarizing data from multiple studies on the neural basis of attention to develop a comprehensive theory of how different brain regions interact to support attentional processes.

Role of Statistics in Data Analysis

Data are the raw material of knowledge.
Scientists rely on data to provide empirical evidence to support and refine their theories.
Governments, businesses, communities, hospitals, GPs, and individuals need data to help inform decision-making and risk assessment.
Learning statistics will provide you with basic skills to read and understand data.
Statistics provides techniques for:
- Summarising and presenting the information contained in a data set.
- Handling and quantifying variation and uncertainty in the data, to help infer what they tell about the underlying theory of interest.

Types of Data

Categorical
- Nominal Data: Categories without a specific order.
  - Example: Blood types (A, B, AB, O).
- Ordinal Data: Categories with a meaningful order but no consistent difference between categories.
  - Example: Stages of cancer (Stage I, II, III, IV).
Quantitative (Numerical) Data
- Discrete Data: Countable values, often integers.
  - Example: Number of hospital visits.
- Continuous Data: Data that can take any value within a range.
  - Example: Blood pressure measurements.
Interval Data
- Definition: Numerical data where the intervals between values are meaningful but lacks a true zero point.
- Characteristics:
  - Equal Intervals: The difference between values is consistent and meaningful.
  - No True Zero: There is no absolute zero point that indicates the absence of the quantity being measured.
- Examples:
  - Temperature in Celsius or Fahrenheit
  - Dates on a calendar (e.g., years, months)
Ratio Data
- Definition: Numerical data with equal intervals between values and a true zero point, allowing for the calculation of ratios.
- Characteristics:
  - Equal Intervals: The difference between values is consistent and meaningful.
  - True Zero: There is an absolute zero point that indicates the absence of the quantity being measured. For example, 0 kg means 'no weight'.
  - Ratios: Meaningful comparison of values using ratios is possible.
- Examples:
  - Height and weight
  - Duration (e.g., time taken to complete a task)
Key Differences
- Zero Point: Interval data lacks a true zero point, while ratio data has a true zero point.
- Ratios: Meaningful ratios can be calculated with ratio data but not with interval data.
- Examples: Temperature in Celsius (interval) vs. weight in kilograms (ratio).

Why Summarise Data?

Handling Complexity: Use of Principal Component Analysis (PCA) to reduce the dimensionality of complex brain imaging data.
- Example: PCA can help identify key patterns in fMRI data, simplifying the understanding of brain activity during different tasks.
Better Understanding: Summarizing data from electrophysiological recordings can help understand neural responses.
- Example: Summarizing spike train data from neurons can reveal how different brain regions respond to stimuli, aiding in studying sensory processing.
Efficient Analysis: In genetic studies, summarizing data from large-scale genome-wide association studies (GWAS) helps identify genetic variants associated with neurological disorders.
- Example: Highlighting significant genetic markers that warrant further investigation.
Sharing and Reproducibility: Summarizing data from connectomics studies (mapping the connections between neurons in the brain) allows researchers to share simplified versions of these complex networks.
- Example: Promoting reproducibility and collaborative research by providing a clear overview of neural connectivity.
Resource Management: Summarizing neuroimaging data helps manage storage and processing resources.
- Example: Creating summary statistics from large datasets of MRI scans can reduce the data size while retaining essential information for further analysis.

Ways of Summarising Data

Measures of Central Tendency
- Mean: The average value of the data set.
- Median: The middle value when the data is ordered.
- Mode: The most frequently occurring value in the data set.
Measures of Spread/Dispersion
- Range: The difference between the highest and lowest values.
- Variance: The average of the squared differences from the mean.
- Standard Deviation: The square root of the variance, indicating how much the values deviate from the mean.
  - $\sigma = \sqrt{\frac{\sum<em>{i=1}^{N}(x</em>i - \mu)^2}{N}}$
- Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile).
  - $IQR = Q3 - Q1$
Measures of Shape
- Skewness: Indicates the asymmetry of the data distribution.
- Kurtosis: Measures the "tailedness" of the data distribution.
Graphical Summaries
- Histograms: Show the frequency distribution of a data set.
- Box Plots: Display the median, quartiles, and potential outliers.
- Scatter Plots: Show the relationship between two quantitative variables.
- Bar Charts: Represent categorical data with rectangular bars.
Summary Tables
- Frequency Tables: Show the number of occurrences of each category.
- Contingency Tables: Display the frequency distribution of variables to show relationships between them.
Correlation and Association
- Correlation Coefficient: Measures the strength and direction of the relationship between two variables.
- Covariance: Indicates the direction of the linear relationship between variables.
  - $cov(X,Y) = \frac{\sum<em>{i=1}^{n}(X</em>i - \bar{X})(Y_i - \bar{Y})}{n-1}$
Regression Models
- Linear Regression: Models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
- Logistic Regression: Used for binary outcome variables to model the probability of a certain class or event.
- Poisson Regression: Models count data and rates, such as the number of occurrences of an event within a fixed period.
Longitudinal Data Analysis
- Mixed-effects Models: Account for both fixed and random effects, useful for analysing data where measurements are taken on the same subjects over time.
- Generalized Estimating Equations (GEE): Used for estimating the parameters of a generalized linear model with a possible unknown correlation between outcomes.
Survival Analysis
- Kaplan-Meier Estimator: Estimates the survival function from lifetime data, often used to measure the fraction of patients living for a certain amount of time after treatment.
- Cox Proportional Hazards Model: Assesses the effect of several variables on survival time, allowing for the estimation of hazard ratios.
Multivariate Analysis
- Principal Component Analysis (PCA): Reduces the dimensionality of data while retaining most of the variance. Useful for identifying patterns and simplifying complex datasets.
- Factor Analysis: Identifies underlying relationships between variables by grouping them into factors.
Bayesian Methods
- Bayesian Inference: Uses Bayes’ theorem to update the probability of a hypothesis as more evidence becomes available. It provides a probabilistic approach to inference using prior information.
- Markov Chain Monte Carlo (MCMC): A class of algorithms used to sample from a probability distribution and perform Bayesian inference.
Advanced Visualization Techniques
- Heatmaps: Visualize data matrices, often used in genomics and other fields to show the intensity of data points.
- Network Analysis: Visualizes relationships between entities, useful in studying biological pathways and social networks.

Skewness and Measures

No symmetry in the data and it looks like positively skewed: median and inter-quartile range are appropriate measures.
No symmetry in the data and it looks like negatively skewed: median and inter-quartile range are appropriate measures.
Normal distribution tail extended equally over both sides: mean and standard deviations are appropriate measures.
Mean < median.

Identifying Outliers

Outliers are identified by assessing whether or not they fall within a set of numerical boundaries called "inner fences" and "outer fences".
A point that falls outside the data set's inner fences is classified as a minor outlier, while one that falls outside the outer fences is classified as a major outlier.
Inner Fences:
- Multiplying inter-quartile range (Q3-Q1) by 1.5 then add this number to Q3 and subtract it from Q1 to find the boundaries of the inner fences.
- $Upper Inner Fence = Q3 + 1.5 * IQR$
- $Lower Inner Fence = Q1 - 1.5 * IQR$
Outer Fences:
- Multiplying inter-quartile range (Q3-Q1) by 3 (instead of 1.5) then add this number to Q3 and subtract it from Q1 to find the upper and lower boundaries of the outer fences.
- $Upper Outer Fence = Q3 + 3 * IQR$
- $Lower Outer Fence = Q1 - 3 * IQR$

Outlier Identification Example

Use hospital admissions data HospAdmNeu.dta
Use summ Ordinary1213, det \ to find 1st quartile & 3rd quartile
IQR = Q3-Q1 = 4013-2269 = 1744
1744 × 1.5 = 2616, 1744 × 3 = 5232
Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6629, - 347)
Boundaries for outer fence : (Q3+ 5232 , Q1- 5232 ) = (9245, - 2963)
As hospital admissions never be negative we now check how many data points are outside inner fence & how many are outside outer fence using STATA :
count if Ordinary1213 > 6629 15
count if Ordinary1213 > 9245 4
As the data are positively skewed so report median and inter-quartile range.

Correlation vs Covariance Demonstration with STATA

Setup
- webuse census13
Visualise your data
- br
Estimate correlation matrix
- correlate mrgrate dvcrate medage
Comments?
Try different ways of explaining the output
Estimate covariance matrix; use population as analytic weight
- correlate mrgrate dvcrate medage [aweight=pop], covariance
Comments?
Try different ways of explaining the outputs.
Try learning the similarity & the difference between correlation and covariance matrix. (Home work)

Data Transformations

It is sometimes helpful to transform data to a different scale, to aid interpretation and/or statistical analysis
Reasons for transforming data include:
- Improved approximation to normality
- Reducing skewness
- Linearising the relationship between 2 variables
- Making multiplicative relationships additive
Common transformations include:
- Natural logarithm ( $y = log_e(x) \implies x = e^y$ or exp(y), where e = 2.718…)
- Power transformations ( $y = \sqrt{x}$ , $y = x^2$ , $y = x^3$ , etc.)

Log Transformation Example

Log transform stretches scale at lower end and compresses it at upper end
Can only take logs of positive values

Data Display in a Spreadsheet / Data Management

Suppose you are running a study at UCLH aiming to lower the low-density lipoprotein (LDL) cholesterol levels for patients with cardiovascular disease. Your study is an RCT, double-blind and placebo-controlled. Patients were randomly assigned to receive evolocumab (either 140 mg every 2 weeks or 420 mg monthly) or matching placebo as subcutaneous injections.
Out of the first 20 patients:
- Group: 11 patients received evolocumab and 9 patients received placebo.
- Gender: 12 female and 8 male.
- Statin use:
  - High intensity – 12 patients
  - Medium intensity – 6 patients
  - Low intensity – 2 patients
Using patient ID 1 to 20 and the appropriate code display the above information in a spreadsheet. Ignore between variables information for now.

Data Display in a Spreadsheet - Coding

Group:
- 1 if patients received evolocumab
- 0 if patients received a placebo.
Gender:
- 1 if a patient is female
- 0 if the patient is male
Statin use:
- 2 for High intensity
- 1 for Medium-intensity
- 0 for Low-intensity

Example Data

Patient ID	Group	Gender	Statin-use
1	1	0	0
2	1	1	2
3	1	1	2
4	1	1	1
5	1	0	2
6	1	1	2
7	1	1	2
8	1	0	2
9	1	1	1
10	1	0	2
11	1	1	2
12	0	0	0
13	0	1	1
14	0	0	2
15	0	1	2
16	0	1	2
17	0	0	1
18	0	1	1
19	0	0	1
20	0	1	2

Adding an Extra Column

Consider the patients age between 50 and 70 with a mean age of 60 years. Can you now put an extra column for age of the patients?
You might get different variables in your study but you must present them in a new column but within the same spreadsheet.

Example Data with Age

Patient ID	Group	Gender	Statin-use	Age
1	1	0	0	56
2	1	1	2	52
3	1	1	2	59
4	1	1	1	60
5	1	0	2	63
6	1	1	2	70
7	1	1	2	63
8	1	0	2	58
9	1	1	1	55
10	1	0	2	59
11	1	1	2	68
12	0	0	0	59
13	0	1	1	67
14	0	0	2	69
15	0	1	2	52
16	0	1	2	53
17	0	0	1	61
18	0	1	1	63
19	0	0	1	62
20	0	1	2	51

Data Validation

Check twice that your coding is correct and make sure you didn’t put any wrong information or type any number wrongly
Check relevant research data matches your findings
Check other research proportion of the people using statin and have lowered LDL. Is it consistent with yours?
Identify and develop methods for how you handle missing values
If you are convinced – data is ready to cook (for analysis).

Recap

Need to distinguish between different types of data (continuous, discrete, categorical)
The most appropriate way of presenting data depends on the data type
Frequency tables are appropriate for all types of data
- For quantitative data, need to think carefully about the appropriate choice of classes/intervals to group data before display
- Keep information in tables to the minimum necessary to convey the message (story) you want to present (significant figures, number of variables/categories)
Bar charts are appropriate for displaying categorical data
Histograms and box plots are appropriate for quantitative data
Summarizing and predicting the clinical outcome interest we should use advanced methods (ANOVA, PCA, Regression, survival analysis, Bayesian statistics etc) as appropriate.

Patient ID	Group	Gender	Statin-use
1	1	0	0
2	1	1	2
3	1	1	2
4	1	1	1
5	1	0	2
6	1	1	2
7	1	1	2
8	1	0	2
9	1	1	1
10	1	0	2
11	1	1	2
12	0	0	0
13	0	1	1
14	0	0	2
15	0	1	2
16	0	1	2
17	0	0	1
18	0	1	1
19	0	0	1
20	0	1	2

Patient ID	Group	Gender	Statin-use	Age
1	1	0	0	56
2	1	1	2	52
3	1	1	2	59
4	1	1	1	60
5	1	0	2	63
6	1	1	2	70
7	1	1	2	63
8	1	0	2	58
9	1	1	1	55
10	1	0	2	59
11	1	1	2	68
12	0	0	0	59
13	0	1	1	67
14	0	0	2	69
15	0	1	2	52
16	0	1	2	53
17	0	0	1	61
18	0	1	1	63
19	0	0	1	62
20	0	1	2	51

Patient ID	Group	Gender	Statin-use
1	1	0	0
2	1	1	2
3	1	1	2
4	1	1	1
5	1	0	2
6	1	1	2
7	1	1	2
8	1	0	2
9	1	1	1
10	1	0	2
11	1	1	2
12	0	0	0
13	0	1	1
14	0	0	2
15	0	1	2
16	0	1	2
17	0	0	1
18	0	1	1
19	0	0	1
20	0	1	2

Patient ID	Group	Gender	Statin-use	Age
1	1	0	0	56
2	1	1	2	52
3	1	1	2	59
4	1	1	1	60
5	1	0	2	63
6	1	1	2	70
7	1	1	2	63
8	1	0	2	58
9	1	1	1	55
10	1	0	2	59
11	1	1	2	68
12	0	0	0	59
13	0	1	1	67
14	0	0	2	69
15	0	1	2	52
16	0	1	2	53
17	0	0	1	61
18	0	1	1	63
19	0	0	1	62
20	0	1	2	51

Summarising and Presenting Data

Summarising Data Lecture Notes

Learning Outcomes

Time Distribution

Importance of Summarising Data in Neuroscience Research

Role of Statistics in Data Analysis

Types of Data

Why Summarise Data?

Ways of Summarising Data

Skewness and Measures

Identifying Outliers

Outlier Identification Example

Correlation vs Covariance Demonstration with STATA

Data Transformations

Log Transformation Example

Data Display in a Spreadsheet / Data Management

Data Display in a Spreadsheet - Coding

Example Data

Adding an Extra Column

Example Data with Age

Data Validation

Recap

Recommended Reading

Patient ID	Group	Gender	Statin-use
1	1	0	0
2	1	1	2
3	1	1	2
4	1	1	1
5	1	0	2
6	1	1	2
7	1	1	2
8	1	0	2
9	1	1	1
10	1	0	2
11	1	1	2
12	0	0	0
13	0	1	1
14	0	0	2
15	0	1	2
16	0	1	2
17	0	0	1
18	0	1	1
19	0	0	1
20	0	1	2

Patient ID	Group	Gender	Statin-use	Age
1	1	0	0	56
2	1	1	2	52
3	1	1	2	59
4	1	1	1	60
5	1	0	2	63
6	1	1	2	70
7	1	1	2	63
8	1	0	2	58
9	1	1	1	55
10	1	0	2	59
11	1	1	2	68
12	0	0	0	59
13	0	1	1	67
14	0	0	2	69
15	0	1	2	52
16	0	1	2	53
17	0	0	1	61
18	0	1	1	63
19	0	0	1	62
20	0	1	2	51