Summarising and Presenting Data
Summarising Data Lecture Notes
Learning Outcomes
- Understanding the importance of summarising data in neuroscience.
- Clear demonstration of different types of data.
- How to summarise data, including measures of central tendency and dispersion.
- Understanding data variability and how to identify outliers.
- Demonstrate ways of managing research data.
- Apply statistical software (e.g., STATA) to carry out exploratory analysis and display a dataset using histograms, box plots, and cumulative frequency presentations.
Time Distribution
- Importance of Data Summarisation: Why summarising data is crucial in neuroscience research (2 min).
- Types of Data: Overview of quantitative vs. qualitative data (8 mins).
- Summarising data: Central Tendency: Mean, median, mode (5 mins)
- Dispersion: Range, variance, standard deviation, and standard error (8 mins)
- Outliers in data (5 mins)
- How to manage research data from the start? (5 mins)
- Visualization: Histograms, box plots using statistical software (4 mins)
- Test knowledge: Mentimeter Quiz (last 10 mins)
Importance of Summarising Data in Neuroscience Research
Clarity and Understanding
- Neuroscience research generates vast amounts of complex data.
- Summarizing data helps distill it into a more understandable form, making it easier to identify key patterns and insights.
- Example: Summarizing fMRI data to show average brain activation levels in different regions during a memory task can help identify which areas are most involved in memory processing.
Hypothesis Testing
- Summarized data allows for more effective hypothesis testing.
- Researchers can determine whether their hypotheses are supported or refuted by focusing on essential data points, leading to more accurate conclusions.
- Example: Summarizing the results of an experiment testing the effects of a new drug on synaptic plasticity by reporting the average change in synaptic strength across different treatment groups.
Efficient Communication
- Effective communication of research findings is vital.
- Summarized data can be presented clearly and concisely, making it easier to share results with peers, publish in journals, and present at conferences.
- Example: Using a bar graph to summarize the average reaction times of participants in a cognitive task, making it easier to communicate the impact of a specific intervention on cognitive performance.
Resource Management
- Summarizing data helps manage resources efficiently.
- By focusing on the most relevant data, researchers can allocate their time, funding, and efforts more effectively, ensuring impactful research.
- Example: Summarizing the key findings from a pilot study on the effects of sleep deprivation on neural activity to decide whether to pursue a larger, more expensive study.
Meta-Analyses and Generalization
- Summarized data is essential for meta-analyses, which combine results from multiple studies to increase statistical power and improve the generalizability of findings.
- Example: Conducting a meta-analysis of studies on the effects of exercise on brain-derived neurotrophic factor (BDNF) levels to provide a comprehensive summary of the evidence.
Data Integrity and Reproducibility
- Summarizing data helps maintain data integrity and reproducibility.
- Clear documentation of key findings and methodologies allows other researchers to replicate studies, a cornerstone of scientific research.
- Example: Summarizing the methodology and key results of a study on the neural correlates of decision-making to ensure that other researchers can replicate the experiment.
Development of Theories
- Summarized data supports the development of new theories about brain function and behavior.
- By identifying consistent patterns, researchers can build and refine theoretical models.
- Example: Summarizing data from multiple studies on the neural basis of attention to develop a comprehensive theory of how different brain regions interact to support attentional processes.
Role of Statistics in Data Analysis
- Data are the raw material of knowledge.
- Scientists rely on data to provide empirical evidence to support and refine their theories.
- Governments, businesses, communities, hospitals, GPs, and individuals need data to help inform decision-making and risk assessment.
- Learning statistics will provide you with basic skills to read and understand data.
- Statistics provides techniques for:
- Summarising and presenting the information contained in a data set.
- Handling and quantifying variation and uncertainty in the data, to help infer what they tell about the underlying theory of interest.
Types of Data
Categorical
- Nominal Data: Categories without a specific order.
- Example: Blood types (A, B, AB, O).
- Ordinal Data: Categories with a meaningful order but no consistent difference between categories.
- Example: Stages of cancer (Stage I, II, III, IV).
- Nominal Data: Categories without a specific order.
Quantitative (Numerical) Data
- Discrete Data: Countable values, often integers.
- Example: Number of hospital visits.
- Continuous Data: Data that can take any value within a range.
- Example: Blood pressure measurements.
- Discrete Data: Countable values, often integers.
Interval Data
- Definition: Numerical data where the intervals between values are meaningful but lacks a true zero point.
- Characteristics:
- Equal Intervals: The difference between values is consistent and meaningful.
- No True Zero: There is no absolute zero point that indicates the absence of the quantity being measured.
- Examples:
- Temperature in Celsius or Fahrenheit
- Dates on a calendar (e.g., years, months)
Ratio Data
- Definition: Numerical data with equal intervals between values and a true zero point, allowing for the calculation of ratios.
- Characteristics:
- Equal Intervals: The difference between values is consistent and meaningful.
- True Zero: There is an absolute zero point that indicates the absence of the quantity being measured. For example, 0 kg means 'no weight'.
- Ratios: Meaningful comparison of values using ratios is possible.
- Examples:
- Height and weight
- Duration (e.g., time taken to complete a task)
Key Differences
- Zero Point: Interval data lacks a true zero point, while ratio data has a true zero point.
- Ratios: Meaningful ratios can be calculated with ratio data but not with interval data.
- Examples: Temperature in Celsius (interval) vs. weight in kilograms (ratio).
Why Summarise Data?
- Handling Complexity: Use of Principal Component Analysis (PCA) to reduce the dimensionality of complex brain imaging data.
- Example: PCA can help identify key patterns in fMRI data, simplifying the understanding of brain activity during different tasks.
- Better Understanding: Summarizing data from electrophysiological recordings can help understand neural responses.
- Example: Summarizing spike train data from neurons can reveal how different brain regions respond to stimuli, aiding in studying sensory processing.
- Efficient Analysis: In genetic studies, summarizing data from large-scale genome-wide association studies (GWAS) helps identify genetic variants associated with neurological disorders.
- Example: Highlighting significant genetic markers that warrant further investigation.
- Sharing and Reproducibility: Summarizing data from connectomics studies (mapping the connections between neurons in the brain) allows researchers to share simplified versions of these complex networks.
- Example: Promoting reproducibility and collaborative research by providing a clear overview of neural connectivity.
- Resource Management: Summarizing neuroimaging data helps manage storage and processing resources.
- Example: Creating summary statistics from large datasets of MRI scans can reduce the data size while retaining essential information for further analysis.
Ways of Summarising Data
Measures of Central Tendency
- Mean: The average value of the data set.
- Median: The middle value when the data is ordered.
- Mode: The most frequently occurring value in the data set.
Measures of Spread/Dispersion
- Range: The difference between the highest and lowest values.
- Variance: The average of the squared differences from the mean.
- Standard Deviation: The square root of the variance, indicating how much the values deviate from the mean.
- Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile).
Measures of Shape
- Skewness: Indicates the asymmetry of the data distribution.
- Kurtosis: Measures the "tailedness" of the data distribution.
Graphical Summaries
- Histograms: Show the frequency distribution of a data set.
- Box Plots: Display the median, quartiles, and potential outliers.
- Scatter Plots: Show the relationship between two quantitative variables.
- Bar Charts: Represent categorical data with rectangular bars.
Summary Tables
- Frequency Tables: Show the number of occurrences of each category.
- Contingency Tables: Display the frequency distribution of variables to show relationships between them.
Correlation and Association
- Correlation Coefficient: Measures the strength and direction of the relationship between two variables.
- Covariance: Indicates the direction of the linear relationship between variables.
Regression Models
- Linear Regression: Models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
- Logistic Regression: Used for binary outcome variables to model the probability of a certain class or event.
- Poisson Regression: Models count data and rates, such as the number of occurrences of an event within a fixed period.
Longitudinal Data Analysis
- Mixed-effects Models: Account for both fixed and random effects, useful for analysing data where measurements are taken on the same subjects over time.
- Generalized Estimating Equations (GEE): Used for estimating the parameters of a generalized linear model with a possible unknown correlation between outcomes.
Survival Analysis
- Kaplan-Meier Estimator: Estimates the survival function from lifetime data, often used to measure the fraction of patients living for a certain amount of time after treatment.
- Cox Proportional Hazards Model: Assesses the effect of several variables on survival time, allowing for the estimation of hazard ratios.
Multivariate Analysis
- Principal Component Analysis (PCA): Reduces the dimensionality of data while retaining most of the variance. Useful for identifying patterns and simplifying complex datasets.
- Factor Analysis: Identifies underlying relationships between variables by grouping them into factors.
Bayesian Methods
- Bayesian Inference: Uses Bayes’ theorem to update the probability of a hypothesis as more evidence becomes available. It provides a probabilistic approach to inference using prior information.
- Markov Chain Monte Carlo (MCMC): A class of algorithms used to sample from a probability distribution and perform Bayesian inference.
Advanced Visualization Techniques
- Heatmaps: Visualize data matrices, often used in genomics and other fields to show the intensity of data points.
- Network Analysis: Visualizes relationships between entities, useful in studying biological pathways and social networks.
Skewness and Measures
- No symmetry in the data and it looks like positively skewed: median and inter-quartile range are appropriate measures.
- No symmetry in the data and it looks like negatively skewed: median and inter-quartile range are appropriate measures.
- Normal distribution tail extended equally over both sides: mean and standard deviations are appropriate measures.
- Mean < median.
Identifying Outliers
- Outliers are identified by assessing whether or not they fall within a set of numerical boundaries called "inner fences" and "outer fences".
- A point that falls outside the data set's inner fences is classified as a minor outlier, while one that falls outside the outer fences is classified as a major outlier.
- Inner Fences:
- Multiplying inter-quartile range (Q3-Q1) by 1.5 then add this number to Q3 and subtract it from Q1 to find the boundaries of the inner fences.
- Outer Fences:
- Multiplying inter-quartile range (Q3-Q1) by 3 (instead of 1.5) then add this number to Q3 and subtract it from Q1 to find the upper and lower boundaries of the outer fences.
Outlier Identification Example
- Use hospital admissions data HospAdmNeu.dta
- Use summ Ordinary1213, det \ to find 1st quartile & 3rd quartile
- IQR = Q3-Q1 = 4013-2269 = 1744
- 1744 × 1.5 = 2616, 1744 × 3 = 5232
- Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6629, - 347)
- Boundaries for outer fence : (Q3+ 5232 , Q1- 5232 ) = (9245, - 2963)
- As hospital admissions never be negative we now check how many data points are outside inner fence & how many are outside outer fence using STATA :
- count if Ordinary1213 > 6629 15
- count if Ordinary1213 > 9245 4
- As the data are positively skewed so report median and inter-quartile range.
Correlation vs Covariance Demonstration with STATA
- Setup
- webuse census13
- Visualise your data
- br
- Estimate correlation matrix
- correlate mrgrate dvcrate medage
- Comments?
- Try different ways of explaining the output
- Estimate covariance matrix; use population as analytic weight
- correlate mrgrate dvcrate medage [aweight=pop], covariance
- Comments?
- Try different ways of explaining the outputs.
- Try learning the similarity & the difference between correlation and covariance matrix. (Home work)
Data Transformations
- It is sometimes helpful to transform data to a different scale, to aid interpretation and/or statistical analysis
- Reasons for transforming data include:
- Improved approximation to normality
- Reducing skewness
- Linearising the relationship between 2 variables
- Making multiplicative relationships additive
- Common transformations include:
- Natural logarithm ( or exp(y), where e = 2.718…)
- Power transformations (, , , etc.)
Log Transformation Example
- Log transform stretches scale at lower end and compresses it at upper end
- Can only take logs of positive values
Data Display in a Spreadsheet / Data Management
- Suppose you are running a study at UCLH aiming to lower the low-density lipoprotein (LDL) cholesterol levels for patients with cardiovascular disease. Your study is an RCT, double-blind and placebo-controlled. Patients were randomly assigned to receive evolocumab (either 140 mg every 2 weeks or 420 mg monthly) or matching placebo as subcutaneous injections.
- Out of the first 20 patients:
- Group: 11 patients received evolocumab and 9 patients received placebo.
- Gender: 12 female and 8 male.
- Statin use:
- High intensity – 12 patients
- Medium intensity – 6 patients
- Low intensity – 2 patients
- Using patient ID 1 to 20 and the appropriate code display the above information in a spreadsheet. Ignore between variables information for now.
Data Display in a Spreadsheet - Coding
- Group:
- 1 if patients received evolocumab
- 0 if patients received a placebo.
- Gender:
- 1 if a patient is female
- 0 if the patient is male
- Statin use:
- 2 for High intensity
- 1 for Medium-intensity
- 0 for Low-intensity
Example Data
| Patient ID | Group | Gender | Statin-use |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 1 | 1 | 2 |
| 3 | 1 | 1 | 2 |
| 4 | 1 | 1 | 1 |
| 5 | 1 | 0 | 2 |
| 6 | 1 | 1 | 2 |
| 7 | 1 | 1 | 2 |
| 8 | 1 | 0 | 2 |
| 9 | 1 | 1 | 1 |
| 10 | 1 | 0 | 2 |
| 11 | 1 | 1 | 2 |
| 12 | 0 | 0 | 0 |
| 13 | 0 | 1 | 1 |
| 14 | 0 | 0 | 2 |
| 15 | 0 | 1 | 2 |
| 16 | 0 | 1 | 2 |
| 17 | 0 | 0 | 1 |
| 18 | 0 | 1 | 1 |
| 19 | 0 | 0 | 1 |
| 20 | 0 | 1 | 2 |
Adding an Extra Column
- Consider the patients age between 50 and 70 with a mean age of 60 years. Can you now put an extra column for age of the patients?
- You might get different variables in your study but you must present them in a new column but within the same spreadsheet.
Example Data with Age
| Patient ID | Group | Gender | Statin-use | Age |
|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 56 |
| 2 | 1 | 1 | 2 | 52 |
| 3 | 1 | 1 | 2 | 59 |
| 4 | 1 | 1 | 1 | 60 |
| 5 | 1 | 0 | 2 | 63 |
| 6 | 1 | 1 | 2 | 70 |
| 7 | 1 | 1 | 2 | 63 |
| 8 | 1 | 0 | 2 | 58 |
| 9 | 1 | 1 | 1 | 55 |
| 10 | 1 | 0 | 2 | 59 |
| 11 | 1 | 1 | 2 | 68 |
| 12 | 0 | 0 | 0 | 59 |
| 13 | 0 | 1 | 1 | 67 |
| 14 | 0 | 0 | 2 | 69 |
| 15 | 0 | 1 | 2 | 52 |
| 16 | 0 | 1 | 2 | 53 |
| 17 | 0 | 0 | 1 | 61 |
| 18 | 0 | 1 | 1 | 63 |
| 19 | 0 | 0 | 1 | 62 |
| 20 | 0 | 1 | 2 | 51 |
Data Validation
- Check twice that your coding is correct and make sure you didn’t put any wrong information or type any number wrongly
- Check relevant research data matches your findings
- Check other research proportion of the people using statin and have lowered LDL. Is it consistent with yours?
- Identify and develop methods for how you handle missing values
- If you are convinced – data is ready to cook (for analysis).
Recap
- Need to distinguish between different types of data (continuous, discrete, categorical)
- The most appropriate way of presenting data depends on the data type
- Frequency tables are appropriate for all types of data
- For quantitative data, need to think carefully about the appropriate choice of classes/intervals to group data before display
- Keep information in tables to the minimum necessary to convey the message (story) you want to present (significant figures, number of variables/categories)
- Bar charts are appropriate for displaying categorical data
- Histograms and box plots are appropriate for quantitative data
- Summarizing and predicting the clinical outcome interest we should use advanced methods (ANOVA, PCA, Regression, survival analysis, Bayesian statistics etc) as appropriate.
Recommended Reading
- Introduction to medical statistics by Martin Bland : Chapter – 4
- Medical Statistics by B. Kirkwood & J. Sterne : Chapter-4
- Practical Statistics for medical research by Douglas Altman : Chapter 6