Summarising and Presenting Data

Summarising Data Lecture Notes

Learning Outcomes

  • Understanding the importance of summarising data in neuroscience.
  • Clear demonstration of different types of data.
  • How to summarise data, including measures of central tendency and dispersion.
  • Understanding data variability and how to identify outliers.
  • Demonstrate ways of managing research data.
  • Apply statistical software (e.g., STATA) to carry out exploratory analysis and display a dataset using histograms, box plots, and cumulative frequency presentations.

Time Distribution

  • Importance of Data Summarisation: Why summarising data is crucial in neuroscience research (2 min).
  • Types of Data: Overview of quantitative vs. qualitative data (8 mins).
  • Summarising data: Central Tendency: Mean, median, mode (5 mins)
  • Dispersion: Range, variance, standard deviation, and standard error (8 mins)
  • Outliers in data (5 mins)
  • How to manage research data from the start? (5 mins)
  • Visualization: Histograms, box plots using statistical software (4 mins)
  • Test knowledge: Mentimeter Quiz (last 10 mins)

Importance of Summarising Data in Neuroscience Research

  1. Clarity and Understanding

    • Neuroscience research generates vast amounts of complex data.
    • Summarizing data helps distill it into a more understandable form, making it easier to identify key patterns and insights.
    • Example: Summarizing fMRI data to show average brain activation levels in different regions during a memory task can help identify which areas are most involved in memory processing.
  2. Hypothesis Testing

    • Summarized data allows for more effective hypothesis testing.
    • Researchers can determine whether their hypotheses are supported or refuted by focusing on essential data points, leading to more accurate conclusions.
    • Example: Summarizing the results of an experiment testing the effects of a new drug on synaptic plasticity by reporting the average change in synaptic strength across different treatment groups.
  3. Efficient Communication

    • Effective communication of research findings is vital.
    • Summarized data can be presented clearly and concisely, making it easier to share results with peers, publish in journals, and present at conferences.
    • Example: Using a bar graph to summarize the average reaction times of participants in a cognitive task, making it easier to communicate the impact of a specific intervention on cognitive performance.
  4. Resource Management

    • Summarizing data helps manage resources efficiently.
    • By focusing on the most relevant data, researchers can allocate their time, funding, and efforts more effectively, ensuring impactful research.
    • Example: Summarizing the key findings from a pilot study on the effects of sleep deprivation on neural activity to decide whether to pursue a larger, more expensive study.
  5. Meta-Analyses and Generalization

    • Summarized data is essential for meta-analyses, which combine results from multiple studies to increase statistical power and improve the generalizability of findings.
    • Example: Conducting a meta-analysis of studies on the effects of exercise on brain-derived neurotrophic factor (BDNF) levels to provide a comprehensive summary of the evidence.
  6. Data Integrity and Reproducibility

    • Summarizing data helps maintain data integrity and reproducibility.
    • Clear documentation of key findings and methodologies allows other researchers to replicate studies, a cornerstone of scientific research.
    • Example: Summarizing the methodology and key results of a study on the neural correlates of decision-making to ensure that other researchers can replicate the experiment.
  7. Development of Theories

    • Summarized data supports the development of new theories about brain function and behavior.
    • By identifying consistent patterns, researchers can build and refine theoretical models.
    • Example: Summarizing data from multiple studies on the neural basis of attention to develop a comprehensive theory of how different brain regions interact to support attentional processes.

Role of Statistics in Data Analysis

  • Data are the raw material of knowledge.
  • Scientists rely on data to provide empirical evidence to support and refine their theories.
  • Governments, businesses, communities, hospitals, GPs, and individuals need data to help inform decision-making and risk assessment.
  • Learning statistics will provide you with basic skills to read and understand data.
  • Statistics provides techniques for:
    • Summarising and presenting the information contained in a data set.
    • Handling and quantifying variation and uncertainty in the data, to help infer what they tell about the underlying theory of interest.

Types of Data

  1. Categorical

    • Nominal Data: Categories without a specific order.
      • Example: Blood types (A, B, AB, O).
    • Ordinal Data: Categories with a meaningful order but no consistent difference between categories.
      • Example: Stages of cancer (Stage I, II, III, IV).
  2. Quantitative (Numerical) Data

    • Discrete Data: Countable values, often integers.
      • Example: Number of hospital visits.
    • Continuous Data: Data that can take any value within a range.
      • Example: Blood pressure measurements.
  3. Interval Data

    • Definition: Numerical data where the intervals between values are meaningful but lacks a true zero point.
    • Characteristics:
      • Equal Intervals: The difference between values is consistent and meaningful.
      • No True Zero: There is no absolute zero point that indicates the absence of the quantity being measured.
    • Examples:
      • Temperature in Celsius or Fahrenheit
      • Dates on a calendar (e.g., years, months)
  4. Ratio Data

    • Definition: Numerical data with equal intervals between values and a true zero point, allowing for the calculation of ratios.
    • Characteristics:
      • Equal Intervals: The difference between values is consistent and meaningful.
      • True Zero: There is an absolute zero point that indicates the absence of the quantity being measured. For example, 0 kg means 'no weight'.
      • Ratios: Meaningful comparison of values using ratios is possible.
    • Examples:
      • Height and weight
      • Duration (e.g., time taken to complete a task)
  5. Key Differences

    • Zero Point: Interval data lacks a true zero point, while ratio data has a true zero point.
    • Ratios: Meaningful ratios can be calculated with ratio data but not with interval data.
    • Examples: Temperature in Celsius (interval) vs. weight in kilograms (ratio).

Why Summarise Data?

  • Handling Complexity: Use of Principal Component Analysis (PCA) to reduce the dimensionality of complex brain imaging data.
    • Example: PCA can help identify key patterns in fMRI data, simplifying the understanding of brain activity during different tasks.
  • Better Understanding: Summarizing data from electrophysiological recordings can help understand neural responses.
    • Example: Summarizing spike train data from neurons can reveal how different brain regions respond to stimuli, aiding in studying sensory processing.
  • Efficient Analysis: In genetic studies, summarizing data from large-scale genome-wide association studies (GWAS) helps identify genetic variants associated with neurological disorders.
    • Example: Highlighting significant genetic markers that warrant further investigation.
  • Sharing and Reproducibility: Summarizing data from connectomics studies (mapping the connections between neurons in the brain) allows researchers to share simplified versions of these complex networks.
    • Example: Promoting reproducibility and collaborative research by providing a clear overview of neural connectivity.
  • Resource Management: Summarizing neuroimaging data helps manage storage and processing resources.
    • Example: Creating summary statistics from large datasets of MRI scans can reduce the data size while retaining essential information for further analysis.

Ways of Summarising Data

  1. Measures of Central Tendency

    • Mean: The average value of the data set.
    • Median: The middle value when the data is ordered.
    • Mode: The most frequently occurring value in the data set.
  2. Measures of Spread/Dispersion

    • Range: The difference between the highest and lowest values.
    • Variance: The average of the squared differences from the mean.
    • Standard Deviation: The square root of the variance, indicating how much the values deviate from the mean.
      • σ=<em>i=1N(x</em>iμ)2N\sigma = \sqrt{\frac{\sum<em>{i=1}^{N}(x</em>i - \mu)^2}{N}}
    • Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile).
      • IQR=Q3Q1IQR = Q3 - Q1
  3. Measures of Shape

    • Skewness: Indicates the asymmetry of the data distribution.
    • Kurtosis: Measures the "tailedness" of the data distribution.
  4. Graphical Summaries

    • Histograms: Show the frequency distribution of a data set.
    • Box Plots: Display the median, quartiles, and potential outliers.
    • Scatter Plots: Show the relationship between two quantitative variables.
    • Bar Charts: Represent categorical data with rectangular bars.
  5. Summary Tables

    • Frequency Tables: Show the number of occurrences of each category.
    • Contingency Tables: Display the frequency distribution of variables to show relationships between them.
  6. Correlation and Association

    • Correlation Coefficient: Measures the strength and direction of the relationship between two variables.
    • Covariance: Indicates the direction of the linear relationship between variables.
      • cov(X,Y)=<em>i=1n(X</em>iXˉ)(YiYˉ)n1cov(X,Y) = \frac{\sum<em>{i=1}^{n}(X</em>i - \bar{X})(Y_i - \bar{Y})}{n-1}
  7. Regression Models

    • Linear Regression: Models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
    • Logistic Regression: Used for binary outcome variables to model the probability of a certain class or event.
    • Poisson Regression: Models count data and rates, such as the number of occurrences of an event within a fixed period.
  8. Longitudinal Data Analysis

    • Mixed-effects Models: Account for both fixed and random effects, useful for analysing data where measurements are taken on the same subjects over time.
    • Generalized Estimating Equations (GEE): Used for estimating the parameters of a generalized linear model with a possible unknown correlation between outcomes.
  9. Survival Analysis

    • Kaplan-Meier Estimator: Estimates the survival function from lifetime data, often used to measure the fraction of patients living for a certain amount of time after treatment.
    • Cox Proportional Hazards Model: Assesses the effect of several variables on survival time, allowing for the estimation of hazard ratios.
  10. Multivariate Analysis

    • Principal Component Analysis (PCA): Reduces the dimensionality of data while retaining most of the variance. Useful for identifying patterns and simplifying complex datasets.
    • Factor Analysis: Identifies underlying relationships between variables by grouping them into factors.
  11. Bayesian Methods

    • Bayesian Inference: Uses Bayes’ theorem to update the probability of a hypothesis as more evidence becomes available. It provides a probabilistic approach to inference using prior information.
    • Markov Chain Monte Carlo (MCMC): A class of algorithms used to sample from a probability distribution and perform Bayesian inference.
  12. Advanced Visualization Techniques

    • Heatmaps: Visualize data matrices, often used in genomics and other fields to show the intensity of data points.
    • Network Analysis: Visualizes relationships between entities, useful in studying biological pathways and social networks.

Skewness and Measures

  • No symmetry in the data and it looks like positively skewed: median and inter-quartile range are appropriate measures.
  • No symmetry in the data and it looks like negatively skewed: median and inter-quartile range are appropriate measures.
  • Normal distribution tail extended equally over both sides: mean and standard deviations are appropriate measures.
  • Mean < median.

Identifying Outliers

  • Outliers are identified by assessing whether or not they fall within a set of numerical boundaries called "inner fences" and "outer fences".
  • A point that falls outside the data set's inner fences is classified as a minor outlier, while one that falls outside the outer fences is classified as a major outlier.
  • Inner Fences:
    • Multiplying inter-quartile range (Q3-Q1) by 1.5 then add this number to Q3 and subtract it from Q1 to find the boundaries of the inner fences.
    • UpperInnerFence=Q3+1.5IQRUpper Inner Fence = Q3 + 1.5 * IQR
    • LowerInnerFence=Q11.5IQRLower Inner Fence = Q1 - 1.5 * IQR
  • Outer Fences:
    • Multiplying inter-quartile range (Q3-Q1) by 3 (instead of 1.5) then add this number to Q3 and subtract it from Q1 to find the upper and lower boundaries of the outer fences.
    • UpperOuterFence=Q3+3IQRUpper Outer Fence = Q3 + 3 * IQR
    • LowerOuterFence=Q13IQRLower Outer Fence = Q1 - 3 * IQR

Outlier Identification Example

  • Use hospital admissions data HospAdmNeu.dta
  • Use summ Ordinary1213, det \ to find 1st quartile & 3rd quartile
  • IQR = Q3-Q1 = 4013-2269 = 1744
  • 1744 × 1.5 = 2616, 1744 × 3 = 5232
  • Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6629, - 347)
  • Boundaries for outer fence : (Q3+ 5232 , Q1- 5232 ) = (9245, - 2963)
  • As hospital admissions never be negative we now check how many data points are outside inner fence & how many are outside outer fence using STATA :
  • count if Ordinary1213 > 6629 15
  • count if Ordinary1213 > 9245 4
  • As the data are positively skewed so report median and inter-quartile range.

Correlation vs Covariance Demonstration with STATA

  • Setup
    • webuse census13
  • Visualise your data
    • br
  • Estimate correlation matrix
    • correlate mrgrate dvcrate medage
  • Comments?
  • Try different ways of explaining the output
  • Estimate covariance matrix; use population as analytic weight
    • correlate mrgrate dvcrate medage [aweight=pop], covariance
  • Comments?
  • Try different ways of explaining the outputs.
  • Try learning the similarity & the difference between correlation and covariance matrix. (Home work)

Data Transformations

  • It is sometimes helpful to transform data to a different scale, to aid interpretation and/or statistical analysis
  • Reasons for transforming data include:
    • Improved approximation to normality
    • Reducing skewness
    • Linearising the relationship between 2 variables
    • Making multiplicative relationships additive
  • Common transformations include:
    • Natural logarithm (y=loge(x)    x=eyy = log_e(x) \implies x = e^y or exp(y), where e = 2.718…)
    • Power transformations (y=xy = \sqrt{x}, y=x2y = x^2, y=x3y = x^3, etc.)

Log Transformation Example

  • Log transform stretches scale at lower end and compresses it at upper end
  • Can only take logs of positive values

Data Display in a Spreadsheet / Data Management

  • Suppose you are running a study at UCLH aiming to lower the low-density lipoprotein (LDL) cholesterol levels for patients with cardiovascular disease. Your study is an RCT, double-blind and placebo-controlled. Patients were randomly assigned to receive evolocumab (either 140 mg every 2 weeks or 420 mg monthly) or matching placebo as subcutaneous injections.
  • Out of the first 20 patients:
    • Group: 11 patients received evolocumab and 9 patients received placebo.
    • Gender: 12 female and 8 male.
    • Statin use:
      • High intensity – 12 patients
      • Medium intensity – 6 patients
      • Low intensity – 2 patients
  • Using patient ID 1 to 20 and the appropriate code display the above information in a spreadsheet. Ignore between variables information for now.

Data Display in a Spreadsheet - Coding

  • Group:
    • 1 if patients received evolocumab
    • 0 if patients received a placebo.
  • Gender:
    • 1 if a patient is female
    • 0 if the patient is male
  • Statin use:
    • 2 for High intensity
    • 1 for Medium-intensity
    • 0 for Low-intensity

Example Data

Patient IDGroupGenderStatin-use
1100
2112
3112
4111
5102
6112
7112
8102
9111
10102
11112
12000
13011
14002
15012
16012
17001
18011
19001
20012

Adding an Extra Column

  • Consider the patients age between 50 and 70 with a mean age of 60 years. Can you now put an extra column for age of the patients?
  • You might get different variables in your study but you must present them in a new column but within the same spreadsheet.

Example Data with Age

Patient IDGroupGenderStatin-useAge
110056
211252
311259
411160
510263
611270
711263
810258
911155
1010259
1111268
1200059
1301167
1400269
1501252
1601253
1700161
1801163
1900162
2001251

Data Validation

  • Check twice that your coding is correct and make sure you didn’t put any wrong information or type any number wrongly
  • Check relevant research data matches your findings
  • Check other research proportion of the people using statin and have lowered LDL. Is it consistent with yours?
  • Identify and develop methods for how you handle missing values
  • If you are convinced – data is ready to cook (for analysis).

Recap

  • Need to distinguish between different types of data (continuous, discrete, categorical)
  • The most appropriate way of presenting data depends on the data type
  • Frequency tables are appropriate for all types of data
    • For quantitative data, need to think carefully about the appropriate choice of classes/intervals to group data before display
    • Keep information in tables to the minimum necessary to convey the message (story) you want to present (significant figures, number of variables/categories)
  • Bar charts are appropriate for displaying categorical data
  • Histograms and box plots are appropriate for quantitative data
  • Summarizing and predicting the clinical outcome interest we should use advanced methods (ANOVA, PCA, Regression, survival analysis, Bayesian statistics etc) as appropriate.

Recommended Reading

  • Introduction to medical statistics by Martin Bland : Chapter – 4
  • Medical Statistics by B. Kirkwood & J. Sterne : Chapter-4
  • Practical Statistics for medical research by Douglas Altman : Chapter 6