Visualizing Data: Intro to Statistics
Visualizing Data: Intro to Stats - Week 2
Course Information
Institution: Western University Canada
Department: Health Sciences
Objectives
Review Topics:
Distributions
Percentages
Percentiles
Visualization
Review of Key Statistical Concepts
Populations and Samples
Previous discussion revolved around populations and sampling methods.
Mode
Definition: The mode is defined as the most frequently occurring value in a dataset.
To find the mode:
Count occurrences of each unique value in the dataset.
The mode is the value that appears most often.
Dataset example: 2, 2, 3, 5, 5, 7, 8 has modes of 2 and 5 (bi-modal).
Median
Definition: The median is the middle value when all values are organized in ascending order.
Procedure for determining the median:
Organize the dataset from smallest to largest.
If the number of values is odd, the median is the middle value.
If it’s even, compute the average of the two middle values.
Example: Dataset 2, 2, 3, 5, 5, 7, 8 -> Median is 5.
Mean
Definition: The mean represents the average of a dataset.
Steps to calculate the mean:
Sum all individual values in the dataset to derive a total.
Divide that total by the number of values present (n).
Example: For dataset 2, 2, 3, 5, 5, 7, 8:
Total = $2 + 2 + 3 + 5 + 5 + 7 + 8 = 32$
Mean = $ rac{32}{7}
eq 4.57$.
Calculating the Mean with Data Table
A sample table calculating mean for medications distributed among a number of people is shown, listing the quantity and respective counts along with cross-products.
Example summary: Sum total of medicines given = 218. To find the mean: 218/53 = 4.11.
Data Collection Techniques
Study context: Patient satisfaction at a Primary Care Clinic. Questions raised on the methods for data collection to assess patient satisfaction.
Types of Studies in Research
Primary study: Involves gathering new data from participants.
Secondary study: Involves the analysis of existing data collected by someone else.
Tertiary study: Involves reviewing and synthesizing already existing literature.
Key Considerations When Choosing Study Types
Primary Studies:
Identify potential populations for sampling.
Determine participant recruitment feasibility.
Secondary Studies:
Determine available datasets for analysis.
Tertiary Studies:
Access to literature and resources needed.
Quantitative Research Designs
Experimental Designs: Randomized Controlled Trials (RCT), Cohort studies.
Quasi-experimental Designs: Analytic cohort and case-control studies.
Non-experimental Designs: Cross-sectional studies are less biased compared to others.
Data Collection Sample
Sample information shows data structure with multiple attributes, such as patient type, gender, age, service location, service type, date, and satisfaction rating.
Preparation of Data Types
Raw data: Data collected during initial collection, such as surveys.
Cleaned data: Data that has been formatted and screened for analysis.
Considerations for Data Accuracy
Rules set for data entry (e.g., handling circled responses).
Track modifications post-collection.
Analyze frequency and descriptive statistics.
Outliers and data integrity errors must be addressed before performing any analysis.
Frequency Distributions
Compilation of responses organized into tables to display counts in various categories.
Example provided with satisfaction ratings and their respective counts in a sample dataset.
Percentages
Definition: A percentage represents a part of the whole.
Calculation process involves dividing the part by the total and multiplying by 100.
Cumulative Measures
Cumulative frequency and percentage calculations are crucial for understanding order within distributions.
Normality in Data
The assumption is vital for various statistical tests; it’s characterized by a bell-shaped curve.
Quartiles
Quartiles divide the dataset into four equal parts providing insights into data distribution and outlier detection. Steps include sorting data, identifying the median, dividing data halves, and calculating their medians.
Data Presentation Techniques
Bar Charts
Characteristics include horizontal x-axis (categorical variables) and vertical y-axis (frequency/percentage).
Pros: Easy understanding, comparing categories.
Cons: Limited detail on data distribution, potential clutter.
Histograms
Vertical bars represent continuous variables, with pros including clear summaries and benefits in spotting outliers but limited to continuous data.
Line Graphs
Used to indicate changes over time; trends can be identified visually through data points connected by lines.
Scatter Plots
Each point symbolizes observations in relation to two variables. Scatter plots help evaluate relationships, trends, and potential linearity between variables.
Box and Whisker Plots
Provides a visual summary highlighting median, interquartile range, and potential outliers, making comparisons easy while avoiding assumptions about data distribution normality.
Tips for Effective Data Presentation
Two pie charts for yearly comparisons can be inefficient. Using a combined bar graph may present clearer insights. Avoid overly complex stacked charts in favor of clustered representations.
Recap
Emphasize that data presentation directly impacts effective communication of research findings to a variety of audiences.
Learning Activities
Introduction to concepts of statistics and visual data representation.
Read Chapter 2 on Presenting Data and complete related review questions.