Visualizing Data: Intro to Statistics

Visualizing Data: Intro to Stats - Week 2

Course Information

Institution: Western University Canada
Department: Health Sciences

Objectives

Review Topics:
- Distributions
- Percentages
- Percentiles
- Visualization

Review of Key Statistical Concepts

Populations and Samples

Previous discussion revolved around populations and sampling methods.

Mode

Definition: The mode is defined as the most frequently occurring value in a dataset.
To find the mode:
- Count occurrences of each unique value in the dataset.
- The mode is the value that appears most often.
- Dataset example: 2, 2, 3, 5, 5, 7, 8 has modes of 2 and 5 (bi-modal).

Median

Definition: The median is the middle value when all values are organized in ascending order.
Procedure for determining the median:
- Organize the dataset from smallest to largest.
- If the number of values is odd, the median is the middle value.
- If it’s even, compute the average of the two middle values.
- Example: Dataset 2, 2, 3, 5, 5, 7, 8 -> Median is 5.

Mean

Definition: The mean represents the average of a dataset.
Steps to calculate the mean:
- Sum all individual values in the dataset to derive a total.
- Divide that total by the number of values present (n).
- Example: For dataset 2, 2, 3, 5, 5, 7, 8:
1. Total = $2 + 2 + 3 + 5 + 5 + 7 + 8 = 32$
2. Mean = $ rac{32}{7}
  eq 4.57$.

Calculating the Mean with Data Table

A sample table calculating mean for medications distributed among a number of people is shown, listing the quantity and respective counts along with cross-products.
Example summary: Sum total of medicines given = 218. To find the mean: 218/53 = 4.11.

Data Collection Techniques

Study context: Patient satisfaction at a Primary Care Clinic. Questions raised on the methods for data collection to assess patient satisfaction.

Types of Studies in Research

Primary study: Involves gathering new data from participants.
Secondary study: Involves the analysis of existing data collected by someone else.
Tertiary study: Involves reviewing and synthesizing already existing literature.

Key Considerations When Choosing Study Types

Primary Studies:
- Identify potential populations for sampling.
- Determine participant recruitment feasibility.
Secondary Studies:
- Determine available datasets for analysis.
Tertiary Studies:
- Access to literature and resources needed.

Quantitative Research Designs

Experimental Designs: Randomized Controlled Trials (RCT), Cohort studies.
Quasi-experimental Designs: Analytic cohort and case-control studies.
Non-experimental Designs: Cross-sectional studies are less biased compared to others.

Data Collection Sample

Sample information shows data structure with multiple attributes, such as patient type, gender, age, service location, service type, date, and satisfaction rating.

Preparation of Data Types

Raw data: Data collected during initial collection, such as surveys.
Cleaned data: Data that has been formatted and screened for analysis.

Considerations for Data Accuracy

Rules set for data entry (e.g., handling circled responses).
Track modifications post-collection.
Analyze frequency and descriptive statistics.
Outliers and data integrity errors must be addressed before performing any analysis.

Frequency Distributions

Compilation of responses organized into tables to display counts in various categories.
Example provided with satisfaction ratings and their respective counts in a sample dataset.

Percentages

Definition: A percentage represents a part of the whole.
Calculation process involves dividing the part by the total and multiplying by 100.

Cumulative Measures

Cumulative frequency and percentage calculations are crucial for understanding order within distributions.

Normality in Data

The assumption is vital for various statistical tests; it’s characterized by a bell-shaped curve.

Quartiles

Quartiles divide the dataset into four equal parts providing insights into data distribution and outlier detection. Steps include sorting data, identifying the median, dividing data halves, and calculating their medians.

Data Presentation Techniques

Bar Charts

Characteristics include horizontal x-axis (categorical variables) and vertical y-axis (frequency/percentage).
Pros: Easy understanding, comparing categories.
Cons: Limited detail on data distribution, potential clutter.

Histograms

Vertical bars represent continuous variables, with pros including clear summaries and benefits in spotting outliers but limited to continuous data.

Line Graphs

Used to indicate changes over time; trends can be identified visually through data points connected by lines.

Scatter Plots

Each point symbolizes observations in relation to two variables. Scatter plots help evaluate relationships, trends, and potential linearity between variables.

Box and Whisker Plots

Provides a visual summary highlighting median, interquartile range, and potential outliers, making comparisons easy while avoiding assumptions about data distribution normality.

Tips for Effective Data Presentation

Two pie charts for yearly comparisons can be inefficient. Using a combined bar graph may present clearer insights. Avoid overly complex stacked charts in favor of clustered representations.

Recap

Emphasize that data presentation directly impacts effective communication of research findings to a variety of audiences.

Learning Activities

Introduction to concepts of statistics and visual data representation.
Read Chapter 2 on Presenting Data and complete related review questions.