Describing Data in Probability and Statistics
Chapter 1: Describing Data
Overview
In this chapter, we will explore key concepts related to displaying data through various methods, including graphs and tables. These tools will allow us to analyze frequency distributions and visualize data effectively, especially in the context of statistics and engineering. The primary software discussed will be R, which facilitates data visualization and analysis.
1. Display of Data by Graphs and Tables
A. Frequency Distributions
Basic Steps for Creating Frequency Distributions:
Find the Minimum and Maximum Values:
Identify the lowest and highest observations in the dataset.
Class Intervals:
These are intervals of equal length that cover the range between the minimum and maximum values without overlapping.
Example: For data ranging from 0 to 100, class intervals might be [0, 10), [10, 20), …, [90, 100].
Frequency:
This is the count of observations in the dataset that belong to each interval, denoted as $f1, f2, ext{…}$
Relative Frequency:
This is calculated as class frequency divided by the total number of observations: f_i/n where $n$ is the total number of observations.
A total relative frequency of 1.00 indicates that all observations are accounted for.
Note: Half-open intervals are preferred in histograms to avoid overlap and ensure every data value belongs to exactly one bin.
Example 1.1: Midterm Scores
Observations: 69, 84, 52, 93, 81, 74, 89, 85, 88, 63, 87, 64, 67, 72, 74, 55, 82, 91, 68, 77
Total Observations, $n = 20$
Minimum Value, min = 52
Maximum Value, max = 93
Class Interval | Tally | Frequency | Relative Frequency |
|---|---|---|---|
50-59 | |||
60-69 | |||
70-79 | |||
80-89 | |||
90-99 | |||
Total | 20 | 1.00 |
B. Histogram
When to Use a Histogram:
To See Data Shape: Understand the symmetry of the data (normal), skewness, uniformity, or multiple peaks (bimodal).
For Large Datasets: Effective for datasets with 100+ observations where individual points are hard to interpret.
To Identify Outliers: Quickly spot extreme values that deviate from main clusters.
For Continuous Data: Ideal for continuous variables (height, weight, etc.), not suited for categorical data.
For Process Monitoring: Used in manufacturing to verify if products meet specified dimensions.
To Analyze Customer Behavior: Helps in understanding distributions related to customer ages, spending, etc.
Examples of Usage:
Quality Control: Verify that manufactured items meet tolerance limits.
Finance: Analyze distributions of investment returns to measure risk.
Healthcare: Visualize distributions of patient age or disease cases.
Human Resources: Examine the spread of employee salaries.
When Not to Use a Histogram:
Categorical Data: Use bar charts for distinct categories (e.g., product types).
Comparing Multiple Variables: A histogram focuses on single-variable distributions; alternative charts are better for multidimensional analysis.
Histogram Representation of Midterm Scores
The following histogram represents the relative frequencies of the midterm scores:
C. Shapes of Distributions
Symmetric Distribution: The left and right sides of the distribution are mirror images.
Unimodal Distribution: A distribution with a single peak.
Bimodal Distribution: Contains two distinct peaks.
Uniform Distribution: All values occur with equal frequency.
Skewed Distribution: One tail is longer or fatter than the other.
Right Skewed: Mean > Median.
Left Skewed: Mean < Median.
E. Dot Diagram
Ideal Usage:
Small Datasets: Best for visualizing distributions with few data points.
Univariate Data: Suitable for single-variable data, whether categorical or quantitative.
Visualizing Distribution: Helpful for spotting clusters, gaps, skewness, and outliers.
Quick Analysis: Easily identify the range, median, and frequency of values.
Example 1.3: Exam Score Data
Scores: 55, 61, 94, 94, 69, 77, 68, 54, 85, 77, 92, …
Example 1.4: Heights of Students (in inches)
Heights: 67.2, 65.0, 72.5, 71.1, 69.1, …
F. Qualitative (Categorical) Data
Definition:
Categorical data groups observations into categories using names or labels.
Types of Categorical Data:
Nominal Categorical Data: No inherent order (e.g., gender, color).
Ordinal Categorical Data: Defined order (e.g., rankings, satisfaction levels).
Example 1.6: Frequency Distribution of Enrollment in a High School
Class | Frequency | Relative Frequency |
|---|---|---|
Algebra | 26 | 0.26 |
English | 30 | 0.30 |
Physics | 19 | 0.19 |
Biology | 24 | 0.24 |
Total | 99 | 0.99 |
G. Bar Graphs
Characteristics of Bar Graphs:
Represents categorical data
Equal space between bars
Height corresponds to frequency, while bars maintain uniform width.
Differences from Histograms:
Histograms represent quantitative data without spaces between bars.
The area of the bars reflects frequency, and bar width can vary.
H. Pie Charts
A pie chart visually represents part-to-whole relationships, where each slice corresponds to a category and all slices combine to equal the whole.
I. Time Plots
A time plot or time series plot shows observations against time.
Components:
Trends: Indicates whether data is increasing or decreasing.
Seasonal Variation or Cycles: Regular pattern of movement across time.
Example 5: Number of Workers Late
Day | Week 1 | Week 2 | Week 3 | Week 4 |
|---|---|---|---|---|
Monday | 6 | 8 | 7 | 5 |
Tuesday | 3 | 0 | 2 | 0 |
Wednesday | 2 | 5 | 1 | 1 |
Thursday | 4 | 3 | 0 | 0 |
Friday | 7 | 2 | 2 | 1 |
2. Measures of Central Tendency
2.1 Definitions:
Mode: The value that occurs most frequently in a dataset.
Mean: The average value, computed as:
ar{x} = rac{ ext{Sum of all observations}}{ ext{Number of observations}}Median: The middle value when the dataset is ordered.
Example 1.7: Weights of Five 7th Grade Girls (in pounds)
Weights: 122, 94, 135, 111, 108
Calculating Mean:
ar{x} = rac{122 + 94 + 135 + 111 + 108}{5} = rac{570}{5} = 114
2.2 Population Mean:
The population mean (denoted by $m$) is calculated as:
m = rac{1}{N} imes ext{Sum of all observations in population}
2.3 Median Calculation
Algorithm for Obtaining Median:
Order the Data:
Identify if n is odd or even:
For odd $n$: Take the middle value.
For even $n$: Take the average of the two middle values.
Example of Calculating Median:
From Example 1.7, ordered weights: 94, 108, 111, 122, 135, the median is:
Because there are 5 weights (odd), ext{median} = x{ rac{(n + 1)}{2}} = x{ rac{5+1}{2}} = x3 = 111
Example 1.8: Survival Days of Heart Transplant Patients
Days: 15, 3, 46, 623, 126, 64
Ordered: 3, 15, 46, 64, 126, 623
Median Calculation:
ext{Median} = ext{median}(x{ rac{n}{2}} + x{ rac{n}{2}+1}{2}) = 55
Average: ext{Mean} = rac{15 + 3 + 46 + 623 + 126 + 64}{6} = rac{146.2}{6} ext{ days.}
Conclusion on Mean vs. Median
In skewed distributions or the presence of outliers, the median often serves as a better indicator of the center.
3. Skewness
Definition:
The skewness of data describes how the mean and median fall in relation to each other in a frequency distribution.
Key Points:
If mean and median are equal, data is symmetric.
If mean > median, data is positively skewed (right skewed).
If mean < median, data is negatively skewed (left skewed).
Comparison of Mean, Median, and Mode:
Example 1.9: Annual Income of Professional Baseball Players
Income (in $1,000): 30 (180), 50 (100), 100 (50), 500 (20), 1,000 (10), 2,000 (15), 3,000 (15), 5,000 (10)
Frequency Table Calculation:
Total Income calculation:
ext{Total Income} = (30 imes 180) + (50 imes 100) + (100 imes 50) + ext{…} + (5,000 imes 10) = 160,400The mean income:
ext{Mean} = rac{ ext{Total Income}}{400} = 401 ext{ ($1,000s)} = 401,000
3. Measures of Variation
Definition:
Measures of variation inform us how spread out or scattered the data values are, addressing whether values are close together or vary widely.
Important Measure: Sample Variance & Standard Deviation
Sample Variance Formula:
s^2 = rac{ ext{Sum of squares of deviations from mean}}{n - 1}Sample Standard Deviation Formula:
s = ext{sqrt of } s^2
Example 1.10: Calculating Variance and Standard Deviation
Data Observations: 6, 12, 6, 6, 4, 8
Calculate means:
ar{x} = rac{42}{6} = 7Variance:
s^2 = rac{332 - rac{42^2}{6}}{6 - 1} = 7.6Standard Deviation:
s = ext{sqrt}(7.6) ext{ approximately } 2.76
Additional Measures: Quartiles & Percentiles
Percentiles:
The 100$p$-th percentile is the value at which at least 100$p$ of observations lie at or below it, with 100(1-$p$)% above.
Quartiles:
First Quartile ($Q_1$): 25th percentile
Second Quartile ($Q_2$): 50th percentile
Third Quartile ($Q_3$): 75th percentile
Calculation Procedure:
Order the observations.
For percentile: Calculate and round appropriately to find the correct observations.
Boxplots
Key Concepts:
Sample Range: xn - x1 = ext{max} - ext{min}
Interquartile range (IQR): IQR = Q3 - Q1
Outliers: Observations less than $Q1 - 1.5 imes IQR$ or greater than $Q3 + 1.5 imes IQR$.
Example 1.11: Exam Scores and Result for Outliers
Given scores: 81, 57, 85, 84, 99 …
Ordered: 57, 69, 76, 76, 81, 83, 84, 85, 90, 99
Calculate:
Median: Q2 = rac{x5 + x_6}{2}
$Q1$ and $Q3$ from cumulative frequencies
Outlier Processing:
Identify and compute accordingly based on IQR.
4. Relationship Between Two Variables
Covariance
To measure the relationship between two variables $(xi, yi)$:\
Cov(x, y) = rac{ ext{Sum of deviations from } ar{x} ext{ and } ar{y}}{n-1}
Positive covariance: indicates a direct relationship.
Negative covariance: indicates an inverse relationship.
Example 1.13: Initial Speed and Stopping Distance
Data: Initial speed: 11, 22, 32, 41, 51; Stopping distance: 8.2, 32.8, 82.0, 144.4, 236.2.
Calculation of Covariance:
Mean pairs calculated as ar{x} = 31.4 and ar{y} = 100.72.
Use observations to determine covariance through the derived formula.
Correlation Coefficient
Measure of relationship normalized using the standard deviations:
r = rac{Cov(x,y)}{sx sy}The value of $r$ ranges from -1 to +1, giving insight into the strength and direction of the linear relationship.
Example 1.15
Calculate correlation coefficient based on midterm and final scores from students using student observation pairs. Use R commands for analysis, facilitate calculations using statistical software.
Conclusion
Understanding how to present data, analyze distributions, and compute important statistics is essential in the fields of probability and statistics that can be applied to science and engineering.