Describing Data in Probability and Statistics

Chapter 1: Describing Data

Overview

In this chapter, we will explore key concepts related to displaying data through various methods, including graphs and tables. These tools will allow us to analyze frequency distributions and visualize data effectively, especially in the context of statistics and engineering. The primary software discussed will be R, which facilitates data visualization and analysis.

1. Display of Data by Graphs and Tables

A. Frequency Distributions

Basic Steps for Creating Frequency Distributions:

Find the Minimum and Maximum Values:
- Identify the lowest and highest observations in the dataset.
Class Intervals:
- These are intervals of equal length that cover the range between the minimum and maximum values without overlapping.
- Example: For data ranging from 0 to 100, class intervals might be [0, 10), [10, 20), …, [90, 100].
Frequency:
- This is the count of observations in the dataset that belong to each interval, denoted as $f1, f2, ext{…}$
Relative Frequency:
- This is calculated as class frequency divided by the total number of observations: $f_i/n$ where $n$ is the total number of observations.
- A total relative frequency of 1.00 indicates that all observations are accounted for.

Note: Half-open intervals are preferred in histograms to avoid overlap and ensure every data value belongs to exactly one bin.

Example 1.1: Midterm Scores

Observations: 69, 84, 52, 93, 81, 74, 89, 85, 88, 63, 87, 64, 67, 72, 74, 55, 82, 91, 68, 77

Total Observations, $n = 20$
Minimum Value, min = 52
Maximum Value, max = 93

Class Interval	Frequency	Relative Frequency
50-59
60-69
70-79
80-89
90-99
Total	20	1.00

B. Histogram

When to Use a Histogram:

To See Data Shape: Understand the symmetry of the data (normal), skewness, uniformity, or multiple peaks (bimodal).
For Large Datasets: Effective for datasets with 100+ observations where individual points are hard to interpret.
To Identify Outliers: Quickly spot extreme values that deviate from main clusters.
For Continuous Data: Ideal for continuous variables (height, weight, etc.), not suited for categorical data.
For Process Monitoring: Used in manufacturing to verify if products meet specified dimensions.
To Analyze Customer Behavior: Helps in understanding distributions related to customer ages, spending, etc.

Examples of Usage:

Quality Control: Verify that manufactured items meet tolerance limits.
Finance: Analyze distributions of investment returns to measure risk.
Healthcare: Visualize distributions of patient age or disease cases.
Human Resources: Examine the spread of employee salaries.

When Not to Use a Histogram:

Categorical Data: Use bar charts for distinct categories (e.g., product types).
Comparing Multiple Variables: A histogram focuses on single-variable distributions; alternative charts are better for multidimensional analysis.

Histogram Representation of Midterm Scores

The following histogram represents the relative frequencies of the midterm scores:

C. Shapes of Distributions

Symmetric Distribution: The left and right sides of the distribution are mirror images.
Unimodal Distribution: A distribution with a single peak.
Bimodal Distribution: Contains two distinct peaks.
Uniform Distribution: All values occur with equal frequency.
Skewed Distribution: One tail is longer or fatter than the other.
- Right Skewed: Mean > Median.
- Left Skewed: Mean < Median.

E. Dot Diagram

Ideal Usage:

Small Datasets: Best for visualizing distributions with few data points.
Univariate Data: Suitable for single-variable data, whether categorical or quantitative.
Visualizing Distribution: Helpful for spotting clusters, gaps, skewness, and outliers.
Quick Analysis: Easily identify the range, median, and frequency of values.

Example 1.3: Exam Score Data

Scores: 55, 61, 94, 94, 69, 77, 68, 54, 85, 77, 92, …

Example 1.4: Heights of Students (in inches)

Heights: 67.2, 65.0, 72.5, 71.1, 69.1, …

F. Qualitative (Categorical) Data

Definition:

Categorical data groups observations into categories using names or labels.

Types of Categorical Data:

Nominal Categorical Data: No inherent order (e.g., gender, color).
Ordinal Categorical Data: Defined order (e.g., rankings, satisfaction levels).

Example 1.6: Frequency Distribution of Enrollment in a High School

Class	Frequency	Relative Frequency
Algebra	26	0.26
English	30	0.30
Physics	19	0.19
Biology	24	0.24
Total	99	0.99

G. Bar Graphs

Characteristics of Bar Graphs:

Represents categorical data
Equal space between bars
Height corresponds to frequency, while bars maintain uniform width.

Differences from Histograms:

Histograms represent quantitative data without spaces between bars.
The area of the bars reflects frequency, and bar width can vary.

H. Pie Charts

A pie chart visually represents part-to-whole relationships, where each slice corresponds to a category and all slices combine to equal the whole.

I. Time Plots

A time plot or time series plot shows observations against time.

Components:

Trends: Indicates whether data is increasing or decreasing.
Seasonal Variation or Cycles: Regular pattern of movement across time.

Example 5: Number of Workers Late

Day	Week 1	Week 2	Week 3	Week 4
Monday	6	8	7	5
Tuesday	3	0	2	0
Wednesday	2	5	1	1
Thursday	4	3	0	0
Friday	7	2	2	1

2. Measures of Central Tendency

2.1 Definitions:

Mode: The value that occurs most frequently in a dataset.
Mean: The average value, computed as:
$\bar{x} = rac{ ext{Sum of all observations}}{ ext{Number of observations}}$
Median: The middle value when the dataset is ordered.

Example 1.7: Weights of Five 7th Grade Girls (in pounds)

Weights: 122, 94, 135, 111, 108
Calculating Mean:
$\bar{x} = rac{122 + 94 + 135 + 111 + 108}{5} = rac{570}{5} = 114$

2.2 Population Mean:

The population mean (denoted by $m$) is calculated as:
m = rac{1}{N} imes ext{Sum of all observations in population}

2.3 Median Calculation

Algorithm for Obtaining Median:

Order the Data:
Identify if n is odd or even:
- For odd $n$: Take the middle value.
- For even $n$: Take the average of the two middle values.

Example of Calculating Median:

From Example 1.7, ordered weights: 94, 108, 111, 122, 135, the median is:

Because there are 5 weights (odd), $ext{median} = x{ rac{(n + 1)}{2}} = x{ rac{5+1}{2}} = x3 = 111$

Example 1.8: Survival Days of Heart Transplant Patients

Days: 15, 3, 46, 623, 126, 64
Ordered: 3, 15, 46, 64, 126, 623
Median Calculation:
- $ext{Median} = ext{median}(x{ rac{n}{2}} + x{ rac{n}{2}+1}{2}) = 55$
Average: $ext{Mean} = rac{15 + 3 + 46 + 623 + 126 + 64}{6} = rac{146.2}{6} ext{ days.}$

Conclusion on Mean vs. Median

In skewed distributions or the presence of outliers, the median often serves as a better indicator of the center.

3. Skewness

Definition:

The skewness of data describes how the mean and median fall in relation to each other in a frequency distribution.

Key Points:

If mean and median are equal, data is symmetric.
If mean > median, data is positively skewed (right skewed).
If mean < median, data is negatively skewed (left skewed).

Comparison of Mean, Median, and Mode:

Example 1.9: Annual Income of Professional Baseball Players

Income (in $1,000): 30 (180), 50 (100), 100 (50), 500 (20), 1,000 (10), 2,000 (15), 3,000 (15), 5,000 (10)

Frequency Table Calculation:

Total Income calculation:
$ext{Total Income} = (30 imes 180) + (50 imes 100) + (100 imes 50) + ext{…} + (5,000 imes 10) = 160,400$
The mean income:
ext{Mean} = rac{ ext{Total Income}}{400} = 401 ext{ ($1,000s)} = 401,000

3. Measures of Variation

Definition:

Measures of variation inform us how spread out or scattered the data values are, addressing whether values are close together or vary widely.

Important Measure: Sample Variance & Standard Deviation

Sample Variance Formula:
$s^2 = rac{ ext{Sum of squares of deviations from mean}}{n - 1}$
Sample Standard Deviation Formula:
$s = ext{sqrt of } s^2$

Example 1.10: Calculating Variance and Standard Deviation

Data Observations: 6, 12, 6, 6, 4, 8

Calculate means:
$\bar{x} = rac{42}{6} = 7$
Variance:
$s^2 = rac{332 - rac{42^2}{6}}{6 - 1} = 7.6$
Standard Deviation:
$s = ext{sqrt}(7.6) ext{ approximately } 2.76$

Additional Measures: Quartiles & Percentiles

Percentiles:

The 100$p$-th percentile is the value at which at least 100$p$ of observations lie at or below it, with 100(1-$p$)% above.

Quartiles:

First Quartile ($Q_1$): 25th percentile
Second Quartile ($Q_2$): 50th percentile
Third Quartile ($Q_3$): 75th percentile

Calculation Procedure:

Order the observations.
For percentile: Calculate and round appropriately to find the correct observations.

Boxplots

Key Concepts:

Sample Range: $xn - x1 = ext{max} - ext{min}$
Interquartile range (IQR): $IQR = Q3 - Q1$
Outliers: Observations less than $Q1 - 1.5 imes IQR$ or greater than $Q3 + 1.5 imes IQR$.

Example 1.11: Exam Scores and Result for Outliers

Given scores: 81, 57, 85, 84, 99 …

Ordered: 57, 69, 76, 76, 81, 83, 84, 85, 90, 99
Calculate:
- Median: $Q2 = rac{x5 + x_6}{2}$
- $Q1$ and $Q3$ from cumulative frequencies
Outlier Processing:
- Identify and compute accordingly based on IQR.

4. Relationship Between Two Variables

Covariance

To measure the relationship between two variables $(xi, yi)$:\
$Cov(x, y) = rac{ ext{Sum of deviations from } \bar{x} ext{ and } \bar{y}}{n-1}$

Positive covariance: indicates a direct relationship.
Negative covariance: indicates an inverse relationship.

Example 1.13: Initial Speed and Stopping Distance

Data: Initial speed: 11, 22, 32, 41, 51; Stopping distance: 8.2, 32.8, 82.0, 144.4, 236.2.

Calculation of Covariance:

Mean pairs calculated as $\bar{x} = 31.4$ and $\bar{y} = 100.72$ .
Use observations to determine covariance through the derived formula.

Correlation Coefficient

Measure of relationship normalized using the standard deviations:
$r = rac{Cov(x,y)}{sx sy}$
The value of $r$ ranges from -1 to +1, giving insight into the strength and direction of the linear relationship.

Example 1.15

Calculate correlation coefficient based on midterm and final scores from students using student observation pairs. Use R commands for analysis, facilitate calculations using statistical software.

Conclusion

Understanding how to present data, analyze distributions, and compute important statistics is essential in the fields of probability and statistics that can be applied to science and engineering.