Statistical Reasoning - GNED1101: Fall 2025

GNED1101: Fall 2025 - Statistical Reasoning

Statistics

Statistics Defined:
- The data (numbers or other pieces of information) that describe or summarize everything.
- The science of collecting, organizing, and interpreting data.
- Types of Statistics:
  - Descriptive Statistics: Summarizes data from a sample (e.g., mean, median).
  - Inferential Statistics: Makes inferences and predictions about a population based on a sample of data.
- Reference: Book Section 5.1, Etext Section 12.1.

Components of a Statistical Study

Population: The complete set of people or things being studied.
Sample: The subset of the population from which the raw data are collected.
Population Parameters: Specific numbers that describe the characteristics of the population (e.g., average age, total number).
Sample Statistics: Numbers describing characteristics of the sample, derived from consolidating or summarizing the raw data.

Rating Statistics Example

Example of ratings systems:
- Example: "The Big Bang Theory had 22 million viewers last week."
- Data Source: Ratings are based on a sample of 5000 homes rather than counting all viewers.
- Population of Interest: The entire country (e.g., the U.S.).
- Population Parameters: Number of people watching the show.
- Sample Characteristics: 5000 homes monitored through devices installed with homeowner's consent, used to estimate overall viewership.

Basic Steps in a Statistical Study

State the Goal: Define the study's aim and determine the population.
Choose a Sample: Select a representative sample from the population.
Collect Raw Data: Gather data from the sample and summarize it.
Use Sample Statistics: Infer about the population parameters from the sample statistics.
Draw Conclusions: Make conclusions based on the analysis.

Identifying Statistical Steps Example

U.S. Labor Department:
- Surveys 60,000 households monthly to assess U.S. workforce characteristics.
- Population Parameter of Interest: The U.S. unemployment rate, defined as the percentage of unemployed individuals among those employed or actively seeking employment.

Choosing a Sample

Representative Sample: Should reflect the characteristics of the whole population.
Common Sampling Methods:
- Simple Random Sampling: Every sample has an equal chance of selection, often performed by computer.
- Convenience Sampling: Uses readily available results or samples.
- Systematic Sampling: Uses a system, such as selecting every nth individual.
- Stratified Sampling: Divides the population into strata (subgroups) and samples from each.

Statistical Tables and Graphs

Essential for organizing statistical data into visual representations.
Types of Tables and Graphs:
- Frequency Tables:
- Basic Frequency Table includes:
  - First column: Categories of data.
  - Second column: Frequency of each category.

Types of Frequency

Relative Frequency: The fraction (or percentage) of data values falling into a category.
Cumulative Frequency: The total number of data values in a category plus all preceding categories.

Example of Frequency Table

Grade	Frequency	Relative Frequency	Cumulative Frequency
A	4	4/25 = 16%	4
B	7	7/25 = 28%	11
C	9	9/25 = 36%	20
D	3	3/25 = 12%	23
F	2	2/25 = 8%	25
Total	25	1 = 100%	25

Bar Graphs

Bar Graph: Uses bars to represent the frequency (or relative frequency) of each category.
Represents data visually to facilitate understanding.

Grouped Frequency Distribution

When there are numerous data items, a grouped frequency distribution can help manage the data.
Example:
- Test scores for 40 students: 82, 47, 75, …
- By arranging into classes (e.g., 40-49, 50-59, etc.), data organization improves.

Example of Grouped Frequency Distribution

Class	Frequency
40-49	3
50-59	6
60-69	6
70-79	11
80-89	9
90-99	5
Total	40

Histograms and Frequency Polygons

Histogram: A bar graph where bars touch.
Frequency Polygon: A line graph formed by connecting midpoints of histogram bars.

Important Labels for Graphs

Title/Caption: Explains the graph and lists the data source, if applicable.
Vertical Scale and Title: Clearly indicated with labels.
Horizontal Scale and Title: Should describe variables qualitatively or quantitatively.
Legend: Required if multiple datasets are presented with different colors or shapes in a single graph.

Pie Charts

Used primarily for displaying relative frequencies.
Wedge Angle Calculation: Wedge angle = Relative frequency × 360 degrees.

Line Charts

Line Chart: Represents data points associated with categories connected by lines.
Can also represent time-series data.

Misleading Visual Displays

Graphs can distort the underlying data.
Distortion Techniques:
- Stretching the scale on the vertical axis to create misleading growth impressions.
- Compressing the scale to create a slow increase impression.

Things to Watch for in Visual Displays

Ensure there's a descriptive title clearly identifying the displayed data.
Check for properly aligned numbers on the vertical axis.
Beware of unnecessary design effects that may misrepresent the underlying data.
Confirm that time intervals along the x-axis are uniform to avoid misleading trends.
Ensure that bar sizes in bar graphs correspond accurately to the data represented.
Verify the data source and the methods used in data collection.

Measures of Central Tendency

Represent average or typical values within data distributions:
- Mean: The sum of the data items divided by the number of items, represented as:
  $ext{Mean} = rac{ ext{Σ}x}{n}$
  where Σx is the sum of all data items and n is the number of items.
- Median: The middle data item when arranged in order.
- Mode: The most frequently occurring data value.
- Midrange: The average of the lowest and highest data values, calculated as $ext{Midrange} = rac{ ext{Lowest} + ext{Highest}}{2}$ .

Example Calculation of Mean

Find the mean of given actor earnings and work through calculations.
Finding the mean from a data set:
- Example: Ten highest-earning actors’ earnings = 147 million dollars; Mean = $rac{147}{10} = 14.7$ million.

Variance and Standard Deviation

The standard deviation quantifies the amount of variation or dispersion of a set of data values.
Formula for Standard Deviation:
1. Calculate mean.
2. Find deviation of each data item from the mean.
3. Square each deviation.
4. Sum the squared deviations.
5. Divide the sum by n - 1 (for sample) to get variance.
6. The standard deviation is the square root of the variance.

Example Calculation for Standard Deviation

Example data set: 69, 68, 65, 64, 64.
Standard deviation derived through steps outlined above.

Normal Distribution

Normal Distribution: Characterized by a bell curve, symmetric about the mean. Mean, median, and mode coincide at the center.
The shape depends on the mean and standard deviation. As the standard deviation increases, the distribution spreads out.

68-95-99.7 Rule

Approximately 68% of data falls within 1 standard deviation from the mean.
About 95% falls within 2 standard deviations.
Approximately 99.7% falls within 3 standard deviations.

Z-Scores

A z-score represents the number of standard deviations an individual data item is from the mean.
Z-Score Formula:
$z = rac{ ext{data item} - ext{mean}}{ ext{standard deviation}}$
Positive z-scores indicate values above the mean while negative z-scores indicate values below the mean.

Example of Z-Score Calculation

For a weight of 9 pounds, mean is 7, and standard deviation is 0.8:
$z = rac{9 - 7}{0.8} = 2.5$

Percentiles and Quartiles

Percentiles: Indicate relative standing of a value in a dataset.
Quartiles: Divide dataset into four equal parts; the 25th percentile is the first quartile, 50th is the median, and 75th is the third quartile.

Interpreting Z-Scores Using Percentiles

Use z-scores to find the percentage of data items that fall below the specific data item in a normal distribution.

Polls and Margins of Error

The margin of error represents the range in which the true population parameter lies concerning the sample statistic.
Example: For a margin of error of ±2.9%, if 17% of those surveyed expressed a view, then the actual population percentage is likely between 14.1% and 19.9%.

Correlation and Regression

Correlation: Measures relationship strength between two variables, identified using scatter plots and correlation coefficients.
- Positive correlation indicates that as one variable increases, so does the other.
- Negative correlation indicates that as one variable increases, the other decreases.
- Correlation coefficient (r) indicates strength and direction of the relationship, ranging from -1 (perfect negative) to 1 (perfect positive).

Computing the Correlation Coefficient

Use the formula for correlation coefficient, which incorporates sums of x and y values, their respective squares and products.

Regression Line

A regression line is calculated to predict values of one variable based on another.
The equation for the regression line is denoted as $y = mx + b$ , where m is the slope and b is the y-intercept.

The notes thoroughly cover the details of statistics as provided in the transcript, organized in a structured format for study purposes.