Section 1.1: Data, Variables, and Frequency Tables

Data & Statistics: Key Concepts, Types, and Methods

Data & Units

Data = a set of numbers or values (may include words) describing something. Context (e.g., units) makes a dataset informative.
Units are essential; without them values are ambiguous.
Definition: A unit specifies the measurement scale attached to a numerical value (e.g., pounds, kilograms, milliliters).
Example: Patient weight recorded as 120 → should be reported as 120 lb (U.S. customary units).

Population vs. Sample

Population: the complete set of entities of interest.
- Example: All UI Health patients with diabetes.
Sample: a smaller, manageable subset drawn from the population for analysis.
- Example: Patients on the Third Floor of UI Health surveyed about diet, weight, etc.

Variables

A variable is a characteristic or attribute of an individual that can vary among individuals.
Types of variables discussed:
- Population (concept): The entire group you are interested in studying.
- Sample (concept): A subset of the population for data collection.
- Variable Type (concept): Different kinds of data a variable can take.

Organizing Data (4-step process)

State the problem with detailed context.
Plan how to collect and analyze the data (often in a team).
Solve the problem using appropriate statistical methods.
Make a conclusion that interprets the results for the intended audience (e.g., patients, healthcare team).

Research Applications

Clinical contexts: diabetes care, prenatal care, patient weight monitoring.
Public health example: Early COVID-19 studies used the first 100–200 patients to estimate average age, ethnicity distribution, and risk factors (e.g., higher mortality for patients ≥60 years).

Variable Types

Quantitative (numerical) vs Qualitative (categorical)

Quantitative variables: numeric values where arithmetic operations are meaningful.
Qualitative (Categorical) variables: place individuals into groups or categories.

Qualitative (Categorical) Examples

Hair color: {Brown, Black, Blonde, etc.}
Gender: {Male, Female}
Ethnicity: {Hispanic, Non-Hispanic, etc.}
Likes tacos?: {Yes, No}

Quantitative Variables (Defined by Type)

Weight, age, tumor thickness, etc.
Examples in transcript: 120 lb, 45 years, 2.3 cm

Quantitative Variables: Definitions and Subtypes

Quantitative Definition: A numeric variable where arithmetic operations are meaningful.
Discrete vs Continuous:
- Discrete: Finite or countable; whole numbers only.
- Continuous: Takes any value within an interval; may include decimals.

Discrete (finite or countable)

Examples: Number of siblings, test scores reported as whole numbers.

Continuous (any value in an interval)

Examples: Exact age (e.g., 25.4 years), weight in grams, time to blink.

Continuous Quantitative Variables and Analysis

Definition: Continuous quantitative variable – a numeric variable that can assume any value within a continuous range.
Arithmetic with quantitative variables: compute mean, median, five-number summary, box plot, etc.
Units must be recorded (e.g., $\text{mg}, \mu\text{g}, \text{ounces}$ ).

Qualitative (Categorical) Variables

Assign individuals to groups without implying order (unless specified).
Useful for blocking in medical studies (e.g., smoker vs. non-smoker) to control for confounding factors.

Common Categorical Examples

Hair color: {brown, black, blond, etc.}
Political affiliation: {Democrat, Republican, Independent}
Favorite food: {tacos, pizza, sushi}

Subtypes of Categorical Variables

Definition: Binary variable – a categorical variable with exactly two categories.
Definition: Nominal variable – categories with no inherent order.
Definition: Ordinal variable – categories with a meaningful order.

Subtype Options and Typical Use

Binary: Two possible outcomes (e.g., True/False, Pass/Fail). Examples: Yes/No survey items, test results.
Nominal: Multiple categories without inherent order. Examples: Species names, eye color.
Ordinal: Categories with a natural ranking. Examples: Year in school (Freshman → Senior), pain scale, military rank.

Data Table Overview

ID: categorical identifier; used for anonymity and to eliminate bias.
Year in school: categorical (word-based) variable.
Age: quantitative variable; suitable for averages, medians, box plots, etc.
Likes tacos?: categorical variable (word-based).

Anonymity

Definition: Anonymity – removing personally identifying information to prevent bias in data analysis.

Frequency Tables & Relative Frequency

Frequency table lists each possible value of a variable and its count (how many times it occurs).
Relative frequency = (count) / (total sample size) → expressed as a fraction, decimal, or percentage.

Calculations and Tips

Determine total sample size: $n = \sum \text{counts}$
Relative frequencies: $f = \frac{\text{count}}{n}$
Express as percentages: $\text{percentage} = 100 \times f$
When an exam asks for only one of “count” or “relative frequency,” provide the requested column and omit the other.
In the example, total = 40 and total relative frequency = 1.00.

Converting Decimals ↔ Percentages

Rule: To turn a decimal into a percent, multiply by 100 and attach the % symbol.
Key point: Keep every digit when converting—exactness is crucial for professional work.

Rounding Places

Tenth place: the first digit right of the decimal (e.g., the 4 in 3.4? or 3.4? is 4 in the tenths place).
Hundredth place: the second digit right of the decimal.
Examples:
- Rounding to the Tenth: 3.456 → 3.5.
- Rounding to the Hundredth: 3.456 → 3.46.
Checking sums after rounding: Sum of rounded percentages should ideally total 1 (or 100%). Small deviations signal round-off error due to chosen precision.

Round-off vs. Round-up Errors

Round-off error: slight discrepancy when numbers are rounded to a fixed decimal place.
Round-up error: when the rounding rule pushes a value upward, potentially increasing the total slightly.
Both are normal and acceptable within reasonable tolerances.
Awareness of these errors helps avoid misinterpreting results as mistakes.

Formulas and Key Equations (LaTeX)

Relative frequency: $f = \frac{\text{count}}{n}$
Percentage from relative frequency: $\text{percentage} = 100 \times f = \frac{100 \times \text{count}}{n}$