Section 1.1: Data, Variables, and Frequency Tables

Data & Statistics: Key Concepts, Types, and Methods

Data & Units

  • Data = a set of numbers or values (may include words) describing something. Context (e.g., units) makes a dataset informative.

  • Units are essential; without them values are ambiguous.

  • Definition: A unit specifies the measurement scale attached to a numerical value (e.g., pounds, kilograms, milliliters).

  • Example: Patient weight recorded as 120 → should be reported as 120 lb (U.S. customary units).

Population vs. Sample

  • Population: the complete set of entities of interest.

    • Example: All UI Health patients with diabetes.

  • Sample: a smaller, manageable subset drawn from the population for analysis.

    • Example: Patients on the Third Floor of UI Health surveyed about diet, weight, etc.

Variables

  • A variable is a characteristic or attribute of an individual that can vary among individuals.

  • Types of variables discussed:

    • Population (concept): The entire group you are interested in studying.

    • Sample (concept): A subset of the population for data collection.

    • Variable Type (concept): Different kinds of data a variable can take.

Organizing Data (4-step process)

  • State the problem with detailed context.

  • Plan how to collect and analyze the data (often in a team).

  • Solve the problem using appropriate statistical methods.

  • Make a conclusion that interprets the results for the intended audience (e.g., patients, healthcare team).

Research Applications

  • Clinical contexts: diabetes care, prenatal care, patient weight monitoring.

  • Public health example: Early COVID-19 studies used the first 100–200 patients to estimate average age, ethnicity distribution, and risk factors (e.g., higher mortality for patients ≥60 years).

Variable Types

Quantitative (numerical) vs Qualitative (categorical)

  • Quantitative variables: numeric values where arithmetic operations are meaningful.

  • Qualitative (Categorical) variables: place individuals into groups or categories.

Qualitative (Categorical) Examples
  • Hair color: {Brown, Black, Blonde, etc.}

  • Gender: {Male, Female}

  • Ethnicity: {Hispanic, Non-Hispanic, etc.}

  • Likes tacos?: {Yes, No}

Quantitative Variables (Defined by Type)
  • Weight, age, tumor thickness, etc.

  • Examples in transcript: 120 lb, 45 years, 2.3 cm

Quantitative Variables: Definitions and Subtypes

  • Quantitative Definition: A numeric variable where arithmetic operations are meaningful.

  • Discrete vs Continuous:

    • Discrete: Finite or countable; whole numbers only.

    • Continuous: Takes any value within an interval; may include decimals.

Discrete (finite or countable)
  • Examples: Number of siblings, test scores reported as whole numbers.

Continuous (any value in an interval)
  • Examples: Exact age (e.g., 25.4 years), weight in grams, time to blink.

Continuous Quantitative Variables and Analysis

  • Definition: Continuous quantitative variable – a numeric variable that can assume any value within a continuous range.

  • Arithmetic with quantitative variables: compute mean, median, five-number summary, box plot, etc.

  • Units must be recorded (e.g., \text{mg}, \mu\text{g}, \text{ounces}).

Qualitative (Categorical) Variables

  • Assign individuals to groups without implying order (unless specified).

  • Useful for blocking in medical studies (e.g., smoker vs. non-smoker) to control for confounding factors.

Common Categorical Examples

  • Hair color: {brown, black, blond, etc.}

  • Political affiliation: {Democrat, Republican, Independent}

  • Favorite food: {tacos, pizza, sushi}

Subtypes of Categorical Variables

  • Definition: Binary variable – a categorical variable with exactly two categories.

  • Definition: Nominal variable – categories with no inherent order.

  • Definition: Ordinal variable – categories with a meaningful order.

Subtype Options and Typical Use
  • Binary: Two possible outcomes (e.g., True/False, Pass/Fail). Examples: Yes/No survey items, test results.

  • Nominal: Multiple categories without inherent order. Examples: Species names, eye color.

  • Ordinal: Categories with a natural ranking. Examples: Year in school (Freshman → Senior), pain scale, military rank.

Data Table Overview

  • ID: categorical identifier; used for anonymity and to eliminate bias.

  • Year in school: categorical (word-based) variable.

  • Age: quantitative variable; suitable for averages, medians, box plots, etc.

  • Likes tacos?: categorical variable (word-based).

Anonymity
  • Definition: Anonymity – removing personally identifying information to prevent bias in data analysis.

Frequency Tables & Relative Frequency

  • Frequency table lists each possible value of a variable and its count (how many times it occurs).

  • Relative frequency = (count) / (total sample size) → expressed as a fraction, decimal, or percentage.

Calculations and Tips

  • Determine total sample size: n = \sum \text{counts}

  • Relative frequencies: f = \frac{\text{count}}{n}

  • Express as percentages: \text{percentage} = 100 \times f

  • When an exam asks for only one of “count” or “relative frequency,” provide the requested column and omit the other.

  • In the example, total = 40 and total relative frequency = 1.00.

Converting Decimals ↔ Percentages

  • Rule: To turn a decimal into a percent, multiply by 100 and attach the % symbol.

  • Key point: Keep every digit when converting—exactness is crucial for professional work.

Rounding Places

  • Tenth place: the first digit right of the decimal (e.g., the 4 in 3.4? or 3.4? is 4 in the tenths place).

  • Hundredth place: the second digit right of the decimal.

  • Examples:

    • Rounding to the Tenth: 3.456 → 3.5.

    • Rounding to the Hundredth: 3.456 → 3.46.

  • Checking sums after rounding: Sum of rounded percentages should ideally total 1 (or 100%). Small deviations signal round-off error due to chosen precision.

Round-off vs. Round-up Errors

  • Round-off error: slight discrepancy when numbers are rounded to a fixed decimal place.

  • Round-up error: when the rounding rule pushes a value upward, potentially increasing the total slightly.

  • Both are normal and acceptable within reasonable tolerances.

  • Awareness of these errors helps avoid misinterpreting results as mistakes.

Formulas and Key Equations (LaTeX)

  • Relative frequency: f = \frac{\text{count}}{n}

  • Percentage from relative frequency: \text{percentage} = 100 \times f = \frac{100 \times \text{count}}{n}