Notes on Variable Types, Probability, and Medical Testing (Comprehensive)

Variable Types and Measurement Levels

  • There are different levels for variables: nominal, ordinal, and numerical (interval/ratio).
  • Key idea: nominal variables have categories with no intrinsic order; ordinal variables have a meaningful order; numerical variables are quantitative and can be averaged.
  • Examples discussed:
    • Exterior materials (wood siding, cement, brick, etc.) are nominal because there is no natural ordering among materials.
    • The two main quality levels mentioned were “high” quality and similar categories, indicating an ordinal-like ordering at times, but the focus for materials is nominal.
  • Coding of categorical variables:
    • A categorical variable can be encoded with numbers (e.g., gender coded as 1 for male and 0 for female). The numeric code is just a label, not a measurement.
  • Case study example (survey of 253 students) with several variables to classify as nominal, ordinal, or numerical:
    • Gender: coded as 0/1 (female/male). Since there is no natural order between male and female, this is nominal.
    • Class year: coded as integers (e.g., 1 for freshman, 4 for senior; with 2 and 3 for sophomores/juniors). There is a clear order (freshman → sophomore → junior → senior), so this is ordinal.
    • Early riser vs. owl (lark, neither, owl): nominal because there is no inherent order among lark, neither, and owl.
    • Number of classes in the semester: numerical (a count).
    • Anxiety score on a scale (e.g., 1 to 8 or 1 to 10): numerical, since averages can be taken (e.g., an average score like 6.5).
    • Categorical vs numerical distinction when categorizing scores (normal/moderate/severe vs a numerical anxiety score): if the data are categories with a meaningful order (normal < moderate < severe), it’s ordinal; if the data are raw numeric scores, they’re numerical.
  • Transforming numerical to categorical (and vice versa):
    • You can convert a numerical variable to a categorical variable by binning into intervals. Example: converting year built into ranges (before 1900, 1900–1925, etc.) for a bar plot instead of a histogram.
    • It is generally not appropriate to convert a purely categorical variable into a numerical value that implies a mathematical quantity. Numbers used to code categories are just labels, not measurements.
  • Summary of conversion rules:
    • Numerical → Categorical: possible via binning into intervals.
    • Categorical → Numerical: not truly appropriate because the categories are labels, not ordered numeric quantities (unless you assign meaningful scores to categories and treat them as ordinal, but that involves assumptions).

Practice: Classifying Variables (Nominal vs Numerical)

  • Example 1: Born region (or country) of birth – nominal (categories with no natural order).
  • Example 2: Employment status (e.g., employed, unemployed) – nominal.
  • Example 3: Day of the week someone was born (Mon–Sun) – nominal, even though days can be ordered in a week, the day labels themselves do not have a mathematical order.
  • Example 4: Distance from birthplace in kilometers – numerical.
  • Note: Some questions in class involve deciding whether a variable is nominal, ordinal, or numerical; ordering or lack of natural order drives the classification.

Key Concepts: Variables vs Statistics vs Parameters

  • A variable (for each subject) is a value measured or observed for that subject.

  • A statistic is a summary value computed from a sample (e.g., the sample mean
    ar{x} = rac{1}{n}

    \sum{i=1}^n xi).

  • A parameter is the corresponding quantity for the entire population.

  • Common sources of confusion:

    • The average amount of sleep in a class sample is a statistic, not a per-subject variable.
    • The proportion of left-handed students in the class is a statistic, not a per-subject measurement.
    • The sample size $n$ describes the study, not a variable observed on each subject (it is a fixed count for the dataset in this context).
  • Important takeaway: treat sample-level summaries as statistics; treat population-level quantities as parameters; the sample size is a descriptor of the study, not a per-subject measurement.

Scenarios: Deterministic vs. Nondeterministic Processes

  • Scenario 1 (savings accounts): Marcus and Caroline each put $50 into savings with a fixed interest rate $r=2.5\%$ compounded monthly for $t=6$ months.
    • Will the final amounts be the same? If the inputs are identical and the process is deterministic, the ending balances should be the same.
    • The formula for compound interest (monthly compounding) is:
      A = P\left(1+\frac{r}{m}\right)^{mt}
      where $P=50$, $r=0.025$, $m=12$, $t=0.5$ (six months).
    • In this deterministic setup, the two accounts should yield the same amount after six months.
  • Scenario 2 (product popularity focus groups): A marketing executive surveys 20 young adults for initial ratings; another group of 20 may yield different ratings.
    • This is nondeterministic: same input (focus group) can yield different results due to sampling, opinions, etc.
    • Deterministic vs nondeterministic: if the same input always produced the same output, the process would be deterministic; if outcomes vary, it is nondeterministic and we use probability to quantify likelihood of events.
  • Probability basics:
    • Probability is defined as the long-run frequency of an event when the process is repeated many times: P(A) = \lim_{n\to\infty} \frac{\text{# of times A occurs in n trials}}{n}.
    • For equally likely outcomes (e.g., a fair coin):
      P(\text{Heads}) = \frac{1}{2}, \quad P(\text{4 on a fair die}) = \frac{1}{6}.
    • In real life, not all outcomes are equally likely; probabilities are often estimated from data (census, samples, etc.).
  • Base rate (prevalence):
    • Base rate is the chance that a condition appears in the whole population (population prevalence).
    • The base rate depends on place and time (e.g., polio prevalence changes over eras and locations).
    • Base rate is essential when interpreting test results, especially in medical testing, because it influences post-test probabilities.
  • Probability scale:
    • Probabilities lie in the interval $[0,1]$ (0% to 100%).
    • Negative probabilities or probabilities >1 are not meaningful and indicate an error in calculation or interpretation.
    • 0% means the event is almost surely not going to happen; 100% means the event is certain; 50% means the event is equally likely as not.
  • Practical takeaway: probability theory helps model uncertainty in real-world processes and distinguishes between fixed inputs and variable outputs.

Teacher Satisfaction: Contingency Tables and Exact vs Estimated Probabilities

  • Contingency table setup (urban school district): Teaching level (Elementary, Middle, High) × Job satisfaction (Satisfied, Not Satisfied).
  • Exact probabilities (from the sample of 674 respondents):
    • Probability a randomly selected respondent teaches in high school:
      P(\text{High}) = \frac{\text{Number in High}}{674} = \frac{205}{674} = 0.3042.
    • Probability a randomly selected respondent is satisfied with their job:
      P(\text{Satisfied}) = \frac{341}{674}.
    • Probability a respondent teaches in middle school and is not satisfied: (cell count / 674) [specific cell count not fully disclosed in the transcript].
  • Conditional probability (within a subgroup):
    • Focus on elementary teachers only (row): 307 elementary teachers; among them, 125 are satisfied.
    • Within elementary group:
      P(\text{Satisfied} | \text{Elementary}) = \frac{125}{307} = 0.4072.
  • Conditional probability example (not satisfied → probability of high school):
    • Among not satisfied teachers, total = 333; among them, 74 teach in high school.
    • Then:
      P(\text{High} \mid \text{Not Satisfied}) = \frac{74}{333} = 0.2222.
  • Distinguishing exact vs estimated probabilities:
    • Exact probabilities are computed directly from the given sample counts (as above).
    • Estimated probabilities focus on a subgroup (e.g., elementary teachers) to infer about that subgroup or the population of teachers in the district.
    • The two last probabilities illustrate how focusing on a subset changes the denominator and the context of the probability.
  • Key takeaway: probabilities derived from a contingency table can be exact (from the sample) or conditional (within a subpopulation). Inference to the population relies on representativeness of the sample.
  • Recap on probability phrasing:
    • Given a random respondent, what is the probability they have a specific characteristic? Use the proportion across all respondents (denominator is the total sample size).
    • If you know the respondent belongs to a subgroup, what is the probability of another characteristic within that subgroup? Use the subgroup totals (denominator is the subgroup size).
    • If you ask for a joint event (e.g., High School and Not Satisfied), focus on the cell count for that combination divided by the total sample size.

Medical Testing: Sensitivity, Specificity, and Base Rate

  • Test outcomes and terminology:
    • True positive: test result is positive, and the person has the condition.
    • True negative: test result is negative, and the person does not have the condition.
    • False positive: test result is positive, but the person does not have the condition.
    • False negative: test result is negative, but the person does have the condition.
  • Key rates:
    • Sensitivity (true positive rate):
      \text{Sensitivity} = P(\text{Positive} \mid \text{Disease})
    • Specificity (true negative rate):
      \text{Specificity} = P(\text{Negative} \mid \text{No Disease})
    • False positive rate: \text{FPR} = 1 - \text{Specificity}
    • False negative rate: \text{FNR} = 1 - \text{Sensitivity}
  • Base rate (prevalence):
    • The prevalence in the population is: P(\text{Disease}).
    • The base rate is essential for interpreting post-test probabilities; the same test with given sensitivity/specificity can yield different post-test probabilities depending on prevalence.
  • HIV test example (illustrative numbers):
    • Base rate (prevalence) in the states: 0.34\% = 0.0034.
    • Sensitivity: 0.75 (true positive rate).
    • False positive rate: 0.04 (so specificity is 1 - 0.04 = 0.96).
    • Therefore, the test characteristics are:
      \text{Sensitivity} = 0.75, \text{Specificity} = 0.96, \text{FPR} = 0.04.
  • Practical implications and interpretation:
    • A false positive can cause unnecessary anxiety, isolation, and treatment; a false negative can miss a diagnosis and allow continued spread or progression of disease.
    • Base rate affects the probability that a positive test truly indicates disease (post-test probability). A full interpretation requires Bayes’ rule:
      P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.
  • Example values from the HIV scenario (for context):
    • Prevalence (base rate): 0.0034
    • Sensitivity: 0.75
    • Specificity: 0.96
    • False positive rate: 0.04
  • Additional clinical interpretation notes:
    • Always consider base rate when communicating test results to patients; a positive result does not guarantee disease without considering prevalence and test characteristics.
    • The balance of sensitivity and specificity is crucial in screening vs. confirmatory testing strategies; higher sensitivity reduces false negatives, higher specificity reduces false positives.
  • Ethical and practical implications:
    • Misinterpretation of test results can lead to stigma, unnecessary treatment, or missed diagnoses.
    • Public health decisions should incorporate base rates and test characteristics to avoid misinforming patients or policy.

Connections to Foundational Principles

  • Measurement scales (nominal, ordinal, numerical) underpin data analysis choices, including which summary statistics are appropriate and which statistical tests to apply.
  • The idea of converting between measurement levels (numerical → categorical) can simplify visualization (e.g., bar plots) but may incur information loss; converting categorical to numerical is generally inappropriate unless a meaningful ordinal interpretation is assigned.
  • Probability theory connects data descriptions (contingency tables) to real-world uncertainty, guiding decisions under uncertainty and informing interpretation of tests and surveys.
  • The interplay between sensitivity, specificity, and prevalence highlights the importance of context when interpreting diagnostic tests and the impact of base rates on predictive values.

Quick Reference: Key Formulas and Definitions

  • Probability basics:
    • If outcomes are equally likely: P(A) = \frac{\text{# of favorable outcomes}}{\text{# of possible outcomes}}.
    • For a fair coin: P(\text{Heads}) = \frac{1}{2}.
  • Compound interest (monthly): A = P\left(1+\frac{r}{m}\right)^{mt} where:
    • $P$ = principal, $r$ = annual rate, $m$ = number of compounding periods per year, $t$ = time in years.
  • Contingency table probabilities:
    • Exact probability from table: cell count / total count.
    • Conditional probability: P(A|B) = \frac{P(A \cap B)}{P(B)}.
  • Medical test terminology:
    • Sensitivity: \text{Sensitivity} = P(\text{Positive} \mid \text{Disease}).
    • Specificity: \text{Specificity} = P(\text{Negative} \mid \text{No Disease}).
    • False Positive Rate: \text{FPR} = 1 - \text{Specificity}.
    • False Negative Rate: \text{FNR} = 1 - \text{Sensitivity}.
    • Base rate (Prevalence): P(\text{Disease}).
  • Bayes-style post-test probability (conceptual):
    P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.