Notes on Variable Types, Probability, and Medical Testing (Comprehensive)

Variable Types and Measurement Levels

There are different levels for variables: nominal, ordinal, and numerical (interval/ratio).
Key idea: nominal variables have categories with no intrinsic order; ordinal variables have a meaningful order; numerical variables are quantitative and can be averaged.
Examples discussed:
- Exterior materials (wood siding, cement, brick, etc.) are nominal because there is no natural ordering among materials.
- The two main quality levels mentioned were “high” quality and similar categories, indicating an ordinal-like ordering at times, but the focus for materials is nominal.
Coding of categorical variables:
- A categorical variable can be encoded with numbers (e.g., gender coded as 1 for male and 0 for female). The numeric code is just a label, not a measurement.
Case study example (survey of 253 students) with several variables to classify as nominal, ordinal, or numerical:
- Gender: coded as 0/1 (female/male). Since there is no natural order between male and female, this is nominal.
- Class year: coded as integers (e.g., 1 for freshman, 4 for senior; with 2 and 3 for sophomores/juniors). There is a clear order (freshman → sophomore → junior → senior), so this is ordinal.
- Early riser vs. owl (lark, neither, owl): nominal because there is no inherent order among lark, neither, and owl.
- Number of classes in the semester: numerical (a count).
- Anxiety score on a scale (e.g., 1 to 8 or 1 to 10): numerical, since averages can be taken (e.g., an average score like 6.5).
- Categorical vs numerical distinction when categorizing scores (normal/moderate/severe vs a numerical anxiety score): if the data are categories with a meaningful order (normal < moderate < severe), it’s ordinal; if the data are raw numeric scores, they’re numerical.
Transforming numerical to categorical (and vice versa):
- You can convert a numerical variable to a categorical variable by binning into intervals. Example: converting year built into ranges (before 1900, 1900–1925, etc.) for a bar plot instead of a histogram.
- It is generally not appropriate to convert a purely categorical variable into a numerical value that implies a mathematical quantity. Numbers used to code categories are just labels, not measurements.
Summary of conversion rules:
- Numerical → Categorical: possible via binning into intervals.
- Categorical → Numerical: not truly appropriate because the categories are labels, not ordered numeric quantities (unless you assign meaningful scores to categories and treat them as ordinal, but that involves assumptions).

Practice: Classifying Variables (Nominal vs Numerical)

Example 1: Born region (or country) of birth – nominal (categories with no natural order).
Example 2: Employment status (e.g., employed, unemployed) – nominal.
Example 3: Day of the week someone was born (Mon–Sun) – nominal, even though days can be ordered in a week, the day labels themselves do not have a mathematical order.
Example 4: Distance from birthplace in kilometers – numerical.
Note: Some questions in class involve deciding whether a variable is nominal, ordinal, or numerical; ordering or lack of natural order drives the classification.

Key Concepts: Variables vs Statistics vs Parameters

A variable (for each subject) is a value measured or observed for that subject.
A statistic is a summary value computed from a sample (e.g., the sample mean
ar{x} = rac{1}{n}
\sum{i=1}^n xi).
A parameter is the corresponding quantity for the entire population.
Common sources of confusion:
- The average amount of sleep in a class sample is a statistic, not a per-subject variable.
- The proportion of left-handed students in the class is a statistic, not a per-subject measurement.
- The sample size $n$ describes the study, not a variable observed on each subject (it is a fixed count for the dataset in this context).
Important takeaway: treat sample-level summaries as statistics; treat population-level quantities as parameters; the sample size is a descriptor of the study, not a per-subject measurement.

Scenarios: Deterministic vs. Nondeterministic Processes

Scenario 1 (savings accounts): Marcus and Caroline each put $50 into savings with a fixed interest rate $r=2.5\%$ compounded monthly for $t=6$ months.
- Will the final amounts be the same? If the inputs are identical and the process is deterministic, the ending balances should be the same.
- The formula for compound interest (monthly compounding) is:
  $A = P\left(1+\frac{r}{m}\right)^{mt}$
  where $P=50$, $r=0.025$, $m=12$, $t=0.5$ (six months).
- In this deterministic setup, the two accounts should yield the same amount after six months.
Scenario 2 (product popularity focus groups): A marketing executive surveys 20 young adults for initial ratings; another group of 20 may yield different ratings.
- This is nondeterministic: same input (focus group) can yield different results due to sampling, opinions, etc.
- Deterministic vs nondeterministic: if the same input always produced the same output, the process would be deterministic; if outcomes vary, it is nondeterministic and we use probability to quantify likelihood of events.
Probability basics:
- Probability is defined as the long-run frequency of an event when the process is repeated many times: P(A) = \lim_{n\to\infty} \frac{\text{# of times A occurs in n trials}}{n}.
- For equally likely outcomes (e.g., a fair coin):
  $P(\text{Heads}) = \frac{1}{2}, \quad P(\text{4 on a fair die}) = \frac{1}{6}.$
- In real life, not all outcomes are equally likely; probabilities are often estimated from data (census, samples, etc.).
Base rate (prevalence):
- Base rate is the chance that a condition appears in the whole population (population prevalence).
- The base rate depends on place and time (e.g., polio prevalence changes over eras and locations).
- Base rate is essential when interpreting test results, especially in medical testing, because it influences post-test probabilities.
Probability scale:
- Probabilities lie in the interval $[0,1]$ (0% to 100%).
- Negative probabilities or probabilities >1 are not meaningful and indicate an error in calculation or interpretation.
- 0% means the event is almost surely not going to happen; 100% means the event is certain; 50% means the event is equally likely as not.
Practical takeaway: probability theory helps model uncertainty in real-world processes and distinguishes between fixed inputs and variable outputs.

Teacher Satisfaction: Contingency Tables and Exact vs Estimated Probabilities

Contingency table setup (urban school district): Teaching level (Elementary, Middle, High) × Job satisfaction (Satisfied, Not Satisfied).
Exact probabilities (from the sample of 674 respondents):
- Probability a randomly selected respondent teaches in high school:
  $P(\text{High}) = \frac{\text{Number in High}}{674} = \frac{205}{674} = 0.3042.$
- Probability a randomly selected respondent is satisfied with their job:
  $P(\text{Satisfied}) = \frac{341}{674}.$
- Probability a respondent teaches in middle school and is not satisfied: (cell count / 674) [specific cell count not fully disclosed in the transcript].
Conditional probability (within a subgroup):
- Focus on elementary teachers only (row): 307 elementary teachers; among them, 125 are satisfied.
- Within elementary group:
  $P(\text{Satisfied} | \text{Elementary}) = \frac{125}{307} = 0.4072.$
Conditional probability example (not satisfied → probability of high school):
- Among not satisfied teachers, total = 333; among them, 74 teach in high school.
- Then:
  $P(\text{High} \mid \text{Not Satisfied}) = \frac{74}{333} = 0.2222.$
Distinguishing exact vs estimated probabilities:
- Exact probabilities are computed directly from the given sample counts (as above).
- Estimated probabilities focus on a subgroup (e.g., elementary teachers) to infer about that subgroup or the population of teachers in the district.
- The two last probabilities illustrate how focusing on a subset changes the denominator and the context of the probability.
Key takeaway: probabilities derived from a contingency table can be exact (from the sample) or conditional (within a subpopulation). Inference to the population relies on representativeness of the sample.
Recap on probability phrasing:
- Given a random respondent, what is the probability they have a specific characteristic? Use the proportion across all respondents (denominator is the total sample size).
- If you know the respondent belongs to a subgroup, what is the probability of another characteristic within that subgroup? Use the subgroup totals (denominator is the subgroup size).
- If you ask for a joint event (e.g., High School and Not Satisfied), focus on the cell count for that combination divided by the total sample size.

Medical Testing: Sensitivity, Specificity, and Base Rate

Test outcomes and terminology:
- True positive: test result is positive, and the person has the condition.
- True negative: test result is negative, and the person does not have the condition.
- False positive: test result is positive, but the person does not have the condition.
- False negative: test result is negative, but the person does have the condition.
Key rates:
- Sensitivity (true positive rate):
  $\text{Sensitivity} = P(\text{Positive} \mid \text{Disease})$
- Specificity (true negative rate):
  $\text{Specificity} = P(\text{Negative} \mid \text{No Disease})$
- False positive rate: $\text{FPR} = 1 - \text{Specificity}$
- False negative rate: $\text{FNR} = 1 - \text{Sensitivity}$
Base rate (prevalence):
- The prevalence in the population is: $P(\text{Disease})$ .
- The base rate is essential for interpreting post-test probabilities; the same test with given sensitivity/specificity can yield different post-test probabilities depending on prevalence.
HIV test example (illustrative numbers):
- Base rate (prevalence) in the states: $0.34\% = 0.0034$ .
- Sensitivity: $0.75$ (true positive rate).
- False positive rate: $0.04$ (so specificity is $1 - 0.04 = 0.96$ ).
- Therefore, the test characteristics are:
  $\text{Sensitivity} = 0.75,$ $\text{Specificity} = 0.96,$ $\text{FPR} = 0.04.$
Practical implications and interpretation:
- A false positive can cause unnecessary anxiety, isolation, and treatment; a false negative can miss a diagnosis and allow continued spread or progression of disease.
- Base rate affects the probability that a positive test truly indicates disease (post-test probability). A full interpretation requires Bayes’ rule:
  $P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.$
Example values from the HIV scenario (for context):
- Prevalence (base rate): $0.0034$
- Sensitivity: $0.75$
- Specificity: $0.96$
- False positive rate: $0.04$
Additional clinical interpretation notes:
- Always consider base rate when communicating test results to patients; a positive result does not guarantee disease without considering prevalence and test characteristics.
- The balance of sensitivity and specificity is crucial in screening vs. confirmatory testing strategies; higher sensitivity reduces false negatives, higher specificity reduces false positives.
Ethical and practical implications:
- Misinterpretation of test results can lead to stigma, unnecessary treatment, or missed diagnoses.
- Public health decisions should incorporate base rates and test characteristics to avoid misinforming patients or policy.

Connections to Foundational Principles

Measurement scales (nominal, ordinal, numerical) underpin data analysis choices, including which summary statistics are appropriate and which statistical tests to apply.
The idea of converting between measurement levels (numerical → categorical) can simplify visualization (e.g., bar plots) but may incur information loss; converting categorical to numerical is generally inappropriate unless a meaningful ordinal interpretation is assigned.
Probability theory connects data descriptions (contingency tables) to real-world uncertainty, guiding decisions under uncertainty and informing interpretation of tests and surveys.
The interplay between sensitivity, specificity, and prevalence highlights the importance of context when interpreting diagnostic tests and the impact of base rates on predictive values.

Quick Reference: Key Formulas and Definitions

Probability basics:
- If outcomes are equally likely: P(A) = \frac{\text{# of favorable outcomes}}{\text{# of possible outcomes}}.
- For a fair coin: $P(\text{Heads}) = \frac{1}{2}.$
Compound interest (monthly): $A = P\left(1+\frac{r}{m}\right)^{mt}$ where:
- $P$ = principal, $r$ = annual rate, $m$ = number of compounding periods per year, $t$ = time in years.
Contingency table probabilities:
- Exact probability from table: cell count / total count.
- Conditional probability: $P(A|B) = \frac{P(A \cap B)}{P(B)}.$
Medical test terminology:
- Sensitivity: $\text{Sensitivity} = P(\text{Positive} \mid \text{Disease}).$
- Specificity: $\text{Specificity} = P(\text{Negative} \mid \text{No Disease}).$
- False Positive Rate: $\text{FPR} = 1 - \text{Specificity}.$
- False Negative Rate: $\text{FNR} = 1 - \text{Sensitivity}.$
- Base rate (Prevalence): $P(\text{Disease}).$
Bayes-style post-test probability (conceptual):
$P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.$