Biostatistics: Introduction to Biostatistics (B300) - Practice Flashcards
Course logistics and access
Course: Biostatistics Introduction to Biostatistics B300 (in-person and online components)
Canvas layout overview:
Lectures posted just before the lecture (usually the night before)
Lectures accessible via a PowerPoint file for today’s topic
Assignments: Homework and quizzes posted under Assignments
Quizzes: graded online; Homework: graded by TA (to be assigned)
Syllabus and course information available in Canvas; a separate official syllabus document is posted
Zoom link available for remote attendance or one-on-one help
Kaltura Media Gallery: video recordings of lectures (videos go there after posting)
Class participation and attendance:
Recording the class may influence some students to stay home; the instructor encourages attendance for better learning outcomes
Attendance tracking may be implemented in the future, but not strictly required for all sessions
Office hours and contact:
Instructor: Nayus (pronounced like “my goose” as a mnemonic); open to scheduling by appointment
Office location (RG 6056) and phone; email address: a9@iu.edu
No fixed schedule of formal office hours; contact for meetings as needed
Class format notes:
The instructor also teaches two online classes (B300 Fort Wayne and B301 - Biostatistics for Health Information Management) and records lectures for those classes as well
The class uses a blended format: in-person meetings, online access, and recorded lectures
Textbook and software overview:
Textbook: OpenIntro Statistics, 4th edition (open-access, low cost)
OpenIntro website: openintro.org; supports a free PDF download and low-cost paperback options
R software for statistics: introduced in Week 2; setup guidance provided via poseit.cloud (free plan available)
OpenIntro labs: Introduction to R and R Studio available on the textbook site; recommended to complete before heavy usage in class
Emphasis on self-directed learning:
The instructor will refer to OpenIntro labs and slides; slides are edited for clarity and supplemented with instructor notes
A cheat sheet (one 8.5 × 11, two-sided) will be used during exams; not an open-book/open-notes policy, but a two-sided sheet with essential notes
Exam and assessment structure:
Three in-class exams (must be taken in person unless a very good reason is approved in advance)
Final exam scheduled for the last day of the semester (December 19) at 3:00 PM
Homework due two weeks after each lecture; quizzes due at 11:59 PM on the due date
A placeholder policy: if a student is unable to attend an exam for a legitimate reason, prior approval is required; otherwise, attendance is expected
Grading and adjustments:
The grading scheme uses a percentage scale; if overall grades are too low, z-score adjustments may be applied to raise some grades
The instructor discourages copying or “homework leeches”; collaboration is allowed but copying is forbidden
Additional course notes:
The syllabus includes sections on academic integrity, sexual misconduct, accommodations (AES), course withdrawal, and incompletes
If you request disability accommodations, contact AES (Accessible Educational Services)
Opening video: statistics in the real world
A short video introduces descriptive statistics (data organization and presentation) and inferential statistics (drawing conclusions about populations from samples), with examples across occupations and contexts
Probability and interpretation of results (confidence and significance) are introduced as part of the foundation for later topics
Personal context from the instructor:
Background: long career in statistics; experience at a major hospital, Lilly (pharmaceuticals), and as a university lecturer since 1990
Personal anecdotes illustrate real-world applications of statistics (clinical trials, hospital quality control, travel data, zoo attendance, etc.)
A note on teaching style: the instructor sometimes teaches seated due to prior surgeries but aims to remain engaging and accessible
Opening case study overview (Chronic Fatigue Syndrome):
Objective: evaluate cognitive behavioral therapy (CBT) vs relaxation for chronic fatigue syndrome
Participant pool: 142 patients recruited; 60 entered the study; exclusions for criteria/health status; some declined
Study design: randomized assignment to two groups; treatment group (CBT) vs control (relaxation); nt = 27, nc = 26
Outcome measure: proportion of patients with “good results”
Results:
Treatment group: 19/27 good results (CBT)
Control group: 5/26 good results (relaxation)
Key computations:
Proportion in treatment: p_t = rac{19}{27} = 0.70
Proportion in control: p_c = rac{5}{26} approx 0.1923
Percentage form: 70% vs ~19% (control)
Difference in proportions: ext{Difference} = pt - pc = 0.70 - 0.1923
approx 0.5077 ext{ (about } 51 ext{ percentage points)}Interpretive notes:
The large observed difference suggests the CBT treatment may have a real effect beyond random fluctuation, but external generalizability is limited due to the specific volunteer sample
The concept of random variation is introduced to explain why not all samples yield identical results; a larger or more diverse sample could be explored in future studies
Data basics: types of variables and data structure
Data collection context: survey of students in an introductory statistics course; variables collected include:
Gender (categorical)
Introvert vs extrovert (categorical; often treated as ordinal when using a scale)
Sleep hours (numerical, continuous)
Bedtime category (categorical, ordinal)
Number of countries visited (numerical, discrete)
Dread level for the class (ordinal scale 1–5; can be treated numerically for averages but inherently categorical)
Data matrix concept:
Rows = observations (e.g., students)
Columns = variables (features)
Variable types:
Numerical vs categorical:
Numerical: measurements or counts (continuous or discrete)
Categorical: groups or categories (nominal) or ordered categories (ordinal)
Continuous vs discrete (numeric types):
Discrete numerical: counts (e.g., number of children, number of classes taken)
Continuous numerical: measurements (e.g., height, weight, age)
Examples from the lecture:
Age is often treated as discrete but is technically continuous (you can have fractional ages)
ZIP code is categorical (a code, not a numeric quantity with intrinsic magnitude)
Area code is categorical
Ordinal categorical variables have a natural order (e.g., cancer stages: 1 to 4; dread level 1–5)
Explanatory vs response variables (for potential causal interpretation):
Explanatory (independent) variable: e.g., hours of study
Response (dependent) variable: e.g., GPA
Note: labeling does not prove causation; relationships observed in data may be correlational or due to confounding factors
Observational studies vs experiments:
Observational: data collected without random assignment; can reveal associations but not causal conclusions
Experiment: random assignment to treatments; can establish causal relationships
Association vs independence:
Associated/dependent: there is a relationship or connection between variables
Independent: no association observed between variables in the data
Scatter plots: example with head length vs skull width shows a positive association (the variables tend to move together)
Sampling concepts: populations, samples, and census
Population: all members of the group of interest (e.g., all adults, all women with lupus, all Americans in an election population)
Sample: a subset of the population actually measured or surveyed
Census: complete enumeration of the population; often impractical due to cost, time, and locating difficult-to-reach individuals (e.g., immigrants)
Example scenarios:
Population for a lupus drug study: all people with lupus; sample might be 2,000 people from around the world
Election polling: population = all registered voters; sample might be ~1,000 respondents per survey
Why sampling is used: practical constraints make census infeasible; sampling aims to infer population characteristics from a representative subset
Anecdotal evidence vs evidence-based conclusions
Anecdotal evidence:
Based on limited, non-representative examples (e.g., uncle who smoked for decades without health issues)
Historically used in early smoking research but later found to be insufficient for general conclusions
Evidence-based conclusions:
Based on large, representative samples and systematic analysis showing consistent patterns (e.g., smoking linked to lung cancer and heart disease)
Practical implications for biostatistics and research design
Distinguish between descriptive statistics (summarizing data) and inferential statistics (drawing conclusions about populations from samples)
Understand the limitations of observational studies and the value of randomized experiments for causal inference
Recognize the importance of data quality, sample representativeness, and potential biases (e.g., nonresponse, selection bias)
Use of standardization and z-scores for grade adjustments when necessary; understanding how adjustments affect interpretation of results
Quick reference concepts to memorize for quizzes/exams
Proportion and percentage definitions:
Proportion: p = \frac{\text{number of successes}}{\text{total}}
Percentage: \text{Percentage} = p \times 100\%
Example from the case study:
Treatment success proportion: p_t = \frac{19}{27} = 0.70 \; (70\%)
Control success proportion: p_c = \frac{5}{26} \approx 0.1923 \; (\approx 19\%)
Observed difference: \Delta p = pt - pc \approx 0.70 - 0.1923 = 0.5077 \approx 0.51 \; (51\text{ percentage points})
Probability and random variation intuition (e.g., coin flip analogy) to understand why observed differences may reflect true effects or random fluctuation
Notable formulas to remember for later use
Z-score (conceptual form; used for grading adjustments): z = \frac{X - \mu}{\sigma} (where X is an observed score, μ is the mean, σ is the standard deviation)
Basic proportion and percentage as shown above
Summary takeaways
The course blends lectures, online materials, and recorded videos; attendance is encouraged for engagement and better performance
OpenIntro Statistics is the core textbook; free PDF available and affordable paperback options; supportive online labs for R and R Studio
A three-exam in-class format with a final exam; assignments due on a two-week cycle; cheat sheet use is allowed as a study aid
Core concepts introduced include descriptive vs inferential statistics, variability, probability, and the logic of sampling, observation, and experimentation
Data types and variable classification are foundational for choosing appropriate analyses
Real-world context and ethical considerations are integrated (e.g., avoiding data fabrication, respecting participants, and recognizing limitations of evidence)
Connections to prior and future topics
This first lecture lays the groundwork for exploratory data analysis, later moving toward inference, probability, and sampling distributions
Subsequent lectures will cover exploratory analysis through to inference, building on the data basics established here
Real-world relevance and applications
Statistics informs medical decision-making, public health policy, market research, sports analytics, and many other fields
Understanding statistical literacy helps evaluate claims in news, research articles, and policy debates
Ethical and practical implications
Emphasizes the importance of proper study design, avoiding inappropriate generalizations, and recognizing biases in data collection
Highlights the role of transparency (sharing data, methods) and the limits of extrapolating from small or non-representative samples
Quick tips for study preparation
Review the difference between population and sample and be able to identify the population for a given study
Practice identifying variable types from sample survey questions and data examples
Work through the case study proportions and practice converting to percentages and percentage point differences
Familiarize yourself with the OpenIntro OpenBook strategy and R/Lab resources prior to hands-on weeks
Next steps in course progression
Expect to begin using R and RStudio after week 1; complete Intro to R labs on OpenIntro before deep statistical analyses
Prepare for the first in-class exam and the associated cheat sheet; plan assignments and quizzes on the two-week timeline
Follow the schedule: Lecture 1 today (Introduction to Data); Lecture 2 on September 8 (Summarizing and Reading assignments); and continue with planned exam dates and breaks