Chapter 1 & Chapter 2
Statistical Life Cycle and Data Types
- Review of week 1 Canvas notes: addressing Echo sound issue; aim to record the class if needed.
- Chapter 1 focus:
- Statistical life cycle stages (brief recap): data collection, data organization/description, analysis, and drawing conclusions (inferential). Descriptive statistics cover the first three phases (data representation and summarization). Inferential statistics involve analyzing data to answer questions and make conclusions.
- Key terms introduced earlier:
- Population vs. sample
- Descriptive vs. inferential statistics
- Purpose of defining the question and the research objective
Descriptive vs Inferential Statistics
- Descriptive statistics
- Purpose: summarize and describe data; no conclusions beyond the observed data.
- Examples from the transcript:
- Large sample of men studied for 18 years: 70% of unmarried men alive at 65; 90% of married men alive at 65. This is descriptive: it reports proportions without implying causation or broader conclusions.
- If you appended an interpretation (e.g., "Mary Anne are happier, so more live to 65"), that would be inferential/biased and inappropriate for descriptive data.
- Another example: in a sample of 700 parents, 45% provide some financial support to freshmen and 25% to seniors; the statement that support decreases with age is an interpretation, not just a description.
- Inferential statistics
- Involves drawing conclusions, making inferences, or answering questions based on the data collected.
- Requires careful consideration of study design, sampling, and potential biases.
Qualitative vs Quantitative Data (Quantitation)
- Qualitative (categorical) data
- Data described by quality or category rather than numeric value.
- Examples:
- Type of TV show (comedy, drama, reality, sports, news) — qualitative description of genre.
- Blood type categories (A, B, AB, O) — categorical.
- Characteristics:
- Generally cannot be computed with arithmetic; you can organize or summarize but not perform numeric computations like averages.
- Quantitative (numeric) data
- Data expressed as numbers; suitable for arithmetic and statistical calculations.
- Examples:
- Height of hot air balloons — numeric (quantitative).
- Wait time in a grocery checkout — numeric (time, quantitative).
- Student ID number — often treated as an identifier (qualitative) rather than a quantity to be computed. Debate arises: is a student ID truly quantitative? Generally, IDs serve as labels, not numeric measurements to compute with.
- Debit card number — usually a label, not for computation.
- Quick rule of thumb from the talk:
- If you can compute an average from the data and it makes sense, it’s quantitative.
- If you’re simply labeling, sorting, or categorizing, it’s qualitative.
Discrete vs Continuous Variables
- Discrete variables
- Countable quantities, often integers: 1, 2, 3, 4, …
- Examples:
- Number of home runs by a player in a season (you cannot have 3.5 home runs in a season).
- Driver’s license numbers or jersey numbers (identifiers, not measurements to be computed).
- Continuous variables
- Measured quantities that can take on any value within a range (theoretically infinite precision depending on measurement instrument).
- Variables can be assumed infinite.
- Examples:
- Time (hours, minutes, seconds with fractions).
- Weight or height.
- The smallest unit of currency depends on instrument; in practice, pennies can be counted, but money is often treated as continuous for measurement purposes.
- Practical distinction:
- If you can count the values (e.g., number of items), it’s discrete.
- If you measure a quantity that can have non-integer values (e.g., weight, time), it’s continuous.
Four Levels of Measurement
- Nominal (qualitative)
- Classifies data into categories with no intrinsic order.
- No mathematical computation meaningful for the nominal level.
- Examples: gender; blood type; driver’s license number (label); categories like blood types A, B, AB, O.
- Ordinal (qualitative with order)
- Categories have a meaningful order, but differences between categories are not necessarily equal.
- Example: ranking of movies; satisfaction levels on a scale (e.g., 1–5 stars) where higher is better but the distance between levels is not necessarily equal.
- Interval (quantitative)
- Numeric scale with equal intervals, but no true zero (zero does not indicate absence of the quantity).
- Example: temperature in Celsius or Fahrenheit; IQ scores (often treated as interval in some contexts); differences are meaningful, ratios are not.
- Key point: no absolute zero; you can add and subtract but not form meaningful ratios.
- Ratio (quantitative)
- Numeric scale with equal intervals and a true, meaningful zero (absence of the quantity).
- Examples: height, weight, time duration when zero means none; test scores can be ratio if zero means no points and ratios are meaningful, but context can allow interval interpretation as well.
- How to determine level (general guidance):
- Are data numbers or words? If numbers, consider interval or ratio; determine if absolute zero exists.
- If data are words or ordered categories with no numeric computation, consider ordinal or nominal.
- If sorting adds information, it’s closer to ordinal; if there is an absolute zero and meaningful ratios, consider ratio.
Data Classification Practice (examples discussed)
- Driver’s license number → Nominal (label, no meaningful arithmetic)
- Length of a song in seconds → Quantitative; involves time measurement; discussion about whether zero seconds means absence (should be ratio for duration)
- Ranking of movies or past-year chart position → Ordinal (order matters, not necessarily equal intervals)
- Highest score level completed in a course or game → Could be ordinal or ratio depending on whether the metric supports meaningful arithmetic and a true zero; context-dependent
- Satisfaction with service on a 1–5 star scale → Interval (often treated as interval; zero does not imply absence of satisfaction)
- Gender of patient → Nominal
- Test scores → Could be interval or ratio depending on whether zero indicates no correct answers; interpretation can vary by context
- Other: survey responses like “from 0 to 5” satisfaction levels → Interval (zero does not mean total lack of satisfaction; it’s a scale)
Sampling Techniques
- Random sampling
- Every member of the population has an equal chance of being selected.
- Example: email surveys where response depends on voluntary participation; not guaranteed who responds, but each person had equal chance to be selected.
- Systematic sampling
- Select every k-th member of the population (k is fixed).
- Examples:
- Approaching every 10th or every 100th person in a mall survey.
- From a roster, selecting every 50th student.
- Convenience sampling
- Use subjects that are easily accessible.
- Example: using students already in a class to participate in a survey.
- Stratified sampling
- Divide population into subgroups (strata) and take a random sample from each subgroup.
- Example: study HCC students by campus; draw random samples from Central, Northline, Eastside, etc.
- Cluster sampling
- Divide population into subgroups (clusters) and select entire subgroups; sample everyone within selected clusters.
- Example: survey all students in two nearby campuses rather than all campuses.
- Non-sampling error vs sampling error
- Sampling error: the natural variability that occurs by chance when taking a sample (can be reduced by better sampling design and larger samples).
- Non-sampling error: errors not due to sampling (e.g., poorly worded questions, sensitive questions leading to dishonesty, measurement bias).
- Example of non-sampling error: asking about drinking and driving when respondents may not answer truthfully; cannot be fully corrected by sampling once data are collected.
- Improving study design
- The goal is to minimize sampling error through careful sampling design and question design; some non-sampling error is unavoidable.
Observational vs Experimental Studies
- Observational study
- Observe and collect data without manipulating the subjects or intervening.
- Experimental study
- Involves manipulation or intervention and comparison against a control or alternative condition (common in healthcare studies, such as testing a new procedure vs existing procedure).
- Relevance
- Important distinction for study design and interpretation of results; many healthcare-related investigations rely on experimental designs to assess new procedures, therapies, or tests.
Chapter 2: Organizing and Summarizing Data
Frequency distributions
- Frequency: how often something happens; a count of occurrences.
- Example: a sample of 10 students and the number of classes they took during the summer.
- Purpose: convert a long list of numbers into a summarized form.
Frequency table components
- Data value (x): the specific value or category.
- Frequency (f): how many times that value occurs.
- Total n: the sum of all frequencies; should equal the number of data points.
- Percentage (p):
- Tally: quick counting marks to ensure accuracy; sums should match n to catch data-entry errors.
Group frequency distributions
- Used when data values are too numerous or continuous (e.g., weights, ages).
- Classes (bins) group similar values into ranges rather than each unique value.
- Four requirements for good grouping:
1) Classes must cover the entire data range (no data outside the classes).
2) No overlapping classes.
3) No gaps between classes.
4) Class width must be the same for all classes.
- Examples:
- Classifying weights into ranges (e.g., 50–59, 60–69, etc.).
- Age groups (e.g., 15–19, 20–29, etc.).
Designing a good frequency distribution (workable example)
- Given a data range from a minimum to a maximum value, decide the number of classes (k) and set a uniform class width w.
- If you know the min and max, you can compute w and boundaries.
- For continuous data, consider boundaries to reflect half-unit adjustments (e.g., 99.5–104.5 for a class that covers 100–104). The boundary concept helps avoid overlap and ensures each value falls into exactly one class.
Boundaries and continuous data
- For continuous data, boundaries are often defined as half-unit margins around class limits to avoid overlap between adjacent classes.
- Example: if class is 100–104, use boundaries like 99.5 and 104.5 to reflect the continuous nature.
Cumulative frequency
- Cumulative frequency for a class is the sum of frequencies up to and including that class:
- Useful for constructing ogives and checking totals against n.
Histogram vs Bar Chart
- Histogram
- A vertical bar graph with no gaps between bars.
- Height of each bar represents the frequency (or relative frequency) for each class.
- Left axis (y-axis): frequency; bottom axis (x-axis): class intervals (boundaries or midpoints).
- Useful for continuous or grouped data where adjacent classes are meaningfully connected.
- Bar Chart
- Bars are separated with gaps; used for discrete data or qualitative categories.
- Height represents frequency or proportion for each category.
- Practical nuances
- The x-axis in a histogram is a number line showing class intervals (or midpoints with appropriate binning).
- For discrete data (e.g., number of classes taken), a bar chart is often more appropriate because there are distinct, non-continuous values (0, 1, 2, …).
- In tests or exams, be mindful of which visualization is appropriate given the data type and the groupings you created.
Example Walkthrough: Record High Temperature (Seven Classes)
- Given: record high temperature for 50 states; data range: 100 to 134 degrees (F).
- Task: Construct a group frequency distribution with seven classes.
- Steps:
- Determine the overall range: 100 to 134 → range = 134 - 100 = 34 degrees.
- Choose number of classes: k = 7.
- Class width (before rounding): w = \frac{max - min}{k} = \frac{34}{7} \approx 4.857…; always round up to ensure coverage of entire range, so w = 5.
- Class boundaries (example):
- 100–104, 105–109, 110–114, 115–119, 120–124, 125–129, 130–134 (note: some conventions may shift the exact boundaries by 0.5; the key is equal width and full coverage).
- For continuous data, boundaries may be shown as 99.5–104.5, 104.5–109.5, etc., to reflect half-unit boundaries.
- Tally frequencies for each class from the data and compute cumulative frequencies if needed.
- Important rule demonstrated: always round up when calculating class width to ensure the entire data range is included.
- Continuous data nuance:
- For continuous measurements, you often decide whether to use midpoints or boundaries; boundaries help prevent gaps or overlaps in the histogram.
Summary of Key Concepts and Formulas
- Data types
- Qualitative: nominal, ordinal
- Quantitative: interval, ratio
- Distinctions
- Nominal vs ordinal: ordering in ordinal; nominal has no intrinsic order
- Interval vs ratio: interval has equal intervals but no true zero; ratio has a true zero
- Descriptive vs inferential statistics
- Descriptive: summarizing data (e.g., frequencies, percentages, means, medians)
- Inferential: making conclusions beyond the data (e.g., hypothesis tests, estimations)
- Sampling techniques (recap)
- Random, systematic, convenience, stratified, cluster
- Frequency distribution and grouping
- Frequency table: data value, frequency, total n, percent
- Grouped frequency distribution: equal class width, non-overlapping, complete coverage
- Boundaries for continuous data: use half-unit adjustments to avoid gaps/overlaps
- Visualization tools
- Histogram: no gaps, shows frequency distribution of continuous or grouped data
- Bar chart: gaps between bars, used for discrete or qualitative categories
- Non-sampling vs sampling error
- Non-sampling error: biases from measurement or questions; often not fully correctable
- Sampling error: variability due to sampling; can be reduced by better sampling design
Quick Takeaways for Exam Preparation
- Be able to classify data into nominal, ordinal, interval, or ratio, with justification regarding zero and arithmetic feasibility.
- Distinguish when data are qualitative vs quantitative and when a Likert-type scale is treated as interval vs ordinal.
- Design a frequency distribution: decide if you’ll use a simple (un-grouped) table or a grouped (binned) table with equal class widths, ensuring complete coverage and no overlaps.
- Understand boundaries and cumulative frequency in the context of histograms and data organization.
- Recognize when to use histogram vs bar chart based on data type and class construction.
- Know the difference between observational and