Statistics Unit 1 Flashcards

1.1 What Is/Are Statistics

Statistics is the science of collecting, organizing, and interpreting data.
Statistics are the data (numbers or other pieces of information) that describe or summarize something.
Population (in a statistical study) = the complete set of people or things being studied.
Population Parameters = specific numbers describing characteristics of the population.
Sample = a subset of the population from which data are actually obtained.
Raw data = the actual measurements or observations from the sample.
Sample statistics = numbers describing characteristics of the sample found by consolidating or summarizing the raw data.
Ex 1: Identify population and population parameters
- a) You are tasked by your student organization to determine what percentage of East A&M's population students have perfect attendance in their classes.
- Population: all East A&M students.
- Population parameter of interest: the true percentage of East A&M students with perfect class attendance.
- b) You are a health policy worker studying whether a new vaccine is effective in preventing infection and severe symptoms from Covid-19 (P+) - population
- Population: all individuals exposed to Covid-19 in the context of the vaccine study (or the target population defined by the policy).
- Population parameters: true infection rate and true rate of severe symptoms prevented by the vaccine.
- c) You are hired by a manufacturing company to determine the weights in shipments of raw materials being used for processing chips.
- Population: all shipments of raw materials used for processing chips.
- Population parameters: true mean weight, true weight distribution characteristics (e.g., variance).
Ex 2: Identify and describe the sample, the population, the sample statistic, and the population parameter.
- Based on the National Health Interview Survey of about 30,000 adult Americans, it was concluded that about 12.5% of all adult Americans smoked one or more cigarettes in the past week.
- Population: all adult Americans.
- Sample: the 30,000 adults surveyed.
- Sample statistic: 12.5% (the proportion observed in the sample).
- Population parameter: the true proportion of all adult Americans who smoked one or more cigarettes in the past week.
Ex 3: Margin of Error (MOE) and Confidence Interval (CI)
- The Margin of Error in a statistical study describes the range likely to contain the population parameter.
- CI is obtained by adding and subtracting the MOE from the sample statistic: CI = [p̂ − MOE, p̂ + MOE].
- Example calculations:
- a) Pew Research Center survey of 5005 adults; 15% do not use the Internet; MOE = 3 percentage points.
  - CI = [12%, 18%].
- b) Study of 400 birth weights in New York State; mean birth weight = 3152 g; MOE = 68 g.
  - CI = [{3084} g, {3220} g].
  - Lower bound: 3152 − 68 = 3084 g; Upper bound: 3152 + 68 = 3220 g.

1.2 Basic Steps in a Statistical Study

Five basic steps:
1) State the goals of the study precisely.
2) Choose a representative sample from the population.
3) Collect raw data from the sample and summarize these data by finding sample statistics of interest.
4) Use the sample statistics to make inferences about the population.
5) Draw conclusions.
Mapping the steps to terms (example 3):
- Goals: % of adults who do not use the Internet.
- Sample: 505 surveyed adults (the sample).
- Sample statistic: 15% of surveyed adults who do not use the Internet.
- Inference: Use a confidence interval to estimate the total percentage of U.S. adults who do not use the Internet.
- Conclusion: Use the confidence interval to make a claim about the population.
One of the most important purposes of statistics: to help us make good decisions about issues involving uncertainty.

1.3 Types of Statistical Studies

Defn: Census = data from every member of the population.
Defn: Representative sample = a sample whose members’ characteristics are generally the same as the population.
Defn: A statistical study suffers from bias if its design or conduct tends to favor certain results.
Ex 1: A social media personality reports on annual incomes from 873 responses out of 40,000 emailed questionnaires.
- Low response rate can lead to non-representative results (bias).
Simple Random Sampling (SRS): A sample where every sample of the same size has an equal chance of being selected.
Ex 2: Poll of all East A&M students using random names from those who attended new student orientation is NOT a simple random sample (orientation attendees are not a complete cross-section of all students).
Systematic Sampling: Use a system to choose the sample (e.g., select every 10th or 50th member).
Ex 3: Interview a library visitor exactly every 15 minutes; this is Systematic Sampling (not Simple Random). Rationale: may introduce pattern bias; systematic could be chosen for practicality.
Convenience Sampling: A sample that is convenient to select (often biased).
Ex 4: Supermarket free taste tests in the store morning stand = Convience Sampling. Likely not representative of all shoppers.
Cluster Sampling: Divide population into clusters, randomly select some clusters, then sample all members within selected clusters.
Ex 5: Measure student satisfaction across all A&M campuses using cluster sampling (sample some campuses and survey all students in those campuses).
Stratified Sampling: Divide population into strata (subgroups) with distinct characteristics; draw a random sample within each stratum; total sample is the union of the strata samples.
Ex 6: Measure average gas prices across the United States using stratified sampling (stratify by region or other subgroups to capture variation).
Ex 7 (Identifying sampling type and representativeness):
- a) Systematic Sampling: A man selected by every 7500th person in phone listings for a paid focus group.
- b) Cluster Sampling: 80 trains randomly selected; survey all passengers on those trains.
- c) Simple Random Sampling: 1056 adults called from randomly generated numbers; 90% identified president.
- d) Stratified Sampling: Day split into morning/afternoon/evening; 3 randomly selected times in each part.
- e) Convenience Sampling: Measuring strength of fingers of the researcher’s family members.
- f) Stratified Sampling: Random sample of 250 males and 250 females in a heart disease study (age 65+); note: should be stratified by sex.
- g) Simple Random Sampling: Johns Hopkins list of transplant patients; select by 50 random numbers.
- h) Cluster Sampling: Randomly select 20 police precincts; interview all officers in those precincts.

1.4 Types of Statistical Studies

Subjects are the people, animals, or objects chosen for the sample.
Two basic types:
- Observational Study: researchers observe or measure characteristics of subjects without attempting to influence them.
- Retrospective (case-control): uses past data.
- Prospective (longitudinal): collects data in the future from groups sharing factors.
- Experiment: researchers apply a treatment to some or all subjects and observe the effects.
- Treatment group: receives the treatment.
- Control group: does not receive the treatment.
Ex 1: Identify as observational or experiment
- a) Tempered glass used for car windows; researcher tests strengths by heating to 620°C.
- This is an Experiment (manipulation of conditions to observe effect).
- b) Poll asking East A&M students if they commute or live on campus.
- Observational (no manipulation introduced).
Defn: Variable = any item or quantity that can take on different values. The variable of interest = what the study seeks to measure.
Ex 2: Identify the variables of interest for both parts of example 1.
- a) Variable: strength of tempered glass under high temperature.
- b) Variable: commuting status (commute vs live on campus) and possibly attitudes or transportation choices as outcomes.
Ex 3: You want to measure the quality of life of children born with fetal alcohol syndrome. What type of study should you do? (Note: unethical to conduct certain trials.)
- Observational study (to observe outcomes without manipulating variables).
Defn: Confounding variables = variables that lead to confusion in statistical studies by mixing with the variable of interest, making it hard to determine separate effects.
Strategies for Selecting Treatment and Control Groups
- Randomly assign participants to treatment or control groups (each participant has equal chance).
- Use sufficiently large groups to reduce the chance that groups differ in a significant way.
Ex 4: The Salk Polio Vaccine (1954)
- Sample: 400,000 children from the population of all U.S. children.
- Treatment group: children who received the vaccine.
- Control group: children who received a salt-water injection (placebo).
- Outcomes: 33 polio cases in the vaccine group vs 115 cases in the placebo group.
- How the two strategies for selecting treatment and control were implemented: random assignment to vaccine vs placebo groups; use of placebo to blind participants to treatment (pharmacological placebo).
Defn: Placebo = lacks active ingredients but looks/feels like the treatment; used so participants do not know which they receive.
Defn: Placebo effect = improvement due to belief in treatment, not the treatment itself.
Ex 5: What was the placebo used in example 4? Why used? (Salt-water injection served as placebo; to control for placebo effect and to blind participants to treatment assignment.)
Defn: An experimental effect = any unintended influence by the experimenter on subjects.
Blinding
- Single-blind: participants do not know whether they are in the treatment or control group; researchers do know.
- Double-blind: neither participants nor researchers know who is in which group.
Ex 6: Determine the most appropriate type of statistical study for each scenario (elaborate)
- a) What is the average income of stockbrokers? Observational (do not manipulate).
- b) Do seatbelts save lives? Observational (cannot ethically assign seatbelts as a treatment in many settings).
- c) Can lifting weights improve runners' times in a 10K? Experimental (introduce an intervention).
- d) Does skin contact with a particular glue cause a rash? Observational (observe outcomes without assigning exposure).
- e) Can a herbal remedy reduce the severity of colds? Experimental (introduce herbal remedy exposure).
- f) Do supplements of resveratrol increase life span? Observational (measure natural variation; ethical/feasibility concerns for a long-term experiment).

2.1 Qualitative vs Quantitative Data

Definition: Qualitative data consist of values that can be placed into nonnumerical categories.
Definition: Quantitative data consist of values representing counts or measurements.
Ex 1 (Classification in a survey):
- a) Brand names of sodas in customer survey — Qualitative.
- b) Scores achieved by students during a multiple-choice exam — Quantitative.
- c) Letter grades on a project presentation — Qualitative.
- d) Numbers on uniforms that identify players on a volleyball team — Qualitative.
Key idea: Qualitative vs Quantitative is about whether values are categories/labels or numerical measurements.

Continuous vs Discrete Data

Definition: Continuous data can take on any value in a given interval.
Definition: Discrete data can take on only particular, distinct values and not other values in between.
Ex 2 (Indicate discrete or continuous):
- a) Measurements of student mile times at a track event — Continuous.
- b) The numerical years of the calendar — Discrete.
- c) The numbers of students in different classes — Discrete.
- d) The amount of feed consumed by animals at an animal sanctuary — Continuous.
Practical note: Discrete data arise from counting; continuous data arise from measurements that can be infinitely precise within a range.

Levels of Measurement

Nominal level
- Characterized by data that consist of names, labels, or categories only.
- Data are qualitative and cannot be ranked or ordered.
Ordinal level
- Applies to qualitative data that can be arranged in some order.
- Generally does not make sense to perform computations with data at the ordinal level.
Interval level
- Applies to quantitative data for which intervals are meaningful, but ratios are not.
- Data at this level have an arbitrary zero point.
Ratio level
- Applies to quantitative data for which both intervals and ratios are meaningful.
- Data at this level have a true zero point.
- Arithmetic operations make sense: $a + b,\ a - b,\ a \times b,\ \frac{a}{b}$
Summary of operational implications:
- Nominal: mode, frequencies; no meaningful order; no arithmetic.
- Ordinal: median and nonparametric comparisons; order matters but differences are not necessarily equal.
- Interval: differences meaningful; can compute means and standard deviations; ratios not meaningful.
- Ratio: all arithmetic operations meaningful; supports all standard statistical methods.
Quick example connections:
- Interval vs Ratio: Temperature in Celsius is interval (zero is arbitrary), while Time taken to walk (ratio) has a true zero (zero time means no time).

Ex 3: Level of Measurement (Identify the level for each data set)

a) Student ID numbers at East Texas A&M
- Level: Nominal
- Reason: They are identifiers (names/labels) used to distinguish individuals, not to measure quantity or order.
b) Student rankings of food or drink places on campus
- Level: Ordinal
- Reason: The data can be ordered (rank 1st, 2nd, etc.), but the differences between ranks are not necessarily equal.
c) Calendar years of important historic events
- Level: Interval
- Reason: Differences between years are meaningful, but there is no true zero year in the calendar system used here.
d) Temperatures of ill patients measured in Celsius
- Level: Interval
- Reason: Intervals are meaningful (difference between temperatures), but there is no true zero for the Celsius scale.
e) Times it takes students to walk between classes
- Level: Ratio
- Reason: Times are quantitative with a true zero (0 minutes means no time); both intervals and ratios are meaningful, so arithmetic operations are valid.

Quick Takeaways

Always classify data first as Qualitative vs Quantitative; then, within Quantitative, decide Continuous vs Discrete; and within both, determine the appropriate Level of Measurement (Nominal, Ordinal, Interval, Ratio).
The level of measurement dictates what kinds of summaries and statistical analyses are appropriate.
Real-world relevance: choosing the right analysis depends on the measurement level (e.g., mean vs median, parametric vs nonparametric tests, arithmetic operations).