Statistics Midterm 1 Review Flashcards

Foundational Statistical Definitions and Data Types

Qualitative Data: This type of data consists of attributes, labels, or non-numerical entries. It describes characteristics that cannot be naturally measured on a numerical scale, such as names, colors, or categories.
Population and Sample Relationship:
- Population: The entire collection of all individuals, objects, or measurements whose properties are being studied.
- Sample: A subset, or part, of a population. It is used to make inferences about the larger population when the whole cannot be measured.

Levels of Measurement

In statistics, data is categorized into four levels of measurement, ranging from least to most complex. The following examples from the examination illustrate these levels:

Nominal Level: This level consists of names, labels, or qualities. No mathematical computations can be made.
- Example: Social security numbers. Even though they are made of digits, they function as unique identifiers (labels) rather than quantities.
Ordinal Level: Data can be arranged in order or ranked, but differences between data entries are either not meaningful or cannot be determined mathematically.
- Example: Video game ratings (e.g., E for Everyone, T for Teen, M for Mature). These represent a rank of content maturity, but the "distance" between ratings is not quantifiable.
Interval Level: Data can be ordered, and meaningful differences between data entries can be calculated. However, a zero entry represents a position on a scale rather than an inherent absence of something (no "true zero").
- Example: Average temperature of Bakersfield, CA. Expressed in degrees Fahrenheit or Celsius, zero is a point on the scale, but it does not mean "no temperature."
Ratio Level: This is the highest level of measurement. It possesses all the properties of interval data, with the added property that a zero entry is an inherent zero (representing a total absence of the quantity).
- Example: Amount of money in a retirement account. Having $0$ dollars means a complete lack of funds, and you can meaningfully say one account has twice as much as another.

Sampling Techniques and Methodology

Identifying the correct sampling method is vital for the validity of statistical results. Common techniques include:

Systematic Sampling: A rule is used to select members of the population.
- Example: A vitamin supplement producer studies the exercise habits of every $5^{th}$ person that comes to the gym to workout.
Simple Random Sampling: Every possible sample of the same size has the same chance of being selected.
- Example: All possible $30$ -person subgroups are listed from a group of $300$ potential jurors. One of those subgroups is randomly chosen.
Stratified Sampling: The population is divided into groups (strata) and a sample is taken from each group, often proportionally.
- Example: A proportional number of students from each major are asked if they work while attending college.
Convenience Sampling: Using results that are very easy to get.
- Example: A person surveys her coworkers about their political opinions. This is non-random and often biased.

Statistical Ethics and Critical Analysis of Studies

Case Study: Sleep Issues Infomercial: A company selling sleeping pills via a $2:30\text{ am}$ infomercial invited callers to discuss sleep issues, concluding that $87\%$ of people suffer from insomnia.
Critique of Results: These results should not be trusted due to several factors:
- Voluntary Response Bias: Only people interested in the topic or those awake at $2:30\text{ am}$ (likely because they cannot sleep) are responding. This does not represent the general population.
- Time Bias: The broadcast time ( $2:30\text{ am}$ ) specifically targets the demographic suffering from the very condition being studied.
Confounding Variables: Possible confounding variables include existing medical conditions, stress levels, caffeine intake, or the environment of the callers, which were not controlled for in the call-in format.

Data Organization: Frequency Distributions and Histograms

Raw Data Set (Old Faithful Geyser Eruption Time in Minutes):
- Row 1: $56$ , $62$ , $70$ , $79$ , $81$ , $82$ , $84$ , $86$ , $89$ , $97$
- Row 2: $57$ , $62$ , $73$ , $79$ , $81$ , $83$ , $85$ , $87$ , $90$ , $98$
- Row 3: $58$ , $62$ , $74$ , $79$ , $82$ , $83$ , $86$ , $88$ , $91$ , $100$
- Row 4: $61$ , $67$ , $78$ , $79$ , $82$ , $83$ , $86$ , $89$ , $94$ , $102$
- Row 5: $62$ , $69$ , $78$ , $79$ , $82$ , $84$ , $86$ , $89$ , $95$ , $104$
Frequency Distribution Construction (7 Classes):
- Class Width Calculation: $\text{Class Width} = \frac{\text{Max} - \text{Min}}{\text{Number of Classes}} = \frac{104 - 56}{7} = \frac{48}{7} \approx 6.86$ . Round up to $7$ .
- The first lower class limit is the minimum: $56$ .
- Frequency distribution columns must include: Class Limits, Boundaries, and Frequency.
Histogram and Shape: A histogram is a bar graph of a frequency distribution. Common shapes include symmetric (bell-shaped), skewed right (tail to the right), or skewed left (tail to the left).

Descriptive Summary and Visualization (The Five-Number Summary)

To construct a Box-and-Whisker Plot, five specific values must be identified:

Minimum: The lowest value in the data set ( $56$ ).
First Quartile ( $Q_1$ ): The median of the lower half of the data.
Median ( $Q_2$ ): The middle value of the ordered data set.
Third Quartile ( $Q_3$ ): The median of the upper half of the data.
Maximum: The highest value in the data set ( $104$ ).

Measures of Central Tendency: Weighted Mean (GPA Calculation)

Grade Point Average (GPA) is a weighted mean where the grade is the value ( $x$ ) and the units/credits are the weights ( $w$ ).

Standard Grade Values: $\text{A} = 4$ , $\text{B} = 3$ , $\text{C} = 2$ , $\text{D} = 1$ , $\text{F} = 0$ .
Student Data:
- Class 1: $4$ units, Grade: A ( $4$ points)
- Class 2: $5$ units, Grade: B ( $3$ points)
- Class 3: $2$ units, Grade: C ( $2$ points)
GPA Formula: $\text{GPA} = \frac{\sum (x \times w)}{\sum w} = \frac{(4 \times 4) + (3 \times 5) + (2 \times 2)}{4 + 5 + 2} = \frac{16 + 15 + 4}{11} = \frac{35}{11} \approx 3.18$

Measures of Relative Position: The Z-Score

To compare values from different populations, we use the standard score or z-score. This determines how many standard deviations ( $\sigma$ ) a value ( $x$ ) is from the mean ( $\mu$ ).

Z-score Formula: $z = \frac{x - \mu}{\sigma}$
Comparison Example: Income in Middle City vs. Richville
- Middle City: $\mu = \$65,472$ , $\sigma = \$3,963$
  - Income being evaluated ( $x$ ): $\$75,400$
  - $z_{\text{Middle City}} = \frac{75400 - 65472}{3963} \approx 2.505$
- Richville: $\mu = \$121,473$ , $\sigma = \$15,279$
  - Income being evaluated ( $x$ ): $\$135,000$
  - $z_{\text{Richville}} = \frac{135000 - 121473}{15279} \approx 0.885$
Conclusion: The income of $\$75,400$ in Middle City is relatively higher because its z-score ( $2.505$ ) is significantly higher than the z-score for $\$135,000$ in Richville ( $0.885$ ). The Middle City income is over $2.5$ standard deviations above its mean, while the Richville income is less than $1$ standard deviation above its mean.