Probability and Statistics Exam Notes

Probability & Venn Diagrams

Introduction & Exam Information

  • Today's session covers makeup time from last week, concluding slides 4.2 and beginning 4.3.

  • An exam is scheduled for Wednesday at 04:15 PM, covering Chapters 1 to 3.

  • Students with accommodations have been contacted regarding arrangements.

  • A specific calculator, available from 'red random shots' for approximately 15, is required.

Venn Diagrams: Visualization of Probability

  • Venn diagrams are used to visualize probability scenarios, starting from slide 32 of 4.2.

  • Sample Space (\mathcal{E}): Represents all possible outcomes. Other notations include S, U (capital U with top line), or \Omega. It's crucial to be aware of these different symbols when encountering external resources.

  • Events (A, B): Any outcomes or set of outcomes that can occur within the sample space.

  • Intersection (A \cap B): Denotes the event where both A and B happen simultaneously (read as "A and B").

  • Union (A \cup B): Denotes the event where A happens, or B happens, or both happen (read as "A or B").

  • Independent Events: Two events A and B are independent if the occurrence of one does not affect the probability of the other. The probability of their intersection is the product of their individual probabilities: P(A \cap B) = P(A) \times P(B).

  • Mutually Exclusive Events: Two events A and B are mutually exclusive if they cannot happen at the same time.

    • The probability of their union is the sum of their individual probabilities: P(A \cup B) = P(A) + P(B).

    • Their intersection is the empty set (\emptyset) or has a probability of zero: P(A \cap B) = 0.

  • A without B (A \setminus B): Represents the portion of A that does not include B. This is expressed as "A less B" or "A without B".

  • Complement of A (A', A^c, \bar{A}): Represents all outcomes in the sample space where event A does not occur. The probability of A' is P(A') = 1 - P(A).

Deck of Cards Example

  • A standard deck has 52 cards, comprising four suits (diamonds, spades, hearts, clubs), with 13 cards per suit (Ace, 2-10, Jack, Queen, King).

  • The probability of picking any single particular card is P(\text{card}) = \frac{1}{52}.

  • Example: Picking an Ace (A) or a Diamond (D)

    • First, determine the number of elements in each section of the Venn diagram (n(\cdot)).

      • n(A \cap D) (Ace of Diamonds): 1

      • n(A \setminus D) (Aces that are not Diamonds): 3

      • n(D \setminus A) (Diamonds that are not Aces): 12

      • n(\text{Neither A nor D}): 52 - (3 + 1 + 12) = 36

    • Probabilities using the diagram and the equally likely outcomes principle (favorable outcomes / total outcomes):

      • P(A \cap D) = \frac{1}{52}

      • P(A \cup D) = \frac{n(A \cup D)}{n(\mathcal{E})} = \frac{3 + 1 + 12}{52} = \frac{16}{52}

      • P(A') = \frac{n(A')}{n(\mathcal{E})} = \frac{12 + 36}{52} = \frac{48}{52} (Alternatively, 1 - P(A) = 1 - \frac{4}{52} = \frac{48}{52})

      • P(D \setminus A) = \frac{12}{52}

Conditional Probability

  • Conditional Probability (P(A|B)): The probability of event A occurring, given that event B has already occurred.

    • Formula: P(A|B) = \frac{P(A \cap B)}{P(B)}.

    • From a Venn diagram, this means restricting the sample space to only the outcomes in B and then finding the proportion of A within that reduced space: P(A|B) = \frac{n(A \cap B)}{n(B)}.

  • Multiplication Rule (derived from conditional probability): P(A \cap B) = P(A|B) \times P(B) or P(A \cap B) = P(B|A) \times P(A).

Practice Problems & Concepts

General Probabilities with Venn Diagrams
  • Always start by filling in the intersection when constructing a Venn diagram with probabilities to simplify calculations.

  • Example 1: Probabilities (P(A)=0.5, P(B)=0.2, P(A \cap B)=0.1)

    • Venn Diagram sections: P(A \cap B) = 0.1, P(A \setminus B) = P(A) - P(A \cap B) = 0.5 - 0.1 = 0.4, P(B \setminus A) = P(B) - P(A \cap B) = 0.2 - 0.1 = 0.1, P(\text{Neither}) = 1 - (0.4 + 0.1 + 0.1) = 0.4.

    • P(A \cup B) = P(A \setminus B) + P(B \setminus A) + P(A \cap B) = 0.4 + 0.1 + 0.1 = 0.6 (or using the addition rule: P(A) + P(B) - P(A \cap B) = 0.5 + 0.2 - 0.1 = 0.6).

    • P(B') = P(A \setminus B) + P(\text{Neither}) = 0.4 + 0.4 = 0.8.

    • P(A \cap B') = P(A \setminus B) = 0.4.

    • P(A \cup B') = P(A \setminus B) + P(A \cap B) + P(\text{Neither}) = 0.4 + 0.1 + 0.4 = 0.9.

Conditional Probabilities with Venn Diagrams
  • When solving conditional probability problems using Venn diagrams, first reduce your sample space to only the 'given' event. Then, find the probability of the 'wanted' event within that reduced space.

  • Example 1: (P(A)=0.55, P(B)=0.4, P(A \cap B)=0.15)

    • Venn diagram sections: P(A \cap B) = 0.15, P(A \setminus B) = 0.4, P(B \setminus A) = 0.25, P(\text{Neither}) = 0.2.

    • P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{0.15}{0.4} = \frac{3}{8}. (Diagram: 0.15 / (0.15 + 0.25), restricted to B circle).

    • P(B|A \cup B) = \frac{P(B \cap (A \cup B))}{P(A \cup B)} = \frac{P(B)}{P(A \cup B)} = \frac{0.4}{0.8} = \frac{1}{2} (Diagram: P(B) / (P(A \setminus B) + P(A \cap B) + P(B \setminus A)) = (0.15+0.25)/(0.4+0.15+0.25) ).

    • P(A'|B') = \frac{P(A' \cap B')}{P(B')} = \frac{P(\text{Neither})}{P(A \setminus B) + P(\text{Neither})} = \frac{0.2}{0.4 + 0.2} = \frac{0.2}{0.6} = \frac{1}{3}. (Diagram: Restricted to B' (0.4 + 0.2), then A' within that (0.2)).

Independent vs. Mutually Exclusive Check Example
  • Independence: Check if P(X|Y) = P(X) (or P(X \cap Y) = P(X)P(Y)). If probabilities change when one event is given, they are dependent.

  • Mutually Exclusive: Check if P(X \cap Y) = 0. If there's an intersection (P(X \cap Y) > 0), they are not mutually exclusive.

Rearranging Probability Formulas
  • If events are dependent or mutual exclusivity is unknown, use the general addition rule: P(A \cup B) = P(A) + P(B) - P(A \cap B).

  • This formula can be rearranged to find an unknown component, e.g., P(A \cap B) = P(A) + P(B) - P(A \cup B).

    • Example: If P(A) = 0.6, P(B) = 0.7, P(A \cup B) = 0.9.

      • P(A \cap B) = 0.6 + 0.7 - 0.9 = 0.4.

Exam Revision: Chapters 1-3

Types of Data
  • Nominal: Categorical data with no inherent order (e.g., gender, colors, names). No numerical operations are meaningful.

  • Ordinal: Categorical data with a meaningful order, but the intervals between categories are not uniform or meaningful (e.g., student grades, restaurant star ratings). You can rank, but differences aren't equal.

  • Interval: Numerical data where differences between values are meaningful, but there is no true or meaningful zero point. Ratios are not meaningful (e.g., temperature in Celsius or Fahrenheit, where 0 degrees doesn't mean absence of temperature, and negative values exist)..

  • Ratio: Numerical data with a meaningful zero point, allowing for meaningful ratios between values (e.g., height, weight, age, counts). A value of 0 represents the complete absence of the measured quantity.

    • Example: Average temperatures are interval data because 0 degrees Celsius does not represent an absence of temperature, and negative values are possible, making ratios meaningless.

Descriptive vs. Inferential Statistics
  • Descriptive Statistics: Summarizes and describes the features of a dataset (e.g., mean, median, mode, histograms). It focuses on presenting what is observed.

    • Example: Creating a histogram to describe a sample is descriptive.

  • Inferential Statistics: Uses data from a sample to make predictions or inferences about a larger population. This typically involves hypothesis testing and confidence intervals.

Sampling Methods
  • Stratified Sampling: Dividing the population into homogeneous subgroups (strata) based on shared characteristics (e.g., age, major) and then sampling from each stratum.

  • Systematic Sampling: Selecting subjects from a list at a regular interval (e.g., every n^{th} person after a random start).

  • Convenience Sampling: Selecting individuals who are easily accessible or readily available. This method is often biased and not representative.

  • Multi-stage Sampling: A complex sampling method that involves multiple stages of sampling, often used in large-scale surveys. It might involve sampling regions, then communities within regions, then households within communities.

  • Cluster Sampling: Dividing the population into naturally occurring groups (clusters) and then randomly selecting some clusters. All members within the chosen clusters are typically included in the sample.

    • Example: Accessing a university register and selecting every n^{th} student is systematic sampling.

  • Simple Random Sample: Every member of the population has an equal chance of being selected. This requires a complete list of the population (sampling frame).

    • Example: Randomly choosing 30 names from each state's electoral register is not a simple random sample of US adults because adults in smaller population states have a higher probability of being selected compared to larger population states.

  • Sampling Frame: The actual list or population from which the sample is drawn (e.g., electoral registers).

Lottery Probability
  • In a lottery where players choose 6 numbers from 1 to 42, every combination of 6 numbers has an equal chance of being chosen. Therefore, choosing specific numbers or choosing numbers randomly yields the same probability of winning.

Data Visualization & Measures
  • Frequency Table Components:

    • Class Boundaries: Define the inclusive range for each class. If classes like 10-20, 20-30 are used, the boundaries are the same as the limits. If classes are like 10-19, 20-29, boundaries are halfway points (e.g., 19.5, 29.5).

    • Midpoint: The average of the lower and upper limits of a class (\frac{\text{Lower + Upper}}{2}).

    • Cumulative Frequency: A running total of frequencies, showing the number of observations up to and including a particular class.

    • Relative Frequency: The proportion of observations in each class (\frac{\text{Class Frequency}}{\text{Total Frequency}}).

  • Ogive: A line graph that displays the cumulative frequency or cumulative relative frequency against the upper class boundaries. It helps visualize the shape of the cumulative distribution.

    • Skew from Ogive: A right-skewed (positive skew) distribution has a heavier bottom tail on the ogive, indicating more lower values. A left-skewed (negative skew) distribution has a heavier upper tail, indicating more higher values. A symmetric distribution forms an S-shape.

  • Limitations of Data Tables: While providing detail, large tables can lead to information overload and make it difficult to visualize patterns. Having too few classes in a frequency table can obscure the true shape of the data.

  • Advantages/Disadvantages of Tables vs. Charts:

    • Tables: Offer more detailed information but can be hard to visualize patterns.

    • Charts: Make patterns easy to see but may lose some specific detail.

  • Pie Chart vs. Bar Chart: A pie chart is preferable for visualizing parts of a whole, specifically percentages or proportions of a total. Bar charts are better for comparing different categories or displaying exact frequencies/values.

Measures of Central Tendency & Spread for Sample Data (1, 4, 4, 9, 10, 13, 15, 20, 21, 30)
  • Sample Mean (\bar{x}) (N=10): Sum of all values divided by the number of values.

    • \bar{x} = \frac{1 + 4 + 4 + 9 + 10 + 13 + 15 + 20 + 21 + 30}{10} = \frac{127}{10} = 12.7.

    • Note: The lecturer used a hypothetical symmetrically spaced dataset (e.g., 1,3,5,7,9) to demonstrate explaining the mean of 5 without calculation by observing symmetry around the center.

  • Mode: The value that appears most frequently in the dataset.

    • For the given data, the mode is 4.

    • Note: If all values appear with the same frequency (e.g., all unique), there is no mode.

  • Median (Q_2): The middle value when the data is ordered. For an even number of data points, it's the average of the two middle values.

    • Ordered: 1, 4, 4, 9, \mathbf{10}, \mathbf{13}, 15, 20, 21, 30.

    • Median = \frac{10 + 13}{2} = 11.5.

  • Range: The difference between the highest and lowest values.

    • Range = 30 - 1 = 29.

  • Interquartile Range (IQR): The range of the middle 50\% of the data, calculated as the difference between the third quartile (Q3) and the first quartile (Q1).

    • \text{First Half (for } Q1\text{)}: 1, 4, 4, 9, 10. Q1 = 4.

    • \text{Second Half (for } Q3\text{)}: 13, 15, 20, 21, 30. Q3 = 20.

    • IQR = Q3 - Q1 = 20 - 4 = 16.

  • IQR vs. Range: The IQR is generally a better measure of spread than the range because it is less affected by outliers (extremely large or small values) and focuses on the spread of the typical, central values.

  • Sample Variance (s^2): Measures the average of the squared differences from the mean.

    • Formula: s^2 = \frac{\sum x^2 - n\bar{x}^2}{n-1} (for sample variance).

    • Units of Variance: If the original data units are in meters, the variance will be in meters squared (m^2), which is not intuitive.

  • Standard Deviation (s): The square root of the variance (s = \sqrt{s^2}). It is generally preferred over variance because it has the same units as the original data, making it easier to interpret.