AP Statistics Study Guide Flashcards
AP Stats Study Guide
General Tips & Chapter 1: One Variable Data
Introduction to Statistics
- Statistics is the science and art of collecting, analyzing, and drawing conclusions from data.
- Data can be:
- Categorical: Data that falls into categories.
- Quantitative: Numerical data representing measurements or counts.
- Key Terms:
- Individual: An object described in a set of data (also known as cases/observational units).
- Variable: An aspect that can take different values for different individuals.
- Distribution: The pattern of variation of a variable, showing what values the variable takes and how often it takes those values.
- Types of Statistics:
- Descriptive Statistics: Analyzing data (Units 1-7).
- Inferential Statistics: Making inferences/drawing conclusions from data (Units 8-12).
FRQs (Free Response Questions)
- Always include context: variable and units.
MCQs (Multiple Choice Questions) Strategies
- Process of elimination.
- Read actively and carefully.
- Underline important information.
- Anticipate answers.
Types of Variables
- Categorical:
- Nominal: Categories with no inherent order.
- Ordinal: Categories with a specific order.
- Numbers can be ordinal if they don’t measure anything (e.g., cell phone digits).
- Quantitative:
- Discrete: Fixed set of possible values with gaps between them.
- Whole numbers or defined intervals.
- Countable or countably infinite.
- Continuous: Infinite possibilities.
- Decimals/fractions.
- Any value in an interval on the number line.
- The main difference: whether they measure something.
- Discrete: Fixed set of possible values with gaps between them.
- Categorical:
1.1 Analyzing Categorical Data
- Tables and Graphs
- Frequency Table: Shows counts for each category.
- Relative Frequency Table: Shows proportions/percentages for each category.
- Bar Graph: Graphical representation of categorical data with bars representing frequencies or relative frequencies.
- Tips:
- Equal width bars.
- Leave gaps between bars.
- Label and scale axes.
- Indicate whether frequencies or relative frequencies are used.
- Tips:
- Pie Chart: Circular chart divided into slices proportional to frequencies or relative frequencies.
- Good for: comparing categories to the whole.
- Areas of slices proportional to frequencies/relative frequencies.
- Must include all possible categories in the whole (add an “other” option if necessary).
- Include a legend key.
- Two-Way Tables: Summarizes data on the relationship between two categorical variables for a group of individuals.
- Marginal Relative Frequency: B/C
- Joint Relative Frequency: A/C
- Conditional Relative Frequency: A/B
- Side-by-Side Bar Graph: Bar graphs showing the distribution of a categorical variable for each value of another categorical variable.
- Tip: It is acceptable for the bars to touch here within each of the values for the categorical variable, but leave spaces in between the distributions for each value.
- Segmented Bar Graph: Distribution of a categorical variable as segments of a whole (bars stacked on top of each other and proportional to relative frequencies).
- Uses relative frequencies.
- Tip: The bars do not touch here!
- Mosaic Plot: Similar to segmented, except the width of the bars proportional to number of individuals in that category.
- Tip: The bars do touch here!
*Note: Tables are not data, they are summaries of data!
Avoid Bad Statistics Practices
- Truncating axes.
- Using pictograms.
Association
- Knowing the value of one variable allows you to predict the value of the other.
- Association does NOT equal causation!
1.2 Displaying Quantitative Data with Graphs
- Types of Graphs
- Dotplot: Each data value is shown as a dot above its location on a number line.
- Pros: Can see every individual value, easy to see shape.
- Cons: Difficult to make with large data sets.
- Stemplot: Separates each data value into a stem (all but the final digit) and a leaf (the final digit).
- Tips: Always add a key; Split stems to better see distribution if needed (try to get at least 5 stems); Make sure each stem has an equal number of possible leaf digits.
- Pros: Can see every individual value, easy to see shape.
- Cons: Difficult to make with large data sets.
- Histogram: Groups data into bins (intervals) and shows the frequency (or relative frequency) of values within each bin.
- Tips: Equal size bins; Try to go with a minimum of 5 bins; The bin size will affect the appearance of the distribution (more bins -> more detail but less clear overall pattern); Edges of bins either inclusive or noninclusive (typically the right edge of a bin is noninclusive); Bars touch; Either frequencies or relative frequencies (but need to indicate which one!).
- Pros: Easier to make for large data sets, easy to see shape (especially for large data sets–simplifies the overall pattern).
- Cons: Doesn’t show every individual value.
*Use relative frequencies if you’re using the histogram to compare distributions with different numbers of observations.
- Boxplot: Represents the five-number summary (minimum, Q1, median, Q3, maximum) and any outliers.
- Pros: Easy to make for large data sets, shows five-number summaries, splits data into quartiles.
- Cons: Doesn’t show individual values, slight skewness.
- Dotplot: Each data value is shown as a dot above its location on a number line.
1.3 Describing Quantitative Data with Numbers
Measures of Center
- Mean: Average (\bar{x} or µ).
- Median: Middle value.
- Mode: Most common value.
Measures of Variability
- Range: Maximum – Minimum.
- Pros: Easy to calculate.
- Cons: Nonresistant and doesn’t express variability from the center.
- IQR: Interquartile Range (middle 50% of values).
- Pros: Resistant.
- Calculating quartiles: Split each half (leave out median!).
- Standard Deviation: Typical distance from the mean (s_x or σ).
- sx = \sqrt{\frac{\sum(xi - \mu_x)^2}{n-1}}
- To calculate: calculate all the deviations (value – mean), square each, add up, divide by n-1, take square root.
- Properties: Only use in tandem with the mean; s_x always greater than or equal to σ.
- Cons: Nonresistant.
- Range: Maximum – Minimum.
Describing Quantitative Distributions: SOCV
- Shape: Skewness & modality, and any clusters/gaps.
- Outliers: 1.5 x IQR (above Q3 or below Q1).
- Center: Measures of the typical value: mean/median/mode.
- Variability: Range/IQR/st dev.
- Always add context! (variables & units).
Comparing Mean & Median
- Mean: Nonresistant.
- Median: Resistant (not sensitive to skewness/outliers).
Outliers
- 1.5 x IQR (above Q3 or below Q1).
- Five Number Summary: minimum, Q1, median, Q3, maximum
- Also good to know: upper & lower bounds (1.5 x IQR) (might not be actual data values)
*Decide which measures to use based on whether resistancy is a concern for the distribution
- Additional Vocab
- Statistic: A value describes a characteristic of a sample.
- Parameter: A value that describes a characteristic of a population.
Types of MCQs (or FRQs)
- Compare mean & median (use knowledge of resistancy & skewness to answer).
- Interpret graphical or tabular representation of data:
- Using it to answer questions about the variable/distribution of the variable.
- Interpret summary statistics/a value about the distribution:
- Put it in context (variable & units!).
- Especially interpreting standard deviation: “the typical distance from the mean”.
- Match data with its graphical representation OR match graphical representations of the same data.
- Determine whether there is an association between two categorical variables given data: graphical/tabular representations OR summary statistics:
- Calculate distribution of one categorical variable for each value of the other (basically a whole bunch of conditional relative frequencies).
- And then find whether knowing the value of one variable allows you to predict the value of another (eg. are those distributions you calculated the same or not).
- Describe a distribution (quantitative):
- SOCV (shape, outliers, center, variability) (see content overview for more information).
- Context (variables & units of the distribution).
- Compare distributions (quantitative):
- SOCV for both distributions.
- Use explicitly comparative language that relates the two distributions.
- Make a claim/argument based on a distribution (quantitative):
- Refer to specific characteristics (eg. SOCV) of the distribution in your answer.
- Give specific numbers as much as possible.
- Context (variables & units of the distribution).
- Then, explain why those characteristics support your claim/argument.
- Describe a distribution/compare distributions/make a claim based on a distribution (categorical).
- Construct a certain type of graph for given data:
- Follow appropriate guidelines for the type of graph (see content summaries for tips).
- Always label & scale axes appropriately! with units!
- Add a title with context & others.
- What is apparent from the histogram but not from the boxplot?
- Misrepresenting/manipulating data:
- Why would it be misleading to only report [insert statistic/parameter here]?
- What would you want to report in order to [achieve specified goal]?
- Why does one method for determining outliers give you more outliers than the other?
Language & Wording / General Common Mistakes
- Language & Wording
- Always include context: distribution & variables & units.
- Describing distributions: “appears to be” / “approximately” (bc you cannot be sure).
- Ex. “approximately normal” & “roughly symmetric” (this is a very important one!).
- Be VERY careful with relative frequencies vs. frequencies / raw counts! (this is a very important one!).
- Use relative frequencies with groups of different sizes!
- & say “a greater percentage” not “more”.
- Plurality vs majority.
- Always indicate which one you’re using.
- For histograms & boxplots: keep in mind that you can’t conclusively determine what the values are.
- Esp for histograms: need to say that a value is in a certain bin (“between [value] and [value]”).
- Common Mistakes
- Range, IQR, and st. dev. are single values! not a range of values.
- Avoid bad statistics: truncated axes & pictograms.
- Association is not causation!
Chapter 2: Modeling Distributions of Quantitative Data
2.1 Describing Location in a Distribution
Percentile: pth percentile is value with p% observations less than or equal to it.
- Works well with a frequency table of quantitative data.
Cumulative Relative Frequency Graphs / Ogives: Plots points corresponding to the percentile of a value in the distribution & points connected with line segments to create the graph.
- Another way to describe location in a distribution.
Standardized Scores (z-scores): How many standard deviations from the mean a value is (& what direction).
- (value – mean) / st dev
- Allows for a standard scale to compare values from different distributions.
- A way to describe location in a distribution.
Transforming Data
- Adding/Subtracting Constant: Affects measure of center/location (not shape/variability).
- Multiplying/Dividing Constant: Affects measures of center, location, variability (not shape).
- Multiple Transformations: Follow order of operations.
- Transformations Related to Z-scores: In a distribution of z-scores, shape remains the same as original distribution, mean always 0, standard deviation always 1.
2.2 Density Curves & Normal Distributions
- Density Curve: Simplified model of a distribution of a quantitative variable
- Always on or above horizontal axis, has an area of exactly 1 underneath it.
- Always an approximation of data (not an exact model).
- Models continuous data but often used to approximate discrete distributions as well.
- Describing Density Curves:
- Shape: same ways as usual
- Center: mean (balance point) (µ) & median (divides area of curve in half) (if symmetric, they’re the same)
- Variability: same measures as usual (σ)
- Normal Distributions: Bell shaped & symmetric & unimodal distribution
- Approximated with a normal curve (density curve).
- Can fully be described by mean (same as median) (µ) and standard deviation (σ).
- Useful for: real data, chance outcomes, inference methods
- Empirical Rule: 68% (within 1σ of µ) - 95% (within 2σ of µ) - 99.7% (within 3σ of µ) (for normal distributions)
- Standard Normal Distribution: distribution of z-scores (mean 0, st dev 1)
- Finding areas in Normal Distributions
- Empirical Rule (when applicable).
- Find z-score & use table a to look up p-value (percent of values to left of z-score): Table a connects z-scores to percentiles in a normal distribution.
- Technology (see calculator functions)
- Types of Problems: Area to left, area to right, area between, working backwards (z-score given area)
- Assessing Normality
- Plot data (see if it looks normal)
- Check against empirical rule: Check amount of data within 1, 2, and 3 st dev from mean (w/in 3-5% is pretty good!)
- Normal Probability / Normal Quantile Plot:
- Plots actual z-score (x) vs predicted z-score if it was normal (y).
- Look for a straight-ish line on the normal probability plot
*ideally use all three methods to check!
Types of MCQs (or FRQs)
- Interpret percentile or z-score (see content summary for info):
- Provide specific values for the percentages / mean & st dev.
- Use cumulative relative frequency graphs to determine percentile.
- Describe how distribution of data will change with a given type of transformation.
- Find area in a normal distribution (see content summary for info on how).
- Use percentiles or z-scores to evaluate claims about data:
- Find & interpret percentile / z-score
- Percentile: value with p% of observations less than or equal to it
- Z-score: abt that many standard deviations above / below the mean
- Draw conclusion using percentile / z-score
- Find & interpret percentile / z-score
- Normal Distribution Questions:
- Picture
- Draw normal curve
- Label specific distribution (context & mean / st dev)
- Label boundary values & shade area of interest
- Calculate z-score(s)
- Calculate p-values using table a or calculator
- Picture
Language & Wording/General Common Mistakes
Overall: Units 1 & 2 & 3
- If you use calculator, ALWAYS LABEL VALUES: State answer in context
- Is it extrapolation / is the answer reasonable?
- Predict value & comment on whether it’s reliable:
- E: correct prediction, plugged into formula for LSRL to get predicted value, say whether it is reliable or not (extrapolation)
- ALWAYS ADD CONTEXT esp when describing location in a distribution
- Percentile: “at” a percentile NOT “in” a percentile (bc it’s a location!)
- Z-scores: always provide the direction! not just “away” from the mean (and provide context (units & distribution / variables)!
- Normal Distributions:
- Be careful with direction & tails (two-sided vs one-sided)
- Distributions of real world data always “approximately normal” (never perfect)
- Quantitative Data:
- Discrete & Continuous
- One-Var:
- Tabular: frequency table, relative / cumulative frequency table
- Graphical: dotplot, stemplot, boxplot, histogram, cumulative relative frequency graph
- Numerical: 5-number summary, center, variability, percentile / z-score
- Two-Var:
- Graphical: scatterplot
- Numerical: r, r2 , LSRL, s
- Simplified model of data: density curves
- Categorical Data:
- Nominal & Ordinal
- One-Var:
- Tabular: frequency table, relative frequency table
- Graphical: bar chart, pie chart
- Numerical: proportions, etc
- Two-Var:
- Tabular: two-way table (frequency OR relative frequency)
- Graphical: side-by-side bar chart, segmented bar chart, mosaic plot
- Numerical: proportions, association, etc
Chapter 3: Exploring Two-Variable Quantitative Data
3.1 Scatterplots & Correlation
- Scatterplots: explanatory x-axis, response y-axis; label & scale axes (you CAN truncate the axes here)
- Describing Scatterplots: CDOFS
- Context: state variables & units
- Direction: pos / neg / no correlation
- Outliers / unusual features: outliers & points outside the general pattern / clusters
- Form: linear / nonlinear
- Strength: r (correlation coefficient) / r2 value (measures whether LSRL is a good fit)
- r value: measures strength & direction (ONLY for linear models)
- Cautions
- r is nonresistant
- only for linear
- correlation is not causation!
- no units
- unaffected by changing units / changing explanatory & response variables
- | r | less than 0.5: weak, | r | between 0.5 and 0.75: moderate, | r | greater than 0.75: strong
- Cautions
- Extrapolation: using a regression line to make predictions way outside of the interval of x-values used to generate the line (beyond the scope of your data)
- won’t be accurate bc it might not remain linear at such extreme points
3.2 Linear Regression
- Regression Line: model of how response variable (y) changes as explanatory variable (x) changes
- Residuals: actual value – predicted value (based on line)
- Least-Squares Regression Line: line that minimizes sum of squared residuals
- Explanatory & Response Variables: not necessarily causation (though it could be), just which helps to explain the other
- \hat{y} = a + bx (y-hat is predicted y-value for a given x-value); predicting so it’s okay if y-hat is not an integer for a real world situation (think of it as an average)
- A good linear regression line: minimizes the residuals; sum of residuals on an LSRL is always 0
- Residual Plots: scatterplot that plots residuals against explanatory variable
- Determines whether a linear model is appropriate (check for random scatter & no leftover curved pattern)
- s: standard deviation of residuals
- How well does the line work? -> how good will predictions be?
- Measures typical residual (distance between predicted & actual)
- Calculate in the same way as st dev but divide by n-2
- r2: coefficient of determination (value between 0 and 1, usually expressed as a percentage)
- Square of correlation r
- When finding r from r2, make sure to consider direction of correlation!
- How well the LSRL fits the data: percent reduction in sum of squared residuals when using LSRL instead of mean to make predictions
- What percent of the variability in the response variable that can be explained by the linear association
- Regression to the Mean
- (to calculate LSRL): slope: b = r (sy / sx ); y-int: a = \bar{y} - b(\bar{x})
- since LSRL passes through (\bar{x}, \bar{y})
- Correlation & Regression Wisdom
- Correlation and LSRLs only describe linear relationships
- r, s, r2 , and LSRL are non-resistant (see influential points)
- Influential Points: points that, if removed, substantially change the slope, y-int, r, r2 , or s
- These are very often influential (but not automatically guaranteed to be)
- Can do regression calculations with & without the points to see how much influence they have
- Outliers: doesn’t follow pattern of data and has a large residual
- High Leverage: much larger / smaller values than other values in data set
- Removing High-Leverage Points
- Lower than line & slope negative: slope closer to 0 & y-int lower
- Lower than line & slope positive: slope steeper & y-int lower
- Higher than line & slope negative: slope steeper & y-int higher
- Higher than line & slope positive: slope closer to 0 & y-int higher
- and sometimes effects on r, r2 , or s (use the point to evaluate this)
- Removing Outliers
- Impacts r, r2 , and s values heavily: Usually makes them go up because strength of association & fit of LSRL way higher
- Doesn’t generally impact LSRL though
3.3 Transforming to Achieve Linearity
- Applying a function to a quantitative variable (changes the scale of measurement) in order to make the scatterplot more approximately linear (in order to use linear regression methods)
- Transforming with powers & roots (for power models: y = ax^p)
- Option 1: Raise values of x to power p (graph (x^p, y) (it will be linear))
- Option 2: pth root of values of y (graph (x, \sqrt[p]{y}) (it will be linear))
- When p is known: use the above methods; When p is unknown: guess & check OR use log (more universal & works for unknown power models)
- Transforming with Logs (for power OR exponential models)
- Apply log transformation (log10 or ln)
- For power models (y = ax^p): use a log-log (both variables)
- For exponential models (y = ab^x): take log of y-var
- To choose a model: most random scatter (tiebreaker highest r2 value)
Types of MCQs (or FRQs)
- Slope: for every [increase / decrease] in one [unit of x], there is a predicted [increase / decrease] in [units of y]
- Y-int: when the [context of x] is 0 [units], the predicted value of [context of y] would be [y-int]
- r-value / correlation coefficient: the correlation coefficient of [r] indicates that there is a [strong / moderate / weak], [positive / negative] correlation between [context of x] and [context of y]
- s (standard deviation of residuals): on average, the model mispredicts [context of y] by [s units] using the LSRL
- r2 value: [r2] percent of the variability in [context of y] can be explained by the linear association with [context of x]
- Residual plot: the residual plot [is randomly scattered / has a pattern], indicating that a linear model [is / is not] appropriate
- Effect of outliers / high leverage points on measures of strength or the LSRL
- What would happen when they are removed?
- Can you infer causation based on correlation? (no) (might not be worded like this directly though)
- Which is explanatory & which is response?
- Interpret a feature of the association / regression line
- LSRL in general: for every increase in [1 unit of x], there is a predicted increase in [b units of y] above / below a [unit of y] of [a – the y-int]
Tips/Common Errors
- Be very careful with predicted vs actual values
- Remember to add a hat on top of any predicted value! And ALWAYS SAY THEY’RE “PREDICTED”
- When defining LSRL: always define variables & add units for x & y
- Particularly important when data has been transformed to get the LSRL
- With transformed data: be mindful of units & always convert back to “regular” units where appropriate / needed!
- With residuals: pay attention to whether the question is asking for predicted residual (from LSRL / LSRL equation) or actual residual (from residual plot / graph)!
- Can’t go backwards with LSRL & predict x value given a definite y value (can find x value from y-hat though)
- How to tell if linear model is a good fit: high r2 value, s is small relative to the data
Chapter 4: Collecting Data
4.1 Sampling & Surveys
- Sampling: selecting a random group of people out of a whole population (that’s representative of the population)
- Sampling Frame: the group of members from the population from which we select our sample
- Sampling Survey: collects data from the individuals in the sample (to learn about the population)
- Types of Sampling
- Random Sampling: involves a chance process to determine which individuals are in the sample
- SRS: every group of n individuals has an equal chance of being selected (label individuals with numbers, random number generator, select individuals that correspond: sample without replacement! don’t include repeated numbers; calculator: math -> prob -> randintnorep(1, N) OR use table d
- Stratified: SRS selected from each strata; Strata: group w similar characteristics assumed to be associated with the variables being measured: ensures you get some from each strata (more precise & accurate estimates)
- Clustered: randomly selecting entire clusters; Clusters: diff responses between (hopefully representative of population): no statistical advantage but resource-efficient
- Systematic: randomly select starting point & select every kth individual after: make sure no patterns coincide with your systematic pattern; allows you not to have to have identifiers (eg. names) for all the individuals in the population (useful w unknown population size)
*Decide on sampling type based on population & variable & resources available to you
- Bad Sampling
- Convenience Sampling: individuals who are easy to reach
- Voluntary Response Sampling: allows individuals to choose to be in sample: leads to voluntary response bias; individuals who feel strongly / have similar opinions more likely to respond
- Random Sampling: involves a chance process to determine which individuals are in the sample
- Bias: likely to systematically overestimate or underestimate the value
- Undercoverage: certain individuals less likely / cannot be chosen in a sample
- Nonresponse: individual chosen for sample can’t be contacted / doesn’t participate: diff from voluntary response bias bc this is only after sample has been selected
- Response: systematic pattern of inaccurate answers to a survey question; ex. wording or order of questions
- Uses resources more efficiently than a census
4.2 Experiments
- Observational Studies: observes individuals & measures variables of interest (does not influence responses); Retrospective: existing data; Prospective: tracks into future; pros: ethics; cons: confounding variables (cannot determine causality with observational studies)
- Experiments: imposes a treatment on individuals & measures their responses; pros: can establish causality bc avoids confounding; cons: resources, ethics important vocab:
- Placebo: no active ingredient; optional: placebo effect
- Treatment: condition imposed on individuals
- Experimental Unit: individual to which treatment applied: Subject: human experimental unit
- Factor: explanatory var that’s manipulated (may cause change in response var); treatments formed using levels of each of the factors (in a multifactor experiment)
- Levels: diff possible values of a factor
- Confounding: when variables are associated so that their effects on a response variable can’t be distinguished from one another. Avoids confounding variables
- Control Group: provides a baseline for comparison; not always required but there does need to be some sort of comparison
- Random Assignment of Treatments: avoids confounding variables
- Control: all other variables constant; avoids confounding variables
- Replication: use enough subjects (diff in effects can be distinguished from chance variation). Avoids confounding variables & reduces variation
- Double Blind: neither subjects nor the ppl measuring know the treatment; Triple Blind: statistician doesn’t know either
- Single-Blind: only one of the groups (above) knows
- Selection Biases: voluntary, convenience, undercoverage; The other two: nonresponse, response
Designing Experiments (required in a good experiment)
- Completely Randomized Design: experimental units assigned to treatments completely at random
- Randomized Block Design: random assignment within each block: Form blocks based on: confounding variable / the variable that’s the best predictor of the response variable; It accounts for variability in [response var] created by [block var]
- Block: group of experimental units known to be similar in some way that could affect their response to the treatments
- Matched Pairs Design: a type of RBD where blocks are pairs; pairs of similar experimental units, either true pairs & randomize treatment or each individual receives both treatments in a random order: Can establish causality
4.3 Using Studies Wisely
- Statistical Significance: observed diff is larger than can be attributed to chance alone: Ensured by random assignment of treatments
- Statistical Inference: generalising results to population: Assuming sample is representative of population (ensured by random sample)
- Ethical Data Gathering: informed consent, benefit
- Sampling Variability: diff random samples (same size, same population) produce diff estimates
- Larger sample sizes produce more accurate estimates (closer to true value)
Types of MCQs (or FRQs)
- Is it an observational study or an experiment?
- Describe the type of bias present
- Describe how members respond differently, then describe how this leads to overestimated / underestimated values
- Describe a confounding variable
- Describe that variable’s association with both explanatory AND response vars
- Describe an appropriate experiment design
- Describe how to randomly assign treatments
- Create groups AND define which group gets which treatment
- Describe why certain experiment designs would be preferable. Draw a diagram to explain the experiment design
- Describe how to randomly assign treatments
Tips/Common Errors
- Describing how to use a random number generator / table: ALWAYS account for repeated numbers
- If you flip coins: make sure everyone flips a coin (not just until you have one group & then put the rest in another. this would not be random sampling)
- Be really careful not to mix up language for experiments & language for observational studies!
Chapter 5: Probability
5.1 Randomness, Probability, and Simulation
- Random Process: generates outcomes purely by chance; unpredictable in short-term but predictable in the long run
- Probability: likelihood of an event to happen; proportion of times an outcome would occur in the long run; P(A) = (number of outcomes in event A) / (total number of outcomes in sample space)
- Law of Large Numbers: more trials means proportion approaches true probability (more accurate)
- Simulation: imitates random process such that simulated outcomes are consistent with real-world outcomes: Options: random number generator, random number table, flipping coin, draw cards, etc; describing simulation: remember to say that repeated numbers will be ignored!; for random number table: remember that numbers need to be the same length digit-wise
5.2 Probability Rules
- Probability Model: description of a random process that includes a list of all possible outcomes & the probability for each outcome- Sum of all probabilities is 1 (or 100%), each probability is between 0 & 1 (or 0% & 100%)
- Sample Space: list of all outcomes: options: chart, table, venn diagram, probability tree
- Event: any collection of outcomes from a random process- Notation & vocab: P(A) = (number of outcomes in event A) / (total number of outcomes in sample space)
- Complement: the probability that an event does not occur P(AC) = 1 – P(A)
- Intersection: P(A and B) = A ∩ B (both A and B must be true) joint probability
- Union: P(A or B) = A ⋃ B (at least one–either A or B, or both–must be true)
- Mutually Exclusive Events: cannot occur simultaneously (no outcomes in common) (also known as disjoint) if P(A and B) = 0:
- Addition Rule: P(A or B) = P(A) + P(B)
- Non-Mutually Exclusive Events: can occur simultaneously: use a two-way table!
- General Addition Rule: P(A or B) = P(A) + P(B) – P(A and B)
5.3 Conditional Probability & Independence
- Conditional Probability: probability that an event happens given that another event is known to have happened: P(A | B) FORMULA: P(A ∩ B) / P(B) Tree diagrams are useful with conditional