Looks like no one added any tags here yet for you.
Statistics and Probability
Statistics: collection, analysis, and interpretation of data
two parts: descriptive and inferential
Probability: a mathematical tool to study randomness
Difference between between statistics and probability:
suppose there is a jar with 10 balls, 3 red and 7 green
in probability: we know the jar content and therefore the true probabilities 3/10 and 7/10. we ask questions such as what is P(2 red in a row) with replacement
in statistics: we do not know the jar content. but we take a sample of say n=4 balls (with replacement). with this sample, we estimate the true probabilities.
Population vs sample
Population: collection of persons or things under study
Sample: a subset of the population that provides information about the population
Sampling
Sampling: selection of a portion of the population
We want an adequate sampling method such that the sample is representative of the population
If the sample is representative, sample statistics are meaningful with respect to the population
Parameter vs Statistic
Parameter: number that represents a property of the population
example: true population mean (mu)
Statistic: number that represents a property of the sample
example: sample mean (x bar)
Variable X and Data
Variable X: a characteristic of interest for each person (or thing) of the population (examples: hours of sleep, GDP)
Data: actual values for the variables (persons or things)
Successful Sampling
Sampling: a sample should have the same characteristics as the population it is representing
Simple random sampling (SRS): names in a hat (or generate random numbers). Most important/common. Any group of n people is equally likely to be drawn.
example: pick n professors from Fordham
cluster: select departments randomly
systematic: each 20th name in the phonebook
convenience (not random)
replacement
Sampling error (bias): x-bar does not estimate mu
Sampling Error: Variation in samples (key concept) 1.2
Sampling Error: the natural variation that results from selecting a sample to represent a larger population
this variation decreases as the sample size increases, so selecting larger samples reduces sampling error
Sampling with Replacement and without Replacement
Sampling with Replacement: once a member of the population is selected for inclusion in a sample, that member is returned to the population for the selection of the next individual
Sampling without Replacement: a member of the population may be chosen for inclusion in a sample only once. if chosen, the member is not returned to the population before the next selection
Frequency, Relative Frequency, & Cumulative relative frequency
Frequency: the number of times the value of the variable occurs in the sample
Relative frequency: the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. Relative frequencies can be written as fractions, percents, or decimals.
Cumulative relative frequency: the accumulation of the previous relative frequences. Add all previous relative frequencies to the relative frequency for the current ro.
Histograms
To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of 5 to 15 bars or classes for clarity
Choose a starting point
Less than the smallest data value
A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places
ex. if the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 - 0.005).
when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary
Calculate the width of each bar or class interval.
Subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire)
Frequency polygons
Analogous to line graphs and histograms— make continuous data visually easy to interpret
Useful for comparing distributions
Levels of Measurement
the way a set of data is measured is called its level of measurement
levels of measurement (from lowest to highest level):
Nominal scale level
data that is measured using a nominal scale is qualitative (categorical). categories, colors, names, labels, and favorite foods along with yes or no responses are examples of nominal level data.
not ordered
cannot be used in calculations
Ordinal scale level
similar to nominal scale data but there is a big difference
the ordinal scale data can be ordered
ex. top five national parks in the US
can be ordered but the differences cannot be measured
cannot be used in calculations
Interval scale level
similar to ordinal level data because it has a definite ordering but there is a difference between data
the differences between interval scale data can be measured though the data does not have a starting point
temperature scales like celsius and fahrenheit are measured using the interval scale
can be used in calculations, but one type of comparison cannot be done: no meaning to ratios
Ratio scale level
takes care of the ratio problem and gives you the most information
like interval scale data, but it has a starting point and ratios can be calculated
the data can be put in order from lowest to highest
the differences have a meaning
ratios can be calculated
Quantitative data: discrete or contiuous
Discrete: take on only certain numerical values
ex. counting number of phone calls you receive for each day of the week
Continuous: made up of counting numbers, but may include fractions, decimals, irrational numbers, etc.
ex. lengths, weights, times, etc.
Key Components of Every Experiment (to produce reliable data)
Subjects must be assigned randomly to different treatment groups to eliminate lurking variables
One of the groups must act as a control group, demonstrating what happens when the active treatment is not applied
Participants in the control group receive a placebo treatment that looks exactly like the active treatments but cannot influence the response variable
To preserve the integrity of the placebo, both researchers and subjects may be blinded
Measures of the Location of Data: Quartiles and Percentiles
Quartiles divide an ordered data set into four equal parts
about one-fourth of the data falls on or below the first quartile Q1
about one-half of the data falls on or below the second quartile Q2
about three-fourths of the data falls on or below the first quartile Q3
Percentiles divide ordered data into hundredths
To score in the 90th percentile of an exam does not necessarily mean that you received a 90% on a test. It means that 90% of test scores are the same or less than your score, and 10% of the test scores are the same or greater than your test score
Finding Quartiles
Find Q2 by finding the median (n+1/2)
Find Q1— the middle value of the lower half of the data
one fourth of the entire set of values are the same or less than Q1 and three fourths of the values are more than Q1
Find Q3— the middle value, or median, of the upper half of the data
three fourths of the ordered data set are less than Q3 and one fourth of the ordered data set is greater than Q3
Interquartile Range (IQR)
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of data
It is the difference between the third quartile (Q3) and the first quartile (Q1)
IQR = Q3 - Q1
Interquartile Range (IQR) and Outliers
The IQR can help to determine potential outliers
A value is suspect to be a potential outlier if it less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile
Q1 - (1.5)(IQR)
Q3 + (1.5)(IQR)
[Q1 - (1.5 x IQR) ; Q3 + (1.5 x IQR)]
Interpreting percentiles: On a 20 questions math test, the 70th percentile for correct answers was 16
What does this mean?
70% had 16 correct answers or less. 30% had 16 correct answers or more
Percentiles & frequency tables
percentile in frequency table —> cumulative relative frequency column
frequency —> values at 28th percentile
Resistant measure
A resistant measure is a statistical measurement that is not significantly affected by outliers
The mean is not robust to outliers
The median is robust to outliers
Another center statistic: the mode
the mode is the most frequent value in the sample
it also works for qualitative data
Skew and Distribution
To understand skewness:
mean - median
outliers pull the mean away from the median
sample mean > median → right/positive skewed distribution
sample mean < median → left/negative skewed distribution
sample mean ≅ median → symmetrical distribution
Box Plots
Box plots give a good graphica image of the concentration of the data
Show how far the extreme values are from most of the data
Constructed from five values:
Minimum value
First quartile
Median
Third Quartile
Maximum value
The middle 50 percent of the data falls inside the box
The first quartile marks one end of the box, and the third quartile marks the other end of the box
The median or second quartile can be between the first and third quartiles
The smallest and largest data values label the endpoints of the axis and extend from the ends of the box
Standard Deviation
The most common measure of variation or spread
The standard deviation is a number that measures how far data values are from their mean
It provides a numerical measure of the overall amount of variation in a data set
It can be used to determine whether a particular data value is close to or far from the mean
Higher standard deviation → more variation
lower case letter s represents the sample standard deviation
greek letter o sigma represents the population standard deviation
sample → divide by n-1
population → divide by N
Experiment, sample space, event, etc. (Chapter 3: probability topics)
Experiment: planned operation with a random outcome carried out under controlled conditions
ex. one coin flipping
Sample space: a set of possible outcomes
ex. S = {H,T}
Event: an event A is a subset of the sample space
Probability of an outcome: number between 0 and 1 that can be seen as the long-term relative frequency of that outcome
Probability of an event A when outcomes are equally likely
Different types of probabilities and how they relate
Marginal probabilities: P(A), P(B)
OR events: P(A U B) = P(A or B)
AND events: P(A and B) → aka joint probability
Conditional probability: P(A | B) P(A given B)
Bayes theorem
Conditional probability = joint/marginal
P(A|B) = P(A and B) / P(B)
P(A and B) = P(A|B) x P(B)
P(A and B) = P(B|A) x P(A)
So: P(A|B) x P(B) = P(B|A) x P(A)
Independence and mutual exclusion
Independence: A and B are independent if P(A|B) = P(A) or P(A and B) = P(A) x P(B) i.e. the conditioning set is useless (or P(B|A) = P(B))
one event occurring does not affect the chance the other occurs
intuition: roulette vs. black jack
note: if A and B are independent, then Bayes theorem becomes (Bayes’ particular case): P(A and B) = P(A) x P(B)
under independence, the joint is the product of marginal
Mutual exclusion: A and B are mutually exclusive when the joint is 0 → P(A and B) = 0
events that cannot occur at the same time
Two basic rules of probability
P(A and B) = P(A|B) x P(B)
Reduces to P(A and B) = P(A) x P(B) under independence
AND → product
P(A or B) = P(A) + P(B) - P(A and B)
Reducls to P(A or B) = P(A) + P(B) under mutual exclusion
OR → sum
Sampling with replacement or without replacement
With replacement: the events are considered to be independent, meaning the result of the first pick will not change the probabilities for the second pick
Without replacement: the events are considered to be dependent or not independent
Contingency tables
Discrete Random Variable
Discrete data are data that you can count
A random variable describes the outcomes of a statistical experiment in words
The values of a random variable can vary with each repetition of an experiment
Random Variable Notation
Upper case letters such as X or Y denote a random variable
Lower case letters like x or y denote the value of a random variable
If X is a random variable, then X is written in words, and x is given as a number
For example, let X= the number of heads you get when you toss three fair coins. The sample space for the toss of three fair coins is TTT;THH;HTH;HHT;HTT;THT;TTH;HHH
Then, x = 0,1,2,3
Because you can count the possible values that X can take on and the outcomes are random (the x values 0,1,2,3), X is a discrete random variable
Random Variables
A random variable X (or Y) takes different values with different probabilities
Example: experiment: flip 2 coins
S= {HH,HT,TH,TT}
Define X= count of heads from flipping 2 coins
Possible values for X: 0,1,2
But X will realize those values with different probabilities (1/4,2/4,1/4)
We want to describe the probability with which X takes on different values → We use a PDF
Is the sample mean (x bar) a RV?
Yes, because its value depends on which specific random sample is drawn from a population, meaning it can vary depending on the sample selected, and therefore has a probability distribution associated with it.
Probability Distribution Function (PDF) for a Discrete Random Variable
A discrete probability distribution function has two characteristics:
Each probability is between zero and one, inclusive.
The sum of the probabilities is one
Mean or Expected Value
The expected value is often referred to as the “long-term” average or mean. This means that over the long term of doing an experiment over and over, you would expect this average.
Law of large numbers → as the number of trials in a probability experiment increases, the difference between the theoretical probability of an event and the relative frequency approaches zero (the theoretical probability and the relative frequency get closer and closer together)
The mean (mu) of a discrete probability function is the expected value E(X)
mu = E(X) = sum of x x P(X)
P(X): probability that X takes on a value x
Standard deviation of a RV/PDF
square root of the variance
square root of the sum of (x - mu)² x P(x)
Binomial Experiment and Binomial Probability Distirubtion
There are three characteristics of a binomial experiment
There are a fixed number of trials.
There are only two possible outcomes, called “success” and “failure,” for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial. p + q= 1
The n trials are independent and are repeated using identical conditions. Because the n trials are independent, the outcome of one trial does not help predict the outcome of another. Chance of success vs. failure remains the same for each individual trial.
the outcomes of a binomial experiment fit a binomial probability distribution
the random variable X= the number of successes obtained in the n independent trials
The binomial distribution (form slides)
1st theoretical distribution that underlies all others
A distribution can be theoretical or empirical
The binomial distribution describes the probability of x successes in n trials of a Bernoulli process
Bernoulli process:
2 or more successive trials
2 possible outcome
Trials are independent
Probability of success remains constant
Binomial Probability Distribution: Mean and Varaince
E(X) = np
V(X)= npq
standard deviation → square root of npq
PDF & CDF: the binomial distribution
Probability Density Function or PDF: the probability of a random variable taking a specific value
Cumulative Distribution Function or CDF: the probability that the random variable X is less than or equal to x
PDF → P(X = x)
CDF → P(X ≤ x)
P(X > x) → 1 - binomialcdf