Population
The whole set of items that are of interest
Census
Observes or measures every member of a population
Sample
A selection of observations taken from a subset of the population which is used to find out information about the population as a whole
Census - Adv & Disadv
Adv
Completely accurate
Disadv
Time consuming & expensive
Cannot be used when the testing process destroys the item
Hard to process large quantity of data
Sample - Adv & Disadv
Adv
Less time consuming & less expensive than a census
Fewer people have to respond
Less data to process than in a census
Disadv
Data may not be as accurate
May not be large enough to reflect about subsets in population
Sampling units
Individual units of a population
Sampling frame
Sampling units of a population individually named or numbered to form a list
Simple random sampling
Number the list from 001 to ______ Select x random numbers using random number generator Ignore repeats Continue until you have x numbers Select corresponding items from the data sheet
Systematic sampling
The required elements are chosen at regular intervals from an ordered list
Stratified sampling
The population is divided into mutually exclusive strata and a random sample is taken from each
proportion of each strata sampled should be the same
Stratified sampling formula
The number sampled in a stratum = (number in stratum / number in population) x overall sample size
Simple random sampling - Adv & Disadv
Adv
Free of bias
Easy & cheap to implement for small populations and small samples
Each sampling unit has a known and equal chance of selection
Disadv
Not suitable when the population size or the sample size is large
A sampling frame is needed
Systematic sampling - Adv & Disadv
Adv
Simple and quick to use
Suitable for large samples and large populations
Disadv
A sampling frame is needed
It can introduce bias if the sampling frame is not random
Stratified sampling - Adv & Disadv
Adv
Sample accurately reflects the population structure
Guarantees proportional representation of groups within a population
Disadv
Population must be clearly classified into distinct strata
Not suitable when the population size or the sample size is large
A sampling frame is needed
Quota sampling
How many members of each group you wish to sample is decided in advance and opportunity sampling is used until you have a large enough sample for each group
Opportunity sampling
Consists of taking the sample from people who are available at the time the study is carried out and who fit the criteria you are looking for
Quantitative variable
Data associated with numerical observations
Qualitative variable
Data associated with non-numerical observations
Mode / Modal class
-Qualitative and quantitative data -The value or class that occurs most often -Not informative if each value occurs once
Median (Q2)
-((n+1)/2)th term -The middle value when the data values are put in order -Quantitative data -Not affected by extreme values
Mean (x̄)
-Average of values -Quantitative data -Uses all data -Affected by extreme values
x̄= Σx / n
Mean (frequency table)
x̄ = Σxf / Σf x = midpoint of each class interval
Lower quartile
Is one-quarter of the way through the data set
Upper quartile
Is three-quarters of the way through the data set
Calculator
Menu 2 List 1 - Values List 2 - Frequencies F2 (CALC) 1VAR
Interpolation
Make predictions of dependent variable withing the range if given data
Extrapolation
Make predictions of dependent variable outside range of given valies(not as accurate)
Range
The difference between the largest and smallest values in the data set
Interquartile range
The difference between the upper quartile and the lower quartile, Q₃ - Q₁
Interpercentile range
The difference between the values for two given percentiles
Variance
σ² = Σ(x - x̄)² / n σ² = (Σx² / n) - (Σx/n)²
'the mean of the squares minus the square of the mean'
Standard deviation
Square root of the variance σ = √(Σ(x - x̄)² / n) σ = √((Σx² / n) - (Σx/n)²)
Variance (frequency table)
σ² = Σf(x - x̄)² / Σf = (Σfx² / Σf) - (Σfx / Σf)²
Standard deviation (frequency table)
σ = √(Σf(x - x̄)² / Σf) = √((Σfx² / Σf) - (Σfx / Σf)²)
Outlier
An extreme value that lies outside the overall pattern of the data
Greater than Q₃ : Q₃ + 1.5Q₃ - Q₁) Less than Q₁ : Q₁ - 1.5(Q₃ - Q₁)
Keep Outlier
Outliers may indicate natural variation and is still a piece of data to keep
May be the result of errors in measuring or recording data
Cleaning the data
Removing anomalies from a data set
Histogram
Can be used to represent grouped continuous data
area of the bar is proportional to the frequency in each class
Can be scaled
Histogram formulas
area of bar = k x frequency
frequency density = frequency / class width
Frequency Polygon
Midpoint Straight Line
Cumulative Frequency
Upper Limit Curve
Histogram and Frequency Polygon
Join the middle of the top of each bar in the histogram to form a frequency polygon
Comparing data
Comment on:
Interquartile range (less/more precise?)
Median (On average has a higher/lower____) -Outliers -Positively/Negatively skewed
Strong negative correlation
Weak negative correlation
Weak positive correlation
Strong positive correlation
Correlation
Describes the nature of the linear relationship between two variables "With__outliers" "The higher the _the higher/lower the_ between ___ and ___"
Bivariate data
Data which has pairs of values for two variables
Regression line
Line of y on x is written in the form y = a + bx Y can be predicted from X
Regression line interpretation
y=a+bx "If the (x in words) increases by 1 (Unit on axis) then (y in words) increases/decreases by (value of b ignore sign)(unit on axis)"
"If (x in words) is 0 (unit on axis) then (y in words) is (value of a)(unit on y axis)
Dependent (response) Variable
Y-axis Researcher measures variable Found from x-axis
Independent (explanatory) Variable
X-axis Researcher controls variable
Venn diagrams
Can be used to represent events graphically
frequencies or probabilities can be placed in the regions of the Venn diagrams
Intersection
A & B (A ∩ B)
Union
A or B (A ∪ B)
Complement
P(not A) = 1 - P(A), A'
Mutually exclusive events
Both can't happen at the same time P(A and B) = 0 P(A or B) = P(A) + P(B)
Independent events
When one event happens, it doesn't affect the probability of the other happening P(A and B) = P(A) x P(B)
Random variable
A variable whose value depends on the outcome of a random event
Probability distribution
Shows all the values of a variable (x) abd their probabilities
Probability mass function
P(X = x)
Interval Length Equation
Amount of items in a population ÷ Sample size
Cluster Sampling
Split the population into clusters. Select a set amount of these clusters at random then take a simple random sample from each of these clusters
Cluster Sampling Adv & Disadv
Adv -Easy to carry out -Inexpensive Disadv -Bias -Members of the population aren't equally likely to be selected as the probability depends on size(Larger-Less likely) -Population must be divided into clusters which can be costly -Increasing scope of study increases clusters which adds time and expense
Box Plot
Median LQ UQ Lowest value that isn't an outlier Highest value that isn't an outlier Outlier (x) Skew
Discrete Datas
Daya that takes values which change in steps (e.g.shoe size)
Random Variable
Variable whose value is determined by chance
Binomial Distribution (Conditions)
Binary? Trials can be classified as success/failure
Independent? Trials must be independent.
Number? The number of trials (n) must be fixed in advance
Success? The probability of success (p) must be the same for each trial.
Binomial Probability Formula
P(x)= (nCx) (p^x) (1-p)^n-x
Distrubution of x
x~B(n,p) p = probability n = number of trials
Binomial mean
Np n = number of trials p = probability
binomial standard deviation
square root of np(1-p)
Binomial variance
np(1-p)
Null Hypothesis (H0)
Hypothesis you assume to be correct (H0 : p = )
Alternative hypothesis (H1) One tailed test
Reject null hypothesis
To carry out a hypothesis test, you assume the null hypothesis is true and likliness for it to occur. If the likliness is < significance level you reject null hypothesis
significance level
Probability threshold Uaually 10% 5% 1%
critical region
the area in the tails of the comparison distribution in which the null hypothesis can be rejected How many before we're below significance level
Acceptance region
The region where we accept the null hypothesis
Test the claim
Test the claim (Two tailed test)
Define X 2.X~B(n,p) 3.State H0 and H1 4.Find where the bias is (pn)>x/<x 5.Half significance level then compare 6.State accept or reject H0 7.Put into context