CHEMY204 Statistical Analysis Notes

Handling Experimental Values

  • The question of whether to include all experimental values in calculations is posed, especially when encountering measurements that seem questionable (e.g., x = 1, 2, 2, 3, 4, 5, 6, 7, 8, 19).
  • Such values may arise from power surges, experimental errors, faulty equipment, or may be genuine, necessitating a method to determine their validity.

Gross Errors

  • Rejecting data without a valid reason is considered inappropriate and unethical.
  • Data rejection must be based on a defined criterion.
  • A lenient criterion may include gross errors, adversely affecting the mean xx.
  • A strict criterion may lead to the rejection of genuine results.

Dixon’s Q Test

  • A simple and commonly used test to determine if a value can be discarded.
  • The Q test is defined by the formula: Q=gap/rangeQ = gap / range
    • The gap is the difference between two consecutive values in numerical order (highest or lowest and the next).
    • The range is the difference between the highest and lowest values.
  • The calculated Q value is compared to a critical value, QcritQ_{crit}, based on the number of values and the desired confidence level.
  • If Q > Q_{crit}, the value is rejected.

Example of Dixon's Q Test

  • Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
  • Range: 191=1819 - 1 = 18
  • Testing the lower end:
    • Gap: 21=12 - 1 = 1
    • Q=1/18=0.0556Q = 1 / 18 = 0.0556
    • For 10 values and a 90% confidence level, Qcrit=0.412Q_{crit} = 0.412.
    • Since Q < Q_{crit}, the value 1 is not rejected.

Example of Dixon's Q Test (cont.)

  • Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
  • Range: 191=1819 - 1 = 18
  • Testing the higher end:
    • Gap: 198=1119 - 8 = 11
    • Q=11/18=0.611Q = 11 / 18 = 0.611
    • For 10 values and a 90% confidence level, Qcrit=0.412Q_{crit} = 0.412.
    • Since Q > Q_{crit}, the value 19 is rejected.

Cautions on Using the Q Test

  • The Q test assumes the data is from a Gaussian distribution.
  • However, proving or disproving this assumption is difficult with fewer than 50 values.
  • Therefore, use the Q test cautiously.
  • Discarding observations based solely on statistical rules can be misleading.
  • Applying good judgment and broad experience with the analytical method is often more reliable.

Other Statistical Tests

  • Grubb’s Test: Similar to Dixon’s Q Test, using G=xxˉsG = \frac{|x - \bar{x}|}{s}.
  • Chauvenet’s Criterion: Determines a band within which all values should lie for a certain confidence level.
  • Peirce’s Criterion: A more rigorous version of Chauvenet’s criterion.

Statistical Analysis in Analytical Chemistry

  • Goal: To introduce basic statistical analysis relevant to analytical chemistry.
  • Learning Outcomes: Students should be able to describe, explain, and use basic statistical methods to analyze data.

Recommended Text

  • D. A. Skoog, D. M. West, F. J. Holler, S. R. Crouch, Fundamentals of Analytical Chemistry, 9th and 10th editions, Brooks/Cole, 2014.
  • Relevant chapters:
    • 9th edition: Chapters 5 to 7
    • 10th edition: Chapters 4 to 6

Problem Scenario: Cadmium Contamination

  • Cadmium compounds contaminate phosphate fertilizer imported into New Zealand.
  • A farmer needs to determine the amount of cadmium in the soil of a 100 × 100 m² paddock.
  • The farmer collects 100 soil samples, each 1 m³ from the top 1 cm of each 10 m² square.
  • An analytical chemistry company is employed to analyze the samples, but their equipment can only analyze one 1 m³ sample at a time.
  • Question: How can the cadmium concentration be determined?

Possible Approaches to the Problem

  • Measure the concentration in every 1 m³ sample.
  • Consider whether just one sample can be analyzed.
  • If the cadmium contamination is homogeneous, the concentration should be identical for each sample.
  • If heterogeneous, the concentration is likely to vary for each sample.
  • Reality: Soil is heterogeneous, and fertilizer application is often uneven.

100 Concentrations (xj)

  • 100 concentrations have been measured, one for each sample: x<em>1,x</em>2,x<em>3,,x</em>99,x100x<em>1, x</em>2, x<em>3, …, x</em>{99}, x_{100}.
  • The concentrations are not all identical, although some may be (e.g., x<em>1=x</em>2x<em>31=x</em>37x<em>58x</em>96=x97x<em>1 = x</em>2 ≠ x<em>{31} = x</em>{37} ≠ x<em>{58} ≠ x</em>{96} = x_{97}).

Ways to Represent Cadmium Concentration

  • Maximum (highest value)
  • Minimum (lowest value)
  • Midrange (midpoint)
  • Mode (most common value)
  • Median (separates the set into two equal halves)
  • Mean (average)
  • A value selected at random

Example Data Set

  • Given the data set: x = 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 19
  • What value best represents these numbers?

Values for the Example Data Set

  • Minimum = 1
  • Midrange = 10
  • Maximum = 19
  • Mode = 2
  • Median = 4
  • Random value = 5 (using a random number generator)
  • Mean = 5.36 (to 2 decimal places)

Choosing the Best Representation

  • A table is shown comparing: number of values of x (n), median, midrange, minimum, mean, mode, and maximum with a checkmark beside "Mean".

The Mean

  • The Mean is often the best choice.
  • Sum all the values of xix_i from i=1i = 1 to nn, and divide the sum by NN:
    • x=(x<em>1+x</em>2+x<em>3++x</em>N1+xN)Nx = \frac{(x<em>1 + x</em>2 + x<em>3 + … + x</em>{N-1} + x_N)}{N}
    • There are 100 samples, so N=100N = 100.
    • x=1N<em>i=1Nx</em>ix = \frac{1}{N} \sum<em>{i=1}^{N} x</em>i

Usefulness of the Mean

  • The mean gives the farmer the statistic he needs.
  • However, it doesn’t reveal whether the cadmium is distributed evenly throughout the paddock.

Cases of Cadmium Distribution

  • Case 1: All samples have the same concentration.
    • x<em>1=x</em>2=x<em>3==x</em>99=x100=50mggx<em>1 = x</em>2 = x<em>3 = … = x</em>{99} = x_{100} = 50 \frac{mg}{g}
    • x=(50×100)100=50mggx = \frac{(50 × 100)}{100} = 50 \frac{mg}{g}
  • Case 2: Half the samples have a high concentration, and half have none.
    • x<em>1=x</em>2=x<em>3==x</em>49=x50=100mggx<em>1 = x</em>2 = x<em>3 = … = x</em>{49} = x_{50} = 100 \frac{mg}{g}
    • x<em>51=x</em>52=x<em>53==x</em>99=x100=0mggx<em>{51} = x</em>{52} = x<em>{53} = … = x</em>{99} = x_{100} = 0 \frac{mg}{g}
    • x=(100×50)100=50mggx = \frac{(100 × 50)}{100} = 50 \frac{mg}{g}
  • Case 3: Samples have varying concentrations.
    • x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{40} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • x=50mggx = 50 \frac{mg}{g}

Accounting for Variation

  • The Variance: quantifies the spread of the data around the mean.
    • Sum all the squares of the differences between xix_i and xˉ\bar{x} from i=1i = 1 to NN, and divide the sum by NN:
    • s2=[(x<em>1xˉ)2+(x</em>2xˉ)2++(xNxˉ)2]Ns^2 = \frac{[(x<em>1 – \bar{x})^2 + (x</em>2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{N}
    • s2=(xiN)(xˉ)2s^2 = (\frac{\sum x_i}{N}) - (\bar{x})^2
    • s2=<em>i=1N(x</em>ixˉ)2Ns^2 = \frac{\sum<em>{i=1}^{N} (x</em>i - \bar{x})^2}{N}

Standard Deviation

  • Describes the spread of data around the mean.
  • Sum all the squares of the differences between xix_i and xˉ\bar{x} from i=1i = 1 to NN, divide the sum by NN, and then take the square root:
    • s=[(x<em>1xˉ)2+(x</em>2xˉ)2++(xNxˉ)2]Ns = \sqrt{\frac{[(x<em>1 – \bar{x})^2 + (x</em>2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{N}}
    • s=<em>i=1N(x</em>ixˉ)2Ns = \sqrt{\frac{\sum<em>{i=1}^{N} (x</em>i - \bar{x})^2}{N}}

Standard Deviation Interpretation

  • Indicates how far values are from the mean.
  • A low standard deviation indicates that the values are clustered around the mean.
  • A high standard deviation means that the values, or some of them, are not near the mean.

Standard Deviation Calculator & Excel

  • Diagrams of a calculator and excel are presented showing where functions to automatically calculate the mean, variance, and standard deviation can be found.

Standard Deviation: Case 1

  • Given x<em>1=x</em>2=x<em>3=x</em>99=x100=50mggx<em>1 = x</em>2 = x<em>3 = … x</em>{99} = x_{100} = 50 \frac{mg}{g}
    • s=[(5050)2+(5050)2+(5050)2]100=0mggs = \sqrt{\frac{[(50 - 50)^2 + (50 - 50)^2 + … (50 - 50)^2]}{100}} = 0 \frac{mg}{g}
  • Standard deviation has the same units as the measured vaues and the mean
  • Variance has the same units as the measured values squared
    • s2=0mg2g2s^2 = 0 \frac{mg^2}{g^2}. A standard deviation of 0 means all values are identical; there is no variation.

Standard Deviation: Case 2

  • Given x<em>1=x</em>2=x<em>3=x</em>49=x<em>50=100mggx<em>1 = x</em>2 = x<em>3 = … x</em>{49} = x<em>{50} = 100 \frac{mg}{g} and x</em>51=x<em>52=x</em>53=x<em>99=x</em>100=0mggx</em>{51} = x<em>{52} = x</em>{53} = … x<em>{99} = x</em>{100} = 0 \frac{mg}{g}
    • s2=[(10050)2+(10050)2(10050)2+(050)2+(050)2(050)2]100s^2 = \frac{[(100 - 50)^2 + (100 - 50)^2 … (100 - 50)^2 + (0 - 50)^2 + (0 - 50)^2 … (0 - 50)^2 ]}{100}
    • s2=[(10050)2×50+(050)2×50]100s^2 = \frac{[(100 - 50)^2 × 50 + (0 – 50)^2 × 50]}{100}
    • (10050)2=502 and (050)2=(50)2=502(100 - 50)^2 = 50^2 \text{ and } (0 – 50)^2 = (-50)^2 = 50^2
    • s=(502×100)100=502=50mggs = \sqrt{\frac{(50^2 × 100)}{100}} = \sqrt{50^2} = 50 \frac{mg}{g}

Standard Deviation: Case 3

  • Given x<em>1 to x</em>10=30,x<em>11 to x</em>41=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{41} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • s2=[(3050)2×10+(4050)2×30+(5050)2×20+(6050)2×30+(7050)2×10]100s^2 = \frac{[(30 - 50)^2 × 10 + (40 - 50)^2 × 30 + (50 - 50)^2 × 20 + (60 - 50)^2 × 30 + (70 - 50)^2 × 10]}{100}
    • s=[202×20+102×60+02×20]100s = \sqrt{\frac{[20^2 × 20 + 10^2 × 60 + 0^2 × 20]}{100}}
    • s=11.832159566199232085134656583123mggs = 11.832159566199232085134656583123 \frac{mg}{g}

Standard Deviation Accuracy

  • Calculate the standard deviation, s=11.832159566199232085134656583123mggs = 11.832159566199232085134656583123 \frac{mg}{g}
  • But is quoting the standard deviation to 30 decimal places reasonable?
  • The measured values are quoted with no decimal place (30, 40, etc), which should indicate the accuracy of the measuring device.
  • The mean and standard deviation should be quoted with the same degree of accuracy; the same number of decimal places (or perhaps one more).
    • s=12mggs = 12 \frac{mg}{g}

Summary of Cases

  • Case 1: x=50mggx = 50 \frac{mg}{g}, s=0mggs = 0 \frac{mg}{g}
  • Case 2: x=50mggx = 50 \frac{mg}{g}, s=50mggs = 50 \frac{mg}{g}
  • Case 3: x=50mggx = 50 \frac{mg}{g}, s=12mggs = 12 \frac{mg}{g}

Population vs. Sample Analysis

  • Analyzing every cube gives the farmer:
    • The mean, xˉ\bar{x}
    • The standard deviation, ss
  • Because the concentration of cadmium in all the samples (the whole population) has been analyzed, strictly these are:
    • The Population Mean (μ\mu)
    • The Population Standard Deviation (σ\sigma)

Avoiding Full Analysis

  • Can we avoid analyzing every 1 m³ sample?
  • Analyzing every sample is:
    • Time-consuming
    • Expensive
    • Resource intensive
    • Unnecessary (if certain conditions are met)

Analyzing Just One Sample

  • Analyzing one sample will give the farmer the concentration of cadmium in the paddock only if it is homogeneous.
  • If it is heterogeneous (it is!), then how would the farmer know that the measured concentration is that of the whole paddock?

Analyzing More Samples

  • What about analyzing more samples?

Sample Definitions

  • (Physical) sample: a portion of the physical entity to be analyzed (e.g., soil, water, gas, oil, blood, honey).
  • (Statistical) sample: a subset of a population; a selection of values from the complete set of values.

Advantages of Sample Analysis

  • Analyzing a sample of samples is:
    • Less time-consuming
    • Cheaper
    • Uses fewer resources
    • Appropriate (if certain conditions are met)

Sample Mean to Population Mean

  • Does the mean of the concentration of a sample of samples represent the mean of the concentration of the population?
  • Is the sample mean approximately equal to the population mean? xˉμ\bar{x} ≈ \mu?
  • xˉ\bar{x} is the Sample Mean

Comparing Sample Mean to Population Mean: Case 3

  • Case 3: x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{40} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • x16=40mggx_{16} = 40 \frac{mg}{g}
    • xˉ=40mgg\bar{x} = 40 \frac{mg}{g}
    • xˉμ\bar{x} ≠ \mu

Comparing Sample Mean to Population Mean: Case 3 (cont.)

  • x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{40} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • x33=40mggx_{33} = 40 \frac{mg}{g}
    • xˉ=40mgg\bar{x} = 40 \frac{mg}{g}
    • xˉμ\bar{x} ≠ \mu

Comparing Sample Mean to Population Mean: Case 3 (cont. 2)

  • x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{40} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • x67=60mggx_{67} = 60 \frac{mg}{g}
    • xˉ=46.7mgg\bar{x} = 46.7 \frac{mg}{g}
    • xˉμ\bar{x} ≈ \mu

Comparing Sample Mean to Population Mean: Case 3 (cont. 3)

  • x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70mggx<em>1 \text{ to } x</em>{10} = 30, x<em>{11} \text{ to } x</em>{40} = 40, x<em>{41} \text{ to } x</em>{60} = 50, x<em>{61} \text{ to } x</em>{90} = 60, x<em>{91} \text{ to } x</em>{100} = 70 \frac{mg}{g}
    • xˉ=50mgg\bar{x} = 50 \frac{mg}{g}
    • xˉ=μ\bar{x} = \mu

Sample Size and Convergence

  • The larger the size of the sample, the closer the sample mean and population mean.
    • As NN increases, xˉμ\bar{x} → \mu
    • The sample mean tends towards the population mean.
  • Of course, it does depend on which soil samples we pick to analyze.

Sample Value Approximations

  • xˉ=49μ\bar{x} = 49 \approx \mu
  • xˉ=51μ\bar{x} = 51 \approx \mu
  • xˉ=50=μ\bar{x} = 50 = \mu
  • xˉ=50=μ\bar{x} = 50 = \mu
  • xˉ=30μ\bar{x} = 30 \neq \mu
  • xˉ=70μ\bar{x} = 70 \neq \mu
  • xˉ=50=μ\bar{x} = 50 = \mu
  • xˉ=50=μ\bar{x} = 50 = \mu

Combinations and Means

  • There are 17,310,309,456,440 possible samples of 10 out of 100: 100!(10!×90!)=17,310,309,456,440\frac{100!}{(10! × 90!)} = 17,310,309,456,440
  • The lowest xˉ\bar{x} is 30 mg / g and the highest xˉ\bar{x} is 70 mg / g: 30xˉ70mgg30 \leq \bar{x} \leq 70 \frac{mg}{g}
  • There is only one combination that leads to xˉ=30mgg\bar{x} = 30 \frac{mg}{g}
  • There is only one combination that leads to xˉ=70mgg\bar{x} = 70 \frac{mg}{g}
  • The mean of the values of xˉ\bar{x} for all the 17,310,309,456,440 samples is μ\mu (50.0 mg / g)

Central Limit Theorem

  • Central Limit Theorem
    • The sum of independent random variables tends towards a normal distribution, even if the original variables themselves are not normally distributed.

Normal Distribution

  • Bell curve, Gaussian distribution.
  • Midrange = Mode = Median = Mean.

Die Roll

  • Imagine rolling a die.
  • Six possible outcomes (1, 2, 3, 4, 5, 6) with equal probability (one sixth).
  • NOT normally distributed.

Dice Rolls and Sums

  • The more times the die is rolled and the numbers summed, the closer the distribution is to the normal distribution.

Dice Roll Distributions

  • Distributions of dice rolls are displayed.
  • n=1: uniform distribution with p(k) = 1/6 for k = 1,2,3,4,5,6. mean 3.5
  • n=2: distribution peaks at 7. Mean 7
  • n=3: distribution peaks at 10, 11. Mean 10.5
  • n=4: distribution is bell shaped. Mean 14
  • n=5: distribution is more pronounced. Mean 17.5

Central Limit Theorem (cont.)

  • The sum of independent random variables tends towards a normal distribution, even if the original variables themselves are not normally distributed.
  • The mean xˉ\bar{x} is the sum of independent random variables divided by the number of variables, and so, by the theorem, is normally distributed.
    • xˉ=<em>i=1Nx</em>iN\bar{x} = \frac{\sum<em>{i=1}^{N} x</em>i}{N}

Gaussian Distribution Formula

  • f(x)=ae(xb)22c2f(x) = a e^{\frac{-(x - b)^2}{2c^2}}
  • a, b, and c are constants.

Gaussian Distribution Parameters

  • The position of the curve is determined by b.
  • The shape is determined by it being a Gaussian distribution.
  • The width is determined by c2c^2

Gaussian Distribution Plot

  • f(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x - \mu)^2}{2\sigma^2}}
  • Graphs are shown for: μ=0,σ2=0.2\mu=0, \sigma^2=0.2, μ=0,σ2=1.0\mu=0, \sigma^2=1.0, μ=0,σ2=5.0\mu=0, \sigma^2=5.0, μ=2,σ2=0.5\mu=-2, \sigma^2=0.5

Gaussian Distribution Data Ranges

  • 68% within 1 standard deviation
  • 95% within 2 standard deviations
  • 99.7% of the data are within 3 standard deviations of the mean

Dice Roll Statistics

  • Imagine rolling a die five times and summing the five numbers.
    • xˉ\bar{x} for one die = (1+2+3+4+5+6)6=216=3.5\frac{(1 + 2 + 3 + 4 + 5 + 6)}{6} = \frac{21}{6} = 3.5
    • s2s^2 for one die = [(13.5)2+(23.5)2+(33.5)2+(43.5)2+(53.5)2+(63.5)2]6\frac{[(1 – 3.5)^2 + (2 – 3.5)^2 + (3 – 3.5)^2 + (4 – 3.5)^2 + (5 – 3.5)^2 + (6 – 3.5)^2]}{6}
    • =[(2.5)2+(1.5)2+(0.5)2+0.52+1.52+2.52]6=(25+9+1+1+9+25)24=3512= \frac{[(-2.5)^2 + (-1.5)^2 + (-0.5)^2 + 0.5^2 + 1.5^2 + 2.5^2]}{6} = \frac{(25 + 9 + 1 + 1 + 9 + 25)}{24} = \frac{35}{12}

Dice Roll Statistics (continued)

  • xˉ\bar{x} for five dice = 5×3.5=17.55 × 3.5 = 17.5
  • s2s^2 for five dice = 5×3512=17512=14.585 × \frac{35}{12} = \frac{175}{12} = 14.58
  • s=14.58=3.82s = \sqrt{14.58} = 3.82

Dice Roll Distribution Approximation

  • xˉ=17.5\bar{x} = 17.5
  • 2s=7.642s = 7.64
  • 95% of the possible sums are in this region.

Data Range Probability

  • Assuming the original data is normally distributed:
  • 95% probability that a randomly selected value will be in this range
  • Normal Distribution Diagram with Mean labelled as mew.

Finding m

  • Assuming the original data is normally distributed:
  • Taking a random value of x in that range
  • Normal Distribution Diagram with Mean labelled as mew.

m Calculation

  • Assuming the original data is normally distributed:
  • 95% probability that m lies within the range x – 2s to x + 2s
  • Normal Distribution Diagram.

Paddock Cadmium Concentration

  • So for 95% of the soil samples the value of m lies within x - 2s to x + 2s
  • So, measuring one soil sample can produce a range within which there is 95% probability of m lying.
  • The farmer needs only measure the cadmium concentration of one sample (randomly selected) to have a 95% probability of determining the cadmium concentration of the paddock within a certain range.
  • Quicker, Cheaper, Uses fewer resources
  • Conditions – 95% probability and within a range

Non-Normal Cadmium Distribution

  • But is the original data normally distributed?
  • No!
  • Graph of distribution not normally distributed illustrated this point.

Sampling More Than Once

  • But, by Central Limit Theorem, the mean of a sample of samples (more than one sample) is normally distributed.
  • The larger the sample, the closer the sample means tend to the population mean.
    • As NN increases xˉμ\bar{x} → \mu

Standard Error of the Mean

  • σm=σN\sigma_m = \frac{\sigma}{\sqrt{N}}
  • The standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the population is the degree to which individuals within the population differ from the population mean.

Confidence Interval Illustration

  • A diagram is shown to illustrate the 95% confidence interval.
  • 95%
  • Area to the left and right of the mean is labeled as 2σm\sigma_m

Confidence Interval

  • Confidence Interval at a confidence level of 95%
    • xˉ±2σ<em>m\bar{x} \pm 2\sigma<em>m (or xˉ2σ</em>m\bar{x} - 2\sigma</em>m to xˉ+2σ<em>m\bar{x} + 2\sigma<em>m) \bar{x} 2σ</em>m\sigma</em>m

Confidence Interval (cont.)

  • The confidence interval at a confidence level of 95% is:
    • xˉ±2σm=xˉ±2sN\bar{x} \pm 2\sigma_m = \bar{x} \pm \frac{2s}{\sqrt{N}}
    • xˉ2sN to xˉ+2sN\bar{x} - \frac{2s}{\sqrt{N}} \text{ to } \bar{x} + \frac{2s}{\sqrt{N}}
  • As NN increases, σm\sigma_m decreases and the confidence interval decreases, giving more precision.

Decreasing the Variation

  • σm=σN\sigma_m = \frac{\sigma}{\sqrt{N}}
  • σm\sigma_m is inversely proportional to N\sqrt{N}
  • To decrease σ<em>m\sigma<em>m to ½ σ</em>m\sigma</em>m, NN must increase to 4N
    • e.g. σ=10\sigma = 10
    • σm=2\sigma_m = 2 for N=25N = 25 (1025=105\frac{10}{\sqrt{25}} = \frac{10}{5}
    • for σ<em>m=1\sigma<em>m = 1, N=σσ</em>m=101\sqrt{N} = \frac{\sigma}{\sigma</em>m} = \frac{10}{1}, N=100=25×4N = 100 = 25 × 4

Confidence Range

  • 95% probability that m lies within the range xˉ2σ<em>m\bar{x} – 2\sigma<em>m to xˉ+2σ</em>m\bar{x} + 2\sigma</em>m
  • N = n. Area to the left and right of the mean is labeled as 2σm\sigma_m
  • N = 4n. Area to the left and right of the mean is labeled as 2σm\sigma_m
  • N = 16n. Area to the left and right of the mean is labeled as 2σm\sigma_m

Estimating s

  • Problem: How does the farmer know the value of σ\sigma?
  • Determining the range of values that m lies within with 95% confidence relies on knowing the standard deviation of the population, σ\sigma
  • To determine σ\sigma, the analysts need to measure the cadmium concentrations of all of the 100 soil samples.

Estimating Population SD

  • Is there a way of estimating the population standard deviation, σ\sigma, from the sample?
  • Each sample (of samples) has a mean, xˉ\bar{x}
  • Since a sample comprises a number of individual values, it also has a standard deviation.

Sample SD Equation

  • Sum all the squares of the differences between xix_i and xˉ\bar{x} from i=1i = 1 to NN, divide the sum by N1N - 1 and then take the square root:
  • s=[(x<em>1xˉ)2+(x</em>2xˉ)2++(xNxˉ)2](N1)s = \sqrt{\frac{[(x<em>1 – \bar{x})^2 + (x</em>2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{(N – 1)}}
  • s=<em>i=1N(x</em>ixˉ)2(N1)s = \sqrt{\frac{\sum<em>{i=1}^{N} (x</em>i - \bar{x})^2}{(N - 1)}}

Degrees of Freedom

  • Why N1N – 1 and not NN?
  • The Degrees of Freedom is the number of values in the final calculation of a statistic that are free to vary.
  • If there are NN values and the mean, xˉ\bar{x}, is known, then only N1N – 1 values can vary; the Nth(xN)Nth (x_N) is fixed by the value of the mean.
    • e.g. xˉ=1\bar{x} = 1 and N=3N = 3. If x<em>1=0,x</em>2=2x<em>1 = 0, x</em>2 = 2 then x3x_3 must be 1 – it’s not free to vary. There are N1(31)=2N – 1 (3 - 1) = 2 degrees of freedom, not 3.

Statistical Calculation Help.

  • Diagrams of a calculator and excel are presented showing where functions to automatically calculate the sample standard deviation can be found.

Relation Between SD

  • Note: For the same