The question of whether to include all experimental values in calculations is posed, especially when encountering measurements that seem questionable (e.g., x = 1, 2, 2, 3, 4, 5, 6, 7, 8, 19).
Such values may arise from power surges, experimental errors, faulty equipment, or may be genuine, necessitating a method to determine their validity.
Gross Errors
Rejecting data without a valid reason is considered inappropriate and unethical.
Data rejection must be based on a defined criterion.
A lenient criterion may include gross errors, adversely affecting the mean x.
A strict criterion may lead to the rejection of genuine results.
Dixon’s Q Test
A simple and commonly used test to determine if a value can be discarded.
The Q test is defined by the formula: Q=gap/range
The gap is the difference between two consecutive values in numerical order (highest or lowest and the next).
The range is the difference between the highest and lowest values.
The calculated Q value is compared to a critical value, Qcrit, based on the number of values and the desired confidence level.
If Q > Q_{crit}, the value is rejected.
Example of Dixon's Q Test
Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
Range: 19−1=18
Testing the lower end:
Gap: 2−1=1
Q=1/18=0.0556
For 10 values and a 90% confidence level, Qcrit=0.412.
Since Q < Q_{crit}, the value 1 is not rejected.
Example of Dixon's Q Test (cont.)
Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
Range: 19−1=18
Testing the higher end:
Gap: 19−8=11
Q=11/18=0.611
For 10 values and a 90% confidence level, Qcrit=0.412.
Since Q > Q_{crit}, the value 19 is rejected.
Cautions on Using the Q Test
The Q test assumes the data is from a Gaussian distribution.
However, proving or disproving this assumption is difficult with fewer than 50 values.
Therefore, use the Q test cautiously.
Discarding observations based solely on statistical rules can be misleading.
Applying good judgment and broad experience with the analytical method is often more reliable.
Other Statistical Tests
Grubb’s Test: Similar to Dixon’s Q Test, using G=s∣x−xˉ∣.
Chauvenet’s Criterion: Determines a band within which all values should lie for a certain confidence level.
Peirce’s Criterion: A more rigorous version of Chauvenet’s criterion.
Statistical Analysis in Analytical Chemistry
Goal: To introduce basic statistical analysis relevant to analytical chemistry.
Learning Outcomes: Students should be able to describe, explain, and use basic statistical methods to analyze data.
Recommended Text
D. A. Skoog, D. M. West, F. J. Holler, S. R. Crouch, Fundamentals of Analytical Chemistry, 9th and 10th editions, Brooks/Cole, 2014.
Relevant chapters:
9th edition: Chapters 5 to 7
10th edition: Chapters 4 to 6
Problem Scenario: Cadmium Contamination
Cadmium compounds contaminate phosphate fertilizer imported into New Zealand.
A farmer needs to determine the amount of cadmium in the soil of a 100 × 100 m² paddock.
The farmer collects 100 soil samples, each 1 m³ from the top 1 cm of each 10 m² square.
An analytical chemistry company is employed to analyze the samples, but their equipment can only analyze one 1 m³ sample at a time.
Question: How can the cadmium concentration be determined?
Possible Approaches to the Problem
Measure the concentration in every 1 m³ sample.
Consider whether just one sample can be analyzed.
If the cadmium contamination is homogeneous, the concentration should be identical for each sample.
If heterogeneous, the concentration is likely to vary for each sample.
Reality: Soil is heterogeneous, and fertilizer application is often uneven.
100 Concentrations (xj)
100 concentrations have been measured, one for each sample: x<em>1,x</em>2,x<em>3,…,x</em>99,x100.
The concentrations are not all identical, although some may be (e.g., x<em>1=x</em>2=x<em>31=x</em>37=x<em>58=x</em>96=x97).
Ways to Represent Cadmium Concentration
Maximum (highest value)
Minimum (lowest value)
Midrange (midpoint)
Mode (most common value)
Median (separates the set into two equal halves)
Mean (average)
A value selected at random
Example Data Set
Given the data set: x = 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 19
What value best represents these numbers?
Values for the Example Data Set
Minimum = 1
Midrange = 10
Maximum = 19
Mode = 2
Median = 4
Random value = 5 (using a random number generator)
Mean = 5.36 (to 2 decimal places)
Choosing the Best Representation
A table is shown comparing: number of values of x (n), median, midrange, minimum, mean, mode, and maximum with a checkmark beside "Mean".
The Mean
The Mean is often the best choice.
Sum all the values of xi from i=1 to n, and divide the sum by N:
x=N(x<em>1+x</em>2+x<em>3+…+x</em>N−1+xN)
There are 100 samples, so N=100.
x=N1∑<em>i=1Nx</em>i
Usefulness of the Mean
The mean gives the farmer the statistic he needs.
However, it doesn’t reveal whether the cadmium is distributed evenly throughout the paddock.
Cases of Cadmium Distribution
Case 1: All samples have the same concentration.
x<em>1=x</em>2=x<em>3=…=x</em>99=x100=50gmg
x=100(50×100)=50gmg
Case 2: Half the samples have a high concentration, and half have none.
x<em>1=x</em>2=x<em>3=…=x</em>49=x50=100gmg
x<em>51=x</em>52=x<em>53=…=x</em>99=x100=0gmg
x=100(100×50)=50gmg
Case 3: Samples have varying concentrations.
x<em>1 to x</em>10=30,x<em>11 to x</em>40=40,x<em>41 to x</em>60=50,x<em>61 to x</em>90=60,x<em>91 to x</em>100=70gmg
x=50gmg
Accounting for Variation
The Variance: quantifies the spread of the data around the mean.
Sum all the squares of the differences between xi and xˉ from i=1 to N, and divide the sum by N:
s2=N[(x<em>1–xˉ)2+(x</em>2–xˉ)2+…+(xN–xˉ)2]
s2=(N∑xi)−(xˉ)2
s2=N∑<em>i=1N(x</em>i−xˉ)2
Standard Deviation
Describes the spread of data around the mean.
Sum all the squares of the differences between xi and xˉ from i=1 to N, divide the sum by N, and then take the square root:
s=N[(x<em>1–xˉ)2+(x</em>2–xˉ)2+…+(xN–xˉ)2]
s=N∑<em>i=1N(x</em>i−xˉ)2
Standard Deviation Interpretation
Indicates how far values are from the mean.
A low standard deviation indicates that the values are clustered around the mean.
A high standard deviation means that the values, or some of them, are not near the mean.
Standard Deviation Calculator & Excel
Diagrams of a calculator and excel are presented showing where functions to automatically calculate the mean, variance, and standard deviation can be found.
Standard Deviation: Case 1
Given x<em>1=x</em>2=x<em>3=…x</em>99=x100=50gmg
s=100[(50−50)2+(50−50)2+…(50−50)2]=0gmg
Standard deviation has the same units as the measured vaues and the mean
Variance has the same units as the measured values squared
s2=0g2mg2. A standard deviation of 0 means all values are identical; there is no variation.
Standard Deviation: Case 2
Given x<em>1=x</em>2=x<em>3=…x</em>49=x<em>50=100gmg and x</em>51=x<em>52=x</em>53=…x<em>99=x</em>100=0gmg
Assuming the original data is normally distributed:
95% probability that a randomly selected value will be in this range
Normal Distribution Diagram with Mean labelled as mew.
Finding m
Assuming the original data is normally distributed:
Taking a random value of x in that range
Normal Distribution Diagram with Mean labelled as mew.
m Calculation
Assuming the original data is normally distributed:
95% probability that m lies within the range x – 2s to x + 2s
Normal Distribution Diagram.
Paddock Cadmium Concentration
So for 95% of the soil samples the value of m lies within x - 2s to x + 2s
So, measuring one soil sample can produce a range within which there is 95% probability of m lying.
The farmer needs only measure the cadmium concentration of one sample (randomly selected) to have a 95% probability of determining the cadmium concentration of the paddock within a certain range.
Quicker, Cheaper, Uses fewer resources
Conditions – 95% probability and within a range
Non-Normal Cadmium Distribution
But is the original data normally distributed?
No!
Graph of distribution not normally distributed illustrated this point.
Sampling More Than Once
But, by Central Limit Theorem, the mean of a sample of samples (more than one sample) is normally distributed.
The larger the sample, the closer the sample means tend to the population mean.
As N increases xˉ→μ
Standard Error of the Mean
σm=Nσ
The standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the population is the degree to which individuals within the population differ from the population mean.
Confidence Interval Illustration
A diagram is shown to illustrate the 95% confidence interval.
95%
Area to the left and right of the mean is labeled as 2σm
Confidence Interval
Confidence Interval at a confidence level of 95%
xˉ±2σ<em>m (or xˉ−2σ</em>m to xˉ+2σ<em>m)
\bar{x} 2σ</em>m
Confidence Interval (cont.)
The confidence interval at a confidence level of 95% is:
xˉ±2σm=xˉ±N2s
xˉ−N2s to xˉ+N2s
As N increases, σm decreases and the confidence interval decreases, giving more precision.
Decreasing the Variation
σm=Nσ
σm is inversely proportional to N
To decrease σ<em>m to ½ σ</em>m, N must increase to 4N
e.g. σ=10
σm=2 for N=25 (2510=510
for σ<em>m=1, N=σ</em>mσ=110, N=100=25×4
Confidence Range
95% probability that m lies within the range xˉ–2σ<em>m to xˉ+2σ</em>m
N = n. Area to the left and right of the mean is labeled as 2σm
N = 4n. Area to the left and right of the mean is labeled as 2σm
N = 16n. Area to the left and right of the mean is labeled as 2σm
Estimating s
Problem: How does the farmer know the value of σ?
Determining the range of values that m lies within with 95% confidence relies on knowing the standard deviation of the population, σ
To determine σ, the analysts need to measure the cadmium concentrations of all of the 100 soil samples.
Estimating Population SD
Is there a way of estimating the population standard deviation, σ, from the sample?
Each sample (of samples) has a mean, xˉ
Since a sample comprises a number of individual values, it also has a standard deviation.
Sample SD Equation
Sum all the squares of the differences between xi and xˉ from i=1 to N, divide the sum by N−1 and then take the square root:
s=(N–1)[(x<em>1–xˉ)2+(x</em>2–xˉ)2+…+(xN–xˉ)2]
s=(N−1)∑<em>i=1N(x</em>i−xˉ)2
Degrees of Freedom
Why N–1 and not N?
The Degrees of Freedom is the number of values in the final calculation of a statistic that are free to vary.
If there are N values and the mean, xˉ, is known, then only N–1 values can vary; the Nth(xN) is fixed by the value of the mean.
e.g. xˉ=1 and N=3. If x<em>1=0,x</em>2=2 then x3 must be 1 – it’s not free to vary. There are N–1(3−1)=2 degrees of freedom, not 3.
Statistical Calculation Help.
Diagrams of a calculator and excel are presented showing where functions to automatically calculate the sample standard deviation can be found.