CHEMY204 Statistical Analysis Notes

Handling Experimental Values

The question of whether to include all experimental values in calculations is posed, especially when encountering measurements that seem questionable (e.g., x = 1, 2, 2, 3, 4, 5, 6, 7, 8, 19).
Such values may arise from power surges, experimental errors, faulty equipment, or may be genuine, necessitating a method to determine their validity.

Gross Errors

Rejecting data without a valid reason is considered inappropriate and unethical.
Data rejection must be based on a defined criterion.
A lenient criterion may include gross errors, adversely affecting the mean $x$ .
A strict criterion may lead to the rejection of genuine results.

Dixon’s Q Test

A simple and commonly used test to determine if a value can be discarded.
The Q test is defined by the formula: $Q = gap / range$
- The gap is the difference between two consecutive values in numerical order (highest or lowest and the next).
- The range is the difference between the highest and lowest values.
The calculated Q value is compared to a critical value, $Q_{crit}$ , based on the number of values and the desired confidence level.
If Q > Q_{crit}, the value is rejected.

Example of Dixon's Q Test

Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
Range: $19 - 1 = 18$
Testing the lower end:
- Gap: $2 - 1 = 1$
- $Q = 1 / 18 = 0.0556$
- For 10 values and a 90% confidence level, $Q_{crit} = 0.412$ .
- Since Q < Q_{crit}, the value 1 is not rejected.

Example of Dixon's Q Test (cont.)

Given the data set: 1, 2, 2, 3, 4, 5, 6, 7, 8, 19
Range: $19 - 1 = 18$
Testing the higher end:
- Gap: $19 - 8 = 11$
- $Q = 11 / 18 = 0.611$
- For 10 values and a 90% confidence level, $Q_{crit} = 0.412$ .
- Since Q > Q_{crit}, the value 19 is rejected.

Cautions on Using the Q Test

The Q test assumes the data is from a Gaussian distribution.
However, proving or disproving this assumption is difficult with fewer than 50 values.
Therefore, use the Q test cautiously.
Discarding observations based solely on statistical rules can be misleading.
Applying good judgment and broad experience with the analytical method is often more reliable.

Other Statistical Tests

Grubb’s Test: Similar to Dixon’s Q Test, using $G = \frac{|x - \bar{x}|}{s}$ .
Chauvenet’s Criterion: Determines a band within which all values should lie for a certain confidence level.
Peirce’s Criterion: A more rigorous version of Chauvenet’s criterion.

Statistical Analysis in Analytical Chemistry

Goal: To introduce basic statistical analysis relevant to analytical chemistry.
Learning Outcomes: Students should be able to describe, explain, and use basic statistical methods to analyze data.

Recommended Text

D. A. Skoog, D. M. West, F. J. Holler, S. R. Crouch, Fundamentals of Analytical Chemistry, 9th and 10th editions, Brooks/Cole, 2014.
Relevant chapters:
- 9th edition: Chapters 5 to 7
- 10th edition: Chapters 4 to 6

Problem Scenario: Cadmium Contamination

Cadmium compounds contaminate phosphate fertilizer imported into New Zealand.
A farmer needs to determine the amount of cadmium in the soil of a 100 × 100 m² paddock.
The farmer collects 100 soil samples, each 1 m³ from the top 1 cm of each 10 m² square.
An analytical chemistry company is employed to analyze the samples, but their equipment can only analyze one 1 m³ sample at a time.
Question: How can the cadmium concentration be determined?

Possible Approaches to the Problem

Measure the concentration in every 1 m³ sample.
Consider whether just one sample can be analyzed.
If the cadmium contamination is homogeneous, the concentration should be identical for each sample.
If heterogeneous, the concentration is likely to vary for each sample.
Reality: Soil is heterogeneous, and fertilizer application is often uneven.

100 Concentrations (xj)

100 concentrations have been measured, one for each sample: $x1, x2, x3, …, x{99}, x_{100}$ .
The concentrations are not all identical, although some may be (e.g., $x1 = x2 ≠ x{31} = x{37} ≠ x{58} ≠ x{96} = x_{97}$ ).

Ways to Represent Cadmium Concentration

Maximum (highest value)
Minimum (lowest value)
Midrange (midpoint)
Mode (most common value)
Median (separates the set into two equal halves)
Mean (average)
A value selected at random

Example Data Set

Given the data set: x = 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 19
What value best represents these numbers?

Values for the Example Data Set

Minimum = 1
Midrange = 10
Maximum = 19
Mode = 2
Median = 4
Random value = 5 (using a random number generator)
Mean = 5.36 (to 2 decimal places)

Choosing the Best Representation

A table is shown comparing: number of values of x (n), median, midrange, minimum, mean, mode, and maximum with a checkmark beside "Mean".

The Mean

The Mean is often the best choice.
Sum all the values of $x_i$ from $i = 1$ to $n$ , and divide the sum by $N$ :
- $x = \frac{(x1 + x2 + x3 + … + x{N-1} + x_N)}{N}$
- There are 100 samples, so $N = 100$ .
- $x = \frac{1}{N} \sum{i=1}^{N} xi$

Usefulness of the Mean

The mean gives the farmer the statistic he needs.
However, it doesn’t reveal whether the cadmium is distributed evenly throughout the paddock.

Cases of Cadmium Distribution

Case 1: All samples have the same concentration.
- $x1 = x2 = x3 = … = x{99} = x_{100} = 50 \frac{mg}{g}$
- $x = \frac{(50 × 100)}{100} = 50 \frac{mg}{g}$
Case 2: Half the samples have a high concentration, and half have none.
- $x1 = x2 = x3 = … = x{49} = x_{50} = 100 \frac{mg}{g}$
- $x{51} = x{52} = x{53} = … = x{99} = x_{100} = 0 \frac{mg}{g}$
- $x = \frac{(100 × 50)}{100} = 50 \frac{mg}{g}$
Case 3: Samples have varying concentrations.
- $x1 \text{ to } x{10} = 30, x{11} \text{ to } x{40} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $x = 50 \frac{mg}{g}$

Accounting for Variation

The Variance: quantifies the spread of the data around the mean.
- Sum all the squares of the differences between $x_i$ and $\bar{x}$ from $i = 1$ to $N$ , and divide the sum by $N$ :
- $s^2 = \frac{[(x1 – \bar{x})^2 + (x2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{N}$
- $s^2 = (\frac{\sum x_i}{N}) - (\bar{x})^2$
- $s^2 = \frac{\sum{i=1}^{N} (xi - \bar{x})^2}{N}$

Standard Deviation

Describes the spread of data around the mean.
Sum all the squares of the differences between $x_i$ and $\bar{x}$ from $i = 1$ to $N$ , divide the sum by $N$ , and then take the square root:
- $s = \sqrt{\frac{[(x1 – \bar{x})^2 + (x2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{N}}$
- $s = \sqrt{\frac{\sum{i=1}^{N} (xi - \bar{x})^2}{N}}$

Standard Deviation Interpretation

Indicates how far values are from the mean.
A low standard deviation indicates that the values are clustered around the mean.
A high standard deviation means that the values, or some of them, are not near the mean.

Standard Deviation Calculator & Excel

Diagrams of a calculator and excel are presented showing where functions to automatically calculate the mean, variance, and standard deviation can be found.

Standard Deviation: Case 1

Given $x1 = x2 = x3 = … x{99} = x_{100} = 50 \frac{mg}{g}$
- $s = \sqrt{\frac{[(50 - 50)^2 + (50 - 50)^2 + … (50 - 50)^2]}{100}} = 0 \frac{mg}{g}$
Standard deviation has the same units as the measured vaues and the mean
Variance has the same units as the measured values squared
- $s^2 = 0 \frac{mg^2}{g^2}$ . A standard deviation of 0 means all values are identical; there is no variation.

Standard Deviation: Case 2

Given $x1 = x2 = x3 = … x{49} = x{50} = 100 \frac{mg}{g}$ and $x{51} = x{52} = x{53} = … x{99} = x{100} = 0 \frac{mg}{g}$
- $s^2 = \frac{[(100 - 50)^2 + (100 - 50)^2 … (100 - 50)^2 + (0 - 50)^2 + (0 - 50)^2 … (0 - 50)^2 ]}{100}$
- $s^2 = \frac{[(100 - 50)^2 × 50 + (0 – 50)^2 × 50]}{100}$
- $(100 - 50)^2 = 50^2 \text{ and } (0 – 50)^2 = (-50)^2 = 50^2$
- $s = \sqrt{\frac{(50^2 × 100)}{100}} = \sqrt{50^2} = 50 \frac{mg}{g}$

Standard Deviation: Case 3

Given $x1 \text{ to } x{10} = 30, x{11} \text{ to } x{41} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $s^2 = \frac{[(30 - 50)^2 × 10 + (40 - 50)^2 × 30 + (50 - 50)^2 × 20 + (60 - 50)^2 × 30 + (70 - 50)^2 × 10]}{100}$
- $s = \sqrt{\frac{[20^2 × 20 + 10^2 × 60 + 0^2 × 20]}{100}}$
- $s = 11.832159566199232085134656583123 \frac{mg}{g}$

Standard Deviation Accuracy

Calculate the standard deviation, $s = 11.832159566199232085134656583123 \frac{mg}{g}$
But is quoting the standard deviation to 30 decimal places reasonable?
The measured values are quoted with no decimal place (30, 40, etc), which should indicate the accuracy of the measuring device.
The mean and standard deviation should be quoted with the same degree of accuracy; the same number of decimal places (or perhaps one more).
- $s = 12 \frac{mg}{g}$

Summary of Cases

Case 1: $x = 50 \frac{mg}{g}$ , $s = 0 \frac{mg}{g}$
Case 2: $x = 50 \frac{mg}{g}$ , $s = 50 \frac{mg}{g}$
Case 3: $x = 50 \frac{mg}{g}$ , $s = 12 \frac{mg}{g}$

Population vs. Sample Analysis

Analyzing every cube gives the farmer:
- The mean, $\bar{x}$
- The standard deviation, $s$
Because the concentration of cadmium in all the samples (the whole population) has been analyzed, strictly these are:
- The Population Mean ( $\mu$ )
- The Population Standard Deviation ( $\sigma$ )

Avoiding Full Analysis

Can we avoid analyzing every 1 m³ sample?
Analyzing every sample is:
- Time-consuming
- Expensive
- Resource intensive
- Unnecessary (if certain conditions are met)

Analyzing Just One Sample

Analyzing one sample will give the farmer the concentration of cadmium in the paddock only if it is homogeneous.
If it is heterogeneous (it is!), then how would the farmer know that the measured concentration is that of the whole paddock?

Analyzing More Samples

What about analyzing more samples?

Sample Definitions

(Physical) sample: a portion of the physical entity to be analyzed (e.g., soil, water, gas, oil, blood, honey).
(Statistical) sample: a subset of a population; a selection of values from the complete set of values.

Advantages of Sample Analysis

Analyzing a sample of samples is:
- Less time-consuming
- Cheaper
- Uses fewer resources
- Appropriate (if certain conditions are met)

Sample Mean to Population Mean

Does the mean of the concentration of a sample of samples represent the mean of the concentration of the population?
Is the sample mean approximately equal to the population mean? $\bar{x} ≈ \mu$ ?
$\bar{x}$ is the Sample Mean

Comparing Sample Mean to Population Mean: Case 3

Case 3: $x1 \text{ to } x{10} = 30, x{11} \text{ to } x{40} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $x_{16} = 40 \frac{mg}{g}$
- $\bar{x} = 40 \frac{mg}{g}$
- $\bar{x} ≠ \mu$

Comparing Sample Mean to Population Mean: Case 3 (cont.)

$x1 \text{ to } x{10} = 30, x{11} \text{ to } x{40} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $x_{33} = 40 \frac{mg}{g}$
- $\bar{x} = 40 \frac{mg}{g}$
- $\bar{x} ≠ \mu$

Comparing Sample Mean to Population Mean: Case 3 (cont. 2)

$x1 \text{ to } x{10} = 30, x{11} \text{ to } x{40} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $x_{67} = 60 \frac{mg}{g}$
- $\bar{x} = 46.7 \frac{mg}{g}$
- $\bar{x} ≈ \mu$

Comparing Sample Mean to Population Mean: Case 3 (cont. 3)

$x1 \text{ to } x{10} = 30, x{11} \text{ to } x{40} = 40, x{41} \text{ to } x{60} = 50, x{61} \text{ to } x{90} = 60, x{91} \text{ to } x{100} = 70 \frac{mg}{g}$
- $\bar{x} = 50 \frac{mg}{g}$
- $\bar{x} = \mu$

Sample Size and Convergence

The larger the size of the sample, the closer the sample mean and population mean.
- As $N$ increases, $\bar{x} → \mu$
- The sample mean tends towards the population mean.
Of course, it does depend on which soil samples we pick to analyze.

Sample Value Approximations

$\bar{x} = 49 \approx \mu$
$\bar{x} = 51 \approx \mu$
$\bar{x} = 50 = \mu$
$\bar{x} = 50 = \mu$
$\bar{x} = 30 \neq \mu$
$\bar{x} = 70 \neq \mu$
$\bar{x} = 50 = \mu$
$\bar{x} = 50 = \mu$

Combinations and Means

There are 17,310,309,456,440 possible samples of 10 out of 100: $\frac{100!}{(10! × 90!)} = 17,310,309,456,440$
The lowest $\bar{x}$ is 30 mg / g and the highest $\bar{x}$ is 70 mg / g: $30 \leq \bar{x} \leq 70 \frac{mg}{g}$
There is only one combination that leads to $\bar{x} = 30 \frac{mg}{g}$
There is only one combination that leads to $\bar{x} = 70 \frac{mg}{g}$
The mean of the values of $\bar{x}$ for all the 17,310,309,456,440 samples is $\mu$ (50.0 mg / g)

Central Limit Theorem

Central Limit Theorem
- The sum of independent random variables tends towards a normal distribution, even if the original variables themselves are not normally distributed.

Normal Distribution

Bell curve, Gaussian distribution.
Midrange = Mode = Median = Mean.

Die Roll

Imagine rolling a die.
Six possible outcomes (1, 2, 3, 4, 5, 6) with equal probability (one sixth).
NOT normally distributed.

Dice Rolls and Sums

The more times the die is rolled and the numbers summed, the closer the distribution is to the normal distribution.

Dice Roll Distributions

Distributions of dice rolls are displayed.
n=1: uniform distribution with p(k) = 1/6 for k = 1,2,3,4,5,6. mean 3.5
n=2: distribution peaks at 7. Mean 7
n=3: distribution peaks at 10, 11. Mean 10.5
n=4: distribution is bell shaped. Mean 14
n=5: distribution is more pronounced. Mean 17.5

Central Limit Theorem (cont.)

The sum of independent random variables tends towards a normal distribution, even if the original variables themselves are not normally distributed.
The mean $\bar{x}$ is the sum of independent random variables divided by the number of variables, and so, by the theorem, is normally distributed.
- $\bar{x} = \frac{\sum{i=1}^{N} xi}{N}$

Gaussian Distribution Formula

$f(x) = a e^{\frac{-(x - b)^2}{2c^2}}$
a, b, and c are constants.

Gaussian Distribution Parameters

The position of the curve is determined by b.
The shape is determined by it being a Gaussian distribution.
The width is determined by $c^2$

Gaussian Distribution Plot

$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x - \mu)^2}{2\sigma^2}}$
Graphs are shown for: $\mu=0, \sigma^2=0.2$ , $\mu=0, \sigma^2=1.0$ , $\mu=0, \sigma^2=5.0$ , $\mu=-2, \sigma^2=0.5$

Gaussian Distribution Data Ranges

68% within 1 standard deviation
95% within 2 standard deviations
99.7% of the data are within 3 standard deviations of the mean

Dice Roll Statistics

Imagine rolling a die five times and summing the five numbers.
- $\bar{x}$ for one die = $\frac{(1 + 2 + 3 + 4 + 5 + 6)}{6} = \frac{21}{6} = 3.5$
- $s^2$ for one die = $\frac{[(1 – 3.5)^2 + (2 – 3.5)^2 + (3 – 3.5)^2 + (4 – 3.5)^2 + (5 – 3.5)^2 + (6 – 3.5)^2]}{6}$
- $= \frac{[(-2.5)^2 + (-1.5)^2 + (-0.5)^2 + 0.5^2 + 1.5^2 + 2.5^2]}{6} = \frac{(25 + 9 + 1 + 1 + 9 + 25)}{24} = \frac{35}{12}$

Dice Roll Statistics (continued)

$\bar{x}$ for five dice = $5 × 3.5 = 17.5$
$s^2$ for five dice = $5 × \frac{35}{12} = \frac{175}{12} = 14.58$
$s = \sqrt{14.58} = 3.82$

Dice Roll Distribution Approximation

$\bar{x} = 17.5$
$2s = 7.64$
95% of the possible sums are in this region.

Data Range Probability

Assuming the original data is normally distributed:
95% probability that a randomly selected value will be in this range
Normal Distribution Diagram with Mean labelled as mew.

Finding m

Assuming the original data is normally distributed:
Taking a random value of x in that range
Normal Distribution Diagram with Mean labelled as mew.

m Calculation

Assuming the original data is normally distributed:
95% probability that m lies within the range x – 2s to x + 2s
Normal Distribution Diagram.

Paddock Cadmium Concentration

So for 95% of the soil samples the value of m lies within x - 2s to x + 2s
So, measuring one soil sample can produce a range within which there is 95% probability of m lying.
The farmer needs only measure the cadmium concentration of one sample (randomly selected) to have a 95% probability of determining the cadmium concentration of the paddock within a certain range.
Quicker, Cheaper, Uses fewer resources
Conditions – 95% probability and within a range

Non-Normal Cadmium Distribution

But is the original data normally distributed?
No!
Graph of distribution not normally distributed illustrated this point.

Sampling More Than Once

But, by Central Limit Theorem, the mean of a sample of samples (more than one sample) is normally distributed.
The larger the sample, the closer the sample means tend to the population mean.
- As $N$ increases $\bar{x} → \mu$

Standard Error of the Mean

$\sigma_m = \frac{\sigma}{\sqrt{N}}$
The standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the population is the degree to which individuals within the population differ from the population mean.

Confidence Interval Illustration

A diagram is shown to illustrate the 95% confidence interval.
95%
Area to the left and right of the mean is labeled as 2 $\sigma_m$

Confidence Interval

Confidence Interval at a confidence level of 95%
- $\bar{x} \pm 2\sigmam$ (or $\bar{x} - 2\sigmam$ to $\bar{x} + 2\sigmam$ ) \bar{x} 2 $\sigmam$

Confidence Interval (cont.)

The confidence interval at a confidence level of 95% is:
- $\bar{x} \pm 2\sigma_m = \bar{x} \pm \frac{2s}{\sqrt{N}}$
- $\bar{x} - \frac{2s}{\sqrt{N}} \text{ to } \bar{x} + \frac{2s}{\sqrt{N}}$
As $N$ increases, $\sigma_m$ decreases and the confidence interval decreases, giving more precision.

Decreasing the Variation

$\sigma_m = \frac{\sigma}{\sqrt{N}}$
$\sigma_m$ is inversely proportional to $\sqrt{N}$
To decrease $\sigmam$ to ½ $\sigmam$ , $N$ must increase to 4N
- e.g. $\sigma = 10$
- $\sigma_m = 2$ for $N = 25$ ( $\frac{10}{\sqrt{25}} = \frac{10}{5}$
- for $\sigmam = 1$ , $\sqrt{N} = \frac{\sigma}{\sigmam} = \frac{10}{1}$ , $N = 100 = 25 × 4$

Confidence Range

95% probability that m lies within the range $\bar{x} – 2\sigmam$ to $\bar{x} + 2\sigmam$
N = n. Area to the left and right of the mean is labeled as 2 $\sigma_m$
N = 4n. Area to the left and right of the mean is labeled as 2 $\sigma_m$
N = 16n. Area to the left and right of the mean is labeled as 2 $\sigma_m$

Estimating s

Problem: How does the farmer know the value of $\sigma$ ?
Determining the range of values that m lies within with 95% confidence relies on knowing the standard deviation of the population, $\sigma$
To determine $\sigma$ , the analysts need to measure the cadmium concentrations of all of the 100 soil samples.

Estimating Population SD

Is there a way of estimating the population standard deviation, $\sigma$ , from the sample?
Each sample (of samples) has a mean, $\bar{x}$
Since a sample comprises a number of individual values, it also has a standard deviation.

Sample SD Equation

Sum all the squares of the differences between $x_i$ and $\bar{x}$ from $i = 1$ to $N$ , divide the sum by $N - 1$ and then take the square root:
$s = \sqrt{\frac{[(x1 – \bar{x})^2 + (x2 – \bar{x})^2 + … + (x_N – \bar{x})^2]}{(N – 1)}}$
$s = \sqrt{\frac{\sum{i=1}^{N} (xi - \bar{x})^2}{(N - 1)}}$

Degrees of Freedom

Why $N – 1$ and not $N$ ?
The Degrees of Freedom is the number of values in the final calculation of a statistic that are free to vary.
If there are $N$ values and the mean, $\bar{x}$ , is known, then only $N – 1$ values can vary; the $Nth (x_N)$ is fixed by the value of the mean.
- e.g. $\bar{x} = 1$ and $N = 3$ . If $x1 = 0, x2 = 2$ then $x_3$ must be 1 – it’s not free to vary. There are $N – 1 (3 - 1) = 2$ degrees of freedom, not 3.

Statistical Calculation Help.

Diagrams of a calculator and excel are presented showing where functions to automatically calculate the sample standard deviation can be found.

Relation Between SD

Note: For the same