Statistics in Precision Agriculture

Variability and GPS

GPS helps identify variability in the field (Lecture 1-3).

Scientific Method and Data Analysis

Scientific method is used to understand causes of variability.
Step 5 of the scientific method involves analyzing data.
Statistics is crucial for meaningful conclusions from data analysis.
Decisions without statistics lack robustness.

Data Collection and Variability

Data is gathered through mapping, soil sampling, and experiments.
Variability exists in data, such as plant height.

Importance of Statistics

Statistics is the foundation of precision agriculture.
It helps to understand why variability exists.
It helps determine if the variability matters.
Decisions based on statistics are evidence-based, not guesses.
Statistics aids in applying technology for farm management.
Simple statistics (mean, median, mode) summarize data and reveal trends.

Why Statistics for Agronomists and Researchers?

Professionals need to analyze data to make informed decisions.
Precision agriculture involves complex datasets from satellites, drones, sensors, etc.
Basic statistical knowledge is essential for initial data understanding.
For complex analyses, collaboration with biometricians may be needed.
Understanding experimental design (replication, randomization) is crucial when seeking help.

Interpreting Research Papers

Statistical formulas in research papers need to be understood for effective information extraction.
Example: Understanding terms like $y \text{delta}$ , $y \text{dot}$ , and $I = 0.95$ is essential.
Understanding statistical information like vertical bars overlapping on graphs is important.
Understanding statistical analysis boxes including split plot design and ANOVA is a must.

Repeating Experiments

Knowledge of statistical analysis (ANOVA, LSD, p-value) is necessary when repeating experiments or using methodologies from other research.

Sir Ronald Fisher: The Pioneer of Modern Statistics

Developed ANOVA method, which is still widely used - about 180 years ago.
Served at Roadmaster Research Station for 14 years.
Helped researchers develop drought-resistant, high-yielding, and disease-resistant varieties.

Basic Statistical Terminology

Experiment: A systematic procedure to understand a problem and find new knowledge, involving replication and randomization.
Factor: An independent variable that can be changed to observe its effect (e.g., lime application).
Treatment: A specific condition of a factor (e.g., good soil, poor soil, poor soil + lime).
Levels: The number of factors (e.g., three soil types, five herbicides).
Variable: A characteristic of interest that can be measured (e.g., plant height, biomass, yield).
Experimental Unit: The entity to which treatment is applied (e.g., pot in a greenhouse, plot in a field).
Observational Unit: The entity on which measurements are taken (e.g., plant in a pot).
Experimental and Observational units are sometimes the same in Livestock research.
Experimental Error: Variation between experimental units; minimized by replication and randomization.
Populations and Samples: Making decisions about a population based on measurements from a sample.
Statistical inference: Collecting sample, making some measurement on sample, observe the variability, and make a decision for the populations.
Central Tendency: Central value which includes the mean median, and mode value. Gives us a single value for a sample, and population.

Central Tendency: Mean, Median, and Mode

Mean: Average value calculated by adding all numbers and dividing by the total observations.
Trend: Treatment that includes good soil provides more plant height showing trend.
Mode: The most frequently observed number in a dataset.
Median: The center value; data must be organized in ascending order.
- For even number of data points, take the average of the two middle numbers.
Mean, median, and mode can only be calculated for continuous or quantitative data, not nominal or ordinal data.
- For Nominal data can calculate mood value.
- For ordinal data can calculate the mood value and median value.

Frequency Histogram

A simple approach to present variation, showing how many times each value is observed.
Frequency histogram: tells us how many time single value we observe.
Unlike a bar graph, a frequency histogram displays the distribution of data.
Cannot mixed with bar Graphs.
Bar graph telling us the major differences Major differences between between plants.What is the difference between plant one to plant two?

Data Grouping

For large datasets, group data into categories (bins) to create a meaningful frequency histogram.
Too few or too many groups can make interpretation difficult, The group interval is 25 millimeter

Formula for Sensible Bin Groups

Calculate the range (largest value - smallest value).
Determine the number of bin groups (4-7 for small datasets).
Apply formula: $\text{Bin Width} = \frac{\text{Range}}{\text{Number of Bins}}$

Data Shape and Distribution

Unimodal Data: One peak in the distribution.
Symmetrical Data: Data is normally distributed (bell-shaped curve).
- 50% of data on each side.
- Mean, median, and mode are the same.
Left Skewed Data: Long tail on the left side.
- Mean is smaller than median.
- Most of the data close to the center
Right Skewed Data: Long tail on the right side.
- Mean is larger than median.
- Most of the data close to the center
  Use median value when data is skewed value because is more stable.

Limitations of Central Tendency

Mean alone is insufficient to describe data; the spread of data must be considered.
Quantitative Measurement: Mean, median, mode.
Qualitative Measurement: Spread of data (quality).
Precision agriculture requires both quantity and quality for precise information.
The spread of data must be considered.
The spread of data can refer to how close they are sitting to the mean.

Measuring Data Spread

Statistical tools to measure data spread:
- Range
- Variance
- Interquartile range
- Standard deviation
- Coefficient of variation
- Standard error of mean

Range

Represents the difference between the highest and lowest values.
Advantage:
- Simple.
- Quick information.
Disadvantages:
- Not suitable for big datasets.
- Sensitive to outliers.

Quartiles

Divide the dataset into four groups after organizing data from lowest to highest value.
Less sensitive to outliers.
Good for big datasets.
Easy to find the first, second, and third quarter.

Box and Whisker Plot

Box is created using the boundaries of the first and third quadrants.
The second quartile will be our median thumb.
Whiskers extend to the minimum and maximum values.
Indicate data skewness.
Interquartile range help us to identify the outliers.

Standard Deviation

Measures the distance of each data point from the center.
Tells us if the dataset sitting close to the center or sitting far away from the center by using a bell shape in which is a symmetrical distribution.
Bell Shape make by biometrician for a visual concept of easy understanding.
One standard deviation ($\alpha$) means 68% data in both side, lower side and upper side, sitting closely to the center.
Two standard deviations means 95% of data is captured.
Three standard deviations means 99.7% of data is captured.
Low standard deviation value sitting close to the value sitting close to the central valve.
Big standard deviation value sitting far away from the central valve.
Small (s) is standard deviations sample, Sigma ($\sigma$) is a standard deviations population.
Relationship with variance:
- $\text{Standard Deviation} = \sqrt{\text{Variance}}$
  Examples, if you got deviation then squared it will be variance.

Calculating Variance

Variance: the average square difference - of each data point from sample. From mean point
When when working on a sample to estimate population, a correction factor (n-1) is used
Correction factor is n -1 for the simple.

Standard Error of the Mean

Tells us how well the sample represents the population mean.
Small number gives confidence sample is almost accurate. This explains how the accurate measure I did to represent my sample for the population.
The standard error can tell you if the sample is not truly representative if sample data/mean goes up to a certain number or down to a certain number. Data is inconsistent.
If the two standard error overlap each other, that means there is no significance between the 2 samples.
Each sample could be very close in number.

Coefficient of Variation (CV)

Used to compare data spread between two samples with different units or sources as it converts deviation around sample mean to an easy scale.
Formula: $CV \% = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100$
Lower CV value means high consistency in the dataset. Can easily see differences between datasets.

Confidence Interval

The range of values within which a population mean are calculated.
It provide an accepted value in which its a good sample range.
Formula: $\text{Sample Mean} \pm \text{Marginal Error}$
Marginal Error calculation : $\text{Z} * \frac{\text{Standard Deviation} }{\sqrt{n}}$
If you use the minus sign you will get the lower boundary, If you use the plus sign you will get the upper bound.
Z value also depends on the degree of confidence, If its a small number you get higher confident of it being good sample range because its nearly close to the mean.
+ This sample is good enough to represent my population based on range.
The values is important in precision agriculture.
The best sample from that measurement can be represent for population in calculation.
- Degree of Confidence: is how confident you are, ex: $\text{95\%}$ , If its $\text{99\%}$ its more confidence of being representative for my population.

Requirements for Statistical Tools

All the statistical tools discuss must have normal distribute of normal data. Normally distributed data is one that can be easily measured such as one that can be converted to linear
We should use more measure observation number to normalize our data. The small the sample size mean they would close in that a population mean.

Importance of Measuring Variability

Helps classify production zones.
Enables creation of management zones.
Enables you to shift from traditional practices to more futuristic practices, like site specific management practices.
This would work for certain soil that at a certain pH or something to do with acidic state, But at higher pH it would not exist as there is no changes between different cultivars to make management change, as such it wont to any difference.

Summary

Measurement discussed is variability of mean.
It has help make a decision whether the variability is there or not & if it is big enough or not.
Always ask yourself, are the data you are using correct to solve management issues.
It very import and critically ask yourself are we presenting that very variability in the way that looks like that we are the one that has this highest variability or the one with very low very variability?
You need to ask yourself about presentation about you dataset for you final result?
Larger sample size enables much better results that will have measurement.