AP Biology Descriptive Statistics, Standard Deviation and Standard Error Study Guide

The Scientific Method and Experimental Design

The scientific method is a systematic process used to explore observations and answer specific questions. Scientists utilize various tools and environments to test observations, including laboratory experiments, field investigations, models, simulations, and existing data sets. These methods are employed to determine the validity of a hypothesis. A hypothesis is defined as the first step in the scientific method, based on prior knowledge and direct observation.

The formal progression of the scientific method includes several key steps:

Make an observation to identify a phenomenon.
Formulate a statement of the problem and ask a specific question.
Propose a hypothesis, which is a tentative answer to the question asked.
Design and conduct an experiment, prioritizing the use of quantifiable data. Mathematics is considered extremely important at this stage.
Use statistical tests to evaluate the significance of the results obtained. Common tests include the $\chi^2$ (chi-square) test and the evaluation of the null hypothesis.
Reach a conclusion regarding the acceptance or rejection of the hypothesis.

The process follows a logical flow: an observation provokes a question, which leads to the generation of a hypothesis. The hypothesis is tested by an experiment. During the experiment, expected results are compared with actual results. If the hypothesis is supported, a conclusion is formed. If the hypothesis is not supported, the scientist must test an alternative hypothesis. This interpretation of results then provokes new observations, repeating the cycle.

A scientific hypothesis must possess two essential qualities: it must be testable, and it must have the potential to be rejected. For example, if a flashlight fails to work, two hypotheses might be proposed: Hypothesis #1 (dead batteries) and Hypothesis #2 (burnt-out bulb). Each leads to a prediction (e.g., "replacing batteries will fix the problem"). If a test falsifies a hypothesis (e.g., new batteries do not fix the light), that hypothesis is rejected. If a test does not falsify the hypothesis (e.g., a new bulb fixes the light), the hypothesis is supported.

Designing Controlled Experiments and Data Collection

In controlled experiments, researchers begin with two or more groups that are as similar as possible. A method is devised to manipulate only one variable to observe its effect. The Independent Variable is the variable that is manipulated by the researcher. The Dependent Variable is the response that is measured as the result of changes in the independent variable.

When working with data during an experiment, several protocols must be followed:

Make accurate and precise measurements.
Account for error in all measured values.
Develop consistent techniques for collecting data.
Understand the specific units and properties of the data.
Identify trends and patterns within the data results.
Produce visual representations of the data, such as graphs and charts.

Graphical Representation of Scientific Data

Researchers must select the graph type that best illustrates their specific findings. Common types of graphs used in AP Biology include Bar graphs, Histograms, Pie graphs, Line graphs, and Scatter plots.

Bar graphs are most commonly used to represent data that does not have a numerical value for the independent variable, allowing for a side-by-side comparison of various categories.

Pie graphs are used as an alternative to bar graphs when the proportion of data is the primary focus of communication. In these graphs, the dependent variable is represented as a percentage. These graphs should ideally contain six categories or fewer.

Line graphs are the most commonly used graphs in scientific experiments. They effectively depict the relationship between a dependent and an independent variable. The direction of the line allows the researcher to determine the nature of the relationship between variables.

General rules for creating a line graph include:

Identify the independent variable and place it on the x-axis.
Identify the dependent variable and place it on the y-axis.
Choose a consistent scale with regular intervals.
Plot each data point as a dark dot.
Make the graph as large as possible by spreading data across the axes.
Use convenient intervals for grid squares; do not number every single grid line.
Label each axis with the variable name and the specific units of measure.
Title the graph using a short, clear statement of purpose, often formatted as "y-variable vs. x-variable."
Use a single, dedicated sheet of graph paper and avoid using the back of the sheet.

When drawing a Line of Best Fit, the line should be positioned so there are equal numbers of data points above and below it, staying as close to all individual data points as possible. Different relationships can be observed, such as positive linear relationships (e.g., Output per period vs. Employment per period), negative linear relationships (e.g., Life expectancy vs. Cigarettes smoked per day), or non-linear curves (e.g., Life expectancy vs. Fruits and vegetables consumed).

Descriptive Statistics and Central Tendency

Descriptive statistics are used to describe basic features and provide simple summaries of data in a study. These statistics, along with graphical analysis, form the basis for quantitative data analysis.

A key concept is the Normal Curve, which represents the frequency distribution of a large population. Properties of the normal curve include:

It is symmetrical and bell-shaped.
Most data points occur around the mean, mode, and median.
A small portion of the data occurs at the tails of the curve.

There are three primary measures of central tendency:

Mean ( $\bar{x}$ ): The average of all data points. Calculation: sum of all values divided by the total number of values. For the set: $1, 1, 2, 3, 3, 5, 5, 5, 6, 7, 7, 7, 7, 9, 9$ , the sum is $77$ and $n = 15$ , so the mean is $\approx 5.13$ .
Mode: The most frequent observation. In the set above, the mode is $7$ .
Median: The middle number in an ordered series. In the set above, the median is $5$ .

Practice activities for central tendency and range:

Data set: $12, 15, 12, 97, 46, 88$ → Mean: $45$ , Median: $30.5$ , Mode: $12$ , Range: $85$ .
Data set: $33, 76, 37, 92, 92, 88$ → Mean: $69.67$ , Median: $82$ , Mode: $92$ , Range: $59$ .
Data set: $4, 12, 4, 77, 4, 4$ → Mean: $17.5$ , Median: $4$ , Mode: $4$ , Range: $73$ .

Standard Deviation ( $s$ )

Standard deviation is a measure of how spread out data points are from the mean. In a normal distribution:

$68.26\%$ of data falls within $\pm 1s$ of the mean ( $\bar{x} \pm 1s$ ).
$95.44\%$ of data falls within $\pm 2s$ of the mean ( $\bar{x} \pm 2s$ ).
$99.74\%$ of data falls within $\pm 3s$ of the mean ( $\bar{x} \pm 3s$ ).

A lower standard deviation indicates data is closer to the mean, suggesting the independent variable is likely causing changes in the dependent variable. A higher standard deviation indicates data is more spread out, suggesting other factors may be influencing the results. Two data sets can have the same mean but vastly different standard deviations. For example, in a set with an average of $100$ , a standard deviation of $10$ results in $95\%$ of data between $80$ and $120$ , while a standard deviation of $50$ results in $95\%$ of data between $0$ and $200$ .

The steps to calculate standard deviation ( $s$ ) are:

Calculate the mean ( $\bar{x}$ ).
Determine the difference between each data point ( $x_i$ ) and the mean.
Square those differences.
Sum the squares: $\sum (x_i - \bar{x})^2$ .
Divide by the sample size minus one ( $n - 1$ ).
Take the square root: $s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}}$ .

Example Practice (Sunflower Plant Stomata per area): Data: $88, 93, 90, 92, 75, 78$ . $n = 6$ . Mean ( $\bar{x}$ ) = $86$ . Squares of differences: $(88-86)^2 = 4$ , $(93-86)^2 = 49$ , $(90-86)^2 = 16$ , $(92-86)^2 = 36$ , $(75-86)^2 = 121$ , $(78-86)^2 = 64$ . $\sum = 290$ . $s = \sqrt{\frac{290}{5}} = \sqrt{58} \approx 7.62$ .

Standard Error ( $SE$ )

Standard Error ( $SE$ ) indicates how well the sample mean ( $\bar{x}$ ) estimates the true population mean ( $\mu$ ). It serves as a measure of accuracy (if the true mean is known) or precision (if the true mean is unknown). Accuracy refers to how close a measured value is to the actual value, while precision refers to how close measured values are to each other.

To calculate Standard Error ( $SE$ ):

Calculate the standard deviation ( $s$ ).
Divide the standard deviation by the square root of the sample size ( $n$ ): $SE = \frac{s}{\sqrt{n}}$ .

Standard Error is used in bar graphs by plotting the mean on the y-axis and adding error bars representing $\pm SE$ . In figure captions, it must be stated that error bars represent Standard Error. To analyze these graphs, researchers look for overlap in error lines. If the error bars overlap, the difference between the groups is generally considered not significant. If they do not overlap, the difference may be significant.

In summary, Standard Deviation measures how individual data points deviate from the sample mean, while Standard Error measures how the sample mean deviates from the true population mean.

95% Confidence Interval and the t-table

When conducting a study, samples should ideally be compared using a $95\%$ interval of confidence ( $\pm 2 SD$ ). This requires the use of a t-table to determine critical values based on degrees of freedom ( $df$ ). The formula for the $95\%$ Confidence Interval is used specifically when the population size is less than $30$ .

The t-table for Proportion in Two Tails Combined (specifically for $0.05$ significance for $95\%$ confidence) provides the following values for degrees of freedom ( $df$ ):

$df 1$ : $12.706$
$df 2$ : $4.303$
$df 3$ : $3.182$
$df 4$ : $2.776$
$df 5$ : $2.571$
$df 6$ : $2.447$
$df 7$ : $2.365$
$df 8$ : $2.306$
$df 9$ : $2.262$
$df 10$ : $2.228$
$df 11$ : $2.201$
$df 12$ : $2.179$
$df 13$ : $2.160$
$df 14$ : $2.145$
$df 15$ : $2.131$
$df 16$ : $2.120$
$df 17$ : $2.110$
$df 18$ : $2.101$