Experimental Designs and Statistical Analysis

Single-Case, Quasi-Experimental, and Developmental Research

Learning Objectives

Single-case experimental designs and reasons for use.
One-group posttest-only design and its utility.
One-group pretest-posttest design and threats to internal validity (history, maturation, testing, instrument decay, regression toward the mean).
Nonequivalent control group design vs. nonequivalent control group pretest-posttest design; advantages of a control group.
Interrupted time series design vs. control series design.
Cross-sectional, longitudinal, and sequential research designs; advantages and disadvantages of each.
Cohort effect.

Single-Case Experimental Design

Single-case experiment designs: experimental designs that allow drawing conclusions about the effect of an experimental manipulation based on data from one or a small number of research participants.
- Also called single-subject design and small-N design.
- The subject’s behavior is measured during a baseline period, followed by experimental manipulation and continued measurement.
The basic issue in single-case experiments is how to determine that the experimental manipulation had an effect.

Reversal Designs

One method of determining that the manipulation had an effect is to demonstrate a reversal of the effect when the manipulation is removed.
Reversal design: a single-case design in which the treatment is introduced after a baseline period and then withdrawn during a second baseline period.
- In an ABA design, behavior is observed during the baseline control (A) period, again during the treatment (B) period, and also during a second baseline (A) period after the experimental treatment has been removed.
- This can be greatly improved by extending it to an ABAB design or even an ABABAB design.

Multiple Baseline Designs

Multiple baseline design: observing behavior before and after a manipulation under multiple circumstances (across individuals, behaviors, or settings).
- Effectiveness of a treatment is demonstrated when a behavior changes only after the manipulation is introduced.
- Such a change must be observed under multiple circumstances to rule out the possibility that other events were responsible.

Replications in Single-Case Designs

The procedure used with one subject can be replicated with others.
Single-case design research often reports on the results for multiple subjects.
Traditional single-case research presents results from each subject individually to avoid masking individual differences.

Quasi-Experimental Designs

Quasi-experimental designs: approximate the control features of true experiments to infer that a given treatment did have its intended effect.
One-group posttest-only design: a quasi-experimental design that has no control group and no pretest comparison.
- This is a poor design in terms of internal validity.

One-Group Pretest-Posttest Design

One-group pretest-posttest design: obtains a comparison by measuring participants before and after manipulation.
Potential threats to internal validity:
- History effects: an outside event occurring between the pretest and posttest that could be responsible for the results.
- Maturation: naturally occurring change within the individual is responsible for the results.
- Testing effect: simply taking the pretest changes the participant’s behavior.
- Instrument decay: change in the basic characteristics of the measuring instrument over time that could be responsible for the results.
- Regression toward the mean: the principle that extreme scores on a variable tend to be closer to the mean when a second measurement is made.

Nonequivalent Control Group Design

Nonequivalent control group design: compares an experimental group with a separate control group, but the groups are not equivalent.
- Differences become a confounding variable, a problem called selection differences or selection bias.

Nonequivalent Control Group Pretest-Posttest Design

Nonequivalent control group pretest-posttest design: compares an experimental group with a nonequivalent control group, with both groups taking a pretest and a posttest.
- A pretest shows how similar the groups were before the manipulation; a posttest shows if the groups differed after the manipulation, despite their dissimilarities.

Propensity Score Matching

Propensity score matching: a method of matching participants in experimental and control conditions based on a combination of scores on several variables.
- Mitigates the problem of studying nonequivalent groups.

Interrupted Time Series Design and Control Series Design

Interrupted time series design: examines the dependent variable over an extended period, both before and after the independent variable is implemented.
- Vulnerable to interpretation problems.
Control series design: an extension of the interrupted time series design in which there is a comparison or control group.
- Involves finding a similar population that did not receive the same manipulation.

Developmental Research Designs

Developmental psychologists use three methods to study changes in people as they age:
- Cross-sectional method: persons of different ages are measured at only one point in time.
- Longitudinal method: the same group of people is observed at different times as they age.
- Sequential method: the longitudinal and cross-sectional methods are combined.

Comparison of Longitudinal and Cross-Sectional Methods

The cross-sectional method is relatively quick and inexpensive.
- Researchers can only infer that any differences found are due to age.
- Cohort: a group of people born at about the same time and exposed to the same societal events.
- Cohort effects: differences among age groups attributed to social, cultural, economic, or political differences rather than to the effect of age.

Comparison of Longitudinal and Sequential Methods

The longitudinal method is the only way to conclusively study changes in people as they age.
- It is expensive and difficult to carry out and takes a long time to yield results.
The sequential method takes less time and effort than the longitudinal method and yields some results right away.
- It does provide some information about how individuals change as they age.
- It does not provide information as complete as a longitudinal study can offer.

Understanding Research Results: Description and Correlation

Lesson Objectives

Compare and contrast the three ways of describing results: comparing group percentages, correlating scores, and comparing group means.
Describe a frequency distribution, including the various ways to display a frequency distribution.
Compare and contrast the three measures of central tendency.
Describe how to determine how much variability exists in a set of scores.
Define what a correlation coefficient is.
Explain what an effect size is.
Describe how researchers use regression equations to predict behavior.
Distinguish between mediation and moderation as ways to explore more complex relationships among variables.
Explain the purpose of more advanced statistical techniques like structural equation modeling.

Scales of Measurements: A Review

Nominal scale: The levels of variables are different categories or groups.
- Levels are simply different categories.
Ordinal scale: measurement categories form a rank order along a continuum.
Interval scale: allows for more sophisticated statistical treatments; the intervals between the levels are equal in size.
Ratio scale: has an absolute zero point that indicates the absence of the variable being measured.

Describing Results

There are three basic ways of describing the results of a study of relationships between variables:
- Comparing group percentages
- Correlating scores
- Comparing group means

Frequency Distributions

Frequency distribution: an arrangement of a set of scores from lowest to highest that indicates the number of times each score was obtained.
Pie charts: Divide a circle into slices that represent relative percentages.
Bar graphs: Use separate and distinct bars for each piece of information.
Frequency polygons: Use a line to represent the distribution of frequencies of scores.
Histograms: Display a frequency distribution for a quantitative variable using bars that are drawn next to each other.

Descriptive Statistics: Central Tendency

Descriptive statistics: statistical measures that allow you to summarize and describe data.
Central tendency: a single number or value that describes the typical or central score among a set of scores.
- Mean: obtained by adding all the scores and dividing by the number of scores; in scientific reports, abbreviated as M; $\sum{X}/n$
- Median: the score that divides the group in half (with 50% scoring below and 50% scoring above); abbreviated as Mdn; appropriate with an ordinal scale and used with interval and ratio scales.
- Mode: the most frequent score; is the only measure of central tendency when a nominal scale is used.

Descriptive Statistics: Variability

Variability: the amount of spread in a distribution of scores.
Standard deviation: the average deviation of scores from the mean (square root of the variance); symbolized as s and abbreviated as SD in scientific reports. $\sqrt{\frac{\sum(X - M)^2}{n-1}}$
Variance: the square of the standard deviation; symbolized as s².
Range: the difference between the highest score and the lowest score.

Graphing Relationships

Bar graphs are used when the values on the x axis are nominal categories.
Line graphs are used when the values on the x axis are numeric.
Choosing the scale for a bar graph allows a manipulation of how the results appear, a practice that is sometimes used by scientists and often used by advertisers.

Correlation Coefficients: Describing the Strength of Relationships

Correlation coefficient: a statistic that describes how strongly variables are related to one another.
Pearson product–moment correlation coefficient: used when both variables have interval or ratio scale properties; called the Pearson r.
- It provides information about the strength and direction of a relationship.
- Values can range from 0.00 to ±1.00.
- Results can be described visually using a scatterplot, where each pair of scores is shown as a single point on a diagram.

Correlation Coefficients: Important Considerations

If the range of possible values is restricted, the correlation coefficient will not be accurate.
- Restriction of range: a problem when scores on a variable are limited to a small subset of their possible values, making it more difficult to identify relationships of the variable to other variables of interest.
The Pearson product–moment correlation coefficient (r) is designed to detect only linear relationships; the correlation coefficient will not indicate the existence of a curvilinear relationship.

Effect Size

Effect size: the strength of association between variables.
The Pearson r correlation coefficient is one indicator of effect size: It indicates the strength of the linear association between two variables.
- Small effects are near r = 0.15
- Medium/moderate effects are near r = 0.30
- Large effects are above r = 0.40
Squared value of the coefficient ( $r^2$ ) transforms the value of r to a percentage.
Reporting effect size provides a scale of values that is consistent across all types of studies.

Regression Equations

Regression equations: calculations used to predict a person’s score on one variable when that person’s score on another variable is already known.
The general form is $Y = a + bX$ , where Y is the score we wish to predict, X is the score that is known, a is a constant, and b is a weighting adjustment factor.
To predict a future behavior (the criterion variable) based on a person’s prior score on some other variable (the predictor variable), it is necessary to demonstrate that there is a reasonably high correlation between the two.

Multiple Correlation and Regression

Multiple regression: used to analyze the relationship between a single criterion variable and more than one predictor variable.
Multiple correlation: the correlation between a combined set of two or more predictor variables and a single criterion variable.
Symbolized as R where Y is the criterion variable, ${X1}$ to ${Xn}$ are the predictor variables, a is a constant, and ${b1}$ to ${bn}$ are weights that are multiplied by scores on the predictor variables $Y = a + b1X1 + b2X2 + … + bnXn$

Mediating and Moderating Variables

In research, a mediating variable is hypothesized to be intervening between variable X and variable Y.
- In a mediation model, the mediating variable accounts for the relationship between two other variables.
- The independent variable affects the mediating variable. The mediating variable then affects the dependent or criterion variable.
In research, a moderating variable influences the relationship between variable X and variable Y.
- The two terms, moderation and interaction, developed from different research traditions, but they mean essentially the same thing when interpreting research findings.
An uncontrolled third variable may be responsible for the relationship between the two variables of interest.
- When experimental research is properly designed, there is no third variable problem—
- Multiple regression can be used to statistically control for the effects of third variables.

Structural Equation Modeling

Structural equation modeling (SEM): statistical techniques to evaluate a model that specifies a set of relationships among a set of variables.
- After data have been collected, statistical methods can be applied to examine how closely the proposed model fits the obtained data.
Researchers typically present path diagrams to visually represent the models being tested.
- These show the theoretical causal paths among the variables.

Inferential Statistics

Learning Objectives

Explain how researchers use inferential statistics to evaluate sample data.
Distinguish between the null hypothesis and the research hypothesis.
Discuss probability in statistical inference, including the meaning of statistical significance.
Describe the t test and explain the difference between one-tailed and two-tailed tests.
Describe the F test, including systematic variance and error variance.
Describe what a confidence interval tells you about your data.
Distinguish between Type 1 and Type II errors and discuss the factors that influence the probability of a Type II error.
Discuss the reasons a researcher might obtain nonsignificant results.
Define power of a statistical test and describe how power influences research.
Demonstrate skills in selecting an appropriate statistical test.

Inferential Statistics

Researchers rarely, if ever, study entire populations
Inferential statistics are used to evaluate the likelihood that the results of a study, conducted with multiple samples, would hold up consistently in the population.
- They help answer whether it can be inferred that the difference in the sample means reflects a true difference in the population means.

Equivalence of Groups

Equivalence of groups is achieved by experimentally controlling all other variables or by randomization. The assumption is that if the groups are equivalent, the independent variable is the only thing that differs.
However, the difference between any two groups will be based on more than just the IV; there will also be random error.
Inferential statistics allow researchers to make objective probability statements about the nature of population based on the sample data.
- They give the probability that the difference between means reflects random error rather than a real difference.

Null Hypothesis vs. Research Hypothesis

The null hypothesis is that the population means are equal.
- The independent variable had no effect.
The research hypothesis is that the population means are, in fact, not equal.
- The independent variable did have an effect.
Statistical significance indicates that there is a low probability that the difference between the obtained sample means was due to random error.

Probability and Statistical Significance

Probability is the likelihood of the occurrence of some event or outcome.
- In statistical inference, we want to specify the probability that an event (e.g., obtained results) is due to chance or random error
- Alpha level: the probability required for statistical significance.
Sampling distributions assume that the null hypothesis is true.
Sample size—the total number of observations—has an impact on determinations of statistical significance.
- Greater size produces a more accurate estimate of the true population value.

Using a Statistical Test

To use a statistical test, you must first:
- Specify the null hypothesis
- Specify the research hypothesis
- Specify the significance level

The t Test

The t test is commonly used to examine whether two groups are significantly different from one another.
The t value is a ratio of two aspects of data: the difference between the group means and the variability within groups.

Degrees of Freedom

Degrees of freedom (df) represents the number of scores free to vary once the means are known.
Somewhat different critical values of t are used depending on whether the test is one tailed or two tailed:
- One-tailed tests: The research hypothesis specifies the direction of difference between the groups.
- Two-tailed tests: The research hypothesis does not specify the predicted direction of difference.

The Analysis of Variance (F Test)

The analysis of variance, or F test, is a more general statistical procedure than the t test.
When a study has only one independent variable with two groups, F and t are virtually identical.
Analysis of variance is also used when there are more than two levels of an independent variable and when a factorial design with two or more independent variables has been used.

The F Statistic

The F statistic is a ratio of two types of variance:
- Systematic variance: the deviation of the group means from the grand mean or mean score of all individuals in all groups.
- Error variance: the deviation of the individual scores in each group from their respective group means.
The larger the F ratio is, the more likely it is that the results are statistically significant.
Effect size
- Small r = 0.10, Cohen’s d = 0.20
- Moderate/medium r = 0.30, Cohen’s d = 0.50
- Large r = 0.50, Cohen’s d = 0.80

Confidence Intervals

Confidence interval: an interval of values within which there is a given level of confidence (e.g., 95%) where the population value lies.
Represented in bar graphs as a “I” bounded by upper and lower limits.
The goal is to help testers decide if the obtained results are significant.
The chosen significance level indicates how big a risk researchers are willing to take when making the decision.
Significant results are most likely when the effect size is large and the sample size is also large.

Type I and Type II Errors

The decision to reject the null hypothesis is based on probabilities rather than certainties.
The decision might not be correct; errors may result from the use of inferential statistics.
Using a decision matrix, there are two possible decisions and two possible truths about the population:
- Possible decisions: (1) reject the null hypothesis (2) accept the null hypothesis
- Possible truths: (1) the null hypothesis is true (2) the null hypothesis is false

Correct Decisions

Correct decisions occur in two instances:
- When we reject the null hypothesis, and the null hypothesis is false in the population.
- When we accept the null hypothesis, and the null hypothesis is true in the population.

Type I Error

A Type I error is made when we reject the null hypothesis, but the null hypothesis is actually true.
The probability of making a Type I error is determined by the significance or alpha level.
- The higher the significance or alpha level, the greater the chance of making a Type I error.

Type II Error

A Type II error occurs when the null hypothesis is accepted but in reality the research hypothesis is true.
- A Type II error occurs when a true effect of the independent variable exists in the population, but the results of the experiment do not lead to a decision to reject the null hypothesis.
The probability of this type of error is related to three factors:
- Significance (alpha) level
- Sample size
- Effect size

Significance Level and Consequences of Errors

Researchers have traditionally used a .05 or a .01 significance level in the decision to reject the null hypothesis.
The chosen level specifies the probability of a Type I error if the null hypothesis is true.
The significance level chosen and the consequences of a Type I or a Type II error are depend on the goals of the researcher.

Nonsignificant Results

The results of a single study may not be conclusive when a true relationship between the variables in the population does in fact exist.
A meaningful result is more likely to be overlooked when:
- The significance level is very low or stringent (e.g., p < .001)
- The sample size is small
- The effect size is small
Sample sizes should be large enough to give some confidence that a real effect will be detected if it exists.
Research should have a reasonably large sample to rule out the possibility that the sample was too small; consider conducting a power analysis to determine optimal sample size.

Power Analysis

The power analysis determines the optimal sample size based on significance level and effect size.
Power = 1 − p (Type II error)
Effect sizes and desired power:
- Smaller effect sizes require larger sample sizes.
- A higher desired power requires a larger sample size.
Researchers use statistical software to determine sample size.
Total sample size needed to detect a significant difference for a t test
- Small r=0.10, Power = 0.80 is 789;Power = 0.90 is 1052
- Medium r=0.30, Power = 0.80 is 88; Power = 0.90 is 116
- Large r=0.50, Power = 0.80 is 26; Power = 0.90 is 36

The Importance of Replications

Scientists attach little importance to the results of a single study if the results are not replicated in future research.
A rich understanding of any phenomenon comes from the results of numerous studies investigating the same variables.
Meta-analysis provides a method for determining the reliability and generalizability of research findings across many different studies.
- Instead of inferring population values based on a single investigation, a researcher can look at the results of several studies that used similar procedures and assessed the same variables.

Statistical Significance of Correlation

The Pearson r correlation coefficient is used to assess the strength and direction of linear associations between two variables when both variables have interval or ratio scale properties.
There remains the issue of whether the correlation is statistically significant.
A statistical significance test helps to:
- Decide whether, based on the sample data, we can conclude that there is a significant correlation in the population.
- Conclude that the population correlation, in fact, greater than 0.00.

Computer Analysis of Data

Most data analysis is carried out via specially designed statistical software.
Major statistical programs include:
- SPSS
- SAS
- SYSTAT
- R (which is freely available).
Many people do most of their simple analyses using a spreadsheet program such as Microsoft Excel.

Choosing a Statistical Test

The variables that we study may have nominal or interval/ratio scale properties.
Note that nominal scale properties have two levels, such as male and female.
Interval/ratio scale properties have many values

Statistical tests for different combinations of IV and DV

Nominal, Nominal uses Chi-square
Nominal (two groups), Interval/ratio uses t test
Nominal (three groups) , Interval/ratio uses One-way analysis of variance
Interval/ratio , Interval/ratio uses Pearson correlation

Statistical tests for research designs with multiple independent variables

Nominal (two or more variables), Interval/ratio uses Analysis of variance (factorial design)
Interval/ratio (two or more variables), Interval/ratio Multiple regression