Research Design

6.1 Overview

Research design is the specific setup of a research study.
Key issues include:
- Selection of variables (independent and dependent).
- Types of research (experimental, quasi-experimental, and non-experimental).
- Design strategies (group and single subject).
- Time frames.
- Sampling procedures.

I. Independent Variables (IVs) and Dependent Variables (DVs)

Independent Variables (IVs): Form the basis of the groups being compared.
- Example: Comparing three treatments for depression; the type of treatment is an IV.
- Example: Comparing males and females; gender is an IV.
Two main types of IVs:
- IVs directly manipulated by the researcher.
- IVs that are pre-existing and cannot be manipulated.
Manipulated IVs: Also termed environmental or situational variables.
- Example: Contrasting cognitive-behavioral, psychodynamic, and medication treatments for depression; the IV is the type of treatment.
Non-Manipulated IVs: Subject or individual difference variables that cannot be manipulated because they are pre-existing.
- Example: Contrasting men and women; the IV of gender cannot be manipulated.
Dependent Variables (DVs): Outcome measures selected.
- Example: Level of depression measured by scores on the Beck Depression Inventory.
- Dependent variables can be nominal, ordinal, interval, or ratio.

II. Types of Research

True Experimental: At least one IV is manipulated, and subjects are randomly assigned.
- Example: Comparing two types of treatment for anxiety with random assignment to treatment groups.
Quasi-Experimental: At least one IV is manipulated, but there is non-random assignment, typically due to pre-existing groups.
- Example: Administering one treatment to patients in hospital ward A and another to patients in ward B; wards are pre-existing groups.
Observational, Passive, or Non-Experimental: No intervention or manipulation.
- Sometimes called correlational, but this is a misnomer as they often use statistics that detect group differences (e.g., ANOVA).
- Example: Comparing cigarette smoking extent in adolescent males and females; gender is the non-manipulated IV, and smoking extent is the DV.

III. Design Strategies and Measurement

A. Group Designs:
- Research with groups can be between-groups, within-subjects, or mixed design.
  - Between-Groups Design: Compares independent groups.
    - Example: Differences in reading levels of two different classes of second-grade students.
  - Within-Subjects Design: Groups contrasted are correlated or related.
    - Conditions leading to correlated data:
      - Subjects are repeatedly measured (repeated measures).
      - Subjects have been matched prior to group assignment.
      - Subjects have an inherent relationship (e.g., identical twins).
    - Specific Subtype: Subjects exposed to several interventions/conditions in sequence.
      - Example: Remembering nouns, verbs, and nonsense syllables.
      - Requires counterbalancing due to carryover effects.
      - Subjects divided into thirds, each receiving a different word list first.
      - Latin Square is the most sophisticated form of counterbalancing.
  - Mixed Design: Groups are both independent and correlated.
    - Example: Patients assigned to different treatment groups for depression (independent) and measured before and after treatment (correlated/repeated measures).
B. Single-Subject Designs:
- One or very few subjects studied intensively with repeated measurements during baseline and treatment.
- Single-subject approaches are idiographic; group approaches are nomothetic.
- Includes AB, ABAB, multiple baseline, simultaneous treatment, and changing criterion designs.
- Significant problem: Autocorrelation, the effect of repeated measurements on the same person, results in highly correlated data.
- AB Design: Baseline (A) followed by treatment (B).
  - Example: Recording a subject's weight during a baseline phase, followed by recording weight during a prescribed dietary intervention.
  - Threat of history: Difficulty determining if the intervention or another event caused the change.
- ABAB Design: Alternates baseline and treatment conditions (A, B, A, B).
  - Example: Recording head-banging episodes in an autistic child during baseline, treatment, return to baseline, and resumed treatment phases.
  - Protects against the threat of history.
  - Problems: Failure of the DV to return to baseline and ethical issues with removing effective treatment.
- Multiple Baseline Design: Treatment applied sequentially across subjects, situations, or behaviors.
  - Resolves problems of AB and ABAB designs but is more time-consuming and expensive.
    - Multiple Baseline Across Subjects: Medication effect on hyperactivity for three different children with ADHD; medication is administered to each child sequentially after a baseline phase.
    - Multiple Baseline Across Situations: Intervention for one problem behavior in three different settings (e.g., tantrums at home, school, park); treatment is applied consecutively.
    - Multiple Baseline Across Behaviors: Intervention for three problem behaviors of one subject (e.g., head-banging, rocking, self-biting); intervention is applied to each behavior sequentially.
- Simultaneous (Alternating) Treatment Design: Two or more interventions implemented concurrently during treatment, balanced and varied across time of day.
  - Allows comparison of the relative effectiveness of multiple interventions for an individual.
  - Example: Contrasting M&Ms vs. praise to decrease head-banging by alternating them during the day.
- Changing Criterion Design: Attempts to change behavior in increments to match a changing criterion.
  - Example: Gradually reducing coffee consumption by setting a new criterion each time the person meets the previous one.
C. Behavioral Measurement:
- Time Sampling: Useful when a behavior is not discrete (no distinct beginning/end).
  - Appropriate for measuring a person's attention to a speaker during a lecture.
  - Breaks the period of interest into smaller periods (e.g., two-hour lecture into one-minute intervals).
    - Momentary Time Sampling: Recording whether the target behavior is present or absent at the moment the time interval ends.
    - Whole-Interval Sampling: Scoring the target behavior positively only if it is exhibited for the full duration of the time interval.
- Event Recording: Tallying the number of times a target behavior occurs.
  - Useful when the target behavior is discrete and infrequent (e.g., counting the number of times a student arrives late).

IV. Conditions of Experimentation

Analogue Research: Evaluates treatment under conditions that only resemble clinical situations.
- Problems studied are less severe, participants are often college students/volunteers, therapists are graduate students, and treatments are standardized.
- Allows for tight control but limited generalizability.
Clinical Trials: Outcome investigations conducted in clinical settings.
- Often require methodological compromises due to being conducted in clinical settings.

V. Time Frame

Cross-Sectional Research: Looks at differences across sections (e.g., different ages) by sampling subjects from age categories at one point in time.
- Example: Studying internet usage among 20, 40, and 60-year-olds.
- Subject to cohort effects: Younger people may use the internet more simply because they grew up with computers.
Longitudinal Research: Follows a group of subjects over many years to understand changes that take place as people age.
- Example: Tracking individuals from infancy through old age, looking at changes in savings habits.
- Problems: significant expense and high dropout rates.
Cross-Sequential Research: Attempts to correct problems with both cross-sectional and longitudinal research by taking several cross-sections of age groups and following them over briefer periods.
- Example: Studying health habits across the lifespan by tracking groups of different ages for five to ten years.

VI. Sampling Procedures

Important to have a representative sample of the population.
Major selection strategies:
- Simple Random Sampling: Every member has an equal chance of being selected.
  - Example: Randomly selecting registered voters from a roster of all registered voters in California.
- Stratified Random Sampling: Population is divided into strata (e.g., age, income, ethnicity), and a random sample of equal size from each stratum is selected.
  - Example: Dividing the population into SES levels and selecting equal numbers from each SES level.
- Proportional Sampling: Individuals are randomly selected in proportion to their representation in the general population.
  - Example: If a population is 85% White, 10% Hispanic, and 5% African American, the subject pool should reflect these percentages.
- Systematic Sampling: Selecting every kth element after a random start.
  - Example: If 100 out of 1000 persons are needed, every tenth person is selected.
  - Needs to be arranged in such a way that it is unbiased.
- Cluster Sampling: Identifying naturally occurring groups (clusters) and randomly selecting certain clusters (e.g., classes at a university or schools within a district).
  - Once clusters are selected, all subjects within each cluster are surveyed.
  - Example: Studying achievement in 8th graders in LAUSD by randomly selecting ten schools and assessing all 8th graders in those schools.

6.2 Threats in Research Design

Goal: To draw inferences about the relationships among variables, minimizing ambiguity.
Threats can interfere with drawing clear conclusions.
Include threats to internal validity, construct validity, external validity, and statistical conclusion validity.

I. Threats to Internal Validity

Internal Validity: Being able to conclude that the IV caused the change in the DV.
Threats: Factors other than the IV that may have caused the DV to change.
History: Specific incidents intervening between measuring points, either in or outside the experimental situation.
- Example: An intervention targeted at increasing earthquake preparedness; if a sizable earthquake were to occur between measurement periods, it would not be clear if the increase in preparedness was due to the intervention or earthquake.
- Best control: A control group.
Maturation: Factors affecting subjects' performance due to the passing of time (fatigue, maturing).
- Example: A program to help young children develop greater coordination for sports; it would be difficult to determine whether the program actually helped improve coordination or whether the children's coordination improved simply because of biological maturation.
- Best Control: Control Group.
Testing or Test Practice: Familiarity with testing affects scores on repeated testing.
- Example: Children given a task, then an intervention, and retested; improvement may be due to the intervention or test practice.
- Best control: Solomon Four-Group design.
- Solomon Four-Group Design: Subjects divided into four groups.
  - Group 1: Pre- and post-tested with intervention in between.
  - Group 2: Pre- and post-tested without intervention.
  - Group 3: Post-tested only with intervention.
  - Group 4: Post-tested only without intervention.
Instrumentation: Changes in observers or calibration of equipment.
- Example: Biofeedback electrodes showing reductions in scores as the equipment wears out; observed GSR reductions may be due to treatment or equipment problems.
- Control Group can correct for this problem.
Statistical Regression: Extreme scores (above or below the mean) tend to become less extreme on retesting, even without intervention.
- Example: Subjects with extremely high scores on the BDI would likely score lower upon a second administration.
- Best control: A control group.
Selection Bias: Caused by non-random assignment.
- Example: First 20 volunteers assigned to one treatment, second 20 to another; differences may be due to treatment or differences in volunteers.
- Avoided by using random assignment.
Attrition or Experimental Mortality: Differential loss of subjects from groups.
- Example: Less depressed drop out of control, more depressed drop out of treatment; treatment group appears less depressed due to attrition, not treatment.
- To assess for problems of attrition, subjects who drop out should be compared on relevant variables by running t-tests.
Diffusion: The no-treatment group gets some of the treatment, clouding treatment effects.
- Example: Anxious subjects assigned to cognitive-behavioral interventions or non-specific elements of therapy, but during the research, some cognitive strategies are inadvertently discussed in the control group; both groups appear less anxious.
- Tighter control helps to limit this.

II. Threats to Construct Validity

Construct validity is about showing that it was the intended factor that caused the effect, or whether it was something else.
Construct validity (as described by Kazdin) refers to factors other than the desired specifics of our intervention that result in differences. These factors are often lumped under threats to external validity.
Attention and Contact with Clients: May be hard to tell whether the change in patients is due to the actual technique or just contact with the therapist.
Experimenter Expectancies: Cues or clues transmitted to the subjects by the experimenter.
- Also known as the Rosenthal effect.
- Example: Experimenter smiles and nods encouragingly only at subjects who received the actual medication, resulting in these subjects reporting more improvement.
- Control: Having the experimenter blind to the subjects' treatment condition
Demand Characteristics: Factors in the procedures that suggest how the subject should behave.
- Example: Subjects given medication are told that the medication might cause certain side effects, and in fact subjects then report more side effects; by contrast, subjects given the placebo are told there won't be any side effects, and they then report no side effects.
- To reduce demand characteristics, subjects should be blind to their treatment condition.
John Henry Effect: Persons in control try harder in competition with the experimental group.
- Also called compensatory rivalry.
- Members of the control group may outperform persons in the experimental or treatment group.
- To control or eliminate the John Henry effect, the experimental and control groups should not know about each other. If this is not possible, the groups should not be given any sense of competition.

III. Threats to External Validity

External Validity: Being able to generalize from the sample studied to the population.
Factors that interfere with generalizability are called threats to external validity.
Sample Characteristics: Differences between the sample and the population.
- Example: Volunteers studying a sleep intervention may differ from the population of persons with sleep problems who do not volunteer.
- Volunteers may be more enthusiastic, compliant, and willing to engage in treatment.
Stimulus Characteristics: Features of the study with which the intervention is associated, such as artificial research arrangements.
- Example: Research assessing memory functioning in the laboratory may not be generalizable to memory functioning in naturalistic settings.
Contextual Characteristics: Conditions in which the intervention is imbedded; reactivity is a significant threat.
- Reactivity occurs when subjects behave in a certain way just because they are participating in research and being observed.
- The Hawthorne effect is a frequently described example of reactivity.

IV. Threats to Statistical Conclusion Validity

If statistically significant results are not obtained, it is important to look at whether the treatment was ineffective, or whether the lack of significant findings was due to statistical problems.
Threats: Low power, unreliability of measures, variability in procedures, and subject heterogeneity.
Low Power: Diminished ability to find significant results.
- Factors: Small sample size and inadequate interventions.
Unreliability of Measures: Even if the intervention is effective, significant differences may not be found if the outcome measure used is unreliable.
Variability in Procedures: Inconsistency in treatment procedures obscures treatment findings.
- Especially of concern in psychotherapy outcome research.
Subject Heterogeneity: Makes it more difficult to find significant differences between groups.
- The greater the subject heterogeneity, the less the likelihood is of finding significance.

V. Interrelations of Threats

Using a control group and random assignment can minimize most threats to internal validity.
The more controlled the experiment is, the less generalizable it is likely to be, or put differently, the greater the internal validity, the lower the external validity.

6.3 Basic Math

*There are two basic mathematical principles and a third theoretical principle that once mastered should allow you to quickly and accurately respond to most questions requiring calculations on the EPPP.

Squaring Decimals

Squaring decimals is just like squaring whole numbers; the trick is knowing where to put the decimal point.
When you square decimals, the result is always smaller than the original value.
When you square a decimal that is expressed in tenths (e.g., .42), the response will always be expressed in hundredths (e.g. .16).
The tricky ones are .12, .22, and .32, whose squares are .01, .04, and .09 respectively.

Square-Rooting Decimals

The trick to square-rooting decimals is to first express your decimal in hundredths rather than tenths or thousandths (e.g…50 rather than .5, and .10 rather than .100), and then to look for the number that is closest to the square root.
The result should always be expressed in tenths.
- Example A: (\sqrt{.64} = .8)
- Example B: (\sqrt{.5} = \sqrt{.50} = .7)
- Example C: (\sqrt{.1} = \sqrt{.10} = .3)

Relationships in Equations

{a={\frac{b}{c}}}
- a varies directly with b: As b increases, a increases; as b decreases, a decreases.
- a varies indirectly with c: As c increases, a decreases; as c decreases, a increases.
{a=b(1-c)}
- a varies directly with b: As b increases, a increases; as b decreases, a decreases.
- a varies indirectly with c: As c increases, a decreases; as c decreases, a increases.
Example D:
- {S\bar{x} = {\frac{SD{pop}}{\sqrt{N}}}}
Example E:
- {S{meas} = SDx \sqrt{1 - r_{xx}}}

6.4 Types of Data

The type of data is critical in determining the type of statistical test to be used (e.g., parametric versus nonparametric).
It is most important to ascertain what type of data the dependent variable (DV) or outcome variable is.
This can be accomplished by determining how any given person is measured.
- Specifically, it should be asked whether the person is given a score or numerical value (interval or ratio data), or whether the person is assigned to a category or tallied up as one point in a given category (nominal or ordinal data).
The four types of data are nominal, ordinal, interval, and ratio. A helpful acronym is NOIR.
Nominal data, especially when there are only two categories (e.g., gender), are commonly called dichotomous, while interval or ratio data are frequently called continuous.
Nominal Data: Involves tallying people (head count) to see which non-ordered category each person falls into.
- Examples: Sex (male or female), voting preference (Democrat or Republican), or ethnicity (White, Hispanic, Asian, African American).
- Nominal categories have no inherent order.
- Numbers or frequencies are obtained for each category. These frequencies can be converted to proportions or percentages (e.g., percent of males, proportion of Democrats).
- Group means cannot be calculated from nominal data.
Ordinal Data: Involves tallying people to see which ordered category each person falls into.
- Examples: Attitude toward abortion (Likert scales), SES, or percentile ranks for income.
- The categories are ordered and numbers or frequencies are obtained for each category.
- Group means cannot be calculated from ordinal data.
Interval Data: Obtaining numerical scores for each person, where the score values have equal intervals.
- There is either no zero score (e.g., IQ scores or t-scores) or zero is not absolute (e.g., temperature in °C or °F).
- Group means can be calculated from interval data.
Ratio Data: Obtaining numerical scores for each person, where the score values have equal intervals and an absolute zero.
- Examples: Savings in the bank, score on the EPPP, weight, and number of children.
- Means can be calculated from ratio data. Comparisons can also be made across score values (e.g., $10 is twice as much as $5).

6.5 Descriptive Statistics

There are two broad classes of statistics: descriptive and inferential.
- With descriptive statistics, the data collected (the DV) are simply described.
- With inferential statistics, the goal is to make inferences about the population from the sample.
Descriptive statistics can further be divided into two basic groups statistics that describe the whole group's data, and statistics that describe an individual's score relative to the group.

I. Group Data

A. Measures of Central Tendency

The entire group's data can be described using measures of central tendency, which include the mean, median, and mode.
- Mean: The arithmetic average of a group of data. Calculated by adding up all the scores in the group and dividing by the total number of scores.
- Median: Corresponds to the score at the 50th percentile. Half the people in the group score below and half the people in the group score above.
- Mode: The most frequently occurring score. The score that is obtained by more people in the group than any other score.
The best measure of central tendency is typically the mean.
- However, when data are either skewed or there are some very extreme scores present, the median is most accurate.

B. Measures of Variability

The group's data can also be described using the measures of variability, which include the standard deviation, the variance, and the range.
- Standard Deviation: A measure of average deviation (or spread) from the mean in a given set of scores.
- Variance: Mathematically, the standard deviation is the square root of the variance. Said another way, the variance is the standard deviation squared.
- Range: The crudest measure of variability. It is simply the difference between the highest and lowest value obtained.
The effect of mathematical operations on the mean and standard deviation is such that if points are added to or subtracted from everyone's score, the mean is affected, but not the standard deviation; if scores are either divided or multiplied, both the mean and standard deviation are affected.

C. Graphs

The group's data can also be depicted using graphs (e.g., bar graphs, curves, etc.).
On a graph of group data, the X-axis either represents categories (for nominal or ordinal data) or scores (for interval or ratio data), while the Y-axis always represents frequency.
Data that are nominal are graphed using a bar graph.
Data that are ordinal are also graphed using a bar graph, with the exception of percentile ranks, which are depicted graphically as flat or rectangular.
Interval or ratio data are graphed with curves.

*Most interval or ratio variables are normally distributed. Data that are not normally distributed are considered skewed or kurtotic. In a skewed distribution, scores are not equally distributed above and below the mean. In a positive skew, there is a higher proportion of scores in the lower range of values; in a negative skew, there is a higher proportion of scores in the higher range of values. Note that in a positive skew, the mode has the lowest value of the three measures of central tendency, and the mean has the highest value. In a negative skew, the relationship is reversed: the mean has the lowest value and the mode has the highest value. Kurtosis refers to how peaked a distribution is. Instead of the familiar bell shape, a leptokurtotic distribution has a very sharp peak, and a platykurtotic distribution is flattened.

II. Individual Scores

A. Raw Scores

Knowing an individual's score (X) on a test usually provides very little information.
The percentage correct (%) can be determined but it remains unclear if it is a good, bad, or mediocre score.
Percentage correct is considered a criterion-referenced or domain-referenced score.

B. Percentile Ranks

In general, what is most relevant is how the person scored relative to the group. A score that provides such information is called a norm-referenced score.
The most informative norm-referenced score is the percentile rank.
A percentile rank indicates the percentage of cases in a sample that are equal to or below a particular score.
For example, a person who scores at the 98th percentile has scored the same or higher than 98% of the rest of the sample.
As mentioned above, the distribution of percentile ranks is graphically depicted as flat or rectangular.

C. Standard Scores

Standard scores are based on the standard deviation of the sample. Examples include z-scores, t-scores, IQ scores, SAT scores, and EPPP scores.
Z-scores are the most basic standard scores in that they correspond directly to standard deviation (SD) units. Z-scores have a mean of zero and a standard deviation of one. For example, a z-score of +2 represents a score that is 2 SDs above the mean.
The shape of the z-score distribution is always identical to the shape of the raw score distribution. Transforming raw scores into z-scores does not normalize a distribution.
In any distribution, the area under the curve represents 100% of people. In most distributions, the area under the curve consists of values that are +/- 3 SDs, which is to say that almost 100% of the people will have scores between +/- 3SDs from the mean. In any normal distribution. z-scores (or standard deviations) always correspond to the same percentile ranks.

*If points are added to only two people in a distribution (as opposed to everyone), the percentile ranks of those two individuals will change. There will be a greater change in percentile rank for the person who initially has a mid-range rank (e.g., 50th PR) as compared to someone who is initially closer to the tail of the distribution.

Z-Scores can be easily calculated from raw score data (X).
- Z-Score Formula:
  - Z = {\frac{X - \bar{X}}{SD}}
  - or
  - Z = {\frac{score - mean}{SD}}
- Raw Score Formula:
  - X = \bar{X} +/- Z(SD)
  - or
  - score = mean (+/-) Z-Score \times the SD

6.6 Inferential Statistics

Instead of just describing data, inferential statistics are those procedures or tests that allow researchers to make inferences about the population, based on the sample studied. When research is conducted, it is hoped that the study sample is representative of the population.
In fact, a critical assumption of all statistical tests is that the sample subjects are randomly selected, and thus representative of the population. Population values are referred to as parameters, examples of which include μ (mu), the population mean, and σ (sigma), the population standard deviation. Sample values are called statistics, such as (the mean) and SD or S (the standard deviation).

I. Sampling Error

A key concern in research is sampling error: samples drawn from populations are usually not perfectly representative of the population.
One clear indication of sampling error is that untreated sample means are frequently not identical to population means. For example, the population mean for IQ is 100. If a random sample of 25 people were taken, the mean of this sample would probably not be exactly 100.

II. Standard Error of the Mean and Central Limit Theorem

Standard Error of the Mean: If, hypothetically, a researcher were to take many, many samples of equal size (e.g., N=25) and plot the group means of these samples, the researcher would get a normal distribution of means. Any spread or deviation in these means is error. The average amount of deviation is called the standard error of the mean.
Central Limit Theorem: This hypothetical theoretical construct is called central limit theorem. Central limit theorem states that assuming an infinite number of equal sized samples (of large enough size) are drawn from the population, and the means of these samples are plotted, a normally distributed distribution of means will result.
- The mean of the means (the grand mean) will equal the population mean, and the standard deviation of the means will equal the standard deviation of the population divided by the square root of sample size (the standard error of the mean).
- The distribution of means will be normal regardless of the shape of the distribution of scores.
The practical relevance of central limit theorem and the standard error of the mean is that it informs the researcher how likely it is that a particular mean will be obtained just by chance. Using this information, the researcher can calculate whether the obtained mean is most likely due to treatment or experimental effects, or to chance (sampling error or random error).
The researcher implements a treatment program designed to enhance IQ with one group of 25 subjects. Following treatment, the treated group's IQ is measured, and the mean IQ turns out to be 103. With the standard error of the mean at 3, the mean of 103 was quite likely obtained by chance. The difference between the mean of 103 and the non-treated mean of 100 is only 3 points, or one standard error. A difference of one standard error might easily be due to sampling error alone. By contrast, if the treated group's mean IQ is 110, it is highly unlikely that this mean was obtained by chance alone. At 110, the group's mean IQ would be more than three standard error units away from the mean, a difference highly unlikely to be due to just sampling error; consequently, the results would be considered statistically significant.

III. Hypothesis Testing

A. Key Concepts

The most critical aspect of inferential statistics is hypothesis testing. A hypothesis is a statement of belief; on the basis of this belief, research studies are run, and subsequent treatment programs are implemented. Although hypotheses are tested using samples. hypotheses are always expressed in terms of the population.
The nature of the hypothesis depends on the type of question asked in the research. For example, hypotheses can be generated about group differences or about relationships. For simplicity's sake, the prototype that will be used here is the hypothesis about group differences.
Null Hypothesis: The null hypothesis states that there are no differences between the groups, or H0: \mu1 = \mu_2.
- Put differently, the null hypothesis states that the IV has not had an effect on the DV. Research is conducted with the hope of finding differences between groups that have received different treatment; thus, the researcher always wishes to be able to reject the null hypothesis. This concept is a bit tricky because of the double negative. The researcher hopes to be able to reject the statement that there are no differences, and thereby conclude that there are differences between the groups.
Alternative Hypothesis: The alternative hypothesis states directly that there are differences, or H1: \mu1 \neq \mu_2. Keep in mind that results are almost always stated in terms of the null hypothesis.
When statistics are run to test hypotheses, there are really only two decisions or conclusions that can be drawn, based entirely on where the means fall relative to the rejection or acceptance regions.
Rejection Region: The rejection region, at the tail end of the curve, is also called the region of unlikely values, because it is unlikely that a researcher will obtain means in this region simply because of chance (sampling error). The size of the rejection region corresponds to the alpha level. For example, when alpha is .05. the rejection region is 5% of the curve. When alpha is .01, the rejection region is 1% of the curve.
When obtained values fall in the rejection region, the null hypothesis is rejected and the researcher concludes that treatment did have an effect (i.e., there are differences between groups).
When obtained values fall in the acceptance or retention region, the null hypothesis is accepted and the researcher concludes that there were no effects of treatment (i.e., there are no differences between groups).

B. Correct and Incorrect Decisions

Based on the region in which the treated mean falls, researchers conclude that their findings either have statistical significance or lack statistical significance. Researchers assume that statistical significance indicates that there was a treatment effect and that a lack of statistical significance indicates that there was no treatment effect. It is important to realize that there are actually two factors that contribute to conclusions regarding statistical significance: treatment effects and chance (sampling error).
Statistical tests can merely provide information about the probability of results being due to chance or treatment, but the only way to know with certainty is for the experiment to be replicated numerous times. Thus. only after many replications of the experiment can it be known whether the original results were obtained in error.
There are only two possible decisions or conclusions a researcher can draw: either to reject or accept the null. Both of these decisions have the possibility to be correct or incorrect. and thus there are four possible outcomes.
- Type I Error: If the null is rejected and this decision later turns out to be a mistake, the researcher has made a Type I error. In other words, significance is found in the original experiment, but subsequent researchers do not find significance. A Type I error occurs when the null is incorrectly rejected, or differences are found when they do not actually exist.
- The size of the rejection region is typically set in advance by the experimenter and is called alpha. The size of alpha directly corresponds to the likelihood of making a Type I error. Conventional cutoffs for alpha are .05, .01, and .001. These cutoffs have been selected, somewhat arbitrarily, as indicating that obtained means are different enough to be attributed to treatment effect and not to chance.
- Type II Error: When the null hypothesis is accepted, and this decision turns out to be a mistake, a Type II error has been made. In other words, significance is not found in the original experiment, but subsequent researchers do obtain significant results. A Type II error occurs when the null is incorrectly accepted, or no differences are found when differences actually do exist.
- The probability of making a Type II error corresponds to beta. The value of beta varies from experiment to experiment. Beta can be calculated by