Assessing the Quality of Behavioral Measurement and Interobserver Agreement

Indicators of Trustworthy Behavioral Measurement

Perspective of Philosophic Doubt: Coming from the perspective of concepts and principles in behavior analysis, researchers and practitioners must maintain a stance of philosophic doubt. Data cannot be taken at face value; its quality must be assessed once collected.
Goal of Assessment: The goal is to determine how well data reflects what actually occurred in the environment and to ensure it is "believable" and "trustworthy."
Validity: This is the first indicator of trustworthy measurement. It refers to the extent to which a measurement method accomplishes three things: * Directly measures a socially significant behavior. * Measures a dimension of that behavior relevant to the question or concern (e.g., duration vs. frequency). * Ensures data are representative of the range of the behavior's occurrence.
Accuracy: This refers to the degree to which observed values match the "true values" of an event. Accurate measurement means the data recorded depicts the actual occurrence of the behavior.
Reliability: This refers to the extent to which a measurement system yields the same values across repeated measurements of the very same event. While reliability suggests consistency, it does not guarantee accuracy.

Threats to Measurement Validity

Indirect Measurement: This occurs when a researcher measures a behavior other than the actual behavior of interest. * It requires an inference to be made regarding the relationship between the measured response and the target response. * Practitioners must provide evidence that the proxy behavior is directly related to the behavior of interest (e.g., using a survey to measure performance rather than direct observation).
Irrelevant or Ill-suited Dimensions: Validity is threatened when the dimension of behavior being measured does not address the actual concern. * Example: If a client's crying is disruptive to the school day, measuring "repeatability" (count/frequency) may be irrelevant. One instance of crying that lasts for $6\,\text{hours}$ is highly disruptive, but a frequency count of $1$ would underestimate the severity compared to a duration measure.
Measurement Artifacts: These are misleading data that result from the way behavior is measured rather than the behavior itself. Examples include: * Discontinuous Measurement: Methods that systematically underestimate or overestimate the occurrence of behavior. * Poorly Scheduled Observations: Conducting observations at times when the behavior is unlikely to occur (e.g., observing aggression during lunch or nap time) produces data that is an artifact of the schedule, not the client's actual repertoire. * Insensitive or Limiting Scales: If a measurement scale has a ceiling that prevents recording all occurrences of a behavior, the resulting data is an artifact of the scale's limitations.

Threats to Measurement Accuracy and Reliability

Human Error: Identified as the single biggest threat to accuracy and reliability in behavioral measurement.
Poorly Designed Measurement Systems: Systems that are cumbersome, difficult to use, or overly complex increase the likelihood of recording errors.
Inadequate Observer Training: Training must be explicit and systematic. To address this threat, practitioners should: * Use clear operational definitions. * Select observers carefully. * Train observers to a specific competency standard. * Provide ongoing training to prevent Observer Drift (the tendency for observers to shift away from the original measurement standard over time).
Unintended Influences on Observers: * Observer Expectations: When an observer expects to see high or low rates of a behavior (e.g., expecting aggression because a specific math program is running), their data collection may be biased to match those expectations. * Observer Reactivity: This occurs when an observer knows they are being evaluated. This can lead to the observer recording data differently than they would otherwise. * Example Anecdote: A therapist $20\,\text{years}$ ago did not want to be associated with problem behaviors, so she recorded a client’s behavioral tallies under a different therapist’s hour on the data sheet to protect her own reputation.
Measurement Bias: Feedback given to observers about how their data relates to intervention goals can inadvertently cause reactivity or bias in recording.

Assessing Accuracy and Reliability

Procedural Requirements for Quality: Practitioners should design sound measurement systems customized to their situation and ensure observers are familiar with operational definitions and the purpose of the data.
Calculating Accuracy: Accuracy is determined by comparing "observed values" (what was recorded during a session) to "true values" (the actual event). * Example: A Registered Behavior Technician (RBT) tallies foot stomping from memory at the end of a session (observed value). Simultaneously, a supervisor uses a golf counter mechanical device solely for observation (true value). * The process for determining the true value must differ from the standard measurement procedure. * Accuracy assessments should always be reported in research to ensure conclusions are not based on faulty data.
Assessing Reliability: Reliability requires permanent products for re-measurement. Low reliability signals suspect data. While two observers can be reliable (agreeing on the same value) without being accurate (both recording the wrong value), a lack of reliability usually indicates a failure in accuracy.

Interobserver Agreement (IOA)

Definition: IOA refers to the degree to which two or more independent observers report the same values for the same events.
Benefits of IOA: * Determining the competence of new observers. * Detecting observer drift over time. * Judging the clarity of behavioral definitions and the measurement system. * Increasing the overall believability and trustworthiness of the data.
Prerequisites for Conducting IOA: * Observers must use the same observation code and measurement system. * Observers must measure the same participants and events. * Observers must score independently. * Methodology: This can be done live or via video. If using video, a signal such as " $3, 2, 1, \text{now}$ !" ensures both observers start at the exact same moment.

IOA Calculation Methods

General Rules: The different methods correspond to the data collection type. In general, practitioners should use the most stringent and conservative method available.
Event Recording Methods: * Total count IOA. * Mean count per interval IOA. * Exact count per interval IOA. * Trial-by-trial IOA.
Timing Recording Methods: * Total duration IOA. * Mean duration per occurrence IOA. * Inter-agreement per occurrence.
Interval Recording and Time Sampling: * Interval-by-interval IOA (also called point-by-point agreement). * Scored interval IOA. * Unscored interval IOA.

Standards for Reporting IOA

Frequency and Distribution: * IOA should be collected during each condition and each phase of a study (baseline and intervention). * It should be distributed across different days, times, settings, and observers. * The minimum standard is to collect IOA for $20\%$ of sessions, but $30\%$ (one-third) is preferable. * More frequent IOA is necessary for complex measurement systems or treatment procedures.
Acceptable Benchmarks: * The goal is $100\%$ agreement. * Historically, $80\%$ agreement has been utilized as the minimum acceptable benchmark.
Reporting Formats: IOA can be reported in narrative form, tables, or graphs. The report must describe exactly which method was used so the reader can judge how conservative the assessment was. If one calculation is insufficient, multiple indices/calculations can be reported to strengthen the assessment. These details are found in Chapter 5 of the Cooper, Herron, and Hewitt text and the Pearson education slides.