Replication

Aspects of Replication in Research

Overview of the Session

The session focuses on the critical importance of replicating research findings and the essential tools used for robust data analysis in sports science and beyond.
Discussion will include a blend of theoretical concepts and practical tasks designed to reinforce a deep understanding of principles of replication, statistical analysis tools, and best practices in research methodology.

Key Theme: Rigor in Methodological Design

Recap of previous session focusing on established checklists and comprehensive assessment tools used to evaluate research quality, ensuring a foundational understanding for advanced topics.
Strong connection between the systematic application of these checklists and assessment tools and ensuring stringent methodological rigor in all stages of research studies, from the initial design and data collection to analysis and final reporting. This rigor is paramount for generating trustworthy and reproducible scientific evidence.

Initial Discussion Points

Recall from previous week:
- What are the key assessment tools discussed, such as the PEDro scale or the Cochrane risk of bias tool, and their specific applications? These tools systematically identify potential biases like selection bias, performance bias, detection bias, attrition bias, and reporting bias, thereby improving the internal validity of studies.
- How are these tools intricately linked to the overarching theme of rigor, by systematically identifying potential biases and limitations in study design and execution? By providing a structured framework for evaluation, they compel researchers to adhere to high standards of design and transparent reporting.
Importance of checklists:
- Serve as crucial markers and standardized frameworks to systematically evaluate the quality, completeness, and transparency of research papers, ensuring adherence to reporting guidelines. They act as a foundational guide for researchers to ensure all critical elements of a study are present and clearly articulated.
Variety of checklists for different types of studies:
- Experimental studies: Often utilize checklists like ARRIVE for animal research or specific criteria for human intervention studies to ensure ethical conduct and robust experimental design.
- Randomized controlled trials (RCTs): Primarily use CONSORT (Consolidated Standards of Reporting Trials) to ensure comprehensive and transparent reporting of trial design, conduct, analysis, and results. This includes detailing randomization procedures, blinding, and participant flow.
- Validity testing of instruments: Checklists such as COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) are used to evaluate the methodological quality of studies on measurement properties (e.g., reliability, validity, responsiveness) of health-related patient-reported outcome measures, ensuring the instruments themselves are fit for purpose.
- Observational studies: STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) provides guidelines for clear and complete reporting, covering aspects like study design, participant characteristics, and statistical methods tailored for cohort, case-control, and cross-sectional studies.
- Literature reviews and systematic reviews: PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is widely used to ensure comprehensive and transparent reporting processes, including search strategy, study selection, data extraction, and synthesis of evidence.

Assessment Tools vs. Checklists

Checklists:
- Provide a concise yet thorough confirmation of methodological adherence and completion of essential reporting items within a study, often in a 'yes/no' or 'present/absent' format. They are descriptive and prescriptive, guiding researchers on what to include.
- Examples:
  - COSMIN: Specifically for evaluating the methodological quality of studies on measurement properties of health-related patient-reported outcome measures, ensuring the instrument's scientific soundness.
  - CONSORT: For transparent and complete reporting of randomized controlled trials, encompassing a 25-item checklist and a flow diagram.
  - STROBE: For comprehensive reporting of observational studies (cohort, case-control, and cross-sectional studies), providing a 22-item checklist.
Assessment Tools:
- More comprehensive and evaluative instruments, allowing for detailed commentary and scoring on the overall quality, risk of bias, and methodological rigor of research, often leading to a numerical quality score or categorization (e.g., high, moderate, low risk of bias). They are analytical and interpretative.
- Used primarily during the post-study appraisal and critical evaluation phase, rather than pre-implementation, to determine the trustworthiness and applicability of findings. They help assess the confidence one can place in a study's results.

Guidance on Study Design

Importance of meticulously highlighting and justifying the methodology in research presentations and publications, with a strong emphasis on achieving and demonstrating rigor:
- Sample size calculations: Essential for ensuring adequate statistical power, justifying the number of participants needed to detect a statistically significant effect if one truly exists, and avoiding underpowered studies. Underpowered studies increase the risk of Type II errors, leading to missed real effects and wasted resources.
- Inclusion/exclusion criteria: Clearly defined criteria are vital for ensuring homogeneity of the study population, enhancing the internal validity by minimizing confounding factors, and accurately determining the generalizability of results to specific populations.
- Data handling methods: Detailed pre-registered protocols for data collection, storage, cleaning, and analysis are crucial for transparency and reproducibility, protecting against data manipulation, errors, or post-hoc adjustments that could bias findings.
Strong emphasis on minimizing any form of manipulation of data, analyses, or interpretations to ensure authentic, unbiased findings that accurately reflect the observed phenomena. This commitment to honesty is a cornerstone of scientific integrity.

Statistical Analysis in Sports Science Research

Growing concerns over the credibility of research findings in many fields, including sports science, stemming from a tendency for only statistically significant results to be published, potentially without sufficient empirical support. This phenomenon distorts the body of evidence and can lead to a misallocation of research efforts.
Publication Bias:
- A pervasive issue where studies reporting positive or statistically significant findings are more likely to be published than those reporting negative, null, or non-significant results. This creates a 'file drawer problem' where valid but non-significant findings remain unpublished.
- Approximately $\,90\%\,$ of published studies across various disciplines, including sports science, report statistically significant findings, creating a skewed perception of efficacy and widespread effects. This overrepresentation of positive results makes it difficult to assess the true prevalence and magnitude of effects.
- This bias emphasizes the immense pressure on researchers to present novel, exciting, and positive results, often at the expense of reporting neutral or negative findings, which are equally important for a complete scientific understanding and for avoiding redundant research efforts.

Understanding P-values

P-value calculations: The p-value (probability value) is a measure used in hypothesis testing to quantify the probability of observing results as extreme as, or more extreme than, the observed results, assuming the null hypothesis is true. A smaller p-value suggests stronger evidence against the null hypothesis.
- Example computation:
- A p-value of $0.051$ indicates a $5.1\%\,$ probability that the observed results (or more extreme results) are due to random chance, assuming there is no true effect. This is just above the conventional threshold for statistical significance.
- A marginal change, such as adding or removing just one participant from a study, can dramatically shift the p-value: e.g., from $0.051$ (often considered non-significant) to $0.049$ (often considered statistically significant), highlighting its extreme sensitivity and the arbitrary nature of the $0.05$ threshold. This illustrates how a seemingly small change can alter conclusions due to a rigid cut-off.
Potential issues with p-values:
- Their extreme sensitivity to sample size can misrepresent the true significance or practical importance of results; a large sample size can make a tiny, practically meaningless effect statistically significant, while a small sample size might miss a practically important effect (Type II error).
- Misinterpretation is common; a p-value does not indicate the probability that a hypothesis is true, nor does p < 0.05 mean there is a $95\%\,$ chance the alternative hypothesis is true. It quantifies evidence against the null hypothesis, not the probability of the alternative hypothesis.

Research Misconduct and Questionable Practices

Questionable Research Practices (QRPs): Practices that, while not outright fraud, deviate from accepted scientific conduct and can undermine the integrity and reproducibility of research findings. These practices can subtly inflate confidence in spurious findings and hinder scientific progress.
- HARKing (Hypothesizing After the Results are Known): Involves fabricating or distorting the original hypothesis to align perfectly with unexpected post hoc findings. This makes exploratory research appear confirmatory and can lead to a misleading sense of strong theoretical support for effects that were initially unforeseen, thereby presenting exploratory analyses as if they were confirmatory.
- P-hacking (or data dredging, fishing expeditions): Refers to the unethical practice of consciously or unconsciously altering statistical methods, selectively reporting outcomes, or manipulating data inclusion/exclusion criteria after data collection, specifically to obtain a statistically favorable p-value (e.g., below $0.05$ ). This practice severely compromises methodological rigor, inflates the rate of false positives, and includes actions like stopping data collection when a significant result is found, removing outliers post-hoc, or trying multiple statistical tests until one yields a desired p-value.
Extensive discussion on the critical need for transparent and pre-registered reporting of all data, methodologies, and analysis plans (e.g., through platforms like OSF Registries) to proactively avoid the influence of biases, selective reporting, and various questionable research practices. Pre-registration commits researchers to their planned methods before seeing the data, greatly enhancing transparency and credibility.

Underpowered Study Designs

Understanding type I and type II errors in the context of underpowered studies, which are studies conducted with too small a sample size to detect a true effect if it exists. Such studies have insufficient statistical power.
- Type I error (False Positive): Occurs when one incorrectly rejects a true null hypothesis, meaning a significant effect is found when no real effect actually exists (e.g., concluding a new intervention improves performance when it does not). The probability of a Type I error is denoted by $\alpha$ (alpha), typically set at $0.05$ (or a $5\%\,$ chance of a false positive).
- Type II error (False Negative): Occurs when one incorrectly fails to reject a false null hypothesis, meaning a real effect exists but the study fails to detect it (e.g., concluding a new intervention has no effect when it actually improves performance). The probability of a Type II error is denoted by $\beta$ (beta), and it represents the likelihood of missing a true effect.
Underpowered studies increase the risk of Type II errors, meaning real and important effects may be missed, leading to wasted resources in conducting ineffective studies, potentially hindering scientific progress, and delaying the implementation of beneficial interventions.
Strong emphasis on performing rigorous a priori sample size calculations to ensure adequate statistical power ( $1 - \beta$ ), which enhances the robustness and reliability of findings, reduces the likelihood of both Type I and Type II errors, and prevents underpowered designs. Desired power is typically set at $0.80$ or $80\%\,$ , meaning an $80\%\,$ chance of detecting a true effect if it exists.

The Importance of Replication

Introduction to initiatives like the Sports Science Replication Center, emphasizing the paramount necessity for all research studies to be independently replicable and inherently reliable to build a trustworthy and cumulative body of scientific knowledge. Replication is the cornerstone of empirical science.
Overview of the fundamental purpose of replication studies:
- To validate and confirm the original findings across diverse populations, different research settings, varied contexts, and sometimes using slightly modified methodologies. This helps determine if an effect is robust and not merely specific to the original study conditions.
- To assess the generalizability of an effect and to identify potential moderating factors that influence the reproducibility of results. For instance, an intervention might be effective in one demographic but not another, or only under specific environmental conditions.
Illustrated with compelling examples of recent studies in sports science and related fields that have attempted to replicate original research and found significantly differing, often non-replicable, results, prompting critical re-evaluation of previous conclusions and showcasing that many initially promising findings are not robust.

Analyzing Effect Sizes in Replication Studies

Understanding the critical significance of effect sizes (e.g., Cohen's d, partial eta-squared, correlation coefficients) in quantitatively evaluating the practical magnitude and clinical importance of findings, beyond mere statistical significance. Effect sizes provide a standardized measure of the strength of a phenomenon, allowing for comparison across studies.
- Effect size classifications (e.g., for Cohen's d):
- Below $0.2$ : Considered trivial or negligible, suggesting very little practical significance.
- $0.2$ to $0.6$ : Generally classified as small to moderate, suggesting a noticeable but not necessarily large impact. For example, a Cohen's d of $0.4$ might indicate a meaningful improvement in performance.
- Above $0.6$ : Considered substantial or large, indicating a strong and practically significant effect. These effects are often readily observable and have clear real-world implications.
Detailed effect comparisons, using methods like equivalence testing, and the calculation of prediction intervals play a crucial role in rigorously evaluating the degree of agreement and the likelihood of successful replication between an original study and a replication attempt. Equivalence testing, for example, determines if a new result is not meaningfully different from a previous one, rather than just statistically non-significant.

Gathering Data for Replication Studies

Outline of detailed steps to perform during data collection for replication efforts:
- Ensuring a meticulously clear, standardized, and precisely documented method for testing specific variables, along with a consistent approach for all participants to minimize measurement error and procedural variability. This often involves detailed operational definitions for all experimental procedures and outcomes.
- Using highly specific protocols, standardized equipment, and calibrated instruments (e.g., force plates, accelerometers, physiological monitors) to ensure accuracy and reduce systematic bias. All equipment should have documented calibration procedures and checks.
Importance of implementing rigorous randomization and counterbalancing techniques to effectively mitigate a wide range of potential biases (e.g., order effects, selection bias, expectation bias such as placebo or Hawthorne effects) in experimental designs, thereby enhancing internal validity by ensuring groups are comparable and treatments are administered fairly.
Discuss the systematic collection of specific kinematic or kinetic data, such as average peak force during various coaching cues (e.g., "push" vs. "explode"), with a strong focus on ensuring an absolutely consistent methodology and measurement environment across all trials and participants to ensure reliable data. This includes controlling environmental variables like temperature, noise, and lighting.

Conclusion

Continuous and unwavering emphasis on ensuring profound methodological rigor and transparency throughout the entire research process, as this is the bedrock for generating meaningful, trustworthy, and ultimately useful results in sports science research. Without rigor, findings are unreliable and potentially misleading.
Future studies should not only report novel findings but also strongly highlight the necessity of replication, embrace radical transparency in methodologies and data sharing, and consistently utilize assessment tools to evaluate and uphold the quality and integrity of research at every stage, from conceptualization to dissemination.

Final Notes

Active encouragement for all researchers and practitioners to critically appraise all research studies with a discerning eye, ensuring a thorough understanding of their methodologies, limitations, and strengths before consideration for implementation in practice. This involves questioning assumptions, scrutinizing methods, and evaluating statistical reporting.
Resources will be provided for further learning, engagement with meta-scientific efforts, and practical application of robust research strategies to continuously improve the quality of evidence in the field. These resources could include links to open science frameworks, specific statistical software tutorials, and guidelines for preregistration.