Scolaris Content Display Scolaris Content Display

Early versus late discontinuation of caffeine administration in preterm infants

Contraer todo Desplegar todo

Abstract

Objectives

This is a protocol for a Cochrane Review (intervention). The objectives are as follows:

To evaluate the effects of early versus late discontinuation of caffeine administration in preterm infants.

Background

Description of the condition

Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting preterm infants who require respiratory support. BPD was first described in 1967, as oxygen dependence for 28 days (NIH 1979). The most commonly used definition includes a classification of severity, and is based on gestational age (GA). For infants born at GA < 32 weeks, BPD is classified as mild (no need for oxygen), moderate (21% to 30% oxygen required), or severe (> 30% oxygen required, or positive pressure assistance), based on the required fraction of inspired oxygen at 36 weeks of corrected GA (Jobe 2001). Other BPD definitions, e.g. the physiological definition (Walsh 2004), contextualize the need for oxygen and respiratory support in the clinical condition of the infant, such as lack of episodes of desaturation for a certain amount of time. Despite the advances in neonatal care, the morbidity and mortality due to BPD remain stable (Stoll 2015). Approximately 45% of preterm infants born at a gestational age of 29 weeks develop BPD. It is associated with a significantly increased risk for pulmonary and neurologic impairment, which persists into adulthood. A survey found that 18‐ to 36‐month‐old infants with severe BPD had a significantly lower quality of life than term infants and preterm infants without BPD (Brady 2019). Several systematic reviews showed that within two years after birth, the readmission rate of children with BPD was as high as 50% (Resch 2011; Townsi 2018). In addition, pulmonary vascular dysplasia in children with BPD can lead to pulmonary hypertension. A systematic review revealed that pulmonary hypertension occurred in 17% of children with BPD, and up to 24% in children with severe BPD (Bui 2017).

Apnea of prematurity (AOP), classified as central, obstructive, or mixed, is usually defined as a cessation of breathing in a premature infant for at least 20 seconds, or a shorter pause accompanied by bradycardia (< 100 beats per minute), cyanosis, or pallor (Eichenwald 2016). It is a common problem among preterm infants, particularly extremely preterm infants (Saroha 2020). The incidence of AOP is inversely correlated with gestational age and birth weight. Seven per cent of neonates born at 34 to 35 weeks' gestation, 15% born at 32 to 33 weeks', 54% born at 30 to 31 weeks' (Martin 2011), and nearly all infants born at < 29 weeks' gestation or < 1000 g, exhibit AOP (Robertson 2009).

Apneic event frequency and duration can be reduced with respiratory interventions, including continuous positive airway pressure and pharmacologic therapies, such as methylxanthines. Caffeine is the first choice because of its efficacy, better tolerability, wider therapeutic index, and longer half‐life (Dobson 2013).

Intermittent hypoxemia (IH) is defined as brief, repetitive episodes of decreased hemoglobin oxygen saturation from a normoxic baseline, followed by reoxygenation and a return to normoxia. IH occurs frequently in preterm infants, and may last until term‐equivalent age, and after the cessation of any clinically apparent apnea‐associated symptoms (Hunt 2011; Rhein 2014).

In a retrospective economic evaluation, completed alongside the Caffeine for Apnea of Prematurity (CAP) trial, and using individual‐patient data, Dukhovny and colleagues found that caffeine was probably not only effective but also cost‐saving compared with the placebo, mainly because of the reduced number of days on mechanical ventilation (Dukhovny 2011). However, this study has some limitations, which may affect the precision of these results, such as the existence of retrospective analysis of cost‐effectiveness data. In addition, certain resource utilization data were not evaluated adequately in the CAP trial, such as costs of inter‐hospital transport, postdischarge use of drugs, and other outpatient healthcare services. (Abdel‐Hady 2015).

Description of the intervention

Several therapeutic strategies for BPD have been proposed (Sauer‐Zavala 2019). Treatment strategies to alleviate premature lung injury have been described in several reviews and surveys (Arroyo 2021; Principi 2018; Thébaud 2019), including the administration of corticosteroids (Doyle 2021), surfactants (Arroyo 2021; Isayama 2016), antioxidants (Poggi 2014), vitamin A (Darlow 2016), stem cells (Pierro 2017), and non‐invasive respiratory support (Lemyre 2017). However, these therapies are only partially beneficial. To date, caffeine is one of the few interventions that has been shown to be effective in targeting the symptoms observed in infants with BPD (Jensen 2015).

Cardiorespiratory benefits of caffeine that may contribute to the lower risk of BPD include: reduced exposure to positive airway pressure and supplemental oxygen, less frequent treatment for a patent ductus arteriosus, improved pulmonary mechanics, and direct effects on pulmonary inflammation, alveolarization, and angiogenesis. The improvement of survival without neurodevelopmental disability at 18 to 21 months has not been shown to persist at five years (Schmidt 2007), although it has been shown that there is a continuing reduction in severity of motor impairment (Schmidt 2012).

Caffeine is often available as caffeine citrate, which comes in both oral and injectable formulations; the dose of caffeine base is half that of caffeine citrate (Shrestha 2017). The Food and Drug Administration (FDA)‐approved doses for caffeine citrate are 20 mg/kg/day for loading, and 5 mg/kg/day for maintenance (NDA 20‐793/S‐001). The European Medicines Agency (EMA) approved similar caffeine citrate doses for loading and maintenance (EMA 2009). In the European Public Assessment Report (EPAR) of Peyona® and Gencebok®, the two EMA‐approved specialties that contain the active substance, caffeine citrate, higher maintenance doses of 10 mg/kg/day are considered only in case of insufficient response. This takes into account the potential for accumulation of caffeine, due to the long half‐life in preterm newborn infants, and the progressively increasing capacity to metabolize caffeine in relation to postmenstrual age (EMA 2020). The goal is to achieve a therapeutic blood level of 5 mg/L to 25 mg/L of caffeine in preterm infants, who are younger than 32 weeks' postmenstrual age (PMA).

Described first in 1977, and despite the widespread use of caffeine over the years, the literature has recommended significantly different PMA and postnatal ages for the first administration of caffeine, different PMAs for the discontinuation of caffeine and discharge from hospital, and different total durations of caffeine use (Aranda 1977; Gentle 2018; Ji 2020; Kumar 2019).

Dobson and colleagues summarized the clinical benefits of beginning caffeine treatment before three days of age, showing that early treatment is associated with reduced incidence of BPD, death from BPD, intraventricular hemorrhage (IVH), necrotizing enterocolitis (NEC), need for treatment of PDA, retinopathy of prematurity (ROP), and reduced use of postnatal steroids, although certainty of evidence was low (Dobson 2013).

The Cochrane Review on methylxanthine for the prevention and treatment of apnea in preterm infants found that caffeine and aminophylline may importantly reduce the incidence of apneic events and need for intermittent positive pressure ventilation (Marques 2023).

A prospective, multicenter randomized controlled trial that enrolled 105 infants who were born at less than 32 weeks' gestation, and were formerly treated with caffeine, found significant reductions in IH at 35 and 36 weeks' PMA with prolonged caffeine treatment (Rhein 2014). However, the optimum dosing regimen of caffeine required to alleviate IH, the long‐term effects of extended use of caffeine, and the timing of caffeine initiation and discontinuation still need to be identified.

How the intervention might work

nCaffeine acts as an antioxidative and anti‐inflammatory drug by reducing cell death and apoptosis‐associated factors in models of oxygen‐induced lung injury (Endesfelder 2020; Nagatomo 2016). Beneficial effects of caffeine might also be mediated through a diuretic effect, as reported in clinical (Gillot 1990), and preclinical studies (Crossley 2012). Due to differences in the maturity of hepatic and renal function among preterm infants, the response to varying doses may be considerably different, for both potential benefits and harms (Stevenson 2007).

Plasma concentrations of caffeine as low as 3 mg/L to 4 mg/L have been shown to reduce apneic spells, but optimal levels range from 8 mg/L to 20 mg/L (Aranda 1979). Typically, to maintain these caffeine plasma concentration levels, a standard regimen is used, comprising a loading dose of 20 mg/kg of caffeine citrate (10 mg/kg of caffeine base) and a maintenance dose of 5 mg/kg/day (2.5 mg/kg/day of caffeine base [Blanchard 1992]). Preterm infants can tolerate higher doses of caffeine very well, even at serum concentrations of 70 mg/L or above (Lee 1997). Higher doses have been reported to further reduce respiratory morbidity (Bruschettini 2023).

Caffeine has a long half‐life, which was shown to be present in the plasma of infants who discontinued caffeine within the last seven days prior to discharge, indicating that the level of caffeine may still be therapeutic days after stopping the drug, and carry over after discharge (Charles 2008). However, infants should be monitored for recurrence of apnea for more than five to seven days after caffeine is discontinued, and it is important to assess an infant’s functional status and stability off caffeine prior to discharge (Ji 2020).

The study by Doyle and colleagues showed that caffeine may persist in an infant's plasma for 11 to 12 days after cessation of therapy (Doyle 2016).

Managing when to discontinue caffeine and discharge from hospital also vary widely. Physician discomfort, which arises from a lack of scientific evidence, results in a wide variation of practice for apnea or bradycardia events, including a delay in discharging home. This increases hospital costs, and the use of home monitors. Discharge readiness of premature infants is usually determined by the demonstration of functional maturation and medical stability. Factors contributing to a clinician’s decision to discontinue caffeine therapy include the degree of respiratory stability, and symptoms suggestive of caffeine‐induced toxicity (Butler 2014).

Although an earlier discharge has been proposed for infants on caffeine, with or without cardiorespiratory monitoring, the majority of neonatal intensive care units (NICU) keep preterm infants until they have been apnea‐free for five to seven days (Darnall 1997; Jefferies 2014; Martin 2022). It is also known that an early discontinuation increases the risk of recurrent apnea problems, and the need for increased respiratory support (Ji 2020). A safety margin of apnea‐free time before discharge has been explored, but currently, there is no consensus among physicians about how long an infant should be apnea‐free after a significant event to determine a safe discharge (Eichenwald 2016; Ji 2020).

A recent retrospective study showed that discharging stable preterm infants home on caffeine may be safe, especially for those who are waiting for the complete resolution of AOP/IH events, and are otherwise ready to go home (Ma 2020).

This was also shown in the Chimes study (Ramanathan 2001). This study confirmed that complete resolution of AOP/IH in more premature infants is variable and takes a long time. Infants discharged home on caffeine needed caffeine until an average corrected gestation age of 43 weeks, compared to 35 weeks in infants in whom it was stopped during their hospital stay.

A clear benefit of early discontinuation is related to unnecessarily prolonging hospital stays, which adds to the increased burden of health care costs (Butler 2014).

Why it is important to do this review

In recent decades, numerous clinical cross‐sectional and longitudinal studies have shown the efficacy of caffeine in treating primary apnea in premature infants, by stimulating the respiratory center, and in protecting the nervous and respiratory systems (Marques 2023). However, while the benefits of caffeine therapy are well known, a consensus is missing among clinicians about the appropriate timing of discontinuation relative to the infant's discharge.

The National Institute of Child Health and Human Development Neonatal Research Network developed a trial to evaluate the effectiveness and safety of continuing caffeine administration throughout hospitalization and after discharge home, in moderately preterm infants with resolving AOP. This trial, which is currently enrolling participants, will help to guide decisions about the appropriate time to discontinue caffeine use in premature infants (NCT03340727).

To determine an optimal time to discontinue the use of caffeine, clinicians must weigh the risks of early discontinuation (i.e. need for restart of therapy, respiratory compromise, and need for escalation of support) with those of later discontinuation (possible prolonged hospital stay and increased costs [Butler 2014]).

Standardization of evidence‐based interventions by reducing the variability in practice is fundamental for enhancing health care.

Objectives

To evaluate the effects of early versus late discontinuation of caffeine administration in preterm infants.

Methods

Criteria for considering studies for this review

Types of studies

We will include randomized controlled trials (RCTs) or quasi‐RCTs (with parallel groups), and cluster‐RCTs. We will exclude cross‐over randomized trials. We will exclude non‐randomized cohort studies, because they are prone to bias due to confounding by indication, or by residual confounding: both of which may influence the results of the studies (Fewell 2007; Kyriacou 2016).

Types of participants

We will include preterm infants born at less than 37 weeks' gestation, of postmenstrual age (PMA) up to 44 weeks and 0 days, who received caffeine for any indication, for at least seven days.

Types of interventions

We will include studies comparing early versus late discontinuation of caffeine.

  • Early (PMA less than 35 weeks' gestation) versus late (PMA 35 weeks or more) gestation

  • Early (less than five days without the presence of respiratory support or apneic spells) versus late (at least five days without the presence of respiratory support or apneic spells), regardless of the age of infant

  • Early (without the need for respiratory support, or the absence of apneic spells) versus late (with discontinuation of caffeine at a scheduled PMA)

In the control group, infants might be administered placebo or receive no intervention.

Types of outcome measures

Outcome measures used will not be part of the eligibility criteria.

Primary outcomes

  • Restarting caffeine therapy

  • Intubation within one week of treatment discontinuation

  • Need for non‐invasive respiratory support (continuous positive airway pressure [CPAP], nasal intermittent positive pressure ventilation [NIPPV], high‐flow nasal cannulae) within one week of treatment discontinuation

Secondary outcomes

  • Apnea: number of episodes (defined as interruption of breathing for more than 20 seconds) in the seven days after discontinuation of treatment

  • Apnea: number of infants with at least one episode (defined as interruption of breathing for more than 20 seconds) in the seven days after discontinuation of treatment

  • Intermittent hypoxemia (IH): number of episodes in the seven days after discontinuation of treatment

  • IH: number of infants with at least one episode in the seven days after discontinuation of treatment

  • All‐cause mortality prior to hospital discharge

  • Major neurodevelopmental disability: cerebral palsy, developmental delay (Bayley Mental Developmental Index [Bayley 1993; Bayley 2006], or Griffiths Mental Development Scale [Griffiths 1954]), assessed as more than two standard deviations [SDs] below the mean, intellectual impairment (intelligence quotient [IQ] more than two SDs below the mean), blindness (vision less than 6/60 in both eyes), or sensorineural deafness requiring amplification (Jacobs 2013). We will separately assess outcomes at age 18 months to 24 months corrected age (CA), and at three to five years CA

  • Each component of the composite outcome, major neurodevelopmental disability

  • Mortality or major neurodevelopmental disability. We will separately assess outcomes at age 18 to 24 months CA, and at three to five years CA.

  • Number of days of respiratory support (mechanical ventilation, CPAP, high‐flow nasal cannula, NIPPV) after treatment discontinuation

  • Duration of hospital stay

  • Cost of neonatal care

Search methods for identification of studies

The Cochrane Sweden Information Specialist developed a draft search strategy for PubMed (National Library of Medicine) in consultation with the authors (Appendix 1). This strategy will be peer‐reviewed by an Information Specialist using the Peer Review of Electronic Search Strategies (PRESS) Checklist (McGowan 2016a; McGowan 2016b). The PubMed strategy will be translated, using appropriate syntax, for other databases.

We will use a population filter developed by Cochrane Neonatal. As recommended by Cochrane Neonatal, we will adapt the RCT search filter for MEDLINE Ovid to the syntax of PubMed to identify randomized and quasi‐randomized studies. We will conduct searches for eligible trials without language, publication year, publication type, or publication status restrictions.

Electronic searches

We will search the following databases:

  • Cochrane Central Register of Controlled Trials (CENTRAL), via Wiley;

  • MEDLINE via PubMed (1946 to present);

  • Embase via Elsevier (1974 to present)

Searching other resources

We will identify trial registration records using CENTRAL, and by independent searches of:

We will screen the reference lists of included studies and related systematic reviews for studies not identified by the database searches.

We will search for errata or retractions for included studies published in PubMed (www.ncbi.nlm.nih.gov/pubmed).

Data collection and analysis

We will use the standard methods of Cochrane Neonatal, as described below.

Selection of studies

We will download all titles and abstracts retrieved by electronic searching to a reference management software and remove duplicates. We will use Cochrane's Screen4Me to reduce screening activities by the authors (Marshall 2018; Noel‐Storr 2020; Noel‐Storr 2021; Thomas 2021). Screen4Me comprises three components.

  1. Known assessments (a service that matches records in the search results to records that have been screened by Cochrane Crowd and labeled as 'RCT' or 'not an RCT');

  2. The RCT classifier (a machine‐learning model that distinguishes RCTs from non‐RCTs);

  3. Cochrane Crowd (Cochrane’s crowdsourcing platform, through which contributors from around the world help to identify randomized trials and other types of healthcare‐related research).

We will add any references categorized as non‐RCTs through the known assessments and the RCT classifier to the irrelevant segment of Covidence. This approach will mean references are available for the purposes of deduplication when the review is updated; and for verification purposes should questions arise about the accuracy of Screen4Me categorization. We will present the results of Screen4Me in a figure in the full review, and incorporate the disposition of references into a PRISMA flow diagram (Liberati 2009).

Two review authors (SU, MB) will independently screen the remaining title/abstracts. Two review authors (SU, MB) will independently assess the full‐text of references included after the title/abstract review. At any point in the screening process, we will resolve disagreements between review authors by discussion. We will document the reasons for excluding studies during our full‐text review in the characteristics of excluded studies table. We will exclude studies if one or more PICO‐S elements are absent; if a study omits more than one PICO‐S element, we will document only one.

We will collate multiple reports of the same study so that each study, rather than each report, is the unit of interest in the review. We will record the selection process in sufficient detail to complete a PRISMA flow diagram (Liberati 2009).

Data extraction and management

Two review authors (SU; MB) will independently extract data using a data extraction form integrated with a modified version of the Cochrane Effective Practice and Organisation of Care Group data collection checklist (EPOC 2017). We will pilot the form within the review team, using a sample of included studies.

We will extract the following characteristics from each included study.

  • administrative details: study author(s), published or unpublished, year of publication, year in which study was conducted, presence of vested interest, details of other relevant papers cited

  • study characteristics: study registration, study design type, study setting, number of study centers and location, informed consent, ethics approval, completeness of follow‐up (e.g. greater than 80%)

  • participants: number randomized, number lost to follow‐up/withdrawn, number analyzed, mean GA, GA age range, mean CA or CA age range, inclusion criteria, and exclusion criteria

  • interventions: initiation, dose, and duration of caffeine administration

  • outcomes: outlined above, under Types of outcome measures

We will resolve any disagreements by discussion.

We will describe ongoing studies identified by our search and document available information, such as the primary author, research question(s), methods, and outcome measures, together with an estimate of the anticipated reporting date in the characteristics of ongoing studies table.

Should any queries arise, or in cases for which additional data are required, we will contact study investigators/authors for clarification. Two review authors (SU, MB) will use Cochrane software for data entry (RevMan Web 2023). We will replace any standard error of the mean (SEM) by the corresponding SD.

Assessment of risk of bias in included studies

Two review authors (SU, MB) will use the Cochrane RoB 1 tool to independently assess the risk of bias (low, high, or unclear) of all included studies for the following domains (Higgins 2017).

  • Sequence generation (selection bias)

  • Allocation concealment (selection bias)

  • Blinding of participants and personnel (performance bias)

  • Blinding of outcome assessment (detection bias)

  • Incomplete outcome data (attrition bias)

  • Selective reporting (reporting bias)

  • Any other bias

We will resolve any disagreements by discussion or by consultation with a third review author (WO). A more detailed description of risk of bias for each domain is given in Appendix 2.

Measures of treatment effect

Dichotomous data

For dichotomous data, we will present results using risk ratios (RR) and risk differences (RD) with 95% confidence intervals (CIs). We will calculate the number needed to treat for an additional beneficial outcome (NNTB), or number needed to treat for an additional harmful outcome (NNTH) with 95% CIs if there is a statistically significant reduction (or increase) in RD.

Continuous data

For continuous data, we will use the mean difference (MD) when outcomes were measured in the same way between trials. We will use the standardized mean difference (SMD) to combine data from trials that measured the same outcome but used different methods. Where trials reported continuous data as median and interquartile range (IQR), and data passed the test of skewness, we will convert median to mean, and estimate the standard deviation as IQR/1.35.

Unit of analysis issues

The unit of analysis will be the participating infant in individually randomized trials; an infant will be considered only once in the analysis. The participating neonatal unit or section of a neonatal unit or hospital will be the unit of analysis in cluster‐randomized trials. For cluster‐randomized trials, we will abstract information on the study design and unit of analysis for each study, indicating whether clustering of observations is present due to allocation to the intervention at the group level, or clustering of individually randomized observations (e.g. infants within clinics). We will abstract available statistical information needed to account for the implications of clustering on the estimation of outcome variances, such as design effects or intra‐cluster correlations (ICCs), and whether the study adjusted results for the correlations in the data. In cases where the study does not account for clustering, we will ensure that appropriate adjustments are made to the effective sample size following Cochrane guidance (Higgins 2022). Where possible, we will derive the ICC for these adjustments from the trial itself, or from a similar trial. If an appropriate ICC is unavailable, we will conduct sensitivity analyzes to investigate the potential effect of clustering, by imputing a range of values of ICC.

If any trials have multiple arms compared against the same control condition that will be included in the same meta‐analysis, we will either combine groups to create a single pair‐wise comparison, or select the pair of interventions that more closely match the definitions given in Types of interventions, and exclude the others. We will acknowledge this potential selective bias of data used for analysis in the Discussion section.

Dealing with missing data

We intend to carry out analysis on an intention‐to‐treat basis for all included outcomes. Whenever possible, we will analyze all participants in the treatment group to which they were randomized, regardless of the actual treatment received. If we identify important missing data (in the outcomes) or unclear data, we will contact the original investigators and request the missing data. We will make explicit the assumptions of any methods used to deal with missing data.

For missing dichotomous outcomes, we will include participants with incomplete or missing data in the sensitivity analyzes by imputing them according to the following scenarios.

  • Extreme‐case analysis favoring the experimental intervention (best‐worst case scenario): none of the dropouts/participants lost from the experimental arm, but all the dropouts/participants lost from the control arm experienced the outcome, including all randomized participants in the denominator

  • Extreme‐case analysis favoring the control (worst‐best case scenario): all dropouts/participants lost from the experimental arm, but none from the control arm experienced the outcome, including all randomized participants in the denominator

For continuous outcomes, we will calculate missing standard deviations using reported P values or CIs (Higgins 2022). If calculation is not possible, we will impute a standard deviation as the highest standard deviation reported in the other trials for the corresponding treatment group and outcome.

We will address the potential impact of missing data on the findings of the review in the Discussion section.

Assessment of heterogeneity

We will describe the clinical diversity and methodological variability of the evidence narratively and in tables. Tables will include data on study characteristics, such as design features, population characteristics, and intervention details.

To assess statistical heterogeneity, we will visually inspect forest plots and describe the direction and magnitude of effects and the degree of overlap between confidence intervals. We will also consider the statistics generated in forest plots that measure statistical heterogeneity. We will use the I² statistic to quantify inconsistency among the trials in each analysis. We will also consider the P value from the Chi² test to assess if this heterogeneity is significant (P < 0.1). If we identify substantial heterogeneity, we will report the finding and explore possible explanatory factors using prespecified subgroup analysis.

We will grade the degree of heterogeneity as:

  • 0% to 40% might not represent important heterogeneity;

  • 30% to 60% may represent moderate heterogeneity;

  • 50% to 90% may represent substantial heterogeneity;

  • more than 75% may represent considerable heterogeneity

A rough guideline will be used to interpret the I2 value rather than a simple threshold, and our interpretation will take into account the understanding that measures of heterogeneity (I2 and Tau) will be estimated with high uncertainty when the number of studies is small (Deeks 2022):

Assessment of reporting biases

We will assess reporting bias by comparing the stated primary outcomes and secondary outcomes and reported outcomes. When study protocols are available, we will compare these to the full publications to determine the likelihood of reporting bias. We will document studies using the interventions in a potentially eligible infant population, but not reporting on any of the primary and secondary outcomes, in the characteristics of included studies tables.

We will use funnel plots to screen for publication bias when there are a sufficient number of studies (> 10) reporting the same outcome. If publication bias is suggested by a significant asymmetry of the funnel plot on visual assessment, we will incorporate this in our assessment of certainty of evidence (Egger 1997). If our review includes fewer than 10 studies eligible for meta‐analysis, the ability to detect publication bias will be largely diminished, and we will simply note our inability to rule out possible publication bias or small study effects.

Data synthesis

If we identify multiple studies that we consider to be sufficiently similar, we will perform meta‐analysis using Review Manager Web (RevMan Web 2023). For categorical outcomes, we will calculate the typical estimates of RR and RD, each with its 95% CI; for continuous outcomes, we will calculate the MD or the SMD, each with its 95% CI. We will use a fixed‐effect model to combine data where it is reasonable to assume that studies were estimating the same underlying treatment effect. If we judge meta‐analysis to be inappropriate, we will analyze and interpret individual trials separately. If there is evidence of clinical heterogeneity, we will try to explain this based on the different study characteristics and subgroup analyzes.

Subgroup analysis and investigation of heterogeneity

We will interpret tests for subgroup differences in effects with caution, given the potential for confounding with other study characteristics and the observational nature of the comparisons; see section 10.11.2 of the Cochrane Handbook of Systematic Reviews for Interventions (Higgins 2022). In particular, subgroup analyzes with fewer than five studies per category are unlikely to be adequate to ascertain valid difference in effects, and we will not highlight them in our results. When subgroup comparisons are possible, we will conduct stratified meta‐analysis and a formal statistical test for interaction to examine subgroup differences that could account for effect heterogeneity (e.g. Cochran’s Q test, meta‐regression [Borenstein 2013; Deeks 2022]).

Given the potential differences in the intervention effectiveness related to gestational age and duration of caffeine treatment, and discussed in the Background, we will conduct subgroup comparisons to see if the intervention is more effective.

We plan to carry out the following subgroup analyzes of factors that may contribute to heterogeneity in the effects of the intervention.

  • gestational age: extremely preterm (less than 28 weeks); very preterm (less than 32 weeks); 32 weeks or more

  • duration of caffeine treatment before randomization to discontinuation: less than one week; one to four weeks; more than four weeks

  • indication for initial treatment: prevention of apnea; treatment of apnea; peri‐extubation management

We will use the main outcomes (those specified for the summary of findings table) in subgroup analyzes if there are enough studies reporting the outcomes to support valid subgroup comparisons (at least five studies per subgroup).

Sensitivity analysis

We will conduct sensitivity analyzes to explore the effect of the methodological quality of studies, and check to ascertain whether studies with a high risk of bias (in at least two domains) overestimate the effect of treatment.

Differences in study design of included studies might also affect the results of the systematic review. We will perform a sensitivity analysis to compare the effects of caffeine in truly randomized trials as opposed to quasi‐randomized trials.

Summary of findings and assessment of the certainty of the evidence

We will use the GRADE approach, as outlined in the GRADE Handbook, to assess the certainty of evidence for the following (clinically relevant) outcomes (Schünemann 2013).

  • Restarting caffeine therapy

  • Intubation within one week of treatment discontinuation

  • Need for non‐invasive respiratory support (CPAP, NIPPV, high‐flow nasal cannulae) within one week of treatment discontinuation

  • Major neurodevelopmental disability

Two review authors (SU, MB) will independently assess the certainty of the evidence for each of the outcomes above. We will consider evidence from randomized controlled trials as high certainty, downgrading the evidence one level for serious (or two levels for very serious) limitations based upon the following: design (risk of bias), consistency across studies, directness of the evidence, precision of estimates, and presence of publication bias. We will use GRADEpro GDT software to create a summary of findings table to report the certainty of the evidence.

The GRADE approach results in an assessment of the certainty of a body of evidence in one of the following four grades.

  • High: we are very confident that the true effect lies close to that of the estimate of the effect

  • Moderate: we are moderately confident in the effect estimate; the true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different

  • Low: our confidence in the effect estimate is limited; the true effect may be substantially different from the estimate of the effect

  • Very low: we have very little confidence in the effect estimate; the true effect is likely to be substantially different from the estimate of effect