Interim PET‐results for prognosis in adults with Hodgkin lymphoma: a systematic review and meta‐analysis of prognostic factor studies

Angela Aldin; Lisa Umlauff; Lise J Estcourt; Gary Collins; Karel GM Moons; Andreas Engert; Carsten Kobe; Bastian von Tresckow; Madhuri Haque; Farid Foroutan; Nina Kreuzberger; Marialena Trivella<sup>a</sup>; Nicole Skoetz<sup>a</sup>

doi:10.1002/14651858.CD012643.pub3

Keputusan PET interim untuk prognosis pada orang dewasa dengan Hodgkin lymphoma: ulasan sistematik dan meta‐analisis kajian faktor prognostik

Authors' declarations of interest

Version published: 13 January 2020 Version history

https://doi.org/10.1002/14651858.CD012643.pub3

Collapse all Expand all

Abstract

available in

Background

Hodgkin lymphoma (HL) is one of the most common haematological malignancies in young adults and, with cure rates of 90%, has become curable for the majority of individuals. Positron emission tomography (PET) is an imaging tool used to monitor a tumour’s metabolic activity, stage and progression. Interim PET during chemotherapy has been posited as a prognostic factor in individuals with HL to distinguish between those with a poor prognosis and those with a better prognosis. This distinction is important to inform decision‐making on the clinical pathway of individuals with HL.

Objectives

To determine whether in previously untreated adults with HL receiving first‐line therapy, interim PET scan results can distinguish between those with a poor prognosis and those with a better prognosis, and thereby predict survival outcomes in each group.

Search methods

We searched MEDLINE, Embase, CENTRAL and conference proceedings up until April 2019. We also searched one trial registry (ClinicalTrials.gov).

Selection criteria

We included retrospective and prospective studies evaluating interim PET scans in a minimum of 10 individuals with HL (all stages) undergoing first‐line therapy. Interim PET was defined as conducted during therapy (after one, two, three or four treatment cycles). The minimum follow‐up period was at least 12 months. We excluded studies if the trial design allowed treatment modification based on the interim PET scan results.

Data collection and analysis

We developed a data extraction form according to the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS). Two teams of two review authors independently screened the studies, extracted data on overall survival (OS), progression‐free survival (PFS) and PET‐associated adverse events (AEs), assessed risk of bias (per outcome) according to the Quality in Prognosis Studies (QUIPS) tool, and assessed the certainty of the evidence (GRADE). We contacted investigators to obtain missing information and data.

Main results

Our literature search yielded 11,277 results. In total, we included 23 studies (99 references) with 7335 newly‐diagnosed individuals with classic HL (all stages).

Participants in 16 studies underwent (interim) PET combined with computed tomography (PET‐CT), compared to PET only in the remaining seven studies. The standard chemotherapy regimen included ABVD (16) studies, compared to BEACOPP or other regimens (seven studies). Most studies (N = 21) conducted interim PET scans after two cycles (PET2) of chemotherapy, although PET1, PET3 and PET4 were also reported in some studies. In the meta‐analyses, we used PET2 data if available as we wanted to ensure homogeneity between studies. In most studies interim PET scan results were evaluated according to the Deauville 5‐point scale (N = 12).

Eight studies were not included in meta‐analyses due to missing information and/or data; results were reported narratively. For the remaining studies, we pooled the unadjusted hazard ratio (HR). The timing of the outcome measurement was after two or three years (the median follow‐up time ranged from 22 to 65 months) in the pooled studies.

Eight studies explored the independent prognostic ability of interim PET by adjusting for other established prognostic factors (e.g. disease stage, B symptoms). We did not pool the results because the multivariable analyses adjusted for a different set of factors in each study.

Overall survival

Twelve (out of 23) studies reported OS. Six of these were assessed as low risk of bias in all of the first four domains of QUIPS (study participation, study attrition, prognostic factor measurement and outcome measurement). The other six studies were assessed as unclear, moderate or high risk of bias in at least one of these four domains. Four studies were assessed as low risk, and eight studies as high risk of bias for the domain other prognostic factors (covariates). Nine studies were assessed as low risk, and three studies as high risk of bias for the domain 'statistical analysis and reporting'.

We pooled nine studies with 1802 participants. Participants with HL who have a negative interim PET scan result probably have a large advantage in OS compared to those with a positive interim PET scan result (unadjusted HR 5.09, 95% confidence interval (CI) 2.64 to 9.81, I² = 44%, moderate‐certainty evidence). In absolute values, this means that 900 out of 1000 participants with a negative interim PET scan result will probably survive longer than three years compared to 585 (95% CI 356 to 757) out of 1000 participants with a positive result.

Adjusted results from two studies also indicate an independent prognostic value of interim PET scan results (moderate‐certainty evidence).

Progression‐free survival

Twenty‐one studies reported PFS. Eleven out of 21 were assessed as low risk of bias in the first four domains. The remaining were assessed as unclear, moderate or high risk of bias in at least one of the four domains. Eleven studies were assessed as low risk, and ten studies as high risk of bias for the domain other prognostic factors (covariates). Eight studies were assessed as high risk, thirteen as low risk of bias for statistical analysis and reporting.

We pooled 14 studies with 2079 participants. Participants who have a negative interim PET scan result may have an advantage in PFS compared to those with a positive interim PET scan result, but the evidence is very uncertain (unadjusted HR 4.90, 95% CI 3.47 to 6.90, I² = 45%, very low‐certainty evidence). This means that 850 out of 1000 participants with a negative interim PET scan result may be progression‐free longer than three years compared to 451 (95% CI 326 to 569) out of 1000 participants with a positive result.

Adjusted results (not pooled) from eight studies also indicate that there may be an independent prognostic value of interim PET scan results (low‐certainty evidence).

PET‐associated adverse events

No study measured PET‐associated AEs.

Authors' conclusions

This review provides moderate‐certainty evidence that interim PET scan results predict OS, and very low‐certainty evidence that interim PET scan results predict progression‐free survival in treated individuals with HL. This evidence is primarily based on unadjusted data. More studies are needed to test the adjusted prognostic ability of interim PET against established prognostic factors.

Ringkasan bahasa mudah

available in

Pengimejan dengan tomografi positron pelepasan (PET) semasa kemoterapi untuk meramalkan hasil pada orang dewasa dengan hodgkin limfoma

Soalan ulasan

Kajian Cochrane ini bertujuan untuk mengetahui sama ada keputusan untuk tomografi pelepasan positron (haiwan kesayangan) semasa terapi dalam orang dengan hodgkin limfoma (HL) boleh membantu membezakan antara mereka dengan prognosis yang teruk dan mereka yang mempunyai prognosis yang lebih baik, dan meramal survival hasil dalam setiap kumpulan.

Latar belakang

Hodgkin limfoma adalah kanser yang memberi kesan kepada sistem limfoid badan. Ia dianggap sebagai satu penyakit yang agak jarang berlaku (dua hingga tiga kes bagi setiap 100,000 orang setiap tahun di negara Barat), yang paling biasa dalam orang dewasa muda dalam umur puluhan, tetapi ia juga boleh berlaku pada kanak‐kanak dan orang tua. Setelah pilihan rawatan telah bertambah baik, kebanyakan orang dengan HL kini boleh sembuh. Ia adalah penting bahawa individu menerima rawatan dengan keberkesanan yang besar dan kurang ketoksikan yang mungkin. PET adalah alat pengimejan untuk menilai peringkat penyakit seseorang individu, dan memantau aktiviti tumor. Ia telah dicadangkan bahawa PET yang dilakukan semasa terapi (apa yang dipanggil PET interim, misalnya selepas dua kitaran kemoterapi) boleh membezakan antara orang‐orang yang bertindak balas baik untuk terapi dan mereka yang tidak bertindak balas dengan baik. Matlamat kajian ini adalah untuk menunjukkan kebolehan prognostik untuk membezakan antara kumpulan‐kumpulan ini, dan meramalkan hasil ikhtiar hidup dalam setiap Kumpulan, untuk membantu ahli klinikal membuat keputusan yang tepat berkenaan keputusan rawatan untuk meningkatkan hasil dan keselamatan jangka panjang untuk orang dengan HL.

Ciri‐ciri kajian

Kami melibatkan 23 kajian untuk meneroka kaitan di antara hasil imbasan PET interim selepas satu hingga empat kitaran kemoterapi dan hasil survival dalam orang dewasa dengan HL (semua peringkat). Kami menghubungi 10 pengarang dan enam memberikan kami maklumat dan/atau data yang berkaitan.

Keputusan‐keputusan utama

Dalam 16 termasuk kajian, peserta menerima sama ada kemoterapi ABVD atau kemoterapi BEACOPP (empat kajian) sahaja, dengan atau tanpa radioterapi. Dalam 16 kajian, peserta menjalani imbasan PET interim dalam kombinasi dengan tomografi (CT) yang dikira (iaitu PET‐CT), yang mempunyai ketepatan yang lebih tinggi dalam mengesan kanser rendah dan menengah daripada imbasan haiwan peliharaan sahaja. Dalam baki tujuh kajian, haiwan peliharaan sahaja dijalankan. Dua puluh satu kajian menjalankan imbasan haiwan peliharaan interim selepas dua kitaran (PET2) kemoterapi.

Lapan kajian tidak melaporkan data yang mencukupi tentang hasil atau populasi kepentingan kami, oleh itu kami melaporkan keputusan daripada kajian ini dengan secara bijak. Kami menggabungkan hasil kajian individu di Meta‐analisis untuk memberikan bukti yang kukuh bagi hasil kami yang hidup secara keseluruhan dan kelangsungan hidup tanpa perkembangan. Tiada kajian mengukur peristiwa buruk yang berkaitan dengan PET (kemudaratan).

Untuk survival keseluruhan, gabungan keputusan dari sembilan kajian (1802 peserta) menunjukkan bahawa mungkin ada kelebihan yang besar dalam hidup secara keseluruhan untuk orang dengan imbasan PET interim negatif berbanding dengan orang dengan imbasan yang positif PET interim. Untuk kelangsungan hidup tanpa menjadi semakin teruk gabungan 14 kajian (2079 peserta) menunjukkan orang dengan PET‐negatif interim mungkin mempunyai kelebihan berkaitankelangsungan hidup tanpa menjadi semakin teruk, berbanding dengan orang dengan PET‐positif interim, tetapi kami tidak berapa pasti dengan keputusan ini. Ini adalah keputusan yang tidak dilaraskan, di mana PET interim telah diuji sebagai satu‐satunya faktor prognostik.

Lapan kajian melaporkan keputusan yang telah dilaraskan, di mana keupayaan prognostik PET interim sahaja telah dinilai berbanding dengan faktor prognostik lain yang lebih kukuh (contohnya peringkat penyakit, B simptom). Kita tidak boleh menggabungkan keputusan kajian individu kerana kajian itu tidak termasuk set yang sama daripada kovariat. Walau bagaimanapun, keputusan mereka menunjukkan kemungkinan prognostik bebas daripada PET interim untuk meramalkan kedua‐dua hasil.

Kepastian bukti

Mengenai keputusan yang tidak dapat dilaraskan, kami menilai kepastian kami terhadap bukti sebagai ' sederhana ' untuk kelangsungan hidup keseluruhan. Ini bermakna bahawa kesan sebenar mungkin akan dekat dengan kesan anggaran, tetapi ada kemungkinan bahawa ia adalah sangat berbeza. Untuk survival bebas perkembangan, kami menilai keyakinan kami terhadap bukti sebagai ' sangat rendah ', bermakna bahawa kita mempunyai sedikit keyakinan dalam anggaran kesan, dan bahawa kesan sebenar mungkin berbeza secara ketara daripada kesan anggaran.

Mengenai keputusan yang dilaraskan, kami menilai kepastian kami bukti sebagai ' sederhana ' untuk kelangsungan hidup keseluruhan, dan ' rendah ' untuk kelangsungan hidup tanpa perkembangan.

Adakah ulasan ini terkini?

Kami mencari data asas sehingga 2 April 2019, dan satu pendaftaran percubaan pada 25 Januari 2019.

Authors' conclusions

Implications for practice

This review provides moderate‐certainty evidence that interim positron emission tomography (PET) scan results predict overall survival (OS), and very low‐certainty evidence that interim PET scan results predict progression‐free survival (PFS) in individuals with Hodgkin lymphoma (HL) (evidence of the pooled, unadjusted results). The evidence on the ability of interim PET scan results to distinguish between individuals with a poor prognosis and individuals with a good prognosis can aid decision‐making for clinicians and diagnosed individuals, and the evidence may be used in international treatment guidelines for individuals with HL.

Implications for research

Multivariable analyses and prognostic models

Thus far, the prognostic value of interim PET has mostly been assessed in univariable analyses, in which its prognostic ability of determining survival outcomes in individuals with HL has been shown. However, using one single factor is usually not sufficient to give a satisfactory prediction of an outcome, and clinicians, therefore, usually additional factors to give an accurate prediction of an individual's disease progression and health outcome (Moons 2009). Hence, it is important to assess the independent prognostic value of the prognostic factor of interest (in this case interim PET) against established prognostic factors such as disease stage, age, sex, B symptoms or other relevant clinical and individual factors in multivariable analyses as well (Moons 2009; Riley 2019). In such analyses, the independent prognostic ability of a factor, as well as its incremental value on top of other prognostic factors, can be assessed (Moons 2009). In a next step, prognostic models can be built that include multiple prognostic factors that have been proven to be predictive of outcome. Such models are built for risk adaptation and treatment stratification for participants who present those specific factors included in a prediction model for a specific disease, and thereby enables more individualised disease monitoring and treatment guidance. Using a combination of factors, rather than one factor only, allows for a more individual and accurate estimate of the risk of a patient to experience a certain health event (or outcome) within a specific period of time (Moons 2009; Steyerberg 2013).

With regard to our index prognostic factor, we could pool adjusted results in meta‐analyses in an update of this review if new studies would adjust for the same set of prognostic factors (covariates). There is a number of different established clinical and individual prognostic factors that can be used to predict survival outcomes in individuals with HL (Cuccaro 2014; Josting 2010; Kılıçkap 2013). In order to enable pooling of adjusted results, future authors of systematic reviews of prognostic factor studies could define a core set of covariates a priori (Riley 2019).

Study design

There is some evidence from retrospective studies that interim PET scan results can predict outcome in individuals during chemotherapy. However, it is commonly agreed that the true prognostic value of this factor can best be assessed in randomised controlled trials (RCTs), in which participants are randomly assigned to a standard or an experimental arm. In the standard arm, participants continue with the planned therapy regimen independent of the interim PET scan result. In the experimental arm, however, different treatments are given according to the interim PET scan result, e.g. de‐escalation of treatment in interim PET‐negative participants. Hence, RCTs are the most suitable study design, with results from experimental arms in which participants receive therapy adaptation based on the interim PET scan result providing the most robust evidence on whether outcome can be approved, while treatment can be safer, by this strategy. Although assessing therapy modification was not an aim of our review, we judged it important to present and discuss some results of published trials that evaluated the impact of PET‐adapted treatment on survival outcomes.

Summary of findings

Open in table viewer

Summary of findings 1. Comparison of interim PET‐negative and interim PET‐positive individuals with Hodgkin Lymphoma

Outcomes	*Anticipated absolute effects^ (95% CI)**		Relative effect (95% CI)	№ of participants (studies)	Certainty of the evidence (GRADE)	Comments
Comparison of interim PET‐positive and interim PET‐negative participants with Hodgkin lymphoma
Population: Individuals with Hodgkin lymphoma Setting: Eleven studies recruited participants from a total of 28 haemato‐oncology treatment centres/hospitals in Brazil (N = 1), China (N = 1), Denmark (N = 4), France (N = 4), Italy (N = 3), Poland (N = 11), UK (N = 2) and the USA (N = 2). One study (Straus 2011) included participants from 29 institutions, but did not report the countries. One study (Simon 2016) reported the country (Hungary) but not the number of centres. One multi‐centre study (Hutchings 2014) recruited participants from four countries (USA, Italy, Poland and Denmark). One RCT (Kobe 2018) included participants from 301 hospitals and private practices in Germany, Switzerland, Austria, the Netherlands, and the Czech Republic.
Outcomes	Risk with Interim PET‐negative	Risk with Interim PET‐positive	Relative effect (95% CI)	№ of participants (studies)	Certainty of the evidence (GRADE)	Comments
Overall survival Follow up: 3 years	Low		HR 5.09 (2.64 to 9.81)	1802 (9 studies)	⊕⊕⊕⊝ MODERATE ^{2 3 4}
	900 per 1.000 ¹	585 per 1.000¹ (356 to 757)
	High
	980 per 1.000 ¹	902 per 1.000¹ (820 to 948)
Progression‐free survival Follow up: 3 years	Low		HR 4.90 (3.47 to 6.90)	2079 (14 studies)	⊕⊝⊝⊝ VERY LOW^{6 7 8}
	850 per 1.000 ⁵	451 per 1.000 ⁵ (326 to 569)
	High
	940 per 1.000 ⁵	738 per 1.000 ⁵ (653 to 807)
Adverse events associated with PET ‐ not reported	No study measured PET‐associated adverse events.		‐	‐	‐
Overall survival (adjusted effect estimate)	Two studies reported an adjusted effect estimate for overall survival after interim PET2: a hazard ratio of 3.2 (95% CI 1.3 to 8.4, P = 0.02) (Kobe 2018) and 11.51 (95% CI 3.14 to 42.86, P < 0.001) (Simon 2015) indicates the independent prognostic value of interim PET over and above other clinically relevant prognostic factors.		‐	843 (2 studies)	⊕⊕⊕⊝ MODERATE ⁹
Progression‐free survival (adjusted effect estimate)	Eight studies conducted a multivariable analysis to test the independent prognostic value of interim PET over and above other clinically relevant prognostic factors. Four of these studies reported a hazard ratio as the adjusted effect estimate, of which the value ranges from 2.4 to 36.89, indicating the independent prognostic value of interim PET2.¹⁰		‐	996 (4 studies)¹⁰	⊕⊕⊝⊝ LOW ^{11 12}
*The survival in the PET‐positive group (and its 95% confidence interval) is based on the assumed survival in the PET‐negative group. CI: Confidence interval; HR: Hazard ratio; PET: positron emission tomography
GRADE Working Group grades of evidence High certainty: We are very confident that the true effect lies close to that of the estimate of the effect Moderate certainty: We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different Low certainty: Our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate of the effect Very low certainty: We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect
¹ The assumed event‐free survival in the control group is based on the survival rate of the interim PET‐negative participants at 3 years in the studies included (the lowest survival rate from Cerci 2010 and the highest survival rate from Kobe 2018). ² High risk of bias in seven studies for the domain 'other prognostic factors (covariates)', and high risk of bias in three studies for the domain 'statistical analysis and reporting'. Downgraded by 1 point for risk of bias. ³ For one study we used the reported hazard ratio. For seven studies we had to estimate the hazard ratio and for one study we re‐calculated it (Trivella 2006). Downgraded by 1 point for imprecision. ⁴ Upgraded by one point due to the large effect showing the large difference between interim PET‐negative and interim PET‐positive participants (HR 5.09, CI 2.64 to 9.81). ⁵ The assumed event‐free survival in the control group is based on the survival rate of the interim PET‐negative participants at 3 years in the studies included (the lowest survival rate from Rossi 2014 and the highest survival rate from Kobe 2018). ⁶ High risk of bias in eight studies for the domain 'other prognostic factors (covariates)', and high risk of bias in six studies for the domain 'statistical analysis and reporting'. Downgraded by 1 point for risk of bias. ⁷The definition of PFS varied across studies, downgraded by 1 point for inconsistency ⁸ For three studies we used the reported hazard ratio. For ten studies we had to estimate the value, and for one study we had to re‐calculate it (Trivella 2006). Downgraded by 1 point for imprecision. ⁹ High risk of bias for the domains 'other prognostic factors (covariates)' and statistical analysis and reporting for one study (Simon 2016). Downgraded by 1 point for risk of bias. ¹⁰Hutchings 2006; Kobe 2018; Mesguich 2016; Simon 2016. ¹¹ High risk of bias for the domains 'other prognostic factors (covariates)' and statistical analysis and reporting for one study (Simon 2016). Also high risk of bias for the domain study participation in one study (Hutchings 2006). Downgraded by 1 point for risk of bias. ¹² Studies included a heterogenous set of covariates in the adjusted analyses. Downgraded by 1 point for inconsistency.

Background

Description of the condition

Hodgkin lymphoma (HL) is a cancer of the lymph nodes and the lymphoid system with possible involvement of other organs such as the liver, lung, bone or bone marrow (Lister 1989). With an annual incidence of approximately two to three per 100,000 inhabitants in Western countries, HL is a comparatively rare disease, but it is one of the most common malignancies in young adults (Howlader 2015). In industrialised countries, the age distribution of HL shows a first peak in the third decade and a second peak after the age of 50 (Thomas 2002).

The World Health Organization (WHO) Classification of Tumours of Haematopoietic and Lymphoid Tissues distinguishes between two types of HL: classical HL, representing about 95% of all HL; and lymphocyte‐predominant HL, representing about 5% of all HL (Swerdlow 2008). Both types differ in morphology, phenotype and molecular features, and therefore in clinical behaviour and presentation (Re 2005).

The Ann Arbor Classification is used for staging and distinguishes between four different tumour stages. Stages one to three indicate the degree of lymph node and localised extranodal organ involvement, or both, and stage four includes disseminated organ involvement, which can be found in 20% of cases. Factors associated with a poor prognosis include a large mediastinal mass, three or more involved lymph node areas, a high erythrocyte sedimentation rate, extranodal lesions, B symptoms (weight loss > 10%, fever, drenching night sweats) and advanced age, but the factors considered as significant vary slightly between different study groups (German Study Hodgkin Lymphoma Study Group (GHSG); European Organization for Research and Treatment of Cancer (EORTC); National Cancer Institute of Canada (NCIC)). The Cotswold modification of the Ann Arbor Classification also takes into consideration the occurrence of bulky disease (largest tumour diameter greater than 10 cm) (Lister 1989). Hodgkin lymphoma is classified into early favourable, early unfavourable and advanced stage (Engert 2007; Klimm 2005). In Europe, the early favourable‐stage group usually comprises Ann Arbor stages I and II without risk factors. The early unfavourable‐stage group includes individuals with Ann Arbor stages I or II and one or more risk factors. Most individuals with stages IIB, III or IV disease are included in the advanced‐stage risk group (Engert 2003).

With cure rates of up to 90%, HL is one of the most curable cancers worldwide (Engert 2010; Engert 2012; Rancea 2013a; von Tresckow 2012). A combination of adriamycin, bleomycin, vinblastine and dacarbazine (ABVD) is widely accepted as the standard chemotherapy regimen in early‐stage HL (Bröckelmann 2018, Canellos 1992; Engert 2010). Individuals in this stage usually receive a combination of chemotherapy and involved‐field radiation therapy (IF‐RT) (Engert 2010; von Tresckow 2012), whereas those with advanced‐stage disease receive an intensified regimen, such as BEACOPP (bleomycin, etoposide, doxorubicin, cyclophosphamide, vincristine, procarbazine and prednisone) (Skoetz 2017a; Borchmann 2011; Engert 2012; Skoetz 2013), or ABVD. A large randomised study showed that two cycles of ABVD followed by 20 Gy of IF‐RT is sufficient for the treatment of early‐favourable HL (Engert 2010), which is implemented into current standard treatment, whereas four cycles of chemotherapy followed by 30 Gy IF‐RT is more suitable for individuals with early‐unfavourable HL. Approximately 10% of people with HL will be refractory to initial treatment or will relapse; this is more common in people with advanced stage or bulky disease. These individuals can be treated with high‐dose chemotherapy and autologous stem cell transplantation (Rancea 2013). Immunotherapy for relapsed HL as another possible approach is under active investigation (Moskowitz 2018).

The current treatment approach for HL aims to maximise progression‐free and OS and to minimise acute and long‐term toxicities like cardiac and pulmonary damage, infertility and secondary cancers. Development of a secondary cancer is one of the major causes of morbidity and mortality once the risk of progression and relapse of HL is over, i.e. from about five years after first‐line treatment onwards. In a large systematic review based on individual patient data in people with HL, Franklin and colleagues demonstrated that treatment de‐intensification by avoiding additional radiotherapy reduces the risk of a secondary cancer (Franklin 2005).

Description of the index (prognostic) factor

A prognostic factor is a characteristic of a patient or the disease (e.g. age, sex, co‐morbidities, disease stage, blood or imaging results) that is likely to predict patient outcomes or health events, often related to OS and disease‐free survival (Moons 2009; Riley 2013). Prognostic information ultimately provides a basis for the determination of treatment and also helps to stratify individuals for treatment according to their risk of future outcomes (Riley 2013). Established prognostic factors in HL include age, gender, B symptoms, Ann Arbor disease stage, bulky disease, albumin level, anaemia and white blood cell count, amongst others (Cuccaro 2014; Josting 2010; Kılıçkap 2013). Particularly male gender, advanced disease stage or age, and a low level of albumin, for example, are associated with worse prognosis and survival outcomes (Cuccaro 2014; Josting 2010).

The prognostic factor to be examined in this review is the tumour's metabolic activity, its stage, and progression as captured by [18F]‐fluorodeoxy‐D‐glucose (FDG)‐positron emission tomography (PET, also called PET scanning), which is an imaging tool. The principle of FDG‐PET is based on a radio‐labelled glucose analogue being a good indicator of the glucose metabolism of a tissue. It comprises two parts: a vector (2‐deoxy‐D‐glucose) taken up by cells with a high metabolic rate, and 18F, a positron‐emitting nuclide, which is detected by scintigraphy. FDG‐PET scanning provides the opportunity to identify the state and degree of progression of FDG‐avid tumours and has therefore become a standard imaging tool for various cancers (Boellaard 2010). Hodgkin lymphoma is a FDG‐avid tumour; in a study of 233 people with HL, 100% were FDG‐avid (Weigler‐Sagie 2010). However, as the field of imaging continuously evolves, it is now widely accepted to use PET in combination with a computed tomography (CT), known as PET‐CT (Barrington 2014). The combination of PET‐CT is argued to provide clearer imaging and a more accurate measurement of nodal size (Cheson 2014). Nevertheless, in the studies included in this review, the use of PET or PET‐CT varied.

Over the last few decades FDG‐PET has been used more and more for staging, prognosis, treatment planning and response evaluation in individuals with HL, and is a widely accepted procedure (Barrington 2017a; Cheson 2014; Fitzgerald 2019; Kobe 2010a; Markova 2009; Meignan 2009; Radford 2015; Specht 2007). FDG‐PET is primarily used for the pretreatment assessment in order to determine the stage of the disease of an individual and thereby to decide on the appropriate treatment regimen (Cheson 2014; Meignan 2009). However, it is now argued that PET should also be conducted during first‐line chemotherapy in individuals with HL, namely interim PET after a few cycles of chemotherapy (Barrington 2017a; Bröckelmann 2018; Meignan 2009). The result of the interim PET scan (positive or negative) is believed to be a good predictor of outcome, aiding the distinction between individuals with a poor prognosis from those with a better prognosis, while undergoing early treatment (Gallamini 2007; Kobe 2010; Markova 2012). Therapy adaptation based on interim PET results was introduced after detailed exploration of the FDG‐PET procedure (Engert 2012; Kobe 2008a), the idea being to achieve maximum efficacy in terms of OS and progression‐free survival (PFS).We will refer to the prognostic factor henceforth as 'interim PET'.

Why it is important to do this review

There is a need to systematically explore the prognostic ability of the factor (interim PET) in conditions where there is no treatment adaptation. The 'no treatment adaptation' clause is a rather important point in the prognostic exploration as adapting treatment based on interim PET results in daily practice when its prognostic ability is not yet proven is not desired. There is one systematic review on the prognostic value of interim PET without treatment adaptation in individuals with HL (Adams 2015a). However, this review looked at 'treatment failure' as an outcome of the interim PET scan, which is different to the outcomes the current review explored. Moreover, and despite the fact that it is entitled as a review of prognosis studies, the methodology used is akin to diagnostic test evaluation (with calculations of diagnostic odds ratio, specificity and sensitivity), rather than using established prognostic methodology and crucially, the confidence in the calculated estimates was not rated. Moreover, the review included studies published before December 2014 and, therefore, important research published since that time is not included.

One Cochrane Review on the role of PET‐adapted treatment modification for people with HL found some evidence that PFS was decreased in people with early‐stage HL and a negative PET scan receiving only chemotherapy (PET‐adapted therapy) compared to those receiving radiotherapy in addition to chemotherapy (which is the standard therapy regimen) (Sickinger 2015). A similar result was found in another Cochrane Review (Blank 2017). The authors compared the effects of chemotherapy alone versus chemotherapy plus radiotherapy on outcome and safety for adults with early stage HL. They found moderate evidence that when individuals receive the same number of chemotherapy cycles, the addition of radiotherapy can improve PFS. However, both reviews were not able to give definite conclusions on the effect on OS. Another systematic review suggests the change of therapy after interim PET in advanced‐stage individuals only (Amitai 2018). In the current German guideline for the treatment of HL, for example, it is recommended that patients with advanced HL receive an interim PET scan after two cycles of chemotherapy. The result of the interim PET scan can then be used to guide further treatment for patients in advanced stages of HL (Bröckelmann 2018). Hence, the disease stage is an additional key prognostic factor for patients with HL. Several randomised controlled trials (RCTs) have recently been published that investigated the consequences of treatment adaptation based on interim PET scan results on outcome and safety for individuals with HL (Andre 2017; Casasnovas 2019; Kobe 2018; Johnson 2016; Radford 2015).

Hence, the prognostic role of interim PET in individuals with HL undergoing first‐line chemotherapy is very important and will strongly influence decision‐making particularly regarding the choice of subsequent treatments. Therefore, we have summarised all available data from identified studies and included these in a meta‐analysis when they were sufficiently homogeneous. Our aim was to produce robust evidence based on the improved power that a meta‐analysis provides over the limitations of individual primary studies, and grade the evidence. A reliable answer to the question of the prognostic value of interim PET scan to predict survival outcomes in individuals with HL will strongly influence decision‐making at a crucial point of an individual’s treatment pathway. Moreover, grading the evidence on the prognostic value of interim PET will provide readers with an estimate of how much they can rely on the calculated results.

The aim of this systematic review was to determine whether in previously untreated adults with HL receiving first‐line therapy, interim PET scan results can distinguish between those with a poor prognosis and those with a better prognosis, and whether it can predict survival outcomes in each group. Thereby, we assessed the prognostic value of interim PET scan results. Meta‐analyses and grading of the evidence allow a conclusion of whether interim PET is a prognostic factor. This comprehensive overview will have a great impact on international guidelines and clinical pathways, and will contribute to a high‐grade support in clinical decision‐making for effective, supportive strategies for the individual patient.

Objectives

To determine whether in previously untreated adults with Hodgkin lymphoma (HL) receiving first‐line therapy, interim positron emission tomography (PET) scan results can distinguish between those with a poor prognosis and those with a better prognosis, and thereby predict survival outcomes in each group.

Primary objective

To identify all studies evaluating interim PET scan results as a prognostic factor, describe the characteristics and risk of bias of included studies and meta‐analyse results on the association between PET scan results and overall survival (OS), progression‐free survival (PFS) and PET‐associated adverse events.

PICOTS

We used the PICOTS (population, index, comparator, outcome(s), timing, setting) system to describe the key items for framing this review and its objective and methodology (Table 1) (Debray 2017; Riley 2019).

Table 1. PICOTS system

Population

Index (prognostic) factor

Comparator

Outcome(s)

Timing

Setting

People with classic HL, at any stage of the disease
Newly diagnosed individuals undergoing first‐line therapy
Adults, as defined in the studies

Interim PET scan results

Not applicable to this review

Overall survival (OS)
Progression‐free survival (PFS)
PET‐associated adverse events (AEs)

The outcome should be measured after a minimum follow‐up of 12 months.

Interim PET scan should be conducted during chemotherapy (after one, two, three or four cycles of chemotherapy)

Hospital/treatment centre

Methods

This is a systematic review of prognostic factor studies.

Criteria for considering studies for this review

Types of studies

We included retrospective and prospective studies evaluating interim PET scan results in a minimum of 10 individuals with Hodgkin lymphoma (HL) undergoing first‐line therapy.

We excluded studies that modified the treatment regimen based on the interim PET scan results in order to draw an unbiased conclusion of the ability of interim PET to predict the outcomes under study.

Participants

We included studies on adults with newly diagnosed classic HL receiving first‐line therapy. If in a study a percentage of the included participants were adolescents but received adult treatment regimen and dosage, and the study considered them as adults, then we also accepted this 'adult' definition.

All participants received an interim PET scan during chemotherapy (e.g. after one, two, three and/or four cycles of chemotherapy), and continued with the planned chemotherapy regimen, without treatment adaptation due to the interim PET scan result.

Index (prognostic) factor

We included studies that assessed interim PET scan results as the index (prognostic) factor to predict survival outcomes. We expected the interim PET scan to be conducted during first‐line treatment of adults with HL, and without interim PET‐guided treatment adaptation, meaning participants should be treated in the same way regardless of the interim PET scan result. We accepted all studies that conducted a PET or PET‐CT (see Background 'Description of index (prognostic) factor').

In the literature, it is generally recommended to use a five‐point scale to assess the grade of uptake and report the PET scan result (Meignan 2009). Generally, scores 1‐3 indicate PET‐negativity, while scores 4‐5 indicate PET‐positivity (Barrington 2014). Most of the included studies used a validated scale, such as the 5‐PS Deauville criteria (Meignan 2009), the Lugano classification (Cheson 2014), the Imaging Subcommittee of International Harmonization Project in Lymphoma criteria (Juweid 2007) or the joint Italian‐Danish study criteria (Gallamini 2007).

Type of outcome measures

Primary outcome

Overall survival (OS), defined as the time to death due to any cause.

We chose OS as our primary outcome because it has the greatest clinical relevance and is most important for individuals with HL. Furthermore, death due to any cause is an objective endpoint not susceptible to bias by the outcome assessor.

Secondary outcomes

Progression‐free survival (PFS), defined as the time to disease progression, relapse, death due to any cause or last follow‐up.
Adverse events (AEs), defined as any event associated with the index factor (e.g. radiation safety).

To report meaningful findings, the required minimum follow‐up period was 12 months for each outcome.

Search methods for identification of studies

Electronic searches

Reporting and therefore retrieval of prognostic factor studies is very poor, as evaluation of guidelines on reporting of prognostic markers in cancer have shown (Altman 2012; Mallett 2010; McShane 2005). Moreover, no specific search filter exists for this new methodological approach, therefore published filters have to be combined for a sensitive search strategy (Geersing 2012). However, as PET scans often are not reported as a prognostic factor, we did not combine our search strategy with a filter for prognosis research. Therefore, the search strategy was not very specific and the results were screened independently and in detail by two teams of two review authors. Furthermore, we did not apply a language restriction in order to reduce the language bias, according to chapter six of the Cochrane Handbook for Systematic Reviews of Interventions (Lefebvre 2011).

We searched the following databases.

Databases of medical literature
- Cochrane Central Register of Controlled Trials (CENTRAL; 2 April 2019, Issue 11) (Appendix 1)
- MEDLINE Ovid SP (1946 until 2 April 2019) (Appendix 2)
- Embase (1990 until 2 April 2019) (Appendix 2)
Conference proceedings of annual meetings of the following societies for abstracts (2000 to 2019)

- American Society of Hematology
- European Hematology Association
- International Symposium on Hodgkin Lymphoma

We searched ClinicalTrials.gov (on 25 January 2019 using the query PET and Hodgkin lymphoma) to identify clinical trials.

Searching other resources

Handsearching of references
- We searched the references of all identified studies, relevant review articles and current treatment guidelines for further literature to find other relevant studies and to identify associated articles.
Personal contacts
- We contacted 10 principal investigators of included studies for further information, of whom six replied and answered our questions for clarification. Two out of these six provided us also with relevant data to conduct our analyses.

Data collection and analysis

Selection of studies

Two teams of two review authors (AA, LE, MHT, NS) independently screened the results of the search strategies to identify eligible studies by reading the titles and abstracts in Covidence (Covidence). In case of disagreements, consensus between the two review authors was reached by discussion of the full‐text publication. When consensus could not be reached, a third review author was consulted for final decision (Higgins 2011).

We documented the study selection process in a flow chart as recommended in the Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRISMA) statement (Moher 2009), showing the total numbers of retrieved references and the numbers of included and excluded studies (Figure 1).

Figure 1

Study flow diagram according to PRISMA

Data extraction and management

We developed a data extraction form specific to studies of prognostic factors based on the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) (Moons 2014). The form was piloted using four of the included studies, and then further assessed during several teleconferences between the review authors to discuss required changes. After several amendments of the form, two teams of two review authors (AA, LE, MHT, NS) independently extracted all relevant data from the included studies. After data extraction, we contacted 10 principal investigators of included studies to request additional information.

Our form included the following items (in short).

General information
- i.e. Author, title, source, publication date, country, language, duplicate publications
Source of data
- i.e. Cohort, prospective planned study, randomised study participants, or registry data
Participants
- Participant eligibility and recruitment method (e.g. consecutive participants, location, number of centres, setting, inclusion and exclusion criteria)
- Participant description (e.g. age, gender, stage of disease)
- Details of treatments received
- Study dates
Prognostic factor
- Definition and method for measurement of prognostic factor
- Timing of prognostic factor measurement (number of chemotherapy cycles before and after measurement of the prognostic factor)
Outcomes to be predicted
- Definition and method for measurement of outcome
- Was the same outcome definition (and method for measurement) used in all individuals?
- Was the outcome assessed without knowledge of the prognostic factor (i.e. blinded)?
- Time of outcome occurrence or summary of duration of follow‐up
Sample size
- Number of participants and number of outcomes/events
Missing data
- Number of participants with any missing value (include predictors and outcomes)
- Handling of missing data (e.g. complete‐case analysis, imputation, or other methods)
Reported results
- Overall survival (OS) (including duration of follow‐up)
- Progression‐free survival (PFS) (including duration of follow‐up)
- Adverse events (AEs) (including duration of follow‐up)

Risk of bias

In the protocol for this review we prespecified that we will use the Quality in Prognostic Studies (QUIPS) tool (Hayden 2013) for the risk of bias assessment. However, recent methodological developments for the systematic review of prognostic factor studies (Riley 2019; Riley 2019b) led us to consider amending this tool. In the light of this we consulted the primary author (Hayden 2013) of the QUIPS tool and following discussions decided to add to the three bias ratings ('low', 'moderate' and 'high' risk of bias) a fourth 'unclear' option. This was necessary due to the inconsistent reporting of the included studies, when information was clearly missing, and hence, without an 'unclear' category, risk of bias assessment would not be feasible.

Following further discussions, we additionally decided to rename the fifth domain 'study confounding' to 'other prognostic factors (covariates)' in order to highlight the important distinction between confounding (the preferred term when seeking estimates of causal effect of a specific etiologic factor) and adjusting for other important prognostic factors, namely covariates (advocated when seeking the independent prognostic ability of index prognostic factors). As said, in the context of our review (adults with Hodgkin lymphoma), the disease stage is a key factor that is taken into account together with the interim PET scan result when decisions about treatment adaptation are made in daily clinical practice (Bröckelmann 2018). Hence, we assessed studies that only included participants within one disease stage (e.g. only early stages or only advanced stages of HL) as 'low' risk of bias, as such patient sampling can be considered as accounting for disease stage as another prognostic factor. Studies that included participants within all disease stages, but offered adjusted results including disease stage as another prognostic factor, were also assessed as 'low' risk of bias. Studies with participants of all disease stages, not accounting for disease stage, were assessed as 'high' risk of bias in this domain. This latter modification is also reflected in the GRADE assessment. Regardless of whether meta‐analysis of adjusted or unadjusted (crude) effects of the prognostic factor of interest (interim PET scan results) was possible, we included this domain's risk of bias assessment in our GRADE judgement.

Two teams of two review authors (AA, LE, MHT, NS) independently assessed the risk of bias of the included studies according to the domains of the QUIPS tool. We judged each domain by taking into account the criteria listed for each domain in the QUIPS tool (Hayden 2013), and also provided a brief statement supporting our judgement.

We made the following judgements.

Low risk of bias: the relationship between the prognostic factor and outcome is unlikely to be different for participants and eligible non‐participants.
Moderate risk of bias: the relationship between the prognostic factor and outcome may be different for participants and eligible non‐participants.
High risk of bias: the relationship between the prognostic factor and outcome is very likely to be different for participants and eligible non‐participants.
Unclear risk of bias: the study does not provide sufficient information that allows a clear judgement for this domain.

Furthermore, we decided to assess the risk of bias per outcome in each study because not all studies reported all of our outcomes of interest, and even studies reporting at least two of our outcomes showed differences in their outcome reporting.

We judged the following domains and criteria.

Study participation
- Adequate participation in the study by eligible persons
- Description of the source population or population of interest
- Description of the baseline study sample
- Adequate description of the sampling frame and recruitment
- Adequate description of the period and place of recruitment
- Adequate description of inclusion and exclusion criteria
Study attrition
- Adequate response rate for study participants
- Description of attempts to collect information on participants who dropped out
- Reasons for loss to follow‐up are provided
- Adequate description of participants lost to follow‐up
- There are no important differences between participants who completed the study and those who did not
Prognostic factor measurement
- A clear definition or description of the prognostic factor is provided
- Method of prognostic factor measurement is adequately valid and reliable
- Continuous variables are reported or appropriate cut points are used
- The method and setting of measurement of prognostic factor is the same for all study participants
- Adequate proportion of the study sample has complete data for the prognostic factor
- Appropriate methods of imputation are used for missing prognostic factor data
Outcome measurement
- A clear definition of the outcome is provided
- Method of outcome measurement used is adequately valid and reliable
- The method and setting of outcome measurement is the same for all study participants
Other prognostic factors (covariates)
- Other prognostic factors (covariates) are measured
- Clear definitions of the important prognostic factors (covariates) measured are provided
- Measurement of all important prognostic factors (covariates) is adequately valid and reliable
- The method and setting of prognostic factor measurement are the same for all study participants
- Appropriate methods are used if imputation is used for missing data
- Important potential prognostic factors (covariates) are accounted for in the study design
- Important potential prognostic factors (covariates) are accounted for in the analysis
Statistical analysis and reporting
- Sufficient presentation of data to assess the adequacy of the analytic strategy
- Strategy for model building is appropriate and is based on a conceptual framework or model
- The selected statistical model is adequate for the design of the study
- There is no selective reporting of results

Reporting deficiencies

Methods and reporting in prognostic research often do not follow current methodological recommendations, limiting retrieval, reliability and applicability of these publications (Bouwmeester 2012; Peat 2014). There is evidence suggesting that prognosis research in cancer is cluttered with false‐positive studies, which would not have been published if the results were negative (Kyzas 2005; Kyzas 2007; Sauerbrei 2005). Moreover, studies evaluating prognostic factors are usually not prospectively registered and no protocol is published (Peat 2014; Riley 2013), resulting in difficulties to identify all studies and to assess potential risks of publication bias. We used sensitive search filters for the disease (HL) and the prognostic factor (interim PET scan results) without any specific filter for research on prognosis in order to increase retrieval.

Due to the expected large effect of hazard ratios (HRs), tests for funnel plot asymmetry could result in publication bias being incorrectly indicated by the test (Macaskill 2010). Therefore, we decided not to evaluate the risk of publication bias by funnel plot asymmetry and describe reporting deficiencies instead.

Data synthesis

We performed analyses according to the recommendations of Cochrane, and the Cochrane Prognosis Methods Group in particular, and used the Cochrane statistical package Review Manager 5 (Deeks 2011; Review Manager 2014). We are aware that since the protocol development, the methodology on assessing studies of prognosis has evolved; hence, some differences between the published protocol and this full review may exist to account for the updated guidance. We have listed these in Differences between protocol and review.

We pooled unadjusted (crude) HRs for OS and PFS by applying meta‐analysis using the RevMan's generic inverse variance methods random‐effects model. Due to reporting inefficiencies and the expected heterogeneity between studies, we only combined studies that were sufficiently similar (e.g. most studies used ABVD as the main therapy regimen, or most studies conducted interim PET after two cycles of chemotherapy). Studies did not always provide an HR and associated standard error (SE), which are the parameters needed for meta‐analysis. Where these values were not available, we estimated them from other available data where possible using an in‐house calculator based on published methods for recovering survival data (Altman 1999; Parmar 1998; Tierney 2007). Recovered data included information and results reported in the text, tables, and Kaplan‐Meier (K‐M) curves. We also contacted 10 principal investigators of included studies to either ask for additional data, or to clarify issues regarding the studies.

As prespecified in the protocol, we would have also pooled adjusted HRs of the interim PET scan‐result (the index prognostic factor) from multivariable analyses of the included studies as adjusted prognostic effects (e.g. HRs) indicate the independent prognostic value of the prognostic factor over and above other clinically relevant prognostic factors (Riley 2019). However, pooling of adjusted estimates is recommended only if the same (largely) prognostic factors (covariates) are adjusted for in multivariable analyses (Riley 2019; Riley 2019b). As, said clinically relevant prognostic factors in individuals with HL particularly include the disease stage, as well as age, gender, and B symptoms (Cuccaro 2014). Regardless of whether pooling of adjusted or unadjusted effects of interim PET scan results was possible, we always assessed the risk of bias for all studies using the QUIPS tool, including the fifth domain 'other prognostic factors (covariates)', where we considered the disease stage as an important covariate to be taken into account.

Detailed description of the estimation of hazard ratios (HRs) and standard errors (SEs)

We used unadjusted HRs as the effect measure for OS and PFS. In cases where the HR and SE were not reported, we estimated them from available data using an in‐house calculator (Trivella 2006), based on methods reported by Tierney 2007, Altman 1999 and Parmar 1998, or contacted authors to request additional data (Higgins 2011b). Recovered data included sample size, number of events, results such as the logrank P‐value and confidence intervals (CIs), which were reported in the text, tables, and K‐M curves. We kept detailed records of how the HR and SEs were calculated for each outcome in each included study. We identified the following six categories of HR precision.

HR was provided in the study, and the SE was either provided or easily estimated from reported CIs, and/or using the RevMan inbuilt calculator.
HR was provided but on checking while attempting to obtain the SE, there were errors and/or discrepancies with related provided data and we re‐estimated the HR.
HR and SE were not provided but all necessary data for their estimation were available in the study.
HR and SE were not provided. Other necessary data were available but not an exact logrank P value, hence the nearest value was used in the estimation. For example, if they reported P < 0.001, then the nearest exact value was used, in this case P = 0.0009.
HR and SE were not provided. Other necessary data were available but the number of events was estimated from the K‐M curves.
IPD data were available and HR and SE were accurately calculated.

We are aware that categories four and five are likely to over‐ or under‐estimate the HR and associated SE. However, they were the best estimates we could obtain. We consider the remaining categories as precise. We explored the precision of the estimates in a post‐hoc sensitivity analysis where the imprecise studies were temporarily removed to examine the robustness of the pooled result.

Grading the evidence

According to the recommendations of the GRADE working group, we rated and described the confidence in estimates for each outcome by assessing potential risk of bias, inconsistency, imprecision, indirectness and publication bias. We applied an approach that has been proposed for prognosis studies by the GRADE working group, suggesting that the starting point is one of high certainty of the evidence for observational studies (Iorio 2015).

Dealing with missing data

We dealt with missing data as suggested in Chapter 16 of the Cochrane Handbook for Systematic Reviews of Interventions (Higgins 2011b). We contacted ten principal investigators of included studies to answer our questions regarding the studies and/or to provide us with additional data. Six principal investigators replied and answered our questions, of which two also provided us with additional data necessary to perform our analyses. One investigator kindly provided us with individual participant data for the whole data set. In some studies, the description of the methodology was rather unclear or relevant information was missing. In addition, some studies did not fully report their statistical analyses and data were missing, which complicated a full assessment of the study. We performed sensitivity analysis to assess how sensitive the results were to reasonable changes in the assumptions that were made, and addressed the potential impact of missing data on the findings of this review in the Discussion.

Furthermore, we noticed that most studies applied exclusion criteria on the baseline population (such as unavailability of interim PET or descriptive information) without providing a description of the size of this population and/or reasons for missing information. We treated this as a potential source of selection bias in the domain study participation of the QUIPS tool.

Investigation of heterogeneity

We investigated and discussed clinical and statistical heterogeneity and design aspects of included studies as mentioned in the section 'Data extraction and data management'. We assessed between‐study heterogeneity using the I² statistic (an I² greater than 50% = moderate heterogeneity; an I² greater than 80% = considerable heterogeneity) (Deeks 2011). As most studies of prognosis are observational in nature, we are aware that they are prone to higher and/or inflated heterogeneity. Hence, we also assessed the Tau² values from the meta‐analyses to be able to make a more robust judgment on the degree of statistical heterogeneity.

As specified in the protocol, we explored potential causes of heterogeneity by subgroup analysis. We considered the following parameters.

Study design (e.g. prospective versus retrospective)
Disease stage (e.g. early versus advanced stages)
Type of chemotherapy (e.g. ABVD versus BEACOPP)
Type of radiotherapy (e.g. involved field versus involved site)
Type of PET measurement (e.g. PET versus PET‐CT) (post‐hoc)

In addition, we conducted a post hoc sensitivity analysis for the timing of the interim PET, as well as the availability/estimation of HR and SE to explore the robustness of the pooled results.

Results

Results of the search

Our literature search in CENTRAL, MEDLINE and Embase (until 2 April 2019, see Appendix 1, Appendix 2 and Appendix 3, respectively) and one trial registry (ClinicalTrials.gov on 25 January 2019), identified 11,277 potentially relevant publications. After removal of 358 duplicates, we screened titles and abstracts of 10,919 references using inclusion and exclusion criteria defined at the protocol stage. These criteria led to the exclusion of 10,651 references, and 268 references were then included for full‐text screening. Before starting full‐text screening, we discussed and determined exclusion reasons. Full‐text screening led to the exclusion of 133 references. Thirty‐four references that were identified are still awaiting assessment (see Studies awaiting classification), and one study is still ongoing (see Ongoing studies). Hence, we finally included 23 studies (from 99 references) in this review. The overall number of publications screened, identified, selected and included in this review is shown in Figure 1

Description of studies

Included studies

We included 23 studies in this review (Andre 2017; Annunziata 2016; Barnes 2011; Casasnovas 2019; Cerci 2010; Gallamini 2014; Gandikota 2015; Hutchings 2005; Hutchings 2006; Hutchings 2014; Kobe 2018; Markova 2012; Mesguich 2016; Oki 2014; Okosun 2012; Orlacchio 2012; Rossi 2014; Simon 2016; Straus 2011; Touati 2014; Ying 2014; Zaucha 2017; Zinzani 2012), which added up to a total of 99 references when secondary citations were included. To avoid duplication and overlapping of participant data in our analyses, we grouped those publications that assessed the same population (or groups from the same population). In such cases, we chose the publication with the greatest number of participants and/or most information as the primary publication. Duplicate or overlapping study populations were found for eight studies (Andre 2017; Barnes 2011; Gallamini 2014; Kobe 2018; Markova 2012; Simon 2016; Straus 2011; Zinzani 2012). Four studies did not report the duration of follow‐up (Andre 2017; Annunziata 2016; Orlacchio 2012; Straus 2011). The earliest study recruited participants between 1993 and 2004 (Hutchings 2005), and the most recent between 2007 and 2014 (Annunziata 2016).

There was considerable heterogeneity between the included studies, particularly with regard to: stages of disease; treatment regimens; and the timing and criteria for evaluation of the interim PET scans, which are described in detail in the sections below. For meta‐analyses, we only grouped studies that were homogenous enough in order to ensure comparability, and conducted subgroup analyses to explore the potential impact of heterogeneity on our results (see Methods 'Investigation of heterogeneity').

Study design

Of the 23 included studies, seven studies were retrospective single‐centre studies (Annunziata 2016; Markova 2012; Oki 2014; Orlacchio 2012; Rossi 2014; Touati 2014; Ying 2014). Five studies were retrospective multi‐centre studies (ranging between two to 17 centres) (Barnes 2011; Gallamini 2014; Mesguich 2016; Okosun 2012; Zinzani 2012). Two retrospective studies did not report the number of centres from which participants were recruited (Gandikota 2015; Simon 2016). Out of eight studies with a prospective study design, one study was a single‐centre study (Cerci 2010), three were multi‐centre studies (including between four and 11 centres, with Hutchings 2014 not reporting the number of study centres) (Hutchings 2006; Hutchings 2014; Zaucha 2017), and four were clinical trials (Andre 2017; Casasnovas 2019; Kobe 2018; Straus 2011). One study did not report the study design (Hutchings 2005).

For more details see Characteristics of included studies.

Sample size

The smallest study included 23 participants (Okosun 2012) and the largest study included 1945 participants (Kobe 2018).

Location

The included studies were conducted in a variety of countries, including Austria, Belgium, Brazil, Croatia, Czech Republic, Denmark, France, Germany, Hungary, Italy, the Netherlands, Poland, Slovakia, Switzerland, the United Kingdom (UK), the United States of America (USA), and the People's Republic of China. Four studies reported the country but not the study centre (Annunziata 2016; Hutchings 2014; Markova 2012; Simon 2016), and two studies reported neither country nor study centre (Gandikota 2015; Straus 2011).

Participants

This review included a total of 7335 male and female consecutive participants who were newly diagnosed with classic HL and received first‐line therapy. Out of these, a total of 2205 participants were included in meta‐analyses.

Follow‐up

There were differences in the follow‐up time between studies. Three studies did not report follow‐up time (Annunziata 2016; Orlacchio 2012; Straus 2011). Two studies reported follow‐up time per subgroup, i.e. surviving participants only (Kobe 2018; Zaucha 2017). The median follow‐up time for the remaining 18 studies ranged from 23 to 66 months. The total raw range of follow‐up time was between two to 195 months.

Stages of disease

Fifteen studies included all stages of the disease. Four studies included only early stages (Andre 2017; Barnes 2011; Gandikota 2015, Straus 2011) and four studies only advanced stages (Casasnovas 2019; Kobe 2018; Markova 2012; Okosun 2012).

Treatment/therapy

The following chemotherapy regimens were administered.

ABVD (adriamycin/doxorubicin, bleomycin, vinblastine and dacarbazine) in 16 studies (Andre 2017; Annunziata 2016; Barnes 2011; Cerci 2010; Gallamini 2014; Hutchings 2005; Hutchings 2006; Hutchings 2014; Mesguich 2016; Oki 2014; Okosun 2012; Orlacchio 2012; Simon 2016; Touati 2014; Zaucha 2017; Zinzani 2012).
Either ABVD or BEACOPP in one study (Ying 2014).
BEACOPP_escalated (bleomycin, etoposide, doxorubicin, cyclophosphamide, vincristine, procarbazine and prednisone in escalated doses) in one trial (Casasnovas 2019).
BEACOPP_escalated or BEACOPP_escalated with rituximab in one trial (Kobe 2018).
BEACOPP_escalated or time‐condensed BEACOPP14_baseline (BEACOPP in standard, non‐escalated doses repeated on day 15) in one study (Markova 2012).
AVG (doxorubicin, vinblastine and gemcitabine) in one trial (Straus 2011).
ABV/MOPP (adriamycin, bleomycin, vinblastine, mechlorethamine, vincristine, procarbazine and prednisone), ABVD/COPP (ABVD plus cyclophosphamide, vincristine, procarbazine and prednisone), eBEACOPP, or PVAG (prednisone, vinblastine, doxorubicin and gemcitabine) in subgroups of participants in three studies (Hutchings 2005; Hutchings 2006; Touati 2014).
Anthracycline‐based chemotherapy not further specified in one study (Rossi 2014).

The following number of chemotherapy cycles were administered.

Two, three, four, six or eight cycles of chemotherapy alone or combined with radiotherapy in 15 studies (Andre 2017; Annunziata 2016; Barnes 2011; Casasnovas 2019; Cerci 2010; Gallamini 2014; Hutchings 2014; Markova 2012; Mesguich 2016; Orlacchio 2012; Rossi 2014; Simon 2016; Straus 2011; Zaucha 2017; Zinzani 2012). The number of cycles usually depended on the stage of the disease.
Four, six or eight cycles of chemotherapy, depending on the interim PET scan results, in one trial (Kobe 2018). A protocol amendment during the trial introduced a reduction of standard therapy from eight to six cycles.
Six cycles of chemotherapy combined with antiretroviral therapy due to HIV‐positive study population in one study (Okosun 2012).

Six studies did not report the number of cycles (Gandikota 2015; Hutchings 2005; Hutchings 2006; Oki 2014; Touati 2014; Ying 2014).

The following radiotherapy techniques were used either in all or a subgroup of participants.

Involved‐field radiotherapy in eight studies (Barnes 2011; Gallamini 2014; Hutchings 2005; Hutchings 2006; Hutchings 2014; Mesguich 2016; Rossi 2014; Simon 2016), and either involved‐field radiotherapy or extended‐field radiotherapy in one study (Gandikota 2015).
Involved‐node radiotherapy in three studies (Andre 2017; Annunziata 2016; Zaucha 2017).
Involved‐site radiotherapy in two studies (Touati 2014; Zinzani 2012).
Radiotherapy without further specification in five studies (Cerci 2010; Kobe 2018; Markova 2012; Orlacchio 2012; Ying 2014).
No radiotherapy in three studies (Oki 2014; Okosun 2012; Straus 2011).

Stem cell transplantation was conducted in participants who relapsed after first‐line therapy despite treatment escalation or salvage therapy.

Autologous stem cell transplantation in eight studies (Cerci 2010; Gallamini 2014; Hutchings 2014; Mesguich 2016; Touati 2014; Ying 2014; Zaucha 2017).
Autologous and/or allogeneic stem cell transplantation in one study (Zinzani 2012).
Type of stem cell transplantation not specified in four studies (Hutchings 2005; Hutchings 2006; Markova 2012; Orlacchio 2012).
No stem cell transplantation reported in 10 studies (Andre 2017; Annunziata 2016; Barnes 2011; Gandikota 2015; Kobe 2018; Oki 2014; Okosun 2012; Rossi 2014; Simon 2016; Straus 2011).

Index (prognostic) factor

Participants in 16 out of 23 studies underwent PET combined with computed tomography (CT), contrast enhanced CT, or multi detector CT (MDCT), compared to PET‐only for participants in the other studies. Participants in 13 studies underwent PET‐CT (Annunziata 2016; Cerci 2010; Gallamini 2014; Gandikota 2015; Hutchings 2014; Kobe 2018; Mesguich 2016; Okosun 2012; Rossi 2014; Simon 2016; Touati 2014; Ying 2014; Zaucha 2017). Participants in another study underwent either PET or PET‐CT (Barnes 2011); participants in one study underwent PET with contrast‐enhanced CT (Markova 2012); and participants in another study underwent PET/MDCT (Orlacchio 2012). In the remaining seven studies, participants underwent a PET scan only (Andre 2017; Casasnovas 2019; Hutchings 2005; Hutchings 2006; Oki 2014; Straus 2011; Zinzani 2012).

Timing of interim PET

The timing of interim PET imaging varied between studies. In most studies, participants underwent an interim PET scan after two cycles (PET2) of chemotherapy (Andre 2017; Casasnovas 2019; Cerci 2010; Gallamini 2014; Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Oki 2014; Okosun 2012; Orlacchio 2012; Rossi 2014; Simon 2016; Straus 2011; Touati 2014; Zinzani 2012). In another study, participants underwent an interim PET scan after the first cycle (PET1) of chemotherapy only (Annunziata 2016). In one study, participants underwent interim PET scans after the first and second cycle of chemotherapy, but the study protocol was amended after interim analysis to limit PET2 scans to participants with positive results after PET1 (Zaucha 2017). In one multi‐centre study, participants from two centres underwent both PET1 and PET2, whereas participants from the remaining two centres underwent PET2 only if PET1 was positive (Hutchings 2014). Three retrospective studies included participants who underwent interim PET after two to four cycles of chemotherapy (Barnes 2011; Gandikota 2015; Ying 2014), and in another study participants underwent interim PET after four cycles (PET4) of chemotherapy (Markova 2012). For meta‐analyses, we used information at PET2 whenever available in order to ensure homogeneity across studies.

Evaluation of PET scans

In most studies, two nuclear medicine physicians evaluated the PET scans individually, and disagreements in scoring were solved in a consensus meeting (Annunziata 2016; Barnes 2011; Cerci 2010; Hutchings 2005; Hutchings 2006; Hutchings 2014; Mesguich 2016; Orlacchio 2012; Rossi 2014; Ying 2014; Zinzani 2012). Evaluation of PET scans was performed by only one expert in one study (Markova 2012); and by a panel consisting of three to six experts in eight studies (Andre 2017; Casasnovas 2019; Gallamini 2014; Kobe 2018; Oki 2014; Okosun 2012; Straus 2011; Zaucha 2017). Three studies did not report the number or qualification of persons who performed evaluation of PET scans (Gandikota 2015; Simon 2016; Touati 2014). Nine out of 13 multi‐centre studies reported that evaluation of PET scans took place centrally (Andre 2017; Gallamini 2014; Hutchings 2006; Kobe 2018; Mesguich 2016; Okosun 2012; Straus 2011; Zaucha 2017; Zinzani 2012), and two studies did not report how reviewing of PET scans was performed across centres (Barnes 2011; Hutchings 2014).

In 11 studies, outcome assessors were blinded to the outcome (Kobe 2018; Gallamini 2014; Gandikota 2015; Hutchings 2006; Hutchings 2014; Mesguich 2016; Oki 2014; Rossi 2014; Straus 2011; Zaucha 2017; Zinzani 2012). The remaining studies did not report blinding.

Criteria for evaluation

Most studies reported the use of a standardised scale for the evaluation of the PET scans, but the scoring systems and cut‐off points between studies varied.

In 12 studies, the Deauville 5‐point scoring system for evaluation of PET scans was used: in nine studies, Deauville scores 1 ‐ 3 were considered as PET‐negative, and Deauville scores 4 ‐ 5 as PET‐positive (cut‐off ≥4) (Annunziata 2016; Casasnovas 2019; Gallamini 2014; Hutchings 2014; Oki 2014; Okosun 2012; Rossi 2014; Simon 2016; Zaucha 2017); in two studies, both cut‐off points for evaluation of the PET scans were used by scoring each image twice, and comparing performance of interim PET between both scales (Kobe 2018; Mesguich 2016); and in one study, it was reported that the PET scans were re‐interpreted retrospectively using the Deauville criteria, but it was not indicated which cut‐off points were used (Touati 2014).
In one study, the International Harmonization Project criteria were used: a PET scan was considered positive when the residual mass is ≥ 2 cm or, if less than 2 cm, positive if its activity is above that of the surrounding background (Andre 2017). A negative PET scan corresponds to Deauville score 1 (no uptake) and score 2 (uptake ≤ mediastinum).
In two studies, the scoring systems were not specified, but similar scales and cut‐off points as the Deauville scoring system were used: in one study, PET scans were reviewed using a 4‐point scale (Barnes 2011), and in another study using a 5‐point scale (Gandikota 2015).
In three studies, other standardised scales for the evaluation of PET scans were used: one study used the Juweid criteria (Zinzani 2012), and two studies used the International Harmonization Project guidelines (Orlacchio 2012; Straus 2011).
Two studies did not report how PET scans were evaluated (Hutchings 2005; Hutchings 2006); and four studies reported performance of visual evaluation but did not indicate the use of a standardised scoring system (Cerci 2010; Markova 2012; Touati 2014; Ying 2014).

Outcomes

Primary outcome

Overall survival (OS)

Univariable analyses

Twelve out of 23 included studies reported unadjusted results for our primary outcome OS (Barnes 2011; Casasnovas 2019; Cerci 2010; Gallamini 2014; Hutchings 2005; Hutchings 2006; Hutchings 2014; Kobe 2018; Simon 2016; Touati 2014; Zaucha 2017; Zinzani 2012). Of these, nine provided sufficient information and data to be included in meta‐analysis. One study reported an HR that we used (Kobe 2018). Another study reported an HR, but we still re‐calculated it due to discrepancies in values between the graph and table (Simon 2016). For the other seven studies, we estimated the HR using other available data from the publications (Barnes 2011; Cerci 2010; Hutchings 2005; Hutchings 2014; Touati 2014; Zaucha 2017; Zinzani 2012).

Multivariable analyses

Two studies reported adjusted results for OS (Kobe 2018; Simon 2016). Two additional studies planned, but did not conduct the analysis for different reasons (Gallamini 2014; Hutchings 2005).

Secondary outcomes

Progression‐free survival (PFS)

Univariable analyses

Twenty‐one out of 23 studies reported unadjusted results for PFS (Andre 2017; Annunziata 2016; Barnes 2011; Casasnovas 2019; Cerci 2010; Gallamini 2014; Hutchings 2005; Hutchings 2006; Hutchings 2014; Kobe 2018; Markova 2012; Mesguich 2016; Oki 2014; Okosun 2012; Rossi 2014; Simon 2016; Straus 2011; Touati 2014; Ying 2014; Zaucha 2017; Zinzani 2012). Of these, 15 provided sufficient information and data to be included in meta‐analysis. Three studies provided an HR which we used (Annunziata 2016; Kobe 2018; Simon 2016). Another three studies reported an HR, but we still re‐calculated it due to unclear description of the statistical methods used (Hutchings 2006), reporting discrepancies between graphs and tables (Mesguich 2016) or general uncertainties in the reported values (Rossi 2014). For eight studies we estimated the HR using other available data (Barnes 2011; Cerci 2010; Hutchings 2005; Straus 2011; Touati 2014; Ying 2014; Zaucha 2017; Zinzani 2012).

Multivariable analyses

Eight studies reported adjusted results for PFS (Casasnovas 2019; Gallamini 2014; Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Rossi 2014; Simon 2016). Three studies took the importance of adjustment into account, but did not actually conduct a multivariable analysis (Annunziata 2016; Hutchings 2014; Oki 2014).

Definitions of Progression‐free survival (PFS)

The definition of the progression outcome varied between studies. Four studies that reported PFS did not provide a definition (Hutchings 2014; Simon 2016; Straus 2011; Zaucha 2017). One study analysed event‐free survival (Cerci 2010), which was identical with PFS and, therefore, included in the analysis. Table 2 presents an overview of definitions used for progression outcome. Studies with identical definitions were grouped.

Table 2. Definitions of progression outcomes

Study	Definition of progression outcome
Andre 2017	Progression‐free survival, defined – from the date of random assignment to date of progression – as experiencing relapse after previous complete remission or progression after reaching partial remission (50% decrease and resolution of B symptoms and no new lesions); progressive disease (50% increase from nadir of any previous partial remission lesions or appearance of new lesions) on CT scan measurements during protocol treatment; or death from any cause, whichever occurred first.
Casasnovas 2019	Progression‐free survival defined as the time from randomisation to first progression, relapse, or death from any cause or last follow‐up.
Annunziata 2016	The primary endpoint was PFS, with progression during treatment, lack of complete remission at the end of the first‐line treatment, and relapse counted as adverse events.
Barnes 2011; Ying 2014; Zinzani 2012	Progression‐free survival is defined as the time from diagnosis to progression or death from any cause.
Kobe 2018	Progression‐free survival is defined as the time from completion of staging until progression, relapse, or death from any cause, or to the day when information was last received on the patient's disease status.
Cerci 2010	Three‐year event‐free survival was chosen as the endpoint and defined as the time from diagnosis to treatment failure or last follow‐up. Treatment failure was defined as an incomplete response after first‐line treatment, progression during therapy, relapse, or death.
Gallamini 2014; Markova 2012; Mesguich 2016; Oki 2014;	Progression‐free survival is defined as the time from diagnosis to either disease progression or relapse, or to death as a result of any cause, whichever occurred first.
Hutchings 2005; Hutchings 2006	Progression‐free survival is defined as the time from diagnosis to first evidence of progression or relapse, or to disease‐related death.
Okosun 2012	Progression‐free survival is defined as the time from diagnosis to disease progression or relapse or last follow‐up.
Rossi 2014	Progression‐free survival is defined as the time from the beginning of treatment until progression, relapse, or death from any cause or the date of last follow‐up. Time‐to‐progression (TTP) is defined as the time from the date of the first course of chemotherapy to any treatment failure, including progression, relapse, or death related to lymphoma, or the date of the last follow‐up.
Touati 2014	Progression‐free survival is defined as the time from diagnosis to relapse or death.
Hutchings 2014; Simon 2016; Straus 2011; Zaucha 2017	Definition not reported.

Adverse events (AEs)

None of the included studies measured PET‐associated AEs.

Conflict of interest

Two studies reported potential conflicts of interest (Andre 2017; Casasnovas 2019). Fourteen studies declared that the investigators had no conflict of interest (Annunziata 2016; Barnes 2011; Hutchings 2006; Hutchings 2014; Kobe 2018; Mesguich 2016; Oki 2014; Okosun 2012; Orlacchio 2012; Rossi 2014; Simon 2016; Straus 2011; Zaucha 2017; Zinzani 2012). Seven studies did not report investigators' disclosures of potential conflicts of interest (Cerci 2010; Gallamini 2014; Gandikota 2015; Hutchings 2005; Markova 2012; Touati 2014; Ying 2014).

Excluded studies

After screening titles and abstracts, we excluded 10651 references that did not match our inclusion criteria. In addition, we excluded a total of 133 references after full‐text screening for the following reasons.

Fifty‐six references had a study design or publication type that did not match our inclusion criteria, i.e. letters and commentaries, case studies with a small sample size or validation studies (Adams 2016; Adams 2017; Adams 2018; Adams 2018a; Adams 2018b; Adams 2019; Afanasyev 2017; Ansell 2016; Barrington 2017; Bar‐Shalom 2003; Basu 2009; Becherer 2002; Bednaruk‐Mlynski 2015; Biggi 2012; Bishop 2015; Bodet‐Milin 2009; Boisson 2007; Borchmann 2016; Bucerius 2006; Cremerius 1999; D'Urso 2018; Dann 2018; deAndres‐Galiana 2015; Diehl 2007; El‐Galaly 2012; Evens 2014; Fanti 2008; Friedberg 2002; Friedberg 2004; Gallamini 2008; Gallamini 2018a; Gallowitsch 2008; Guidez 2016; Hagtvedt 2015; Hartmann 2012; Hartridge‐Lambert 2013; Kobe 2008; Kobe 2014; Lowe 2002; Milgrom 2017; Mocikova 2010; NCT02292979; Pichler 2000; Reinhardt 2005; Rigacci 2002; Rigacci 2017; Rubello 2015; Sakr 2017; Specht 2007; Spinner 2018; Strigari 2016; Tirelli 2015; Xie 2018; Yasgur 2015; Zabrocka 2016; Zaucha 2009).
Thirty‐nine references adapted the treatment based on PET‐results (Albano 2017; Albano 2018; Biggi 2017; Carras 2018; Ciammella 2016; Cuccaro 2016; Damlaj 2017; Damlaj 2019; Danilov 2017; Dann 2009; Dann 2010; Dann 2010a; Dann 2012; Dann 2013; Dann 2016; Dann 2017; Fornecker 2017; Gallamini 2017; Gallamini 2018; Greil 2018; Illidge 2015; Johnson 2015; Johnson 2016; Kamran 2016; Kamran 2018; Moskowitz 2015; NCT00784537; NCT00795613; NCT01358747; NCT01652261; Nguyen 2017; Paolini 2007; Pavlovsky 2019; Simontacchi 2015; Straus 2018; Torizuka 2004; Trotman 2017; Villa 2018; Zinzani 2016).
Eighteen references also included participants with other types of lymphoma and did not report data for HL separately (Awan 2013; Blum 2002; Bodet‐Milin 2008; Cremerius 2001; Filmont 2003; Freudenberg 2004; Fruchart 2006; Goldschmidt 2011; Haioun 2005; Honda 2014; Iagaru 2008; Kostakoglu 2006; Li 2013; Slaby 2002; Tomita 2015; Torizuka 2004; Zinzani 1999; Zinzani 2002).
Ten references included participants who received treatment other than first‐line therapy, i.e. second‐line therapy for relapsed or refractory disease (Bjurberg 2006; Front 1999; Huic 2006; Mocikova 2010; Mocikova 2011; Schot 2007; Sucak 2011; Tseng 2012; Weidmann 1999; Yoshimi 2008).
Eight references reported only end‐of‐chemotherapy PET‐results (Advani 2007; Hueltenschmidt 2001; Hutchings 2007; Jerusalem 2003; Molnar 2010; Naumann 2001; Panizo 2004; Spaepen 2001).
Two were duplicates (Freudenberg 2004; Kobe 2014).

These publications are described in Characteristics of excluded studies.

Risk of bias in included studies

We assessed the risk of bias at outcome level (OS and PFS) for each study using the QUIPS tool. No study reported PET‐associated AE. The detailed assessment can be found in the 'Risk of bias (QUIPS)' section in the Characteristics of included studies.

Risk of bias in studies included in meta‐analyses

The 'Risk of bias' summary (Figure 2) presents the combined judgement made by the review authors in a cross‐tabulation. Studies included in meta‐analysis are highlighted in bold.

Figure 2

'Risk of bias' assessment according to QUIPS (Quality in Prognostic Studies) by outcome.

Overall survival (OS)

For our primary outcome OS, one out of nine studies included in meta‐analysis was assessed as 'low' in all risk of bias domains (Kobe 2018). Four studies were assessed as 'unclear' for the domain study participation (Barnes 2011; Hutchings 2005; Simon 2016; Touati 2014), mostly due to a lack of information about the baseline population from which the study sample originated. Most studies had defined exclusion criteria to sample participants from the baseline population (e.g. unavailability of interim PET2) without providing a description of the original population or reasons for missing information. Considering this a potential source of selection bias, we assessed this domain as ‘unclear’ when information about the baseline population was missing. For the domains study attrition, prognostic factor measurement and outcome measurement, risk of bias was assessed as 'low' in most studies. Two studies did not report the use of standardised criteria for prognostic factor measurement, therefore we assessed the risk of bias as 'moderate' (Barnes 2011; Touati 2014). One study was assessed as 'moderate' risk because PET2 availability was dependent on PET1 result (Zaucha 2017). Due to inconsistency in reporting of the timing of the interim PET measurement, the risk of bias for outcome measurement was assessed as 'high' in one study (Barnes 2011), while the remaining studies were all assessed as 'low'. Two studies were assessed as 'low' risk of bias in the domain other prognostic factors (covariates) because they only included participants within one disease stage (e.g. early or advanced stages) (Barnes 2011; Kobe 2018), while the remaining seven studies were assessed as 'high' risk of bias for this domain because they included all disease stages without adjusting for stage (Cerci 2010; Hutchings 2005; Hutchings 2014; Simon 2016; Touati 2014; Zaucha 2017; Zinzani 2012). Six studies provided sufficient information about the methods used for univariable analysis (Hutchings 2005; Hutchings 2014; Kobe 2018; Touati 2014; Zaucha 2017; Zinzani 2012), therefore we assessed the risk of bias for statistical analysis and reporting as 'low'. The same domain was assessed as 'high' in three studies due to discrepancies between text and figures and/or tables (Barnes 2011; Cerci 2010; Simon 2016).

Progression‐free survival (PFS)

For our secondary outcome PFS, two out of 14 studies included in meta‐analysis were assessed as 'low' risk of bias in all domains (Casasnovas 2019; Mesguich 2016). Eight studies provided clear descriptions of study characteristics and participants (Cerci 2010; Kobe 2018; Mesguich 2016; Rossi 2014; Straus 2011; Ying 2014; Zaucha 2017; Zinzani 2012), so we assessed the risk of bias as 'low'. Five studies did not report inclusion and/or exclusion criteria (Annunziata 2016; Barnes 2011; Hutchings 2005; Simon 2016; Touati 2014), so we assessed the risk of bias for study participation as 'unclear'. One study reported a high number of participants with unavailable interim PET scans without further information (Hutchings 2006), so we assessed the risk of bias as 'high' in the same domain. Most studies had no loss to follow‐up to report or provided a clear description of how missing data were handled, so we assessed the risk of bias for study attrition as 'low' in the majority of studies. One study was assessed as 'unclear' due to a lack of information regarding loss to follow‐up (Annunziata 2016); another study was assessed as 'moderate' because no explanation was provided as to why some participants were lost to follow‐up (Hutchings 2005). The risk of bias for the domains prognostic factor measurement and outcome measurement was assessed as 'low' in most studies. Three studies did not report the use of standardised criteria for prognostic factor measurement, therefore we assessed the risk of bias as 'moderate' (Barnes 2011; Touati 2014; Ying 2014). A fourth study was assessed as 'moderate' risk because PET2 availability was dependent on PET1 result (Zaucha 2017). Due to lack of outcome definition or inconsistency in the reporting of the timing of the interim PET measurement, the risk of bias for outcome measurement was assessed as 'high' in one study (Barnes 2011). In another study, this domain was also assessed as 'high' because the outcome was not defined (Zaucha 2017). The remaining studies were all assessed as 'low' for the domain outcome measurement. For the domain other prognostic factors (covariates), six studies were assessed as 'low' risk of bias, because they either included participants within one disease stage only, or if all disease stages were included, the authors adjusted for disease stage (Barnes 2011; Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Straus 2011). The remaining eight studies were assessed as 'high' risk of bias for this domain (Annunziata 2016; Cerci 2010; Rossi 2014; Simon 2016; Touati 2014; Ying 2014; Zaucha 2017; Zinzani 2012). Eight studies provided sufficient information about the methods used for univariable analysis (Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Rossi 2014; Straus 2011; Touati 2014; Zinzani 2012), so we assessed the risk of bias for statistical analysis and reporting as 'low'. Five studies were assessed as 'high' for this domain because of the poor reporting of results (Annunziata 2016; Barnes 2011; Cerci 2010; Simon 2016; Ying 2014), including discrepancies between text and figures and/or tables in some studies. Another study was also assessed as 'high' because the method of analysis was not sufficiently described (Zaucha 2017).

Risk of bias in studies reported narratively

The risk of bias for all studies reported narratively is included in Figure 2.

Overall survival (OS)

The results for OS from three studies are reported narratively in this review (Casasnovas 2019; Gallamini 2014; Hutchings 2006). For two studies (Casasnovas 2019; Gallamini 2014) we assessed the risk of bias as 'low' in all six domains of the QUIPS tool. For Hutchings 2006, the first four domains were assessed as 'low' risk of bias. For the domain study participation, the study was assessed as 'high' risk because a great number of participants initially included in the study did not undergo an early interim PET. The study was also assessed as 'high' risk for the domain other prognostic factors (covariates) because participants within all disease stages were included.

Progression‐free survival (PFS)

For PFS, the results from seven studies are reported narratively (Andre 2017; Casasnovas 2019; Gallamini 2014; Hutchings 2014; Markova 2012; Oki 2014; Okosun 2012). Out of these, two studies (Casasnovas 2019; Gallamini 2014) were assessed as 'low' risk of bias in all six domains of the QUIPS tool. From the remaining five studies, all were assessed as 'low' risk of bias for the domain study participation. For the domains study attrition, prognostic factor measurement and outcome measurement, three studies were assessed as a 'low' risk of bias (Hutchings 2014; Oki 2014; Okosun 2012). For the other two studies (Andre 2017; Markova 2012), the domain prognostic factor measurement was assessed as 'moderate' risk because the prognostic factor was measured differently in some participants. For the domain other prognostic factors (covariates), five studies were assessed as 'low' risk of bias (Andre 2017; Casasnovas 2019; Gallamini 2014; Markova 2012; Okosun 2012). The other two studies were assessed as 'high' risk for this domain because they included all disease stages without adjusting for disease stage (Hutchings 2014; Oki 2014). Regarding the domain statistical reporting and analysis, five studies were assessed as 'low' risk because they used appropriate methods for the planned analysis (Andre 2017; Casasnovas 2019; Gallamini 2014; Hutchings 2014; Markova 2012). The remaining two studies were assessed as 'high' risk due to inconsistent conduct and reporting of the analyses (Oki 2014; Okosun 2012).

Other potential sources of bias

Reporting deficiencies and selective reporting

We detected reporting deficiencies in some of the studies, particularly when not all analyses that were planned in the methods were actually conducted. In some cases, this was due to the low number of events (i.e. in PET‐negative participants) that did not allow for further analyses. In other cases, it was unclear why certain analyses were performed and others not. This was particularly the case with regard to multivariable analyses, when studies planned to assess the independent prognostic ability of the interim PET in a prognostic model including other clinically relevant prognostic factors (covariates). Studies either did not perform such an analysis even though they initially planned to, or they did not consider adjustment. None of the studies stated clearly their rationale for the choice of covariates; in some cases, the choice was based on their significance in univariable analysis. For example, in studies that only included two or less covariates in the model in addition to interim PET, the interim PET was always independent in its performance. However, how interim PET possibly performed in comparison to other covariates remains unclear. Hence, it is particularly important to state why certain covariates were taken into account. Thus, we cannot be sure that studies did not only report certain positive ('significant') results, which can be an issue of selective reporting.

In addition, we detected discrepancies in the reporting of results within the texts of some studies, or between text and the corresponding graph(s) (i.e. in the reporting of the HR or number of events). In these cases, we tried to contact the corresponding principal investigator(s) for clarification in order to have a better understanding of the results.

Blinding of prognostic factor assessor

Eleven studies reported that the clinicians evaluating the interim PET scans were blinded to the outcome (Kobe 2018; Gallamini 2014; Gandikota 2015; Hutchings 2006; Hutchings 2014; Mesguich 2016; Oki 2014; Rossi 2014; Straus 2011; Zaucha 2017; Zinzani 2012).

Results of the analyses

Twenty‐three studies evaluated interim PET as a prognostic factor in individuals with HL. Two studies did not report data for our outcomes of interest (Gandikota 2015; Orlacchio 2012) and we have not been able to either obtain or estimate any relevant data. None of the included studies reported PET‐associated AEs. Fifteen studies were included in meta‐analyses. Another six of the included studies in this review reported results for OS and/or PFS, but we were not able to pool results because, despite our approaches for possible estimation of missing data items, there was a lack of accurate information or data to do so (Andre 2017; Casasnovas 2019; Gallamini 2014; Markova 2012; Oki 2014; Okosun 2012). For all studies that were not included in meta‐analyses, we reported the main results narratively in this review.

Overall survival (OS)

Meta‐analysis of unadjusted results

We included nine studies with 1802 participants in meta‐analysis for OS (Barnes 2011; Cerci 2010; Hutchings 2005; Hutchings 2014; Kobe 2018; Simon 2016; Touati 2014; Zaucha 2017; Zinzani 2012). There were 475 interim PET‐positive and 1327 interim PET‐negative participants. Meta‐analysis shows a clear advantage in OS for participants with a negative interim PET scan compared to participants with a positive interim PET scan (HR 5.09, 95% CI 2.64 to 9.81, I² 44%, moderate certainty of evidence) (Analysis 1.1) (Figure 3).

Figure 3

Forest plot of comparison: 1 Univariable comparison of PET+ve vs. PET‐ve, outcome: 1.1 Overall survival

Subgroup analysis

We conducted subgroup analyses to explore the underlying clinical heterogeneity between the studies.

For subgroup analysis by radiotherapy, we found evidence on subgroup difference between the groups (P = 0.05, INRT/ISRT in three studies: N = 548, IFRT in four studies: N = 428, RT not further specified in two studies: N = 826). Results still show an advantage in OS for PET‐negative participants, irrespective of the type of radiotherapy they received (Analysis 2.1).

For the remaining subgroups, there was no evidence of subgroup differences.

Different study designs (P = 0.28; three prospective studies: N = 406, four retrospective studies: N = 589, one RCT: N = 722) (Analysis 2.2). One study (Hutchings 2005) was not included in this subgroup analysis because they did not explicitly state their study design.
Different chemotherapy regimens (P = 0.33; ABVD in five studies: N = 801, ABVD and other in three studies: N = 279, BEACOPP in one study: N = 722) (Analysis 2.3). Chemotherapy‐regimen in the included studies was mainly ABVD, with differentiating numbers of cycles, with or without radiotherapy (Barnes 2011; Cerci 2010; Hutchings 2014; Simon 2016; Zaucha 2017; Zinzani 2012). In Hutchings 2005, the majority of participants received ABVD, while the remaining received MOPP or MOPP/ABV, or another regimen which was not specified. Some participants also received additional radiotherapy. In Kobe 2018, all participants received eBEACOPP. In Touati 2014, the regimens included ABVD, MOPP/ABV hybrid or BEACOPP. If separate data had been available for each type of chemotherapy, we could have performed more specific subgroup analysis to test for differences between chemotherapies.
PET‐CT versus PET (P = 0.66; PET‐CT in five studies: N = 595, PET only in three studies: N = 1111) (Analysis 2.4). One study (Barnes 2011) was not included in this subgroup analysis because they conducted PET in some participants and PET‐CT in the other participants.
Different stages of disease (P = 0.33; early stages with A or B symptoms in one study: N = 96, all stages in seven studies: N = 984, advanced stages in one study: N = 722) (Analysis 2.5). One study included disease stages IA, IB, IIA and IIB (Barnes 2011) and another study included advanced‐stages only (Kobe 2018). The remaining seven studies included participants representing all disease stages of HL.

Sensitivity analysis

We conducted sensitivity analyses for the timing of interim PET (removing those that did not conduct a PET2), and the precision of the estimated HR and SE (removing the studies with imprecise HR and SE estimation).

Regarding the timing of the interim PET, interim PET2 was conducted in six studies (N = 1495 participants in total) (Cerci 2010; Kobe 2018; Simon 2016; Touati 2014; Zinzani 2012; Zaucha 2017). In three studies (N = 307 participants in total), interim PET was conducted at other timings: in Barnes 2011, 41 participants received PET2 while the rest of the participants received PET3; in Hutchings 2005, 55 participants received PET2 and 35 participants received PET3; and in Hutchings 2014, PET1 was conducted for all participants (N = 126). Although 89 out of 126 also received a PET2, we used the data for PET1 as the publication provided us with the most information on PET1. At sensitivity analysis, temporarily removing studies that did not perform a PET2 slightly affected the pooled OS (overall: HR 5.09, 95% CI 2.64 to 9.81; sensitivity: HR 3.53, 95% CI 1.97 to 6.32) (Analysis 2.6). It seems that there was an over‐estimation of the HR for the studies that did not perform a PET2. However, the direction of the effect is firm and unchanged. This difference may also be partly explained by the very wide follow‐up ranges within the studies. Hence, following the sensitivity analysis, we consider the overall OS to be robust.

Regarding the precision of the HR estimation, we were able to either obtain or estimate a precise HR and SE for seven studies (N = 1638 participants in total) (Cerci 2010; Hutchings 2005; Hutchings 2014; Kobe 2018; Simon 2016; Zaucha 2017; Zinzani 2012). For two studies (N = 164 participants in total) (Barnes 2011; Touati 2014), we were only able to provide imprecise estimations of the HR and SE. Temporarily removing the imprecise studies during sensitivity analysis barely affected the pooled results for OS, indicating that the measurements obtained from our imprecise method were quite accurate after all (overall: HR 5.09, 95% CI 2.64 to 9.81; sensitivity: HR 5.70, 95% CI 2.60 to 12.48) (Analysis 2.7). Hence, we concluded that the overall pooled OS is robust.

Narrative reporting of results

Univariable analyses

Three studies (Casasnovas 2019; Gallamini 2014; Hutchings 2006) that reported results for OS were not included in meta‐analysis due to lack of adequate data for estimating the HR and associated SE (Table 3).

Table 3. Narrative reporting of results from univariable analysis for OS

Study

No. of participants + stages

Timing of interim PET scan

Unadjusted results for interim PET scan

Casasnovas 2019

Standard arm

N = 413

PET2

PET2 results (N = 398)

Intention‐to‐treat analysis

5‐year OS for entire arm = 95·2% (95% CI 91·1 to 97·4), 13 events

Per‐procotol analysis

N = 372 participants

5‐year OS for entire arm = 95.6% (95% CI 91.2 to 97.8), 10 events

Comment: Separate results for PET2‐negative and PET2‐positive participants in the standard arm were not reported for this outcome.

Gallamini 2014

260 (stages IIA ‐ IVB)

PET2

PET‐negative N = 215, 2 deaths, 3‐year OS = 99%

PET‐positive N = 45, 6 deaths, 3‐year OS = 87%

Comment: Logrank test for difference between groups was not reported and could not be obtained.

Hutchings 2006

77 (all stages)

PET2 and PET4

PET2 results (N = 77)

PET‐negative N = 61, no deaths

PET‐positive N = 16, 2 deaths

Logrank test for difference between groups: P < .01

PET4 results (N = 64)

PET‐negative N = 51, no deaths

PET‐positive N = 13, 2 deaths

Comment: Logrank test for difference between groups after PET4 was not reported and could not be obtained.

Multivariable analyses

Two studies (Kobe 2018; Simon 2016) reported adjusted effect estimates to test the prognostic ability of PET2 in addition to other prognostic factors. Table 4 displays a list of established prognostic factors (Cuccaro 2014; Josting 2010; Kılıçkap 2013), and shows which were considered as covariates in the final multivariable model. The selection of prognostic factors (covariates) for the final model was either based on the literature (Simon 2016), or on their significance in univariable analysis (Kobe 2018). However, pooling of adjusted data was not possible. In Simon 2016, only the results of those covariates that remained independent prognostic markers in multivariable analysis, namely LMR and PET2‐positivity, were reported. It is unclear whether, or which other covariates were included in the final model. A full list of study‐specific, candidate covariates can be found in the respective table for each study in the Characteristics of included studies.

The statistical methods used were Cox proportional hazards regression model and logistic regression model, which are the appropriate methods for a multivariable analysis.

Table 4. Adjusted results from final multivariable model for OS

Study	Prognostic factors								Adjusted results for interim PET
Study	Interim PET	Age	Gender	Disease stage	B symptoms	Bulky disease	IPS	Other study‐specific factors	Adjusted results for interim PET
Kobe 2018	x	‐	‐	‐	‐	‐	x	x	Interim PET‐positivity (DS 4) HR 3.2 (95% CI 1.3 to 8.4), P = 0.02 Comment: Adjusted results indicate an independent prognostic impact of PET2.
Simon 2016	x	‐	‐	‐	‐	‐	‐	x	Interim PET‐positivity HR = 11.51 (95% CI 3.14 to 42.86), P < 0.001 Comment: Adjusted results indicate the independent prognostic impact of PET2.
x = prognostic factor considered for adjustment in the final model ‐ = prognostic factor was not considered in the final model

Progression‐free survival (PFS)

Meta‐analysis of unadjusted results

We included 14 studies with 2079 participants in meta‐analysis for PFS (Annunziata 2016; Barnes 2011; Cerci 2010; Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Rossi 2014; Simon 2016; Straus 2011; Touati 2014; Ying 2014; Zaucha 2017; Zinzani 2012). There were 529 interim PET‐positive and 1550 interim PET‐negative participants. Meta‐analysis shows a clear advantage in PFS for participants with a negative interim PET scan compared to participants with a positive interim PET scan (HR 4.90, 95% CI 3.47, 6.90, I² = 45%, very low certainty of evidence) (Analysis 1.2) (Figure 4).

Figure 4

Forest plot of comparison: 1 Univariable comparison of PET+ve vs. PET‐ve, outcome: 1.2 Progression‐free survival

Subgroup analysis

We conducted subgroup analyses to explore the underlying clinical heterogeneity between the studies.

Regarding the disease stage, we detected a significant difference between the groups (P = 0.02, early stages with A or B symptoms in two studies: N = 184, all stages in eleven studies: N = 1173, advanced stages in one study: N = 722). Results still showed an advantage for PFS in PET‐negative participants in any stage of the disease (Analysis 3.4). Twelve studies included all disease stages, while one study included stages IA ‐ IIB (Barnes 2011), and another study included advanced‐stages only (Kobe 2018).

For the remaining subgroups, there was no evidence of subgroup differences.

Different study designs (P = 0.29, three prospective studies: N = 357, eight retrospective studies: N = 827, two RCTs: N = 165) (Analysis 3.1). One study (Hutchings 2005) was not included in this subgroup analysis because they did not explicitly state their study design.
Different chemotherapy regimen (P = 0.43; ABVD in seven studies: N = 945, ABVD and other chemotherapy in four studies: N = 265, other chemotherapies in three studies: N = 869) (Analysis 3.2). Chemotherapy‐regimen was ABVD in seven studies, with or without radiotherapy (Annunziata 2016; Barnes 2011; Cerci 2010; Mesguich 2016; Simon 2016; Zaucha 2017; Zinzani 2012). In two studies, participants received either ABVD, ABV/MOPP, ABVD/COPP, BEACOPP esc., PVAG or radiotherapy only (Hutchings 2005; Hutchings 2006). In Touati 2014, the regimens included ABVD, MOPP/ABV hybrid or BEACOPP. In Ying 2014, participants received either ABVD or BEACOPP. In Kobe 2018, all participants received eBEACOPP. In Rossi 2014, all participants received anthracycline‐based chemotherapy, and in Straus 2011, all participants received AVG.
PET versus PET‐CT (P = 0.30; PET‐CT in eight studies: N = 707, PET only in five studies: N = 1276) (Analysis 3.3). One study (Barnes 2011) was not included in this analysis because they conducted PET in some participants and PET‐CT in the other participants.
Different radiotherapy (P = 0.29; INRT/ISRT in five studies: N = 651, IFRT in six studies: N = 514, RT not specified in two studies: N = 826, no RT given in one study: N = 88) (Analysis 3.5).

In addition, we detected variations between the studies with regard to the definition of PFS. However, all trials included in meta‐analysis reported some progression endpoint such as treatment failure, progression or relapse. We have provided the exact reported definitions in Table 2.

Sensitivity analysis

Regarding the timing of interim PET, interim PET was conducted after two cycles of chemotherapy (PET2) in nine studies (N = 1677 participants in total) (Cerci 2010; Hutchings 2006; Kobe 2018; Rossi 2014; Simon 2016; Straus 2011; Touati 2014; Zaucha 2017; Zinzani 2012). In five studies (N = 402 participants in total), interim PET was conducted at other timings: in Annunziata 2016 all participants received PET1; in Barnes 2011 and Hutchings 2005 participants received either PET2 or PET3; and in Hutchings 2006 and Mesguich 2016 participants received either a PET2, PET3 or PET4. At sensitivity analysis, temporarily removing studies that did not perform a PET2 barely affected the results for PFS (overall: HR 4.90, 95% CI 3.47 to 6.90; sensitivity: HR 4.68, 95% CI 3.14 to 6.98) (Analysis 3.6). Hence, the timing of the interim PET measurement (when conducted at a time point other than PET2) did not affect the overall pooled result for PFS.

Regarding the precision of the HR estimation, we were able to provide a precise estimation of the HR and SE for nine studies (N = 1450 participants in total) (Annunziata 2016; Barnes 2011; Hutchings 2005; Kobe 2018; Rossi 2014; Simon 2016; Straus 2011; Ying 2014; Zaucha 2017). For five studies (N = 629 participants in total) we were only able to provide a slightly imprecise estimation of the HR and SE (Cerci 2010; Hutchings 2006; Mesguich 2016; Touati 2014; Zinzani 2012). However, at sensitivity analysis we found that the imprecise HRs did not significantly affect the pooled results. Temporarily removing the imprecise studies during sensitivity analysis barely affected the pooled results (overall: HR 4.90, 95% CI 3.47 to 6.90; sensitivity: HR 4.69, 95% CI 2.84 to 7.73) (Analysis 3.7). Hence, we concluded that the overall pooled PFS is robust and was not affected by our slightly imprecise method of HR and SE estimation.

Narrative reporting of results

Univariable analyses

Seven studies that reported results for PFS were not included in meta‐analysis (Andre 2017; Casasnovas 2019; Gallamini 2014; Hutchings 2014; Markova 2012; Oki 2014; Okosun 2012). Table 5 presents the results from these studies narratively. We extracted all data that were available and relevant to us (i.e. number of interim PET‐negative and interim PET‐positive participants, number of events and percentages for PFS). Due to strong differences in the reporting between studies, the table presents more information for some studies compared to others.

Table 5. Narrative reporting of results from univariable analysis for PFS

Study	No. of participants analysed	Timing of interim PET scan	Unadjusted results for interim PET
Andre 2017	Favourable: N = 371 standard arm Unfavourable: N = 583 standard arm	PET2	PET‐negative Favourable group: N = 2 events (both relapses) in the ABVD + INRT arm, ITT 5‐year PFS rate was 99.0% (95% CI 3.8 to 66.1) Unfavourable group: N = 22 events (16 relapses and 6 deaths not related to HL), ITT 5‐year PFS rate was 92.1% (95% CI 88.0 to 94.8) Results presented here are only for participants without interim PET adaptation (ABVD + INRT arm). Unclear how many of these participants were PET‐positive or PET‐negative. In total (all participants included in the study), there were 465 PET‐negative participants and 361 PET‐positive participants. *PET‐positive N = 41 events (36 relapses and 5 deaths not related to HL) in the ABVD + INRT arm, ITT 5‐year PFS rate was 77.4% (95% CI 70.4 to 82.9)
Casasnovas 2019	Standard arm N = 413	PET2	PET2 results (N = 398) Intention‐to‐treat analysis PET2‐negative N = 349 participants (88%), 5‐year PFS = 88.4% (95% CI 83.3 to 92) PET2‐positive N = 49 participants (12%), 5‐year PFS = 73.5% (95% CI 58.7 to 83.6) Results for entire standard arm 5‐year PFS = 86.2% (95% CI 81.6 to 89.8) 41 participants relapsed or progressed, 14 deaths Comment: Logrank test for difference between groups in the standard arm after PET2 was not reported and could not be obtained. Per‐protocol analysis N = 372 participants 5‐year PFS = 86.7% (95% CI 81.9 to 90.3) for entire arm
Gallamini 2014	260 (stages IIA ‐ IV)	PET2	PET‐negative N = 215, 12 events (progression N = 7, relapse N = 5), 3‐year PFS = 95% PET‐positive N = 45, 33 events (progression N = 27, relapse N = 6), 3‐year PFS = 28% Logrank test for difference between groups: P < 0.0001
Hutchings 2014	121 (all stages)	PET1 (N = 121) PET 2 (N = 89)	PET1 results (N = 126) PET‐negative N = 89, 5 events (relapse), 2‐year PFS = 94.1% PET‐positive N = 37, 22 events (17 primary refractory disease, 5 relapses), 2‐year PFS = 40.8% Log‐rank test for difference between groups: P < 0.01 PET1 vs. PET2 results (N = 89) Participants scanned after PET1 and 2 PET1‐negative 2‐year PFS = 98.3% PET1‐positive 2‐year PFS = 38.5% PET2‐negative 2‐year PFS = 90.2% PET2‐positive 2‐year PFS = 23.1% 14 PET1‐positive converted to a PET2‐negative (6 progressed). All PET1‐negative were also PET2‐negative.
Markova 2012	69 (advanced stages)	PET4	PET‐negative N = 51, 2 events (1 relapse and 1 death), % of PFS not reported PET‐positive N = 18, 4 events (progression or relapse), % of PFS not reported Log‐rank test for difference between groups: P = 0.016
Oki 2014	229 (all stages)	PET2	3‐year PFS rates in PET2‐negative versus PET–positive by disease subgroups Early stage favourable: 100% vs. 100% Early stage unfavourable: 91.5% vs. 56.3% (P < 0.0001) Early stage non‐bulky: 95.9% vs. 76.9% (P = 0.0018) Stage II bulky: 83.3% vs. 20% (P = 0.017) Advanced stage with IPS≤2: 77.0% vs. 30.0% (P < 0.001) Advanced stage with IPS≥3: 71.0% vs. 44.4% (P = 0.155)
Okosun 2012	23 (stages II ‐ IV)	PET2 or PET3	PET‐negative: N = 21, no events, 2‐year PFS = 100% PET‐positive: N = 2, 1 event (treatment failure), 2‐year PFS = 50% Log‐rank test for difference between groups: P = 0.0012

Multivariable analyses

Eight studies reported adjusted effect estimates for PFS (Casasnovas 2019; Gallamini 2014; Hutchings 2005; Hutchings 2006; Kobe 2018; Mesguich 2016; Rossi 2014; Simon 2016). Table 6 shows which prognostic factors (covariates) were considered in the final multivariable model of the studies. In two studies, only the results of those covariates that remained independent prognostic factors in multivariable analysis were reported (Gallamini 2014; Simon 2016). It is unclear whether, or which other covariates were included in the final multivariable model. The selection of prognostic factors (covariates) for adjustment in the studies was either based on their significance in univariable analysis (Casasnovas 2019; Hutchings 2006; Kobe 2018), or on the literature (established prognostic factors) (Hutchings 2005; Rossi 2014; Simon 2016). In two studies, the rationale for the covariates was not clearly stated (Gallamini 2014; Mesguich 2016).

As there are no final models with an identical set of covariates, pooling of adjusted effect estimates was not feasible. A full list of study‐specific, candidate covariates can be found in the respective table for each study in the Characteristics of included studies.

The statistical methods used were Cox proportional hazards regression model and logistic regression model.

Table 6. Adjusted results from final multivariable model for PFS

Study	Prognostic factors								Adjusted results for interim PET
Study	Interim PET	Age	Gender	Disease stage	B symptoms	Bulky disease	IPS	Other study‐specific factors	Adjusted results for interim PET
Casasnovas 2019	x	‐	x	x	x	x	x	x	Multivariable analysis not reported separately for standard treatment group.
Gallamini 2014	x	‐	‐	‐	‐	‐	‐	x	PET2 HR N/A, P < 0.01 (Sig. 0.000), 95% CI 3.136 to 7.917 Comment: Adjusted results indicate the independent prognostic impact of interim PET2.
Hutchings 2005	x	‐	‐	x	‐	‐	‐	x	Early interim PET Wald 19.05, HR N/A, P‐value = 0.00007 Comment: Adjusted results indicate the independent prognostic impact of early interim PET.
Hutchings 2006	x	‐	‐	x	‐	‐	‐	x	Model 1 (interim PET2 + clinical stage + extranodal disease) PET2 HR = 36.281 (95% CI 7.179 to 183.4), P < .001 Model 2 (interim PET2 + extranodal disease) PET2 HR = 36.887 (95% CI 7.338 to 185.4), P < .001 Comment: Adjusted results indicate the independent prognostic impact of interim PET2.
Kobe 2018	x	‐	‐	‐	‐	‐	x	x	Interim PET‐positivity (DS 4) HR 2.4 (95% CI 1.4 to 4.1), P = 0.002 Comment: Adjusted results indicate an independent prognostic impact of PET2.
Mesguich 2016	x	‐	‐	x	‐	x	‐	‐	Model 1 (interim PET + disease stage) Positive interim PET HR = 3.73 (95% CI 1.35 to 10.35), P = 0.0112 Model 2 (interim PET + bulky disease) Positive interim PET HR = 3.62 (95% CI 1.30 to 10.05), P = 0.0138 Comment: Adjusted results indicate the independent prognostic impact of interim PET.
Rossi 2014	x	‐	‐	‐	‐	‐	‐	x	SUVmax PET0‐PET2 Relative risk = 7.9 (95% CI 2.9 to 22.9), P = 0.0001 Comment: Adjusted results indicate the independent prognostic impact of SUVmax PET0‐PET2.
Simon 2016	x	‐	‐	‐	‐	‐	‐	x	Interim PET‐positivity HR = 17.74, P < 0.001, 95% CI 6.61 to 47.57 Comment: Adjusted results indicate the independent prognostic impact of PET2.
x = prognostic factor considered for adjustment in the final model ‐ = prognostic factor was not considered in the final model

Adverse events (AEs)

None of the included studies measured PET‐associated AE.

Studies not reporting our outcomes

Two studies (Gandikota 2015; Orlacchio 2012) did not report data for our outcomes of interest, but were still included in this review as they fit our inclusion criteria. Their investigated outcomes were very close to our review outcomes and potentially the authors could have measured them, but did not report them in their publication. However, it has not been possible to obtain the relevant information; therefore, they are reported narratively in this review. Table 7 presents the results from these studies narratively.

Table 7. Narrative reporting of results from studies not reporting our outcomes of interest

Study

No. of participants

Outcomes/comparison

Results

Gandikota 2015

77 (stages IIA ‐ IIB)

Analysis of imaging at different time points: Baseline imaging, imaging during (after two to four cycles of ABVD) and at the end of treatment, follow‐up imaging
Need for surveillance imaging

Analysis of imaging at different time points

Baseline imaging

77 participants had baseline PET‐CT scans, 1 had only chest X‐ray due to pregnancy at baseline

Imaging during and at the end of treatment

77 participants had interim PET‐CT during chemotherapy (N = 34) or after chemotherapy before initiation of radiotherapy (N = 43)
Out of 77, 4 remained PET‐positive, scans after completion of radiotherapy showed a complete response in 2/4, inflammation in 1/4, resolution of all adenopathy in 1/4, 0/4 relapsed during follow‐up

Follow‐up imaging

Median follow‐up: 46 months (range 24 to 126)
Total of 466 scans in 78 participants (PET‐CT in N = 42)
No relapses occurred in the entire cohort, N = 3 were diagnosed with a second primary malignancy by either imaging or clinical presentation, N = 6 had false‐positive imaging findings (3/6 PET‐CT) requiring further supplementary imaging or biopsy/surgery

Need for surveillance imaging

Quote: “No relapse of cHL was detected at a median follow‐up of 46 months. […] Routine imaging (either CT or PET‐CT) for the early detection of relapse does not appear necessary or justified in these participants.”

Orlacchio 2012

132 (all stages)

Interim PET2 vs. end PET (three months after the end of chemo‐ and radiotherapy).

Interim PET results

Negative interim PET2: 104
Positive interim PET2: 28

End PET results

Negative interim PET2 group

Negative final PET: 102/104
Positive final PET: 2/104

Positive interim PET2 group

Negative final PET: 16/28
Positive final PET: 12/28

Interim PET vs. end PET

Negative interim PET2 group

Quote: “Final PET confirmed the negative results in 102 cases (98%) and revealed pathological uptake in the remaining two cases (2%).”

Positive interim PET2 group

Of the 28 interim PET‐positive participants, 19 showed a partial response and nine had disease stability or progression. Twelve of the 28 interim PET‐positive participants had a positive final PET. Hence, the remaining 16 had a negative final PET.

NPV and PPV

Quote: “Interim PET had a NPV of 98%, with 85.7% sensitivity, 86.4% specificity and 86.4% diagnostic accuracy.”
Quote: “[In univariable analysis] the only independent predictor is the result of interim PET. […] PET had a PPV of 42%."

Discussion

Summary of main results

In this systematic review, we summarised unadjusted data for interim positron emission tomography (PET) scan results as a prognostic factor in individuals with classic Hodgkin lymphoma (HL). The results of an interim PET scan during therapy, e.g. after two cycles of chemotherapy, has been suggested as a good predictor of outcome. Interim PET scan results have also been suggested as an indicator to guide further treatment in order to achieve the best possible outcome in those that have a poor prognosis and those that have a good prognosis, while also minimising adverse events due to the toxicity of the chemotherapy. The results of our review are summarised in the summary of findings Table 1.

The findings emerging from meta‐analyses are as follows.

Unadjusted results for overall survival (OS) show a large advantage for participants with a negative interim PET scan result compared to participants with a positive interim PET scan result. We rated the certainty of the evidence as 'moderate'.
Unadjusted results for progression‐free survival (PFS) show an advantage for participants with a negative interim PET scan result compared to participants with a positive interim PET scan result, but the evidence is very uncertain. We rated the certainty of the evidence as 'very low'.

The findings of the adjusted results from multivariable analyses, reported narratively in this review, are as follows.

Adjusted results for OS indicate an independent prognostic ability of interim PET beyond other associated factors. We rated our certainty of the evidence as 'moderate'.
Adjusted results for PFS indicate that there may be an independent prognostic ability of interim PET beyond other associated factors. We rated our certainty of the evidence as 'low'.

No study measured adverse events (AEs) associated with PET.

Overall completeness and applicability of the evidence

The evidence in this review mostly applies to adults who were newly diagnosed with classic HL, and who receive a PET scan in combination with CT (PET‐CT) after two cycles of chemotherapy (PET2). The studies included in this review addressed our research question in a total of 7335 male and female participants representing all stages of classic HL (Ann Arbor stages I ‐ IV with A or B symptoms). Nine studies included individuals aged 18 years or older, while the remaining studies also included adolescents and young adults (the youngest being 13 years of age, although most studies started from the age of 16 and onwards). Overall, the findings from this review support the statement that in this group of individuals, interim PET scan results can predict OS and PFS. Most participants in the included studies received ABVD (adriamycin/doxorubicin, bleomycin, vinblastine and dacarbazine) chemotherapy, which is the standard treatment regimen for early‐stage disease (Bröckelmann 2018; Engert 2010). However, as participants can have different therapy regimens, which is decided based on their disease stage and other clinical or individual characteristics, results should always be interpreted with caution for different patient groups, and this naturally restrains the applicability of the evidence for all people with classic HL. Twelve out of 23 studies reported our primary outcome of interest OS, while 21 studies reported PFS. No study reported PET‐associated AE. As the main aim of the review was to identify the prognostic value of interim PET results to predict survival outcomes, it is unlikely that studies on prognosis will measure or report AE.

Heterogeneity between the studies was also found with regard to the evaluation of the interim PET scan, as studies used different criteria for the interpretation of the results. Most studies used the Deauville five‐point scale (DS 1 ‐ 5) for the evaluation of the PET scans. However, different cut‐off values were used for PET‐positivity. Most studies considered scores one to three (DS 1‐3) for PET‐negativity, and scores four to five (DS 4‐5) for PET‐positivity. In some studies, however, DS3 was also considered (or tested) for PET‐positivity. Results from these studies should be interpreted with caution, as using a score of ≥3 can have an important impact on the results and possibly introduce bias. Firstly, using this cut‐off can lead to an increased number of false‐positive results for interim PET (Casasnovas 2019). This can have a relevant impact for the individual, if treatment would be modified based on the interim PET scan results (such as in the studies by Andre 2017; Casasnovas 2019; Kobe 2018). Furthermore, using this cut‐off can lead to an overestimation of the positive outcomes in the interim PET‐positive group. In the study by Kobe 2018, in which cut‐off DS3 and DS4 were tested for PET‐positivity, the results showed no significant difference in DS1‐2 compared to DS3, but a significant difference between DS1‐3 and 4. Thereby, the authors argue for DS4 as the cut‐off value for PET‐positivity, which is interpreted as an [18F]‐fluorodeoxy‐D‐glucose (FDG) uptake higher than in the liver, instead of an uptake higher than in the mediastinum (corresponding DS3) (Kobe 2018). Hence, the implementation of a commonly used cut‐off in clinical practice is important in order to improve interobserver reliability and agreements between central reviewers, and is also highly crucial for the individual (Kobe 2018; Meignan 2009a). In the remaining studies included in this review, either different criteria were used (e.g. International Harmonization Project in Lymphoma criteria (Juweid 2007)), or no specific scale was indicated. However, in most studies, at least two nuclear medicine physicians independently interpreted the interim PET scan results.

One of the greatest issues regarding the prognostic factor studies in this review relates to the difficult reporting of their statistical analyses. Even when the methods of the statistical analyses were appropriate for the study design, the data were insufficiently reported in many of the included studies. We used hazard ratios (HRs) as the effect measure for time‐to‐event data in this review. We were able to pool data from only 15 studies, either because the HR and associated standard error (SE) were not reported, or because we did not have separate data for our participants or outcomes of interest. Out of these 15 studies, six studies reported an HR, but we still re‐calculated the value for four of them for different reasons. For example, values were re‐calculated either when we detected discrepancies between the text and corresponding graph(s) and table(s), or when they were simply not reported, while other relevant data were, helping us to estimate the HR and SE. For the remaining studies, we estimated the HR using other available data where possible (Altman 2012; Parmar 1998; Tierney 2007; Trivella 2006). For this reason, we contacted 10 principal investigators to clarify our questions and provide us with additional information or data, or both. This step was particularly helpful for deciding which data to pool.

We prespecified in our protocol that we would only pool adjusted associations of the index prognostic factor if analyses were based on an identical set of covariates. Although this was not feasible for our review, we suggest that future authors of systematic reviews of prognostic factor studies consider pre‐specifying a core set of covariates (established prognostic factors) that are important to the disease under review, and should be investigated in the included studies (Riley 2019; Riley 2019b). In this way, authors may be able to pool adjusted effect estimates, if studies are homogenous enough in the adjustment set of the other prognostic factors. In addition, we have moderate between‐study heterogeneity, which is reflected in the I² and wide confidence intervals (CIs). We took these issues around the reporting in the studies into account when we assessed risk of bias and GRADE.

Furthermore, the pooled estimates of the prognostic effect of the interim PET scan result in our analyses are based on crude HRs (no adjustment for covariates), therefore the reported results are at risk of overestimating the prognostic ability of the interim PET scan result. Hence, in light of the absence of adjustment for other prognostic factors, and considering the risk of bias assessment for the fifth domain of the QUIPS tool, we downgraded the strength of the evidence in our GRADE assessment. This is because it is widely acknowledged that adjusting the predictive effect of a specific prognostic factor for the contribution of other prognostic factors strengthens the robustness of the evidence on the clinically relevant prognostic ability of that factor (Riley 2019; Riley 2019b).

Lastly, although we did not conduct a test for funnel plot asymmetry as this type of test is not necessarily recommended for survival data due to issues of censoring (Debray 2018), we cannot exclude potential publication bias and the presence of small‐study effects in our review (Riley 2019). Firstly, we assume that publication bias may be present in our review as most studies in our analyses have rather small sample sizes, of which all present positive results on the prognostic ability of interim PET scan results. Secondly, most studies included in this review are retrospective studies that have not been pre‐registered, for example, in trial registries. Studies are also not always labelled or indexed as prognosis studies, and search filters for studies on prognosis are still under development, which is the main reason as to why we conducted a broad search with the disease (HL) and prognostic factor (PET) of interest. This led to a high number of search results that had to be screened. Thirdly, we identified a great number of conference abstracts on studies for which we could not find full‐text publications (see Characteristics of studies awaiting classification). Hence, based on these experiences, we cannot preclude that more studies may exist that have either not been published, or not indexed properly.

Certainty of the evidence

Our certainty of the evidence is presented in the summary of findings Table 1.

Unadjusted results

For our primary outcome OS, we judged the certainty of the evidence as 'moderate'. We included nine studies in the meta‐analysis, of which eight were observational studies and one was a clinical trial. We used the data of participants from the standard arm (no treatment adaptation) of this trial. We judged the certainty of the evidence as 'moderate' due to some methodological issues. We downgraded by one point for risk of bias due to a high risk of bias in seven studies for the domain other prognostic factors (covariates), as well as a high risk of bias in three studies for the domain statistical analysis and reporting. In addition, we downgraded by one point for imprecision because the HR had to be estimated in seven studies, and re‐calculated in one study. Hence, only one out of nine studies reported a HR that we used. Nevertheless, we upgraded by one point for a large effect showing the large difference in the OS between interim PET‐positive and interim PET‐negative participants (HR 5.09, CI 2.64 to 9.81).

For the outcome PFS, we judged the certainty of the evidence as 'very low'. We included 14 studies in the meta‐analysis, of which 12 were observational studies and two were clinical trials (participants from the standard arms). For this outcome, we downgraded by one point for inconsistency because the definition of PFS varied across the studies. We also downgraded by one point for imprecision because the HR had to be estimated in 10 studies and re‐calculated in one study. Hence, we were able to use a reported HR for only three out of 14 studies. In addition, we downgraded by one point for risk of bias, because of a high risk of bias in eight studies for the domain other prognostic factors (covariates), and high risk of bias in six studies for the domain 'statistical analysis and reporting'.

Adjusted results

For the outcome OS, two studies reported adjusted results from multivariable analyses including established prognostic factors (e.g. International Prognostic Score) in individuals with HL, and the results of both studies indicate the independent prognostic ability of interim PET to predict OS. We judged our certainty in the evidence as 'moderate' for this outcome due to some methodological issues. We downgraded by one point for risk of bias due to a high risk of bias in the domains other prognostic factors (covariates) and statistical analysis and reporting for one study.

For the outcome PFS, there were eight studies that reported adjusted results (adjusted for e.g. disease stage or B symptoms). All studies found that interim PET scan results have an independent prognostic ability to predict PFS. However, we rated our certainty in the evidence as 'low' for this outcome. We downgraded by one point for risk of bias due to a high risk of bias in the domain study participation in one study, as well as a high risk of bias in the domains other prognostic factors (covariates) and statistical analysis and reporting in a second study. Furthermore, we downgraded by one point for inconsistency because the studies included a heterogenous set of covariates in the multivariable analyses, which made the pooling of adjusted results not feasible.

Potential biases in the review process

To prevent bias in this review, two teams of two review authors independently performed all relevant processes (i.e. screening, data extraction, risk of bias and GRADE assessment). Due to the complexity of assessing bias in prognostic factor studies, as well as assessing the certainty of the evidence from these types of studies, we conducted several teleconferences with different experts in the field of prognosis to discuss our assessments. We consulted Jill Hayden (Hayden 2013) for the 'Risk of bias' assessment, and the GRADE for Prognosis working group for the GRADE assessment. In particular, the methods for grading the evidence from prognosis studies are still under development.

For the 'Risk of bias' assessment, we are aware that adding 'unclear' as a fourth possible rating, thereby setting an example for future authors, can lead to a potential bias in the assessment. However, for our assessment we only used 'unclear' when relevant information was evidently missing, thereby making it difficult to make a fair and transparent judgement for the respective study and domain. We felt that rating a domain as high risk of bias in such cases would be inappropriate. We clearly advise against the use of 'unclear' as a default option and want to recommend future authors of reviews of prognosis studies to use this fourth rating carefully (if the fourth rating will be included in an update of the QUIPS tool).

Our analyses included post‐hoc subgroup analyses on the type of PET measurement (PET versus PET‐CT), as well as post‐hoc sensitivity analyses on the timing of the interim PET and the type of estimation (see Methods) used to estimate missing values. These analyses were necessary due to the heterogeneity between the studies. Results should be interpreted in light of differences that can exist when participants receive a PET‐CT as compared to a PET scan only. Furthermore, the timing of the interim PET is crucial, as PET1 and PET2 may provide different results compared to PET3 and PET4.

Regarding the adjusted results, we refrained from pooling results because, although the studies looked at established prognostic factors, they did not include identical sets of covariates. As the studies are already very heterogeneous, pooling of the adjusted results was not feasible for our review, as the comparison and interpretation of these results may be problematic in this case. To avoid this in the future, we suggest pre‐defining a core set of covariates in order to enable pooling of adjusted results (Riley 2019).

Agreements and disagreements with other studies or reviews

In our review, we included studies that have assessed the prognostic value of interim PET in HL participants without treatment modification. Overall, the findings from this review are in agreement with similar reviews and studies that have investigated the prognostic value of interim PET. Our results are also in agreement with the literature that interim PET can be used for disease and therapy monitoring (Barrington 2017a). Some reviews and studies have investigated this in participants in whom the treatment was changed based on the interim PET scan result, and have come to similar conclusions that interim PET can predict outcome in the different groups (PET‐negative and PET‐positive participants).

We are aware of three systematic reviews (Adams 2015a; Amitai 2018;Sickinger 2015) that have investigated interim PET as a prognostic factor. Adams 2015a included ten studies with limited‐, intermediate‐ and advanced‐stage HL participants in whom the treatment regimen was not modified based on the interim PET scan results. In fact, nine out of these 10 studies are also included in our review. One study was not included in our review because they only included children. The authors of this review concluded that a negative interim PET cannot exclude treatment failure, but that a positive interim PET can identify and predict treatment failure. The authors assessed the quality of the studies with the QUIPS tool (as we did in our review) and judged the overall methodological quality of the included studies as moderate. We have compared their QUIPS assessment with ours for each individual study, and identified that for the domains study participation and study attrition in particular, we found agreements between the authors and our review that there is a low risk of bias in the studies. Disagreement was found regarding the domain prognostic factor measurement, for which the authors judged the quality as moderate mainly due to the heterogeneity between the studies regarding the use of PET‐CT versus PET only, which is an issue that we have also addressed in our review by subgroup analysis.

Comparison of interim PET with end PET

Nine of the included studies compared the performances of interim PET and end‐of treatment PET (end PET) (Barnes 2011; Hutchings 2006; Hutchings 2014; Markova 2012; Mesguich 2016; Orlacchio 2012; Straus 2011; Ying 2014; Zinzani 2012), as omitting one of the two can have an impact on radiation safety for the patient. However, results between studies are rather contradictory. For example, in Barnes 2011 the authors could not detect a significant difference in OS and PFS between interim PET‐negative and interim PET‐positive participants. In their analyses, interim PET‐positive participants that were negative at end PET had the same good outcomes as participants who were negative both at interim and end PET. In addition, after end PET, the difference between end PET‐positive and end PET‐negative participants was fairly high, with a greater four‐year OS and PFS for end‐PET‐negative participants. In this study, 74 (end PET) out of 79 participants (interim PET) remained PET‐negative, while nine (end PET) out of 17 (interim PET) participants remained PET‐positive. The authors concluded that end‐PET (after six cycles of chemotherapy) predicts outcome, rather than interim PET (after two or four cycles of chemotherapy). In Hutchings 2006, interim PET was conducted after two and four cycles of chemotherapy (total number of cycles was six to eight). Results show that PET2 and PET4 were similarly successful in predicting outcome in participants, but the authors of the study still argue that treatment modifications should be indicated as early as possible (e.g. after PET2) in order to achieve the best possible outcome. In the study by Mesguich 2016, interim PET was also lower in its predictive ability compared to end PET. Out of 60 interim PET‐negative participants, seven converted to a positive end PET. Out of 16 interim PET‐positive participants, seven converted to a negative end PET. In addition, treatment failure was most common in participants with a positive end PET as compared to participants with a positive interim PET. The sensitivity of interim PET was measured as 47% compared to 80% of end PET (Mesguich 2016).

Contrastingly, Orlacchio 2012 detected a very high negative predictive value (NPV) of 98% for interim PET2, with an overall diagnostic accuracy of 86.4%. Out of 104 interim PET‐negative participants, 102 were still negative after end PET. Out of 28 interim PET‐positive participants, however, 16 converted to a negative end PET. A high NPV for interim PET was also found in Hutchings 2005 (interim PET2/3) as interim PET‐negative participants rarely relapsed. In Hutchings 2014, 89 participants had an interim PET1 and PET2, and both show a strong prognostic ability for predicting outcome. In this study, none of the participants in early stages that had a negative interim PET1 progressed or relapsed. Advanced‐stage participants with a negative interim PET1 had a long‐term PFS of more than 90%. The three‐year PFS of interim PET1‐positive participants was 30%. In total, 89 participants had both PET1 and PET2. Out of these, 62 were PET1‐negative, and after treatment, 60 were in complete remission. Twenty‐seven participants were PET1‐positive, of which 15 were in complete remission. To compare, 76 participants were PET2‐negative, of whom 70 were in complete remission. Thirteen participants were PET2‐positive, of which five were in complete remission. The negative predictive value of PET1 was reported as 96.8%, while the positive predictive value was 44.4%. Zinzani 2012 also reported that interim PET after two cycles is highly predictive of OS and PFS. In their study, 92% of the interim PET‐negative participants (n = 251) were in continuous complete remission as compared to 24.5% of the interim PET‐positive participants (n = 53). These conclusions are supported by Ying 2014, although their sample size (n = 35) is too small to provide definite answers. Straus 2011 supported these statements particularly for participants in early stages (as included in their study), as participants with a negative interim PET2 result had a PFS of about 90%, compared to 50% for interim PET‐positive participants, at two years. Markova 2012 reported similar findings for interim PET4, which had a high NPV of 98%. Out of 68 participants in total, 50 had a negative interim PET, but 59 a negative end PET. In other words, nine interim PET‐positive participants were end PET‐negative after chemotherapy. The other nine participants who were interim PET‐positive were also end PET‐positive. At both timings (PET4 and PET6/8) the authors found a significant difference in the survival between PET‐positive and PET‐negative participants. The high NPV of interim PET supports early de‐escalation of chemotherapy, or omitting radiotherapy, in order to reduce the risk of toxicity and adverse events related to the harsh treatment.

Treatment adaptation based on interim PET

Although not an aim of our review, we considered it important to discuss some results from recently published randomised controlled trials (RCTs) in which the interim PET scan result was used to adapt the therapy for individuals with HL in order to improve outcomes (Andre 2017; Casasnovas 2019; Johnson 2016; Kobe 2018), based on the premise that interim PET scan results are indeed prognostic. For example, in the trial by Johnson 2016, the primary aim was to test the omission of bleomycin due to its toxic effects. All participants (N = 1214, advanced stages) started with ABVD chemotherapy. After interim PET2, PET‐positive (DS4‐5) participants (N = 182) were assigned to BEACOPP, and PET‐negative (DS1‐3) participants (N = 935) were randomised to receive either ABVD or AVD. Results show that three‐year PFS was slightly better in the ABVD group compared to the AVD group (85.7% versus 84.4%, respectively). Regarding three‐year OS, the ABVD group reached 97.2% compared to 97.6% in the AVD group. Hence, there were no significant subgroup differences. However, grade 3 and 4 AEs due to the chemotherapy were more common in the ABVD group. In the PET‐positive group, which was escalated to BEACOPP chemotherapy, 3‐year PFS was 67.5% and 3‐year OS was 87.8%.

In another example by Casasnovas 2019, 823 advanced‐stage HL participants were randomly assigned to standard treatment group or PET‐driven treatment group. All participants received two cycles of BEACOPP_escalated as the initial therapy and interim PET was conducted thereafter. PET‐positive participants in both groups, as well as PET‐negative participants in the standard group continued with the initial therapy after PET2. PET‐negative participants in the experimental arm, however, were switched to two cycles of ABVD. Results of five‐year PFS show a similar survival of PET‐negative participants in the standard group and experimental group: 88.4% and 89.4%, respectively.

Several systematic reviews were also published that investigated treatment adaptation based on interim PET scan results. Amitai 2018 included 13 studies (of which four were RCTs) that investigated interim PET‐adapted treatment in advanced‐staged HL. Their findings support the statement that PET‐adapted treatment is an appropriate strategy and that it should be considered as standard care for advanced HL (Amitai 2018). This finding is supported by a Phase II RCT (Carras 2018), which assessed interim PET‐response adapted treatment strategy in advanced‐stage HL. The authors concluded that early salvage therapy and high‐dose chemotherapy or autologous stem cell transplant (ASCT) for PET2‐positive participants is safe and can lead to similar positive outcomes as in PET2‐negative participants (Carras 2018). To compare, Sickinger 2015 included studies in which the treatment was also modified, but concluded that PFS was shorter in individuals with early‐stage HL and a negative PET scan receiving chemotherapy only (PET‐adapted therapy) than in those receiving additional RT (standard therapy). This finding was confirmed in another review by Blank 2017, showing improved PFS in early‐stage participants receiving radiotherapy in addition to chemotherapy. However, the overall methodological quality of the included studies in both reviews was judged as moderate (for PFS) to very low (for OS). Constrasting evidence on the clinical and prognostic value of interim PET‐adapted treatment was also found in non‐systematic reviews, which particularly acknowledge the heterogeneity between available studies that makes it difficult to give definite conclusions (Adams 2016a; Berriolo‐Riedinger 2018).

Figure 1

Study flow diagram according to PRISMA

Navigate to figure in ReviewOpen in new tab

Figure 2

'Risk of bias' assessment according to QUIPS (Quality in Prognostic Studies) by outcome.

Navigate to figure in ReviewOpen in new tab

Figure 3

Forest plot of comparison: 1 Univariable comparison of PET+ve vs. PET‐ve, outcome: 1.1 Overall survival

Navigate to figure in ReviewOpen in new tab

Figure 4

Forest plot of comparison: 1 Univariable comparison of PET+ve vs. PET‐ve, outcome: 1.2 Progression‐free survival

Navigate to figure in ReviewOpen in new tab

Analysis 1.1

Comparison 1: Univariable comparison of PET+ve vs. PET‐ve, Outcome 1: Overall survival

Navigate to figure in ReviewOpen in new tab

Analysis 1.2

Comparison 1: Univariable comparison of PET+ve vs. PET‐ve, Outcome 2: Progression‐free survival

Navigate to figure in ReviewOpen in new tab

Analysis 2.1

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 1: OS by radiotherapy

Navigate to figure in ReviewOpen in new tab

Analysis 2.2

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 2: OS by study design

Navigate to figure in ReviewOpen in new tab

Analysis 2.3

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 3: OS by chemotherapy

Navigate to figure in ReviewOpen in new tab

Analysis 2.4

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 4: OS for PET/CT vs PET

Navigate to figure in ReviewOpen in new tab

Analysis 2.5

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 5: OS by disease stage

Navigate to figure in ReviewOpen in new tab

Analysis 2.6

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 6: Timing of interim PET

Navigate to figure in ReviewOpen in new tab

Analysis 2.7

Comparison 2: Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve, Outcome 7: OS by HR type of estimation

Navigate to figure in ReviewOpen in new tab

Analysis 3.1

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 1: PFS by study design

Navigate to figure in ReviewOpen in new tab

Analysis 3.2

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 2: PFS by chemotherapy

Navigate to figure in ReviewOpen in new tab

Analysis 3.3

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 3: PFS for PET/CT vs PET

Navigate to figure in ReviewOpen in new tab

Analysis 3.4

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 4: PFS by disease stage

Navigate to figure in ReviewOpen in new tab

Analysis 3.5

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 5: PFS by radiotherapy

Navigate to figure in ReviewOpen in new tab

Analysis 3.6

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 6: Timing of interim PET

Navigate to figure in ReviewOpen in new tab

Analysis 3.7

Comparison 3: Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve, Outcome 7: PFS by HR type of estimation

Navigate to figure in ReviewOpen in new tab

Summary of findings 1. Comparison of interim PET‐negative and interim PET‐positive individuals with Hodgkin Lymphoma

Outcomes	*Anticipated absolute effects^ (95% CI)**		Relative effect (95% CI)	№ of participants (studies)	Certainty of the evidence (GRADE)	Comments
Comparison of interim PET‐positive and interim PET‐negative participants with Hodgkin lymphoma
Population: Individuals with Hodgkin lymphoma Setting: Eleven studies recruited participants from a total of 28 haemato‐oncology treatment centres/hospitals in Brazil (N = 1), China (N = 1), Denmark (N = 4), France (N = 4), Italy (N = 3), Poland (N = 11), UK (N = 2) and the USA (N = 2). One study (Straus 2011) included participants from 29 institutions, but did not report the countries. One study (Simon 2016) reported the country (Hungary) but not the number of centres. One multi‐centre study (Hutchings 2014) recruited participants from four countries (USA, Italy, Poland and Denmark). One RCT (Kobe 2018) included participants from 301 hospitals and private practices in Germany, Switzerland, Austria, the Netherlands, and the Czech Republic.
Outcomes	Risk with Interim PET‐negative	Risk with Interim PET‐positive	Relative effect (95% CI)	№ of participants (studies)	Certainty of the evidence (GRADE)	Comments
Overall survival Follow up: 3 years	Low		HR 5.09 (2.64 to 9.81)	1802 (9 studies)	⊕⊕⊕⊝ MODERATE ^{2 3 4}
	900 per 1.000 ¹	585 per 1.000¹ (356 to 757)
	High
	980 per 1.000 ¹	902 per 1.000¹ (820 to 948)
Progression‐free survival Follow up: 3 years	Low		HR 4.90 (3.47 to 6.90)	2079 (14 studies)	⊕⊝⊝⊝ VERY LOW^{6 7 8}
	850 per 1.000 ⁵	451 per 1.000 ⁵ (326 to 569)
	High
	940 per 1.000 ⁵	738 per 1.000 ⁵ (653 to 807)
Adverse events associated with PET ‐ not reported	No study measured PET‐associated adverse events.		‐	‐	‐
Overall survival (adjusted effect estimate)	Two studies reported an adjusted effect estimate for overall survival after interim PET2: a hazard ratio of 3.2 (95% CI 1.3 to 8.4, P = 0.02) (Kobe 2018) and 11.51 (95% CI 3.14 to 42.86, P < 0.001) (Simon 2015) indicates the independent prognostic value of interim PET over and above other clinically relevant prognostic factors.		‐	843 (2 studies)	⊕⊕⊕⊝ MODERATE ⁹
Progression‐free survival (adjusted effect estimate)	Eight studies conducted a multivariable analysis to test the independent prognostic value of interim PET over and above other clinically relevant prognostic factors. Four of these studies reported a hazard ratio as the adjusted effect estimate, of which the value ranges from 2.4 to 36.89, indicating the independent prognostic value of interim PET2.¹⁰		‐	996 (4 studies)¹⁰	⊕⊕⊝⊝ LOW ^{11 12}
*The survival in the PET‐positive group (and its 95% confidence interval) is based on the assumed survival in the PET‐negative group. CI: Confidence interval; HR: Hazard ratio; PET: positron emission tomography
GRADE Working Group grades of evidence High certainty: We are very confident that the true effect lies close to that of the estimate of the effect Moderate certainty: We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different Low certainty: Our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate of the effect Very low certainty: We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect
¹ The assumed event‐free survival in the control group is based on the survival rate of the interim PET‐negative participants at 3 years in the studies included (the lowest survival rate from Cerci 2010 and the highest survival rate from Kobe 2018). ² High risk of bias in seven studies for the domain 'other prognostic factors (covariates)', and high risk of bias in three studies for the domain 'statistical analysis and reporting'. Downgraded by 1 point for risk of bias. ³ For one study we used the reported hazard ratio. For seven studies we had to estimate the hazard ratio and for one study we re‐calculated it (Trivella 2006). Downgraded by 1 point for imprecision. ⁴ Upgraded by one point due to the large effect showing the large difference between interim PET‐negative and interim PET‐positive participants (HR 5.09, CI 2.64 to 9.81). ⁵ The assumed event‐free survival in the control group is based on the survival rate of the interim PET‐negative participants at 3 years in the studies included (the lowest survival rate from Rossi 2014 and the highest survival rate from Kobe 2018). ⁶ High risk of bias in eight studies for the domain 'other prognostic factors (covariates)', and high risk of bias in six studies for the domain 'statistical analysis and reporting'. Downgraded by 1 point for risk of bias. ⁷The definition of PFS varied across studies, downgraded by 1 point for inconsistency ⁸ For three studies we used the reported hazard ratio. For ten studies we had to estimate the value, and for one study we had to re‐calculate it (Trivella 2006). Downgraded by 1 point for imprecision. ⁹ High risk of bias for the domains 'other prognostic factors (covariates)' and statistical analysis and reporting for one study (Simon 2016). Downgraded by 1 point for risk of bias. ¹⁰Hutchings 2006; Kobe 2018; Mesguich 2016; Simon 2016. ¹¹ High risk of bias for the domains 'other prognostic factors (covariates)' and statistical analysis and reporting for one study (Simon 2016). Also high risk of bias for the domain study participation in one study (Hutchings 2006). Downgraded by 1 point for risk of bias. ¹² Studies included a heterogenous set of covariates in the adjusted analyses. Downgraded by 1 point for inconsistency.

Summary of findings 1. Comparison of interim PET‐negative and interim PET‐positive individuals with Hodgkin Lymphoma

Navigate to table in Review

Comparison 1. Univariable comparison of PET+ve vs. PET‐ve

Outcome or subgroup title	No. of studies	No. of participants	Statistical method	Effect size
1.1 Overall survival Show forest plot	9	1802	Hazard Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

1.2 Progression‐free survival Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

Comparison 1. Univariable comparison of PET+ve vs. PET‐ve

Navigate to table in Review

Comparison 2. Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve

Outcome or subgroup title	No. of studies	No. of participants	Statistical method	Effect size
2.1 OS by radiotherapy Show forest plot	9	1802	Hazard Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

2.1.1 Involved node and/or site	3	548	Hazard Ratio (IV, Random, 95% CI)	3.45 [1.22, 9.72]
2.1.2 involved field	4	428	Hazard Ratio (IV, Random, 95% CI)	12.75 [4.98, 32.68]
2.1.3 not specified	2	826	Hazard Ratio (IV, Random, 95% CI)	2.80 [1.17, 6.67]
2.2 OS by study design Show forest plot	8	1717	Hazard Ratio (IV, Random, 95% CI)	4.63 [2.43, 8.80]

2.2.1 Prospective	3	406	Hazard Ratio (IV, Random, 95% CI)	5.35 [1.07, 26.68]
2.2.2 Retrospective	4	589	Hazard Ratio (IV, Random, 95% CI)	7.12 [3.14, 16.14]
2.2.3 RCT	1	722	Hazard Ratio (IV, Random, 95% CI)	2.60 [1.03, 6.56]
2.3 OS by chemotherapy Show forest plot	9	1802	Hazard Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

2.3.1 ABVD	5	801	Hazard Ratio (IV, Random, 95% CI)	5.19 [2.11, 12.72]
2.3.2 ABVD and/or other	3	279	Hazard Ratio (IV, Random, 95% CI)	10.30 [1.71, 62.13]
2.3.3 BEACOPP	1	722	Hazard Ratio (IV, Random, 95% CI)	2.60 [1.03, 6.56]
2.4 OS for PET/CT vs PET Show forest plot	8	1706	Hazard Ratio (IV, Random, 95% CI)	5.01 [2.50, 10.02]

2.4.1 PET/CT	5	595	Hazard Ratio (IV, Random, 95% CI)	4.70 [1.86, 11.86]
2.4.2 PET only	3	1111	Hazard Ratio (IV, Random, 95% CI)	6.99 [1.58, 30.90]
2.5 OS by disease stage Show forest plot	9	1802	Odds Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

2.5.1 Stages I and II with A and B symptoms	1	96	Odds Ratio (IV, Random, 95% CI)	9.21 [0.71, 120.03]
2.5.2 All stages	7	984	Odds Ratio (IV, Random, 95% CI)	6.28 [2.62, 15.05]
2.5.3 Advanced	1	722	Odds Ratio (IV, Random, 95% CI)	2.60 [1.03, 6.56]
2.6 Timing of interim PET Show forest plot	9	1802	Hazard Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

2.6.1 PET2	6	1495	Hazard Ratio (IV, Random, 95% CI)	3.53 [1.97, 6.32]
2.6.2 Other (including mixed)	3	307	Hazard Ratio (IV, Random, 95% CI)	20.13 [5.04, 80.38]
2.7 OS by HR type of estimation Show forest plot	9	1802	Hazard Ratio (IV, Random, 95% CI)	5.09 [2.64, 9.81]

2.7.1 precise	7	1638	Hazard Ratio (IV, Random, 95% CI)	5.70 [2.60, 12.48]
2.7.2 Imprecise	2	164	Hazard Ratio (IV, Random, 95% CI)	3.60 [0.89, 14.64]

Comparison 2. Subgroups in univariable comparison of OS: PET+ve vs. PET‐ve

Navigate to table in Review

Comparison 3. Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve

Outcome or subgroup title	No. of studies	No. of participants	Statistical method	Effect size
3.1 PFS by study design Show forest plot	13	1349	Hazard Ratio (IV, Random, 95% CI)	5.66 [4.02, 7.97]

3.1.1 prospective	3	357	Hazard Ratio (IV, Random, 95% CI)	3.95 [2.23, 7.00]
3.1.2 retrospective	8	827	Hazard Ratio (IV, Random, 95% CI)	6.85 [4.66, 10.08]
3.1.3 RCT	2	165	Hazard Ratio (IV, Random, 95% CI)	6.21 [2.87, 13.42]
3.2 PFS by chemotherapy Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

3.2.1 ABVD	7	945	Hazard Ratio (IV, Random, 95% CI)	5.13 [3.18, 8.27]
3.2.2 ABVD and/or other	4	265	Hazard Ratio (IV, Random, 95% CI)	7.07 [3.40, 14.70]
3.2.3 other NON‐ABVD chemo	3	869	Hazard Ratio (IV, Random, 95% CI)	3.64 [1.83, 7.24]
3.3 PFS for PET/CT vs PET Show forest plot	13	1983	Hazard Ratio (IV, Random, 95% CI)	5.08 [3.57, 7.21]

3.3.1 PET/CT	8	707	Hazard Ratio (IV, Random, 95% CI)	6.03 [3.68, 9.90]
3.3.2 PET only	5	1276	Hazard Ratio (IV, Random, 95% CI)	4.06 [2.33, 7.08]
3.4 PFS by disease stage Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

3.4.1 Stages I and II with A and B symptoms	2	184	Hazard Ratio (IV, Random, 95% CI)	3.88 [1.54, 9.83]
3.4.2 All stages	11	1173	Hazard Ratio (IV, Random, 95% CI)	5.81 [3.93, 8.57]
3.4.3 Advanced	1	722	Hazard Ratio (IV, Random, 95% CI)	2.27 [1.35, 3.82]
3.5 PFS by radiotherapy Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

3.5.1 Involved node and/or site	5	651	Hazard Ratio (IV, Random, 95% CI)	5.35 [2.94, 9.75]
3.5.2 Involved field	6	514	Hazard Ratio (IV, Random, 95% CI)	7.06 [4.15, 12.00]
3.5.3 Not specified	2	826	Hazard Ratio (IV, Random, 95% CI)	2.97 [1.48, 5.98]
3.5.4 None	1	88	Hazard Ratio (IV, Random, 95% CI)	5.09 [1.95, 13.29]
3.6 Timing of interim PET Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

3.6.1 PET2	9	1677	Hazard Ratio (IV, Random, 95% CI)	4.68 [3.14, 6.98]
3.6.2 Other (including mixed)	5	402	Hazard Ratio (IV, Random, 95% CI)	6.32 [3.40, 11.75]
3.7 PFS by HR type of estimation Show forest plot	14	2079	Hazard Ratio (IV, Random, 95% CI)	4.90 [3.47, 6.90]

3.7.1 precise	9	1450	Hazard Ratio (IV, Random, 95% CI)	4.69 [2.84, 7.73]
3.7.2 Imprecise	5	629	Hazard Ratio (IV, Random, 95% CI)	5.66 [3.65, 8.77]

Comparison 3. Subgroups in univariable comparison of PFS: PET+ve vs. PET‐ve

Navigate to table in Review

Cochrane Review language

Website language

Abstract

Background

Objectives

Search methods

Selection criteria

Data collection and analysis

Main results

Authors' conclusions

Ringkasan bahasa mudah

Pengimejan dengan tomografi positron pelepasan (PET) semasa kemoterapi untuk meramalkan hasil pada orang dewasa dengan hodgkin limfoma

Visual summary

Authors' conclusions

Implications for practice

Implications for research

Multivariable analyses and prognostic models

Study design

Summary of findings

Background

Description of the condition

Description of the index (prognostic) factor

Why it is important to do this review

Objectives

Primary objective

PICOTS

Methods

Criteria for considering studies for this review

Types of studies

Participants

Index (prognostic) factor

Type of outcome measures

Primary outcome

Secondary outcomes

Search methods for identification of studies

Electronic searches

Searching other resources

Data collection and analysis

Selection of studies

Data extraction and management

Risk of bias

Reporting deficiencies

Data synthesis

Detailed description of the estimation of hazard ratios (HRs) and standard errors (SEs)

Grading the evidence

Dealing with missing data

Investigation of heterogeneity

Results

Results of the search

Description of studies

Included studies

Study design

Sample size

Location

Participants

Follow‐up

Stages of disease

Treatment/therapy

Index (prognostic) factor

Timing of interim PET

Evaluation of PET scans

Criteria for evaluation

Outcomes

Primary outcome

Secondary outcomes

Definitions of Progression‐free survival (PFS)

Adverse events (AEs)

Conflict of interest

Excluded studies

Risk of bias in included studies

Risk of bias in studies included in meta‐analyses

Overall survival (OS)

Progression‐free survival (PFS)

Risk of bias in studies reported narratively

Overall survival (OS)

Progression‐free survival (PFS)

Other potential sources of bias

Reporting deficiencies and selective reporting

Blinding of prognostic factor assessor

Results of the analyses