Scolaris Content Display Scolaris Content Display

Pruebas diagnósticas no invasivas para la infección por Helicobacter pylori

Contraer todo Desplegar todo

Resumen

Antecedentes

La infección por Helicobacter pylori (H. pylori) se ha asociado a varias neoplasias malignas y afecciones no malignas, como la úlcera péptica, la dispepsia no ulcerosa, la hemorragia recurrente de úlcera péptica, la anemia ferropénica no explicada, la púrpura trombocitopénica idiopática y los adenomas colorrectales. El diagnóstico confirmatorio de H. pylori se efectúa mediante biopsia endoscópica, seguida de la revisión histopatológica con tinción de hemotoxilina y eosina (H y E) o tinciones especiales como las tinciones de Giemsa y Warthin‐Starry. Las tinciones especiales son más precisas que la H y E. Existe una incertidumbre significativa acerca de la exactitud diagnóstica de las pruebas no invasivas para el diagnóstico de H. pylori.

Objetivos

Comparar la exactitud diagnóstica de la prueba del aliento con urea, la serología y la prueba de antígenos en heces, usadas solas o en combinación, en el diagnóstico de la infección por H. pylori en pacientes sintomáticos y asintomáticos, para que pueda iniciarse el tratamiento de erradicación del H. pylori.

Métodos de búsqueda

El 4 de marzo de 2016, se realizaron búsquedas en MEDLINE, Embase, Science Citation Index y en la National Institute for Health Research Health Technology Assessment Database. Se examinaron las referencias de los estudios incluidos para identificar estudios adicionales. También se realizaron búsquedas de citas de estudios relevantes, la más reciente el 4 de diciembre de 2016. No se restringieron los estudios por el idioma ni el estado de publicación, ni si los datos se recopilaron de manera prospectiva o retrospectiva.

Criterios de selección

Se incluyeron estudios de exactitud diagnóstica que evaluaron al menos una de las pruebas índice (prueba del aliento con urea mediante isótopos como 13C o 14C, serología y prueba de antígenos en heces) en comparación con el estándar de referencia (examen histopatológico con H y E, tinciones especiales o tinción immunohistoquímica) en los pacientes con infección presunta por H. pylori.

Obtención y análisis de los datos

Dos autores de la revisión examinaron de forma independiente las referencias para identificar los estudios pertinentes y extrajeron los datos. La calidad metodológica de los estudios se evaluó mediante la herramienta QUADAS‐2. Se realizó el metanálisis con el uso del modelo jerárquico de la curva de resumen de rendimiento diagnóstico (HSROC, por su sigla en inglés) para calcular y comparar las curvas SROC. Cuando correspondía, se usaron los modelos de regresión logística de dos variables o de una sola variable para calcular la especificidad y la sensibilidad resumida.

Resultados principales

Se incluyeron 101 estudios con 11 003 participantes, de los cuales 5839 participantes (53,1%) presentaban infección por H. pylori. La prevalencia de la infección por H. pylori en los estudios varió de un 15,2% a un 94,7%, con una mediana de prevalencia de un 53,7% (rango intercuartil de 42,0% a 66,5%). La mayoría de los estudios (57%) incorporaron participantes con dispepsia y 53 estudios excluyeron a participantes excluidos que había recibido recientemente antibióticos o inhibidores de la bomba de protones. Hubo al menos un riesgo incierto de sesgo o una inquietud de aplicabilidad incierta para cada estudio.

De los 101 estudios, 15 compararon la exactitud de dos pruebas índice y dos estudios compararon la exactitud de tres pruebas índice. Treinta y cuatro estudios (4242 participantes) evaluaron la serología; 29 estudios (2988 participantes) usaron la prueba de antígenos en heces; 34 estudios (3139 participantes) evaluaron la prueba del aliento con urea13C; 21 estudios (1810 participantes) evaluaron la prueba del aliento con urea14C; y dos estudios (127 participantes) evaluaron la prueba del aliento con urea pero no informaron el isótopo de carbono empleado. Los umbrales solían definir la positividad de la prueba, y las técnicas de tinción usadas para el examen histopatológico (estándar de referencia) variaron entre los estudios. Dados los datos escasos para cada umbral informado, no fue posible identificar el mejor umbral para cada prueba.

Con el uso de los datos de 99 estudios en una comparación de pruebas indirecta, había pruebas estadísticas de una diferencia en la exactitud diagnóstica entre la prueba de aliento con urea 13, prueba de aliento con urea 14C, la serología y la prueba de antígenos en heces (p = 0,024). Los odds ratios diagnósticos para la prueba del aliento con urea 13C, la prueba del aliento con urea 14, la serología y la prueba de antígenos en heces fueron de 153 (intervalo de confianza [IC] del 95%: 73,7 a 316), 105 (IC del 95%: 74,0 a 150), 47,4 (IC del 95%: 25,5 a 88,1) y 45,1 (IC del 95%: 24,2 a 84,1). La sensibilidad (IC del 95%) calculada en una especificidad fija de 0,90 (mediana de estudios en las cuatro pruebas) fue de 0,94 (IC del 95%: 0,89 a 0,97) para la prueba del aliento con urea 13C, 0,92 (IC del 95%: 0,89 a 0,94) para la prueba del aliento con urea 14C, 0,84 (IC del 95%: 0,74 a 0,91) para la serología y 0,83 (IC del 95%: 0,73 a 0,90) para la prueba de antígenos en heces. Lo anterior implica que, en término medio, dada una especificidad de 0,90 y una prevalencia de un 53,7% (mediana de especificidad y prevalencia en los estudios), de cada 1000 personas examinadas para detectar la infección por H. pylori, habrá 46 positivos falsos (personas sin infección por H. pylori que recibirán un diagnóstico de infección por H. pylori). En esta cohorte hipotética, la prueba del aliento con urea 13C, la prueba del aliento con urea 14C, la serología y la prueba de antígenos en heces resultarán en 30 (IC del 95%: 15 a 58), 42 (IC del 95%: 30 a 58), 86 (IC del 95%: 50 a 140) y 89 (IC del 95%: 52 a 146) negativos falsos respectivamente (pacientes con infección por H. pylori en quienes se omitirá el diagnóstico de H. pylori).

Las comparaciones directas se basaron en pocos estudios uno a uno. Los cocientes de los odds ratios de diagnóstico (ORD) fueron de 0,68 (IC del 95%: 0,12 a 3,70; P = 0,56) para la prueba del aliento con urea 13C versus la serología (siete estudios), y 0,88 (IC del 95%: 0,14 a 5,56; p = 0,84) para la prueba del aliento con urea 13C versus la prueba de antígenos en heces (siete estudios). Los IC del 95% de estos cálculos se superponen con los de los cocientes de ORD de la comparación indirecta. Los datos para el metanálisis de otras comparaciones directas no estaban disponibles o eran limitados.

Conclusiones de los autores

En los pacientes sin antecedentes de gastrectomía y en los que no han recibido recientemente antibióticos ni inhibidores de la bomba de protones, las pruebas del aliento con urea tuvieron una exactitud diagnóstica alta mientras la serología y las pruebas de antígenos en heces fueron menos exactas para el diagnóstico de infección por Helicobacter pylori. Esta afirmación se basa en una comparación de pruebas indirecta (con posibilidad de sesgo debido a factores de confusión), ya que la evidencia de las comparaciones directas fue limitada o no disponible. Los umbrales usados para estas pruebas fueron de alta variabilidad y no se pudieron identificar umbrales específicos que podrían sean útiles en la práctica clínica.

Se necesitan más estudios comparativos de calidad metodológica alta para obtener evidencia más fiable de la exactitud relativa entre las pruebas. Estos estudios deben realizarse de manera prospectiva en un espectro representativo de participantes e informarse claramente para asegurar un riesgo de sesgo bajo. Es importante destacar que los estudios deben predefinir e informar claramente los umbrales utilizados y deben evitar exclusiones inapropiadas.

Resumen en términos sencillos

Exactitud de diferentes métodos no invasivos para la identificación de Helicobacter pylori

¿Por qué es importante saber si alguien tiene Helicobacter pylori?

Helicobacter pylori (H. pylori) es un tipo de bacteria que puede estar presente en el estómago de algunas personas. Se cree que H. pylori causa varios tipos de cáncer, incluido el cáncer de estómago, el cáncer pancreático y el cáncer de garganta. Además, H. pylori está vinculado a otras enfermedades como la úlcera gástrica, la pirosis y la sensación de hinchazón. Si se halla H. pylori en un individuo, puede comenzarse el tratamiento apropiado.

¿Cuál es el objetivo de esta revisión?

Comparar la exactitud de tres tipos diferentes de pruebas para H pylori. Estas son: las pruebas de urea en el aliento, los análisis de sangre (el análisis de sangre específico se llama serología) y las pruebas de materia fecal (en heces).

¿Qué se estudió en esta revisión?

Existen dos tipos de prueba de urea en el aire espirado que utilizan dos formas diferentes de carbono, conocidos como 13C y 14C, así como múltiples versiones de las pruebas de material fecal y serología.

¿Cuáles son los principales resultados de la revisión?

Se encontraron 101 estudios que incorporaron a 11 003 personas sometidas a estudios en busca de H. pylori. De estos 11 003 participantes, 5839 (53,1%) tenían infección por H. pylori. Todos los estudios usaron una de las tres pruebas enumeradas más arriba y compararon los resultados de estas pruebas con el diagnóstico de la biopsia endoscópica. La biopsia endoscópica incluye la obtención de tejido del estómago mediante un tubo flexible delgado introducido a través de la boca y el estudio de la presencia de H. pylori con el microscopio. Actualmente es la prueba disponible más exacta; sin embargo, causa malestar físico al paciente, con riesgos asociados. En cambio, las pruebas alternativas no invasivas de esta revisión son significativamente menos incómodas e implican ningún riesgo o un riesgo mínimo, lo que las convierte en opciones aconsejables si se demuestra que el diagnóstico de H. pylori es tan exacto como el de la biopsia endoscópica. La mayoría de los estudios incorporaron participantes con pirosis o trastornos similares en el estómago y excluyeron a los participantes que habían sido sometidos previamente a extracción parcial del estómago y los que recibían tratamiento para H. pylori.

Treinta y cuatro estudios (4242 participantes) usaron la serología; 29 estudios (2988 participantes) usaron la prueba de antígenos en heces; 34 estudios (3139 participantes) usaron la prueba del aliento con urea13C; 21 estudios (1810 participantes) usaron la prueba del aliento con urea14C; y dos estudios (127 participantes) usaron la prueba del aliento con urea pero no informaron el tipo de carbono usado. Los estudios variaron en el límite que utilizaron antes de establecer que una prueba fuera positiva para la infección por H. pylori y el tipo de tinción utilizada para examinar el material de la biopsia. Cuando se consideraron todos los datos se halló que las pruebas de urea en el aliento fueron más exactas que las pruebas en sangre y heces. Los resultados significan que, como término medio, si se analizan 1000 personas, habrá 46 sin H. pylori que recibirán un diagnóstico erróneo de infección por H. pylori. Además, habrá 30, 42, 86 y 89 personas con infección por H. pylori en quienes se omitirá el diagnóstico de infección por H. pylori con la prueba del aliento con urea 13C, la prueba del aliento con urea 14C, la serología y la prueba de antígenos en heces, respectivamente. Cuando se consideraron los siete estudios que compararon la prueba del aliento con urea 13C y la serología, o la prueba del aliento con urea 13C y las pruebas de antígenos en heces en los mismos participantes, los resultados fueron inciertos y no puede afirmarse qué prueba es más exacta.

¿Cuán confiables son los resultados de los estudios?

Excepto un estudio, todos los estudios fueron de calidad metodológica deficiente, lo que hace que los resultados sean poco fiables.

¿Para quiénes son relevantes los resultados de esta revisión?

Estos resultados son pertinentes para niños y adultos con infección presunta por H. pylori, pero sólo en los que no han sido sometidos previamente a una cirugía de estómago y los que no han recibido recientemente antibióticos o tratamiento para la infección por H. pylori.

¿Cuáles son las implicaciones de esta revisión?

Las pruebas de urea en el aliento, los análisis de sangre y las pruebas de heces pueden ser apropiadas para identificar si un individuo presenta infección por H. pylori. Sin embargo, sigue siendo incierto el nivel del resultado de la prueba del aliento con urea, el análisis de sangre o la prueba en heces que debe usarse para hacer un diagnóstico de infección por H. pylori.

¿Cuál es el grado de actualización de la revisión?

Se realizó una búsqueda bibliográfica minuciosa de los estudios que informaron la exactitud de estas diferentes pruebas hasta el 4 de marzo de 2016.

Authors' conclusions

Implications for practice

In people with no history of gastrectomy and those who have not recently had antibiotics or proton pump inhibitors, urea breath tests had high diagnostic accuracy while serology and stool antigen tests had lower accuracy to detect H pylori infection. Although susceptible to bias due to confounding, this conclusion is based on evidence from indirect test comparisons, as evidence from direct comparisons was based on few studies or was unavailable. There was high or unclear risk of bias for many studies with respect to the selection of participants, and the conduct and interpretation of the index tests and reference standard. The thresholds used for these tests were highly variable, thus there is insufficient evidence to identify specific thresholds that might be useful in clinical practice.

Implications for research

Further comparative studies of high methodological quality are necessary to obtain more reliable evidence of accuracy between the tests (urea breath tests, serology, and stool antigen tests) in people with upper gastrointestinal symptoms and people without any symptoms suggestive of H pylori. Such studies should be conducted prospectively in a representative spectrum of participants, and be clearly reported to ensure low risk of bias. Most importantly, studies should pre‐specify and clearly report the thresholds used, should apply appropriate reference standards such as endoscopic biopsy with special stains, and should avoid inappropriate exclusions.

Summary of findings

Open in table viewer
Summary of findings Performance of non‐invasive tests for diagnosis of H pylori infection

What is the best non‐invasive test for diagnosis of H pylori infection?

Population

Children and adults with gastrointestinal symptoms

Setting

Primary care setting

Index tests

Urea breath test‐13C, Urea breath test‐14C, serology, and stool antigen test

Threshold

Various thresholds were used for each test

Role and purpose of test

Screening and diagnosis of H pylori

Reference standard

Endoscopic biopsy with Haemotoxylin & Eosin stain, special stains, or combination of Haemotoxylin & Eosin and special stains

Quality of evidence

Risk of bias was generally high or unclear with respect to the selection of participants, and the conduct and interpretation of the index tests and reference standard. Applicability concerns were also generally high or unclear with respect to selection of participants

Limitations

There was heterogeneity in thresholds and reference standards. Studies did not often prespecify or clearly report thresholds used

Pre‐test probability (prevalence of Helicobacter pylori)

Median (interquartile range) = 53.7% (42.0% to 66.5%)

Index test

Number of participants (studies)

Diagnostic odds ratio (95% CI)

Sensitivity (95% CI) at fixed specificity of 0.901

Missed H pylori cases per 1000 people tested (95% CI)2

Urea breath test‐13C

3139 participants

(34 studies)

153 (95% CI 73.7 to 316)

0.94 (0.89 to 0.97)

30 (15 to 58)

Urea breath test‐14C

1810 participants

(21 studies)

105 (95% CI 74.0 to 150)

0.92 (0.89 to 0.94)

42 (30 to 58)

Serology

4242 participants

(34 studies)

47.4 (95% CI 25.5 to 88.1)

0.84 (0.74 to 0.91)

86 (50 to 140)

Stool antigen test

2988 participants

(29 studies)

45.1 (95% CI 24.2 to 84.1)

0.83 (0.73 to 0.90)

89 (52 to 146)

Comparison of non‐invasive tests for H pylori infection

Based on an indirect comparison of the four tests using all the studies, there was statistical evidence of a difference in diagnostic accuracy (P = 0.024). Direct comparisons were based on few head‐to‐head studies. The ratios of diagnostic odds ratios (95% CI; P value) were 0.68 (95% CI 0.12 to 3.70; P = 0.56) for urea breath test‐13C versus serology (seven studies), and 0.88 (95% CI 0.14 to 5.56; P = 0.84) for urea breath test‐13C versus stool antigen test (seven studies). The 95% confidence intervals of these estimates overlap with those of the ratios of diagnostic odds ratios from the indirect comparison. Data were limited or unavailable for meta‐analysis of other direct comparisons.

Conclusions

In people with no history of gastrectomy and those who have not recently had antibiotics or proton pump inhibitors, urea breath tests had high diagnostic accuracy while serology and stool antigen tests had lower accuracy to detect H pylori infection. Although susceptible to bias due to confounding, this conclusion is based on evidence from indirect test comparisons as evidence from direct comparisons was based on few studies or was unavailable. It should be noted that studies were generally of poor methodological quality. The thresholds used for the tests were highly variable and there is currently insufficient evidence to recommend specific thresholds for use in clinical practice.

1The sensitivities were estimated along the SROC curves at the median specificity across the studies included for the four tests.

2Based on the sensitivity estimated at the median specificity of 0.90, and the median prevalence of 53.7% from the included studies, the numbers of missed H pylori cases were calculated using a hypothetical cohort of 1000 people suspected of having H pylori infection. The 95% CI for the number of missed cases is from the 95% CI for sensitivity. For a specificity of 0.90 and prevalence of 53.7%, there will be 46 false positives. See Table 3 for results for other values of specificity and prevalence.

Background

Helicobacter pylori (H pylori) is a gram negative spiral bacterium (NCBI 2014). Approximately 13% to 81% of people have H pylori infection (Peleteiro 2014). Prevalence of the bacterium varies according to age (generally increasing with age, although infection rates tend to fall among older age groups in some Latin American and Northeast Asian countries); region (lower infection rates are seen in Australia and the UK, while higher rates are reported in Chile, China, Japan, Korea, and Latvia); race (more prevalent amongst Afrocarribeans compared to white people); and socioeconomic class (more common in poorer settings) (Graham 1991; Laszewicz 2014; Muhsen 2012; Peleteiro 2014).

Based on observational studies, H pylori infection has been implicated in a number of malignancies, including gastric cancer, premalignant lesions of the stomach (atrophic gastritis and intestinal metaplasia), gastric lymphoma, pancreatic cancer, colorectal cancer, and laryngeal cancer (Huang 1998; Huang 2003; Wu 2013; Xiao 2013; Xue 2001; Zhuo 2008). However, H pylori is associated with a lower incidence of oesophageal adenocarcinomas (Islami 2008). H pylori is also associated with a number of non‐malignant conditions, including peptic ulcers, non‐ulcer dyspepsia, recurrent peptic ulcer bleeding, unexplained iron deficiency anaemia, idiopathic thrombocytopaenia purpura, and colorectal adenomas (DuBois 2005; Franchini 2007; Gisbert 2004b; Huang 2002; Jaakkimainen 1999; Wu 2013).

Although a number of pathogenic factors such as cytotoxin‐associated gene A (CagA), vacuolating cytotoxin A (VacA), and blood group antigen binding adhesin (BabA) are associated with increased virulence of H pylori (Huang 2003; Malfertheiner 2012), detection of these pathogenic factors currently has no role in the management of H pylori infection (Malfertheiner 2012). The recommended initial treatment for H pylori infection is with a combination of a proton pump inhibitor, clarithromycin, and amoxicillin or metronidazole (triple therapy) in regions with low resistance to clarithromycin (< 20% resistance rate in the area), and the triple therapy along with bismuth (quadruple therapy) in regions with high resistance to clarithromycin (> 20% resistance rate in the area) (Malfertheiner 2012). If this results in failure of eradication, bismuth‐quadruple therapy or levofloxacin‐triple therapy (replacement of clarithromycin with levofloxacin in the classical triple therapy) when triple therapy was used as the initial treatment and levofloxacin‐triple therapy when bismuth quadruple therapy was used as the initial treatment is recommended (Malfertheiner 2012). If even this treatment fails to eradicate H pylori, then further treatment should be based on antibiotic susceptibility (Malfertheiner 2012). Eradication of H pylori might lead to a decrease in malignant and non‐malignant conditions associated with H pylori infection. Adverse events related to H pylori treatment include taste disturbance, diarrhoea, nausea, headache, skin rash, abdominal pain, dizziness, bloating, myalgias (muscle pain), and constipation (Ye 2014).

A glossary of terms is included in Appendix 1.

Target condition being diagnosed

Helicobacter pylori infection.

Index test(s)

Urea breath test

The urea breath test is based on the presence of urease enzyme in live H pylori which breaks down urea into ammonia and carbon dioxide (McNulty 2005; Ricci 2007). After ingestion of urea labelled with either 13C or 14C, breath samples are collected for up to 30 minutes by exhaling into a carbon dioxide‐trapping agent (Ricci 2007). The urea breath test is performed by the clinician or the clinician's assistant. The thresholds used include the percentage of carbon recovered during the collection time or counts per minute (Ferwana 2015). Threshold levels above 4% or 5% are commonly used to diagnose H pylori infection (Ferwana 2015). A wide range of threshold counts per minute, ranging from more than 25 counts per minute to 1000 counts per minute, have been used for diagnosis of H pylori infection (Ferwana 2015).

Serology

These tests are based on circulating antibodies to H pylori. There are three main methods for these tests: the enzyme‐linked immunosorbent assay (ELISA) test, latex agglutination tests, and Western blotting (Ricci 2007). Of these, ELISA is the most commonly used method. Total immunoglobulin, immunoglobulin subtypes, and antibody response to specific antigens can all be tested. Since they do not require any special equipment, they can be easily performed (Ricci 2007). However, serology may be positive because of the presence of active infection at the time of the test, previous infection, or because of non‐specific cross‐reacting antibodies (McNulty 2005). Tests that use whole blood (rather than serum) and other bedside tests (using a bedside centrifuge) are also available, although these whole‐blood tests and bedside serum tests are generally considered unreliable (Ricci 2007). Routine serum tests are performed by the laboratory technician and interpreted by the clinician. The bedside serum tests and whole‐blood tests are performed by the clinician or the clinician's assistant. Different researchers evaluating the prevalence of H pylori have used different thresholds to define the positivity of serology, for example Lindsetmo 2008 used a titre ≥ 300 while Granberg 1993 used a titre ≥ 500.

Stool antigen tests

These tests use monoclonal and polyclonal antibodies to detect the presence of H pylori antigen in stools and active H pylori infection can be diagnosed (McNulty 2005; Ricci 2007). Serum tests are performed by the laboratory technician and interpreted by the clinician. Several thresholds have been used for other tests, for example, an optical density of ≥ 0.15, ≥ 0.16, and ≥ 0.19 have all been used as thresholds for diagnosis of H pylori using monoclonal antibodies for stool antigen tests.

Clinical pathway

Evidence from randomised controlled trials (RCTs) showed that screening and eradication programmes for H pylori in populations at high risk of gastric cancer (e.g. East Asians) lowered the incidence of gastric cancer (Ford 2014). The Asia‐Pacific Gastric Cancer Consensus conference recommended that screening and eradication of H pylori was advisable in populations in countries at high risk of gastric cancer (i.e. Japan and Korea) (Talley 2008). The updated European Helicobacter Study Group (EHSG) Fourth Maastricht/Florence Consensus Conference guidelines suggest that people should be tested for H pylori, and eradication of H pylori (when present) has been recommended for the following conditions (Malfertheiner 2012):

  1. People at high risk of gastric cancer.

  2. Adults with dyspepsia with a locally‐determined age cut‐off point (depending on local incidence of gastric cancer in different age groups), and without ‘alarm’ symptoms or signs associated with an increased risk of gastric cancer such as weight loss, dysphagia, upper gastrointestinal bleeding, abdominal mass, or iron deficient anaemia.

  3. Unexplained iron deficiency anaemia.

  4. Idiopathic thrombocytopenic purpura.

  5. Uninvestigated young patients with dyspepsia should also be considered for testing for H pylori when the prevalence of H pylori is high (≥ 20%).

The clinical pathway is shown in Figure 1.


Clinical pathway

Clinical pathway

Prior test(s)

The index tests can be performed without any prior test.

Role of index test(s)

The index tests are used for screening and diagnosis of H pylori.

Alternative test(s)

Other tests used in the screening and diagnosis of H pylori infection include non‐invasive saliva and urine antigen‐based tests (Ricci 2007), and invasive gastric biopsy followed by Campylobacter‐like organism (CLO) test, culture, histology, and polymerase chain reaction (PCR) (Van Doorn 2000). We do not include non‐invasive saliva and urine antigen‐based tests in this review because these tests are not commonly used (Ricci 2007).

Rationale

Testing for H pylori and eradication of H pylori have been recommended for a number of population groups (Clinical pathway). These tests have to be non‐invasive so that a large number of people can be tested. People with undetected H pylori continue to be at high risk of gastric cancer or continue to have dyspepsia, anaemia, or purpura. Overdiagnosis (false positive test results) of H pylori means that patients are subject to unnecessary adverse events related to eradication therapy (approximately 27% of patients receiving eradication therapy develop mild adverse events such as bitter taste, nausea, diarrhoea, etc.). Comparing the diagnostic accuracy of different index tests will highlight the best test for the diagnosis of H pylori infection.

Objectives

To compare the diagnostic accuracy of urea breath test, serology, and stool antigen test, alone or in combination, for diagnosis of H pylori infection in symptomatic and asymptomatic people, so that eradication therapy for H pylori can be started.

Secondary objectives

To investigate the following potential sources of heterogeneity: type of reference standard, risk of bias, publication status, prospective versus retrospective studies, symptomatic versus asymptomatic participants, recent or current use of proton pump inhibitors or antibiotics, different subtypes of tests, and the interval between the index test and reference standard.

Methods

Criteria for considering studies for this review

Types of studies

We include studies that evaluate the accuracy of the index tests in the appropriate patient population (see Participants), regardless of language or publication status, or whether data were collected prospectively or retrospectively. However, we exclude reports that describe how the diagnosis of H pylori was made in an individual patient or group of patients, and which do not provide sufficient diagnostic test accuracy data (i.e. the number of true positives, false positives, false negatives, and true negatives). We also exclude case‐control studies because these are prone to bias (Whiting 2011).

Participants

Symptomatic and asymptomatic people in whom H pylori infection status is sought so that eradication therapy for H pylori can be started. We exclude studies that included only people with acute upper gastrointestinal bleeding because such patients are likely to undergo endoscopy and invasive testing can be performed, if required.

Index tests

Urea breath test‐14C, urea breath test‐13C, serology, and stool antigen test, alone or in combination. We included only initial testing and excluded repeat testing (monitoring success of treatment), since diagnostic accuracy may vary depending on the purpose of testing (Ricci 2007).

Target conditions

H pylori infection.

Reference standards

There is no gold standard for diagnosis of H pylori infection and the diagnosis is made by a combination of tests following endoscopic biopsy; endoscopic biopsy followed by histology, endoscopic biopsy followed by polymerase chain reaction (PCR), and endoscopic biopsy followed by rapid urease testing all have excellent sensitivity and specificity (Chey 2007). However, PCR methodology is not standardised across laboratories (Chey 2007); it is an unreliable reference standard. Endoscopic biopsy followed by rapid urease testing has poor sensitivity following treatment with proton pump inhibitors (Chey 2007). Endoscopic biopsy with culture has high specificity but poor sensitivity (Chey 2007). We therefore considered only endoscopic biopsy followed by histology (using haemotoxylin and eosin (H & E) stain, special histological stains such as Giemsa stain and Warthin‐Starry stain, or immunohistochemical stain) as the reference standard in this review.

Immunohistochemical stains are more accurate than special stains, while special stains and immunohistochemical stains are thought to have better specificity than H & E stains for diagnosis of H pylori infection (Laine 1997; Lee 2015b). For this reason,we considered endoscopic biopsy with histology using immunohistochemical stain as the best reference standard, and endoscopic biopsy with histology using H & E stain as the worst reference standard.

Search methods for identification of studies

We included all studies, irrespective of the language of publication and publication status. If we found articles in languages other than English, we obtained translations.

Electronic searches

We searched the following databases.

  1. MEDLINE via OvidSP (January 1946 to 4 March 2016) (Appendix 2).

  2. Embase via OvidSP (January 1947 to 4 March 2016) (Appendix 3).

  3. Science Citation Index Expanded via Thomson Reuters Web of Science (January 1980 to 4 March 2016) (Appendix 4).

  4. National Institute for Health Research (NIHR HTA) via Centre for Reviews and Dissemination, University of York. (www.crd.york.ac.uk/CRDWeb/) (4 March 2016) (Appendix 5).

Searching other resources

To identify additional studies, we examined references in the included studies to see if any might be relevant. We also searched for articles related to the included studies by using the 'related search' function in MEDLINE (OvidSP) and Embase (OvidSP). We conducted a 'citing reference' search (by searching articles which cited the included articles) (Sampson 2008) in MEDLINE (OvidSP) and Embase (OvidSP) on 4 December 2016.

Data collection and analysis

Selection of studies

Two review authors (KG and LB, SS, or AS) independently searched the references to identify relevant studies. We obtained the full text for references considered relevant by at least one of the two review authors. Two review authors independently screened the full‐text papers against the inclusion criteria, resolving any differences in study selection by discussion. We attempted to contact study authors if there were doubts about the eligibility of a study.

Data extraction and management

Two review authors (KG and LB, SS, or AS) independently extracted the following data from each included study, using a pre‐piloted data extraction form, and resolving differences by discussion.

  1. First author.

  2. Year of publication.

  3. Study design (prospective or retrospective cohort studies; cross‐sectional studies or randomised controlled trials).

  4. Inclusion and exclusion criteria for individual studies.

  5. Total number of participants.

  6. Number of female participants.

  7. Average age of the participants.

  8. Initial testing versus testing after eradication.

  9. Number of people with bleeding ulcers, gastric atrophy, lymphoma, and recent or current use of proton pump inhibitors or antibiotics.

  10. Number of symptomatic participants.

  11. Tests carried out prior to the index test.

  12. Description of the index test.

  13. Threshold used for the index test.

  14. Reference standard.

  15. Number of true positives, false positives, false negatives, and true negatives (i.e. 2 x 2 data) at each threshold reported.

If a study reported multiple index tests, we extracted the 2 x 2 data for each index test at each threshold. For studies that reported test accuracy for different reference standards, we extracted 2 x 2 data for only one of the reference standards. For this purpose, due to the accuracy of the stains, we preferred the immunohistochemical stain over special stains, which in turn we preferred over the H & E stain.

Although the number of uninterpretable index test results may provide information on the applicability of the tests in clinical practice and may affect the cost effectiveness of a test, we had planned to exclude patients with uninterpretable index test results from the meta‐analyses. We made this decision because in clinical practice uninterpretable index test results would result in additional testing. Nevertheless, we would have extracted and reported such data if available from the studies.

If we suspected an overlap of participants between multiple reports due to common study authors and centres, we planned to contact the study authors for clarification; however, this was not required, since we could identify multiple reports of the same study using the information provided in the reports. We sought further information from study authors, if necessary.

Assessment of methodological quality

Two review authors independently assessed study quality using the QUADAS‐2 tool (Whiting 2006; Whiting 2011), resolving differences by discussion. The criteria used for the assessment are shown in Appendix 6. We considered studies classified as 'low risk of bias' and 'low concern' in all the domains of the QUADAS‐2 tool as studies with high methodological quality. It must be noted here that 'risk of bias' refers to internal validity (i.e. whether there were systematic errors in performing the study with respect to the particular domain), while 'applicability concern' refers to external validity (i.e. whether there were concerns that the population, index test or reference standard used in the studies matched the review question).

Statistical analysis and data synthesis

We plotted study estimates of sensitivity and specificity on forest plots and in receiver operating characteristic (ROC) space to explore between‐study variation in the accuracy of each test. We examined the thresholds reported for each test and the reference standards used. Due to between‐study variation in thresholds, we performed meta‐analyses by using the hierarchical summary receiver operating characteristics (HSROC) model to estimate SROC curves (Rutter 2001). For these analyses, if a study reported test accuracy at multiple thresholds, we selected the threshold used by the study authors for their primary analysis.

Prior to comparative meta‐analyses of the tests, we performed meta‐analysis of each test separately for preliminary investigation of the shape of the SROC curve of each test and to assess heterogeneity in test performance. We used this approach to understand the data and to guide modelling assumptions we may need to make in the comparative meta‐analysis. These preliminary analyses were done noting the availability of comparative studies. To compare the accuracy of the index tests, we added test type as a covariate to the HSROC model (Macaskill 2013). For the indirect comparison where we used all available data (i.e. not restricted to comparative studies), we assessed the effect of test type on the accuracy, threshold, and shape parameters of the HSROC model. We also explored the effect of test type on the variance of the random effects for accuracy and threshold. To determine the final meta‐analytic model, we used likelihood ratio tests to assess model fit. Likelihood ratio tests were also used to determine the statistical significance of differences in test accuracy. When SROC curves are symmetric (i.e. HSROC model without the shape parameter), each curve can be described using the diagnostic odds ratio (DOR) to quantify the accuracy of the test. We used the ratio of DORs as a summary of the relative accuracy of two tests.

Summary sensitivities and specificities can be obtained from a HSROC model but they are not clinically interpretable here because we included studies with different thresholds. We therefore estimated sensitivities at points on the SROC curves that correspond to the lower quartile, median and upper quartile of the specificities from the studies included in the meta‐analysis. When comparative studies that had evaluated two tests head‐to‐head were available, we performed direct comparisons of the tests (Takwoingi 2013). For these analyses, we fitted HSROC models with symmetric SROC curves, as the available data were insufficient for reliable estimation of the shape of the SROC curves (Takwoingi 2017).

If there were at least two studies that reported the accuracy of a test at the same threshold, we considered meta‐analysis to obtain summary estimates of sensitivity and specificity. Due to the small number of studies in these analyses, we performed meta‐analyses using univariate fixed‐effect or random‐effects logistic regression models, depending on the extent of heterogeneity observed in forest plots and in ROC space (Takwoingi 2017). When there were only two or three studies at the same threshold, and little or no heterogeneity observed in ROC space, we used univariate fixed‐effect logistic regression models to pool sensitivities and specificities separately. When there were two or three studies and we observed heterogeneity, we did not perform meta‐analysis, as random‐effects models would be more appropriate in such situations. However, random effects cannot be reliably estimated with very few studies.

We performed meta‐analyses using the NLMIXED procedure in SAS.

Investigations of heterogeneity

We used forest plots and scatter plots of sensitivity against specificity for preliminary investigation of potential sources of heterogeneity such as:

  1. Type of reference standard (different histological stains).

  2. Studies at low risk of bias in all the QUADAS‐2 domains versus those at unclear or high risk of bias.

  3. Full‐text publications versus abstracts (may provide insight into publication bias if there is an association between the results of a study and full publication of the study) (Eloubeidi 2001).

  4. Prospective versus retrospective studies.

  5. Symptomatic versus asymptomatic participants.

  6. Recent or current use of proton pump inhibitors or antibiotics, as these patients are at higher risk of false negative results for the urea breath test and stool antigen test, with serology being the only non‐invasive test unaffected by the use of proton pump inhibitors or antibiotics (Malfertheiner 2012; Ricci 2007).

  7. Different subtypes of tests (ELISA, latex agglutination test, and Western blot methods of serological tests; formal serological tests versus bedside serological tests; and monoclonal versus polyclonal antibodies for stool antigen tests).

  8. Interval between index test and reference standard. Resolution of H pylori infection in people with H pylori infection (usually with treatment) and infection in those without H pylori infection may occur if there was a long interval between the index test and reference standard.

We formally investigated heterogeneity for each test by adding a covariate to a HSROC model (meta‐regression). We used likelihood ratio tests to assess the statistical significance of differences in test accuracy by comparing models with and without the covariate.

Sensitivity analyses

We planned to examine the impact of data inconsistencies on the meta‐analytic findings. For example, if test accuracy data reported in the text of a paper differed from those in the figures, we planned to assess the impact of using different data in sensitivity analyses; however, we did not find such inconsistencies.

Assessment of reporting bias

Due to limited data, we were unable to formally investigate whether test accuracy differed between studies that were published as full texts and those available only as abstracts.

Results

Results of the search

We identified 23,896 references through electronic searches of MEDLINE, Embase, Science Citation Index, and NIHR HTA. We did not identify additional references through other searches. The flow of studies through the screening process is shown in Figure 2. After removing 10,313 duplicates, there were 13,583 references. Of these, we dropped 11,737 irrelevant references through reading the titles and abstracts. We could not obtain the full text of 11 references. The quality of copies of two references was too poor to allow translation and we were unable to obtain better copies. We assessed the full text of the remaining 1833 references. We excluded 1728 references (1727 studies) for reasons stated in Appendix 7 (also see Characteristics of excluded studies below). The remaining 107 references (101 studies) met our inclusion criteria. Two references reported diagnostic accuracy data separately for people who underwent gastrectomy and those who did not undergo gastrectomy, and so we considered these subgroups as separate studies (Adamopoulos 2009a; Adamopoulos 2009b; Sheu 1998a; Sheu 1998b).


Study flow diagram.

Study flow diagram.

Characteristics of included studies

We summarise the characteristics of the 101 included studies in the Characteristics of included studies table. The studies included 11,003 participants, of which 5839 participants (53.1%) had H pylori infection. The prevalence of H pylori infection ranged from 15.2% to 94.7% with a median of 53.7% (interquartile range: 42.0% to 66.5%).

Of the 101 studies, 34 evaluated urea breath test‐13C; 21 evaluated urea breath test‐14C; two evaluated urea breath test but did not report the isotope used; 34 evaluated serology; and 29 evaluated stool antigen test. Seventeen studies evaluated more than one test. Of these, 15 evaluated two tests (Dede 2015; El‐Din 2013; Eltumi 1999; Hafeez 2007; Inelmen 2004; Korstanje 2006; Kuloglu 2008; Lahner 2004; Lottspeich 2007; Mansour‐Ghanaei 2011; Ogata 2001; Soomro 2012; Vandenplas 1992; Yoshimura 2001; Yu 2001), and two evaluated three tests (Monteiro 2001a; Salles‐Montaudon 2002). Studies used different thresholds, with 15 studies reporting test accuracy at more than one threshold (Chey 1998; Dede 2015; Delvin 1999; Formichella 2013; Ladas 2002a; Mana 2001a; Misawa 1998; Monteiro 2001a; Morales 1995; Noguera 1998; Novis 1991; Ozturk 2003; Trevisani 2005; Weiss 1994; Yu 2001).

Eleven studies were prospective (Adamopoulos 2009a; Adamopoulos 2009b; Al‐Fadda 2000; Arikan 2004; Dede 2015; Eltumi 1999; Fallone 1995; Kalach 1998a; Kuloglu 2008; Ogata 2001; Qadeer 2009); six studies were retrospective (Bosso 2000; Czerwionka‐Szaflarska 2007; Graham 1996a; Iqbal 2013; Mion 1994; Wardi 2012), while the remaining 84 studies did not state whether they were prospective or retrospective studies. Six studies were published as abstracts only (Han 2012; Mohammadian 2007; Rathbone 1986; Sheu 1998a; Sheu 1998b; Thobani 1995), and the remaining 95 were full‐text publications.

Fourteen studies included only children (Argentieri 2007; Behrens 1999; Czerwionka‐Szaflarska 2007; Delvin 1999; Dinler 1999; Eltumi 1999; Hafeez 2007; Kalach 1998a; Kuloglu 2008; Lottspeich 2007; Ogata 2001; Rafeey 2007; Vandenplas 1992; Yoshimura 2001). Five studies clearly included only adults (Atli 2012; Chen 1991; Kamel 2011; Safe 1993; Salles‐Montaudon 2002). Although not clearly specified in the remaining 82 studies, it appeared that most or all of the participants were adults. The mean or median age of the participants included in these studies ranged between 31 years and 85 years in the 45 studies that reported this information. One study included only participants without symptoms (Wang 2008). Fifty‐eight studies included only participants with symptoms, usually abdominal pain or dyspepsia (Adamopoulos 2009a; Adamopoulos 2009b; Aguilar 2007; Al‐Fadda 2000; Allardyce 1997; Behrens 1999; Bosso 2000; Ceken 2011; Chen 1991; Czerwionka‐Szaflarska 2007; D'Elios 2000; Delvin 1999; Dinler 1999; Ekesbo 2006; El‐Din 2013; El‐Mekki 2011; El‐Nasr 2003; Eltumi 1999; Fanti 1999; Faruqui 2007; Ferrara 1998; Germana 2001; Guo 2011; Gurbuz 2005; Hafeez 2007; Jordaan 2008; Kamel 2011; Kuloglu 2008; Ladas 2002a; Lahner 2004; Lee 1998; Lottspeich 2007; Mansour‐Ghanaei 2011; Mion 1994; Misawa 1998; Mohammadian 2007; Morales 1995; Novis 1991; Ogata 2001; Ozturk 2003; Peitz 2001; Qadeer 2009; Rafeey 2007; Rasool 2007; Rathbone 1986; Safe 1993; Scuderi 2000; Segamwenge 2014; Selcukcan 2011; Sharbatdaran 2013; Sheu 1998a; Soomro 2012; Surveyor 1989; Thobani 1995; Vandenplas 1992; Villalobos 1992; Weiss 1994; Yoshimura 2001). The remaining 42 studies did not report the type of participants included. Five studies included only participants who had previously undergone gastrectomy (Adamopoulos 2009b; Lombardo 2003; Schilling 2001; Sheu 1998b; Wardi 2012). Two studies included only participants with atrophic gastritis (Korstanje 2006; Ogata 2001). It was clear that participants who received recent proton pump inhibitors or antibiotics were excluded from 53 studies (Ceken 2011; Chey 1998; Debongnie 1991; D'Elios 2000; Delvin 1999; Duan 1999; El‐Mekki 2011; El‐Nasr 2003; Eltumi 1999; Fallone 1996; Fanti 1999; Ferrara 1998; Formichella 2013; Germana 2001; Guo 2011; Gurbuz 2005; Jekarl 2013; Jensen 1998; Jordaan 2008; Kalach 1998a; Kim 2016; Kuloglu 2008; Ladas 2002a; Lahner 2004; Lee 1998; Lombardo 2003; Lottspeich 2007; Mana 2001a; Mansour‐Ghanaei 2011; Monteiro 2001a; Ogata 2001; Ozturk 2003; Peitz 2001; Peura 1996; Puspok 1999; Qadeer 2009; Rafeey 2007; Rasool 2007; Schilling 2001; Segamwenge 2014; Selcukcan 2011; Sharbatdaran 2013; Shin 2009; Tiwari 2014; Trevisani 2005; Vandenplas 1992; Villalobos 1992; Wang 2008; Weiss 1994; Yan 2003; Yoshimura 2001; Yu 1999; Yu 2001). It was not clear whether such participants were included or excluded in the remaining 48 studies.

Thirty‐two studies used H & E stain as a reference standard (Aguilar 2007; Al‐Fadda 2000; Arikan 2004; Atli 2012; Behrens 1999; Ceken 2011; Chen 1991; Chey 1998; Czerwionka‐Szaflarska 2007; D'Elios 2000; Dinler 1999; Eggers 1990; El‐Nasr 2003; Fallone 1996; Faruqui 2007; Graham 1996a; Gramley 1999; Gurbuz 2005; Iqbal 2013; Jordaan 2008; Kalach 1998a; Kamel 2011; Lee 1998; Logan 1991a; Noguera 1998; Puspok 1999; Segamwenge 2014; Selcukcan 2011; Sheu 1998a; Sheu 1998b; Tiwari 2014; Yu 2001); 24 studies used special stains such as Warthin‐Starry stain, Giemsa stain, or silver stain (Argentieri 2007; Bosso 2000; El‐Din 2013; Fallone 1995; Guo 2011; Hafeez 2007; Han 2012; Ivanova 2010; Kim 2016; Ladas 2002a; Lahner 2004; Mion 1994; Mohammadian 2007; Morales 1995; Novis 1991; Ozturk 2003; Peura 1996; Qadeer 2009; Schilling 2001; Scuderi 2000; Shin 2009; Soomro 2012; Villalobos 1992; Yan 2003); two studies used immunohistochemical staining (Ekesbo 2006; Misawa 1998); and the remaining 43 studies used a combination of different stains.

The interval between the index test and reference standard was reported only in 21 studies. The interval was less than two weeks in 19 of the 21 studies (Adamopoulos 2009a; Adamopoulos 2009b; Bosso 2000; Debongnie 1991; Duan 1999; Fallone 1995; Fallone 1996; Formichella 2013; Gurbuz 2005; Hafeez 2007; Lahner 2004; Lee 1998; Logan 1991a; Lottspeich 2007; Mansour‐Ghanaei 2011; Mion 1994; Ozturk 2003; Peura 1996; Safe 1993), and was between 15 days and 23 days in one study (Dede 2015); it was within 30 days in the remaining study (Lombardo 2003).

Characteristics of excluded studies

We excluded 1726 references (1725 studies). The reason for exclusion is stated for each study in Appendix 7 and summarised below.

  • Case‐control study: 17

  • Not a primary research study: 147

  • Erratum: 3

  • Inappropriate population: 79

    • In monitoring: 33

    • Not in humans: 1

    • Only in H pylori negative people: 2

    • Only in H pylori positive people: 39

    • Only in people with gastrointestinal bleeding: 2

    • Selection of participants was based on the results of other H pylori tests: 1

    • Includes people who were being monitored for H pylori status: 1

  • Inappropriate index test: 38

  • Inappropriate target condition: 4

  • Inappropriate reference standards: 1182

  • Lack of data: 256

    • Insufficient diagnostic test accuracy data: 25

    • No diagnostic accuracy data: 42

    • Not a diagnostic test accuracy study of non‐invasive H pylori diagnosis: 188

    • Incorrect data (correct information could not be obtained): 1

Methodological quality of included studies

The methodological quality of the included studies is summarised across all studies in Figure 3. None of the included studies was of high methodological quality (i.e. low risk of bias in all the domains). Appendix 8 shows the results for individual studies for urea breath test‐13C, urea breath test‐14C, serology and the stool antigen test, respectively.


Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies. For each domain, the numbers shown on the bar represent the number of studies that were scored as high, unclear or low in terms of risk of bias or applicability concern.

Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies. For each domain, the numbers shown on the bar represent the number of studies that were scored as high, unclear or low in terms of risk of bias or applicability concern.

Patient selection domain

In the patient selection domain, 23, 15 and 63 studies were at low, high and unclear risk of bias, respectively. All 15 studies were at high risk of bias because they did not include a consecutive or random series of participants.

Twenty‐five, seven and 69 studies were of low, high and unclear applicability concern. In the 69 studies of unclear applicability concern it was not clear whether participants similar to those seen in the clinical setting where the test is used were excluded, while the seven studies of high concern clearly excluded such participants. In these seven studies, only people who had undergone gastrectomy or those with atrophic gastritis were included.

Index test domain

In the index test domain, studies generally had an unclear risk of bias because it was unclear whether the index test results were interpreted without the knowledge of the results of the reference standard, and/or it was unclear whether a threshold was prespecified.

Urea breath test

None of the studies that evaluated the urea breath test (13C, 14C, or unknown isotope) were at low risk of bias. The risk of bias was unclear in the two studies that did not report the type of isotope (Han 2012; Lombardo 2003). Of the 34 studies that evaluated urea breath test‐13C, seven (21%) had a high risk of bias while 27 (79%) had unclear risk of bias. There were 21 studies of urea breath test‐14C, 16 (76%) of which had unclear risk of bias while five (24%) had high risk of bias.

For the two studies with unknown isotope, applicability concern was high in one study and low in the other. Of the 34 urea breath test‐13C studies, applicability concerns were unclear for two (6%) studies, high for six (18%) studies and low for 26 (76%) studies. For urea breath test‐14C, applicability concerns were generally low (18/21; 86%) with only three studies having high applicability concerns (Selcukcan 2011; Surveyor 1989; Yu 1999) .

Serology

One study (Ladas 2002a), had a low risk of bias.and another study (Rathbone 1986), had a high risk of bias. The risk of bias for the remaining 32 (94%) studies was unclear. Applicability concerns were low in 19 (56%) studies and high in 15 (44%) studies.

Stool antigen test

None of the 29 studies had a high risk of bias. Most of the studies (26/29; 90%) had an unclear risk of bias; three studies (Islam 2005; Kuloglu 2008; Sharbatdaran 2013), had a low risk of bias. All the studies were of low applicability concern.

Reference standard domain

Two studies were at low risk of bias in the reference standard domain (Fallone 1995; Ladas 2002a). For 27 studies, the risk of bias was unclear because it was not clear whether reference standard results were interpreted without knowledge of the results of the index tests. The remaining 72 studies were at high risk of bias because the reference standard was endoscopic biopsy with H & E stain in some or all participants.

All the studies were of low applicability concern.

Flow and timing domain

Seven studies were at low risk of bias in the flow and timing domain. The risk of bias was unclear for 74 studies because the interval between the index test and reference standard was unclear or it was unclear whether all participants were included in the analysis. The remaining 20 studies were at high risk of bias because some participants were clearly excluded from the analysis. These studies did not report the reference standard results for the excluded participants. None of the studies reported indeterminate results (i.e. there were no indeterminate index test results in studies which provided a clear participant flow and none of the exclusions were due to indeterminate index test results).

Findings

Urea breath test‐13C

The 34 studies of urea breath test‐13C included 3139 participants, of whom 1526 had H pylori infection (Figure 4). The threshold used in six studies was either unknown (Eggers 1990; Monteiro 2001a), or unclear (Sheu 1998a; Sheu 1998b; Vandenplas 1992; Wardi 2012). At the most commonly reported threshold of delta over baseline > 4% (30 minutes after administration of urea), the summary sensitivity (95% confidence interval (CI)) and specificity (95% CI) from 10 studies (958 participants) were 0.95 (95% CI 0.79 to 0.99) and 0.95 (95% CI 0.87 to 0.98). Other thresholds were used by a limited number of studies (Figure 5; Appendix 9). When possible we performed meta‐analysis to estimate summary sensitivities and specificities at these common thresholds. The results are presented in Table 1.


Forest plot of urea breath test‐13C.FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.

Forest plot of urea breath test‐13C.FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.


Forest plot of urea breath test‐13C at commonly reported thresholds. FN = false negative; FP = false positive; TN = true negative; TP = true positive. Thresholds are shown in brackets and the number of minutes in brackets is the time after administration of urea.

Forest plot of urea breath test‐13C at commonly reported thresholds. FN = false negative; FP = false positive; TN = true negative; TP = true positive. Thresholds are shown in brackets and the number of minutes in brackets is the time after administration of urea.

Open in table viewer
Table 1. Summary of results at thresholds commonly reported for urea breath test‐13C, urea breath test‐14C and serology

Threshold

Studies

Number of participants (cases)

Sensitivity (95% CI)

Specificity (95% CI)

Urea breath test‐13C

Delta over baseline > 3% (20 minutes)

2

254 (128)

0.98 (0.90 to 1.00)

0.92 (0.82 to 0.97)

Delta over baseline > 3% (30 minutes)

3

333 (140)

0.99 (0.92 to 1.00)

0.95 (0.90 to 0.98)

Delta over baseline > 3.5% (30 minutes)

3

368 (120)

0.75 to 1.00

0.77 to 1.00

Delta over baseline > 4% (10 minutes)

2

236 (118)

0.91 to 1.00

0.60 to 0.95

Delta over baseline > 4% (20 minutes)

2

236 (118)

0.91 to 1.00

0.60 to 0.96

Delta over baseline > 4% (30 minutes)

10

958 (423)

0.95 (0.79 to 0.99)

0.95 (0.87 to 0.98)

Delta over baseline > 4.5% (30 minutes)

3

288 (106)

0.50 to 0.96

0.82 to 0.96

Delta over baseline > 5% (30 minutes)

4

601 (315)

0.95 (0.49 to 1.00)

0.94 (0.84 to 0.98)

Urea breath test‐14C

Counts per minute > 50 (10 minutes)

6

471 (231)

0.89 (0.55 to 0.98)

0.91 (0.79 to 0.96)

Disintegrations per minute > 200 (10 minutes)

4

296 (132)

0.95 (0.33 to 1.00)

0.95 (0.80 to 0.99)

Serology

> 7 units/ml

2

97 (48)

0.98 (0.74 to 1.00)

0.71 (0.51 to 0.86)

≥ 300 unit

2

234 (143)

0.91 (0.82 to 0.96)

0.86 (0.72 to 0.93)

Tests evaluated at the same threshold by more than one study are presented in the table. When there were two or three studies at the same threshold, and little or no heterogeneity was observed in ROC space, estimates of summary sensitivity and summary specificity were obtained by using univariate fixed‐effect logistic regression models to pool sensitivities and specificities separately. When there were two or three studies and we observed heterogeneity, we did not perform meta‐analysis but report the range of the sensitivities and specificities.

Urea breath test‐14C

Figure 6 shows the 21 studies of urea breath test‐14C. The studies included 1810 participants (involving 1018 H pylori cases). Three studies did not state the thresholds used (Selcukcan 2011; Surveyor 1989; Yu 1999). The two most commonly used thresholds were counts per minute > 50 (10 minutes after administration of urea) in six studies (471 participants) and disintegrations per minute > 200 (10 minutes) in four studies (296 participants) (Table 1). Test accuracy results for other thresholds are shown in Appendix 9. The summary sensitivity (95% CI) and specificity (95% CI) at the counts per minute > 50 threshold were 0.89 (95% CI 0.55 to 0.98) and 0.91 (95% CI 0.79 to 0.96). For the disintegrations per minute > 200 threshold, the summary sensitivity (95% CI) and specificity (95% CI) were 0.95 (95% CI 0.33 to 1.00) and 0.95 (95% CI 0.80 to 0.99).


Forest plot of urea breath test‐14C. FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.

Forest plot of urea breath test‐14C. FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.

Serology

Serology was evaluated in 34 studies with a total of 4242 participants, of whom 2477 had H pylori infection (Figure 7). There was considerable variation in the thresholds used but 14 (41%) studies did not state the thresholds used. A threshold of > 7 units/ml was used in two studies (Iqbal 2013; Ogata 2001), involving 97 participants, and two studies involving 234 participants (Ladas 2002a; Monteiro 2001a) used a threshold of ≥ 300 units (Table 1). The summary sensitivity (95% CI) and specificity (95% CI) at the > 7 units/mL threshold were 0.98 (95% CI 0.74 to 1.00) and 0.71 (95% CI 0.51 to 0.86), and 0.91 (95% CI 0.82 to 0.96) and 0.86 (95% CI 0.72 to 0.93) for the ≥ 300 units threshold.


Forest plot of serology. FN = false negative; FP = false positive; SD = standard deviation; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. Other threshold is staining of a 120kDa protein (CagA) gel band and/or at least two of five proteins between 28–33 kDa.

Forest plot of serology. FN = false negative; FP = false positive; SD = standard deviation; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. Other threshold is staining of a 120kDa protein (CagA) gel band and/or at least two of five proteins between 28–33 kDa.

Stool antigen test

Twenty‐nine studies assessed the stool antigen test in 2988 participants (including 1311 H pylori cases) (Figure 8). The threshold used was unknown in almost half of the studies (14/29, 48%). None of the thresholds reported were used by more than one study. Summary estimates of sensitivity and specificity were therefore not obtained at a common threshold.


Forest plot of stool antigen test. FN = false negative; FP = false positive; TN = true negative; TP = true positive; WL = wavelength. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity.

Forest plot of stool antigen test. FN = false negative; FP = false positive; TN = true negative; TP = true positive; WL = wavelength. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity.

Comparative accuracy of non‐invasive tests for H pylori infection

Comparison based on all studies (Indirect test comparison)

Across the four tests (urea breath test‐13C, urea breath test‐14C, serology and stool antigen test) 99 studies (5694 cases; 10799 participants) were included in this comparative meta‐analysis (Figure 9). Preliminary assessment of each test separately indicated there was no significant association between test accuracy and threshold, and so a symmetric SROC curve is plausible for each test. Based on these preliminary assessments, and likelihood ratio tests comparing different HSROC meta‐regression models with covariate terms for test type and examination of the variance parameters in these models, the final model we fitted allowed for differences in accuracy and threshold as random effects (i.e. unequal variances for the random effects) with symmetric SROC curves for the tests. Overall, there was statistical evidence of a difference in accuracy (P = 0.024). The DORs (95% CI) for urea breath test‐13C, urea breath test‐14C, serology and stool antigen test were 153 (95% CI 73.7 to 316), 105 (95% CI 74.0 to 150), 47.4 (95% CI 25.5 to 88.1) and 45.1 (95% CI 24.2 to 84.1) respectively (Table 2). The accuracy of urea breath tests (13C and 14C) was significantly higher than that of serology and stool antigen test. For example, the ratio of DORs (95%) for urea breath test‐13C compared to serology was 3.22 (95% CI 1.24 to 8.37), P = 0.017.


Summary ROC plot of non‐invasive tests for H pylori infection. The SROC curves for the four tests are parallel. The curve for each test is drawn within the range of estimates of specificity from the studies included for the test.

Summary ROC plot of non‐invasive tests for H pylori infection. The SROC curves for the four tests are parallel. The curve for each test is drawn within the range of estimates of specificity from the studies included for the test.

Open in table viewer
Table 2. Indirect comparison of the accuracy of non‐invasive tests for H pylori infection

Index tests

Studies;

participants (H pyloripresent)

DOR (95% CI)

Ratio of diagnostic odds ratios (95% CI), P value

Urea breath test‐13C

Urea breath test‐14C

Serology

Urea breath test‐13C

34; 3139 (1526)

153 (73.7 to 316)

Urea breath test‐14C

21; 1810 (1018)

105 (74.0 to 150)

1.45 (0.65 to 3.26),

P = 0.36

Serology

34; 4242 (2477)

47.4 (25.5 to 88.1)

3.22 (1.24 to 8.37),

P = 0.017

2.22 (1.09 to 4.51),

P = 0.028

Stool antigen test

29; 2988 (1311)

45.1 (24.2 to 84.1)

3.39 (1.30 to 8.83),

P = 0.013

2.33 (1.14 to 4.76),

P = 0.020

1.05 (0.44 to 2.53),

P = 0.91

The indirect comparison included all studies that evaluated at least one of the four tests, i.e. all available data. The ratio of diagnostic odds ratios is the diagnostic odds ratio (DOR) of the test in the column divided by the DOR of the test in the row. If the ratio is greater than one, then the test in the column is more accurate than the test in the row; if the ratio is less than one, the test in the row is more accurate than the test in the column.

Table 3 shows the clinical implications of using each of the four tests in a hypothetical cohort of 1000 people with different levels of prevalence of H pylori infection. For example, given a prevalence of 53.7% and a specificity of 0.90, 46 people who do not have H pylori infection will be treated and urea breath test‐13C, urea breath test‐14C, serology and stool antigen test will miss 30, 42, 86 and 89 people respectively who have H pylori infection.

Open in table viewer
Table 3. Accuracy of non‐invasive tests for H pylori infection at different levels of prevalence

Prevalence (%)

Specificity

False positives1

Test

Sensitivity (95% CI)

Missed cases (95% CI)

42.0

0.79

122

Urea breath test‐13C

0.98 (0.95 to 0.99)

10 (5 to 20)

Urea breath test‐14C

0.97 (0.95 to 0.98)

15 (10 to 20)

Serology

0.93 (0.87 to 0.96)

31 (17 to 54)

Stool antigen test

0.92 (0.87 to 0.96)

32 (18 to 57)

53.7

0.79

97

Urea breath test‐13C

0.98 (0.95 to 0.99)

13 (6 to 26)

Urea breath test‐14C

0.97 (0.95 to 0.98)

19 (13 to 26)

Serology

0.93 (0.87 to 0.96)

39 (22 to 69)

Stool antigen test

0.92 (0.87 to 0.96)

41 (23 to 72)

66.5

0.79

70

Urea breath test‐13C

0.98 (0.95 to 0.99)

16 (8 to 32)

Urea breath test‐14C

0.97 (0.95 to 0.98)

23 (16 to 32)

Serology

0.93 (0.87 to 0.96)

49 (27 to 85)

Stool antigen test

0.92 (0.87 to 0.96)

51 (28 to 89)

42.0

0.90

58

Urea breath test‐13C

0.94 (0.89 to 0.97)

23 (12 to 46)

Urea breath test‐14C

0.92 (0.89 to 0.94)

33 (24 to 46)

Serology

0.84 (0.74 to 0.91)

67 (39 to 110)

Stool antigen test

0.83 (0.73 to 0.90)

70 (41 to 114)

53.7

0.90

46

Urea breath test‐13C

0.94 (0.89 to 0.97)

30 (15 to 58)

Urea breath test‐14C

0.92 (0.89 to 0.94)

42 (30 to 58)

Serology

0.84 (0.74 to 0.91)

86 (50 to 140)

Stool antigen test

0.83 (0.73 to 0.90)

89 (52 to 146)

66.5

0.90

34

Urea breath test‐13C

0.94 (0.89 to 0.97)

37 (18 to 72)

Urea breath test‐14C

0.92 (0.89 to 0.94)

53 (38 to 72)

Serology

0.84 (0.74 to 0.91)

106 (62 to 173)

Stool antigen test

0.83 (0.73 to 0.90)

111 (64 to 180)

42.0

0.96

23

Urea breath test‐13C

0.86 (0.75 to 0.93)

57 (30 to 103)

Urea breath test‐14C

0.81 (0.76 to 0.86)

78 (58 to 103)

Serology

0.66 (0.52 to 0.79)

141 (90 to 204)

Stool antigen test

0.65 (0.50 to 0.78)

146 (93 to 209)

53.7

0.96

19

Urea breath test‐13C

0.86 (0.75 to 0.93)

73 (38 to 132)

Urea breath test‐14C

0.81 (0.76 to 0.86)

100 (74 to 132)

Serology

0.66 (0.52 to 0.79)

181 (115 to 260)

Stool antigen test

0.65 (0.50 to 0.78)

187 (119 to 267)

66.5

0.96

13

Urea breath test‐13C

0.86 (0.75 to 0.93)

90 (47 to 163)

Urea breath test‐14C

0.81 (0.76 to 0.86)

124 (92 to 163)

Serology

0.66 (0.52 to 0.79)

224 (142 to 322)

Stool antigen test

0.65 (0.50 to 0.78)

231 (148 to 331)

1Average number of participants who are diagnosed with H pylori infection but do not have the infection per 1000 tested.

The sensitivities were estimated from the SROC curves at fixed values (lower quartile, median and upper quartile) of specificity from the included studies across all tests. Based on these sensitivities and specificities, and quartiles of prevalence from the included studies (across all tests), the numbers of missed H pylori cases and false positives (i.e. overdiagnosed people) were calculated using a hypothetical cohort of 1000 people suspected of having H pylori infection.

Direct comparisons (restricted to comparative studies)

Direct comparisons were based on few studies. Table 4 shows the number of studies (N) for each pairwise comparison and, where meta‐analysis was possible, the ratio of DORs with 95% CIs and P value. There were no comparative studies of urea breath test‐13C and urea breath test‐14C. All other comparisons were based on seven or fewer studies. Each pair of tests were evaluated as follows:

Open in table viewer
Table 4. Direct comparison of the accuracy of non‐invasive tests for H pylori infection

Test

Urea breath test‐13C

Urea breath test‐14C

Serology

Urea breath test‐13C

Urea breath test‐14C

N = 0

Serology

N = 7

DOR (95% CI) of urea breath test‐13C = 74.8 (95% CI 17.8 to 314)

DOR (95% CI) of serology = 111 (95% CI 41.2 to 297)

RDORs (95% CI) of urea breath test‐13C versus serology, P value = 0.68 (95% CI 0.12 to 3.70), P = 0.56

N = 1

Stool antigen test

N = 7
DOR (95% CI) of urea breath test‐13C = 46.6 (95% CI 3.30 to 658)

DOR (95% CI) of stool antigen test = 53.0 (95% CI 5.34 to 527)

RDORs (95% CI) of urea breath test‐13C versus stool antigen test, P value = 0.88 (95% CI 0.14 to 5.56), P = 0.84

N = 2

N = 4

DOR = diagnostic odds ratio; N = number of studies; RDORs = ratio of diagnostic odds ratios.

Due to paucity of data and substantial heterogeneity observed in ROC space which precluded the use of simpler meta‐analytic models, meta‐analyses were not possible for two test comparisons that had more than one study. For the single study of urea breath test‐14C versus serology (Mansour‐Ghanaei 2011), both tests had similar sensitivity, but specificity was higher for urea breath test‐14C than for serology. The ratio of diagnostic odds ratios is the DOR of the test in the column divided by the DOR of the test in the row. If the ratio is greater than one, then the test in the column is more accurate than the test in the row; if the ratio is less than one, the test in the row is more accurate than the test in the column.


Summary ROC plot of direct comparisons of urea breath test‐13C and serology. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.

Summary ROC plot of direct comparisons of urea breath test‐13C and serology. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.


Summary ROC plot of direct comparisons of urea breath test‐13C and stool antigen test. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.

Summary ROC plot of direct comparisons of urea breath test‐13C and stool antigen test. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.

The ratios of DORs (95% CI; P value) were 0.68 (95% CI 0.12 to 3.70; P = 0.56) for urea breath test‐13C versus serology, and 0.88 (95% CI 0.14 to 5.56; P = 0.84) for urea breath test‐13C versus stool antigen test. Due to paucity of data and substantial heterogeneity observed in ROC space which precluded the use of simpler meta‐analytic models, meta‐analyses were not possible for the other two test comparisons that had more than one study. For the single study of urea breath test‐14C versus serology (Mansour‐Ghanaei 2011), both tests had similar sensitivity, but specificity was higher for urea breath test‐14C than for serology.

Investigation of heterogeneity

We were unable to investigate subtype of tests because most of the serological tests were ELISA (17/20 (85%) studies that provided the type of serology test) and most studies (24/29 (83%) studies) did not report whether monoclonal or polyclonal antibodies were used for stool antigen tests. Studies did not report the precise interval between index test and reference standard (unless they were performed on the same day), i.e. many studies did not report the interval at all, while some reported that the tests were performed within a few days of each other without stating the exact time interval. Of those that reported the interval, only two studies had an interval of more than two weeks (Dede 2015; Lombardo 2003). For each of the four tests, Appendix 10 shows the number of studies in each subgroup of other factors we had planned to investigate. Given the availability of data, we were only able to perform meta‐regression to investigate the effect of reference standard on the accuracy of each test. Of the 99 studies, 42 (42%) used a combination of stains and there were few data for Immunohistochemical stains (2/99; 2%). The analyses were therefore limited to comparisons of H & E stain versus special stain for each test (Appendix 11). Although the effect of reference standard was not consistent across tests, there was no statistical evidence of a difference in test accuracy for any of the tests. For urea breath test‐14C, the DOR for special stain was higher than for H & E stain, while for the other tests the DOR of both types of stain were similar or higher for H & E (Appendix 11).

Discussion

Summary of main results

We included 101 studies (11,003 participants) that evaluated the diagnostic accuracy of different non‐invasive methods for the diagnosis of H pylori. Of these 11,003 participants, 5839 participants (53.1%) had H pylori infection. The prevalence of H pylori infection ranged from 15.2% to 94.7%. The median prevalence was 53.7% (lower quartile: 42.0% and upper quartile: 66.5%).

The summary of results for urea breath test‐13C, urea breath test‐14C, serology and stool antigen test is given in summary of findings Table. The studies used different thresholds and reference standards. As a result, there were few data for pooling sensitivities and specificities at specific thresholds, and we mainly estimated and compared SROC curves. The test comparison based on all available data (99 studies) for the four tests showed a statistically significant difference in diagnostic accuracy between the test (P = 0.024). There was no statistical evidence of a difference in diagnostic accuracy between urea breath test‐13C and urea breath test‐14C, while serology and stool antigen test were inferior to both urea breath tests. Direct comparisons are more reliable than indirect comparisons, due to the potential for confounding in indirect comparisons (Takwoingi 2013). However, we found few head‐to‐head studies and meta‐analysis was possible for only two pairwise comparisons (urea breath test‐13C versus serology, seven studies; and urea breath test‐13C versus stool antigen test, seven studies).

Most of the tests that used visual assessment (for example, appearance of a pink‐red line) were stool antigen tests, although some serology tests also used visual assessment. Some serology and stool antigen tests are therefore easy to use (stool antigen test is easier to use as described below), but low diagnostic accuracy is a disadvantage when compared to urea breath tests. Urea breath test is a cumbersome test and involves the use of radioisotopes; however, urea breath test‐13C may be the most accurate test among the non‐invasive tests. This has implications in the screening of individuals for H pylori as a decision has to be made regarding the use of a cumbersome and relatively costly test but with good diagnostic accuracy versus cheap tests that can be performed easily but with lower diagnostic accuracy. A further decision to make if one opts for easy‐to‐use tests is the threshold at which the test should be used. For example, one can use a threshold that provides higher sensitivity (at the cost of lower specificity, necessitating endoscopic biopsy confirmation or treatment) or a threshold that provides higher specificity (at the cost of lower sensitivity, resulting in people with H pylori not being treated). Although at first sight it appears that the treatment for H pylori is relatively harmless and one would prefer a threshold at which the test has higher sensitivity rather than higher specificity, the decision to give antibiotics is not a straightforward one, because of the association between unnecessary antibiotic use and development of antimicrobial resistance (Llor 2014). Serology and stool antigen test have similar diagnostic test accuracy and the choice between the two may be made based on ease of carrying out the tests. Only one study included in this review used whole blood for performing serology (Chey 1998). Even this test required a laboratory technician to interpret the test result (Chey 1998). So, there are no bedside tests available for serology testing. On the other hand, bedside kits with easy interpretation by colour changes are available for stool antigen tests, making them easy to administer (Inelmen 2004; Jekarl 2013; Kuloglu 2008; Qadeer 2009; Trevisani 2005). A cost‐effectiveness study may clarify the most cost‐effective non‐invasive test in people with suspected H pylori, but it is difficult to factor in the price of antimicrobial resistance to an individual as the price of antimicrobial resistance is paid by future generations (through increased mortality and decreased productivity), rather than the individual for whom the treatment decision has to be made (Taylor 2014).

Strengths and weaknesses of the review

We conducted a thorough literature search and included full‐text publications and abstracts without any language restrictions. There are currently no reliable search strategies to identify diagnostic test accuracy studies (Beynon 2013). We did not use any diagnostic filter in our search strategy, thereby ensuring that studies on the topic were identified. Two review authors independently identified and extracted data from the studies, potentially decreasing errors related to single data extraction. PCR methodology is not standardised across laboratories and it is an unreliable reference standard (Chey 2007). Endoscopic biopsy followed by rapid urease testing has poor sensitivity following treatment with proton pump inhibitors, and endoscopic biopsy with culture has high specificity but poor sensitivity (Chey 2007). We used a strict reference standard (histology) which is likely to diagnose the target condition with a high degree of accuracy. These are the major strengths of the review.

A major limitation was the diversity of thresholds used in the studies. As a result, data were sparse for each threshold, which limited estimation of summary sensitivities and specificities. Therefore there is insufficient evidence to recommend specific thresholds for each of the tests. Nonetheless, we were able to estimate and compare SROC curves by including studies with different thresholds. There was a high proportion of studies at high risk of bias and with high concern regarding applicability in all the four domains of the QUADAS‐2 tool. This makes the validity and applicability of the results questionable. The major concerns were lack of reporting of the threshold used or when the thresholds were reported, there was no information to judge whether the thresholds were prespecified. Despite the lack of statistical evidence of an effect of type of reference standard on test accuracy, as there were few studies for each subgroup and other differences between studies, we cannot conclude that diagnostic accuracy does not depend on type of reference standard.

Comparison with other systematic reviews

We identified several relevant systematic reviews (Ferwana 2015; Gisbert 2001; Gisbert 2004a; Loy 1996; Zhou 2014; Zhou 2017). The findings from this review support those of Zhou 2017, and Ferwana 2015, that urea breath test has high diagnostic accuracy and that there was significant heterogeneity in the diagnostic accuracy of the urea breath test (Zhou 2017). Our findings agree with those of Zhou 2014 that stool antigen test has only modest diagnostic test accuracy. The review findings are contrary to those of Gisbert 2001, and Gisbert 2004a, which suggested that stool antigen tests are highly accurate. This difference may be due to the strict reference standards that we used in this review and how we handled the issue of heterogeneity in thresholds. In agreement with the findings of Loy 1996, the role of serology in clinical practice is uncertain, as stool antigen tests provide equivalent diagnostic accuracy to serology and are easier to interpret.

Applicability of findings to the review question

This review included adults and children who underwent non‐invasive tests for the diagnosis of H pylori. Most of the studies included only symptomatic people and so the findings of this review are applicable only to people with symptoms. Most studies excluded people who had previous gastrectomy and those who had recent antibiotics or proton pump inhibitors. Hence, the findings of this review are not applicable in these populations.

Clinical pathway
Figuras y tablas -
Figure 1

Clinical pathway

Study flow diagram.
Figuras y tablas -
Figure 2

Study flow diagram.

Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies. For each domain, the numbers shown on the bar represent the number of studies that were scored as high, unclear or low in terms of risk of bias or applicability concern.
Figuras y tablas -
Figure 3

Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies. For each domain, the numbers shown on the bar represent the number of studies that were scored as high, unclear or low in terms of risk of bias or applicability concern.

Forest plot of urea breath test‐13C.FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.
Figuras y tablas -
Figure 4

Forest plot of urea breath test‐13C.FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.

Forest plot of urea breath test‐13C at commonly reported thresholds. FN = false negative; FP = false positive; TN = true negative; TP = true positive. Thresholds are shown in brackets and the number of minutes in brackets is the time after administration of urea.
Figuras y tablas -
Figure 5

Forest plot of urea breath test‐13C at commonly reported thresholds. FN = false negative; FP = false positive; TN = true negative; TP = true positive. Thresholds are shown in brackets and the number of minutes in brackets is the time after administration of urea.

Forest plot of urea breath test‐14C. FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.
Figuras y tablas -
Figure 6

Forest plot of urea breath test‐14C. FN = false negative; FP = false positive; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. For threshold, the number of minutes in brackets is the time after administration of urea.

Forest plot of serology. FN = false negative; FP = false positive; SD = standard deviation; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. Other threshold is staining of a 120kDa protein (CagA) gel band and/or at least two of five proteins between 28–33 kDa.
Figuras y tablas -
Figure 7

Forest plot of serology. FN = false negative; FP = false positive; SD = standard deviation; TN = true negative; TP = true positive. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity. Other threshold is staining of a 120kDa protein (CagA) gel band and/or at least two of five proteins between 28–33 kDa.

Forest plot of stool antigen test. FN = false negative; FP = false positive; TN = true negative; TP = true positive; WL = wavelength. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity.
Figuras y tablas -
Figure 8

Forest plot of stool antigen test. FN = false negative; FP = false positive; TN = true negative; TP = true positive; WL = wavelength. The forest plot shows an estimate of sensitivity and specificity from each study and the threshold used. Studies are sorted by threshold, sensitivity and specificity.

Summary ROC plot of non‐invasive tests for H pylori infection. The SROC curves for the four tests are parallel. The curve for each test is drawn within the range of estimates of specificity from the studies included for the test.
Figuras y tablas -
Figure 9

Summary ROC plot of non‐invasive tests for H pylori infection. The SROC curves for the four tests are parallel. The curve for each test is drawn within the range of estimates of specificity from the studies included for the test.

Summary ROC plot of direct comparisons of urea breath test‐13C and serology. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.
Figuras y tablas -
Figure 10

Summary ROC plot of direct comparisons of urea breath test‐13C and serology. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.

Summary ROC plot of direct comparisons of urea breath test‐13C and stool antigen test. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.
Figuras y tablas -
Figure 11

Summary ROC plot of direct comparisons of urea breath test‐13C and stool antigen test. Each summary curve was drawn restricted to the range of specificities for each test. The size of each symbol was scaled according to the precision of sensitivity and specificity in the study. A dotted line joins the pair of points for the two tests from each study.

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for urea breath test‐13C
Figuras y tablas -
Figure 12

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for urea breath test‐13C

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for urea breath test‐14C.
Figuras y tablas -
Figure 13

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for urea breath test‐14C.

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for serology
Figuras y tablas -
Figure 14

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for serology

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for the stool antigen test
Figuras y tablas -
Figure 15

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study included for the stool antigen test

Urea breath test‐13C.
Figuras y tablas -
Test 1

Urea breath test‐13C.

Urea breath test‐14C.
Figuras y tablas -
Test 2

Urea breath test‐14C.

Urea breath test ‐ Unknown isotope.
Figuras y tablas -
Test 3

Urea breath test ‐ Unknown isotope.

Serology.
Figuras y tablas -
Test 4

Serology.

Stool antigen test.
Figuras y tablas -
Test 5

Stool antigen test.

Urea breath test‐13C (delta over baseline > 3% (20 minutes)).
Figuras y tablas -
Test 6

Urea breath test‐13C (delta over baseline > 3% (20 minutes)).

Urea breath test‐13C (delta over baseline > 3% (30 minutes)).
Figuras y tablas -
Test 7

Urea breath test‐13C (delta over baseline > 3% (30 minutes)).

Urea breath test‐13C (delta over baseline > 3.5% (30 minutes)).
Figuras y tablas -
Test 8

Urea breath test‐13C (delta over baseline > 3.5% (30 minutes)).

Urea breath test‐13C (delta over baseline > 4% (10 minutes)).
Figuras y tablas -
Test 9

Urea breath test‐13C (delta over baseline > 4% (10 minutes)).

Urea breath test‐13C (delta over baseline > 4% (20 minutes)).
Figuras y tablas -
Test 10

Urea breath test‐13C (delta over baseline > 4% (20 minutes)).

Urea breath test‐13C (delta over baseline > 4% (30 minutes)).
Figuras y tablas -
Test 11

Urea breath test‐13C (delta over baseline > 4% (30 minutes)).

Urea breath test‐13C (delta over baseline > 4.5% (30 minutes)).
Figuras y tablas -
Test 12

Urea breath test‐13C (delta over baseline > 4.5% (30 minutes)).

Urea breath test‐13C (delta over baseline > 5% (30 minutes)).
Figuras y tablas -
Test 13

Urea breath test‐13C (delta over baseline > 5% (30 minutes)).

Urea breath test‐14C (counts per minute > 50).
Figuras y tablas -
Test 14

Urea breath test‐14C (counts per minute > 50).

Urea breath test‐14C (disintegrations per minute > 200).
Figuras y tablas -
Test 15

Urea breath test‐14C (disintegrations per minute > 200).

Serology > 7 units/ml.
Figuras y tablas -
Test 16

Serology > 7 units/ml.

Serology ≥300 units.
Figuras y tablas -
Test 17

Serology ≥300 units.

Summary of findings Performance of non‐invasive tests for diagnosis of H pylori infection

What is the best non‐invasive test for diagnosis of H pylori infection?

Population

Children and adults with gastrointestinal symptoms

Setting

Primary care setting

Index tests

Urea breath test‐13C, Urea breath test‐14C, serology, and stool antigen test

Threshold

Various thresholds were used for each test

Role and purpose of test

Screening and diagnosis of H pylori

Reference standard

Endoscopic biopsy with Haemotoxylin & Eosin stain, special stains, or combination of Haemotoxylin & Eosin and special stains

Quality of evidence

Risk of bias was generally high or unclear with respect to the selection of participants, and the conduct and interpretation of the index tests and reference standard. Applicability concerns were also generally high or unclear with respect to selection of participants

Limitations

There was heterogeneity in thresholds and reference standards. Studies did not often prespecify or clearly report thresholds used

Pre‐test probability (prevalence of Helicobacter pylori)

Median (interquartile range) = 53.7% (42.0% to 66.5%)

Index test

Number of participants (studies)

Diagnostic odds ratio (95% CI)

Sensitivity (95% CI) at fixed specificity of 0.901

Missed H pylori cases per 1000 people tested (95% CI)2

Urea breath test‐13C

3139 participants

(34 studies)

153 (95% CI 73.7 to 316)

0.94 (0.89 to 0.97)

30 (15 to 58)

Urea breath test‐14C

1810 participants

(21 studies)

105 (95% CI 74.0 to 150)

0.92 (0.89 to 0.94)

42 (30 to 58)

Serology

4242 participants

(34 studies)

47.4 (95% CI 25.5 to 88.1)

0.84 (0.74 to 0.91)

86 (50 to 140)

Stool antigen test

2988 participants

(29 studies)

45.1 (95% CI 24.2 to 84.1)

0.83 (0.73 to 0.90)

89 (52 to 146)

Comparison of non‐invasive tests for H pylori infection

Based on an indirect comparison of the four tests using all the studies, there was statistical evidence of a difference in diagnostic accuracy (P = 0.024). Direct comparisons were based on few head‐to‐head studies. The ratios of diagnostic odds ratios (95% CI; P value) were 0.68 (95% CI 0.12 to 3.70; P = 0.56) for urea breath test‐13C versus serology (seven studies), and 0.88 (95% CI 0.14 to 5.56; P = 0.84) for urea breath test‐13C versus stool antigen test (seven studies). The 95% confidence intervals of these estimates overlap with those of the ratios of diagnostic odds ratios from the indirect comparison. Data were limited or unavailable for meta‐analysis of other direct comparisons.

Conclusions

In people with no history of gastrectomy and those who have not recently had antibiotics or proton pump inhibitors, urea breath tests had high diagnostic accuracy while serology and stool antigen tests had lower accuracy to detect H pylori infection. Although susceptible to bias due to confounding, this conclusion is based on evidence from indirect test comparisons as evidence from direct comparisons was based on few studies or was unavailable. It should be noted that studies were generally of poor methodological quality. The thresholds used for the tests were highly variable and there is currently insufficient evidence to recommend specific thresholds for use in clinical practice.

1The sensitivities were estimated along the SROC curves at the median specificity across the studies included for the four tests.

2Based on the sensitivity estimated at the median specificity of 0.90, and the median prevalence of 53.7% from the included studies, the numbers of missed H pylori cases were calculated using a hypothetical cohort of 1000 people suspected of having H pylori infection. The 95% CI for the number of missed cases is from the 95% CI for sensitivity. For a specificity of 0.90 and prevalence of 53.7%, there will be 46 false positives. See Table 3 for results for other values of specificity and prevalence.

Figuras y tablas -
Summary of findings Performance of non‐invasive tests for diagnosis of H pylori infection
Table 1. Summary of results at thresholds commonly reported for urea breath test‐13C, urea breath test‐14C and serology

Threshold

Studies

Number of participants (cases)

Sensitivity (95% CI)

Specificity (95% CI)

Urea breath test‐13C

Delta over baseline > 3% (20 minutes)

2

254 (128)

0.98 (0.90 to 1.00)

0.92 (0.82 to 0.97)

Delta over baseline > 3% (30 minutes)

3

333 (140)

0.99 (0.92 to 1.00)

0.95 (0.90 to 0.98)

Delta over baseline > 3.5% (30 minutes)

3

368 (120)

0.75 to 1.00

0.77 to 1.00

Delta over baseline > 4% (10 minutes)

2

236 (118)

0.91 to 1.00

0.60 to 0.95

Delta over baseline > 4% (20 minutes)

2

236 (118)

0.91 to 1.00

0.60 to 0.96

Delta over baseline > 4% (30 minutes)

10

958 (423)

0.95 (0.79 to 0.99)

0.95 (0.87 to 0.98)

Delta over baseline > 4.5% (30 minutes)

3

288 (106)

0.50 to 0.96

0.82 to 0.96

Delta over baseline > 5% (30 minutes)

4

601 (315)

0.95 (0.49 to 1.00)

0.94 (0.84 to 0.98)

Urea breath test‐14C

Counts per minute > 50 (10 minutes)

6

471 (231)

0.89 (0.55 to 0.98)

0.91 (0.79 to 0.96)

Disintegrations per minute > 200 (10 minutes)

4

296 (132)

0.95 (0.33 to 1.00)

0.95 (0.80 to 0.99)

Serology

> 7 units/ml

2

97 (48)

0.98 (0.74 to 1.00)

0.71 (0.51 to 0.86)

≥ 300 unit

2

234 (143)

0.91 (0.82 to 0.96)

0.86 (0.72 to 0.93)

Tests evaluated at the same threshold by more than one study are presented in the table. When there were two or three studies at the same threshold, and little or no heterogeneity was observed in ROC space, estimates of summary sensitivity and summary specificity were obtained by using univariate fixed‐effect logistic regression models to pool sensitivities and specificities separately. When there were two or three studies and we observed heterogeneity, we did not perform meta‐analysis but report the range of the sensitivities and specificities.

Figuras y tablas -
Table 1. Summary of results at thresholds commonly reported for urea breath test‐13C, urea breath test‐14C and serology
Table 2. Indirect comparison of the accuracy of non‐invasive tests for H pylori infection

Index tests

Studies;

participants (H pyloripresent)

DOR (95% CI)

Ratio of diagnostic odds ratios (95% CI), P value

Urea breath test‐13C

Urea breath test‐14C

Serology

Urea breath test‐13C

34; 3139 (1526)

153 (73.7 to 316)

Urea breath test‐14C

21; 1810 (1018)

105 (74.0 to 150)

1.45 (0.65 to 3.26),

P = 0.36

Serology

34; 4242 (2477)

47.4 (25.5 to 88.1)

3.22 (1.24 to 8.37),

P = 0.017

2.22 (1.09 to 4.51),

P = 0.028

Stool antigen test

29; 2988 (1311)

45.1 (24.2 to 84.1)

3.39 (1.30 to 8.83),

P = 0.013

2.33 (1.14 to 4.76),

P = 0.020

1.05 (0.44 to 2.53),

P = 0.91

The indirect comparison included all studies that evaluated at least one of the four tests, i.e. all available data. The ratio of diagnostic odds ratios is the diagnostic odds ratio (DOR) of the test in the column divided by the DOR of the test in the row. If the ratio is greater than one, then the test in the column is more accurate than the test in the row; if the ratio is less than one, the test in the row is more accurate than the test in the column.

Figuras y tablas -
Table 2. Indirect comparison of the accuracy of non‐invasive tests for H pylori infection
Table 3. Accuracy of non‐invasive tests for H pylori infection at different levels of prevalence

Prevalence (%)

Specificity

False positives1

Test

Sensitivity (95% CI)

Missed cases (95% CI)

42.0

0.79

122

Urea breath test‐13C

0.98 (0.95 to 0.99)

10 (5 to 20)

Urea breath test‐14C

0.97 (0.95 to 0.98)

15 (10 to 20)

Serology

0.93 (0.87 to 0.96)

31 (17 to 54)

Stool antigen test

0.92 (0.87 to 0.96)

32 (18 to 57)

53.7

0.79

97

Urea breath test‐13C

0.98 (0.95 to 0.99)

13 (6 to 26)

Urea breath test‐14C

0.97 (0.95 to 0.98)

19 (13 to 26)

Serology

0.93 (0.87 to 0.96)

39 (22 to 69)

Stool antigen test

0.92 (0.87 to 0.96)

41 (23 to 72)

66.5

0.79

70

Urea breath test‐13C

0.98 (0.95 to 0.99)

16 (8 to 32)

Urea breath test‐14C

0.97 (0.95 to 0.98)

23 (16 to 32)

Serology

0.93 (0.87 to 0.96)

49 (27 to 85)

Stool antigen test

0.92 (0.87 to 0.96)

51 (28 to 89)

42.0

0.90

58

Urea breath test‐13C

0.94 (0.89 to 0.97)

23 (12 to 46)

Urea breath test‐14C

0.92 (0.89 to 0.94)

33 (24 to 46)

Serology

0.84 (0.74 to 0.91)

67 (39 to 110)

Stool antigen test

0.83 (0.73 to 0.90)

70 (41 to 114)

53.7

0.90

46

Urea breath test‐13C

0.94 (0.89 to 0.97)

30 (15 to 58)

Urea breath test‐14C

0.92 (0.89 to 0.94)

42 (30 to 58)

Serology

0.84 (0.74 to 0.91)

86 (50 to 140)

Stool antigen test

0.83 (0.73 to 0.90)

89 (52 to 146)

66.5

0.90

34

Urea breath test‐13C

0.94 (0.89 to 0.97)

37 (18 to 72)

Urea breath test‐14C

0.92 (0.89 to 0.94)

53 (38 to 72)

Serology

0.84 (0.74 to 0.91)

106 (62 to 173)

Stool antigen test

0.83 (0.73 to 0.90)

111 (64 to 180)

42.0

0.96

23

Urea breath test‐13C

0.86 (0.75 to 0.93)

57 (30 to 103)

Urea breath test‐14C

0.81 (0.76 to 0.86)

78 (58 to 103)

Serology

0.66 (0.52 to 0.79)

141 (90 to 204)

Stool antigen test

0.65 (0.50 to 0.78)

146 (93 to 209)

53.7

0.96

19

Urea breath test‐13C

0.86 (0.75 to 0.93)

73 (38 to 132)

Urea breath test‐14C

0.81 (0.76 to 0.86)

100 (74 to 132)

Serology

0.66 (0.52 to 0.79)

181 (115 to 260)

Stool antigen test

0.65 (0.50 to 0.78)

187 (119 to 267)

66.5

0.96

13

Urea breath test‐13C

0.86 (0.75 to 0.93)

90 (47 to 163)

Urea breath test‐14C

0.81 (0.76 to 0.86)

124 (92 to 163)

Serology

0.66 (0.52 to 0.79)

224 (142 to 322)

Stool antigen test

0.65 (0.50 to 0.78)

231 (148 to 331)

1Average number of participants who are diagnosed with H pylori infection but do not have the infection per 1000 tested.

The sensitivities were estimated from the SROC curves at fixed values (lower quartile, median and upper quartile) of specificity from the included studies across all tests. Based on these sensitivities and specificities, and quartiles of prevalence from the included studies (across all tests), the numbers of missed H pylori cases and false positives (i.e. overdiagnosed people) were calculated using a hypothetical cohort of 1000 people suspected of having H pylori infection.

Figuras y tablas -
Table 3. Accuracy of non‐invasive tests for H pylori infection at different levels of prevalence
Table 4. Direct comparison of the accuracy of non‐invasive tests for H pylori infection

Test

Urea breath test‐13C

Urea breath test‐14C

Serology

Urea breath test‐13C

Urea breath test‐14C

N = 0

Serology

N = 7

DOR (95% CI) of urea breath test‐13C = 74.8 (95% CI 17.8 to 314)

DOR (95% CI) of serology = 111 (95% CI 41.2 to 297)

RDORs (95% CI) of urea breath test‐13C versus serology, P value = 0.68 (95% CI 0.12 to 3.70), P = 0.56

N = 1

Stool antigen test

N = 7
DOR (95% CI) of urea breath test‐13C = 46.6 (95% CI 3.30 to 658)

DOR (95% CI) of stool antigen test = 53.0 (95% CI 5.34 to 527)

RDORs (95% CI) of urea breath test‐13C versus stool antigen test, P value = 0.88 (95% CI 0.14 to 5.56), P = 0.84

N = 2

N = 4

DOR = diagnostic odds ratio; N = number of studies; RDORs = ratio of diagnostic odds ratios.

Due to paucity of data and substantial heterogeneity observed in ROC space which precluded the use of simpler meta‐analytic models, meta‐analyses were not possible for two test comparisons that had more than one study. For the single study of urea breath test‐14C versus serology (Mansour‐Ghanaei 2011), both tests had similar sensitivity, but specificity was higher for urea breath test‐14C than for serology. The ratio of diagnostic odds ratios is the DOR of the test in the column divided by the DOR of the test in the row. If the ratio is greater than one, then the test in the column is more accurate than the test in the row; if the ratio is less than one, the test in the row is more accurate than the test in the column.

Figuras y tablas -
Table 4. Direct comparison of the accuracy of non‐invasive tests for H pylori infection
Table Tests. Data tables by test

Test

No. of studies

No. of participants

1 Urea breath test‐13C Show forest plot

34

3139

2 Urea breath test‐14C Show forest plot

21

1810

3 Urea breath test ‐ Unknown isotope Show forest plot

2

127

4 Serology Show forest plot

34

4242

5 Stool antigen test Show forest plot

29

2988

6 Urea breath test‐13C (delta over baseline > 3% (20 minutes)) Show forest plot

2

254

7 Urea breath test‐13C (delta over baseline > 3% (30 minutes)) Show forest plot

3

333

8 Urea breath test‐13C (delta over baseline > 3.5% (30 minutes)) Show forest plot

3

368

9 Urea breath test‐13C (delta over baseline > 4% (10 minutes)) Show forest plot

2

236

10 Urea breath test‐13C (delta over baseline > 4% (20 minutes)) Show forest plot

2

236

11 Urea breath test‐13C (delta over baseline > 4% (30 minutes)) Show forest plot

10

958

12 Urea breath test‐13C (delta over baseline > 4.5% (30 minutes)) Show forest plot

3

288

13 Urea breath test‐13C (delta over baseline > 5% (30 minutes)) Show forest plot

4

601

14 Urea breath test‐14C (counts per minute > 50) Show forest plot

6

471

15 Urea breath test‐14C (disintegrations per minute > 200) Show forest plot

4

296

16 Serology > 7 units/ml Show forest plot

2

97

17 Serology ≥300 units Show forest plot

2

234

Figuras y tablas -
Table Tests. Data tables by test