Urinary biomarkers for the non‐invasive diagnosis of endometriosis

Summary of findings 1. Biomarkers evaluated as a diagnostic test for endometriosis

Review question	What is the diagnostic accuracy of the urinary biomarkers in detecting pelvic endometriosis [peritoneal endometriosis, endometrioma, DIE]?
Importance	A simple and reliable non‐invasive test for endometriosis, with the potential to either replace syrgery or to triage patients in order to reduce surgery, would minimise surgical risk and reduce diagnostic delay
Patients	Reproductive‐aged women 1) with suspected endometriosis or 2) with persistent ovarian mass or 3) undergoing infertility workup or gynaecological laparoscopy
Settings	Hospitals (public or private of any level): outpatient clinics (general gynaecology, reproductive medicine, pelvic pain); research laboratories
Reference standard	Visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Study design	Cross sectional studies with a 'single‐gate' design (n = 4) or a 'two‐gate' design (n = 1); prospective enrolment; a single study could assess more than one test
Risk of bias	Overall judgement: Poor quality of most of the studies (no study had a 'low risk' assessment in all 4 domains)
	Patient selection bias: High risk ‐ 1 study; Unclear risk ‐ 4 studies; Low risk ‐ 0 studies
	Index test interpretation bias: High risk ‐ 5 studies; Unclear risk ‐ 0 studies; Low risk ‐ 0 studies
	Reference standard interpretation bias: High risk ‐ 0 studies; Unclear risk ‐ 1 study; Low risk ‐ 4 studies
	Flow and timing selection bias: High risk ‐ 1 study; Unclear risk ‐ 0 studies; Low risk ‐ 4 studies
Applicability concerns	Concerns regarding patient selection: High concern ‐ 3 studies; Unclear concern ‐ 0 studies; Low concern 2 studies Concerns regarding index test: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 5 studies Concerns regarding reference standard: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 5 studies
Biomarker		N of studies; N of women	Outcomes				Diagnostic estimates [95% CI]	Implications
Biomarker		N of studies; N of women	True positives (endometriosis)	False negatives (incorrectly classified as disease‐free)	True negatives (disease‐free)	False positives (incorrectly classified as endometriosis)	Diagnostic estimates [95% CI]	Implications
NNE (enolase I) cut‐off > 0.96 ng/mgCr		1; 59	22	17	14	6	Sensitivity 0.56 [0.40, 0.72] Specificity 0.70 [0.46, 0.88]	Insufficient evidence to draw meaningful conclusions
VDBP cut‐off > 87.83 ng/mgCr		1; 95	33	24	21	17	Sensitivity 0.58 [0.44, 0.71] Specificity 0.55 [0.38, 0.71]	Insufficient evidence to draw meaningful conclusions
CK 19 [CYFRA21‐1] cut‐off > 5.3 ng/ml		1; 98	7	56	33	2	Sensitivity 0.11 [0.05, 0.22] Specificity 0.94 [0.81, 0.99]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1824.3 Da cut‐off ≥ 29.34 au		1; 28	10	3	11	4	Sensitivity 0.77 [0.46, 0.95] Specificity 0.73 [0.45, 0.92]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1767.1 Da cut‐off ≥ 35.22 au		1; 27	9	3	13	2	Sensitivity 0.75 [0.43, 0.95] Specificity 0.87 [0.60, 0.98]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 2052.3 Da cut‐off not reported		1; 122	50	10	43	19	Sensitivity 0.83 [0.71, 0.92] Specificity 0.69 [0.56, 0.80]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 3393.9 Da cut‐off not reported		1; 122	51	9	44	18	Sensitivity 0.85 [0.73, 0.93] Specificity 0.71 [0.58, 0.82]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1579.2 Da [collagen alpha 6(IV) chain precursor] cut‐off not reported		1; 122	50	10	43	19	Sensitivity 0.83 [0.71, 0.92] Specificity 0.69 [0.56, 0.80]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 891.6 Da [collagen alpha1 chain precursor] cut‐off not reported		1; 122	49	11	40	22	Sensitiviy 0.82 [0.70, 0.90] Specificity 0.65 [0.51, 0.76]	Insufficient evidence to draw meaningful conclusions
Proteome: 5 peptides m/z 1433.9 + 1599.4 + 2085.6 + 6798.0 + 3217.2 Da cut‐off not reported		1; 25	10	1	13	1	Sensitivity 0.91 [0.59, 1.00] Specificity 0.93 [0.66, 1.00]	Insufficient evidence to draw meaningful conclusions Approaches criteria for a replacement test or SnOUT/SpIN triage tests; further diagnostic test accuracy studies recommended

Summary of findings 2. Biomarkers that do not distinguish between women with and without endometriosis

Review question	Which urinary biomarkers are unlikely to serve as a basis of the diagnostic test for endometriosis?
Importance	Biomarkers that do not show differential expression in women with and without endometriosis are unlikely to be diagnostically useful. Information regarding negative trials can focus research on better diagnostic targets. The biomarkers that display conflicting results (distinguish women with and without endometriosis in some, but not all, studies) can be identified and reported on. Papers that did not show differential expression of a biomarker in endometriosis but were adequately designed and that met inclusion criteria for this review were included.
Patients	Reproductive aged women 1) with suspected endometriosis or 2) with persistent ovarian mass or 3) undergoing infertility workup/gynaecological laparoscopy
Settings	Hospitals (public or private of any level): outpatient clinics (general gynaecology, reproductive medicine, pelvic pain); research laboratory
Reference standard	Visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Study design	Cross‐sectional of 'single‐gate' design (n = 1) or 'two‐gate' design (n = 2); prospective enrolment; one study could assess more than one test
Risk of bias	Overall judgement: Poor quality (no studies had 'low risk' assessment in all 4 domains)
	Patient selection bias: High risk ‐ 2 studies; Unclear risk ‐ 1 study; Low risk ‐ 0 studies
	Index test interpretation bias: High risk ‐ 3 studies; Unclear risk ‐ 0 studies; Low risk ‐ 0 studies
	Reference standard interpretation bias: High risk ‐ 0 studies; Unclear risk ‐ 0 studies; Low risk ‐ 3 studies
	Flow and timing selection bias: High risk ‐ 0 studies; Unclear risk ‐ 0 studies; Low risk ‐ 3 studies
Applicability concerns	Concerns regarding patient selection: High concern ‐ 2 studies; Unclear concern ‐ 0 studies; Low concern ‐ 1 study Concerns regarding index test: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 3 studies Concerns regarding reference standard: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 3 studies
Biomarker	Expression levels	rASRM stage	Menstrual cycle phase	Reference
VEGF	endometriosis (n = 46)¹: 1.11 ± 0.17 pg/mg Cr controls (n = 24): 0.76 ± 0.14 pg/mg Cr p ‐ NS	I‐IV	follicular or luteal	Cho 2007
VEGF	endometriosis (n = 40)¹ 83.6 ± 11.3 pg/mg Cr controls (n = 22): 88.5 ± 10.4 pg/mg Cr P = 0.77	I‐IV	follicular or luteal	Potlog‐Naharia 2004
TNF‐a	endometriosis (n = 46)¹: 0.02 ± 0.01 pg/mg Cr controls (n = 24): 0.01 ± 0.002 pg/mg Cr p ‐ NS	I‐IV	follicular or luteal	Cho 2007
CK 19	endometriosis (n = 44)²: 5.4 ± 5.3 controls (n = 32): 6.7 ± 9.9 p ‐ NS	not reported	follicular or luteal	Kuessel 2014
¹ mean ± SEM ² mean ± SD

Background

Target condition being diagnosed

Endometriosis

Endometriosis is defined as an inflammatory condition characterised by endometrial‐like tissue at sites outside the uterus (Johnson and Hummelshoj 2013). Endometriotic lesions can occur at different locations, including the pelvic peritoneum and the ovary; or they can penetrate pelvic structures below the surface of peritoneum (defined as deeply infiltrating endometriosis (DIE)). Each of these types of endometriosis are thought to represent a separate clinical entity, but can coexist in the same woman. Rarely, endometriotic implants can be found at more distant sites, including lung, liver, pancreas and operative scars, with consequent variations in presenting symptoms.

Endometriosis afflicts 10% of reproductive‐aged women causing dysmenorrhoea (painful periods), dyspareunia (painful intercourse), chronic pelvic pain and infertility (Vigano 2004). The clinical presentation can vary from asymptomatic and unexplained infertility to severe dysmenorrhoea and chronic pain. These symptoms can occur with bowel or urinary symptoms, an abnormal pelvic examination or the presence of a pelvic mass, however no symptom is specific to endometriosis. The prevalence of endometriosis in symptomatic population is reported as 35‐50% (Giudice 2004).

Women with endometriosis are also at increased risk of developing several cancers (Somigliana 2006) and autoimmune disorders (Sinaii 2002). The presence of disease is associated with changes in the immune response, vascularisation, neural function, the peritoneal environment and the eutopic endometrium, suggesting that endometriosis is a systemic, rather than localized, condition (Giudice 2004). Endometriosis has a profound effect on psychological and social well being and imposes a substantial economic burden on society. Women with endometriosis incur significant direct medical costs from diagnostic and therapeutic surgeries, hospital admissions and fertility treatments, however these costs are superseded by the indirect costs of endometriosis including absenteeism and loss of productivity (Gao 2006; Simoens 2012). In the USA, the financial burden of endometriosis is estimated at USD 12,419 per woman (Simoens 2012).

Although the pathogenesis of endometriosis has not been fully elucidated, it is commonly thought that endometriosis occurs when endometrial tissue contained within the menstrual fluid flows back through the fallopian tubes and implants at an ectopic site within the pelvic cavity (Sampson 1927). However, this theory does not explain the fact that although retrograde menstruation is seen in up to 90% of women only 10% of women develop endometriosis (Halme 1984). There is evidence that a variety of environmental, immunological and hormonal factors are associated with endometriosis (Vigano 2004); and genetic loci that confer a risk of endometriosis have been identified (Nyholt 2012). The relative contribution of these and other causal factors needs further elucidation.

Although it is impossible to time the onset of disease, on average women have a 6‐ to 12‐year history of symptoms before obtaining a surgical diagnosis of endometriosis, indicative of considerable diagnostic delay (Matsuzaki 2006). Untreated endometriosis is associated with reduced quality of life and contributes to outcomes such as depression, inability to work, sexual dysfunction and missed opportunity for motherhood (Gao 2006).

Treatment of endometriosis

There is no cure for endometriosis. Treatment options include expectant management, pharmacological (hormonal) therapy and surgery (Johnson and Hummelshoj 2013). Treatment is individualised, taking into consideration the therapeutic goal (pain relief or subfertility) and the location of the disease. Current pharmacological therapies such as the combined oral contraceptive pill, progestogens, weak androgens and GnRH agonists and antagonists act to reduce the effect of oestrogen on endometrial tissues and suppress menstruation. These drugs can ameliorate the symptoms of dysmenorrhoea and chronic pelvic pain, but are associated with side effects such as breast discomfort, irritability, androgenic symptoms and bone loss. Surgical excision of endometriotic lesions can reduce pain symptoms; however it is associated with high recurrence rates of 40% to 50% at 5 years post‐surgery (Guo 2009). Early treatment of endometriosis improves pain levels and physical and psychological functioning. Furthermore, improvements in menstrual management (the use of the Mirena coil and the continuous use of the combined contraceptive pill) and fertility preservation (oocyte vitrification) raise the possibility of suppressing the progression of endometriosis and prospectively managing subfertility in endometriosis sufferers. The potential success of these preventative strategies is dependent on an accurate and early diagnosis. A major impediment to earlier and more efficacious treatment of this disease is diagnostic delay due to the invasive nature of standard diagnostic tests (Dmowski 1997).

Diagnosis of endometriosis

Clinical history and pelvic examination can raise the possibility of a diagnosis of endometriosis, but the heterogeneity in clinical presentation, the high prevalence of asymptomatic endometriosis (2% to 50%) and the poor association between presenting symptoms and severity of the disease mean that a reliable diagnosis of endometriosis based solely on presenting symptoms is difficult to obtain (Spaczynski 2003; Fauconnier 2005; Ballard 2008). Although an abnormal pelvic examination correlates with the presence of endometriosis on laparoscopy in 70% to 90% of cases (Ling 1999), there is a wide differential diagnosis for most positive physical findings. Furthermore, a normal clinical examination does not exclude endometriosis, as laparoscopically proven disease has been diagnosed in more than 50% women with a clinically normal pelvic examination (Eskenazi 2001). A variety of tests utilising pelvic imaging, blood markers, eutopic endometrium characteristics, urinary markers or peritoneal fluid components have been suggested as diagnostic measures for endometriosis. Although large numbers of the reported markers distinguish women with and without endometriosis in small pilot studies, many do not show convincing potential as a diagnostic test when they are evaluated in larger studies by different research groups. The diagnostic value of these tests has not previously been fully systematically evaluated and summarised using Cochrane methodologies. Currently, there is no simple non‐invasive test for the diagnosis of endometriosis that is routinely implemented in clinical practice.

Surgical diagnostic procedures for endometriosis include laparoscopy (minimal access surgery) or laparotomy (open surgery via an abdominal incision). In the last several decades, laparoscopy has become an increasingly common procedure and has largely replaced traditional open surgery in women suspected of having endometriosis (Yeung 2009). Laparoscopy has significant advantages over laparotomy, creating fewer complications and shorter recovery times. Furthermore a magnified view at laparoscopy allows better visualisation of the peritoneal cavity. Despite continuing controversy in the literature with regard to the superiority of one surgical modality over another in treating pelvic pathology, laparoscopy is the preferred technique to evaluate the pelvis and abdomen and to treat benign conditions such as ovarian endometriomas (Medeiros 2009). Surgery is currently also the only accepted way to determine the extent and severity of endometriosis. Several classification systems have been suggested for endometriosis (Batt 2003; Chapron 2003a; Martin 2006; Adamson 2008), but most researchers and clinicians use the revised American Society for Reproductive Medicine (rASRM) classification, which is internationally accepted as a respected, currently available tool for the objective assessment of the disease (American Society for Reproductive Medicine 1997). The rASRM classification system considers appearance, size and depth of peritoneal or ovarian implants and adhesions visualised during laparoscopy (Table 1) and allows uniform documentation of the extent of disease. Unfortunately this classification system has little value in clinical practice due to the lack of correlation between laparoscopic staging, the severity of symptoms and response to treatment (Vercellini 1996; Guzick 1997; Chapron 2003b).

Table 1. Staging of endometriosis, rASRM classification

Peritoneum	Endometriosis	< 1 cm	1 to 3 cm	> 3 cm
	Superficial	1	2	4
	Deep	2	4	6
Ovary	R Superficial	1	2	4
	Deep	4	16	20
	L Superficial	1	2	4
	Deep	4	16	20
	Posterior Cul‐de‐sac Obliteration	Partial	Complete
	Posterior Cul‐de‐sac Obliteration	4	40
Ovary	Adhesions	< 1/3 Enclosure	1/3‐2/3 Enclosure	> 2/3 Enclosure
	R Filmy	1	2	4
	Dense	4	8	16
	L Filmy	1	2	4
	Dense	4	8	16
Tube	R Filmy	1	2	4
	Dense	4*	8*	16
	L Filmy	1	2	4
	Dense	4*	8*	16
* If the fimbriated end of the fallopian tube is completely enclosed, change the point assignment to 16 American Society for Reproductive Medicine 1997

The European Society of Human Reproduction and Embryology (ESHRE) Special Interest Group for Endometriosis stated in their guidelines for the diagnosis and treatment of endometriosis that for women presenting with symptoms suggestive of endometriosis, a definitive diagnosis of most forms of endometriosis requires visual inspection of the pelvis at laparoscopy as the 'gold standard' investigation (Kennedy 2005). Currently the visual identification of endometriotic tissue in the pelvic cavity during surgery with or without histological confirmation is not just the best available but the only diagnostic test for endometriosis that is used routinely in clinical practice.

The disadvantages of laparoscopic surgery include and are not limited to the high cost, the need for general anaesthesia and the potential for adhesion formation post procedure. Laparoscopy has been associated with a 2% risk of injury to pelvic organs, a 0.001% risk of damaging a major blood vessel and a mortality rate of 0.0001% (Chapron 2003c). Only one third of women who undertake a laparoscopic procedure will receive a diagnosis of endometriosis; therefore many disease‐free women are unnecessarily exposed to surgical risk (Frishman 2006).

The validity of laparoscopy as a reference test for endometriosis has been assessed as being highly dependent on the skills of the surgeon. The diagnostic accuracy of laparoscopic visualisation has been compared with histological confirmation in a sole systematic review and is estimated as having a 94% sensitivity and 79% specificity (Wykes 2004). Subsequent studies suggested that incorporation of histological verification in the diagnosis of endometriosis may improve diagnostic accuracy (Marchino 2005; Almeida Filho 2008; Stegmann 2008) but these papers have not been systematically reviewed. The clinical significance of histological verification remains debatable, and a diagnosis based on visual findings can be considered reliable with an accurate inspection of the abdominal cavity by properly trained, experienced surgeons (Redwine 2003). Furthermore, excised potential endometriotic tissues are rarely serially sectioned in clinical practice and small lesions can be missed by pathologists in mild disease. Thus sampling inconsistencies are also likely to influence the accuracy of histological reporting.

Summary

A diagnostic test without the need for surgery would reduce surgical risks, increase accessibility to a diagnostic test and improve treatment outcomes. A need for an accurate and non‐invasive diagnostic test for endometriosis continues to encourage extensive research in the field and was endorsed at the international consensus workshop at the 10th World Congress of Endometriosis in 2008 (Rogers 2009). Although multiple markers and imaging techniques have been explored as diagnostic tests for endometriosis, none of them have been implemented routinely in clinical practice and most of them have not been subject to systematic review.

Index test(s)

This review assesses urinary biomarkers that have been proposed as non‐invasive tests for the diagnosis of endometriosis (Table 2), as part of the review series on non‐invasive diagnostic tests for endometriosis.

Table 2. Urinary biomarkers for endometriosis

Angiogenesis/Growth factors and their receptors
VEGF‐A (vascular endothelial growth factor ‐ A)¹
sFlt‐1 [sVEGFR‐1] (soluble fms‐like tyrosine kinase or variant of VEGF receptor 1)²
Cell adhesion molecules and other matrix‐related proteins
MMP‐2 (matrix metalloproteinase‐2)²
MMP‐9 (matrix metalloproteinase‐9)²
MMP‐9/ NGAL (matrix metalloproteinase‐9/neutrophil gelatinase‐associated lipocalin)²
Cytokines
TNF‐alpha (tumour necrosis factor alfa)¹
Cytoskeleton molecules
CK‐19 or CYFRA 21‐1 (Cytokeratin‐19)¹
High throughput markers
Proteome
Oxidative stress markers
8‐iso‐PGF2a (8‐iso‐prostaglandin F2a)²
Other Peptides/proteins
VDBP (vitamin D binding protein)
NNE (enolase I)
Collagen precursors
Prealbumin²
Alpha 1 antitrypsin²
Chain A solution structure of Bb' domains of human protein disulfide isomerase²
¹ Urinary biomarkers that did not exhibit differential expression in endometriosis ² Urinary biomarkers that exhibited differential expression in endometriosis, but for which the diagnostic estimates were not available

The definition of ‘non‐invasive’ varies between medical dictionaries but refers to a procedure that does not involve penetration of skin or physical entrance to the body (McGraw‐Hill Dictionary of Medicine 2006; The Gale Encyclopedia of Medicine 2008). Although bladder catheterization for urine collection is invasive by this definition, when compared to diagnostic surgery for endometriosis, urine tests are generally considered to be 'non‐invasive' or 'minimally invasive'. For the purpose of these reviews, we will define all tests that do not involve anaesthesia and surgery as ‘non‐invasive’.

The advantages of using a urine test for the diagnosis of endometriosis is that it is non‐invasive, readily available, and can be self‐collected without need for expensive equipment or skilled personnel. It is more acceptable to women, provides a rapid result and is more cost effective when compared to surgery. However urinary testing is dependent on the reliability of laboratory techniques and quality control protocols. Urinary biomarker levels may also be susceptible to variation during the menstrual cycle.

Cellular and molecular processes have been identified that characterise ectopic endometrium and peritoneal fluid in human and animal models (D'Hooghe 2001; Kao 2003; Hull 2008). Markers of these pathophysiological processes have been evaluated in various tissues, including urine, which is increasingly favoured as a fluid for biological testing. Urinary biomarker discovery is a new and rapidly expanding field with most studies published in the last five years. A limited number of endometriosis urinary biomarkers have been evaluated to date and most were assessed in small individual studies. Categories of markers include 1. angiogenesis and growth factors; 2. cell adhesion molecules and other matrix‐related proteins; 3. cytokines; 4. cytoskeleton molecules; 5. high‐throughput molecular markers; 6. oxidative stress markers; 7. other peptides/proteins shown to influence key events implicated in endometriosis.

A large systematic review of all proposed biomarkers for endometriosis in serum, plasma and urine identified over 100 putative biomarkers, but the authors were unable to identify any biomarker (single or in a panel) that could be recommended for use in clinical practice (May 2010). A more recent narrative review concurred with this conclusion (Fassbender 2015). There is a current need to re‐evaluate the diagnostic test accuracy of urine tests for endometriosis using Cochrane methodologies.

Clinical pathway

Women presenting with symptoms of endometriosis (dysmenorrhoea, dyspareunia, chronic pelvic pain or difficulty conceiving) generally are investigated with a pelvic ultrasound scan to exclude other pathologies, which is in line with international guidelines (Dunselman 2014; SOGC 2010; ACOG 2010). There are no other standard investigative tests and MRI is used conservatively because of its cost. If women seek pain management rather than conception, empirical treatment with progestogens or the combined oral contraceptive pill is commonly started. Diagnostic laparoscopy is considered if empirical treatment fails or if women decline or do not tolerate empirical treatment. In women who have difficulty conceiving, laparoscopy may be undertaken before fertility treatment (particularly if severe pelvic pain or endometrioma are present) or after failed ART (assisted reproductive technology) treatments. Endometriosis can be diagnosed during fertility investigations in women who have minimal or no pain symptomatology.

On average there is a delay of between 6 to 12 years from onset of symptoms to definitive diagnosis at surgery. Early referral to a gynaecologist with the capability to perform diagnostic surgery is associated with a shorter time to diagnosis. Collectively, young women, women in remote and rural locations and women of lower socioeconomic status have reduced access to surgery, and are less likely to obtain a prompt diagnosis of endometriosis.

Prior test(s)

Most women presenting with symptoms suggestive of endometriosis have a full history and examination and a routine gynaecological ultrasound before a decision is made to have diagnostic surgery. However there is no consensus on whether or not ultrasound or any other test should be routinely used as part of a standardised approach.

Role of index test(s)

A new diagnostic test can fulfil one of three roles:

1. Replacement: replacing an existing test by having more accuracy, or a similar accuracy with other advantages.

2. Triage: used as an initial step in a diagnostic pathway to identify the group of individuals who need further testing with an existing test. Although ideally a triage test has a high sensitivity and specificity, it may have a lower sensitivity but higher specificity than the current test or vice versa. The triage test does not aim to improve the diagnostic accuracy of the existing test but rather to reduce the number of individuals having an unnecessary diagnostic test.

3. Add‐on: used in addition to existing testing to improve diagnostic performance (Bossuyt 2008).

Ideally a diagnostic test is expected to correctly identify all individuals with a disease and to exclude all those without that disease, in other words it should have a sensitivity and specificity of 100%. A high sensitivity indicates that there are a low number of people who have a negative test and do have the disease (i.e. a low number of false‐negative results). High specificity corresponds to a low number of people who have a positive test but do not have the disease (i.e. low false‐positive results). In practice, however, it is extremely rare to find a test with equally high sensitivity and specificity. An acceptable replacement test would need to have a similar or higher sensitivity and specificity than the current gold standard of laparoscopy. The only systematic review that determines the accuracy of laparoscopy in diagnosing endometriosis reported a sensitivity of 94%, and a specificity of 79% and we have taken this as a cut off for a replacement test (Wykes 2004).

The purpose of triage tests can vary depending on the clinical context and individuals’ priorities. One reasonable approach is to exclude the diagnosis to avoid further unnecessary and expensive diagnostic investigation. High sensitivity tests have few false‐negative results and act to rule conditions out (SnOUT). A negative result from a test with high sensitivity will exclude the disease with high certainty independent of the specificity. As women without disease would be assured of having a negative test, unnecessary invasive interventions can be avoided. However, a positive result has less diagnostic value particularly when the specificity is low. We predetermined that a clinically useful 'SnOUT' triage test should have a sensitivity of 95% or more and a specificity of 50% and above. The sensitivity cutoff for a 'SnOUT' triage test was set at 95% and above, assuming that a 5% false negative rate is statistically and clinically acceptable. The specificity cutoff was set at 50% and above, to avoid diagnostic uncertainty in more than 50% of the population with a positive result.

An alternative approach would be to avoid a missed diagnosis. High specificity tests have few false positive results and act to rule conditions 'in' (SpIN). A positive result for a highly specific triage test indicates a high likelihood of having endometriosis. This information could be used to prioritise these women for surgical treatment. A positive 'SpIN' test could also provide a clinical rationale to start targeted disease‐specific medical management in a person without a surgical diagnosis, under the assumption that disease is present. Surgical management could then be reserved for cases when conservative treatment fails. This is particularly relevant in some populations where the therapeutic benefits of surgery for endometriosis have to be carefully balanced with the disadvantages (e.g. young women, women with medical conditions or pain‐free women with a history of infertility). In this scenario we considered a sensitivity of 50% and above and a specificity of 95% and higher as suitable cutoffs for a 'SpIN' triage test.

We evaluated urine tests for their potential to replace surgery (replacement test) or to improve the selection of women for surgery (triage test) that can either rule out (SnOUT) or rule in (SpIN) the disease. Both types of triage test are clinically useful, minimising the number of unnecessary interventions. Sequential implementation of SnOUT and SpIN tests can also optimise a diagnostic algorithm (Figure 1). We did not assess any test as an add‐on test, as we sought tests that reduce the need for surgery and not tests that improve the accuracy of the currently available surgical diagnosis.

Figure 1

Sequential approach to non‐invasive testing of endometriosis

Alternative test(s)

There are no alternative tests for the diagnosis of endometriosis that are in routine clinical practice.

Rationale

Many women with endometriosis suffer longstanding pelvic pain and infertility prior to a diagnosis. Surgery is the only current method of diagnosing endometriosis, but it is associated with high costs and surgical risks. A simple and reliable non‐invasive test for endometriosis, with the potential to either replace laparoscopy or to triage women in order to reduce surgery, would minimise surgical risk and reduce diagnostic delay. Endometriosis could then be detected at less advanced stages and earlier intervention instituted. This would provide the opportunity for a preventive approach for this debilitating disease. Health care and social security costs of endometriosis would be expected to be reduced by early diagnosis and more cost effective and efficient treatments. Furthermore, identifying urine biomarkers that do not pertain to endometriotic disease would help clinicians and researchers focus on clinically relevant biomarker detection.

Objectives

Primary Objectives

1. To provide summary estimates of the diagnostic accuracy of urinary biomarkers for the diagnosis of pelvic endometriosis (peritoneal or ovarian or deep infiltrating, or a combination thereof) compared to surgical diagnosis as a reference standard.

2. To assess the diagnostic utility of biomarkers that could differentiate ovarian endometrioma from other ovarian masses.

Urinary biomarkers were evaluated as replacement tests for diagnostic surgery as well as triage tests which would assist decision‐making to undertake diagnostic surgery for endometriosis.

Secondary objectives

1. To investigate the influence of heterogeneity on the diagnostic accuracy of urinary biomarkers for endometriosis. Potential sources of heterogeneity include:

Characteristics of the study population: age (adolescents vs. later reproductive years); clinical presentation (subfertility, pelvic pain, ovarian mass, asymptomatic women); stage of disease (rASRM classification system); geographic location of study;
Histological confirmation in conjunction with laparoscopic visualisation compared to laparoscopic visualisation alone;
Changes in technology over time: year of publication; modifications applied to conventional laboratory techniques;
Methodological quality: differences in the QUADAS‐2 (Quality Assessment of Diagnostic Accuracy Studies‐2) evaluation (Table 3), including a) low versus unclear or high risk; b) consecutive versus non‐consecutive enrolment; c) blinding of surgeons to the results of index tests;
Study design ('single‐gate design' vs. 'two‐gate design' studies).

Table 3. Application of the QUADAS‐2 tool for assessment of methodological quality of the included studies

Domain 1 ‐ Patient selection
Description	Describe methods of patient selection and included patients
Type of bias assessed	Selection bias, spectrum bias
Review Question	Women of reproductive age with clinically suspected endometriosis (symptoms, clinical examination ± presence of pelvic mass), scheduled for surgical exploration of pelvic/abdominal cavity for confirmation of the diagnosis ± treatment
Informaton collected	Study objectives, study population, selection (inclusion and exclusion criteria), study design, clinical presentation, age, number of participants enrolled and number of participants available for analysis, setting, place and period of the study
Signalling question 1	Was a consecutive or random sample of patients enrolled?
Yes	If a consecutive sample or a random sample of the eligible patients was included in the study
No	If non‐consecutive sample or non‐random sample of the eligible patients was included in the study
Unclear	If this information was unclear
Signalling question 2	Did the study avoid inappropriate exclusions?
Yes	If inclusion/exclusion criteria were presented and all patients with suspected endometriosis were included, with an exception for those who a) had a history of medical conditions or were on medical therapy that would have potentially interfered with interpretation of index test (e.g. malignancy, pregnancy, autoimmune disorders, infectious diseases, treatment with hormonal or immunomodulator substances); b) refused to participate in the study; or c) were unfit for surgery
No	If the study excluded the patients based on education level, psychosocial factors, genetic testing or phenotype or excluded patients with any co‐morbidities commonly present in general population, including a population that could have undergone a testing for endometriosis in clinical setting (hypertension, asthma, obesity, benign gastro‐intestinal or renal disease, etc)
Unclear	If the study did not provide clear definition of the selection (inclusion or exclusion) criteria and 'no' judgement was not applicable
Signalling question 3	Was a 'two‐gate' design avoided?
Yes	If the study had a single set of inclusion criteria, defined by the clinical presentation (i.e. only participants in whom the target condition is suspected) ‐ a ‘single‐gate’ study design
No	If the study had more than one set of inclusion criteria in respect to clinical presentation (i.e. participants suspected of target condition and participants with alternative diagnosis in whom the target condition would not be suspected in clinical practice) ‐ a 'two‐gate' study design
Unclear	If it was unclear whether a 'two‐gate deign' was avoided or not
Risk of bias	Could the selection of patients have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for 3 of the above questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the included patients do not match the review question?
Low	If the study includes only clinically relevant population that would have undergone index test in real practice and includes representative form of target condition
High	If the study population differed from the population defined in the review question in terms of demographic features and co‐morbidity (e.g. studies with multiple sets of inclusion criteria with respect to clinical presentation including either healthy controls or alternative diagnosis controls that would not have undergone index test in real practice). Further, if target condition diagnosed in the study population was not representative of the entire spectrum of disease, such as limited spectrum of severity (e.g. only mild forms) or limited type of endometriosis (e.g. only DIE)
Unclear	If this information was unclear (e.g. severity of endometriosis was not reported)
Domain 2 ‐ Index test
Description	Describe the index test, how it was conducted and interpreted
Type of bias assessed	Test review bias, clinical review bias, interobserver variation bias
Review Question	Any type of urinary biomarkers
Informaton collected	Index test name, description of positive case definition by index test as reported, threshold for positive result, examiners (number, level of expertise, blinding), interobserver variability, conflict of interests
Signalling question 1	Were the index test results interpreted without knowledge of the results of the reference standard?
Yes	If the operators performing or interpreting index test were unaware of the results of reference standard
No	If the operators performing or interpreting index test were not blinded to the results of reference standard
Unclear	If this information was unclear
Signalling question 2	If a threshold was used, was it pre‐specified?
Yes	If study clearly provided a threshold for positive result and was defined before execution or interpretation of index test
No	If a threshold for positive result was not provided or not defined prior to test execution
Unclear	If it was unclear whether a threshold was pre‐specified or not
Signalling question 3	Was a menstrual cycle phase considered in interpreting the index test?
Yes	If all the included participants were in the same phase of menstrual cycle or if the study reported subgroup analyses per cycle phase or if study reported the pooled estimates after impact of the cycle phase on biomarker expression was not detected
No	If study included participants in different phases of menstrual cycle, but effect of cycle phase on index test was not assessed
Unclear	If the cycle phase was not reported
Risk of bias	Could the conduct or interpretation of the index test have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for any of the above 3 questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the index test, its conduct, or interpretation differ from the review question?
Low	We considered all types of urinary biomarkers as eligible; therefore all the included studies were classified as 'low concern', unless 'unclear' judgement was applicable
High	We did not consider the studies where index tests other than urinary biomarkers were included (or excluded information on other index tests reported in addition to urine tests) or where index test looked at other target conditions not specified in the review (e.g. studies aimed at classifying pelvic masses as benign and malignant); therefore none of the included studies was classified as 'high concern'
Unclear	If study did not present sufficient information on at least one of the following: laboratory method, sample handling, reagents used, experience of the test operators
Domain 3 ‐ Reference standard
Description	Describe the reference standard, how it was conducted and interpreted
Type of bias assessed	Verification bias, bias in estimation of diagnostic accuracy due to inadequate reference standard
Review Question	Target condition ‐ pelvic endometriosis, ovarian endometriosis, DIE. Reference standard ‐ visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Informaton collected	Target condition, prevalence of target condition in the sample, reference standard, description of positive case definition by reference test as reported, examiners (number, level of expertise, blinding)
Signalling question 1	Were the reference standards likely to correctly classify the target condition?
Yes	If the study reported at least one of the following: surgical procedure was described in sufficient details; or criteria for positive reference standard were stated; or diagnosis was confirmed by histopathology; or the procedure was performed by the team with high level of expertise in diagnosis/surgical treatment of target condition, including tertiary referral centres for endometriosis
No	If reference standard did not classify target condition correctly; considering the inclusion criteria and a nature of the reference standard, none of the studies were classified as 'no' for this item
Unclear	If information on execution of the reference standard, its interpretation or operators was unclear
Signalling question 2	Were the reference standard results interpreted without knowledge of the results of the index tests?
Yes	If operators performing the reference test were unaware of the results of index test
No	If operators performing the reference test were aware of the results of index test
Unclear	If this information was unclear
Risk of bias	Could the reference standard, its conduct, or its interpretation have introduced bias?
Low	If 'yes' classification for all the above 2 questions
High	If 'no' classification for any of the above 2 questions
Unclear	If 'unclear' classification for any of the above 2 questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the target condition as defined by the reference standard does not match the question?
Low	Considering the inclusion criteria, all the studies were classified as 'low concern', therefore all the included studies were classified as 'low concern'
High	We excluded the studies where participants did not undergo surgery for diagnosis of endometriosis, therefore none of the included studies were classified as 'high concern'
Unclear	Only studies were laparoscopy/laparotomy served as a reference test were included; therefore none of the included studies was classified as 'unclear concern'
Domain 4 ‐ Flow and timing
Description	Describe any patients who did not receive the index tests or reference standard or who were excluded from the 2 x 2 table, describe the interval and any interventions between index tests (sample collection) and the reference standard
Type of bias assessed	Disease progression bias, bias of diagnostic performance due to missing data
Review Question	Less than 12 months interval between index test (sample collection) and reference standard ‐ endometriosis may progress over time, so we had chosen an arbitrary time interval of 12 months as an acceptable time interval between the sample collection and surgical confirmation of diagnosis
Informaton collected	Time interval between index test (sample collection) and reference standard, withdrawals (overall number reported and if were explained)
Signalling question 1	Was there an appropriate interval between index test (sample collection) and reference standard?
Yes	If time interval was reported and was less than 12 months
No	We excluded all the studies where time interval was longer than 12 months, therefore none of the included studies were classified as 'no' for this item
Unclear	If time interval was not stated clearly, but authors' description allowed to assume that the interval was reasonably short
Signalling question 2	Did all women receive the same reference standard?
Yes	If all participants underwent laparoscopy or laparotomy as a reference standard; considering the inclusion criteria, all the studies were classified as 'yes' for this item, as anticipated
No	If all participants did not undergo surgery or had alternative reference standard or if only a subset of participants had surgery as reference standard, but the information on this population was not available in isolation; considering the inclusion criteria, none of the included studies were classified as 'no' for this item
Unclear	If this information was unclear; considering the inclusion criteria, none of the included studies were classified as 'unclear' for this item
Signalling question 3	Were all women included in the analysis?
Yes	If all the women were included in the analysis or if women were excluded because they did not meet inclusion criteria prior to execution of index test or if the withdrawals were less than 5% of the enrolled population (arbitrary selected cut‐off)
No	If any patients were excluded from the analysis because of un interpretable results, inability to undergo either index test or reference standard or for unclear reasons
Unclear	If this information was unclear
Risk of bias	Could the patient flow have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for any of the above 3 questions and 'high risk' judgement was not applicable

2. To assess biomarkers which were not affected by endometriosis and hence were unlikely to discriminate between women with and without the disease.

Methods

Criteria for considering studies for this review

Types of studies

Published peer‐reviewed studies that compared the results of one or several types of urinary biomarkers with the results obtained by surgical diagnosis of endometriosis. Studies were included if they were:

Randomised controlled trials;
Observational studies of the following designs:
- ‘Single‐gate design’ (studies with a single set of inclusion criteria defined by clinical presentation). All participants had clinically suspected endometriosis.
- ‘Two‐gate design’ (studies where participants are sampled from distinct populations with respect to clinical presentation). The same study includes participants with a clinical suspicion of having the target condition (e.g. women with pelvic pain) and also participants in whom the target condition is not suspected (e.g. women admitted for tubal ligation). Two‐gate studies were eligible only where all cases and controls belonged to the same population in respect to the reference standard (i.e. all the participants were scheduled for laparoscopy) (Rutjes 2005).
Performed on prospectively collected samples, irrespective of the actual time of the test assay. The timing of sample collection relative to surgery is important because the surgical excision of endometriotic lesions could influence urine biomarker expression and hence bias the results. Therefore, we only included studies where urine was collected before the surgical procedure, i.e. 'prospectively collected'. The studies performed on tissue bank samples collected from prospectively recruited, well‐defined populations were considered eligible, which prevented the omission of valuable data from adequately designed studies. The time interval between sample collection and laboratory testing may influence test outcomes which could be dependent on sample storage conditions and the stability of each individual biomarker during storage and freeze/thawing. This information was not readily available for most molecules and was not addressed in this review, but will be considered in future updates if more evidence emerges.
Performed in any healthcare setting;
Published in any language;
We did not impose a minimal limit on the number of participants in the included studies nor the number of studies that have evaluated each index test.

The following studies were excluded:

Study design:
- Narrative or systematic reviews;
- Studies of retrospective design where the sample collection was performed after execution of reference test;
- Studies of retrospective design where the participants were selected from retrospective review of the case notes/archived samples and information on recruitment methods or study population was not available;
- Case reports or case series;
Studies reported only in abstract form or in conference proceedings where the full text was not available. This limitation was applied when we faced substantial difficulty in obtaining the information from the abstracts, which precluded a reliable assessment of eligibility and methodological quality.

Participants

Study participants included reproductive‐aged women (puberty to menopause) with suspected endometriosis based on clinical symptoms or pelvic examination, or both, who undertook both the index test and reference standard.

The participants were selected from populations of women undergoing abdominal surgery for the following indications: 1) clinically suspected endometriosis (pelvic pain, infertility, abnormal pelvic examination, or a combination of the above), 2) ovarian mass regardless of symptoms, 3) a mixed group, which consists of women with suspected endometriosis/ovarian mass or women with other benign gynaecological conditions (e.g. surgical sterilisation, fibroid uterus, etc). Asymptomatic women who have an incidental finding of endometriosis at surgery performed for another indication were also included

Articles that included participants of postmenopausal age were eligible when the data for the reproductive age group was available in isolation. Studies were excluded when the study population involved participants who clearly would not undergo the index test in a clinical scenario or would not benefit from the test (e.g. women with ectopic pregnancies, gynaecological malignancy or acute pelvic inflammatory disease). We also excluded publications where only a subset of participants with a positive index test or reference standard were included in the analysis and the data for the whole cohort were not available.

Index tests

Any type of urinary biomarker for endometriosis was assessed either separately or in combination with other urine tests. The assessed index tests are presented in Table 2. We included the tests performed in one or several phases of menstrual cycle.

The combined evaluations of urinary biomarkers with other methods for diagnosing endometriosis (e.g. pelvic examination, imaging, blood or endometrial tests) are beyond the scope of this review and are presented separately in another review 'Combined tests for the non‐invasive diagnosis of endometriosis'. The studies that solely assessed specific technical aspects, qualitative descriptions of lesion appearance or inter‐observer variability of the index tests without reporting the data on diagnostic performance were excluded from the review. When the evaluated biomarker(s) showed differential expression between the groups of women with and without endometriosis, the publication was considered only if the data were reported with sufficient detail for the construction of 2 x 2 contingency tables. However, when the contingency tables were not available because the expression level of index test did not significantly differ between the groups and the inclusion criteria were otherwise met, a critical appraisal was undertaken and the study was presented in the descriptive part of the review. Thus the adequately designed studies that identified biomarkers without diagnostic value were evaluated as they provide information that is likely to focus future research on other more clinically useful biomarkers. This methodology also identified biomarkers which were associated with endometriosis in some but not other publications. Evaluations of screening or predictive accuracy tests were not included in this review.

The diagnostic performance of an index test was considered to be high when the test reached the criteria for a replacement test (sensitivity of equal or greater than 94% with specificity of equal or greater than 79%) or triage test (sensitivity of equal or greater than 95% with specificity of equal or greater than 50% or vice versa), or approached these criteria (diagnostic estimates within 5% of the set thresholds). All other diagnostic estimates were considered to be low.

Target conditions

Pelvic endometriosis, defined as endometrial tissue located in the pelvic cavity: any of the pelvic organs, peritoneum and pouch of Douglas. Three types of pelvic endometriosis were assessed:

1. Peritoneal endometriosis, defined as endometrial deposits detected on peritoneum covering pelvic organs, pelvic side walls or pouch of Douglas;

2. Ovarian endometriosis (endometrioma), defined as an ovarian cyst lined by endometrial tissue and appearing as an ovarian mass of varying size;

3. Deep infiltrating endometriosis (DIE), defined as subperitoneal infiltration of endometrial implants, i.e. when the endometriotic implants penetrate the retroperitoneal space for a distance of 5 mm or more (Koninckx 1991). DIE may be present in multiple locations, involving either anterior or posterior pelvic compartments, or both.

Certain rare types of endometriosis such as extrapelvic, bladder and ureteric endometriosis were not included in this review because the majority were reported in case reports or case series and laparoscopy or laparotomy are not reliable reference standards for these conditions.

We excluded the studies where diagnosis of endometriosis was not the primary outcome of the trial (e.g. malignant vs benign masses or normal vs. abnormal pelvis) and the separate data for endometriosis were not available.

We also excluded the studies where the findings of the index test formed the basis of selection for the reference standard, because this was likely to distort an assessment of the diagnostic value of index test.

We included studies that recruited selected populations of women with endometriosis (i.e. those with specific rASRM stages), because there is a poor correlation between the rASRM classification and infertility and pain symptoms. Exclusion of these studies could result in a loss of potentially important diagnostic information from otherwise eligible publications. Where possible the impact of these studies was addressed in the assessments of heterogeneity. When a study analysed a large population with a wide spectrum of endometriosis and additionally reported a sub‐group analysis of the different stages of disease severity, only estimates for the entire population were considered, because a subgroup analysis does not directly address the review question regarding the clinical utility of the biomarker in detecting the disease.

Reference standards

The reference standard was visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation, as this is currently the best available test for endometriosis. Information regarding the inter‐ and intra‐observer correlation of the reference standard was reviewed if reported.

Only studies in which the reference test was performed within 12 months of the urine sample collection were included, on the assumption that disease status could change within a period of one year or longer, either naturally or as a result of treatment. Studies in which the participants did not undergo the reference standard or where the findings of the index test formed the basis of selection for undertaking the reference standard were not included in this review.

Summary of inclusion/exclusion criteria

Inclusion criteria:

Types of studies:
- Published peer‐reviewed;
- RCTs;
- Observational of the following design:
  - ‘single‐gate design’ (single set of inclusion criteria defined by clinical presentation): all the participants had clinically suspected endometriosis;
  - ‘two‐gate design’ (two sets of inclusion criteria with respect to clinical presentation and one set of inclusion criteria with respect to reference standard): the participants with or without a clinical suspicion of endometriosis scheduled for abdominal surgery;
- Performed on prospectively collected samples, including the tissue bank samples collected from prospectively recruited well‐defined population;
- Published in any language;
- Performed in any healthcare setting;
- Any sample size.

Participants:
- Reproductive‐aged women;
- Clinically suspected endometriosis, but included
  - women who underwent abdominal surgery for other benign gynaecological conditions and had surgical assessment for presence/absence of endometriosis;
  - asymptomatic women who have an incidental finding of endometriosis at surgery performed for another indication;
- Undertook both the index test and reference standard.
Index tests:
- One or several types of urinary biomarkers;
- Data reported in sufficient detail for the construction of 2 x 2 tables for the tests that showed differential expression between the groups;
- Biomarkers where 2 x 2 tables could not be constructed as the results did not differ between women with and without endometriosis, but all other inclusion criteria were met.
Target condition:
- Pelvic endometriosis
  - peritoneal endometriosis;
  - ovarian endometrioma;
  - DIE;
  - combination of the above.
Reference standard:
- Surgical visualisation of lesions for the diagnosis of endometriosis (laparoscopy or laparotomy) with or without histological verification;
- Performed within 12 months of the urine sample collection.

Exclusion criteria:

Types of studies:
- Narrative or systematic reviews;
- Retrospective design where the index test was performed after execution of reference test;
- Prospectively collected samples that were selected from the archived material, but information on the study population or the selection process was unclear;
- Case reports or case series;
- Conference proceedings.
Participants:
- Included cohort was not representative of the target population that would benefit from the test (e.g. women with known genital tract malignancy, ectopic pregnancies or acute pelvic inflammatory disease);
- Study included participants of postmenopausal age and the data for the reproductive age group were not available in isolation;
- Only participants with positive index test or positive reference standard were included in analysis.
Index tests:
- Urinary biomarkers presented in combination with other diagnostic tests for endometriosis and separate information for urinary biomarkers was not available;
- Study presented only specific technical aspects of an index test or focused on the biological events, rather than diagnostic performance of the test;
- Study assessed screening or predictive test accuracy.
Target condition:
- Endometriosis was not the primary outcome of the trial (e.g. malignant vs benign masses or normal vs. abnormal pelvis)
- Atypical, rare sites of endometriosis.

Reference standard:
- Reference standard performed only in a subset of study/control group;
- Findings of the index test formed the basis of selection for the reference standard;
- Other than specified in inclusion criteria.

Search methods for identification of studies

The search strategy was developed in collaboration with the Trials Search Coordinator of the Gynaecology and Fertility Review Group, following recommendations of the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy (de Vet 2008). The searches were not limited to particular types of study design and did not have language or publication date restrictions. The search strategy incorporated words in the title, abstract, text words across the record and the medical subject headings (MeSH). All searches were performed from inception until 31 July 2015. The search strategies for each database and the number of hits per search are presented in Appendix 1; Appendix 2; Appendix 3; Appendix 4. The summary of the results is presented in Results of the search.

Electronic searches

We searched the following databases to identify the published articles that assessed the diagnostic value of urinary biomarkers for endometriosis:

CENTRAL;
MEDLINE;
EMBASE;
CINAHL;
PsycINFO;
Web of Science;
LILACS;
OAIster;
TRIP;
Databases of the trial registers:
- ClinicalTrials.gov;
- World Health Organization (WHO) International Clinical Trials Registry Platform (ICTRP);
Databases to identify reviews and guidelines as sources of references to potentially relevant studies:
- MEDION;
- DARE;
- PubMed, a ‘Systematic Review’ search under the ‘Clinical Queries' link;
Searches for papers recently published and not yet indexed in the major databases:
- PubMed (simple search for the last 6 months; the ‘related articles’ feature was used to locate additional relevant studies).

Searching other resources

The reference list of all relevant publications (retrieved full texts of the key articles and identified reviews) was handsearched.

An intended attempt to locate the grey literature (unpublished studies and conference proceedings) was abandoned as we faced substantial difficulty in obtaining full‐text publications or further details of studies reported in an abstract form.

Data collection and analysis

Selection of studies

Two authors of this review (EL, VN) and four authors for the other reviews from this series (Devashana Gupta, Rabia Shaikh, Deepika Arora and Lucy Prentice) scanned the titles of studies identified by our search to remove any clearly irrelevant articles. The titles and abstracts of the remaining studies were reviewed to select potentially relevant publications. The relevant articles were then divided into four categories of endometriosis biomarkers: serum, endometrial, urinary, and combined tests. Two of the urinary biomarker review authors (EL, LH, or VN) independently reviewed each of the full‐text versions of the articles selected by title and abstract and assessed them for eligibility for inclusion, based on the criteria listed above under Criteria for considering studies for this review. A single failed eligibility criterion was sufficient for a study to be excluded from the review.

The review authors who assessed the relevance of the studies and eligibility for inclusion were not blinded to the information about each article, including the publishing journal, the names of authors, the institution and the results. Any disagreements were resolved by discussion and, if necessary, in consultation with a third review author (CF), who is an expert in methodological aspects of Cochrane systematic reviews.

When papers updated previous publications and were performed on the same study population at different recruitment points, the most complete data set that superseded previous publications was used to avoid double counting participants or studies. Missing data were retrieved by directly contacting authors to clarify study eligibility. When potentially relevant studies were found in languages other than English, a translation was undertaken. For excluded studies, the reasons for exclusion and details of which criteria were not met were documented. The characteristics of included, excluded and awaiting classification studies are presented under Characteristics of included studies, Characteristics of excluded studies and Characteristics of studies awaiting classification, respectively.

Data extraction and management

Data were extracted from eligible studies by two independent review authors (EL, LH) and any disagreement was resolved by the third review author (VN). If required, study investigators were contacted to resolve any questions regarding the data.

To collect details from included studies, a data extraction form was specifically designed for this review and pilot tested on three studies of diagnostic accuracy tests for endometriosis. The following information was recorded for each study:

General information and study design: first author, year of publication, country, language, setting, objectives, inclusion/exclusion criteria, type of enrolment.

Characteristics of the study participants: age, symptoms/history/previous tests, type of target condition and its prevalence in the study population, number of participants enrolled and available for analysis, reasons for withdrawal.

Features of the index test and reference standard: type, diagnostic criteria, number and experience of the operators, blinding of the operators to other tests or clinical data or both, interobserver variability, time interval between index test and reference standard.

The reported number of true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP) was used to construct a two‐by‐two (2 x 2) table for each index test. If these values were not reported, we attempted to reconstruct the 2 x 2 tables from the summary estimates presented in the article.

Data were extracted into Review Manager® (RevMan) software, which was used to graphically display the quality assessment, the diagnostic estimates data and the descriptive analyses.

Assessment of methodological quality

We used QUADAS‐2, a modified version of the QUADAS tool to assess the quality of each included study (Whiting 2011).

The review‐specific QUADAS‐2 tool and explanatory document are presented in Table 3. Each paper was judged as having a 'low', 'high' or 'unclear' risk for each of four domains and concerns about applicability were assessed in three domains. We considered studies as having low methodological quality when classified at high or unclear risk of bias or at high concern regarding applicability in at least one domain. The assessment of each included study was performed independently by two reviewers (EL, LH, or VN) and disagreements were settled by a third author (CF) or by consensus. Two review authors (EL, LH) independently piloted the topic‐specific tool to rate four of the included studies with a high level of agreement. Modifications specific to the urinary biomarkers review were made to the signalling questions of the original QUADAS‐2 tool and were as following:

1) Domain 1: an original signalling question 'Was a case control design avoided?' was rephrased as 'Was a two‐gate design avoided?'. The diagnostic studies are cross‐sectional in nature, aiming to compare the result of an index test with the result of the reference standard in same group of participants. In these studies the parameters are measured at a single point of time and the groups are classified by the outcome of the reference standard, albeit the analysis is performed retrospectively. Therefore, unlike epidemiological studies, the terminology 'cohort' and 'case‐control' is less informative for diagnostic test trials, and was substituted by 'single‐gate' and 'two‐gate' designs. This question was included because a two‐gate design has more potential to introduce selection bias.

2) Domain 2: an additional signalling question 'Was the phase of the menstrual cycle considered in interpreting the index test?' was introduced to assess bias in the interpretation of the test results. Some biochemical markers are sensitive to fluctuations in steroid sex hormones levels across a menstrual cycle, which could result in the differential expression of endometriosis biomarkers at different cycle phases.

The assessment of methodological quality was undertaken for each domain but a summary score to estimate the overall quality of studies was not calculated (Whiting 2005).

Statistical analysis and data synthesis

The estimates of sensitivity and specificity were generated in forest plots and plotted in the receiver operating characteristic (ROC) space for each index test using RevMan. The diagnostic performance of each test was investigated and inter study variation in the performance of each index test was visually explored in relation to participant characteristics, study design, and study quality factors. Two or more tests evaluated in the same cohort were included as separate data sets, since the unit of analysis was the test result, not the participant.

For studies that reported subgroup analyses per phase of the menstrual cycle, we presented the data in a clinically relevant way. For instance, pooled estimates were presented when there was no statistically significant difference in biomarker expression between cycle phases. Alternatively, where putative biomarkers demonstrated cycle‐dependent expression or were noted to be modulated by ovarian hormones, we reported the test performance either at several time points across the menstrual cycle or in the phase that demonstrated the most distinct difference between groups.

We planned to perform the bivariate logit‐normal random effects model for all meta‐analyses with four studies or more and a fixed effect meta‐analysis of sensitivity and specificity for smaller groups of studies (two to three) in the absence of substantial heterogeneity. When the number of studies was less than four, we did not attempt to estimate the covariance, and reported this as zero. The meta‐analyses were performed using SAS NLMIXED software. Results from SAS were input into RevMan to provide plots of the estimated summary points and confidence regions, superimposed on the study‐specific estimates of sensitivity and specificity. In this review this aspect of the statistical analysis was unable to be performed due to the paucity of data for each biomarker.

The comparative accuracy of index tests was assessed in two ways. In direct, fully paired comparisons where all the study participants received more than one index test as well as the reference standard, the estimates were plotted in RevMan. If a meta‐analysis was possible, test‐level covariates in the bivariate logit‐normal model were used to identify statistically significant differences. Otherwise the available comparative data were reported in a narrative way and illustrated using forest and ROC plots.

When test performance was judged against the predetermined diagnostic criteria, the point estimates of sensitivity and specificity were considered as the most informative presentation of test performance. We acknowledge that tests with point estimates that did not reach the predetermined criteria but confidence intervals (CIs) which contained values above the threshold, could have diagnostic value. Furthermore tests with point estimates that reached the criteria but CIs which contained values below the threshold, could have an overestimated diagnostic value. If the range of the CIs rather than the point estimates of the data are used, the predetermined cut‐off becomes meaningless. Therefore we did not consider CIs in qualifying the test performance, but utilised this information in interpreting the reliability of the obtained data.

Dealing with missing data

Missing data was defined as any information regarding the study population, index tests or reference standard that was not available in the publication which was required to determine the eligibility of the study for inclusion, the methodological quality or to construct the results table. If missing data were identified, the authors were contacted in an attempt to obtain this information. If missing data prevented a clear judgement regarding applicability for inclusion or the construction of accurate 2 x 2 tables and the data were not available from the primary investigators (for example we were unable to locate the contact details of the authors or there was no reply from the authors or the authors replied that the requested information was unavailable), the study was excluded from the review.

Investigations of heterogeneity

Heterogeneity was initially assessed by visually examining the forest plots of sensitivities and specificities and the ROC plots for each index test. The potential sources of heterogeneity are stated in the Secondary objectives. For diagnostic tests where there were more than five eligible studies, we initially planned to formally explore heterogeneity by using study level covariates but we were unable to do so, because of the small numbers of studies in each group. We also planned to assess the sensitivity of results to the inclusion and exclusion of outlying studies in all analyses, but refrained from doing so, again because of the small number of studies for most analyses. It is important to use caution when interpreting small meta‐analyses (few studies) with a limited total sample size.

Sensitivity analyses

We planned to conduct sensitivity analyses to assess the impact of the methodological quality of included studies on the results of any meta‐analyses if sufficient data were available. Low quality studies were defined by the identification of a high risk of bias for one or more QUADAS‐2 domains. We also planned to use the ’leave‐one‐out’ procedure to assess the impact of each study on the meta‐analysis results (leading study effect). In the urinary review this was unable to be undertaken due to the paucity of studies evaluating each biomarker.

Assessment of reporting bias

A comprehensive search of multiple sources for eligible studies, a search of trial registers and no language restrictions minimised the risk of reporting bias. However, publication bias generally arises when studies have a higher chance of being published if their results are positive. Therefore unpublished and published study databases and conference proceedings were initially searched and evaluated. During the process of qualifying the studies for inclusion in this review, we faced substantial difficulty in obtaining full‐text publications or further details of studies published in an abstract form. This precluded a reliable assessment of eligibility and methodological quality and it was decided not to include these publication sources in this review.

Results

Results of the search

The literature search identified 33,438 references in the following databases: MEDLINE (n = 10,328), CENTRAL (n = 226), EMBASE (n = 10,313), CINAHL (n = 1131), PsycINFO (n = 174), Web of Science (n = 7425), LILACS (n = 420), OAIster (n = 446), Trip (n = 1648), Trial registers for ongoing and registered trials (n = 523), MEDION (n = 2), DARE (n = 99), PubMed, a ‘Systematic Review’ search (n = 418), simple search PubMed (n = 267). These databases were searched from inception to 20 April ‐ 31 July 2015. The flow of the selection process is presented in Figure 2. Titles were screened to exclude duplicates (n = 9312) and clearly irrelevant studies (n = 21,534). Another 2575 references were eliminated after reading the abstracts because they did not address the research question or clearly did not meet the inclusion criteria. The full texts of the remaining 16 references were retrieved and assessed for eligibility. Data from two studies required additional clarification from the authors. There were no non‐English publications requiring translation. Ultimately, eight studies were eligible according to the inclusion criteria and provided data for the review, six studies were excluded and two studies were defined as awaiting classification. In addition, one ongoing trial was identified through the clinical trials registries (Characteristics of ongoing studies), but the outcomes of this study were not yet available (ongoing, but not recruiting participants). The progress of this study will be monitored and addressed in future updates.

Figure 2

Flow of the studies identified in literature search for systematic review on urinary biomarkers for a non‐invasive diagnosis of endometriosis.

Basic features of the included studies

The list and details of the included studies are presented in Characteristics of included studies. The eight eligible studies included 646 participants, with a median of 73 women per study (range 39 to 147). Of these studies, five assessed urinary biomarker expression in women with and without endometriosis and included enough data to estimate a diagnostic performance of the investigated test (n = 438 participants, median 95, range 39 to 147 women). Each study evaluated one or several biomarkers. Most studies reported diagnostic estimates for biomarkers that demonstrated differential expression between women with and without endometriosis, although in one publication this assessment was undertaken for a test that demonstrated no differential expression (Lessey 2014). In three studies there was no difference in the expression between the women with and without endometriosis and the diagnostic test accuracy of the urinary biomarker was not evaluated (n = 208 participants, median 70, range 62 to 76 women). This set of studies was methodologically eligible and the biomarkers identified are unlikely to be of diagnostic utility and hence may not be worth further study.

Four of the included studies were conducted in Asia, two in Europe and two in North America. All the studies were conducted at university hospitals, of which at least three were referral centres for endometriosis. The earliest article was published in 2004, six articles were published after 2010 and four studies were published after 2013. There were no randomised controlled trials and all the studies were observational of cross‐sectional design. Five studies were 'single‐gate', where both cases and controls were sampled from the same participant population, all of which included women with suspected endometriosis based on clinical presentation (women presenting with pelvic pain, infertility, ovarian mass, or a combination thereof). Three studies were of a 'two‐gate design' and included a wider group of participants who were undergoing surgery for various indications. All the included studies assessed women of reproductive age. Laparoscopy was the predominant surgical modality in the included studies, whereas laparotomy was co‐utilised in one study. Seven of the included studies used histopathology to confirm the surgical diagnosis. All the included studies evaluated pelvic endometriosis and the reported prevalence of endometriosis varied from 43% to 66%. Five studies included wide spectrum of endometriosis (rASRM I‐IV), two studies included only participants with moderate–severe endometriosis (rASRM stage III‐IV) and in one study the information on the severity of the disease was not available. Six studies received financial support, two of which were funded by the pharmaceutical companies, and all the authors declared no conflict of interest. No information was available from the remaining two studies.

Basic features of the excluded studies

The list and descriptions of the excluded studies are presented in Characteristics of excluded studies. Based on a full text assessment, six publications were excluded, of which one was of retrospective design and the urine samples were collected after the surgical procedure. One study reported statistically significant differences in biomarker levels between the study and control groups, but contained insufficient diagnostic accuracy information for the construction of 2 x 2 contingency tables. One excluded paper presented the qualitative evaluation of urinary biomarkers and did not define a specific test for diagnostic assessment. A further three studies were excluded as they evaluated urinary excretion of the environmental toxins and their association with risk of endometriosis. For two studies there was insufficient data to confirm eligibility and these were classified as awaiting classification at the time of publication. These studies are outlined in Characteristics of studies awaiting classification and will be addressed in future updates of this review.

Methodological quality of included studies

The quality of the included studies is illustrated in the QUADAS‐2 results summary (Figure 3 and Figure 4). Overall, the studies were of poor methodological quality and all studies had an unclear or high risk of bias in at least one domain.

Figure 3

Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies

Figure 4

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each included study

No studies presented a low risk of participant selection bias; five studies demonstrated an unclear risk (Potlog‐Naharia 2004; El‐Kasti 2011; Cho 2012; Wang 2014; Yun 2014); and three studies were assessed at high risk for this domain (Cho 2007; Kuessel 2014; Lessey 2014). Non‐consecutive or non‐random participant selection, utilisation of a two‐gate design for participant selection and the absence of a clear definition of inclusion/exclusion criteria were the main reasons for a 'high risk' assessment of bias.

All the studies demonstrated a high risk of index test interpretation bias (Potlog‐Naharia 2004; Cho 2007; El‐Kasti 2011; Cho 2012; Kuessel 2014; Lessey 2014; Wang 2014; Yun 2014). A lack of clear pre‐specified criteria for a positive diagnosis and index test operators not being blind to the results of reference standard were the main reasons for a 'high risk' assessment. A high risk of bias for this domain was also attributed to the articles where the phase of menstrual cycle was not considered in interpreting the index test. This was considered an important criterion, since varying ovarian hormones across the cycle could influence biomarker expression and undermine the reliability of the results. Furthermore, the skill level of a test operator and interobserver variability, both of which directly affect performance of the tests, were rarely reported. As the criteria for a positive index test were variable between the studies and the index test protocols were not standardised, quality judgements for the index test were complex.

Seven studies were at low risk of bias in the 'reference standard' domain (Potlog‐Naharia 2004; Cho 2007; El‐Kasti 2011; Cho 2012; Kuessel 2014; Lessey 2014; Yun 2014); one study was classified at unclear risk (Wang 2014); and no studies demonstrated a high risk. An unclear risk of bias was assigned if there was not enough information to determine how likely the reference standard was to have correctly classified the target condition. Specifically, surgical procedures were not well described, the criteria for a positive reference standard were not stated, it was unclear if histology was utilised to confirm surgical diagnosis, or there was no information regarding the experience of the surgeons or the pathologists (or both) involved.

Seven studies presented a low risk of bias in the 'flow and timing' domain (Potlog‐Naharia 2004; Cho 2007; Cho 2012; Kuessel 2014; Lessey 2014; Wang 2014; Yun 2014); no studies demonstrated an unclear risk; and one study carried a high risk (El‐Kasti 2011). In every study all participants received the same reference standard. The time interval between the index test and the reference standard was placed as 12 months or less and the most commonly reported time interval was immediately before surgery. A high risk of bias was assigned if there were unexplained withdrawals that exceeded 5% of the enrolled population or if the reason for withdrawal could introduce selection bias regarding the samples analysed.

Three studies presented a low concern for participant selection applicability (Potlog‐Naharia 2004; Cho 2012; Wang 2014); and five were of high concern (Cho 2007; El‐Kasti 2011; Kuessel 2014; Lessey 2014; Yun 2014). A high concern in participant selection applicability was assigned if the study utilised two‐gate selection for cases and controls or if only a limited spectrum of disease was evaluated. Additional uncertainty regarding the accuracy of the index test in the entire clinically relevant population is introduced if the urine biomarker varied across participant subgroups. In our view, any sampling deviation from a representative group of the entire clinically relevant population could skew the estimates of diagnostic accuracy in any direction.

All the studies presented a low concern of index test applicability, presenting sufficient information to conclude that the index test, its conduct or interpretation matched the review question (Potlog‐Naharia 2004; Cho 2007; El‐Kasti 2011; Cho 2012; Kuessel 2014; Lessey 2014; Wang 2014; Yun 2014).

All eight studies were of low concern for applicability in regards to the reference standard and none had a high or unclear concern (Potlog‐Naharia 2004; Cho 2007; El‐Kasti 2011; Cho 2012; Kuessel 2014; Lessey 2014; Wang 2014; Yun 2014). All the included studies implemented pelvic surgery (laparoscopy or laparotomy) as a reference standard, which could be relied upon to match the review question.

Findings

A total of six urinary biomarkers were evaluated in the eight included studies, of which four biomarkers had a diagnostic evaluation in five studies (summary of findings Table 1). Three biomarkers were not altered by the presence of endometriosis and were evaluated in three other studies (summary of findings Table 2).

1) Enolase 1 (NNE)

The diagnostic performance of urinary NNE was evaluated in one study (59 women, follicular or luteal cycle phase, only moderate to severe endometriosis, rASRM III‐IV) (Yun 2014). Urinary NNE expression was not influenced by cycle phase and was significantly greater (P = 0.026) in women with endometriosis when corrected for creatine excretion (NNE‐Cr). Using a cut‐off threshold of more than 0.96 ng/mgCr, the sensitivity was 0.56 (95% CI, 0.40 to 0.72), and the specificity 0.70 (95% CI, 0.46 to 0.88) and did not meet the criteria for either replacement or triage tests (Figure 5). Further testing in larger studies including participants with a wider spectrum of endometriosis is needed to confirm the role of NNE in detecting endometriosis.

Figure 5

Summary ROC Plot of NNE‐Cr for detection of endometriosis utilising a cut‐off > 0.96 ng/mgCr. Each point represents the pair of sensitivity and specificity for evaluation. The bars correspond to 95% CIs.

2) Vitamin D‐binding protein (VDBP)

The diagnostic performance of urinary VDBP was evaluated in one study only, which included 95 women in the follicular or luteal cycle phase (Cho 2012). Even though the study included endometriosis of varying severity (rASRM I‐IV), more than 90% of women with endometriosis had moderate to severe disease (52/57). Urinary VDBP levels corrected for creatinine (VDBP‐Cr) expression were significantly greater in participants with endometriosis (P = 0.001). However, VDBP‐Cr only distinguished women with and without endometriosis in the luteal phase (P = 0.042) of the cycle. The cut‐off value of more than 87.83 ng/mgCr demonstrated a sensitivity of 0.58 (95% CI, 0.44 to 0.71) and a specificity of 0.55 (95% CI, 0.38 to 0.71) (Figure 6). The results are discouraging, but further evaluation of VDBP across the spectrum of endometriosis particularly in the luteal phase may help to clarify its diagnostic role in endometriosis.

Figure 6

Summary ROC plot of VDBP‐Cr for detection of endometriosis utilising a cut‐off >87.83 ng/mgCr. Each point represents the pair of sensitivity and specificity for the evaluation. The bars correspond to 95% CIs.

3) MALDI‐TOF‐MS proteomics

Two studies, including seven data sets comprising 186 women, assessed the accuracy of proteomic techniques in detecting endometriosis (Figure 7; Figure 8). One study included 39 women in the follicular, peri‐ovulatory or luteal phases (El‐Kasti 2011). Six significant putative peptide markers distinguished controls from women with moderate to severe endometriosis (rASRM III‐IV), four of these in the peri‐ovulatory phase and two in the luteal phase. The diagnostic accuracy of two peptides only identified by their mass profile were evaluated. The peri‐ovulatory peptide mass of 1767.1 Da using a cut‐off of 35.22 or more arbitrary units (a.u.) showed a sensitivity of 0.75 (95% CI, 0.43 to 0.95) and a specificity of 0.87 (95% CI 0.60 to 0.98). The luteal peptide mass of 1824.3 Da using a cut‐off of 29.34 or more a.u. showed a sensitivity of 0.77 (95% CI, 0.46 to 0.95) and a specificity of 0.73 (95% CI, 0.45 to 0.92). The diagnostic performance of the other four peptides were not assessed. It was noted that 14 participants with minimal to mild endometriosis were excluded from this analysis. None of the biomarkers met the criteria for a replacement or triage test, but this observation provides too little data to draw conclusions regarding the diagnostic role of these urinary peptides in endometriosis.

Figure 7

Summary ROC plot of Proteome for detection of endometriosis. Each point represents the pair of sensitivity and specificity for each evaluation. The size of each point is proportional to the sample size and the shape designates the tests including different proteins. The bars correspond to 95% CIs of each individual evaluation. The data were not assessed by meta‐analysis.

Figure 8

Forest plot of proteome for detection of endometriosis. Plot shows study‐specific estimates of sensitivity and specificity (squares) with 95% CI (black line), country in which the study was conducted, menstrual cycle phase at which the test was performed and severity of the disease assessed by each study, reported as rASRM stage. FN: false negative; FP: false positive; TN: true negative; TP: true positive.

Another study (122 women, follicular or luteal cycle phase, rASRM I‐IV, unclear histological confirmation) identified 36 peptides that were significantly different between the endometriosis and the control groups (Wang 2014). The peptide pattern did not vary between follicular and luteal phases and four peptides, that were down‐regulated in endometriosis, were further evaluated for diagnostic performance. The 2052.3 Da mass peptide demonstrated a sensitivity of 0.83 (95% CI, 0.71 to 0.92) and a specificity of 0.69 (95% CI, 0.56 to 0.80), and the 3393.9 Da mass peptide had a sensitivity of 0.85 (95% CI, 0.73 to 0.93) and a specificity of 0.71 (95% CI, 0.58 to 0.82). The other two peptides were able to be identified by their mass spectra and were a collagen alpha‐6(IV) chain precursor fragment (1579.2 Da) with a sensitivity of 0.83 (95% CI, 0.71 to 0.92) and a specificity of 0.69 (95% CI, 0.56 to 0.80) and a type VIII, IX, XV collagen alpha1 chain precursor fragment (891.6 Da) with a sensitivity of 0.82 (95% CI, 0.70 to 0.90) and a specificity of 65 (95% CI, 0.51 to 0.76).The cut‐off thresholds were not reported for any of these analyses. Three algorithms were developed using peptide peak clusters in diagnostic models, (the Genetic algorithm (GA), the decision tree algorithm (DTA), and the quick classifier algorithm (QC)). The GA algorithm showed the highest diagnostic estimates comprising five peptides of 1433.9 Da, 1599.4 Da, 2085.6 Da, 6798.0 Da, and 3217.2 Da and was further validated in a blinded test group analysis of 25 randomly selected participants, 11 of which had endometriosis. The estimates of the validation test showed a high sensitivity of 0.91 (95% CI, 0.59 to 1.00) and a high specificity of 0.93 (95% CI, 0.66 to 1.00), which approach the criteria for the replacement of SnOUT and SpIN triage tests. These results require further validation in large, independent, well‐defined populations, displaying a wide spectrum of disease, using standardised and reproducible methodologies.

4) Cytokeratin‐19 fragments (uCYFRA 21‐1)

Two studies that included 174 women assessed the performance of Cytokeratin 19 (CK 19) as a biomarker in detecting endometriosis by measuring urine fragment uCYFRA 21‐1. Both studies concluded that CK 19 was not altered by the presence of endometriosis and that their levels were not affected by menstrual cycle phases (Kuessel 2014), by severity of the disease or when the levels were normalised to urine creatinine or urine protein (Lessey 2014). Only one of these studies (98 participants, cycle phase not reported, rASRM I‐IV) evaluated the diagnostic accuracy of uCYFRA 21‐1 (Lessey 2014), demonstrating a very low sensitivity of 0.11 (95% CI, 0.05 to 0.22) with a high specificity 0.94 (95% CI, 0.81 to 0.99), using a chosen cut‐off of more than 5.3 ng/ml (Figure 9). This evidence suggests that the cytokeratin 19 molecule is not reliable as a diagnostic test for endometriosis, but further testing is required to confirm or refute these findings.

Figure 9

Summary ROC plot of CK 19 for detection of endometriosis utilising a cut‐off > 5.3 ng/ml. Each point represents the pair of sensitivity and specificity for the evaluation. The bars correspond to 95% CIs.

5) Vascular endothelial growth factor (VEGF) or vascular endothelial growth factor‐A (VEGF‐A)

Two studies in 132 women assessed the performance of urinary VEGF in diagnosing endometriosis (Potlog‐Naharia 2004; Cho 2007). The levels were corrected to urinary creatinine in both studies, and one study showed no differences in excretion across the menstrual cycle (Potlog‐Naharia 2004). There was no significant difference between the control and endometriosis groups seen in either study, and the diagnostic accuracy was not evaluated.

6) Tumour necrosis factor‐alpha (TNF‐alpha)

Urinary TNF‐alpha levels were not significantly different in one study (70 participants, follicular or luteal cycle phase, rASRM I‐IV) and the diagnostic accuracy was not evaluated (Cho 2007).

Investigations of heterogeneity and sensitivity analyses

The potential sources of heterogeneity are outlined in Secondary objectives. Although we attempted to assess these sources of heterogeneity, there were not enough studies evaluating each test to make this a meaningful analysis. Furthermore, the sensitivity analyses were not possible due to the small number of studies.

Discussion

Summary of main results

Only a few urinary biomarkers have been assessed in small numbers of individual studies providing insufficient data to perform a meta‐analysis. No urinary test met the criteria of either replacement or triage test for detecting endometriosis. The GA algorithm, a combined test of five urinary peptides of 1433.9 Da, 1599.4 Da, 2085.6 Da, 6798.0 Da, and 3217.2 Da demonstrated the highest diagnostic estimates for detecting endometriosis, which approached but did not meet the criteria for the replacement of both the SnOUT and SpIN triage tests (Wang 2014). The algorithm was validated in an independent test group but, as this test was only reported in one study, meaningful conclusions regarding its value in clinical practice cannot be drawn. Certain urinary peptides identified through high‐throughput MALDI‐TOF‐MS method showed potential in detecting endometriosis. However, urinary proteome studies showed considerable heterogeneity with respect to the population studied, the way the samples were processed and the data analysis. The molecular masses of the identified differentially regulated peptides were entirely inconsistent across studies, with most remaining unidentified biologically. Establishing standardised analytical processes, consistent sets of markers and defined cut‐off thresholds would improve the assessment of urinary peptides as a diagnostic tool for endometriosis and further large‐scale studies are required before meaningful conclusions can be made.

CK 19 and VEGF were found not associated with endometriosis in more than one study, indicating that these biomarkers are unlikely to have diagnostic value. In view of the paucity of data, further large studies are still needed to support this statement.

There were no studies that assessed the role of urinary biomarkers in the diagnosis of ovarian endometrioma.

Strengths and weaknesses of the review

This review is part of a comprehensive review series of minimally invasive biomarkers for the diagnosis of endometriosis. A very thorough search of the current literature was undertaken and included studies written in languages other than English. Two independent reviewers extracted the data and used a modified QUADAS‐2 tooL to perform quality assessments. Stringent selection criteria ensured that eligible studies utilised prospectively collected samples and only included women of reproductive age, which minimised the risk of bias in interpreting the reference standard and index test. An additional strength of this review was that the authors of the studies were approached in an attempt to obtain any missing information required to assess eligibility and critically appraise the studies. The inclusion of studies demonstrating that biomarker levels did not significantly differ in endometriosis introduced an additional dimension to the interpretation of the results, particularly for the biomarkers with contradictory results. Furthermore, biomarkers which were consistently reported as unchanged by the disease could be excluded from the list of putative biomarkers for endometriosis. Although this has little influence on the conclusions of this review due to the paucity of the available data, the relevance of this method will increase in future updates that describe this growing body of evidence.

The main limitation of this review is that there are only individual small studies for all the evaluated index tests. A meaningful meta‐analysis of index test performance was not possible for any urinary biomarker. There was variation between studies with respect to the included populations, the severity of endometriosis, when in the menstrual cycle phase sampling was performed and whether the urinary biomarker levels were corrected against creatinine excretion. Also, most of the included studies determined the diagnostic cut‐off thresholds using a ROC analysis without any subsequent validation in an independent cohort. Lack of validation of the diagnostic data in conjunction with the low number of studies for the majority of the presented tests contributed to the low quality of evidence presented in this review. We now have available a standardised methodology for fluid bio specimen collection, processing and storage and we recommend adhering to these standards in future diagnostic studies (Rahimoglu 2014).

Another weakness is the variation in the selection of the case and control groups with inclusion of participants that may not reflect a clinically representative population. The reported prevalence of endometriosis in most studies was generally higher (43% to 66%) than previously reported prevalences of endometriosis (6%–10% in the general female population and 35%‐50% in symptomatic women) (Giudice 2004). This may reflect a high level of surgical diagnostic expertise but could be due to pre‐selection of more challenging cases in tertiary referral centres and there is a high risk of participant selection bias in most of the studies. Selection bias appeared to be reduced but not eliminated by consecutively enrolling participants; however the information on method of enrolment was missing in most of the included studies. More than a third of the included studies (3/8, 38%) were of a 'two‐gate design' and included a wide group of participants who underwent surgery for various indications. Inclusion of healthy asymptomatic individuals or participants with other pathological conditions represents potential selection bias with regard to the control group which could have biased the test outcomes. Thirty‐eight percent of the studies included either women with a limited spectrum of endometriosis (n = 2) or did not provide information on the severity of target condition (n = 1). These studies were included to avoid omission of potentially valuable diagnostic information, but each of the above factors could skew the diagnostic estimates in either direction and subsequently interfere with the interpretation of the index test results. It was not possible to evaluate population and disease spectrum effects on the data because there were so few single reports for all the urinary biomarkers.

Inappropriate assignation to the endometriosis and control groups could not be excluded in some studies and is another weakness of the review. Surgical misdiagnosis is a potential cause of bias as the number and experience of the surgical team, the surgical diagnostic criteria and the surgical methods were poorly described in most of the included studies. We now have a standardised technique for performing laparoscopy and we recommend that any future studies use this standardised method of undertaking laparoscopy (Becker 2014). Additionally, we did not confine the studies included in this review to those that reported histological confirmation of endometriotic lesions. Although a recent ESHRE guideline stated that evidence is lacking to support laparoscopy without histology to confirm endometriosis (Dunselman 2014), the clinical significance of histological verification remains debatable. Diagnosis by surgical visualisation only remains a common clinical practice and can be considered reliable when an accurate inspection of the abdominal cavity is performed by experienced surgeons. We chose to include the studies that only reported surgical visualisation as the reference standard and we did not wish to loose potentially valuable information by excluding studies that did not confirm the diagnosis histologically; however this could impact the accuracy of assignation to the case and control groups. Only one study did not report using histology as a part of reference standard and although this could bias the reported results, the impact of including this study on the review findings is likely to be low (Wang 2014).

There are no well‐established criteria for replacement or triage diagnostic tests, therefore we chose the criteria that were both realistic and clinically applicable to assist in the interpretation of the complex results. For a replacement test, we considered the threshold reported by the only systematic review on accuracy of the reference standard (laparoscopy) in detecting endometriosis to be the most objective (Wykes 2004). The meta‐analysis was published in 2004 and included four eligible studies comprising 433 women. We acknowledge the limitations associated with emphasising a single review, particularly if it does not present the latest and possibly more accurate data that reflect advances in surgical expertise and technology. Several studies on accuracy of laparoscopy in detecting endometriosis have been published in the last decade; however their results were not addressed in a systematic way. A further systematic analysis to determine the accuracy of laparoscopy was beyond the scope of this review. The criteria for triage tests utilised the common concepts of SnOUT and SpIN in medical statistics and the cut‐offs were set at levels we considered to be clinically relevant (see Role of index test(s)). We encourage the readers to apply independent interpretations of the presented diagnostic estimates with using thresholds that may be more applicable to specific populations and clinical circumstances.

Applicability of findings to the review question

QUADAS‐2 assigned a low rank to clinical applicability with respect to participant selection in 63% of the studies (5/8), summarised as a high concern in all these reports. This occurred when the set of participants in the study was broader that seen in clinical practice or when the spectrum of the target condition was limited and the findings may not be applicable to the review question and to clinical practice. Applicability of the index test and reference standard was judged to be satisfactory using the QUADAS‐2 tool for all studies. However, the majority of included studies were conducted in academic institutions with a high level of expertise in laboratory techniques and the index test outcome measures may not be able to be reproduced in all institutions or extrapolated to general practice.

Some potentially relevant well‐designed studies were excluded as they did not directly address the review question. For example we did exclude studies that reported on biomarkers with differential expression in endometriosis, but that did not provide enough information to assess the diagnostic performance of the biomarker. Some forms of endometriosis, such as bladder, ureteric or involving the extra‐pelvic sites (e.g. umbilicus, hernia sacs, abdominal wall, lung, kidney, etc.) were also excluded from the review as they are informed predominantly by case reports or small case series and diagnostic laparoscopy is not an applicable reference test for these conditions. Although these target conditions are rare, from a clinical perspective the diagnostic options for these forms of endometriosis remain unclear.

Figure 1

Sequential approach to non‐invasive testing of endometriosis

Figure 2

Flow of the studies identified in literature search for systematic review on urinary biomarkers for a non‐invasive diagnosis of endometriosis.

Figure 3

Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies

Figure 4

Risk of bias and applicability concerns summary: review authors' judgements about each domain for each included study

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Test 1

NNE‐Cr (> 0.96 ng/mgCr).

Test 2

VDBP‐Cr (> 87.83 ng/mgCr).

Test 3

Proteome by MALDI‐TOF‐MS (peptide m/z 1824.3 Da; ≥ 29.34 au).

Test 4

Proteome by MALDI‐TOF‐MS (peptide m/z 1767.1 Da; ≥ 35.22 au).

Test 5

Proteome by MALDI‐TOF‐MS (peptide m/z 2052.3 Da; cut‐off not reported).

Test 6

Proteome by MALDI‐TOF‐MS (peptide m/z 3393.9 Da; cut‐off not reported).

Test 7

Proteome by MALDI‐TOF‐MS (peptide m/z 1579.2 Da [collagen alpha 6(IV) chain precursor]; cut‐off not reported).

Test 8

Proteome by MALDI‐TOF‐MS (peptide m/z 891.6 Da [collagen alpha1 chain precursor];; cut‐off not reported).

Test 9

Proteome by MALDI‐TOF‐MS (5 peptides m/z 1433.9 +1599.4 + 2085.6 + 6798.0 + 3217.2 Da; cut‐off not reported).

Test 10

CK 19 [CYFRA 21‐1] (> 5.3 ng/ml).

Summary of findings 1. Biomarkers evaluated as a diagnostic test for endometriosis

Review question	What is the diagnostic accuracy of the urinary biomarkers in detecting pelvic endometriosis [peritoneal endometriosis, endometrioma, DIE]?
Importance	A simple and reliable non‐invasive test for endometriosis, with the potential to either replace syrgery or to triage patients in order to reduce surgery, would minimise surgical risk and reduce diagnostic delay
Patients	Reproductive‐aged women 1) with suspected endometriosis or 2) with persistent ovarian mass or 3) undergoing infertility workup or gynaecological laparoscopy
Settings	Hospitals (public or private of any level): outpatient clinics (general gynaecology, reproductive medicine, pelvic pain); research laboratories
Reference standard	Visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Study design	Cross sectional studies with a 'single‐gate' design (n = 4) or a 'two‐gate' design (n = 1); prospective enrolment; a single study could assess more than one test
Risk of bias	Overall judgement: Poor quality of most of the studies (no study had a 'low risk' assessment in all 4 domains)
	Patient selection bias: High risk ‐ 1 study; Unclear risk ‐ 4 studies; Low risk ‐ 0 studies
	Index test interpretation bias: High risk ‐ 5 studies; Unclear risk ‐ 0 studies; Low risk ‐ 0 studies
	Reference standard interpretation bias: High risk ‐ 0 studies; Unclear risk ‐ 1 study; Low risk ‐ 4 studies
	Flow and timing selection bias: High risk ‐ 1 study; Unclear risk ‐ 0 studies; Low risk ‐ 4 studies
Applicability concerns	Concerns regarding patient selection: High concern ‐ 3 studies; Unclear concern ‐ 0 studies; Low concern 2 studies Concerns regarding index test: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 5 studies Concerns regarding reference standard: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 5 studies
Biomarker		N of studies; N of women	Outcomes				Diagnostic estimates [95% CI]	Implications
Biomarker		N of studies; N of women	True positives (endometriosis)	False negatives (incorrectly classified as disease‐free)	True negatives (disease‐free)	False positives (incorrectly classified as endometriosis)	Diagnostic estimates [95% CI]	Implications
NNE (enolase I) cut‐off > 0.96 ng/mgCr		1; 59	22	17	14	6	Sensitivity 0.56 [0.40, 0.72] Specificity 0.70 [0.46, 0.88]	Insufficient evidence to draw meaningful conclusions
VDBP cut‐off > 87.83 ng/mgCr		1; 95	33	24	21	17	Sensitivity 0.58 [0.44, 0.71] Specificity 0.55 [0.38, 0.71]	Insufficient evidence to draw meaningful conclusions
CK 19 [CYFRA21‐1] cut‐off > 5.3 ng/ml		1; 98	7	56	33	2	Sensitivity 0.11 [0.05, 0.22] Specificity 0.94 [0.81, 0.99]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1824.3 Da cut‐off ≥ 29.34 au		1; 28	10	3	11	4	Sensitivity 0.77 [0.46, 0.95] Specificity 0.73 [0.45, 0.92]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1767.1 Da cut‐off ≥ 35.22 au		1; 27	9	3	13	2	Sensitivity 0.75 [0.43, 0.95] Specificity 0.87 [0.60, 0.98]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 2052.3 Da cut‐off not reported		1; 122	50	10	43	19	Sensitivity 0.83 [0.71, 0.92] Specificity 0.69 [0.56, 0.80]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 3393.9 Da cut‐off not reported		1; 122	51	9	44	18	Sensitivity 0.85 [0.73, 0.93] Specificity 0.71 [0.58, 0.82]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 1579.2 Da [collagen alpha 6(IV) chain precursor] cut‐off not reported		1; 122	50	10	43	19	Sensitivity 0.83 [0.71, 0.92] Specificity 0.69 [0.56, 0.80]	Insufficient evidence to draw meaningful conclusions
Proteome: peptide m/z 891.6 Da [collagen alpha1 chain precursor] cut‐off not reported		1; 122	49	11	40	22	Sensitiviy 0.82 [0.70, 0.90] Specificity 0.65 [0.51, 0.76]	Insufficient evidence to draw meaningful conclusions
Proteome: 5 peptides m/z 1433.9 + 1599.4 + 2085.6 + 6798.0 + 3217.2 Da cut‐off not reported		1; 25	10	1	13	1	Sensitivity 0.91 [0.59, 1.00] Specificity 0.93 [0.66, 1.00]	Insufficient evidence to draw meaningful conclusions Approaches criteria for a replacement test or SnOUT/SpIN triage tests; further diagnostic test accuracy studies recommended

Summary of findings 1. Biomarkers evaluated as a diagnostic test for endometriosis

Summary of findings 2. Biomarkers that do not distinguish between women with and without endometriosis

Review question	Which urinary biomarkers are unlikely to serve as a basis of the diagnostic test for endometriosis?
Importance	Biomarkers that do not show differential expression in women with and without endometriosis are unlikely to be diagnostically useful. Information regarding negative trials can focus research on better diagnostic targets. The biomarkers that display conflicting results (distinguish women with and without endometriosis in some, but not all, studies) can be identified and reported on. Papers that did not show differential expression of a biomarker in endometriosis but were adequately designed and that met inclusion criteria for this review were included.
Patients	Reproductive aged women 1) with suspected endometriosis or 2) with persistent ovarian mass or 3) undergoing infertility workup/gynaecological laparoscopy
Settings	Hospitals (public or private of any level): outpatient clinics (general gynaecology, reproductive medicine, pelvic pain); research laboratory
Reference standard	Visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Study design	Cross‐sectional of 'single‐gate' design (n = 1) or 'two‐gate' design (n = 2); prospective enrolment; one study could assess more than one test
Risk of bias	Overall judgement: Poor quality (no studies had 'low risk' assessment in all 4 domains)
	Patient selection bias: High risk ‐ 2 studies; Unclear risk ‐ 1 study; Low risk ‐ 0 studies
	Index test interpretation bias: High risk ‐ 3 studies; Unclear risk ‐ 0 studies; Low risk ‐ 0 studies
	Reference standard interpretation bias: High risk ‐ 0 studies; Unclear risk ‐ 0 studies; Low risk ‐ 3 studies
	Flow and timing selection bias: High risk ‐ 0 studies; Unclear risk ‐ 0 studies; Low risk ‐ 3 studies
Applicability concerns	Concerns regarding patient selection: High concern ‐ 2 studies; Unclear concern ‐ 0 studies; Low concern ‐ 1 study Concerns regarding index test: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 3 studies Concerns regarding reference standard: High concern ‐ 0 studies; Unclear concern ‐ 0 studies; Low concern ‐ 3 studies
Biomarker	Expression levels	rASRM stage	Menstrual cycle phase	Reference
VEGF	endometriosis (n = 46)¹: 1.11 ± 0.17 pg/mg Cr controls (n = 24): 0.76 ± 0.14 pg/mg Cr p ‐ NS	I‐IV	follicular or luteal	Cho 2007
VEGF	endometriosis (n = 40)¹ 83.6 ± 11.3 pg/mg Cr controls (n = 22): 88.5 ± 10.4 pg/mg Cr P = 0.77	I‐IV	follicular or luteal	Potlog‐Naharia 2004
TNF‐a	endometriosis (n = 46)¹: 0.02 ± 0.01 pg/mg Cr controls (n = 24): 0.01 ± 0.002 pg/mg Cr p ‐ NS	I‐IV	follicular or luteal	Cho 2007
CK 19	endometriosis (n = 44)²: 5.4 ± 5.3 controls (n = 32): 6.7 ± 9.9 p ‐ NS	not reported	follicular or luteal	Kuessel 2014
¹ mean ± SEM ² mean ± SD

Summary of findings 2. Biomarkers that do not distinguish between women with and without endometriosis

Table 1. Staging of endometriosis, rASRM classification

Peritoneum	Endometriosis	< 1 cm	1 to 3 cm	> 3 cm
	Superficial	1	2	4
	Deep	2	4	6
Ovary	R Superficial	1	2	4
	Deep	4	16	20
	L Superficial	1	2	4
	Deep	4	16	20
	Posterior Cul‐de‐sac Obliteration	Partial	Complete
	Posterior Cul‐de‐sac Obliteration	4	40
Ovary	Adhesions	< 1/3 Enclosure	1/3‐2/3 Enclosure	> 2/3 Enclosure
	R Filmy	1	2	4
	Dense	4	8	16
	L Filmy	1	2	4
	Dense	4	8	16
Tube	R Filmy	1	2	4
	Dense	4*	8*	16
	L Filmy	1	2	4
	Dense	4*	8*	16
* If the fimbriated end of the fallopian tube is completely enclosed, change the point assignment to 16 American Society for Reproductive Medicine 1997

Table 1. Staging of endometriosis, rASRM classification

Table 2. Urinary biomarkers for endometriosis

Angiogenesis/Growth factors and their receptors
VEGF‐A (vascular endothelial growth factor ‐ A)¹
sFlt‐1 [sVEGFR‐1] (soluble fms‐like tyrosine kinase or variant of VEGF receptor 1)²
Cell adhesion molecules and other matrix‐related proteins
MMP‐2 (matrix metalloproteinase‐2)²
MMP‐9 (matrix metalloproteinase‐9)²
MMP‐9/ NGAL (matrix metalloproteinase‐9/neutrophil gelatinase‐associated lipocalin)²
Cytokines
TNF‐alpha (tumour necrosis factor alfa)¹
Cytoskeleton molecules
CK‐19 or CYFRA 21‐1 (Cytokeratin‐19)¹
High throughput markers
Proteome
Oxidative stress markers
8‐iso‐PGF2a (8‐iso‐prostaglandin F2a)²
Other Peptides/proteins
VDBP (vitamin D binding protein)
NNE (enolase I)
Collagen precursors
Prealbumin²
Alpha 1 antitrypsin²
Chain A solution structure of Bb' domains of human protein disulfide isomerase²
¹ Urinary biomarkers that did not exhibit differential expression in endometriosis ² Urinary biomarkers that exhibited differential expression in endometriosis, but for which the diagnostic estimates were not available

Table 2. Urinary biomarkers for endometriosis

Table 3. Application of the QUADAS‐2 tool for assessment of methodological quality of the included studies

Domain 1 ‐ Patient selection
Description	Describe methods of patient selection and included patients
Type of bias assessed	Selection bias, spectrum bias
Review Question	Women of reproductive age with clinically suspected endometriosis (symptoms, clinical examination ± presence of pelvic mass), scheduled for surgical exploration of pelvic/abdominal cavity for confirmation of the diagnosis ± treatment
Informaton collected	Study objectives, study population, selection (inclusion and exclusion criteria), study design, clinical presentation, age, number of participants enrolled and number of participants available for analysis, setting, place and period of the study
Signalling question 1	Was a consecutive or random sample of patients enrolled?
Yes	If a consecutive sample or a random sample of the eligible patients was included in the study
No	If non‐consecutive sample or non‐random sample of the eligible patients was included in the study
Unclear	If this information was unclear
Signalling question 2	Did the study avoid inappropriate exclusions?
Yes	If inclusion/exclusion criteria were presented and all patients with suspected endometriosis were included, with an exception for those who a) had a history of medical conditions or were on medical therapy that would have potentially interfered with interpretation of index test (e.g. malignancy, pregnancy, autoimmune disorders, infectious diseases, treatment with hormonal or immunomodulator substances); b) refused to participate in the study; or c) were unfit for surgery
No	If the study excluded the patients based on education level, psychosocial factors, genetic testing or phenotype or excluded patients with any co‐morbidities commonly present in general population, including a population that could have undergone a testing for endometriosis in clinical setting (hypertension, asthma, obesity, benign gastro‐intestinal or renal disease, etc)
Unclear	If the study did not provide clear definition of the selection (inclusion or exclusion) criteria and 'no' judgement was not applicable
Signalling question 3	Was a 'two‐gate' design avoided?
Yes	If the study had a single set of inclusion criteria, defined by the clinical presentation (i.e. only participants in whom the target condition is suspected) ‐ a ‘single‐gate’ study design
No	If the study had more than one set of inclusion criteria in respect to clinical presentation (i.e. participants suspected of target condition and participants with alternative diagnosis in whom the target condition would not be suspected in clinical practice) ‐ a 'two‐gate' study design
Unclear	If it was unclear whether a 'two‐gate deign' was avoided or not
Risk of bias	Could the selection of patients have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for 3 of the above questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the included patients do not match the review question?
Low	If the study includes only clinically relevant population that would have undergone index test in real practice and includes representative form of target condition
High	If the study population differed from the population defined in the review question in terms of demographic features and co‐morbidity (e.g. studies with multiple sets of inclusion criteria with respect to clinical presentation including either healthy controls or alternative diagnosis controls that would not have undergone index test in real practice). Further, if target condition diagnosed in the study population was not representative of the entire spectrum of disease, such as limited spectrum of severity (e.g. only mild forms) or limited type of endometriosis (e.g. only DIE)
Unclear	If this information was unclear (e.g. severity of endometriosis was not reported)
Domain 2 ‐ Index test
Description	Describe the index test, how it was conducted and interpreted
Type of bias assessed	Test review bias, clinical review bias, interobserver variation bias
Review Question	Any type of urinary biomarkers
Informaton collected	Index test name, description of positive case definition by index test as reported, threshold for positive result, examiners (number, level of expertise, blinding), interobserver variability, conflict of interests
Signalling question 1	Were the index test results interpreted without knowledge of the results of the reference standard?
Yes	If the operators performing or interpreting index test were unaware of the results of reference standard
No	If the operators performing or interpreting index test were not blinded to the results of reference standard
Unclear	If this information was unclear
Signalling question 2	If a threshold was used, was it pre‐specified?
Yes	If study clearly provided a threshold for positive result and was defined before execution or interpretation of index test
No	If a threshold for positive result was not provided or not defined prior to test execution
Unclear	If it was unclear whether a threshold was pre‐specified or not
Signalling question 3	Was a menstrual cycle phase considered in interpreting the index test?
Yes	If all the included participants were in the same phase of menstrual cycle or if the study reported subgroup analyses per cycle phase or if study reported the pooled estimates after impact of the cycle phase on biomarker expression was not detected
No	If study included participants in different phases of menstrual cycle, but effect of cycle phase on index test was not assessed
Unclear	If the cycle phase was not reported
Risk of bias	Could the conduct or interpretation of the index test have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for any of the above 3 questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the index test, its conduct, or interpretation differ from the review question?
Low	We considered all types of urinary biomarkers as eligible; therefore all the included studies were classified as 'low concern', unless 'unclear' judgement was applicable
High	We did not consider the studies where index tests other than urinary biomarkers were included (or excluded information on other index tests reported in addition to urine tests) or where index test looked at other target conditions not specified in the review (e.g. studies aimed at classifying pelvic masses as benign and malignant); therefore none of the included studies was classified as 'high concern'
Unclear	If study did not present sufficient information on at least one of the following: laboratory method, sample handling, reagents used, experience of the test operators
Domain 3 ‐ Reference standard
Description	Describe the reference standard, how it was conducted and interpreted
Type of bias assessed	Verification bias, bias in estimation of diagnostic accuracy due to inadequate reference standard
Review Question	Target condition ‐ pelvic endometriosis, ovarian endometriosis, DIE. Reference standard ‐ visualisation of endometriosis at surgery (laparoscopy or laparotomy) with or without histological confirmation
Informaton collected	Target condition, prevalence of target condition in the sample, reference standard, description of positive case definition by reference test as reported, examiners (number, level of expertise, blinding)
Signalling question 1	Were the reference standards likely to correctly classify the target condition?
Yes	If the study reported at least one of the following: surgical procedure was described in sufficient details; or criteria for positive reference standard were stated; or diagnosis was confirmed by histopathology; or the procedure was performed by the team with high level of expertise in diagnosis/surgical treatment of target condition, including tertiary referral centres for endometriosis
No	If reference standard did not classify target condition correctly; considering the inclusion criteria and a nature of the reference standard, none of the studies were classified as 'no' for this item
Unclear	If information on execution of the reference standard, its interpretation or operators was unclear
Signalling question 2	Were the reference standard results interpreted without knowledge of the results of the index tests?
Yes	If operators performing the reference test were unaware of the results of index test
No	If operators performing the reference test were aware of the results of index test
Unclear	If this information was unclear
Risk of bias	Could the reference standard, its conduct, or its interpretation have introduced bias?
Low	If 'yes' classification for all the above 2 questions
High	If 'no' classification for any of the above 2 questions
Unclear	If 'unclear' classification for any of the above 2 questions and 'high risk' judgement was not applicable
Concerns about applicability	Are there concerns that the target condition as defined by the reference standard does not match the question?
Low	Considering the inclusion criteria, all the studies were classified as 'low concern', therefore all the included studies were classified as 'low concern'
High	We excluded the studies where participants did not undergo surgery for diagnosis of endometriosis, therefore none of the included studies were classified as 'high concern'
Unclear	Only studies were laparoscopy/laparotomy served as a reference test were included; therefore none of the included studies was classified as 'unclear concern'
Domain 4 ‐ Flow and timing
Description	Describe any patients who did not receive the index tests or reference standard or who were excluded from the 2 x 2 table, describe the interval and any interventions between index tests (sample collection) and the reference standard
Type of bias assessed	Disease progression bias, bias of diagnostic performance due to missing data
Review Question	Less than 12 months interval between index test (sample collection) and reference standard ‐ endometriosis may progress over time, so we had chosen an arbitrary time interval of 12 months as an acceptable time interval between the sample collection and surgical confirmation of diagnosis
Informaton collected	Time interval between index test (sample collection) and reference standard, withdrawals (overall number reported and if were explained)
Signalling question 1	Was there an appropriate interval between index test (sample collection) and reference standard?
Yes	If time interval was reported and was less than 12 months
No	We excluded all the studies where time interval was longer than 12 months, therefore none of the included studies were classified as 'no' for this item
Unclear	If time interval was not stated clearly, but authors' description allowed to assume that the interval was reasonably short
Signalling question 2	Did all women receive the same reference standard?
Yes	If all participants underwent laparoscopy or laparotomy as a reference standard; considering the inclusion criteria, all the studies were classified as 'yes' for this item, as anticipated
No	If all participants did not undergo surgery or had alternative reference standard or if only a subset of participants had surgery as reference standard, but the information on this population was not available in isolation; considering the inclusion criteria, none of the included studies were classified as 'no' for this item
Unclear	If this information was unclear; considering the inclusion criteria, none of the included studies were classified as 'unclear' for this item
Signalling question 3	Were all women included in the analysis?
Yes	If all the women were included in the analysis or if women were excluded because they did not meet inclusion criteria prior to execution of index test or if the withdrawals were less than 5% of the enrolled population (arbitrary selected cut‐off)
No	If any patients were excluded from the analysis because of un interpretable results, inability to undergo either index test or reference standard or for unclear reasons
Unclear	If this information was unclear
Risk of bias	Could the patient flow have introduced bias?
Low	If 'yes' classification for all the above 3 questions
High	If 'no' classification for any of the above 3 questions
Unclear	If 'unclear' classification for any of the above 3 questions and 'high risk' judgement was not applicable

Table 3. Application of the QUADAS‐2 tool for assessment of methodological quality of the included studies

Table Tests. Data tables by test

Test	No. of studies	No. of participants
1 NNE‐Cr (> 0.96 ng/mgCr) Show forest plot	1	59

2 VDBP‐Cr (> 87.83 ng/mgCr) Show forest plot	1	95

3 Proteome by MALDI‐TOF‐MS (peptide m/z 1824.3 Da; ≥ 29.34 au) Show forest plot	1	28

4 Proteome by MALDI‐TOF‐MS (peptide m/z 1767.1 Da; ≥ 35.22 au) Show forest plot	1	27

5 Proteome by MALDI‐TOF‐MS (peptide m/z 2052.3 Da; cut‐off not reported) Show forest plot	1	122

6 Proteome by MALDI‐TOF‐MS (peptide m/z 3393.9 Da; cut‐off not reported) Show forest plot	1	122

7 Proteome by MALDI‐TOF‐MS (peptide m/z 1579.2 Da [collagen alpha 6(IV) chain precursor]; cut‐off not reported) Show forest plot	1	122

8 Proteome by MALDI‐TOF‐MS (peptide m/z 891.6 Da [collagen alpha1 chain precursor];; cut‐off not reported) Show forest plot	1	122

9 Proteome by MALDI‐TOF‐MS (5 peptides m/z 1433.9 +1599.4 + 2085.6 + 6798.0 + 3217.2 Da; cut‐off not reported) Show forest plot	1	25

10 CK 19 [CYFRA 21‐1] (> 5.3 ng/ml) Show forest plot	1	98

Table Tests. Data tables by test