LNDb: A Lung Nodule Database on Computed Tomography

LNDb: A Lung Nodule Database on Computed Tomography

Abstract

Lung cancer is the deadliest type of cancer worldwide and late detection is the major factor for the low survival rate of patients. Low dose computed tomography has been suggested as a potential screening tool but manual screening is costly, time-consuming and prone to variability. This has fuelled the development of automatic methods for the detection, segmentation and characterisation of pulmonary nodules but its application to clinical routine is challenging. In this study, a new database for the development and testing of pulmonary nodule computer-aided strategies is presented which intends to complement current databases by giving additional focus to radiologist variability and local clinical reality. State-of-the-art nodule detection, segmentation and characterization methods are tested and compared to manual annotations as well as collaborative strategies combining multiple radiologists and radiologists and computer-aided systems. It is shown that state-of-the-art methodologies can determine a patient’s follow-up recommendation as accurately as a radiologist, though the nodule detection method used shows decreased performance in this database.

lung cancer, low dose computed tomography, pulmonary nodules, computer-aided diagnosis, deep learning.

I Introduction

Lung cancer is the deadliest type of cancer worldwide for both men and women [20]. Though changes in the smoking patterns in the general population have been largely responsible for decreasing trends in incidence and mortality rates in recent decades, lung cancer is still responsible for over double the cancer deaths of colorectal cancer, the second deadliest cancer type, and is projected to remain the deadliest type of cancer in the near future. Progress in increasing lung cancer survival rate has also been notoriously slow in contrast to other cancer types, mainly due to late diagnosis of the disease. Low-dose computed tomography (CT) has long been suggested as a potential early screening tool and a 20% reduction in lung cancer mortality has been demonstrated for lung cancer risk groups [22]. Nevertheless, translation of these screening programs to the general population has been challenging due to equipment and personnel costs and the complexity of the task. Namely, lung nodules present a large range of shapes and characteristics and thus the identification and characterization of these abnormalities is not trivial and prone to high interobserver variability. Computer-aided diagnosis (CAD) systems can thus facilitate the adoption and generalization of screening programs by reducing the burden on the clinicians and providing a second-opinion.

Extensive research has been conducted on the development of CAD systems for lung cancer screening focusing on the different tasks essential for efficient screening - pulmonary nodule detection and segmentation followed by nodule characterization and classification of malignancy. Recently, deep learning based methods have shown especially promising results for nodule detection [6, 1, 23, 24], segmentation [23, 24, 1, 25] and characterization [24, 8, 19, 7]. In fact, most of the best performing methods on the LUNA16 nodule detection challenge use deep learning [18] and the same trend was observed for detection and malignancy classification on Kaggle’s Data Science Bowl 2017 challenge [5].

Given the dependence of deep learning methods on large datasets with robust ground truth, the publication of annotated datasets has been a hugely important contribution for the community. Perhaps the most widely known public database is the LIDC-IDRI [3], which contains 1018 CT scans, each annotated by four radiologists. Annotations comprise nodule segmentation and subjective characterization [14], making this an extremely useful database for the development of CAD approaches in lung cancer screening. The NLST database is also widely recognised and contains CTs from 26.722 patients, though nodule segmentation and characterization are not available and nodule position is limited to the slice where a nodule was found [22].

In spite of the promising results in literature, adoption of CAD systems as part of a broader screening in the clinic is not straightforward. First, the fact that CAD systems are not designed as an integrated part of the clinical routine makes them difficult to adopt. In fact, if not well integrated into the normal routine, they come to represent an extra step in the pipeline, increasing the burden on clinicians. Secondly, in spite of the large quantity and variety of data in a dataset like LIDC-IDRI, translation to a local reality can present challenges such as different acquisition settings, population demographics, pathologies or others, which can be detrimental to the performance of deep learning methods, making a local validation of any CAD system an essential step.

To tackle these issues, the Lung Nodule Database (LNDb) was developed as an external dataset complimentary to LIDC-IDRI. The publication of this database will give continuity to LIDC-IDRI and allow the community to perform an external and comparable validation of proposed CAD systems. Furthermore, the fact that eyetracking was used during manual annotation of the images (cf. Section II-B) allows for the development of collaborative strategies for CAD, ensuring that CAD systems are designed as allies of radiologists rather than as competition.

Ii Methodology

Ii-a Patient Selection and Data Acquisition

The LNDb contains 294 CT scans collected retrospectively at the Centro Hospitalar e Universitário de São João (CHUSJ) in Porto, Portugal between 2016 and 2018. All data was acquired under approval from the CHUSJ Ethical Commitee and was anonymised prior to any analysis to remove personal information except for patient birth year and gender. No scan was acquired specifically for LNDb.

To ensure that the database is relevant, inclusion criteria based on the LIDC-IDRI criteria were used [3]. All patients above the age of 18 were included, except if a prior history of cancer was known. CT scans were collected patientwise to avoid repeated patients. CT scans where intravenous contrast had been used and those with a slice thickness greater than 1mm were excluded. One radiologist then performed a reading of the CT to look for other lung pathologies, noise, motion or other artifacts, in which case the CT would be excluded. If during the first reading more than six nodules or one nodule larger than 30mm in-slice diameter were found the CT would also be excluded. Finding more than six nodules during image annotation was not a reason for exclusion.

Table I shows the acquisition parameters for the CT scans in LNDb. Among the 294 patients scanned, 164 (55.8%) were male. The median age was 66 and the minimum and maximum ages were 19 and 98, respectively.

Scanner Model
    Siemens Sensation Cardiac 64a 107 (36.4)
    Siemens Somatom Definition Flasha 59 (19.5)
    Siemens Somatom go.Upa 137 (45.2)
Tube Peak Potential (kV)b 120[100;140]
Average Tube Current (mA)c 161.9128.4
Convolution kernel
    Standarda 4 (1.4)
    Sharpa 160 (54.6)
    Very sharpa 126 (43.0)
    Extremely sharpa 3 (1.0)
In-plane pixelsize (mm)c 0.630.09
Slice thickness (mm)b 1.0[0.5;1.0]
Number of image slicesb 318.5[251;631]

aData are count (%); bData are median[minimum;maximum];
cData are meanstandard deviation.

TABLE I: CT scan acquisition settings.

Ii-B Manual Annotation Process

Each CT scan was read by at least one radiologist at CHUSJ to identify pulmonary nodules and other suspicious lesions. A total of 5 radiologists with at least 4 years of experience reading up to 30 CTs per week, hereinafter referred to as R1 to R5, participated in the annotation process. Annotations were performed in a single blinded fashion, i.e. a radiologist would read the scan once and no consensus or review between the radiologists was performed. Each scan was read by at least one radiologist. The instructions for manual annotation were adapted from LIDC-IDRI[3]. Each radiologist would read a scan and identify the following lesions: i) nodule 3mm: any lesion considered to be a nodule by the radiologist with greatest in-plane dimension larger or equal to 3mm; ii) nodule 3mm: any lesion considered to be a nodule by the radiologist with greatest in-plane dimension smaller than 3mm; iii) non-nodule: any pulmonary lesion considered not to be a nodule by the radiologist, but that contains features which could make it identifiable as a nodule. Figure 1 show examples of annotated lesions in LNDb.

(a)
(b)
(c)
Fig. 1: Examples of annotated lesions. (a) Nodule 3mm annotated by 3 radiologists; (b) Nodule 3mm annotated by 2 radiologists; (c) Non-nodule annotated by 2 radiologists.

Nodules 3mm were segmented and subjectively characterized according to LIDC-IDRI (ratings on subtlety, internal structure, calcification, sphericity, margin, lobulation, spiculation, texture and likelihood of malignancy). For a complete description of these characteristics the reader is referred to McNitt-Gray et al. [14]. For nodules 3mm the nodule centroid was marked and subjective assessment of the nodule’s characteristics was performed. For non-nodules, only the lesion centroid was marked.

Manual annotation was performed in an in-house developed graphical interface [16]. It allows for image orientation and magnification as well as selection of different display windows. Maximum intensity projection was used only for a portion of the scans as it was not available from the start of the project. Lesion segmentation and subjective assessment were performed on the axial slice but radiologists had access to the coronal and sagittal slices as well. Lesion segmentation was performed manually by using a brush to color the nodule.

Eyetracking was performed during manual annotation to record the radiologists’ gaze [12]. The screen coordinates of the radiologists’ gaze were recorded at a frequency of 90Hz together with the display settings of the graphical interface. This allows conversion of the screen coordinates to CT image coordinates taking into account zoom and pan setting, thus allowing to compute which region of the CT the radiologist was looking at throughout the annotation process. Considering a 5 visual angle [15], a 3D attention map of the radiologists’ annotation process can be reconstructed. This 3D attention map can then be used to compute the amount of time spent at each image location.

Ii-C Computer-Aided Annotation

For comparison to manual annotation, previously developed methods for computer-aided detection, segmentation and characterisation of nodules were used. All methods were developed and trained exclusively on nodules 3mm from LIDC-IDRI.

The nodule detection approach is based on the YOLOv3 architecture [17]. In brief, a model pre-trained in natural images is fine-tuned to detect lung nodules by minimizing a loss function that takes into account the width, height, and centroid of the prediction in relation to the ground truth. To account for 3D information, the algorithm is trained with 3-channel images composed of the axial slice containing the nodule’s center of mass and two equidistant adjacent slices [1]. Predictions are performed for every axial slice and the candidates are merged if their bounding boxes overlap.

After candidate detection, a dedicated network for false positive (FP) reduction is used. The network is composed of blocks of 333 convolutions with batch normalization and rectifier linear unit activations. The input size is a cube of size 646464 voxels centered on the candidate centroid. The binary non-nodule/nodule classification is considered as a multiple-instance learning problem so that the probability is inferred by max-pooling on a 8881 feature map. The training dataset is composed of all nodules used for training the detection network as well as the highest scored FPs from each scan in a 1:5 ratio.

The segmentation network is iW-Net[2], a model that allows for both automatic and semi-automatic segmentation. The model is composed of two sequential auto-encoders based on the 3D U-Net [4] that receives as input a 646464 voxel candidate. The first block predicts an initial segmentation, which can then be refined by the second block that assesses the nodule image, the initial segmentation and a weight map resulting from two manual clicks near the nodule boundaries. In this study, only the first block of iW-Net is used.

For nodule characterisation, only texture was used. Three orthogonal planes of 6464 pixels centered on the candidate center of mass, are given as input to a convolutional neuronal network. The features extracted from each of the three planes are then concatenated so that there is a common output with 3 classes: ground glass opacity (GGO), part solid and solid [8].

Iii Experiments

Iii-a Observer Variability

To assess interobserver variability, the annotations of multiple radiologists on matching CTs were compared. Two annotations by different radiologists were considered to correspond to the same lesions, i.e. a unique finding, if the Euclidean distance between their centroids was smaller or equal to the maximum equivalent diameter of the two nodules. For nodules of equivalent diameter smaller than 3mm, an equivalent diameter of 3mm was considered.

Nodule detection agreement was computed as the percentage of cases in agreement over all findings reported as a nodule by at least one of the radiologists being considered:

(1)

where is the number of findings reported as class by radiologist 1 and class by radiologist 2. and are the “nodule” and “non-nodule/not reported” classes, respectively.

Nodule segmentation agreement was evaluated through Jaccard score [11], Hausdorff distance (HD) [10] and mean average distance (MAD) computed as

(2)

where is the mean of the distance between each surface voxel in segmentation and the closest surface voxel in segmentation in ; is computed in the same way.

Nodule characterization agreement was evaluated for each characteristic using Fleiss-Cohen weighted Cohen’s kappa [21]

(3)

where is the proportion of cases rated by observer 1 as class and by observer 2 as class . is a wildcard so that is the proportion of cases rated by observer 2 as class . is the weight for class combination according to

(4)

for a rating consisting of classes (,,…,). Note that for internal structure and calcification, the non-weighted Cohen’s kappa is reported given the non-ordinal nature of these features. Given that for LIDC-IDRI the radiologists’ identity is unknown and Cohen’s kappa cannot be computed, in-class agreement () is reported for comparison, where as the proportion of cases rated the same class by both observers.

As a measure for scanwise agreement, the Fleischner society pulmonary nodule guidelines [13] were used to obtain follow-up recomendations for each CT scan according to the annotations of each radiologist. The Fleischner guidelines are widely used for patient management in the case of nodule findings and take into account the number of nodules (single or multiple), their volume (, and ) and texture (solid, part solid and GGO nodules). Nodule volume was computed from the segmentation and nodules 3mm were considered to belong to the first class (). Nodule texture was recast from the five classes in the LNDb annotation (1-GGO, 2-intermediate, 3-part solid, 4-intermediate, 5-solid) into the three classes of the Fleischner guidelines by considering GGO as 1-2, part solid as 3 and solid as 4-5. The Fleischner follow-up guidelines were then divided into 4 classes of escalating risk: 0) No routine follow-up required or optional CT at 12 months according to patient risk; 1) CT at 6-12 months required; 2) CT at 3-6 months required; 3) CT, PET/CT or tissue sampling at 3 months required.

This 0-3 score, hereinafter referred to as Fleischner score, was then used to compare the follow-up recommendation as assessed by each radiologist using . The agreement per nodule in the volume and texture classes was also assessed using and .

Iii-B Computer-Aided Annotation

In order to assess the performance of state-of-the-art CAD methods in relation to the radiologists’ manual annotations, automatic nodule detection, segmentation and characterization was performed in all CTs.

Nodule detection performance was evaluated in terms of sensitivity and number of FPs per scan as function of the FP reduction threshold. Given that not all findings were annotated by all radiologists, performance was assessed in relation to the radiologists’ agreement level, considering findings marked as a nodule by at least one or at least two radiologists. Segmentation performance was evaluated in terms of MAD, HD and Jaccard and characterization performance in terms of and . Finally, scanwise performance for the full CAD pipeline in terms of Fleischner score was evaluated using . Similarly to the nodule detection performance evaluation, the agreement level was taken into consideration by computing the radiologists’ Fleischner score taking into consideration findings marked by at least one or by at least two radiologists.

Iii-C Collaborative Annotation Strategies

In order to assess the feasibility of collaborative CAD systems, a 2nd opinion experiment was conducted in 23 randomly selected cases. After manual annotation by two radiologists (R4 and R5), each radiologist received suggestions for revision from the other radiologist and the CAD system in terms of nodule detection, segmentation and texture characterization. Suggestions from the other radiologist and CAD were blinded so that each radiologist would not know the source of each suggestion and thus avoid bias in the decision process.

For nodule detection comparison, each radiologist received as suggestions for revision all findings marked as nodules by the other radiologist or CAD if the radiologist had marked it as non-nodule or had not reported it. For nodule segmentation comparison, the LIDC-IDRI interobserver nodule segmentation variability was used to determine nodules in disagreement. As such, when comparing two nodule segmentations, if they belonged to the same Fleischner volume class (, and ) and had a HD outside the LIDC-IDRI HD variability by 2 standard deviations or if they belonged to different Fleischner volume classes and had a HD outside the LIDC-IDRI HD variability by 1 standard deviation they would be presented for revision to the radiologist. For nodule characterization, a nodule would be presented for revision if the annotated nodule textures did not belong to the same Fleischner texture class (1-2, 3 and 4-5).

After revision of nodule detection, segmentation and texture, the revised annotations by each radiologist were compared and cases in disagreement were revised by both radiologists together to obtain a consensus ground truth. In the particular case of segmentation, the ground truth in nodules which were not revised in the consensus phase was considered to be the average volume of the revised segmentation of the two radiologists.

For each revised annotation by a radiologist, the contributions from the other radiologist and the CAD system were then disentangled to allow for a separate assessment of the different annotation strategies possible: i) single radiologist or CAD annotation; ii) first radiologist annotation followed by revision of second radiologist findings; iii) single radiologist annotation followed by revision of CAD findings. Note that for strategy iii) the number of CAD findings received by the radiologist can be regulated by adjusting the FP threshold used during detection. Furthermore, for strategies ii) and iii), the number of findings received by the radiologist can be regulated by removing findings according to the time spent on that finding’s region during the initial image annotation. The amount of time spent in the region around each finding during manual annotation was computed from the eyetracking map and findings for which the time spent in the region was superior to a predetermined attention threshold were excluded.

Detection, segmentation and characterization performance were then evaluated in regard to the consensus annotations by R4 and R5. As in Section III-B, detection performance was evaluated in terms of sensitivity and FPs per scan. To assess the burden for the clinicians associated with each strategy, average time expenditure per scan was assessed for each strategy through the eyetracking map. Segmentation and texture characterization performance were evaluated in terms of and in the Fleischner volume and texture classes and scanwise performance was assessed in terms of .

Finally, CAD candidates (at an FP threshold of 0.5) identified as FPs by the radiologists were revised by R4 to identify the anatomical features which were most responsible for FPs.

Iii-D Statistical Analysis

For comparison between different observers and databases, unpaired t-tests were used taking into account significance at p0.05 and p0.01.

Iv Results

Iv-a LNDb Description

All 294 CTs of LNDb were annotated by at least one radiologist (90 were annotated by 3 radiologists, 145 by 2 radiologists and 59 by a single radiologist). R1 to R5 annotated respectively 125, 90, 81, 162 and 161 CTs. Eyetracking data was collected for a total of 312 CT readings. The database comprises 1897 annotations by the 5 radiologists, corresponding to 1429 unique findings.

(a)
(b)
Fig. 2: Nodule size and characterization distribution in LNDb and LIDC-IDRI. (a) Nodule in-slice diameter for LIDC-IDRI and R1 to R5. (b) Nodule characterization distribution for LIDC-IDRI (leftmost bar) and R1 to R5 (five rightmost bars). Colors correspond to each of the 1-6 ratings. Note that internal structure is rated in a 1-4 range, calcification in 1-6 and the other characteristics in 1-5.

Figure 2 shows the nodule in-slice diameter and characteristics distribution in LNDb compared to LIDC-IDRI. In-slice diameter was determined as the largest distance across two points in any axial slice for nodules 3mm. It can be seen that the distribution in both size and characteristics follows that of LIDC-IDRI. However, more nodules 5mm have been annotated in LNDb, particularly by R2 and R5.

Iv-B Observer Variability

Table II shows the nodule detection agreement . It can be seen that the agreement is smaller than for LIDC-IDRI.

R1 vs R2 0.31
R1 vs R3 0.34
R2 vs R3 0.32
R2 vs R4 0.23
R2 vs R5 0.40
R4 vs R5 0.31
All vs All 0.32
LIDC-IDRI 0.38

Note that only pairs of radiologists with CTs in common are shown.

TABLE II: Nodule detection agreement for LNDb and LIDC-IDRI.

Table III shows the nodule segmentation agreement in terms of MAD, HD and Jaccard. It can be seen that the segmentation agreement is slightly higher for most radiologist pairs in comparison to LIDC-IDRI in terms of MAD and HD but lower for Jaccard. All metrics show a statistically significant difference when comparing all radiologists to LIDC-IDRI.

MAD (mm) HD (mm) Jaccard
R1 vs R2 0.460.26   2.191.52   0.570.15
R1 vs R3 0.490.19 2.451.65   0.580.12
R2 vs R3 0.530.18 2.261.44   0.540.14
R2 vs R4 0.320.18 1.510.39 0.580.17
R2 vs R5 0.280.04 1.580.52 0.620.07
R4 vs R5 0.410.19 1.880.89   0.570.14
All vs All   0.450.21   2.131.35   0.570.14
LIDC-IDRI 0.480.38 2.923.15 0.660.13

Note that only pairs of radiologists with CTs in common are shown.

TABLE III: Nodule segmentation agreement (meanstandard deviation) for LNDb and LIDC-IDRI. Symbols and indicate a statistically significant difference in comparison to LIDC-IDRI observer variability with p0.05 and p0.01, respectively.

Table IV shows the nodule characterization agreement. It can be seen that, overall, the agreement is higher than for LIDC-IDRI except for calcification which has significantly lower agreement.

Subtlety Int. Structure Calcification Sphericity Margin Lobulation Spiculation Texture Malignancy
R1 vs R2 0.48 0.53 1.00 1.00 0.71 0.27 0.48 0.41 0.44 0.37 0.56 0.28 0.57 0.33 0.84 0.78 0.33 0.57
R1 vs R3 0.53 0.58 0.95 0.00 0.86 0.66 0.35 0.33 0.33 0.31 0.67 0.54 0.76 0.56 0.83 0.60 0.45 0.57
R2 vs R3 0.53 0.54 0.94 0.00 0.65 0.37 0.40 0.46 0.42 0.46 0.52 0.25 0.65 0.60 0.90 0.77 0.40 0.52
R2 vs R4 0.57 0.48 0.86 0.00 1.00 1.00 0.43 0.31 0.00 -0.25 0.43 0.00 0.43 -0.10 0.14 0.00 0.14 -0.14
R2 vs R5 0.33 0.41 0.92 0.00 1.00 1.00 0.42 0.41 0.25 -0.13 0.58 0.62 0.58 0.58 0.67 0.00 0.33 0.30
R4 vs R5 0.49 0.37 0.98 0.00 0.81 0.46 0.38 0.42 0.47 0.19 0.63 0.46 0.68 0.32 0.55 0.45 0.44 0.48
All vs All 0.50 NA 0.97 NA 0.78 NA 0.40 NA 0.41 NA 0.60 NA 0.66 NA 0.72 NA 0.41 NA
LIDC-IDRI 0.40 NA 0.99 NA 0.92 NA 0.35 NA 0.41 NA 0.49 NA 0.56 NA 0.69 NA 0.39 NA

Note that only pairs of radiologists with CTs in common are shown.

TABLE IV: Nodule characterization agreement for LNDb and LIDC-IDRI. Note that in LIDC-IDRI only nodules 3mm were characterized and that for spiculation and lobulation the first 399 CTs contain incorrect labels and were thus excluded. NA: not applicable.

Table V shows the scanwise agreement in terms of Fleischner score. Scanwise agreement is lower than for LIDC-IDRI, in spite of the fact that the agreement for Fleischner volume and texture classes is higher.

Follow-up Volume Texture
R1 vs R2 81 0.58 0.42 0.80 0.83 0.94 0.74
R1 vs R3 83 0.66 0.68 0.80 0.86 0.91 0.47
R2 vs R3 81 0.65 0.57 0.74 0.81 0.95 0.74
R2 vs R4 11 0.64 0.01 1.00 1.00 0.86 0.00
R2 vs R5 11 0.82 0.40 1.00 1.00 0.92 0.00
R4 vs R5 157 0.68 0.57 0.85 0.87 0.90 0.62
All vs All 235 0.65 NA 0.82 NA 0.92 NA
LIDC-IDRI 1010 0.73 NA 0.81 NA 0.88 NA

Note that only pairs of radiologists with CTs in common are shown.

TABLE V: Scanwise agreement according to Fleischner guidelines for LNDb and LIDC-IDRI. is the number of CTs analysed by each pair of radiologists. NA: not applicable.

Iv-C Computer-Aided Annotation

Figure 3 shows the detection performance of CAD and each radiologist when considering the remaining radiologists as ground truth for each agreement level. Within the same agreement level, the average radiologist has a higher sensitivity than the CAD with 0.85 and 0.88 FPs per scan for agreement levels 1 and 2, respectively. The CAD system is able to obtain the same sensitivity as the average radiologist only at 5.99 and 5.80 FPs per scan for agreement level 1 and 2, respectively. Figure 4 shows examples of nodule candidates proposed by the automatic detection.

Fig. 3: CAD and individual radiologist nodule detection performance for findings marked as a nodule with agreement level 1 and 2.
Fig. 4: Central axial view (5151mm) of CAD detection examples on CTs annotated by 3 radiologists. Lines correspond to detection candidates with the same agreement level (findings annotated by 3, 2, 1 and 0 radiologists from top to bottom). Columns correspond to candidates with similar probability as given by the FP reduction algorithm (lower right corner of each frame). The seven probability levels correspond to a FP/scan level of 1/8, 1/4, 1/2, 1, 2, 4 and 8 for nodules with agreement level 2 (right to left).

Table VI shows the CAD segmentation performance when compared to the segmentations by each radiologist. Overall, there is a statistically significant difference in the agreement between the CAD and the radiologist annotations and the agreement observed among the radiologists in both the LIDC-IDRI and the LNDb databases. Figure 5 shows examples of nodule segmentations by each radiologist and the automatic segmentation.

MAD (mm) HD (mm) Jaccard
CAD vs R1 0.720.66 3.573.48 0.510.18
CAD vs R2 0.630.56 2.942.78 0.500.17
CAD vs R3 0.830.78 3.804.19 0.480.18
CAD vs R4 0.700.44 3.302.81 0.450.19
CAD vs R5 0.490.36 2.462.27 0.550.17
CAD vs All 0.670.57 3.213.13 0.500.18
TABLE VI: CAD nodule segmentation performance (meanstandard deviation). Symbols () and () indicate a statistically significant difference in comparison to LIDC-IDRI (LNDb) observer variability with p0.05 and p0.01 respectively.
Fig. 5: Central axial view (5151mm) of CAD segmentation examples (red) and ground truth annotations (green). Lines correspond to examples at the 5th percentile, mean and 95th percentile of MAD results obtained. Columns correspond to each radiologist. MAD obtained for each nodule shown at the lower right corner of each frame. Note that MAD is computed in 3D space whereas here only the central axial slice of each nodule is shown.

Table VII shows the volume and texture characterization CAD performance according to Fleischner guidelines. The agreement in volume Fleischner classes is similar to the agreement between radiologists in LIDC-IDRI and LNDb. However, the agreement in Fleischner texture classes is significantly smaller than between radiologists. Figure 6 shows examples of nodule texture characterization by the CAD compared to radiologist annotations.

Volume Texture
CAD vs R1 0.77 0.72 0.79 0.61
CAD vs R2 0.80 0.74 0.82 0.67
CAD vs R3 0.69 0.65 0.76 0.52
CAD vs R4 0.77 0.74 0.76 0.54
CAD vs R5 0.90 0.84 0.74 0.15
CAD vs All 0.80 0.75 0.77 0.51
TABLE VII: CAD Fleischner volume and texture classification performance.
Fig. 6: Central axial view (5151mm) of CAD texture characterization examples. Lines correspond to examples with equal texture class as given by the automatic algorithm. Different agreement levels in texture class by the annotating radiologists are illustrated. The texture class given by the automatic system is shown at the lower right corner of each frame whereas the texture classes given by the radiologists are shown at the lower left corner (lines correspond to multiple radiologists). S - Solid nodule, PS - Part-solid nodule, GGO - Ground glass opacity nodule.

Figure 7 shows the scanwise CAD performance according to Fleischner guidelines as well as the performance of each radiologist when considering the remaining radiologists as ground truth for each agreement level. Results are shown as function of FPs/scan according to the FP reduction threshold used. It can be seen that for both agreement levels the average radiologist agreement is similar to that of CAD within the same FP/scan level. Note that for R4 and R5 a limited number of CTs with agreement level 2 exist (11), which leads to the low scores obtained.

Fig. 7: CAD and individual radiologist scanwise performance according to Fleischner guidelines considering findings marked as a nodule with agreement level 1 and 2.

Iv-D Collaborative Annotation Strategies

Figure 8 shows the detection performance of each annotation strategy. It can be seen that both radiologists have a superior performance to CAD but either of the collaborative approaches, using CAD or a second radiologist, give a significant boost to performance. In terms of time expenditure, the radiologist+CAD combination is the most efficient as it achieves high sensitivity with a small time investment in comparison to single radiologist annotation. Table VIII shows the anatomical structures identified as nodules by the CAD but as FPs by R4 and R5.

(a)
(b)
Fig. 8: Nodule detection performance for R4, R5, CAD and collaborative strategies. Full lines indicate performance at different FP reduction threshold levels and dotted lines indicate performance at different attention threshold levels.
Vascular structures (arteries, veins) 87 (52.4)
Lymph nodes 3 (1.8)
Airway structures (bronchi, bronchioli, 4 (2.4)
 bronchial wall, bronchiectasis, etc.)
Parenchymal features
    Atelectasis 27 (16.3)
    Reticulation (inter/intralobular septa) 8 (4.8)
    Fibrosis 2 (1.2)
    Ground glass opacities (nonspecific) 5 (3.0)
    No visible/identifiable structure 15 (9.0)
Extrapulmonary structures
    Bone 10 (6.0)
    Other 5 (3.0)
TABLE VIII: Anatomical structures identified as nodules by the CAD detection. Data are count (%).

Table IX shows the volume and texture characterization performance according to Fleischner guidelines in comparison to ground truth for each of the annotation strategies considered. For both the texture and volume Fleischner classes it can be seen that either of the collaborative strategies increases the agreement when compared to single radiologist or CAD annotations.

Volume Texture
R4 0.94 0.93 0.68 0.66
R5 0.90 0.75 0.94 0.64
CAD 0.98 0.93 0.71 0.55
R4+CAD 1.00 1.00 0.79 0.76
R4+R5 0.97 0.94 0.84 0.83
R5+CAD 0.92 0.75 0.95 0.91
R5+R4 0.92 0.75 0.95 0.91

TABLE IX: Volume and texture classification performance according to Fleischner guidelines for R4, R5, CAD and collaborative strategies.

Figure 9 shows the average scanwise performance of each annotation strategy according to Fleischner guidelines. It can be seen that, similarly to Figure 7, manual annotation by a radiologist has a performance similar to CAD. However, collaborative strategies show only incremental improvement in performance for R4 and no improvement for R5.

Fig. 9: Scanwise performance for R4, R5, CAD and collaborative strategies. Full and dotted lines indicate performance at different FP reduction thresholds and different attention thresholds respectively.

V Discussion

V-a LNDb Description

In this study, the collection, annotation and analysis of a clinical CT database for nodule detection, segmentation and characterization is presented. While extensive public databases for this purpose exist, in particular the LIDC-IDRI database, the collection of local datasets and annotations can reveal surprising aspects in variability in the image collection and annotation processes and patient population with significant impact on the performance of CAD systems. This is especially true for deep learning methods, which depend to a large extent on the nature of the data used for training.

In the LNDb database, the variability in the annotation process was particularly emphasised as annotations were conducted solely in a single blinded manner, which replicates more closely the clinical reality where images are analysed by a single radiologist. Furthermore, in contrast to LIDC-IDRI, where radiologists were instructed to focus on nodules 3mm, the main task for radiologists in LNDb was to find all nodules, independent of size. This has led to a more diverse dataset, composed of a higher proportion of nodules 3mm than on LIDC-IDRI, as shown in Figure 2. Accordingly, the characterization task was also performed for nodules 3mm, though this did not have an impact on the distribution of characteristics, as shown in Figure 2. Some characteristics present extremely imbalanced distributions on LIDC-IDRI and this trend was repeated on LNDb in spite of the fact that radiologists were not trained in nodule characterization through examples from LIDC-IDRI.

V-B Observer Variability

Overall, the single blind annotation protocol used to build LNDb has meant that observer variability was more accurately captured than in previous databases.

As shown in Table II, a smaller agreement in terms of nodule detection was obtained in comparison to LIDC-IDRI. This can be explained by the fact that LNDb annotations were obtained in a single blinded fashion whereas for LIDC-IDRI each radiologist would review the initial annotation after comparison to the annotations of other radiologists. As such, the LIDC-IDRI detection agreement can be solely attributed to decision error, i.e. deciding if each finding is a nodule or not. On LNDb, however, the detection agreement compounds the decision and fixation errors, i.e. the process of actually finding a nodule in the 3D CT image. Furthermore, the higher proportion of nodules 3mm can increase decision error, as the size of the finding can make decision more difficult [9].

In terms of nodule segmentation, Table III shows that the agreement in LNDb is higher than on LIDC-IDRI in MAD and HD. The fact that Jaccard is lower in LNDb is likely related to the higher proportion of smaller nodules, which can often have very small Jaccard. Nevertheless, while statistically significant, the difference is small in magnitude and has no impact on the volume Fleischner class agreement.

Nodule characterisation agreement (Table IV) and texture Fleischner class agreement is also larger on LNDb. The exception of calcification is probably related to the classes in this feature and its non-ordinal nature (1-Popcorn, 2-Laminated, 3-Solid, 4-Non-central, 5-Central, 6-Absent). In discussion with the radiologists involved in this study, it was clear that, to their understanding, calcification classes 3 and 5 were almost equivalent which can have led to the low agreement observed. This is corroborated in Figure 2, where it can be seen that R2 and R3 mostly choose class 3 whereas R4 mostly chooses class 5 and R1 and R5 choose a mix of the two classes. As such, future studies should focus more on the presence/absence of calcification rather than on the type.

Overall, the fact that radiologists tend to agree more in the segmentation and characterization of nodules can be due to the fact that they belong to the same institution, have similar training and are using the same annotation tools, which was not the case in LIDC-IDRI.

For patient follow-up agreement, a lower agreement was obtained in LNDb, likely due to the lower nodule detection agreement observed.

V-C Computer-Aided Annotation

Figure 3 shows the nodule detection performance of each radiologist and the CAD system when considering the remaining radiologists as ground truth for agreement levels 1 and 2. The degree of variability observed in LNDb is clearly expressed given the low sensitivity observed for radiologists when compared to previous studies [18]. Furthermore, it can be seen that the CAD has a comparatively poor performance, only achieving the average radiologist sensitivity at a relatively high FP/scan rate. While the high observer variability can have played a role, a performance closer to the average radiologist was expected, especially considering that an identical network was able to obtain a sensitivity of 0.926 at 0.25FP/scan on a subset of LIDC-IDRI [1]. The main reasons for these results are probably related to the fact that the nodule detection network was trained uniquely on LIDC-IDRI. Firstly, only nodules 3mm were considered for training. The higher prevalence of smaller nodules on LNDb could thus have played a significant role in the decreased perfomance, given that the network was not trained particularly for these nodules. Secondly, and perhaps most importantly, while LIDC-IDRI is extensive, significant differences in the image acquisition or population characteristics can have led to a decreased performance. The slice thickness distribution, for example, is significantly different between the two databases, which may lead to detrimental performance. This highlights the need for fine tuning of deep learning methods before their application to specific cases and populations, showing that simply increasing the size of the dataset is not always the path towards increasing performance and that fine tuning and the setting of the problem must be taken into account.

Table VIII gives an insight into what anatomical features are wrongly identified as nodules by the CAD. While the amount of vascular FPs is not surprising given the 2D similarity of solid nodules and vessels, it was expected that the 2.5D and 3D nature of the detection and FP reduction networks would contribute to a smaller proportion of vascular FPs. Future approaches should thus take this shortcoming into account and try to incorporate further 3D information and/or specific architecture to target vascular FPs.

In terms of nodule segmentation, a statistically significant difference between the CAD performance and the observer variability on the LNDb and LIDC-IDRI was found. However, the difference is not significantly detrimental for Fleischner volume classes as the average CAD performance of 0.80 is similar to the agreement between radiologists on both LNDb and LIDC-IDRI (0.82 and 0.81 respectively). Furthermore, in contrast to the behaviour observed for nodule detection, the segmentation network’s application to a different database did not have a detrimental effect as a Jaccard index of 0.480.19 was reported for the LIDC-IDRI database in [2].

A performance in terms of of 0.77 was obtained for texture characterization, which is inferior to the average observer agreement in LNDb and LIDC-IDRI. Nevertheless, as for nodule segmentation, this performance is similar to the reported performance for LIDC-IDRI (0.7510.035) [8].

In terms of scanwise performance, Figure 7 shows that there seems to be no significant difference between the CAD and the average radiologist. While this is in stark contradiction with the low performance of the CAD detection performance, one must take into account the fact that only the most suspicious nodules have an implication in terms of Fleischner score. As such, considering for example a patient with several nodules among which one is particularly large, it is sufficient for the CAD system to correctly detect and classify that nodule to obtain a correct Fleischner score. The fact that the FP/scan level with highest scanwise performance is relatively low shows exactly this fact. A similar detection sensitivity to the radiologist is not crucial, as long as the most suspicious nodules are correctly detected with a low number of FPs.

Taking into account the performance obtained at each stage and framing it in the greater picture of Fleischner follow-up guidelines thus gives a clearer picture of the role CAD systems may have in CT lung cancer screening. It is shown that state-of-the-art CAD systems are able to have a performance comparable to the average radiologist. This is in spite of the fact that the nodule detection performance in particular was rather poor in this study. While it might be tempting to conclude that techniques such as fine tuning could lead to an improved CAD scanwise performance due to improved nodule detection, this might not be the case. In fact, the superior detection performance observed for the radiologists did not lead to a superior performance in follow-up. As such, and even though improvement of the overall nodule detection performance is crucial to improve radiologists’ trust in CAD systems, instead of focusing on marginal gains in detection performance, the community should focus on the accurate detection and characterization of those nodules known to be more associated with malignancy, namely solid and part solid 100mm3 nodules. Nevertheless, detection of smaller nodules could come to play a more important role in follow-up scenarios, as the change in size and characteristics of a nodule are a strong indicator of malignancy.

V-D Collaborative Annotation Strategies

Figure 8 shows the results obtained for nodule detection in individual and collaborative strategies for a subset of LNDb. As in Figure 3, it is shown that the CAD has a lower performance than the average radiologist. Furthermore, there is a significant difference between R4 and R5 in terms of sensitivity, with R5 identifying a larger proportion of nodules. Any of the collaborative strategies significantly improve performance, especially for R4, with double radiologist strategies being the most successful. However, given the poor performance of the CAD system in this particular setting, this was to be expected.

In terms of the time spent analysing the image, it can be seen that R5 takes approximately double the amount of time as R4, which justifies the increased sensitivity, as R5 analyses the image more carefully. When comparing the two collaborative approaches, it can be seen that having a second opinion from CAD can significantly boost performance without having a large impact on the overall time spent. A double radiologist strategy has obviously a greater penalty on time spent, with almost 6 minutes per CT, in comparison to over 2 minutes and over 4 minutes spent by R4 and R5 respectively when receiving suggestions from CAD. Interestingly, neither the FP reduction nor the eyetracking data seem to be overwhelmingly beneficial in excluding nodules that do not require revision by a radiologist. Nevertheless, an approach where radiologists only revise nodule candidates from CAD which had not been observed before (eyetracking time of 0s) seems of particular interest with a sensitivity improvement of 0.13 and 0.04 for R4 and R5, respectively, through the revision of 3.0 and 1.3 CAD findings per scan respectively.

For segmentation and characterization, analysed in terms of the Fleischner classes, Table IX shows that for both volume and texture there is an improvement in accuracy in either of the collaborative strategies. However, for nodule volume, improvements are quite small in magnitude given the already high accuracy of both radiologists and CAD.

Looking at the scanwise classification in terms of Fleischner guidelines for follow-up, it can be seen that the CAD has a performance similar to individual radiologists. Interestingly, even though R4 has a much smaller sensitivity for nodule detection, it has a slightly higher performance on follow-up, once more highlighting the importance of finding the ‘right’ nodules, rather than finding all nodules. In regard to the collaborative strategies, while there are marginal performance gains for R4, these might be due to the low sample size of this experiment.

V-E Limitations

While the results of this study are promising, there are important limitations to be considered. First, regarding LNDb, though one of its strenghts is the fact that it represents the clinical reality of a particular time and place, this is also a limitation as the results obtained might not be reproducible in other radiology departments. Furthermore, because the data was annotated in a single blind manner, there is no absolute ground truth. While this could be improved through a revision of all annotations by additional radiologists, this did not fall within reasonable effort for the scope of this study. Of course, this has a strong impact on the results and their interpretation.

Secondly, regarding the CAD systems used, the aim of this study was not to conduct an extensive review of state-of-the-art systems in literature. The algorithms used were chosen as they were deemed representative of the overall trends in literature. While other methodologies could have given origin to different conclusions, this was outside the scope of this study.

Third, regarding the collaborative annotations strategies studied, the low number of CTs used in this experiment limited the conclusions that could be drawn. Furthermore, while in this study two obvious collaborative strategies considering the tools available were tested, other more complex strategies could be designed which could be more successful. While only the eyetracking time per region was considered in this study, there is additional data that could be extracted such as the gaze patterns and fixation lengths that could provide further information. Furthermore, and given the objective of the Fleischner guidelines, one could take this into account by suggesting nodules that would change the Fleischner class if considered to be a nodule by the revising radiologist.

Vi Conclusion

In conclusion, this study presents a novel database for research on several aspects of CT lung cancer screening: nodule detection, segmentation and characterization. Furthermore, the recording of the gaze patterns of radiologists when reading the images could hold important information useful for CAD and collaborative strategies.

By applying state-of-the-art detection, segmentation and characterization methods, it was shown that current CAD systems can classify a patient according to Fleischner follow-up guidelines as accurately as radiologists. Nevertheless, the training of deep learning methodologies for nodule detection can play a crucial role in performance and adaptation to the local characteristics in population, image acquisition, etc. is extremely important. Furthermore, within the three tasks, nodule detection was identified as the current biggest challenge, and thus the task where the biggest improvements can be made to obtain a better follow-up performance.

Finally, different collaborative strategies were tested in a subset of the data, showing that current CAD methodologies, even without fine tuning to a local reality, can be a valuable tool to increase sensitivity in nodule detection, without significantly increasing the burden for clinicians, especially if the collaboration between the two can be adequately designed. Nevertheless, this was not verified in Fleischner follow-up classification, where collaborative strategies did not lead to a significant improvement in comparison to individual radiologist annotation or state-of-the-art CAD systems.

Acknowledgment

This work was financed by the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness - COMPETE 2020 Programme and by National Funds through the Portuguese Funding agency, FCT - Fundação para a Ciência e Tecnologia within project PTDC/EEI-SII/6599/2014 (POCI-01-0145-FEDER-016673) and by the FCT grant contract SFRH/BD/120435/2016.

References

  1. G. Aresta, T. Araújo, C. Jacobs, B. van Ginneken, A. Cunha, I. Ramos and A. Campilho (2018) Towards an automatic lung cancer screening system in low dose computed tomography. In Image Analysis for Moving Organ, Breast, and Thoracic Images, pp. 310–318. Cited by: §I, §II-C, §V-C.
  2. G. Aresta, C. Jacobs, T. Araújo, A. Cunha, I. Ramos, B. van Ginneken and A. Campilho (2019) iW-Net: an automatic and minimalistic interactive lung nodule segmentation deep network. Scientific reports 9 (1), pp. 1–9. Cited by: §II-C, §V-C.
  3. S. G. Armato, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke and E. A. Hoffman (2011) The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical physics 38 (2), pp. 915–931. Cited by: §I, §II-A, §II-B.
  4. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox and O. Ronneberger (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §II-C.
  5. Data Science Bowl 2017. External Links: Link Cited by: §I.
  6. J. Ding, A. Li, Z. Hu and L. Wang (2017) Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 559–567. Cited by: §I.
  7. C. A. Ferreira, G. Aresta, A. Cunha, A. M. Mendonça and A. Campilho (2019) Wide residual network for Lung-Rads screening referral. In 2019 IEEE 6th Portuguese Meeting on Bioengineering (ENBENG), pp. 1–4. Cited by: §I.
  8. C. A. Ferreira, A. Cunha, A. M. Mendonça and A. Campilho (2018) Convolutional neural network architectures for texture classification of pulmonary nodules. In Iberoamerican Congress on Pattern Recognition, pp. 783–791. Cited by: §I, §II-C, §V-C.
  9. D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg and D. C. Strollo (2008) Lung cancer: interobserver agreement on interpretation of pulmonary findings at low-dose CT screening. Radiology 246 (1), pp. 265–272. Cited by: §V-B.
  10. D. P. Huttenlocher, G. A. Klanderman and W. A. Rucklidge (1993) Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis & Machine Intelligence (9), pp. 850–863. Cited by: §III-A.
  11. M. Levandowsky and D. Winter (1971) Distance between sets. Nature 234 (5323), pp. 34. Cited by: §III-A.
  12. M. Machado, G. Aresta, P. Leitão, A. S. Carvalho, M. Rodrigues, I. Ramos, A. Cunha and A. Campilho (2018) Radiologists’ gaze characterization during lung nodule search in thoracic CT. In 2018 International Conference on Graphics and Interaction (ICGI), pp. 1–7. Cited by: §II-B.
  13. H. MacMahon, D. P. Naidich, J. M. Goo, K. S. Lee, A. N. Leung, J. R. Mayo, A. C. Mehta, Y. Ohno, C. A. Powell and M. Prokop (2017) Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner Society 2017. Radiology 284 (1), pp. 228–243. Cited by: §III-A.
  14. M. F. McNitt-Gray, S. G. Armato III, C. R. Meyer, A. P. Reeves, G. McLennan, R. C. Pais, J. Freymann, M. S. Brown, R. M. Engelmann and P. H. Bland (2007) The lung image database consortium (LIDC) data collection process for nodule detection and annotation. Academic radiology 14 (12), pp. 1464–1474. Cited by: §I, §II-B.
  15. M. Millodot (2014) Dictionary of optometry and visual science e-book. Elsevier Health Sciences. Cited by: §II-B.
  16. J. Pedrosa, G. Aresta, J. Rebelo, E. Negrão, I. Ramos, A. Cunha and A. Campilho (2019) LNDetector: a flexible gaze characterisation collaborative platform for pulmonary nodule screening. In Mediterranean Conference on Medical and Biological Engineering and Computing, pp. 333–343. Cited by: §II-B.
  17. J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-C.
  18. A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. van den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci and B. Geurts (2017) Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Medical image analysis 42, pp. 1–13. Cited by: §I, §V-C.
  19. S. Shen, S. X. Han, D. R. Aberle, A. A. Bui and W. Hsu (2019) An interpretable deep hierarchical semantic convolutional neural network for lung nodule malignancy classification. Expert Systems with Applications 128, pp. 84–95. Cited by: §I.
  20. R. L. Siegel, K. D. Miller and A. Jemal (2019) Cancer statistics, 2019. CA: a cancer journal for clinicians. Cited by: §I.
  21. R. L. Spitzer, J. Cohen, J. L. Fleiss and J. Endicott (1967) Quantification of agreement in psychiatric diagnosis: a new approach. Archives of General Psychiatry 17 (1), pp. 83–87. Cited by: §III-A.
  22. The National Lung Screening Trial Research Team (2011) Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine 365 (5), pp. 395–409. Cited by: §I, §I.
  23. S. Wang, M. Zhou, Z. Liu, Z. Liu, D. Gu, Y. Zang, D. Dong, O. Gevaert and J. Tian (2017) Central focused convolutional neural networks: developing a data-driven model for lung nodule segmentation. Medical image analysis 40, pp. 172–183. Cited by: §I.
  24. B. Wu, Z. Zhou, J. Wang and Y. Wang (2018) Joint learning for pulmonary nodule segmentation, attributes and malignancy prediction. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1109–1113. Cited by: §I.
  25. A. Yaguchi, K. Aoyagi, A. Tanizawa and Y. Ohno (2019) 3D fully convolutional network-based segmentation of lung nodules in CT images with a clinically inspired data synthesis method. In Medical Imaging 2019: Computer-Aided Diagnosis, Vol. 10950, pp. 109503G. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402634
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description