Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation
Despite tremendous progress in computer vision, there has not been an attempt for machine learning on very large-scale medical image databases. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital’s Picture Archiving and Communication System. With natural language processing, we mine a collection of representative 216K two-dimensional key images selected by clinicians for diagnostic reference, and match the images with their descriptions in an automated manner. Our system interleaves between unsupervised learning and supervised learning on document- and sentence-level text collections, to generate semantic labels and to predict them given an image. Given an image of a patient scan, semantic topics in radiology levels are predicted, and associated key-words are generated. Also, a number of frequent disease types are detected as present or absent, to provide more specific interpretation of a patient scan. This shows the potential of large-scale learning and prediction in electronic patient records available in most modern clinical institutions.
Editor: xxx xxxx
Keywords: Deep learning, Convolutional Neural Networks, Topic Models, Natural Language Processing, Medical Imaging
The ImageNet Large Scale Visual Recognition Challenge by Deng et al. (2009) provides more than one million labeled images of 1,000 object categories. The accessibility of a huge amount of well-annotated image data in computer vision rekindles deep convolutional neural networks (CNNs) as a premier learning tool to solve the visual object class recognition tasks, as shown by Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Szegedy et al. (2014). Deep CNNs can perform significantly better than traditional shallow learning methods, but usually requires much more training data as was shown by Krizhevsky et al. (2012); Russakovsky et al. (2014). In the medical domain, however, there are no similar large-scale labeled image datasets available. On the other hand, large collections of radiology images and reports are stored in many modern hospitals’ Picture Archiving and Communication Systems (PACS). The invaluable semantic diagnostic knowledge inhabiting the mapping between hundreds of thousands of clinician-created high quality text reports and linked image volumes remains largely unexplored. One of our primary goals is to extract and associate radiology images with clinically semantic labels via interleaved text/image data mining and deep learning on a large-scale PACS database (780K imaging examinations). To the best of our knowledge, this is the first reported work performing automated mining and prediction on a hospital PACS database at a very large scale.
The Radiology reports are text documents describing patient history, symptoms, image observations and impressions written by board-certified radiologists. However, the reports do not contain specific image labels to be trained by a machine learning algorithm. Building the ImageNet database (Deng et al. (2009)) was mainly a manual process: harvesting images returned from Google image search engine according to the WordNet (Miller (1995)) ontology hierarchy and pruning falsely tagged images using crowd-sourcing such as Amazon Mechanical Turk (AMT). This does not meet our data collection and labeling needs due to the demanding difficulties of medical annotation tasks and the data privacy reasons. Thus we first propose to mine categorical semantic labels using non-parametric topic modeling method – latent Dirichlet Allocation (LDA) by Blei et al. (2003) – to provide a semantic interpretation of a patient image in three levels. While this provides a first-level interpretation of a patient image, labeling based on categorization can be nonspecific. To alleviate the issue of non-specificity, we further mined specific disease words in the reports mentioning the images. Feed-forward CNNs were then used to train and predict the presence/absence of the specific disease categories.
Our work has been inspired by the works of Deng et al. (2009); Russakovsky et al. (2014) building very large-scale image databases and the works establishing semantic connections of texts and images by Kulkarni et al. (2013). Please note, that there has not yet been much comparable development on large-scale medical imaging interpretation. Kulkarni et al. (2013) have spearheaded the efforts of learning the semantic connections between image contents and the sentences describing them, such as image captions. Detecting objects of interest, attributes and prepositions and applying contextual regularization with a conditional random field (CRF) is a feasible approach as shown by Kulkarni et al. (2013), and many useful tools for image annotation using it are available in computer vision.
Both deep feed-forward CNNs of Krizhevsky et al. (2012); Simonyan and Zisserman (2014) and recurrent neural networks of Mikolov et al. (2013a, b) were used to model image and text. Also, the CNN parameters pre-trained on ImageNet were used to initialize CNNs and to be adopted for medical image analysis. We show the benefit of this transfer learning and domain adaptation in Section 4.2. The fact that deep learning requires no hand-crafted image features is very desirable since significant adaption would be needed to apply conventional image features, e.g., HOG, SIFT to learn the wide variety of medical images. The large-scale datasets of extracted key images and their categorization, vector labels, and describing sentences can be harnessed to alleviate deep learning’s “data-hungry” challenge in the medical domain. We will make our code and trained deep text/image models publicly available upon acceptance.
1.1 Related Work
The ImageCLEF medical image annotation tasks of 2005-2007 by Deselaers and Ney (2008) have 9,000 training and 1,000 testing two-dimensional images, converted to pixel thumbnails with labels. Local image descriptors and intensity histograms are used as a bag-of-features approach in that work for this scene-recognition-like problem. Unsupervised LDA based matching from lung disease words (e.g., fibrosis, emphysema) to two-dimensional image blocks from axial CT chest scans of 24 patients is studied by Carrivick et al. (2005). The works of Barnard et al. (2003); Blei and Jordan (2003) using generative models of combining words and images under a very limited word/image vocabulary has also motivated this study.
The most related works are by Socher et al. (2013); Frome et al. (2013) where they first map words into vector space using recurrent neural networks and then project images into the label-associated word-vector embeddings by minimizing the (Socher et al. (2013)) or hinge rank losses (Frome et al. (2013)) between the visual and label manifolds. The language model is trained on the texts of Wikipedia and tested on label-associated images from the CIFAR (Krizhevsky and Hinton (2009); Socher et al. (2013)) and ImageNet datasets (Deng et al. (2009); Frome et al. (2013)). Image-to-language correspondence was learned from ImageNet dataset and reasonably high quality image description datasets (Pascal1K (Rashtchian et al. (2010)), Flickr8K (Hodosh et al. (2013)), Flickr30K (Young et al. (2014))) by Karpathy et al. (2014), where such caption datasets are not available in the medical domain.
Graphical models have been employed to predict image attributes by Lampert et al. (2014); Scheirer et al. (2012), or to describe images by Kulkarni et al. (2013) using manually annotated datasets.Automatic label mining on large, unlabeled datasets is presented by Ordonez et al. (2011); Jaderberg et al. (2014), however the variety of the label-space is limited to image text annotations. We analyze and mine the medical image semantics on both document and sentence levels, and deep CNNs of Jaderberg et al. (2014); Simonyan and Zisserman (2014) are adapted to learn them from image contents.
To gain the most comprehensive interpretation of diagnostic semantics, we use all available radiology reports of around 780K imaging examinations, stored in the PACS of National Institutes of Health Clinical Center since the year 2000. Around key two-dimensional image slices are studied here, instead of using all three-dimensional image volumes. Within three-dimensional patient scans, most of the imaging information represented is normal anatomy, therefore they are often not the focus of the radiology reports. The two-dimensional “key images” referenced (Figure 1) by radiologists manually during radiology report writing provide a visual reference to pathologies or other notable findings. Therefore, the two-dimensional key images are more correlated with the diagnostic semantics in the reports than the whole three-dimensional scans, but not all reports have referenced key images ( images from about unique patients). Table 1 provides some statistics of the extracted database, and Table 2 shows examples of the most frequently occurring words in radiology reports collected. Leveraging our deep learning models exploited in this paper will make it possible to automatically select key images from three-dimensional patient scans to avoid mis-referencing.
Finding and extracting key images from radiology reports is done by natural language processing (NLP), i.e, finding a sentence mentioning a referenced image. For example, “There may be mild fat stranding of the right parapharyngeal soft tissues (series 1001, image 32)” is listed in Figure 1. The NLP steps are sentence tokenization, word/number matching and stemming, and rule-based information extraction (e.g., translating “image 1013-78” to “images 1013-1078”). Software package of Bird et al. (2009) was used for basic NLP pipelines. A total of 187K images could be retrieved and matched this way, whereas the rest of 28K key images were extracted according to their reference accession numbers in PACS.
|total number of||# words in documents||# image modalities|
|# words||1 billion||max||1502||PET||67|
3 Document Topic Learning with Latent Dirichlet Allocation
We propose to mine image categorization labels using non-parametric topic-modeling algorithm of Blei et al. (2003) on the ~780K radiology text reports in PACS. Unlike the images of ImageNet (Deng et al. (2009); Russakovsky et al. (2014)) which often have a dominant object appearing in the center, our key images are mostly CT and MRI slices showing several organs usually with pathologies. There are high amounts of intrinsic ambiguity in defining and assigning a semantic label set to images, even for experienced clinicians. Our hypothesis is that the large collection of sub-million radiology reports statistically defines the categories meaningful for topic-mining and visual correspondence learning for these topics.
Latent Dirichlet Allocation (LDA) was originally proposed by Blei et al. (2003) to find latent topic models for a collection of text documents such as newspaper articles. There are some other popular methods for document topic modeling, such as Probabilistic Latent Semantic Analysis (pLSA) by Hofmann (1999) and Non-negative Matrix Factorization (NMF) by Lee and Seung (1999). We choose LDA for extracting latent topic labels among radiology report documents, because LDA is shown to be more flexible yet learns more coherent topics over large sets of documents, as was studied by Stevens et al. (2012). Furthermore, pLSA can be regarded as a special case of LDA (Girolami and Kabán (2003)) and NMF as a semi-equivalent model of pLSA (Gaussier and Goutte (2005); Ding et al. (2006)).
LDA offers a hierarchy of extracted topics and the number of topics can be chosen by evaluating each model’s perplexity score (Equation 1), which is a common way to measure how well a probabilistic model generalizes by evaluating the log-likelihood of the model on a held-out test set. For an unseen document set , the perplexity score is defined as in Equation 1, where is the number of documents in the test set, the words in the unseen document , the number of words in document , with the topic matrix, and the hyper-parameter for topic distribution of the documents.
A lower perplexity score generally implies a better fit of the model for a given document set (Blei et al. (2003)).
Based on the perplexity score evaluated on 80% of the total documents used for training and 20% used for testing, the number of topics chosen is 80 for the document-level model using perplexity scores for model selection (Figure 2). Although the document distribution in the topic space is approximately balanced, the distribution of image counts for the topics is more unbalanced (Figure 3). Specifically, topic (non-primary metastasis spreading across a variety of body parts) contains nearly half of the 216K key images. To address this data bias, sub-topics are obtained for each of the first document-level topics, resulting in 800 topics, where the number of the sub-topics is also chosen based on the average perplexity scores evaluated on each document-level topic. Lastly, to compare the method of using the whole report with using only the sentence directly describing the key images for latent topic mining, a sentence-level LDA topics are obtained based on three sentences only: the sentence mentioning the key-image (Figure 1) and its adjacent sentences as proximal context. The perplexity scores keep decreasing with an increasing number of topics; we choose the topic count to be 1000 as the rate of the perplexity score decrease is very small beyond that point (Figure 2).
We observe that LDA-generated image categorization labels are valid, demonstrating good semantic coherence among clinician observers. The lists of key words and sampled images per topic label are subjected to board-certified radiologist’s review and validation. Some examples of document-level topics with their corresponding images and topic key words are shown in Figure 4. Based on radiologists’ review, our LDA topics discover semantics at different levels. There are 73 low-level concepts for example, pathology examination of certain body regions and organs: topic - sinus diseases; - lesions of solid abdominal organs, primarily kidney; - pulmonary diseases; - brain MRI; - renal diseases on mixed imaging modalities; - brain tumors. There are 7 mid- to high-level concepts, such as: topic - non-primary metastasis spreading across a variety of body parts; topic - cases with high diagnosis uncertainty or equivocation; - indeterminate lesions; - instrumentation artifacts limiting interpretation. Low-level topic images tend to be visually more coherent than the higher-level topic images.
High-level topics may be analogous to the high-level visual concepts in natural images as was studied by Kiapour et al. (2014); Ordonez and Berg (2014). About half of the key images are associated with topic , implying that the clinicians’ image referencing behavior patterns heavily focuses on metastatic patients. Sub-topics of document-level topic are sub-categories of metastatic disease, for example: - - abdominal mass; - - bulky tumor; - - multifocal metastatic disease; - - liver tumor. Meanwhile, some of the sub-topics of document-level do not seem very focused. Many of the sentence-level topics have valid semantics too, e.g. ‘renal imaging’, ‘musculoskeletal imaging’, ‘chest port catheter’, ‘chest imaging with disease or pathology’, and ‘degenerative disease in bone’.
We also obtained LDA topics on the reports having associated images only, resulting in 20 topics according to perplexity score. However, these did not add any more meaningful semantics in addition to the already obtained topics in three levels, so that we did not include the topics. For more details and the image-topic associations, refer to Figures 4, 5, and the supplementary material. Even though LDA labels are computed with text information only, we next investigate the plausibility of mapping images to the topic labels of different levels via deep CNN models.
4 Image to Document Topic Mapping with Deep Convolutional Neural Networks
For each level of topics discussed in Section 3, we train deep CNNs to map the images into document categories using the Caffe framework of Jia et al. (2014). We split our whole key image dataset as follows: 85% used as the training dataset, 5% as the cross-validation (CV), and 10% as the test dataset. If a topic has too few images to be divided into training/CV/test for deep CNN learning, then that topic is neglected for the CNN training. These cases are normally topics of rare imaging protocols, for example: topic - Abdominal ultrasound; topic & - DEXA scans of different usages. In total, 60 topics were used for the document-level topic mapping, 385 for the document-level sub-topic mapping, and 717 for the sentence-level mapping. Systematic diagrams showing how each level of semantic topics are learned, assigned to images, and trained to map from images to topics are shown in Figure 6.
All our CNN network settings are similar or the same as the ImageNet Challenge “AlexNet” (Krizhevsky et al. (2012)), and “VGG-16 & 19” (Simonyan and Zisserman (2014)) models. For “AlexNet” we used the Caffe reference network of Jia et al. (2014), which is a slight modification to the “AlexNet” by Krizhevsky et al. (2012). The AlexNet model byKrizhevsky et al. (2012) has about 60 million parameters (650,000 neurons) and consists of five convolutional layers (1st, 2nd and 5th followed by max-pooling layers), and three fully-connected (FC) layers with a final classification layer. The VGG variations of CNN models by Simonyan and Zisserman (2014) are significantly deeper by having 1619 convolutional layers and 133144 million parameters. The top-1 error rates on ImageNet dataset of these models are AlexNet:15.3% (Krizhevsky et al. (2012)); VGG-16:7.4%; and VGG-19:7.3% (Simonyan and Zisserman (2014)), respectively.
For image to topic mapping, we change the numbers of output nodes in the last softmax classification layer, i.e., 60, 385 and 717 for the document-level, document-level sub-topics, and sentence-level respectively. The networks for first-level semantic labels are fine-tuned from the pre-trained ImageNet models, where the networks for the lower-level semantic labels are fine-tuned from the models of the higher-level semantic labels.
4.2 Transfer Learning and Domain Adaptation
We found that transfer learning from the ImageNet pre-trained CNN parameters on natural images to our medical image modalities (mostly CT, MRI) significantly helps the image classification performance. Additionally, transfer learning from a CNN trained for a more related task (e.g. from CNN trained on image-to-document-level-topic models to train CNN for image-to-document-level-sub-topic model) was found to be more effective than from a CNN trained for a less related task (e.g. from CNN trained on ImageNet to train CNN for image-to-document-level-sub-topic model). Examples of classification accuracy traces during training using CNNs from random initialization, transfer learning from CNN trained on ImageNet, and transfer learning from higher level image-to-topic model to lower level image-to-topic models are shown in Figure 7. Similar findings that deep CNN features can be generalized across different image modalities have been reported by Gupta et al. (2014, 2013), but are empirically verified with only much smaller datasets than ours. Our key image dataset is about one fifth the size of ImageNet (Russakovsky et al. (2014)) and is the largest annotated medical image dataset to date.
From Figure 7 we can see that: (1) CNN testing accuracy increases from 0% to 50+% quickly in about 1600 iterations due to the unbalanced data distribution among classes in document-level; (2) A more complex, deeper CNN model (VGG-Net) performs better than the model which already is a good benchmark (AlexNet), but only when starting from a good initialization (i.e., pre-training via ImageNet models); (3) Fine-tuning from a more closely related task CNN model is even better than fine-tuning from less related task model (alexnet_tp80_h2_start_tp80h1 alexnet_tp80_h2_start_imagenet).
With these findings, we train our CNN models with transfer-learning by default for the remaining parts of our study. All the CNN layers except the newly modified ones are initialized with the weights of a previously trained related model, and trained with a new task with low learning rate of 0.001. The modified layers with new number of classes are initialized randomly, and their learning rates are set with higher learning rate of 0.01. All the key images are resampled to a spatial resolution of pixels. Then we follow the approach of Simonyan and Zisserman (2014) to crop the input images from to for training.
4.3 Classification Results and Discussion
We would expect that the level of difficulties for learning and classifying the images into the LDA-induced topics will be different for each semantic level. Low-level semantic classes can have key images of axial/sagittal/coronal slices with position variations and across MRI/CT modalities. Mid- to high-level concepts all demonstrate much larger within-class variations in their visual appearance since they are diseases occurring within different organs and are only coherent at high level semantics. Table 3 provides the top-1 and top-5 testing in classification accuracies for each level of topic models using AlexNet (Krizhevsky et al. (2012)), and VGG-16&19 (Simonyan and Zisserman (2014)) based deep CNN models.
All top-5 accuracy scores are significantly higher than top-1 values, e.g. increasing from to using VGG-19, or to via AlexNet in document-level. This indicates that the classification errors or fusions are not uniformly distributed among other false classes. Latent “blocky subspace of classes” may exist in our discovered label space, where several topic classes form a tightly correlated subgroup. The confusion matrices in Figure 8 verify this finding.
It is shown that the deeper models (VGG-16&19) perform consistently better than the shallower 8-layer model (AlexNet) in classification accuracy, especially for document-level sub-topics. While the images of some topic categories and some body parts are easily distinguishable as shown in Figure 4, the visual differences in abdominal parts are rather subtle as in Figure 5. Distinguishing the subtleties and high-level concept categories in the images could benefit from a more complex model so that the model can handle these subtleties.
It is also noticeable that deeper models require significantly more computational resource and time to train than the shallower model. Table 4 shows the memory consumption and time required to train the CNN models for the image-to-sentence-level-topic model with up to 70,000 iterations using the NVidia Tesla K40 GPU. However, comparing VGG-16 and VGG19, three additional convolutional layers seem to have contributed to raise the top-5 accuracies by a small amount (2%), which is coherent with the results reported by Simonyan and Zisserman (2014) for object recognition task on the ImageNet dataset.
Compared with the ImageNet 2014 results, top-1 error rates are moderately higher ( versus ) and top-5 test errors () are comparable. In summary, our quantitative results are very encouraging, but there also exist some uncertainties in annotations because labels stem from an unsupervised learning algorithm. Multi-level semantic concepts show good image learnability by deep CNN models which sheds light on the feasibility of automatically parsing very large-scale radiology image databases.
|AlexNet 8-layers||VGG 16-layers||VGG 19-layers|
|AlexNet 8-layers||VGG 16-layers||VGG 19-layers|
|time||4 hours 35 mins||3 days 2 hours||4 days 40 mins|
|memory||1.4 GBytes||10 GBytes||11 GBytes|
5 Generating Image-to-Text Description
The image-to-topic mapping in Section 4 is a promising first step towards automated interpretation of medical images in large scale. However, it is too expensive and time consuming for radiologists to examine all of the 1880 (80 + 800 + 1000) topics generated with their keywords and images. In addition, key words in the topics can help to understand the semantic contents of a given image with more semantic meaning. We therefore propose to generate relevant key-word text descriptions similar to Kulkarni et al. (2013), using deep language/image CNN models.
5.1 Word-to-Vector Modeling and Removing Word-Level Ambiguity
In radiology reports, there exist many recurring word morphisms in text identification, e.g., [mr, mri, t1-/t2-weighted] (natural language expressions for imaging modalities of magnetic resonance imaging (MRI)), [cyst, cystic, cysts], [tumor, tumour, tumors, metastasis, metastatic], etc. We train a deep word-to-vector model of Mikolov et al. (2013c, b, a) to address this word-level labeling space ambiguity, while also transforming the words to vectors. A total of 1.2 billion words from our radiology reports as well as from biomedical research articles obtained from OpenI (ope ()ni: http://openi.nlm.nih.gov) are used. Words with similar meaning are mapped or projected to closer locations in the vector space than dissimilar ones. An example visualization of the word vectors on a two-dimensional space using principal component analysis is shown in Figure 9.
A skip-gram model of Mikolov et al. (2013a, b) is employed with the mapping vector dimension of per word, trained using the hierarchical softmax cost function, sliding-window size of 10 and frequent words sub-sampled in frequency of 0.01. It is found that combining an additional, more diverse set of related documents such as OpenI biomedical research articles, is helpful for the model to learn a better vector representation while keeping all the hyper-parameters the same. Similar findings on unsupervised feature learning models, that robust faetures can be learned from a slightly noisy and diverse set of input, were reported by Vincent et al. (2010, 2008); Shin et al. (2013). Some examples of query words and their corresponding closest words in terms of cosine similarity for the word-to-vector models (Mikolov et al. (2013c)), trained on radiology reports only (total of ~1 billion words) and with additional OpenI articles (total of 1.2 billion words) are shown in Figure 10.
5.2 Image-to-Description Relation Mining and Matching
The sentence referring to a key image and its adjacent sentences may contain a variety of words, but we are mostly interested in the disease-related terms which are highly correlated to diagnostic semantics. To obtain only the disease-related terms, we exploit the human disease terms and their synonyms from the Disease-Ontology (DO; Schriml et al. (2012)), a collection of 8,707 unique disease-related terms. While the sentences referring to an image and their adjacent sentences have 50.08 words on average, the number of disease-related terms in the three consecutive sentences is on average with a standard deviation of . Therefore we chose to use bi-grams for the image descriptions, to achieve a good trade-off between the medium level complexity without neglecting too many text-image pairs. Some statistics about the number of words in the documents are shown in Table 5.
|image references, no stopwords no digits||13.46||11||9.94||143||2|
|image references, disease terms only||5.17||4||2.52||25||1|
Bi-gram disease terms are extracted so that we can train a deep CNN model in Section 5.3 to predict the vector/word- level image representation of . If multiple bi-grams can be extracted per image from the sentence referring the image and the two adjacent ones, the image is trained as many times as the number of bi-grams with different target vectors (). If a disease term cannot form a bi-gram, then the term is ignored, and the process is illustrated in Figure 11. This is a challenging weakly annotated learning problem using referring sentences for labels. The bi-grams of DO disease-related terms in the vector representation of are somewhat analogous to the work of Kulkarni et al. (2013) detecting multiple objects of interest and describing their spatial configurations in the image caption. A deep regression CNN model is employed here, to map an image to a continuous output word-vector space from an image. The resulting bi-gram vector can be matched against a reference disease-related vocabulary in the word-vector space using cosine similarity.
5.3 Image-to-Words Deep CNN Regression
It has been shown by Sutskever et al. (2014) that deep recurrent neural networks (RNN) can learn the language representation for machine translation. To learn the image-to-text representation, we map the images to the vectors of word sequences describing the image. This can be formulated as a regression CNN, replacing the softmax cost in Section 4 with the cross-entropy cost function for the last output layer of VGG-19 CNN model (Simonyan and Zisserman (2014)):
where or is any uni-element of the target word vectors or optimized output vectors , is the sigmoid function (), and is the number of samples in the database.
We adopt the CNN model of Simonyan and Zisserman (2014) for the image-to-text representation since it works consistently better than the other relatively simpler model of Krizhevsky et al. (2012) in our image-to-topic mapping tasks. We fine-tune the parameters of the CNNs for predicting the topic-level labels in Section 4 with the modified cost function, to model the image-to-text representation instead of classifying images into categories. The newly modified output layer has nodes for bi-grams as nodes for each word in a bi-gram.
5.4 Key-Word Generation from Images and Discussion
For any key image in testing, first we predict its topics of three levels (document-level, document-level sub-topics, sentence-level) using the three deep CNN models of Simonyan and Zisserman (2014) in Section 4. Top 50 key-words in each LDA document-topics are mapped into the word-to-vector space of multivariate variables in (Section 5.1). Then, the image is mapped to a output vector using the bi-gram CNN model in Section 5.3. Lastly, we match each of the topic key-word vectors of against the first and second half of the output vector using cosine similarity. The closest key-words at three levels of topics (with the highest cosine similarity against either of the bi-gram words) are kept per image.
The rate of predicted disease-related words matching the actual words in the report sentences (recall-at-K, K=1 (R1 score)) was 0.56. Two examples of key-word generation are shown in Figure 12, with three key-words from three categorization levels per image. We only report R1 score on disease-related words compared to the previous works of Karpathy et al. (2014); Frome et al. (2013), where they report from R1 up to R20 on the entire image caption words (e.g. R1=0.16 on Flickr30K dataset by Karpathy et al. (2014)). As we used NLP to parse and extract image-describing sentences from the whole radiology reports, our ground-truth image-to-text associations are much noisier than the caption dataset used by Frome et al. (2013); Karpathy et al. (2014). Also for that reason, our generated image-to-text associations are not as exact as the generated descriptions by Frome et al. (2013); Karpathy et al. (2014).
Generating key-words for images by CNN regression shows good feasibility for automated interpretation of patient images. The generated key-words describe what to expect from the given image, although sometimes unrelated words can be generated too. Finding and understanding the relations between the generated words will be the next step to explore, for example via more thorough text mining using sophisticated NLP parsing as by Li et al. (2011) and combining them with the specific frequent disease prediction in the next section.
6 Predicting Presence or Absence of Frequent Disease Types
While the key-words generation in Section 5 can aid the interpretation of a patient scan, the generated key-words, e.g. “spine”, “lung”, are not very specific to a disease in an image. Nonetheless, one of the ultimate goal for large-scale radiology image/text analysis would be to automatically diagnose disease from a patient scan. In order to achieve the goal of automated disease detection, we added an additional pipeline of mining disease words rather than disease-related words using radiology semantics, and predicting these in an image using CNNs with softmax cost-function.
6.1 Mining Presence/Absence of Frequent Disease Terms
The disease names in Disease Ontology (DO) contains not only disease terms but also non-disease terms as well describing a disease. Some examples of disease names in DO containing non-disease terms are “occlusion of gallbladder” (DOID: 9714), “acute diarrhea” (DOID: 0050140), “strawberry gallbladder” (DOID: 10254), and “exocrine pancreatic insufficiency” (DOID: 13316). Nonetheless, it is rare that “occlusion of gallbladder” or “exocrine pancreatic insufficiency” is described in radiology reports exactly that way, making it difficult to mine specific disease terms with presence or absence.
The Unified Medical Language System (UMLS) of Lindberg et al. (1993); Humphreys et al. (1998) integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and inter-operable biomedical information systems and services, including electronic health records. It is a compendium of many controlled vocabularies in the biomedical sciences, created in 1986 and maintained by the National Library of Medicine.
The Metathesaurus (Schuyler et al. (1993)) forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, where all of them are collected from the over 100 incorporated controlled vocabularies and classification systems. The Metathesaurus is organized by concept, where each concept has specific attributes defining its meaning and is linked to the corresponding concept names. The Metathesaurus has 133 semantic types that provide a consistent categorization of all concepts represented in it. Among the 133 semantic types we chose to focus on “T033: finding” and “T047: disease or syndrome”, as they seemed most relevant to be disease specific. Examples of some other semantic types we did not focus on this study are: “T017: anatomical structure”, “T074: medical device”, and “T184: sign or symptom”.
RadLex (Langlotz (2006)) is a unified language to organize and retrieve radiology imaging reports and medical records. While the Metathesaurus has a vast resource of biomedical concepts, we also used RadLex to confine our disease-term-mining more specifically to radiology related terms. The mined words are one word terms appearing in the “T033: finding” and “T047: disease or syndrome” of the UMLS Metathesaurus appearing also in RadLex (RadLex is not a subset of Metathesaurus).
We are not only interested in disease terms associated with image, but also whether the disease mentioned is present or absent. After detecting semantic terms of “T033: finding” and “T047: disease or syndrome”, we used the assertion/negation detection algorithm of Chapman et al. (2001, 2013) to detect presence and absence of disease terms. The algorithm of Chapman et al. (2001, 2013) locates trigger terms which can indicate a clinical condition as negated or possible and determines which text falls within the scope of the trigger terms. The number of occurrences “T033: finding” and “T047: disease or syndrome” detected as assertion or negations in radiology reports are shown in Figure 13.
While the assertion/negation detection of “T047: disease or syndrome” seemed specific enough, the detection of “T033: finding” was not. For example, it seemed difficult to derive any specific disease information from 43,219 occurrences of possible “unchanged” and 422 occurrences of negated “unchanged”. Some other similar examples are: 10,236 occurrences of possible “finding” and 1,129 occurrences of negated “finding”; 3,781 occurrences of possible “t2” (a MRI image modality) and 661 occurrences of negated “t2”. We therefore decided to focus on “T047: disease or syndrome” terms only, and further ignored the terms which occurred less then 10 times in the whole radiology reports. The total number of “T047: disease or syndrome” terms for detecting their presence are 59, and the total number of the terms for detecting their absence are 18.
6.2 Predicting Disease in Images using CNN
Similarly to the object detection task in the ImageNet challenge, we match and detect disease terms found in the sentence of radiology reports referring the image using CNN and softmax cost function. Softmax models the probability of an instance being the class of total classes with normalized exponential. It is a generalization of the logistic function summarizing an dimensional vector of real values in the range of 0 to 1:
Softmax is often implemented at the final layer of the neural networks used for classification, and is a standard among the CNN based approaches in ImageNet object recognition challenge.
In addition to assigning disease terms to images, we also assign negated disease terms as absence of the diseases in the images. The total number of labels is 77 (59 present, 18 absent). If more than one disease term is mentioned for a image, we simply assigned the terms multiple times for an image. Some statistics on the number of assertion/negation occurrences per image are shown in Table 6.
|# images||per image mean/std||# assertions per image||# negations per image|
|total matching||18291||# assertions mean||1.05||1/image||16133||1/image||1581|
|total not matching||197495||# negations mean||1.05||2/image||613||2/image||84|
|with assertions||16827||# assertions std||0.23||3/image||81||3/image||0|
|with negations||1665||# negations std||0.22||4/image||0||4/image||0|
As we found in Section 4.2 that transfer learning from the most related model is helpful, we fine-tune the image-to-topic CNN model for the disease prediction model. For this task we fine-tuned from the image to sentence-level-topic (h3) model in Section 4, as the image-to-sentence-level-topic seems to be most closely related to the image-to-disease-specific-terms model. Similarly to Section 4, 85% of image-label pairs were used for training, 5% for cross-validation, and 10% for testing.
6.3 Prediction Result and Discussion
With the CNN trained to model image to disease presence/absence prediction, the top-1 test accuracy achieved is 0.71, and top-5 accuracy is 0.88. We combine this with the previous image-to-topic mapping and key-word generation (Section 5.4) to generate the final output for comprehensive image interpretation. Some examples of test cases where top-1 probability output matches the originally assigned disease labels are shown in Figure 14. It is noticeable that specific disease words are detected with high probability when there is one disease word per image, and with somewhat low top-1 probability for one disease word and the other words within the top-5 probabilities (Figure 14 (b) – “ … infection abcess”).
We can also notice that automatic label assignment to images can sometimes be challenging. In Figure 14 (d) “cyst” is assigned as the correct label based on the original statement “… possibly due to cyst …”, but it would be unclear whether cyst will be present on the image (and cyst is not visibly apparent). It applies similarly to Figure 14 (e) where the presence of “ostephyte” is not clear from the referring sentence but was assigned as the correct label (and osteophyte is not visibly apparent on the image). In Figure 14 (f) “no cyst” was labeled and predicted correctly, but it would be less clear what to derive from the prediction result indicating an absence of a disease then a presence.
Some examples of test cases where top-1 probability does not match the originally assigned labels are shown in Figure 15. Four ((a),(c),(e),(f)) of the six examples contain the originally assigned label in the top-5 probability predictions, which is coherent with the relatively high (88%) top-5 prediction accuracy.
Here again, Figure 15 (a) was automatically labeled as “cyst”, but cyst is not clearly visible on the image where the original statement “… too small to definitely characterize cyst …” supports this. The example of Figure 15 (b) shows a failed case of assertion/negation algorithm, where “cyst” is detected as negated based on the statement “… small cyst”. Nonetheless, true label (“cyst”) is detected as its top-1 probability. For Figure 15 (c) “cyst” is predicted where the true label assigned was “abscess”, however cyst and abscess are sometime visible similar. Similarly to Figure 14 (d) it is unclear to find emphysema from the statement like “ … possibly due to emphysema” (and emphysema is not visibly present), but it would be challenging to correctly interpretate such statement for label assignment. Figure 15 (e) shows a disease which can be bronchiectasis, however it is also unclear from the image. Nonetheless, bronchiectasis is predicted with the second highest probability. Bronchiectasis is visible on Figure 15 (f), and it was predicted with second highest probability too.
Automated mining of disease specific terms enables us to predict disease more specifically with promising result. However, compared to image-to-topic modeling in Section 4 where image labeling was based on topic modeling and loose coupling of image-to-keyword pairs, by matching the images to more specific disease words we lose about 90% of the images for the analysis due to nonspecific original statements. The proportion of the cases where radiologists indicate a disease as strongly positive or negative is often much less then the cases where they describe a finding rather vaguely. By mining and assigning the semantic label “T033: finding” will yield us more image to specific-disease-label pairs. However, it is probably less specific to model an image with a generic term as “mass” (which is a more vague indication of a specific disease such as “cyst” or “tumor”) and detecting it than modeling and detecting an image with a more specific term as “cyst” (similarly to “finding” or “unchanged”).
It is a compromise between whether to go for big data and loose labels, or to go for smaller data and more accurate labels. The key-word generation from the rather loose labeling scheme enables us to use most of the available 216K images. While the generated key-words can help understand the contents of the image, sometimes they are not specific and can also be irrelevant. More specific mining and assignment of specific disease labels to image could provide more accurate and precise disease prediction, however only about 10% of the total images are matched by this scheme. Another alternative is to obtain annotation by radiologists to be even more specific, but the amount of data available will be even smaller due to the time and cost limitations.
Utilizing bigger data will enable us to make a more generalizable model, but labeling will become more challenging as the amount of data gets bigger and becomes more heterogeneous. The compromise between the amount of data and the quality of labels seems to be a recurring dilemma probably in the majority of the automated mining in big data applications. More advanced NLP processing and comprehensive analysis of hospital discharge summaries, progress notes, and patient histories might address the need to get more specific information relating to an image even when the original image descriptions are not very specific.
It has been unclear how to extend the significant success in image classification using deep convolutional neural networks from computer vision to medical imaging. What are the clinically relevant image labels to be defined, how to annotate the huge amount of medical images required by deep learning models, and to what extent and scale the deep CNN architecture is generalizable in medical image analysis are the open questions.
In this paper, we present an interleaved text/image deep mining system to extract the semantic interactions of radiology reports and diagnostic key images at a very large, unprecedented scale in the medical domain. Images are classified into different levels of topics according to their associated documents, and a neural language model is learned to assign disease terms to predict what is in the image. In addition, we mined and matched specific disease terms for more specific automated image labeling, and demonstrated promising results.
To the best of our knowledge, this is the first study performing a large-scale image/text analysis on a hospital picture archiving and communication system database. Our report-extracted key image database is the largest one ever reported and is highly representative of the huge collection of radiology diagnostic semantics over the last decade. Exploring effective deep learning models on this database opens new ways to parse and understand large-scale radiology image informatics.
We hope that this study will inspire and encourage other institutions in mining other large unannotated clinical databases, to achieve the goal of establishing a central training resource and performance benchmark for large-scale medical image research, similar to the ImageNet of Deng et al. (2009) for computer vision.
This work was supported in part by the Intramural Research Program of the National Institutes of Health Clinical Center, and in part by a grant from the KRIBB Research Initiative Program (Korean Visiting Scientist Training Award), Korea Research Institute of Bioscience and Biotechnology, Republic of Korea. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov). We thank NVIDIA for the K40 GPU donation.
- (1) Openi - an open access biomedical image search engine. http://openi.nlm.nih.gov. Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine.
- Barnard et al. (2003) K. Barnard, P. Duygulu, D. Forsyth, N. Freitas, D. Blei, and M. Jordan. Matching words and pictures. JMRL, 3:1107–1135, 2003.
- Bengio et al. (2006) Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. O’Reilly Media, Inc., 2009.
- Blei and Jordan (2003) D. Blei and M. Jordan. Modeling annotated data. In ACM SIGIR, 2003.
- Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3:993–1022, 2003.
- Carrivick et al. (2005) Luke Carrivick, Sanjay Prabhu, Paul Goddard, and Jonathan Rossiter. Unsupervised learning in radiology using novel latent variable models. In CVPR, 2005.
- Chapman et al. (2001) Wendy W Chapman, Will Bridewell, Paul Hanbury, Gregory F Cooper, and Bruce G Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics, 34(5):301–310, 2001.
- Chapman et al. (2013) Wendy W Chapman, Dieter Hilert, Sumithra Velupillai, Maria Kvist, Maria Skeppstedt, Brian E Chapman, Michael Conway, Melissa Tharp, Danielle L Mowery, and Louise Deleger. Extending the negex lexicon for multiple languages. Studies in health technology and informatics, 192:677, 2013.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- Deselaers and Ney (2008) Thomas Deselaers and Hermann Ney. Deformations, patches, and discriminative models for automatic annotation of medical radiographs. PRL, 2008.
- Ding et al. (2006) Chris Ding, Tao Li, and Wei Peng. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method. In Proceedings of the national conference on artificial intelligence, volume 21, page 342. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
- Frome et al. (2013) Andrea Frome, Gregory Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121–2129, 2013.
- Gaussier and Goutte (2005) Eric Gaussier and Cyril Goutte. Relation between plsa and nmf and implications. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 601–602. ACM, 2005.
- Girolami and Kabán (2003) Mark Girolami and Ata Kabán. On an equivalence between plsi and lda. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 433–434. ACM, 2003.
- Gupta et al. (2013) Ashish Gupta, Murat Ayhan, and Anthony Maida. Natural image bases to represent neuroimaging data. In ICML, 2013.
- Gupta et al. (2014) S. Gupta, R. Girshick, P. ArbelÃ¡ez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In ECCV, 2014.
- Hodosh et al. (2013) Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pages 853–899, 2013.
- Hofmann (1999) Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM, 1999.
- Humphreys et al. (1998) Betsy L Humphreys, Donald AB Lindberg, Harold M Schoolman, and G Octo Barnett. The unified medical language system an informatics research collaboration. Journal of the American Medical Informatics Association, 5(1):1–11, 1998.
- Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep features for text spotting. In ECCV, pages 512–528. 2014.
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Karpathy et al. (2014) Andrej Karpathy, Armand Joulin, and Fei Fei F Li. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems, pages 1889–1897, 2014.
- Kiapour et al. (2014) H. Kiapour, K. Yamaguchi, A. Berg, and T. Berg. Hipster wars: Discovering elements of fashion styles. In ECCV, 2014.
- Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Kulkarni et al. (2013) G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2891–2903, 2013.
- Lampert et al. (2014) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
- Langlotz (2006) Curtis P Langlotz. Radlex: A new method for indexing online educational materials 1. Radiographics, 26(6):1595–1597, 2006.
- LeCun et al. (2004) Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE, 2004.
- Lee and Seung (1999) Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- Li et al. (2011) S. Li, G. Kulkarni, T. Berg, A. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In ACM CoNLL, pages 220–228, 2011.
- Lindberg et al. (1993) Donald A Lindberg, Betsy L Humphreys, and Alexa T McCray. The unified medical language system. Methods of information in Medicine, 32(4):281–291, 1993.
- Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048, 2010.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013b.
- Mikolov et al. (2013c) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751. Citeseer, 2013c.
- Miller (1995) George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Ordonez and Berg (2014) V. Ordonez and T. Berg. Learning high-level judgments of urban perception. In ECCV, 2014.
- Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems, pages 1143–1151, 2011.
- Rashtchian et al. (2010) Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139–147. Association for Computational Linguistics, 2010.
- Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 1988.
- Russakovsky et al. (2014) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575, 2014.
- Scheirer et al. (2012) W. Scheirer, N. Kumar, P. Belhumeur, and T. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR, 2012.
- Schriml et al. (2012) Lynn Marie Schriml, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis, Victor Felix, Gang Feng, and Warren Alden Kibbe. Disease ontology: a backbone for disease semantic integration. Nucleic acids research, 40(D1):D940–D946, 2012.
- Schuyler et al. (1993) Peri L Schuyler, William T Hole, Mark S Tuttle, and David D Sherertz. The umls metathesaurus: representing different views of biomedical concepts. Bulletin of the Medical Library Association, 81(2):217, 1993.
- Shin et al. (2013) Hoo-Chang Shin, Matthew R Orton, David J Collins, Simon J Doran, and Martin O Leach. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4d patient data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1930–1943, 2013.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, pages 935–943, 2013.
- Stevens et al. (2012) Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 952–961. Association for Computational Linguistics, 2012.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 2014.
- Szegedy et al. (2014) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
- Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
- Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371–3408, 2010.
- Werbos (1990) Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.