Data Interpretation over Plots
Reasoning over plots by question answering (QA) is a challenging machine learning task at the intersection of vision, language processing, and reasoning. Existing synthetic datasets (FigureQA, DVQA) do not model variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. We propose PlotQA with \textcolorblack8.1 million question-answer pairs over \textcolorblack220,000 plots with data from real-world sources and questions based on crowd-sourced question templates. 26% of the questions in PlotQA have answers that are not in a fixed vocabulary, requiring reasoning capabilities. Analysis of existing models on PlotQA reveals that a hybrid model is required: Specific questions are answered better by choosing the answer from a fixed vocabulary or by extracting it from a predicted bounding box in the plot, while other questions are answered with a table question-answering engine which is fed with a structured table extracted by visual element detection. For the latter, we propose the VOES pipeline and combine it with SAN-VQA to form a hybrid model SAN-VOES. On the DVQA dataset, SAN-VOES model has an accuracy of 58%, significantly improving on highest reported accuracy of 46%. On the PlotQA dataset, SAN-VOES has an accuracy of 54%, which is the highest amongst all the models we trained. Analysis of each module in the VOES pipeline reveals that further improvement in accuracy requires more accurate visual element detection.
Data plots such as bar charts, line graphs, scatter plots, etc. provide an efficient way of summarizing numerical information. Interpreting and reasoning over such plots are considered a test of human aptitude. It is thus of interest to formulate and evaluate machine comprehension of plots. This task lies in the intersection of vision, language processing, and reasoning, and thus poses interesting research challenges. It also has widespread real-world applicability. A highly accurate model for plot reasoning can help domain experts such as policymakers and doctors access information in a large collection of plots. This can also help visually impaired persons interact with plots in natural language.
|A \Q||Structure||Data Retrieval||Reasoning|
Recently, in [11, 10] two datasets of plots and deep neural models for question answering over the generated plots have been proposed. In both the datasets, the plots are synthetically generated with data values and labels drawn from a custom set. In the FigureQA dataset , all questions are binary wherein answers are either Yes or No, (see Figure 0(a) for an example). The DVQA dataset , generalizes this to include questions which can be answered either by (a) fixed vocabulary of 1000 words, or (b) extracting text (such as tick labels) from the plot. An example question could seek the numeric value represented by a bar of a specific label in a bar plot (see Figure 0(b)). Given that all data values in the DVQA dataset are chosen to be integers and from a fixed range, the answer to this question can be extracted from the appropriate tick label. While these datasets have defined the research questions on plot reasoning, realistic questions over plots are much more complex. For instance, consider the question in Figure 0(c), wherein a grouped bar plot, we are to compute the average of floating point numbers represented by three bars of a color specified by the label. The answer to this question is neither in a fixed vocabulary nor can it be extracted from the plot itself. Answering such questions requires a combination of perception, language understanding, and reasoning, and thus poses a significant challenge to existing systems. Furthermore, this task is harder if the training set is not synthetic, but instead is sourced from real-world data with large variability in floating-point values, large diversity in axis and tick labels, and has natural complexity in question templates.
To address this gap between existing datasets and real-world plots, we introduce the PlotQA dataset with \textcolorblack8.1 million question-answer pairs grounded over \textcolorblack220,000 plots. PlotQA improves on existing datasets on three fronts. First, roughly 26% of the questions in our dataset have answers which are not present in the plot or in a fixed vocabulary. Second, the plots are generated from real-world data sourced from World Bank, government sites, etc., thereby having a large vocabulary of axis and tick labels, and a wide range in data values. Third, the questions are more complex as they are generated based on 74 templates extracted from 7,000 crowd-sourced questions asked by workers on a sampled set of 1,400 plots. Questions are categorized into 9 (=3x3) cells based on whether the question involves ‘Structural Understanding’, ‘Data Retrieval’, or ‘Reasoning’ and whether the answer is ‘Yes/No’, ‘From Fixed Vocabulary’, or ‘Open Vocabulary’. Sample questions and the fraction of questions for these cells are shown in Table 1.
We train and evaluate existing baseline models on PlotQA with two observations.
First, SAN-VQA  performs well on Structural Understanding questions and on Data Retrieval questions if the answer is binary.
Second, SAN-VQA performs poorly for Reasoning questions. In particular, it is unable to answer any reasoning question where the answer is from an open vocabulary.
Given these observations, we propose VOES to specifically perform well on Open Vocabulary questions.
VOES is a pipeline of the four modules: Visual element detection, Optical character recognition, Extraction into a structured table, and Structured table question answering.
For questions with an answer from an open vocabulary, VOES correctly answers 32% of Data Retrieval questions and 15.4% of the Reasoning questions.
Given the complementary strengths of SAN-VQA and VOES, we train a hybrid model with a binary classifier which given a question decides whether to use the SAN-VQA or the VOES model.
This hybrid model, SAN-VOES, improves on both SAN and VOES and has an aggregate accuracy of 54% on the PlotQA dataset, which is the best performing model.
We also evaluate SAN-VOES on the DVQA dataset: SAN-VOES has an aggregate accuracy of 58%, improving on the best-reported result of SANDY  of 46%.
\textcolorblackOn the PlotQA dataset, we analyze the performance of VOES-Oracle wherein we feed the ground truth structured table to the question-answering model. This ground truth structured table is constructed from the ground truth annotations of the plots instead of using the annotations predicted by our perception model.
VOES-Oracle performs significantly better than VOES, highlighting the need to improve the perception module to increase end-to-end accuracy.
In summary, we make three major contributions:
(1) We propose PlotQA dataset with plots on data sourced from the real-world and questions based on templates sourced from manually curated questions. The dataset exposes the need to train models for questions that have answers from an Open Vocabulary.
(2) We propose the VOES model specifically for questions that have answers from an Open Vocabulary, as a pipeline of perception and QA modules. VOES performs significantly better on these questions than existing models.
(3) We propose a hybrid model SAN-VOES that combines the strength of classification and extraction methods (SAN) with VOES. SAN-VOES significantly improves on the best results on both DVQA and PlotQA.
\textcolorblack (4) We empirically show that detecting visual elements from plot images is still an open challenge and generating structured tables from plots is a difficult CV task.
1 Related Work
Datasets: Over the past few years several large scale datasets for Visual Question Answering have been released. These include datasets such as COCO-QA , DAQUAR , VQA [1, 6] which contain questions asked over natural images. On the other hand, datasets such as CLEVR  and NVLR  contain complex reasoning based questions on synthetic images having 2D and 3D geometric objects. There are some datasets [12, 13] which contain questions asked over diagrams found in text books but these datasets are smaller and contain multiple-choice questions. FigureSeer  is another dataset which contains images extracted from research papers but this is also a relatively small (60,000 images) dataset. Further, FigureSeer focuses on answering questions based on line plots as opposed to other types of plots such as bar charts, scatter plots, etc. as seen in FigureQA  and DVQA .
Models: The availability of the above mentioned datasets has facilitated the development of complex end-to-end neural network based models (, , , , ). These end-to-end networks contain (a) encoders to compute a representation for the image and the question, (b) attention mechanisms to focus on important parts of the question and image, (c) interaction components to capture the interactions between the question and the image, and (d) a classification layer for selecting the answer from a fixed vocabulary. By design, these algorithms cannot be used in situations where the answer does not come from a fixed vocabulary but needs to be computed.
2 The PlotQA dataset
In this section, we describe the PlotQA dataset and the process to build it. Specifically, we discuss the four main stages, viz., (i) curating data such as year-wise rainfall statistics, country-wise mortality rates, etc., (ii) creating different types of plots with a variation in the number of elements, legend positions, fonts, etc., (iii) crowd-sourcing to generate questions, and (iv) extracting templates from the crowd-sourced questions and instantiating these templates using appropriate phrasing suggested by human annotators.
2.1 Data Collection and Curation
We considered online data sources such as World Bank Open Data111https://data.worldbank.org/ , Open Government Data222https://www.india.gov.in/ , Global Terrorism Database333https://www.start.umd.edu/gtd/ , etc. which contain statistics about various indicator variables such as fertility rate, rainfall, coal production, etc. across years, countries, districts, etc. We crawled data from these sources to extract different variables whose relations could then be plotted (for example, rainfall v/s years across countries, or movie v/s budget, or carbohydrates v/s food_item and so on). Some statistics about the crawled data are of interest. There are a total of 841 unique indicator variables (CO2 emission, Air Quality Index, Fertility Rate, Revenue generated by taxes, etc.) with 160 unique entities (cities, states, districts, countries, movies, food items, etc.). The data ranges from 1960 to 2016, though not all indicator variables have data items for all years. The data contains positive integers, floating point values, percentages, and values on a linear scale. These values range from to .
2.2 Plot Generation
We included 3 different types of plots in this dataset, viz., bar plots, line plots and scatter plots. Within bar plots, we have grouped them by orientation as either horizontal or vertical. Within the data sources we explored, we did not find enough data to create certain other types of plots such as Venn diagrams and pie charts which are used in specific settings. We also do not consider composite plots such as Pareto charts which have line graphs on top of bar graphs. Lastly, all the plots in our dataset contain only 2-axes. Figure 2 shows one sample of each plot type in PlotQA. Each of these plots can compactly represent 3-dimensional data. For instance, in Figure 1(b), the plot compares the indicator variable diesel prices across years for different countries. To enable the development of supervised modules for various sub-tasks we provide bounding box annotations for legend boxes, legend names, legend markers, axes titles, axes ticks, bars, lines, and plot title. By using different combination of indicator variables and entities (years, countries, etc.) we created a total of plots.
To ensure that there is enough variety in the plots, we randomly chose the following parameters: grid lines (present/absent), font size, notation used for tick labels (scientific-E notation or standard notation), line style (solid, dashed, dotted, dash-dot), marker styles for marking data points (asterisk, circle, diamond, square, triangle, inverted triangle), position of legends (bottom-left, bottom-centre, bottom-right, center-right, top-right), and colors for the lines and bars from a set of 73 different colors. The number of discrete elements on the -axis varies from 2 to 12. Similarly, the number of entries in the legend box varies from 1 to 4. In other words, in the case of line plots, the number of lines varies from 1 to 4 and in the case of grouped bars the number of bars grouped on a single -tick varies from 1 to 4. For example, for the plots in Figure 1(b), the number of discrete elements on the -axis is and the number of legend names (i.e., number of lines) is .
2.3 Sample Question Collection by Crowd-sourcing
Since the underlying data of the PlotQA dataset is much richer in comparison to FigureQA and DVQA, we found it necessary to ask a wider set of annotators to create questions over these plots. However, creating questions for all the plots in our dataset would have been prohibitively expensive. We sampled plots across different types and asked workers on Amazon Mechanical Turk to create questions for these plots. We showed each plot to 5 different workers resulting in a total of questions. We specifically instructed the workers to ask complex reasoning questions which involved reference to multiple plot elements in the plots. We also gave the workers a list of simple questions such as “Is this a vertical bar graph?”, “What is the title of the graph?”, “What is the value of coal production in 1993?” and asked them to not create such questions as we had already created such questions using hand designed templates. We paid the workers for each question.
2.4 Question Template Extraction & Instantiation
We manually analyzed the questions collected by crowdsourcing and divided them into a total of 74 templates (including the simple templates that we had manually designed as mentioned earlier). These templates were divided into 3 question categories. These question categories along with a few sample templates are shown below. We refer the reader to the Supplementary material for further details.
Structural Understanding: These are questions about the overall structure of the plot and do not require any quantitative reasoning. Examples: “How many different coloured bars are there?”, “How are the legend labels stacked?”.
Data Retrieval: These questions seek data item for a single element in the plot. Examples: “What is the number of tax payers in Myanmar in 2015?”, “How many bars are there on the tick from the top?”.
Reasoning: These questions either require numeric reasoning over multiple plot elements or a comparative analysis of different elements of the plot, or a combination of both to answer the question. Examples: “In which country is the number of threatened bird species minimum?”, “What is the median banana production?”, “What is the difference between the number of deaths in Bulgaria and Cuba in the year 2005?”, “In how many years, is the rice production greater than the average rice production over all years?”.
We abstracted the questions into templates such as “In how many plural form of X_label, is the Y_label of/in legend_label greater than the average Y_label of/in legend_label taken over all plural form of X_label?”. We could then generate multiple questions for each template by replacing X_label, Y_label, legend_label, etc. by indicator variables, years, cities etc. from our curated data. However, this was a tedious task requiring a lot of manual intervention. For example, consider the indicator variable “Race of students” in Figure 0(c). If we substitute this indicator variable as it is in the above template, it would result in a question, “In how many cities, is the race of the students(%) of Asian greater than the average race of the students (%) of Asian taken over all cities ?”, which sounds unnatural. To avoid this, we asked in-house annotators to carefully paraphrase these indicator variables and question templates. The paraphrased version of the above example was “In how many cities, is the percentage of Asian students greater than the average percentage of Asian students taken over all cities ?’. Such paraphrasing for every question template and indicator variable required significant manual effort. Using this semi-automated process we were able to generate a total of questions. As shown in Table 2, the answer could either be Yes/No or from Fixed Vocabulary, or Open Vocabulary. \textcolorblackWe believe that this approach of creating questions based on templates extracted from complex human generated questions is a good middle ground between (i) the expensive and time consuming process of creating questions with the help of humans and (ii) the inexpensive and fast process of creating questions from very simple templates (as in FigureQA and DVQA).
3 Proposed Model
Existing models for VQA treat it as a multi-class classification problem, i.e., they assume that the answer needs to be picked from a fixed vocabulary. Such models work well for datasets such as DVQA where indeed all answers come from a fixed vocabulary (global or plot specific). However, in our dataset, for roughly 23% of Data Retrieval questions and 46% of Reasoning questions, the answers do not come from a fixed vocabulary but need to be computed by reasoning over one or more visual elements in the plots. To address such complex questions, we seek to leverage existing results on QA over tables . However, this requires the intermediate step of translating the plot image into a structured table (potentially similar to the one from which the plot was generated). To this end, we propose a pipelined method which separates the tasks of visual element detection and reasoning for QA. More specifically, our pipeline contains modules for (i) detecting visual elements in the plot such as bars, bounding boxes around axes label, etc. (ii) performing optical character recognition within these bounding boxes (iii) converting this data into a structured table and (iv) answering questions using this structured table.
3.1 Visual Elements Detection (VED)
The data bearing elements of a plot are of 10 distinct classes: the title, the labels of the and axes, the tick labels or categories (e.g., countries) on the and axis, the data markers in the legend box, the legend names, and finally the bars and lines in the graph. Following existing literature (,), we refer to these elements as the visual elements of the graph. The first task is to extract all these visual elements by drawing bounding boxes around them and classifying them into the appropriate class. We can treat this as (i) an object detection + classification task or (ii) an instance segmentation task. If we take the former view then we can use existing object detection models such as RCNN, Fast-RCNN , YOLO , SSD , etc. and if we take the latter view we can use instance segmentation models such as Mask-RCNN. We tried both these approaches and found that instance segmentation based methods perform better for this task and hence we use Mask-RCNN as our VED module. Figure 3 shows the expected output of this stage with all the visual elements detected.
3.2 Object Character Recognition (OCR)
Some of the visual elements such as title, legends, tick labels, etc. contain numeric and textual data. For extracting this data from within these bounding boxes, we use a state-of-the-art OCR model . More specifically, we crop the detected visual element to its bounding box, convert the cropped image into grayscale, resize and deskew it, and then pass it to an OCR module. Existing OCR modules perform well for machine-written English text, and indeed we found that a pre-trained OCR module444https://github.com/tesseract-ocr/tesseract works well on our dataset.
3.3 Semi-Structured Information Extraction (SIE)
The next stage of extracting the data into a semi-structured table is best explained with an example shown in Figure 3. The desired output of SIE is shown in the table where the rows correspond to the ticks on the -axis (1996, 1997, 1998, 1999), the columns correspond to the different elements listed in the legend (Brazil, Iceland, Kazakhstan, Thailand) and the ,-th cell contains the value corresponding to the -th tick and the -th legend. The values of the -tick labels and the legend names are available from the OCR module. The mapping of legend name to legend marker or color is done by associating a legend name to the marker or color whose bounding box is closest to the bounding box of the legend name. Similarly, we associate each tick label to the tick marker whose bounding box is closest to the bounding box of the tick label. For example, we associate the legend name Brazil to the color “Dark Cyan” and the tick label 1996 to the corresponding tick mark on the -axis. With this we have the 4 row headers and 4 column headers, respectively. To fill in the 16 values in the table, there are again two smaller steps. First we associate each of the 16 bounding boxes of the 16 bars to their corresponding -ticks and legend names. A bar is associated with an -tick label whose bounding box is closest to the bounding box of the bar. To associate a bar to a legend name, we find the dominant color in the bounding box of the bar and match it with a legend name corresponding to that color. Second, we need to find the value represented by each bar. We extract the height of the bar using bounding box information from the VED module and then search for the -tick labels immediately above and below that height. We then interpolate the value of the bar based on the values of these bounding ticks. With this we have the 16 values in the cells and thus have extracted all the information from the plot into a semi-structured table.
3.4 Table Question Answering (QA)
The final stage of the pipeline is to answer questions on the semi-structured table. As this is similar to answering questions from the WikiTableQuestions dataset , we adopt the same methodology as proposed in . In this method, the table is converted to a knowledge graph and the question is converted to a set of candidate logical forms by applying compositional semantic parsing. These logical forms are then ranked using a log-linear model and the highest ranking logical form is applied to the knowledge graph to get the answer. Note that with this approach the output is computed by a logical form that operates on the numerical data. This avoids the limitation of using a small answer vocabulary for multi-class classification as is done in existing work on VQA. \textcolorblackThere are some recent neural approaches for answering questions over semi-structured tables such as [19, 7] which take an ensemble of many models and outperform the relatively simpler model of  only by a small margin (1-2%). In the absence of an ensemble, these neural methods do not perform better than the method proposed in . To the best of our knowledge, there is one neural method  which performs better than  but the code for this model is not available which makes it hard to reproduce their results. Hence we chose the model of  for this stage which is relatively simpler and readily available.
In this section we detail the data splits, baseline models, hyperparameter tunnig and evaluation metrics.
4.1 Train-Valid-Test Splits
As mentioned earlier, by using different combinations of indicator variables and entities (years, countries, etc), we created a total of plots. Depending on the context and type of the plot, we instantiated the templates to create meaningful (question,answer) pairs for each of the plots. The number of questions per plot varies from 17 to 44. We created train (70%), valid (15%) and test (15%) splits from this data. These statistics are summarized in Table 2. The dataset, crowd-sourced questions and the model will be made available on the acceptance of this paper.
|Dataset Split||#Images||#QA pairs|
4.2 Models Compared
We compare the performance of the following models:
- IMG-only: This is a simple baseline where we just pass the image through a VGG19 and use the embedding of the image to predict the answer from a fixed vocabulary.
- QUES-only: This is a simple baseline where we just pass the question through a LSTM and use the embedding of the question to predict the answer from a fixed vocabulary.
- SAN: This is a state of the art VQA model which is an encoder-decoder model with a multi-layer stacked attention  mechanism. It obtains a representation for the image using a deep CNN and a representation for the query using LSTM. It then uses the query representation to locate relevant regions in the image and uses this to pick an answer from a fixed vocabulary.
- SANDY: This is the best performing model on the DVQA dataset and is a variant of SAN. Unfortunately, the code for this model is not available and the description in the paper was not detailed enough for us to reimplement it.555We have contacted the authors and while they are helpful in sharing various details, they do not have access to the original code now. Hence, we report the numbers for this model only on DVQA (from the original paper).
- VOES: This is our model as described in section 3 which is specifically designed for questions which do not have answers from a fixed vocabulary.
- VOES-Oracle: \textcolorblackThis is our model where the first three stages of VOES are replaced by an Oracle, i.e., the QA model answers questions on a table that has been generated using the ground truth annotations of the plot. With this we can evaluate the performance of the WikiTableQA model when it is not affected by the VED model’s errors.
- SAN-VOES: Given the complementary strengths of SAN-VQA and VOES, we train a hybrid model with a binary classifier which given a question decides whether to use the SAN or the VOES model. The data for training this binary classifier is generated by comparing the predictions of a trained SAN model and a trained VOES model on the training dataset. For a given question, the label is set to 1 (pick SAN) if the performance of SAN was better than that of VOES. We ignore questions where there is a tie. The classifier is a simple LSTM based model which computes a representation for the question using an LSTM and uses this representation to predict 1/0. At test time, we first pass the question through this model and depending on the output of this model use SAN or VOES.
4.3 Training Details
SAN: We used an existing implementation of SAN666https://github.com/TingAnChien/san-vqa-tensorflow for establishing the initial baseline results. Image features are extracted from the last pooling layer of VGG19 network. Question features are the last hidden state of the LSTM. Both the LSTM hidden state and 512-d image feature vector at each location are transferred to a 1024-d vector by a fully connected layer, and added and passed through a non-linearity (tanh). The model was trained using Adam  optimizer with an initial learning rate of and a batch size of 128 for 25000 iterations.
Of the four stages of the pipeline described in Section 4.2 only two require training, viz., Visual Elements Detection (VED) and Table Question Answering (QA).
As mentioned earlier, for VED we train an instance segmentation model (MaskRCNN ) using the bounding box annotations available in our dataset.
We trained each model with a batch size of 32 for steps, beyond which no further training benefit was seen.
We used RMSProp as the optimizer with an initial learning rate of .
For Table QA, we trained the model proposed in  using questions from our dataset and the corresponding \textcolorblackground truth tables.
Since this model is computationally expensive with a high training time, we could train it using only questions from our training set.
SAN-VOES: The binary question classifier in this hybrid model contains a 50-dimensional word embedding layer followed by an LSTM with 128 hidden units. The output of the LSTM is projected to 256 dimensions and this is then fed to the output layer. The model is trained for 10 epochs using RMSProp with an initial learning rate of 0.001. Accuracy on the validation set is .
4.4 Evaluation Metric
We used accuracy as the evaluation metric. Specifically, for textual answers (such as India, CO2, etc.) the model’s output was considered to be correct only if the predicted answer exactly matches the true answer. However, for numeric answers which contain floating point values such an exact match is a very strict evaluation metric (for example, if the predicted answer is 10.5 and the true answer is 10 then in most cases it would be acceptable). Hence, we relax the accuracy measure to consider the predicted answer to be correct as long as it is within 5% of the correct answer.
5 Observations and Results
1. Evaluating models on PlotQA dataset (Table 4): \textcolorblackThe baselines IMG-only and QUES-only performed poorly with an accuracy of and respectively. We then evaluate SAN, VOES, VOES-Oracle, and SAN-VOES on each of the 9 question-answer types of the PlotQA dataset. SAN performs very well on Yes/No questions and moderately well on Fixed vocab. questions with a good baseline aggregate accuracy of 46.54%. SAN performs poorly on Open vocab. question, failing to answer almost all the 319,000 questions in this category. On the other hand, VOES fails to answer correctly any of the Yes/No questions, performs moderately well on Fixed vocab. questions, and answers correctly some of the hard Open vocab. questions. SAN-VOES combines the complementary strengths of SAN and VOES with the highest accuracy of 53.96%. In particular, the performance improves significantly for all Fixed Vocab. questions, while retaining the high accuracy of SAN on Yes/No questions and VOES’ performance on Open vocab. There is a significant difference in the performance of VOES and VOES-Oracle across multiple question-answer types. This implies that the visual element detection in the VOES pipeline can be further improved.
2. Analysis of the VOES pipeline
We analyze the performance of the of visual element detection (VED) and OCR.
- Table 5 shows that the VED module performs reasonably well at an Intersection Over Union (IOU) of 0.5. For higher IOUs of 0.8 and 0.9, the accuracy falls drastically. For instance, at IOU of 0.9, dotlines are detected with an accuracy of under 5%. Clearly, such inaccuracies would lead to incorrect table generation and subsequent QA. This brings out an interesting difference between this task and other instance segmentation tasks where the margin of error is higher (where IOU of 0.5 is accepted). A small error in visual element detection as indicated by mAP scores of 80% is considered negligible for VQA tasks, however for PlotQA small errors can cause significantly misaligned table generation and subsequent QA. \textcolorblackWe illustrate this with an example given in Figure 4. The predicted red box having an IOU of 0.58 estimates the bar size as 760 as opposed to ground truth of 680, significantly impacting downstream QA accuracy.
Retraining VED model with a higher IOU of 0.75 only resulted in a small increase in accuracy (last row).
\textcolorblackThus, inverting the plot generation function in going from the plot image to the structured table is a difficult CV task.
- In Table 6 we evaluate the performance of the OCR module in standalone/oracle mode and pipeline mode. In the oracle mode, we feed ground truth boxes to the OCR model whereas in the pipeline model we perform OCR on the output of the VED module. We observe only a small drop in performance, which indicates that the OCR module is robust to the reduction in VED module’s accuracy at higher IOU as it does not depend on the class label or the exact position of bounding boxes.
- In summary, a highly accurate VED for structured images is an open challenge to improve reasoning over plots.
3. Evaluating new models on the existing DVQA dataset (Table 3): The proposed model VOES performs better than the existing models (SAN and SANDY-OCR) on DVQA. The higher performance of VOES in comparison to SAN (in contrast to the PlotQA results) suggests that the extraction of the structured table is more accurate on the DVQA dataset. This is because of the limited variability in the axis and tick labels and shorter length (one word only) of labels. The hybrid model, SAN-VOES, improves on the individual models and establishes a new SOTA result on DVQA.
|Accuracy (in %)|
|mAP - IOU 0.75||92.69%||79.58%||42.88%|
|Textual Class||Oracle||After VED|
We introduce the PlotQA dataset to reduce the gap between existing synthetic plot datasets and real-world plots and question templates. Analysis of an existing model for VQA for plots, SAN, on PlotQA reveals that it performs poorly for Open Vocabulary questions. We proposed the VOES model as a pipelined approach that combines visual element detection and OCR with QA over tables, specifically for the Open Vocabulary questions. A hybrid model, VOES-SAN, that combines SAN and VOES for different question types, generates state-of-the-art results on both the DVQA and PlotQA datasets. Detailed analysis of the VOES pipeline reveals the need for more accurate visual element detection to improve reasoning over plots.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: visual question answering. In ICCV, 2015.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
-  J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on freebase from question-answer pairs. In EMNLP.
-  M. Cliche, D. S. Rosenberg, D. Madeka, and C. Yee. Scatteract: Automated extraction of data from scatter plots. In ECML PKDD, 2017.
-  R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
-  T. Haug, O. Ganea, and P. Grnarova. Neural multi-step reasoning for question answering on semi-structured tables. In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings, pages 611–617, 2018.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
-  K. Kafle, S. Cohen, B. L. Price, and C. Kanan. DVQA: understanding data visualizations via question answering. CoRR, abs/1801.08163, 2018.
-  S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár, A. Trischler, and Y. Bengio. Figureqa: An annotated figure dataset for visual reasoning. CoRR, abs/1710.07300, 2017.
-  A. Kembhavi, M. Salvato, E. Kolve, M. J. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In ECCV, 2016.
-  A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  J. Krishnamurthy, P. Dasigi, and M. Gardner. Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516–1526, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, 2016.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, 2014.
-  A. Neelakantan, Q. V. Le, M. Abadi, A. McCallum, and D. Amodei. Learning a natural language interface with neural programmer. CoRR, abs/1611.08945, 2016.
-  H. Noh and B. Han. Training recurrent answering units with joint loss minimization for VQA. CoRR, abs/1606.03647, 2016.
-  P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In ACL, 2015.
-  J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525, 2017.
-  M. Ren, R. Kiros, and R. S. Zemel. Image question answering: A visual semantic embedding model and a new dataset. CoRR, abs/1505.02074, 2015.
-  A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In NIPS.
-  N. Siegel, Z. Horvitz, R. Levin, S. K. Divvala, and A. Farhadi. Figureseer: Parsing result-figures in research papers. In ECCV, 2016.
-  R. Smith. An overview of the tesseract ocr engine. In ICDAR, 2007.
-  A. Suhr, M. Lewis, J. Yeh, and Y. Artzi. A corpus of natural language for visual reasoning. In ACL, 2017.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 21–29, 2016.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In CVPR, 2016.