Using syntactical and logical forms to evaluate textual inference competence

Using syntactical and logical forms to evaluate textual inference competence

Felipe Salvatore    Marcelo Finger    Roberto Hirata Jr
Institute of Mathematics and Statistics, University of Sao Paulo, Brazil1
11{felsal, mfinger, hirata}@ime.usp.br
Abstract

In the light of recent breakthroughs in transfer learning for Natural Language Processing, much progress was achieved on Natural Language Inference. Different models are now presenting high accuracy on popular inference datasets such as SNLI, MNLI and SciTail. At the same time, there are different indicators that those datasets can be exploited by using some simple linguistic patterns. This fact poses difficulties to our understanding of the actual capacity of machine learning models to solve the complex task of textual inference. We propose a new set of tasks that require specific capacities over linguistic logical forms such as: i) Boolean coordination, ii) quantifiers, iii) definitive description, and iv) counting operators. By evaluating a model on our stratified dataset, we can better pinpoint the specific inferential difficulties of a model in each kind of textual structure. We evaluate two kinds of neural models that implicitly exploit language structure: recurrent models and the Transformer network BERT. We show that although BERT is clearly more efficient to generalize over most logical forms, there is space for improvement when dealing with counting operators.

1 Introduction

Natural Language Inference (NLI) is a complex problem of Natural Language Understanding which is usually defined as follows: given a pair of textual inputs and we need to determine if entails , or contradicts , or and have no logical relationship (they are neutral) [Consortium et al.1996]. and , known as “premise” and “hypothesis” respectively, can be either simple sentences or full texts.

The task can focus either on the entailment or the contradiction part. The former, which is known as Recognizing Textual Entailment (RTE), classifies the pair , in “entailment” or “non-entailment”. The latter, which is know as Contradiction Detection (CD), classifies that pair in terms of “contradiction” or “non-contradiction”. Independently of the form that we frame the problem, the concept of inference is the critical issue here.

With this formulation, NLI has been treated as a text classification problem suitable to be solved by a variety of machine learning techniques [Bar-Haim et al.2014, Bowman et al.2015a, Williams et al.2017]. Inference itself is also a complex problem. As shown in the following sentence pairs:

  1. A woman plays with my dog”, “A person plays with my dog

  2. Jenny and Sally play with my dog”, “Jenny plays with my dog

Both examples are cases of entailment, with different properties. In (1) the entailment is caused by the hypernym relationship between “person” and “woman”. Example (2) deals with interpretation of the coordinating conjunction “and” as a Boolean connective. As (1) relies on the meaning of the noun phrases we call it “lexical inference”. As (2) is invariant under substitution we call it “structural inference”. The latter is the focus of this work.

In this paper, we propose a new synthetic dataset that enables us to:

  1. compare the NLI accuracy of different neural models.

  2. diagnose of the structural (logical and syntactic) competence of each model.

  3. verify cross-linguistic structural competence of each method.

The contributions presented in this paper are: i) the presentation of a structure oriented dataset, ii) the comparison of traditional neural recurrent models against the Transformer network BERT with a clear advantage for the latter, however we still can identify specific gaps in its performance. iii) Finally we present a success case of cross-language transfer learning for structural NLI between English and Portuguese.

2 Background

The size of NLI datasets have been increasing since the initial proposition of the FraCas test suit composed of examples [Consortium et al.1996]. Some old datasets like RTE-6 [Bentivogli et al.2009] and SICK [Marelli et al.2014], with K and K examples, respectively, are relatively small if compared with the current ones like SNLI [Bowman et al.2015a] and MNLI [Williams et al.2017], with K and K examples, respectively. This increase was possible with the use of crowdsource platforms like the Amazon Mechanical Turk [Bowman et al.2015a, Williams et al.2017]. Hence the annotation of a highly specialized researcher, like in RTE 1-3 done by formal semanticist [Giampiccolo et al.2007, Bar-Haim et al.2014], was replaced with the labelling done by an average English speakers. This approach has been criticised with the argument that it is hard for an average speaker to produce different and creative examples of entailment and contradiction pairs [Gururangan et al.2018]. By looking at premise alone a simple text classifier can achieve an accuracy significantly larger than a random classifier in datasets such as SNLI and MNLI. This was explained by a high correlation of occurrences of negative words (“no”, “nobody”, “never”, “nothing”) in contradiction instances, and high correlation of generic words (such as “animal”, “instrument”, “outdoors”) with entailment instances. So despite of the large size of the corpora the task was easier to perform than expected.

The new wave of pre-trained models [Howard and Ruder2018, Devlin et al.2018, Liu et al.2019] poses both a challenge and an opportunity for the NLI field; the large-scale datasets are close to being solved (the benchmark for SNLI, MNLI, and SciTail is , , and , respectively, as reported in [Liu et al.2019]), giving the impression that NLI will become a trivial problem. The opportunity lies in the fact that, by using pre-trained models, training will no longer need such a large dataset. Then we can focus our efforts in creating small, well-thought datasets that reflect the variety of inferential tasks, and so determine the real competence of a model.

Here we present a collection of small datasets designed to measure the competence of detecting contradictions in structural inferences. We have chosen the CD task because it is harder for an average annotator to create examples of contradictions without excessively relying on the same patterns. At the same time, CD has practical importance since it can be used to improve consistency in real case applications, such as chat-bots [Welleck et al.2018].

We choose to focus on structural inference because we have detected that the current datasets are not appropriately addressing this particular feature. In an experiment, we verify the deficiency reported in [Gururangan et al.2018, Glockner et al.2018]. First, we transformed the SNLI and MNLI datasets in a CD task. The transformation is done by converting all instances of entailment and neutral into non-contradiction, and by balancing the classes in both training and test data. Second, we applied a simple Bag-of-Words classifier, destroying any structural information. The accuracy was significantly higher than the random classifier, and for SNLI and MNLI, respectively. Even the recent dataset focusing on contradiction, Dialog NLI [Welleck et al.2018], presents a similar pattern. The same Bag-of-Words model achieved accuracy in this corpus.

3 Data Collection

The different datasets that we propose are divided by tasks, such that each task introduces a new linguistic construct. Each task is designed by applying structurally dependent rules to automatically generate the sentence pairs. We first define the pairs in a formal language and then we use it to generate instances in natural language. In this paper, we have decided to work with English and Portuguese.

There are two main reasons to use a formal language as a basis for the dataset. First, this approach allows us to minimize the influence of common knowledge and lexical knowledge, highlighting structural features. Second, we can obtain a structural symmetry in both corpora.

Hence, our dataset is a tool to measure inference in two dimensions: one defined by the structural forms, which correspond to different levels in our hierarchical corpus; and other defined by the instantiation of these forms in multiple natural languages.

3.1 Template Language

The template language is a formal language used to generate instances of contradictions and non-contradictions in a natural language. This language is composed of two basic entities: people, and places, . We also define three binary relations: , , . It is a simplistic universe with the intended meaning for binary relations such as “ has visited ”, “ is taller than ” and “ is as tall as ”, respectively.

A realisation of the template language is a function mapping and to nouns such that ; and it also maps the relation symbols and logic operators to corresponding forms in some natural language.

Each task is defined by the introduction of a new structural and logical operator. We define the tasks in an hierarchical fashion: if a logical operator appears on task , it can appear in any task (with ). The main advantage of our approach compared to other datasets is that we can isolate the occurrences of each operator to have a clear notion in what forces the models to fail (or succeed).

For each task, we provide training and test data with 10K and 1K examples, respectively. All data is balanced; and, as usual, the model’s accuracy is evaluated on the test data. To test the model’s generalization capability, we have defined two distinct realization functions and such that and . For example, in the English version and are composed of common English masculine names and names of countries, respectively. Similarly, and are composed of feminine names and names of cities from the United States. In the Portuguese version we have done a similar construction, using common masculine and feminine names together with names of countries and names of Brazilian cities.

3.2 Data Generation

A logical rule can be seen as a mapping that transforms a premise into a conclusion . To obtain examples of contradiction we start with a premise and define as the negation of . The examples of non-contradiction are different negations that do not necessarily violate . We repeat this process for each task. What defines the difference from one task to another is the introduction of logical and linguist operators, and subsequently, new rules. We have used more than one template pair to define each task; however, for the sake of brevity, in the description below we will give only a brief overview of each task.

The full dataset in both languages, together with the code to generate it, can be found online [Salvatore2019].

Task 1: Simple negation We introduce the negation operator , “not”. The premise is a collection of facts about some agents visiting different places. Example, (“Charles has visited Chile, Joe has visited Japan”). The hypothesis can be either a negation of one fact that appears in , (“Joe didn’t visit Japan”); or a new fact not related to , (“Lana didn’t visit France”). The number of facts that appear in vary from two to twelve.

Task 2: Boolean coordination In this task, we add the Boolean conjunction , the coordinating conjunction “and”. Example, (“Felix, Ronnie, and Tyler have visited Bolivia”). The new information can state that one of the mentioned agents did not travel to a mentioned place, (“Tyler didn’t visit Bolivia”). Or it can represent a new fact, (“Bruce didn’t visit Bolivia”).

Task 3: Quantification By adding the quantifiers and , “for every” and “some”, respectively, we can construct example of inferences that explicitly exploit the difference between the two basic entities, people and places. Example, states a general fact about all people, (“Everyone has visited every place”) . can be the negation of one particular instance of , (“Timothy didn’t visit El Salvador”). Or a fact that does not violate , (“Timothy didn’t visit Anthony”).

Task 4: Definitive description One way to test if a model can capture reference is by using definitive description, i.e., by adding the operator to perform description and the equality relation . Hence, is to be read as “ is the one that has property Q”. Here we describe one property of one agent and ask the model to combine the description with a new fact. For example, (“Carlos is the person that has visited every place, Carlos has visited John”). Two new hypotheses can be introduced: (“Carlos did not visit Germany”) or (“John did not visit Germany”). Only the first hypothesis is a contradiction. Although the names “Carlos” and “John” appear on the premise, we expected the model to relate the property “being the one that has visited every place” to “John” and not “Carlos”.

Task 5: Comparatives In this task we are interested to know if the model can recognise a basic property of a binary relation: transitivity. The premise is compose of a collection of simple facts . (“Francis is taller than Joe, Joe is taller than Ryan”). Assuming the transitivity of , the hypothesis can be a consequence of , (“Francis is taller than Ryan”), or a fact that violates the transitivity property, (“Ryan is taller than Francis”). The size of the varies from four to ten. Negation is not employed here.

Task 6: Counting In Task 3 we have added only the basic quantifiers and , but there is a broader family of operators called generalised quantifiers. In this task we introduce the counting quantifier (“exactly ”). Example, (“Philip has visited only three places and only two people”). can be an information consistent with , (“Philip has visited John”), or something that contradicts , (“Philip has visited John, Carla, and Bruce”). We have added counting quantifiers corresponding to numbers from one to thirty.

Task 7: Mixed To avoid overexpose the same linguistic structures, we created a dataset composed of different samples of the previous tasks.

Basic statistics for the English and Portuguese realisations of all tasks can be found in Table 1.

Tasks
Vocabulary
size
Vocabulary
intersection
Mean
input
length
Max
input
length
Task 1 (Eng) 3561 77 230.6 459
Task 2 (Eng) 4117 128 151.4 343
Task 3 (Eng) 3117 70 101.5 329
Task 4 (Eng) 1878 62 100.81 134
Task 5 (Eng) 1311 25 208.8 377
Task 6 (Eng) 3900 150 168.4 468
Task 7 (Eng) 3775 162 160.6 466
Task 1 (Pt) 7762 254 209.4 445
Task 2 (Pt) 9990 393 148.5 388
Task 3 (Pt) 5930 212 102.7 395
Task 4 (Pt) 5540 135 91.8 140
Task 5 (Pt) 5970 114 235.2 462
Task 6 (Pt) 9535 386 87.8 531
Task 7 (Pt) 8880 391 159.9 487
Table 1: Task description. Column 1 presents the labels for two realizations of the described tasks - one in English (Eng) and the other in Portuguese (Pt). Column 2 presents the vocabulary size for the task. Column 3 presents the number of words that occurs both in the training and test data. Column 4 presents the average length of the input text (the concatenation of and ). Column 5 presents the maximum length of the input text.

4 Models and Evaluation

To evaluate the accuracy of each CD task we employed three kinds of models:

Baseline The baseline model (Base) is a Random Forest classifier that models the input text, the concatenation of and , as a Bag-of-Words. Since we constructed the dataset centered on the notion of structure-based contradictions, we believe that it should perform slightly better than random. At the same time, by using such baseline, we can certify if the proposed tasks are indeed requiring structural knowledge.

Recurrent Models The dominant family of neural models in Natural Language Processing specialised in modelling sequential data is the one composed by the Recurrent Neural Networks (RNNs) and its variations, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) [Goldberg2015]. We consider both the standard and the bidirectional variants of this family of models. Traditional multilayer recurrent models are not the best choice to improve the benchmark on NLI [Glockner et al.2018]. However, in recent works, it has been reported that recurrent models achieve a better performance than Transformer based models to capture structural patterns for logical inference [Evans et al.2018, Tran et al.2018]. We want to investigate if the same result can be achieved using our tasks as the base of comparison.

Transformer Based Models A recent non-recurrent family of neural models known as Transformer networks was introduced in [Vaswani et al.2017]. Differently from the recurrent models that recursively summarises all previous input into a single representation, the Transformer network employes a self-attention mechanism to directly attend to all previous inputs (more details of this architecture can be found in [Vaswani et al.2017]). Although, by performing regular training using this architecture alone we do not see surprising results in inference prediction [Evans et al.2018, Tran et al.2018], when we pre-trained a Transformer network in the language modeling task and fine-tuned afterwards on an inference task we see a significant improvement [Radford et al.2018, Devlin et al.2018].

Among the different Transformer based models we will focus our analysis on the multilayer bidirectional architecture known as Bidirectional Encoder Representation from Transformers (BERT) [Devlin et al.2018]. This bidirectional model, pre-trained as a masked language model and as a next sentence predictor, has two versions: BERT and BERT. The difference lies in the size of each architecture, the number of layers and self-attention heads. Since BERT is unstable on small datasets [Devlin et al.2018] we have used only BERT.

The strategy to perform NLI classification using BERT is the same the one presented in [Devlin et al.2018]: together with the pair we add new special tokens [CLS] (classification token) and [SEP] (sentence separator). Hence, the textual input is the result of the concatenation: [CLS] [SEP] [SEP]. After we obtain the vector representation of the [CLS] token, we pass it through a classification layer to obtain the prediction class (contradiction/non-contradiction). We fine-tune the model for the CD task in a standard way, the original weights are co-trained with the weights from the new layer.

By comparing BERT with other models we are not only comparing different architectures but different techniques of training. The baseline model uses no additional information. The recurrent models use only a soft version of transfer learning with fine-tuning of pre-trained embeddings (the fine-tuning of one layer only). On the other side, BERT is pre-trained on a large corpus as a language model. It is expected that this pre-training helps the model to capture some general properties of language [Howard and Ruder2018]. Since the tasks that we proposed are basic and cover very specific aspects of reasoning, we can use it to evaluate which properties are being learned in the pre-training phase.

The simplicity of the tasks motivated us to use transfer-learning differently: instead of simply using the multilingual version of BERT222Multilingual BERT is a model trained on the concatenation of the entire Wikipedia from 100 languages, Portuguese included. https://github.com/google-research/bert/blob/master/multilingual.md and fine-tune it on the Portuguese version of the tasks, we have decided to compare how the differently pre-trained versions of the BERT model can be fine-tuned on the Portuguese corpus. This can be done because for each pre-trained model there is a tokenizer that transforms the Portuguese input into a collection of tokens that the model can process. Thus, we have decided to use the regular version of BERT trained on an English corpus (BERT), the already mentioned Multilingual BERT (BERT), and the version of the BERT model trained on a Chinese corpus (BERT).

We hypothesize that most basic logical pattern learned by the model in English can be transferred to Portuguese. By the same reasoning, we believe that BERT should perform poorly. Not only the tokenizer associated to BERT will add noise to the input text, but also Portuguese and Chinese are grammatically different; for example, the latter is overwhelmingly right-branching while the former is more mixed [Levy and Manning2003]).

4.1 Experimental settings

The experiment was done in two stages. In the first stage we have evaluated the performance of the models in different ways:

  1. [label=()]

  2. We trained each model on different proportions of the dataset, by increasingly sampling the size of the training data. In this case, and .

  3. To understand how much the different models rely on the occurrence of noun phrases, we have trained the models on a version of the dataset where we allow full intersection of the train and test vocabulary, i.e., and .

  4. For the Portuguese corpus, we have fine-tuned the three pre-trained models mentioned above: BERT, BERT, and BERT.

In the second phase, we have trained the best model from the first stage with fake examples to observe if it is learning all the intended structures and not some unexpected text pattern. We proceeded by training the model on the following modified versions of the dataset: (Noise label) each pair , is unchanged but we randomly labeled the pair as contradiction or non-contradiction. (Premise only) we keep the labels the same and omit the hypothesis . (Hypothesis only) the premise is removed, but the labels remain intact.

4.2 Implementation and Hyperparameters

All deep learning architectures were implemented using the library Pytorch [Paszke et al.2017]. To make use of the pre-trained version of BERT we have based our implementation on the public repository https://github.com/huggingface/pytorch-pretrained-BERT.

The different recurrent architectures were optimized with Adam [Kingma and Ba2014]. We have used pre-trained word embedding from Glove [Pennington et al.2014] and Fasttext [Joulin et al.2016], but we also used random initialized embeddings. We random searched across embedding dimensions in , hidden layer size of the recurrent model in , number of recurrent layer in , learning rate in , dropout in and batch sizes in . The hyperparameter search for BERT follows the one presented in [Devlin et al.2018] that uses Adam with learning rate warmup and linear decay. We randomly searched the learning rate in , batch sizes in and number of epochs in .

All the code for the experiments is public available [Salvatore2019].

4.3 Results

For stage 1(a), in most of the tasks, BERT presents a clear advantage when compared to all other models. Task 3 and 6 are the only ones where the difference in accuracy between BERT and the recurrent models is small, as can be seen in Table 2. Even when we look at BERT’s results on the Portuguese corpus, which are slightly worst when compared to the English one, we still see a similar pattern. Pre-training plays a role here. When we fine-tuned BERT to the Portuguese version of Tasks 6 and 7 we have achieved a new accuracy of and respectively.

Surprisingly, BERT is able to solve some simple tasks but, overall, it has a medium performance. Using this model we have obtained the accuracy of , , , , , , from Tasks 1 to 7, respectively.

Base RNN GRU LSTM BERT
Task 1 (Eng) 52.1 50.1 50.6 50.4 99.8
Task 2 (Eng) 50.7 50.2 50.2 50.8 100
Task 3 (Eng) 63.5 50.3 66.1 63.5 90.5
Task 4 (Eng) 51.0 51.7 52.7 51.6 100
Task 5 (Eng) 50.6 50.1 50.2 50.2 100
Task 6 (Eng) 55.5 84.4 82.7 75.1 87.5
Task 7 (Eng) 54.1 50.9 53.7 50.0 94.6
Avg. (Eng) 53.9 55.4 58.0 56.2 96.1
Task 1 (Pt) 53.9 50.1 50.2 50.0 99.9
Task 2 (Pt) 49.8 50.0 50.0 50.0 99.9
Task 3 (Pt) 61.7 50.0 70.6 50.1 78.7
Task 4 (Pt) 50.9 50.0 50.4 50.0 100
Task 5 (Pt) 49.9 50.1 50.8 50.0 99.8
Task 6 (Pt) 58.9 66.4 79.7 67.2 79.1
Task 7 (Pt) 55.4 51.1 51.6 51.1 82.7
Avg. (Pt) 54.4 52.6 57.6 52.6 91.4
Table 2: stage 1(a) results, accuracy percentage on test data for the English and Portuguese corpora

In stage 1(b) with the full intersection of the vocabulary, we have observed that the average accuracy improvement differs from model to model: Baseline, GRU, BERT, LSTM and RNN present an average improvement of , , , , , respectively. This may indicate that the recurrent models are relying more on noun phrases than BERT. However, since we have results in opposite directions, more investigation is required.

Figure 1 shows that by taking the average accuracy on all tasks, we can observe that BERT is the only model improved by training on more data. All other models remain close to random independently of the amount of training data.

Accuracy improvement over training size indicates the difference in difficulty of each task. On the one hand, Tasks 1,2 and 4 are practically solved by BERT using only 4K examples of training (, , accuracy, respectively). On the other hand, the results for Tasks 3 and 6 remain below average, as seen in Figure 2.

In stage 2, by taking BERT as the best classifier, we repeated the training using all the listed data modification techniques. The results, as shown in Figure 3, indicate that BERT is not memorizing random textual patterns, neither excessively relying on information that appears only in the premise or the hypothesis . When we applied it on these versions of the data, BERT behaves as a random classifier.

Figure 1: stage 1 results, accuracy for each model on different data proportions (English corpus)
Figure 2: stage 1 results, BERT’s accuracy on different tasks (English corpus)
Figure 3: stage 2 results, BERT’s accuracy on the different versions of the data (English corpus)

5 Discussion

The results that we have found here are in the opposite direction to the ones reported on [Evans et al.2018, Tran et al.2018]. BERT, a Transformer based model, is significantly more efficient to capture sentence structure than the recurrent models. We offer the following explanation. In both previews papers the Transformer models are trained from scratch, while here we have used pre-trained models. This difference in results seems to indicate that language model pre-training has vital importance on obtaining structural knowledge.

Table 2 seems to confirm our initial hypothesis on the effectiveness of transfer learning in a cross-language fashion. We expect these results from English to Portuguese, but we did not expect the result for Chinese to Portuguese regarding Tasks 1, 2 and 4. This can be explained by the following remarks. Take the contradiction pair defined in the template language:

  1. (“ is the person that has visited everybody, has visited ”)

  2. (“ didn’t visit ”)

If we take one possible Portuguese realization of the pair above and apply the different tokenizers we have the following strings:

  1. Original sentence: “[CLS] gabrielle é a pessoa que visitou todo mundo gabrielle visitou luís [SEP] gabrielle não visitou ianesis [SEP]”.

  2. Multilingual tokenizer: “[CLS] gabrielle a pessoa que visito u todo mundo gabrielle visito u lu s [SEP] gabrielle no visito u ian esis [SEP]

  3. English tokenizer: “[CLS] gabrielle a pe sso a que visit ou tod o mundo gabrielle visit ou lu s [SEP] gabrielle no visit ou ian esis [SEP]

  4. Chinese tokenizer: “[CLS] ga b rie lle a pe ss oa q ue vi sit ou to do mu nd o ga b rie lle vi sit ou lu s [SEP] ga b rie lle no vi sit ou ian es is [SEP]

Although the Portuguese words are destroyed by the tokenizers, the model is still able to learn in the fine-tuning phase the simple structural pattern between the tokens highlighted above. This may explain why Task 6 (Counting) presents the highest difficulty for BERT. There is some structural grounding for finding contradictions in counting expressions, but to detect contradiction in all cases one must fully grasp the meaning of the multiple counting operators.

6 Related Work

The use of a synthetic benchmark to measure inference is present in early works of NLI such as the Fracas Test Suit [Consortium et al.1996]333A public version of this dataset has been made available by the Stanford NLP group https://nlp.stanford.edu/~wcmac/downloads/fracas.xml. The move towards the creation of large and realistic datasets was very beneficial to the field because real-life applications that deals with contradiction and inference must rely on linguistic and common background knowledge [Bentivogli et al.2009, Bar-Haim et al.2014, Marelli et al.2014, Bowman et al.2015a, Williams et al.2017, Khot et al.2018, Welleck et al.2018]. Our approach of isolating structural forms by using synthetic data to analyze the logical and syntactical competence of different neural models is similar to [Bowman et al.2015b, Evans et al.2018, Tran et al.2018]. One main difference between their approach and ours is that we are interested in using a formal language as a tool for performing a cross-language analysis.

7 Conclusion and Further Work

With the possibility of using pre-trained models we can successfully craft small datasets ( 10K sentences) to perform fine grained analysis on machine learning models. In this paper, we have presented a new dataset that is able to isolate a few competence issues regarding structural inference. It also allows us to bring to the surface some interesting comparisons between recurrent and Transform-based neural models. As our results show, compared to the recurrent models, BERT presents a considerable advantage in learning structural inference. The same result appears even when fine-tuned one version of the model that was not pre-trained on the target language.

By the stratified nature of our dataset, we can pinpoint BERT’s inference difficulties: there is space for improving the model’s counting understanding. Hence, we can either craft a more realistic NLI dataset centered on the notion of counting or modify BERT’s training to achieve better results in Task 6. We plan to explore these paths in the future.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391997
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description