CJRC: A Reliable Human-Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension
We present a Chinese judicial reading comprehension (CJRC) dataset which contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts. The CJRC dataset can help researchers extract elements by reading comprehension technology. Element extraction is an important task in the legal field. However, it is difficult to predefine the element types completely due to the diversity of document types and causes of action. By contrast, machine reading comprehension technology can quickly extract elements by answering various questions from the long document. We build two strong baseline models based on BERT and BiDAF. The experimental results show that there is enough space for improvement compared to human annotators.
Law is closely related to people’s daily life. Almost every country in the world has laws, and everyone must abide by the law, thereby enjoying rights and fulfilling obligations. Tens of thousands of cases such as traffic accidents, private lending and divorce disputes occurs every day. At the same time, many judgment documents will be formed in the process of handling these cases. The judgment document is usually a summary of the entire case, involving the fact description, the court’s opinion, the verdict, etc. The relatively small number of legal staff and the uneven level of judges may lead to wrong judgments. Even the judgments in similar cases can be very different sometimes. Moreover, a large number of documents make it challenging to extract information from them. Thus, it will be helpful to introduce artificial intelligence to the legal field for helping judges make better decisions and work more effectively.
Currently, researchers have done amounts of work on the field of Chinese legal instruments, involving a wide variety of research aspects. Law prediction [1, 20] and charge prediction [8, 13, 25] have been widely studied, especially, CAIL2018 (Chinese AI and Law challenge, 2018) [22, 26] was held to predict the judgment results of legal cases including relevant law articles, charges and prison terms. Some other researches include text summarization for legal documents , legal consultation [15, 24] and legal entity identification . There also exists some systems for similar cases search, legal documents correction and so on.
Information retrieval usually only returns a batch of documents in a coarse-grained manner. It still takes a lot of effort for the judges to read and extract information from document. Elements extraction often requires pre-defining element types. Different element types need to be defined for different cases or crimes. Manual definition and labeling processes are time consuming and labor intensive. These two technologies cannot cater for the fine-grained, unconstrained information extraction requirements. By contrast, reading comprehension technology can naturally extract fine-grained and unconstrained information.
In this paper, we present the first Chinese judicial reading comprehension dataset (CJRC). CJRC consists of about 10K documents which are collected from http://wenshu.court.gov.cn/ published by the Supreme Peopleâs Court of China. We mainly extract the fact description from the judgment document and ask law experts to annotate four to five question-answer pairs based on the fact. Eventually, our dataset contain around 50K questions with answers. Since some of the questions cannot be directly answered from the fact description, we have asked law experts to annotate some unanswerable and yes/no questions similar to SQuAD2.0 and CoQA datasets (Figure 1 shows an example). In view of the fact that the civil and criminal judgment documents greatly differ in the fact description, the corresponding types of questions are not the same. This dataset covers the two types of documents and thereby covers most of the judgment documents, involving various types of charge and cause of action (in the following parts, we will use casename to refer to civil cases and criminal charges.).
The main contribution of our work can be concluded as follows:
CJRC is the first Chinese judicial reading comprehension dataset to fill gaps in the field of legal research.
Our proposed dataset includes a wide range of areas, specifically 188 causes of action and 138 criminal charges. Moreover, the research results obtained through this dataset can be widely applied, such as information retrieval and factor extraction.
The performance of some powerful baselines indicates there is enough space for improvement compared to human annotators.
|CNN/Daliy Mail||ENG||1.4M||News||Fill in entity|
|RACE||ENG||870K||English Exam||Multi. choices|
|NewsQA||ENG||100K||CNN||Span of words|
|SQuAD||ENG||100K||Wiki||Span of words, Unanswerable|
|CoQA||ENG||127K||Children’s Sto. etc.||Span of words, yes/no, unanswerable|
|TriviaQA||ENG||40K||Wiki/Web doc||Span/substring of words|
|HFL-RC||CHN||100K||Fairy/News||Fill in word|
|DuReader||CHN||200K||Baidu Search/Baidu Zhidao||Manual summary|
|CJRC||CHN||50K||Law||Span of words, yes/no, unanswerable|
2 Related Work
2.1 Reading Comprehension Datasets
Machine reading comprehension (MRC) has emerged a few datasets for researches. Among these data sets, English reading comprehension datasets occupy a large proportion. Almost each of the mainstream datasets is designed to cater for demands of requiring specific scenes or domains corpus, or to solve one or more certain problems. CNN/Daliy mail  and NewsQA  refer to news field, SQuAD 2.0  focuses on wikipedia, and RACE  concentrates on Chinese middle school students’ English reading comprehension examination questions. SQuAD 2.0  mainly introduces the unanswerable questions due to the real situations that we sometimes cannot find a favourable answer according to a given context. CoQA  is a large-scale reading comprehension dataset which contains questions that depend on a conversation history. TriviaQA  and SQuAD 2.0  pay attention to complex reasoning questions, which means that we need to jointly infer the answers via multiple sentences.
Compared with English datasets, Chinese reading comprehension datasets are quite rare. HFL-RC  is the first Chinese Cloze-style reading comprehension dataset, and it is collected from People Daily and Children’s Fairy Tale. DuReader  is an open-domain Chinese reading comprehension dataset, and it is based on Baidu Search and Baidu Zhidao. Our dataset is the first Chinese judicial reading comprehension dataset, and contains multiple types of questions. Table 1 compares the above datasets with ours, mainly considering the four dimensions: language, scale of questions, domain, and answer type.
2.2 Reading Comprehension Models
Cloze-style and span-extraction are two of the most widely studied tasks of MRC. Cloze-style models are usually designed as classification models to predict which word has the maximum probability. Generally, models need to encode query and document respectively into a sequence of vectors, where each vector denotes a token’s representation. The next operations lead to different methods. Stanford Attentive Reader  firstly obtains the query vector, and then exploits it to calculate the attention weights on all the contextual embeddings. The final document representation is computed by the weighted contextual embeddings and is used for the final classification. Some other models [5, 19, 10] are similar with Stanford Attentive Reader.
Span-extraction based reading comprehension models are basically consistent in terms of the goal of calculating the start position and the end position. Some classic models are R-Net , BiDAF , BERT , etc. BERT is a powerful pre-trained model and performs well on many NLP tasks. It is worth noting that almost all the top models on the SQuAD 2.0 leaderboard are integrated with BERT. In this paper, we use BERT and BiDAF as two strong baselines. The gap between human and BERT is 15.2%, indicating that models still have enough room for improvement.
3 CJRC: A New Benchmark Dataset
Our legal documents are all collected from China Judgments Online
In-domain and out-of-domain. Referring to CoQA dataset, we divide the dataset into in-domain and out-of-domain. In-domain means that the data type of test data exists in train sets, and conversely, out-of-domain means the absence. Taking into account that casename can be regarded as the natural segmentation attribute, we firstly determine which casenames should be included in the training set. Then development set and test set should contain casenames in the training set and casenames not in the training set. Finally, we obtain totally 8000 cases for training set and 1000 cases respectively for development set and test set. For development and test set, the number of cases is the same whether it is divided by civil and criminal, or by in-domain and out-of-domain. The distribution of casenames on the training set is shown in Figure 3.
Annotate development and test sets. After splitting the dataset, we ask annotators to annotate two extra answers for each question of each example in development and test sets. We obtain three standard answers for each question.
Redefine the task. Through preliminary experiments, we discovered that the distinction between in-domain and out-of-domain is not obvious. It means that performance of the model trained on training set is almost the same regarding in-domain and out-of-domain, and it is even likely that the latter works better. The possible reasons are as follows:
Casenames inside and outside the domain are similar. In other words, the corresponding cases show some similar case issues. For example, two cases related to the contract, housing sales contract disputes and house lease contract disputes, may involve same issues such as housing agency or housing quality.
Questions about time, place, etc. are more common. Moreover, due to the existence of the “similar casenames” phenomenon, the corresponding questions would also be similar.
|Total Unanswerable Questions||617||617||1901|
|Total Yes/No Questions||3015||2093||5108|
|Total Unanswerable Questions||685||561||1246|
|Total Yes/No Questions||404||251||655|
|Total Unanswerable Questions||685||577||1262|
|Total Yes/No Questions||392||245||637|
However, as we all known, there are remarkable differences between civil and criminal cases. As mentioned in the module “In-domain and out-of-domain”, the corpus would be divided by domain or type of cases (civil and criminal). Although we no longer consider the division of in-domain and out-of-domain, it would also make sense to train a model to perform well on both civil and criminal data.
Adjust data distribution. Through preliminary experiments, we also discovered that the unanswerable questions are more challenging than the other two types of questions. To increase the difficulty of the dataset, we have increased the number of unanswerable questions in development set and test set. Related experiments will be presented in the experimental section.
Via the processing of the above steps, we get the final data. Statistics of the data are shown in Table 2. The subsequent experiments will be performed on the final data.
4.1 Evaluation Metric
We use macro-average F1 as our evaluation metric which is consistent with the CoQA competition. For each question, F1 scores need to be calculated with standard human answers, and the maximum value is taken as its F1 score. However, in assessing human performance, each standard answer needs to be compared to other standard answers to calculate the F1 score. In order to compare human indicators more fairly, standard answers need to be divided into groups, where each group contains answers. Finally, the F1 score of each question is the average of the groups’ F1. The F1 score of the entire dataset is the average of all questions’ F1. The formula is as follow:
Where denotes standard answers, denotes answers predicted by models, means to calculate length, means to calculate the number of overlap chars. represents the total references, represents that the predicted answer is compared to all standard answers except the current one in a single group described as above.
We implement and evaluate two powerful and typical model architectures: BiDAF proposed by  and BERT proposed by . Both of the two models are designed to deal with these three types of questions. These two models learn to predict the probability which is used to judge whether the question is unanswerable. In addition to the way of dealing with unanswerable questions, we concatenate [YES] and [NO] as two tokens with the context for BERT, and concatenate “KYN” as three chars with the context for BiDAF where ‘K’ denoting “Unknown” means cannot answer the question according to the context. Taking BiDAF for example, during the prediction stage, if start index is equal to 1, then model outputs “YES”, and if it is equal to 2, then model outputs “NO”.
Some other implementation details: for BERT, we choose the Bert-Base Chinese pre-trained model
4.3 Result and Analysis
Experimental results on test set are shown in Table 3. From this table, it is obvious that BERT is 14.519 percentage points higher than BiDAF, and Human performance is 14.815.5 percentage points higher that BERT. This implies that models could be improved markedly in future research.
Experimental Effect of In-domain and Out-of-Domain
In this section, we mainly explain why we no loner consider the division of in-domain and out-of-domain described in section 2. We adopts the dataset before adjusting data distribution and select BERT model to verify. Notice that we only train data belong to civil for “Civil”, train data belong to criminal for “Criminal”, and train all data for “Overall”. And type of cases on development set and test set is corresponding to the training corpus. It can be seen from Table 4 that the F1 score of out-of-domain is even higher than that of in-domain, which obviously does not meet the expected result of setting in-domain and out-of-domain.
Comparisons of Different Types of Questions
Table 5 presents fine-grained results of models and humans on the development set and test set, where both of the two sets are not adjusted. We observe that humans maintain high consistency on all types of questions, especially on the “YES” questions. The human agreement on criminal data is lower than that on civil data. This is partly because that we firstly annotate the criminal data, and then have more experience when marking the civil data. It could result in a more consistent granularity of the selected segments on the “Span” questions.
Among the different question types, unanswerable questions are the hardest, and “No” questions are second. We analyze why the performance of unanswerable questions is the lowest, and conclude two possible causes: 1) the total number of unanswerable questions on the training set is few; 2) the unanswerable questions are more troublesome than the others.
It is easy to verify the first cause via observing the corpus. To verify the second point, we compare the unanswerable questions and the “NO” questions. Table 6 shows some comparison data of the two types of questions. The first two rows show that unanswerable questions presents a lower performance than the other on the criminal data, even though the former owns more questions. This has basically illustrated that the unanswerable questions are more hard. We have further experimented with increasing the number of unanswerable questions of civil data on the training set. The last two rows in Table 6 demonstrates that increasing unanswerable questions’ quantity has an significant impact on performance. However, despite having a larger amount of questions for unanswerable questions, it presents a lower score than “NO” questions.
The above experiments could explain that the unanswerable questions are more challenging than other types of questions. To increase the difficulty of the corpus, we adjusts data distribution through controlling the number of unanswerable questions. The following section would show details about the influence of unanswerable questions.
|Number of Questions||Number of Questions||Performance|
|(Training set)||(Test set)||(Test set)|
Influence of Unanswerable Questions
In this section, we mainly discuss the impact of the number of unanswerable questions on the difficulty of the entire dataset. CJRC represents that we only increase the number of unanswerable answers on the development and the test set without changes on the training set. CJRC+Train stands for adjusting all the datasets. CJRC-Dev-Test means no adjusting any of the datasets. CJRC+Train-Dev-Test means only increasing the number of unanswerable questions of the training set. From Table 7, we can observe the following phenomenon:
Increasing the number of unanswerable questions in development and test sets can effectively increase the difficulty of the dataset. In terms of BERT, before adjustment, the gap with human indicator is 9.8%, but after adjustment, the gap increases to 15.2%.
By comparing CJRC+Train and CJRC (or comparing CJRC+Train-Dev-Test and CJRC-Dev-Test), we can conclude that BiDAF cannot handle unanswerable questions effectively.
Increasing the proportion of unanswerable questions in development and test sets is more effective in increasing the difficulty of the dataset, compared with reducing the number of unanswerable questions of the training set (get the conclusion by observing CJRC, CJRC+Train and CJRC-Dev-Test).
In this paper, we construct a benchmark dataset named CJRC (Chinese Judicial Reading Comprehension). CJRC is the first Chinese judical reading comprehension, and could fill gaps in the field of legal research. In terms of the types of questions, it involves three types of questions, namely span-extraction, YES/NO and unanswerable questions. In terms of the types of cases, it contains civil data and criminal data, where various of criminal charges and civil causes are included. We hope that researches on the dataset could improve the efficiency of judges’ work. Integrating Machine reading comprehension with Information extraction or information retrieval would produce great practical value. We describe in detail the construction process of the dataset, which aims to prove that the dataset is reliable and valuable. Experimental results illustrate that there is still enough space for improvement on this dataset.
This work is supported by the National Key RD Program of China under Grant No.2018YFC0832103.
- B., F., J.Z., P., M., K., A.Z., W.: A methodology for a criminal law and procedure ontology for legal question answering. In: Proceedings of EMNLP. pp. 198–214. Springer, Cham (2018)
- Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the cnn/daily mail reading comprehension task. CoRR abs/1606.02858 (2016), http://arxiv.org/abs/1606.02858
- Cui, Y., Liu, T., Chen, Z., Wang, S., Hu, G.: Consensus attention-based neural networks for chinese reading comprehension (07 2016)
- Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
- Dhingra, B., Liu, H., Cohen, W.W., Salakhutdinov, R.: Gated-attention readers for text comprehension. CoRR abs/1606.01549 (2016), http://arxiv.org/abs/1606.01549
- He, W., Liu, K., Lyu, Y., Zhao, S., Xiao, X., Liu, Y., Wang, Y., Wu, H., She, Q., Liu, X., Wu, T., Wang, H.: Dureader: a chinese machine reading comprehension dataset from real-world applications. CoRR abs/1711.05073 (2017), http://arxiv.org/abs/1711.05073
- Hill, F., Bordes, A., Chopra, S., Weston, J.: The goldilocks principle: Reading children’s books with explicit memory representations (11 2015)
- Hu, Z., Li, X., Tu, C., Liu, Z., Sun, M.: Few-shot charge prediction with discriminative legal attributes. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018. pp. 487–498 (2018), https://aclanthology.info/papers/C18-1041/c18-1041
- Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR abs/1705.03551 (2017), http://arxiv.org/abs/1705.03551
- Kadlec, R., Schmid, M., Bajgar, O., Kleindienst, J.: Text understanding with the attention sum reader network. CoRR abs/1603.01547 (2016), http://arxiv.org/abs/1603.01547
- Kanapala, A., Pal, S., Pamula, R.: Text summarization from legal documents: a survey. Artificial Intelligence Review pp. 1–32 (2017)
- Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.H.: Race: Large-scale reading comprehension dataset from examinations. CoRR abs/1704.04683 (2017), http://arxiv.org/abs/1704.04683
- Luo, B., Feng, Y., Xu, J., Zhang, X., Zhao, D.: Learning to predict charges for criminal cases with legal basis. In: Proceedings of EMNLP (2017)
- Natural Language Computing Group, M.R.A.: R-net: Machine reading comprehension with self-matching networks. In: Proceedings of ACL (2017)
- Quaresma, P., Rodrigues, I.P.: A question answer system for legal information retrieval (2005)
- Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. CoRR abs/1806.03822 (2018), http://arxiv.org/abs/1806.03822
- Reddy, S., Chen, D., Manning, C.D.: Coqa: A conversational question answering challenge. CoRR abs/1808.07042 (2018), http://arxiv.org/abs/1808.07042
- Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. CoRR abs/1611.01603 (2016), http://arxiv.org/abs/1611.01603
- Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: Weakly supervised memory networks. CoRR abs/1503.08895 (2015), http://arxiv.org/abs/1503.08895
- Tran, A.H.N.: Applying deep neural network to retrieve relevant civil law articles. In: Proceedings of the Student Research Workshop Associated with RANLP 2017. pp. 46–48. INCOMA Ltd., Varna (Sep 2017), https://doi.org/10.26615/issn.1314-9156.2017_007
- Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Suleman, K.: Newsqa: A machine comprehension dataset. CoRR abs/1611.09830 (2016), http://arxiv.org/abs/1611.09830
- Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z., Wang, H., Xu, J.: CAIL2018: A large-scale legal dataset for judgment prediction. CoRR abs/1807.02478 (2018), http://arxiv.org/abs/1807.02478
- Yin, X., Zheng, D., Lu, Z., Liu, R.: Neural entity reasoner for global consistency in ner (2018)
- Zhang, N., Pu, Y.F., Yang, S.Q., Zhou, J.L., Gao, J.K.: An ontological chinese legal consultation system. IEEE Access p. 5:18250â18261 (2017)
- Zhong, H., Guo, Z., Tu, C., Xiao, C., Liu, Z., Sun, M.: Legal judgment prediction via topological learning. In: Proceedings of EMNLP (2018)
- Zhong, H., Xiao, C., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z., Wang, H., Xu, J.: Overview of CAIL2018: legal judgment prediction competition. CoRR abs/1810.05851 (2018), http://arxiv.org/abs/1810.05851