Charge-Based Prison Term Prediction with Deep Gating Network

Charge-Based Prison Term Prediction with Deep Gating Network

Huajie Chen    Deng Cai   Wei Dai   Zehui Dai   Yadong Ding
NLP Group, Gridsum, Beijing, China
The Chinese University of Hong Kong
  Both authors contributed equally.

Judgment prediction for legal cases has attracted much research efforts for its practice use, of which the ultimate goal is prison term prediction. While existing work merely predicts the total prison term, in reality a defendant is often charged with multiple crimes. In this paper, we argue that charge-based prison term prediction (CPTP) not only better fits realistic needs, but also makes the total prison term prediction more accurate and interpretable. We collect the first large-scale structured data for CPTP and evaluate several competitive baselines. Based on the observation that fine-grained feature selection is the key to achieving good performance, we propose the Deep Gating Network (DGN) for charge-specific feature selection and aggregation. Experiments show that DGN achieves the state-of-the-art performance.

1 Introduction

Judgment prediction Kort (1957); Ulmer (1963); Segal (1984); Liu et al. (2004); Liu and Hsieh (2006) aims at automatically predicting the judgment result given a textual description of a legal case (An example is given in Figure 1). Recently, there has been a resurgent interest in this task due to the availability of more data and new machine learning techniques Luo et al. (2017); Zhong et al. (2018b); Hu et al. (2018).

Figure 1: An example of judgment prediction.

Judgment prediction can be decomposed into several sub-tasks: (a) relevant law article extraction Liu and Hsieh (2006); Liu and Liao (2005); Liu et al. (2015), (b) charge prediction Liu and Hsieh (2006); Luo et al. (2017); Hu et al. (2018), (c) and prison term prediction Zhong et al. (2018a). The dependencies among them have also been studied by \newciteD18-1390. While effective methods exist for sub-task (a) and (b), (e.g In CAIL2018 competition Zhong et al. (2018a), both the charge prediction and the article prediction have attained over ), the prison term prediction remains the performance bottleneck.

In this paper, we improve the accuracy of prison term prediction by decomposing it into a set of charge-based prison term predictions (CPTPs). In this way, more subtle and sophisticated interactions between textual description and a specific charge can be captured, resulting in more precise term predictions for individual charges. Meanwhile, CPTPs also shed light on the prediction of the total prison term.

On the other hand, CPTP also poses challenges due to the following reasons: The case description can be very lengthy and not all parts are relevant to a specific charge. The charge-related descriptions are often presented in an interleaving way, making it difficult to associate a specific charge with its corresponding information.

To address the above problems, we propose the Deep Gating Network (DGN) for gradually filtering and aggregating charge-specified information at different levels of granularity. Specifically, we stack multiple blocks of an LSTM layer and a charge-specific gating layer for generating a focused charge-based representation of the case description. Finally, the whole document representation is obtained by a convolutional neural network.

To conduct the experiments, we construct a new dataset, which contains more than 200, 000 criminal cases.111The dataset can be found at To show the effectiveness of the proposed approach, we compare it with several strong baselines adapted from aspect-based sentiment classification Wang et al. (2016); Tang et al. (2016); Chen et al. (2017); Li et al. (2018). Experiments show that our method achieves significantly better results than all of them. In addition, when we leverage the results of charge-based term predictions for the total prison term prediction, it also surpasses several strong baselines that are directly aimed at the total term prison prediction.

In summary, our contributions are as follows:

  • We formally define the task of charge-based prison term prediction and collect the first dataset for it.

  • We propose the Deep Gating Network (DGN). Experiments show our method achieves the state-of-the-art performance.

  • We show that the accuracy of the total term prediction is also improved by a simple heuristic integration of individual charge-based term predictions.

2 Problem Definition & Dataset Construction

We formally define the task of charge-based prison term prediction as follows. The input are a case description and a set of corresponding charges , where and are the length of case description and the number of charges respectively. The goal is to predict the prison terms , where is the prison term corresponding to charge .

To the best of our knowledge, there is no existing structured dataset for the above task. We thus collect and construct a dataset based on the published records from the Supreme People’s Court of China,222 where each criminal case document includes the accusation by the procuratorate, the court view, and the result of judgment. Following \newcitexiao2018cail2018, we take the accusation by the procuratorate as the input textual description. The charges and the corresponding prison terms are extracted from the result of judgment using regular expressions like “sentence to      months imprisonment for     ”. We build 238,749 well-structured cases in total (An example is given in Fig 1). The collected cases are further split into the training set, the validation set, and the test set. The statistics of the dataset are detailed in Table 1. The range of possible prison terms is [1, 240] (in months). The dataset has a broad coverage of common charges, 157 different types of charges are involved.

#single #multiple total
Train 147, 580 42, 420 190, 000
Valid 19, 350 5, 602 24, 952
Test 18, 539 5, 258 23, 797
Table 1: The statistics of the proposed dataset. #single/#multiple means single/multiple charge(s) cases.

3 Our Approach

Figure 2: The architecture of DGN

Figure 2 gives an overview of our model, which consists of two components: (1) Deep Gating Network (DGN) for charge-based feature filtering and aggregation. (2) Convolutional Neural Network (CNN) for the whole document representation learning.

3.1 Deep Gating Network

At the bottom layer of DGN, each word is mapped into a low dimensional vector according to a word embedding table.

DGN then starts to construct charge-specific representations gradually. identical blocks are hierarchically stacked. The -th block takes the output of the -th layer as input. Each block transforms its input semantic vectors into more sophisticated and focused representations based on gated feature selection and combination.

Specifically, each gating block consists of a bi-LSTM layer for context aggregation and a gating layer for charge-specific feature filtering.

where is the gate for -th vector of the -th gating block and denotes element-wise multiplication. is computed as:

where is target charge embedding. The gating layers can select the charge-specific features according to the target charge embedding.

3.2 Convolutional Neural Network

Convolutional Neural Network (CNN) has been effective in modeling sequential data Kim (2014); Hu et al. (2014); Pang et al. (2016). It uses convolution operations (with multiple groups of filters) for -gram feature extraction. The sequence-level representation is then obtained through max-pooling, where the most salient -gram features are detected and selected.

In this work, we use a CNN with filter width in . The number of filters for each width is 256. We concatenate the outputs of different filters for the final document representation .

3.3 Output and Training

The charge-specific document representation is passed to a fully connected layer with ReLU activation for the final prediction.


where and are trainable parameters.

Since the Mean Squared Error (MSE) loss cannot reflect the relative deviation ratio between the prediction and the ground-truth, we take the logarithm before estimating their difference.


To alleviate the impact of outliers and stabilize the training, we propose to use Huber Loss Huber (1964), is set to 1 in experiments:

Total Term Prediction

Although our model is trained to predict the prison term for specific charge, it can be readily adapted to predict the total term by a simple heuristic integration of individual charge-based prison term predictions. There are certain regulations for combined punishment of crimes in Chinese legislation. For simplicity, we take the average of the maximum and summation of individual charge-specific term predictions. The total term prediction is also capped at 240 months.

4 Experiments

4.1 Evaluation Metrics

For evaluation, we adopt the official score function (S metric) of the CAIL2018 Competition Zhong et al. (2018a). The score function measures the log different between prediction value and gold value as in Eq 2. The final score is a piece-wise function that increases monotonically with the value of . For more details about the S metric, we refer interested readers to Zhong et al. (2018a). We also report the exact match (EM) rate and error-tolerant accuracy Acc@, where is the maximum acceptable error rate. Formally, a prediction is considered “correct” if and only if its value is in the range .

4.2 Compared Methods

The task of charged-based prison term prediction is similar in spirit to aspect-based sentiment classification Pang et al. (2008), where multiple classification decisions are made given one text description and different target entities. This suggests that other neural architectures proposed for aspect-based sentiment classification may also be suitable for our task. The adaption from classification to regression can be easily accomplished by replacing the original final layer with that of Eq 1. Specifically, we adapted the following models:

  • ATAE-LSTM Wang et al. (2016): it concatenates aspect embedding and the output of LSTM, and uses self-attention to obtain aspect-based representation.

  • MemNet Tang et al. (2016): it uses multi-hop attention over the word embeddings for a sentence, where aspect embedding is regarded as the initial key.

  • RAM Chen et al. (2017): it also uses multi-hop attention for aspect-specific representation learning, while the attention at different time steps are aggregated by recurrent neural network.

  • TNet Li et al. (2018): it has a similar architecture to DGN. The major difference is that it employs a Transformation Network for mixing the information in aspect embedding and token representations rather than the explicit gates in our model.

The aspect embedding in above models is replaced by charge embedding in our experiments. In addition, we also compare with the popular models for total term prediction Zhong et al. (2018a, b):

  • CNN Kim (2014): the case description is encoded by a CNN with multiple filter widths, followed by max-pooling.

  • RNN Hochreiter and Schmidhuber (1997): bi-LSTM are used for case description encoding, where the final states are regarded as the document representation.

  • RCNN Lai et al. (2015): we stack a CNN on the top of LSTM states for final representation.

4.3 Main Results

Model S EM Acc@0.1 Acc@0.2
ATE-LSTM 66.49 7.72 16.12 33.89
MemNet 70.23 7.52 18.54 36.75
RAM 70.32 7.97 18.87 37.38
TNet 73.94 8.06 19.55 39.89
DGN 76.48 8.92 20.66 42.61
Table 2: Results on charge-based prison term prediction(%).
Model S EM Acc@0.1 Acc@0.2
CNN 67.24 8.41 16.96 35.58
RNN 67.27 8.04 16.79 35.11
RCNN 69.56 8.54 17.57 35.75
DGN 75.74 8.64 19.32 40.43
Table 3: Results on total term prediction(%).

The results of charge-based prison term prediction are shown in Table 2. The proposed DGN achieves the best results on all four metrics. In addition, the margins between our model and others are remarkably wide. It can be observed that aspect-based sentiment models only give moderate performance, which we attribute to that the case description is so long that more rigorous feature selection, such as the treatment of DGN, is needed. Our model selects and aggregates features in a explicit way which is more efficient and effective in dealing with charge-specific descriptions often spread out across lengthy case documents in CPTP.

Table 3 presents the results of the total term prediction. Although our method is not directly trained to make the final prediction, the performance of our model surpasses all baselines, which confirms that the breakdown charge-based analysis can indeed help the total prison term prediction.

4.4 Depth of DGN

Figure 3: Performance with different depths of DGN.

To study the impact of the number of DGN blocks, we test our model with various depths and show the results in Fig 3.333For simplicity, we only show S score and Acc@0.2. As shown, the performance improves as the depth of DGN increases until it reaches 3 when the performance begins to drop likely due to overfitting.

4.5 Effects of Log Huber Loss

Figure 4: Compare of different loss functions. The result of LHL is regarded as unit 1.

We compare Log Huber Loss (LHL) with Mean Square Error (MSE), Mean Absolute Error (MAE) and Huber Loss (HL). We also try Log Cosh Loss (LCL), but it does not converge. As shown in Figure 4, Log Huber Loss performs best in all metrics for all models. The improvement is most significant in S metric. It also suggests that making the loss function consistent with the evaluation metric is beneficial.

4.6 Error Analysis

So far, our model has the best results on prison term predictions. In this section, we aim to conduct an in-depth analysis and answer the following questions: (1) In which cases, our model fails to deliver accurate predictions? (2) What are the prospects for further improvement? After carefully analyzing 100 examples, we roughly classify them into the following categories.

Lengthy Description

Some cases are extremely complicated, especially for cases with gangs. These descriptions are often lengthy and involve multiple criminal suspects.

Incomplete Information

In some cases, the input case description does not contain sufficient information for precise prediction. Note we only take the accusation by the procuratorate as input, which is incomplete compared to the whole materials relevant to a case. For example, if a defendant is recidivism within a shorter period, he/she shall be given a heavier punishment.

Rare Cases

Some special circumstances will influence the prison term, yet rarely happen in the training set. For example, if a defendant cause injuries to others due to excessive defense, he/she shall be given a lighter punishment. This knowledge is easily understandable by humans, bu hard to be learned by machine learning models.

5 Ethical Discussions

Although the research on prison term prediction has considerable potential to improve efficiency and fairness in criminal justice, there are certain ethical concerns worth discussions.

First, does the training data provide unbiased examples and sufficient? For example, some may worry about that the model may treat people differently based on race, social class, age and so on Tonry (2014). Discrimination in the past may be learned in models. Also, with the development of our society, new forms of crimes will appear. A model trained on historical data may fail in these new cases.

Second, is the learned system robust enough? Some subtle details may significantly affect the result of judgment. For example, the amount of theft and the number of drugs, these numerical values are often not uniform in different case descriptions, causing it hard to learn by neural models. Some infrequent words, such as named entities, may also cause undesirable interference.

The mistake of legal judgment is serious, it is about people losing years of their lives in prison, or dangerous criminals being released to reoffend. We should pay attention to how to avoid judges’ over-dependence on the system. It is necessary to consider its application scenarios. In practice, we recommend deploying our system in the “Review Phase”, where other judges check the judgment result by a presiding judge. Our system can serve as one anonymous checker.

In summary, the judgment prediction is an emerging technology at its exploratory stage. We should be aware of the risks and prevent any inappropriate use of the technology.

6 Conclusion

In this paper, we formally presented the task of charge-based prison term prediction. We introduced the first large-scale dataset for this task. To tackle the problem of the noisy and entangled description of legal cases, we proposed the deep gating network for charge-specific information filter. Experiments show that our model significantly improves the accuracy of charge-based prison term prediction, as well as the total term prediction. Finally, we discussed some ethical problems of the proposed techniques that are worth cautious thinking.


  • P. Chen, Z. Sun, L. Bing, and W. Yang (2017) Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 452–461. External Links: Document, Link Cited by: §1, 3rd item.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: 2nd item.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §3.2.
  • Z. Hu, X. Li, C. Tu, Z. Liu, and M. Sun (2018) Few-shot charge prediction with discriminative legal attributes. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 487–498. Cited by: §1, §1.
  • P. J. Huber (1964) Robust estimation of a location parameter. Annals of Mathematical Statistics 35 (1), pp. 73–101. Cited by: §3.3.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. External Links: Document, Link Cited by: §3.2, 1st item.
  • F. Kort (1957) Predicting supreme court decisions mathematically: a quantitative analysis of the “right to counsel” cases. American Political Science Review 51 (1), pp. 1–12. Cited by: §1.
  • S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: 3rd item.
  • X. Li, L. Bing, W. Lam, and B. Shi (2018) Transformation networks for target-oriented sentiment classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 946–956. External Links: Link Cited by: §1, 4th item.
  • C. Liu, C. Chang, and J. Ho (2004) Case instance generation and refinement for case-based criminal summary judgments in chinese. Cited by: §1.
  • C. Liu and C. Hsieh (2006) Exploring phrase-based classification of judicial documents for criminal charges in chinese. In International Symposium on Methodologies for Intelligent Systems, pp. 681–690. Cited by: §1, §1.
  • C. Liu and T. Liao (2005) Classifying criminal charges in chinese for web-based legal services. In Asia-Pacific Web Conference, pp. 64–75. Cited by: §1.
  • Y. Liu, Y. Chen, and W. Ho (2015) Predicting associated statutes for legal problems. Information Processing & Management 51 (1), pp. 194–211. Cited by: §1.
  • B. Luo, Y. Feng, J. Xu, X. Zhang, and D. Zhao (2017) Learning to predict charges for criminal cases with legal basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2727–2736. External Links: Link Cited by: §1, §1.
  • B. Pang, L. Lee, et al. (2008) Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2 (1–2), pp. 1–135. Cited by: §4.2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng (2016) Text matching as image recognition. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • J. A. Segal (1984) Predicting supreme court cases probabilistically: the search and seizure cases, 1962-1981. American Political Science Review 78 (4), pp. 891–900. Cited by: §1.
  • D. Tang, B. Qin, and T. Liu (2016) Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 214–224. External Links: Document, Link Cited by: §1, 2nd item.
  • M. Tonry (2014) Legal and ethical issues in the prediction of recidivism. Federal Sentencing Reporter 26 (3), pp. 167–176. Cited by: §5.
  • S. S. Ulmer (1963) Quantitative analysis of judicial processes: some practical and theoretical applications. Law & Contemp. Probs. 28, pp. 164. Cited by: §1.
  • Y. Wang, M. Huang, x. zhu, and L. Zhao (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 606–615. External Links: Document, Link Cited by: §1, 1st item.
  • H. Zhong, C. Xiao, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, and J. Xu (2018a) Overview of cail2018: legal judgment prediction competition. External Links: 1810.05851 Cited by: §1, §4.1, §4.2.
  • H. Zhong, G. Zhipeng, C. Tu, C. Xiao, Z. Liu, and M. Sun (2018b) Legal judgment prediction via topological learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3540–3549. External Links: Link Cited by: §1, §4.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description