Creating Auxiliary Representations from Charge Definitions for Criminal Charge Prediction

Creating Auxiliary Representations from Charge Definitions
for Criminal Charge Prediction

Liangyi Kang1, Jie Liu1, Lingqiao Liu2, Qinfeng Shi2, and Dan Ye1
1 Institute of Software, Chinese Academy of Sciences, Beijing, China
2 School of Computer Science, The University of Adelaide, Australia
{kangliangyi15, ljie, yedan}, {lingqiao.liu, javen.shi}

Charge prediction, determining charges for criminal cases by analyzing the textual fact descriptions, is a promising technology in legal assistant systems. In practice, the fact descriptions could exhibit a significant intra-class variation due to factors like non-normative use of language, which makes the prediction task very challenging, especially for charge classes with too few samples to cover the expression variation. In this work, we explore to use the charge definitions from criminal law to alleviate this issue. The key idea is that the expressions in a fact description should have corresponding formal terms in charge definitions, and those terms are shared across classes and could account for the diversity in the fact descriptions. Thus, we propose to create auxiliary fact representations from charge definitions to augment fact descriptions representation. The generated auxiliary representations are created through the interaction of fact description with the relevant charge definitions and terms in those definitions by integrated sentence- and word-level attention scheme. Experimental results on two datasets show that our model achieves significant improvement than baselines, especially for classes with few samples.


The task of charge prediction is to determine appropriate charges, such as theft, seizing or robbery, for criminal cases by analyzing the textual fact descriptions. Automating charge prediction by using NLP technology could significantly reduce the human labor in organizing legal documents, and could be practically useful for an online legal assistant system.

Existing methods formulate the charge prediction task as a text classification problem, targeting at learning the representation of fact descriptions for prediction. Conventional methods [liu2005classifying, liu2006exploring, lin2012exploiting, sulea2017exploring] design shallow text features to represent fact descriptions. Recently, deep learning provides end-to-end models to learn fact representations from fact descriptions  [luo2017learning, hu2018few, zhong2018legal], which achieves the state-of-art result.

In practice, the fact description in a criminal case is written by prosecutors, lawyers, or defendants to state the detail of the criminal case. It comprises a substantial amount of diverse non-normative use of language. For example, the cases of robbery in Figure 1 all involve ”theft”, but the legal term “theft” may be implicitly expressed like ”stole an electric vehicle” or ”came forward to ride away Ke’s white Merida bicycle”. Consequently, the representation of fact descriptions may exhibit considerable intra-class variation which may lead to prediction failure at the test stage. This could be more pronounced for charge classes with only a few examples since the samples are not sufficient for learning a predictive model robust to expression variation.

To address this issue, we introduce the charge definitions from criminal law to create more robust fact representations for charge prediction. We propose to create auxiliary fact representations from the charge definitions to augment the fact representation. Those auxiliary representations are essentially projections of the fact description in the semantic space of charge definitions. Our motivation is that the expressions in a fact description should have corresponding formal terms in charge definitions, and those formal terms can provide an alternative view of the expressions in fact description. Note that many of those formal terms are shared across charge classes and are less diverse. Thus, using elements in charge definitions to re-interpret fact description and generate auxiliary representations could have the potential to account for the diversity in the fact description.

Figure 1: Illustration of our method. The related charges are identified (indicated by the red arrow) via sentence-level attention and aggregated to create the first auxiliary representation, charge-related fact representation. Then key words in cases align to terms in identified charge definitions via word-level attention (aligned words are labeled by the same color), which are then formed as the second auxiliary representation, charge-token-related fact representation.

Specifically, we design an integrated sentence- and word-level interaction model to generate two auxiliary fact representations. We identify the relevant charge definitions through sentence-level interaction between fact description and charge definitions, and then aggregate the holistic features of relevant charge definitions to create the first auxiliary representation, named as charge-related fact representation. The relevant charge definitions identified in the course of producing the first auxiliary representation will also serve for creating the second auxiliary representation. To create the second representation, we further consider finer-grained word-level interaction between the fact description and identified related charge definitions. Relevant words from relevant charge definitions are attended and aggregated through a recurrent neural network to generate the second auxiliary representation, named as charge-token-related fact representation. We illustrate our model by an example in Figure 1. Case 1 and case 2 in Figure 1 belong to the same charge class, robbery, but with different description expressions. With the proposed method, they will be firstly related to the charge definition of robbery. Then the statements of ”stole an electric vehicle” and ”took out a knife to poke the victim” in case 1, ”came forward to ride away Ke’s white Merida bicycle” and ”used fist wounding Ke’s head” in case 2 will be softly aligned to the terms ”theft” and ”use violence” in robbery definition through attention. By reinterpreting the fact descriptions through aligned terms, those two cases become more similar. The final charge prediction is based on the concatenation of the original and auxiliary fact representations, and one can expect the prediction made on this fact representation will be more robust.

To investigate the advantage of our method on charge prediction, we conduct experiments on two datasets, which consist of criminal cases extracted from the Chinese Judgement web. Experimental results show that our model achieves significant improvement over baselines, especially on classes with few samples. We also conduct ablation studies to analyze the effectiveness of each component in our model, and visualize the impact of introducing charge definitions.

Related Works

Charge Prediction

Charge prediction has been studied for years, with the focus on learning representation of fact descriptions in criminal cases and fed into classifiers to make the judgment. At the early stage,  [liu2005classifying, liu2006exploring, lin2012exploiting, sulea2017exploring] attempt to extract shallow text features from fact descriptions or create hand-crafted features to represent fact descriptions, which are hard to generalize to large datasets due to the diverse expression of fact descriptions. Inspired by the success of deep learning,  [luo2017learning, ye2018interpretable, hu2018few, zhong2018legal] employ neural models with external information to capture the high-level semantic information. \citeauthorzhong2018legal propose the LJP method, modeling multiple legal subtasks as a Directed Acyclic Graph(DAG) and using multi-task learning to assist prediction. Further, \citeauthorluo2017learning use a separate two-stage scheme to extract the related articles and then attend them attentively to fact representation for charge prediction. \citeauthorye2018interpretable design 10 legal attributes to help the few-shot charges prediction. However, existing charge prediction models all need a large amount of feature engineering, either design features or build relations between subtasks. Instead, we augment fact representation to assist charge prediction by creating auxiliary representation from charge definitions in an end-to-end fashion.

Attention and Memory

Our model is also related to attention and memory in deep learning  [bahdanau2014neural, vaswani2017attention, sinha2018hierarchical, weston2014memory, wang2018target, ebesu2018collaborative]. Although researchers propose various neural architectures with memory and attention for NLP problems [kumar2016ask, wang2017gated, gao2019hybrid], they either only consider sentence-level or only word-level alignment between sentences. In contrast, we combine them jointly to form auxiliary representation, where sentence-level interaction identifies relevant charges, and a finer-grained word-level interaction on the top of identified charge definitions makes the generated fact representation more robust.

Figure 2: The architecture of our models. Fact description encoder embeds the fact description into the original fact representation . The right part shows the creation of the first auxiliary representation : an attentive charge aggregator is iteratively to identify related charges which are then aggregated to generate . The left part shows the creation of the second auxiliary representation : On top of identified charge definitions, each word in a fact description is represented by the combination of the terms in related charge definitions. The combined intermediate representations are aggregated through a GRU to generate . At last, , and are concatenated to form final fact representation .

The proposed model

Problem Formulation

Charge prediction is to predict the corresponding charges for a given fact description , where fact description consists of a sequence of words , and its label is a dimensional multi-hot vector – a fact description may correspond to multiple charges in charges. The charge definition for the -th charge can be represented as a sequence of words .


To generate a robust fact representation for prediction, we propose an integrated sentence- and word-level interaction model. The architecture of our model is shown in Figure 2. As seen, the final fact representation is the concatenation of three representations.

  • The original fact representation (), derived from fact description only and obtained by the fact description encoder.

  • The auxiliary representation I, charge-related fact representation (), aggregated by the holistic representation of related charge definitions that are identified via sentence-level interaction between fact and charge definitions.

  • The auxiliary representation II, charge-token-related fact representation (), on the top of the identified charge definitions, created by finer-grained word-level interaction between fact and identified charge definitions.

Fact Description Encoder

Giving an fact description represented by a sequence of word embeddings , we use Gated Recurrent Unite [cho2014learning] to create a sequence of hidden states for encoding contextual information of each word.


where is the hidden state of the GRU at time step . The variable sequence is denoted as .

For a fact description, the words and consequently those hidden variables do not contribute equally to convey the semantic meaning of a fact, and long fact description will involve many less informative words. To suppressing the negative impact of the non-informative words, we use attention mechanism to assign each hidden state in an importance weight .


where and are trainable parameters. The holistic representation of original fact description is computed as a weighted sum of :


Charge Definitions Encoder

Each charge class is associated with a charge definition, that is, . We use a CNN to encode the sequence of words into a sequence of vectors. Since we will deal with a large number of charge definitions, using CNNs [kim2014convolutional] gives us better training efficiency.


where the window size of CNN is . Then we sum up these vectors to create the holistic representation of each charge definition.


We also tried using GRUs to encode , but they require more computational resources and lead to worse performance. Thus we choose CNN as charge definitions encoder.

Two Auxiliary Fact Representations from Charge Definitions

The first auxiliary fact representation is created through the sentence-level interaction between the fact description and charge definitions. Its creation process iterates between two steps: identifying related charges and attentively aggregating the holistic representation of charge definitions. After those iterations, relatedness weights of each charge will be obtained and they will also be used as the basis for creating the second auxiliary fact representation. The second auxiliary fact representation is generated from word-level interaction between fact description and identified charge definitions. It uses word-level attention to identify terms that align with the expression in the fact description, and aggregates those terms through a recurrent neural network.

We elaborate the creation of those two auxiliary representations as follows.

Auxiliary Representation I: charge-related fact representation created via sentence-level interaction

Charges Identification

Identifying related charges is realized by calculating an attention weight for each charge to indicate the relatedness. Specifically, we exploit episodic memory attention mechanism [xiong2016dynamic] to iteratively calculate the attention weight from the correlation between the charge definitions and fact description and memory , where can be seen as the summary of already identified charges up to the -th iteration and will be updated at each iteration. With more iterations, the unrelated charges can be filtered out. The memory is initialized with original holistic representation of fact description, that is, .

Formally, we use following formulas to calculate the attention weight of each charge definition at the t-th iteration.


where is the element-wise product, is the element-wise absolute value, and represents concatenation of the vectors. and are trainable weight matrices.

Attentive Charges Aggregator

Once the attention weight of each charge is calculated, we update the memory by performing weighted summation over charge definition representations.


Finally, we concatenate original fact representation with the last memory and the previous memory, and feed them into a fully-connected layer to create the auxiliary charge-related fact representation by using the following equation:


where denotes the fully connected layer.

Auxiliary Representation II: charge-token-related fact representation created via word-level interaction

In the course of creating the above representation, both fact description and charge definitions are represented by holistic feature vectors. In other words, the interaction between fact and charge definitions is only at the sentence level. The second auxiliary representation steps further to introduce interaction at the word level. Specifically, for each hidden variable in the fact description, we first compute its matching score towards each in each charge definition by inner-product. Then is attentively aggregated to an intermediate representation :


The above intermediate representation is defined w.r.t to each charge definition . In our method, we further perform a weighted summation over for different charge definition . The weight is the attention weight calculated at the last iteration in Eq. (9). Using this weight fits our intuition that the terms in the related charges are more relevant to the expressions in the fact description. Formally, we obtain


Note that can be viewed as a projection of in the space spanned by .

After obtaining for each word in the fact description, we process the sequence by a new and obtain the last hidden state :


We concatenate original and the projected fact representation, and feed them into a fully-connected layer to generate the auxiliary charge-token-related fact representation.


The Output

Finally, we concatenate all the generated representations and feed them into a fully-connected layer to generate the final fact representation .


is then passed to the classifier layer to make charge prediction.

The loss function for training is as follows:


where is the number of training data, is the number of charges. and is the estimated likelihood of the -th charge being true.


In order to verify the effectiveness of our model on criminal charges prediction, we conduct experiments on two real-world datasets with different scales to compare our model against several baselines. Further analyses are also made to validate the significance of introducing charge definitions and various components of our model.


Datasets CAIL150K CAIL30K
Traning samples 154592 32506
Test samples 32500 32500
Charge classes 202 168
Table 1: Statistics of datasets.


We use publicly available datasets from  [xiao2018cail2018] to conduct our experiments. There are two datasets with different scales: CAIL150K dataset and CAIL30K dataset. The criminal cases in these datasets are collected from the China Judgment Online111 with a single defendant. Table 1 shows the descriptive statistics of used datasets222The training sets of CAIL150K and CAIL30K are exercise_contest/data_train.json and final_contest.json separately in CAIL2018 file.. It is worth noting that in these two datasets the distribution of charges is quite imbalanced. In CAIL150K, the top 30 most frequent charges cover 60% cases, and the 31% charges in the training set have less than 100 cases, taking up only 1.88% of the total number of cases. CAIL30K is a smaller dataset. In its training set, 42% charges have less than 10 cases, taking up only 0.89% of the total number of cases. The small number of samples makes it challenging to train a model that performs well on low-frequency classes.

As for charge definitions, they are extracted from articles in the Criminal Law of the People’s Republic of China. Specifically, in criminal law, except for articles irrelevant to specific charges, each article may include more than one charges, their corresponding charge definitions, and punishment. We use regular expressions to extract charge definitions, and merge charge definitions scattered in multiple articles. A snippet of cases and charge definitions is illustrated in Figure 1.

Training setup

As all the sentences in charge definitions and fact descriptions are written in Chinese without word segmenting, we apply jieba333 for word cut. We set the maximum length of fact description to 500, charge definitions to 110. We use pre-trained GloVe [dong2014adaptive] vectors as our initial word embeddings. In practice, we choose the 64 dimensional embedding vectors trained on baidubaike. The iteration time in Eq. (9) is set as 3. Adam [kingma2014adam] is used as the optimizer and the learning rate is initialized as 0.005 and halved in every other epoch.

Datasets CAIL150K CAIL30K
Model Acc. MP MR MF1 Acc. MP MR MF1
Not using charge definitions TFIDF+SVM 71.87 79.71 56.84 63.32 49.13 31.48 19.98 22.06
CNN_classify 79.23 70.80 62.27 64.97 52.75 23.64 21.95 20.59
GRU_classify 77.33 72.45 57.42 61.54 56.14 23.99 22.81 21.51
HLSTM_classify 73.15 51.45 43.82 46.06 25.34 7.69 6.34 6.15
Using multiple tasks LJP 25.26 25.78 24.32 25.55 15.29 15.45 15.68 15.56
Match with charge definitions TFIDF match 13.03 31.21 40.29 26.52 12.19 37.37 35.41 27.60
Siamese CNN 72.98 74.52 64.64 66.55 50.66 32.74 33.74 29.28
Augment with charge definitions Fact-Law AN 75.61 58.89 52.30 53.62 60.73 28.15 25.16 24.79
GA_Reader 73.78 74.68 66.59 68.21 54.95 39.29 34.05 33.03
MemNet 80.18 80.09 67.13 70.78 62.40 32.62 27.54 27.64
Ours 81.05 82.06 68.33 72.43 67.99 46.13 36.00 37.62
Table 2: The experimental results [%] of baselines and our model on two datasets. Four different types of models are separated by lines and the best scores are highlight in bold font.


We compare our model against several text classification models and charge prediction methods, which can be categorized into four categories:

(1)Not using charge definitions for classification. We employ TFIDF [salton1988term] to extract text features from fact descriptions and use linear SVMs [suykens1999least] for charge prediction (TFIDF+SVM). We also implement deep learning models, such as multi-layers Convolution Neural Network(CNN)  [kim2014convolutional] (CNN_classify), Gated Recurrent Unite (GRU) [cho2014learning] (GRU_classify) and hierarchical LSTM  [sinha2018hierarchical] (HLSTM_classify) for fact descriptions encoding and classification.

(2)Using multi-task learning for classification. Our method is somehow related to LJP [zhong2018legal], which introduces related legal tasks and use multi-task learning to train a better fact representation. We also re-implement it to compare with our method.

(3)Matching the fact description with charge definitions for classification. We exploit TFIDF to extract text features from fact descriptions and charge definitions, then compare the fact description with each charge definition (TFIDF match) to find the best matched charges. We also train a Siamese CNN [koch2015siamese] (Siamese CNN) to match the representations of fact description and charge definitions.

(4)Augmenting fact description with charge definitions for classification. We implement Fact-Law AN model that \citeauthorluo2017learning propose to use relevant law articles, selected by SVMs, to serve as a legal basis for encoding the fact description. To demonstrate the advantage of our model in considering sentence- and word-level interaction jointly, we also implement improved memory network [kumar2016ask] (MemNet) and GA_Reader [wang2017gated]. These two methods are designed for question-answer task, which employ multi-iterative interaction between query and document at sentence- and word-level respectively for answer prediction. In the implementation, we replace query and document in GA_Reader and MemNet with the fact description and charge definitions.

Models Acc. MP MR MF1
Ours 81.05 82.06 68.33 72.43
      w/o Fc 80.31 79.12 66.88 70.55
      w/o Fs,Fw 77.33 72.45 57.42 61.54
      w/o Fw 79.50 78.86 66.18 69.86
      w/o Fs 80.62 80.54 66.97 71.28
      w/o Fs, 80.54 76.90 64.34 67.98
Table 3: The experimental results of ablation test of our model on CAIL150K dataset.


Evaluation Metrics

We employ accuracy (Acc.), macro-precision (MP), macro-recall (MR) and macro-F1 (MF1) as our evaluation metrics. The macro-precision/recall/F1 are calculated by averaging the precision, recall and F1 of each charge, which are metrics commonly used for multilabel classification task.

Overall Evaluation Results

Experimental results on two scale datasets are shown in Table 2. The observations are as followings:

  • Generally speaking, models without incorporating charge definitions (TFIDF+SVM, CNN_classify, GRU_classify and HLSTM_classify) perform inferior to their charge-definition-incorporated counterparts. This is evident by their lower MF1 scores (MF1 is a more comprehensive score for evaluating multi-label classification than Acc., MP, and MR). This observation clearly demonstrates the benefit of introducing charge definitions to assist charge prediction.

  • Incorporating charge definitions through matching based approaches (TFIDF match and Siamese CNN) works to some extent, although their performance is still worse than methods using more sophisticated interaction between fact description and charge definitions, i.e. GA_Reader, MemNet and Ours.

  • Methods that Augment fact representation with charge definitions through end-to-end schema (GA_Reader, MemNet and Ours) attain better results than Fact-Law AN. The latter uses a separated two-stage framework to first identify the related charge definitions. This observation shows the importance of the end-to-end design. In addition, compared with GA_Reader and MemNet, which performs either sentence- or word-level interaction, our approach achieves better performance through considering sentence- and word-level interaction jointly.

  • Our proposed model outperforms other baselines on two datasets. The improvement is especially significant on the CAIL30K dataset: our method surpasses the second best about 5% in MF1. Since the CAIL30K contains more classes with few training samples, the excellent performance of our approach suggests that our auxiliary representations may help to improve the generalization performance for classes with few samples.

  • Finally, we compare our method against LJP. Like our method, LJP also uses external information for building the fact representation. Different from our method, they introduce multiple related tasks and adopt the multi-task learning for representation training. As shown in Table 2, we can see that our method achieves superior performance than LJP.

Further Analysis

Figure 3: Results of the impact of exploiting charge definitions for charges predicting under the MF1 metric. The charge ids are those classes with training samples less than 100 in CAIL150K dataset.

Ablation Test

We conduct ablation studies to verify the effectiveness of various components in our method. We consider several variations of our approach by removing some components of our model. The result is shown in Table 3. As seen, only using fact descriptions without any level auxiliary fact representations (w/o Fs,Fw) yields the worst performance, which proves the importance of the use of charge definitions. After adding either the sentence-level (w/o Fw) or the word-level auxiliary fact representation (w/o Fs), the performance can be significantly improved. It is observed that the performance of only adding charge-token-related fact representation (w/o Fs) is better than only adding charge-related fact representation (w/o Fw). We also created a variant of our method without using attention weight of each charge from Eq. (9) in the process of generating charge-token-related fact representation (w/o Fs,), which is implemented by setting the attention weight to instead of generated from charge identification part. It can be observed that the performance of w/o Fs, declines. This suggests that the two-level attention is necessary and using them jointly can get the best performance.

Impact of Exploiting charge definitions

We analyze the effects of incorporating generated auxiliary fact representations for classes with few training data. As shown in Figure 3, we study the results of classes with less than 100 samples on CAIL150K dataset. We can find that the MF1 measure of many charges is zero if auxiliary representations are not used, and the results can be improved significantly if we add auxiliary representation from charge definitions. This observation highlights the benefit of introducing auxiliary representations for handling small sample cases.

Figure 4: Intra-class variance of different fact representations of the top-5 frequent classes in CAIL150K dataset. is fact representation only learned from fact description, is the augmented with charge-related fact representation, and is the augmented with all auxiliary fact representations.

Intra-class variance of different fact representations

To investigate whether the fact representation of our method is more stable, we conduct the following experiment: we calculate the variance along each dimension of fact representations from five classes with the most amount of samples, and then use the average variance along all dimensions as an indicator of the intra-class variance of different fact representations. As shown in Figure 4, fact representation () only learned from fact description yields the largest intra-class variance. After augmenting fact representation from charge definitions through sentence-level interaction (), the intra-class variance declines greatly. Specially, the final fact representation () with two auxiliary representations incorporated attains an even lower intra-class variance.

Case study

Finally, we select a representative robbery case to give an intuitive illustration of the attention results on the sentence- and word-level interaction. As shown in Table 4, the case describes that the defendant is convicted of robbery due to stealing property and poking the victim to resist arrest. On the sentence-level interaction, with the increasing of iteration in Eq. (9), our model narrows down the candidate charges and finally identifies the correct related charges. We choose the iteration times as 3 since the performance cannot improve with more iterations.

On the word-level interaction, the attention mechanism makes the words in fact description align with the formal terms in charge definitions. To demonstrate this mechanism, Figure 5 shows for the words in fact description, which terms are focused on in the charge definition of robbery. The identified keywords in fact description are ”electric vehicle”, ”resisting arrest” and ”a knife”, which correspond to key terms in robbery definition–”stolen goods”, ”resist arrest” and ”use violence”.

Fact description: 被告人偷盗电动车,被受害人阻拦时,
The Defendant stole an electric vehicle, when blocked,
he took out a knife to pock the victim to resist arrest…
Charge: Robbery
Top5 Related Charges  t1  t2  t3
Intentional injury
Negligent act causing severe injury
Endangering public security
Table 4: Attention map of sentence-level attention of robbery case. t1, t2, and t3 represent the iteration times in Eq. (9). The color darker means the charges are more related to the fact.
Figure 5: Attention map of word-level attention between robbery case and the charge definition of robbery. The dark color means a large value.


In this work, we focus on the task of multilabel charge prediction for given fact descriptions of criminal cases. To address the problem of having a large expression variance in fact descriptions due to informal language use, we introduce charge definitions from criminal law to create auxiliary representations of the fact descriptions. The experimental results on two datasets show the effectiveness of our model on charge prediction. The significant improvement on the classes with few training data validate that our method can benefit the small sample training scenario and the two-level auxiliary fact representations can help the model to generalize to the unseen description.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description