Creating Auxiliary Representations from Charge Definitions
for Criminal Charge Prediction
Charge prediction, determining charges for criminal cases by analyzing the textual fact descriptions, is a promising technology in legal assistant systems. In practice, the fact descriptions could exhibit a significant intra-class variation due to factors like non-normative use of language, which makes the prediction task very challenging, especially for charge classes with too few samples to cover the expression variation. In this work, we explore to use the charge definitions from criminal law to alleviate this issue. The key idea is that the expressions in a fact description should have corresponding formal terms in charge definitions, and those terms are shared across classes and could account for the diversity in the fact descriptions. Thus, we propose to create auxiliary fact representations from charge definitions to augment fact descriptions representation. The generated auxiliary representations are created through the interaction of fact description with the relevant charge definitions and terms in those definitions by integrated sentence- and word-level attention scheme. Experimental results on two datasets show that our model achieves significant improvement than baselines, especially for classes with few samples.
The task of charge prediction is to determine appropriate charges, such as theft, seizing or robbery, for criminal cases by analyzing the textual fact descriptions. Automating charge prediction by using NLP technology could significantly reduce the human labor in organizing legal documents, and could be practically useful for an online legal assistant system.
Existing methods formulate the charge prediction task as a text classification problem, targeting at learning the representation of fact descriptions for prediction. Conventional methods [liu2005classifying, liu2006exploring, lin2012exploiting, sulea2017exploring] design shallow text features to represent fact descriptions. Recently, deep learning provides end-to-end models to learn fact representations from fact descriptions [luo2017learning, hu2018few, zhong2018legal], which achieves the state-of-art result.
In practice, the fact description in a criminal case is written by prosecutors, lawyers, or defendants to state the detail of the criminal case. It comprises a substantial amount of diverse non-normative use of language. For example, the cases of robbery in Figure 1 all involve ”theft”, but the legal term “theft” may be implicitly expressed like ”stole an electric vehicle” or ”came forward to ride away Ke’s white Merida bicycle”. Consequently, the representation of fact descriptions may exhibit considerable intra-class variation which may lead to prediction failure at the test stage. This could be more pronounced for charge classes with only a few examples since the samples are not sufficient for learning a predictive model robust to expression variation.
To address this issue, we introduce the charge definitions from criminal law to create more robust fact representations for charge prediction. We propose to create auxiliary fact representations from the charge definitions to augment the fact representation. Those auxiliary representations are essentially projections of the fact description in the semantic space of charge definitions. Our motivation is that the expressions in a fact description should have corresponding formal terms in charge definitions, and those formal terms can provide an alternative view of the expressions in fact description. Note that many of those formal terms are shared across charge classes and are less diverse. Thus, using elements in charge definitions to re-interpret fact description and generate auxiliary representations could have the potential to account for the diversity in the fact description.
Specifically, we design an integrated sentence- and word-level interaction model to generate two auxiliary fact representations. We identify the relevant charge definitions through sentence-level interaction between fact description and charge definitions, and then aggregate the holistic features of relevant charge definitions to create the first auxiliary representation, named as charge-related fact representation. The relevant charge definitions identified in the course of producing the first auxiliary representation will also serve for creating the second auxiliary representation. To create the second representation, we further consider finer-grained word-level interaction between the fact description and identified related charge definitions. Relevant words from relevant charge definitions are attended and aggregated through a recurrent neural network to generate the second auxiliary representation, named as charge-token-related fact representation. We illustrate our model by an example in Figure 1. Case 1 and case 2 in Figure 1 belong to the same charge class, robbery, but with different description expressions. With the proposed method, they will be firstly related to the charge definition of robbery. Then the statements of ”stole an electric vehicle” and ”took out a knife to poke the victim” in case 1, ”came forward to ride away Ke’s white Merida bicycle” and ”used fist wounding Ke’s head” in case 2 will be softly aligned to the terms ”theft” and ”use violence” in robbery definition through attention. By reinterpreting the fact descriptions through aligned terms, those two cases become more similar. The final charge prediction is based on the concatenation of the original and auxiliary fact representations, and one can expect the prediction made on this fact representation will be more robust.
To investigate the advantage of our method on charge prediction, we conduct experiments on two datasets, which consist of criminal cases extracted from the Chinese Judgement web. Experimental results show that our model achieves significant improvement over baselines, especially on classes with few samples. We also conduct ablation studies to analyze the effectiveness of each component in our model, and visualize the impact of introducing charge definitions.
Charge prediction has been studied for years, with the focus on learning representation of fact descriptions in criminal cases and fed into classifiers to make the judgment. At the early stage, [liu2005classifying, liu2006exploring, lin2012exploiting, sulea2017exploring] attempt to extract shallow text features from fact descriptions or create hand-crafted features to represent fact descriptions, which are hard to generalize to large datasets due to the diverse expression of fact descriptions. Inspired by the success of deep learning, [luo2017learning, ye2018interpretable, hu2018few, zhong2018legal] employ neural models with external information to capture the high-level semantic information. \citeauthorzhong2018legal propose the LJP method, modeling multiple legal subtasks as a Directed Acyclic Graph(DAG) and using multi-task learning to assist prediction. Further, \citeauthorluo2017learning use a separate two-stage scheme to extract the related articles and then attend them attentively to fact representation for charge prediction. \citeauthorye2018interpretable design 10 legal attributes to help the few-shot charges prediction. However, existing charge prediction models all need a large amount of feature engineering, either design features or build relations between subtasks. Instead, we augment fact representation to assist charge prediction by creating auxiliary representation from charge definitions in an end-to-end fashion.
Attention and Memory
Our model is also related to attention and memory in deep learning [bahdanau2014neural, vaswani2017attention, sinha2018hierarchical, weston2014memory, wang2018target, ebesu2018collaborative]. Although researchers propose various neural architectures with memory and attention for NLP problems [kumar2016ask, wang2017gated, gao2019hybrid], they either only consider sentence-level or only word-level alignment between sentences. In contrast, we combine them jointly to form auxiliary representation, where sentence-level interaction identifies relevant charges, and a finer-grained word-level interaction on the top of identified charge definitions makes the generated fact representation more robust.
The proposed model
Charge prediction is to predict the corresponding charges for a given fact description , where fact description consists of a sequence of words , and its label is a dimensional multi-hot vector – a fact description may correspond to multiple charges in charges. The charge definition for the -th charge can be represented as a sequence of words .
To generate a robust fact representation for prediction, we propose an integrated sentence- and word-level interaction model. The architecture of our model is shown in Figure 2. As seen, the final fact representation is the concatenation of three representations.
The original fact representation (), derived from fact description only and obtained by the fact description encoder.
The auxiliary representation I, charge-related fact representation (), aggregated by the holistic representation of related charge definitions that are identified via sentence-level interaction between fact and charge definitions.
The auxiliary representation II, charge-token-related fact representation (), on the top of the identified charge definitions, created by finer-grained word-level interaction between fact and identified charge definitions.
Fact Description Encoder
Giving an fact description represented by a sequence of word embeddings , we use Gated Recurrent Unite [cho2014learning] to create a sequence of hidden states for encoding contextual information of each word.
where is the hidden state of the GRU at time step . The variable sequence is denoted as .
For a fact description, the words and consequently those hidden variables do not contribute equally to convey the semantic meaning of a fact, and long fact description will involve many less informative words. To suppressing the negative impact of the non-informative words, we use attention mechanism to assign each hidden state in an importance weight .
where and are trainable parameters. The holistic representation of original fact description is computed as a weighted sum of :
Charge Definitions Encoder
Each charge class is associated with a charge definition, that is, . We use a CNN to encode the sequence of words into a sequence of vectors. Since we will deal with a large number of charge definitions, using CNNs [kim2014convolutional] gives us better training efficiency.
where the window size of CNN is . Then we sum up these vectors to create the holistic representation of each charge definition.
We also tried using GRUs to encode , but they require more computational resources and lead to worse performance. Thus we choose CNN as charge definitions encoder.
Two Auxiliary Fact Representations from Charge Definitions
The first auxiliary fact representation is created through the sentence-level interaction between the fact description and charge definitions. Its creation process iterates between two steps: identifying related charges and attentively aggregating the holistic representation of charge definitions. After those iterations, relatedness weights of each charge will be obtained and they will also be used as the basis for creating the second auxiliary fact representation. The second auxiliary fact representation is generated from word-level interaction between fact description and identified charge definitions. It uses word-level attention to identify terms that align with the expression in the fact description, and aggregates those terms through a recurrent neural network.
We elaborate the creation of those two auxiliary representations as follows.
Auxiliary Representation I: charge-related fact representation created via sentence-level interaction
Identifying related charges is realized by calculating an attention weight for each charge to indicate the relatedness. Specifically, we exploit episodic memory attention mechanism [xiong2016dynamic] to iteratively calculate the attention weight from the correlation between the charge definitions and fact description and memory , where can be seen as the summary of already identified charges up to the -th iteration and will be updated at each iteration. With more iterations, the unrelated charges can be filtered out. The memory is initialized with original holistic representation of fact description, that is, .
Formally, we use following formulas to calculate the attention weight of each charge definition at the t-th iteration.
where is the element-wise product, is the element-wise absolute value, and represents concatenation of the vectors. and are trainable weight matrices.
Attentive Charges Aggregator
Once the attention weight of each charge is calculated, we update the memory by performing weighted summation over charge definition representations.
Finally, we concatenate original fact representation with the last memory and the previous memory, and feed them into a fully-connected layer to create the auxiliary charge-related fact representation by using the following equation:
where denotes the fully connected layer.
Auxiliary Representation II: charge-token-related fact representation created via word-level interaction
In the course of creating the above representation, both fact description and charge definitions are represented by holistic feature vectors. In other words, the interaction between fact and charge definitions is only at the sentence level. The second auxiliary representation steps further to introduce interaction at the word level. Specifically, for each hidden variable in the fact description, we first compute its matching score towards each in each charge definition by inner-product. Then is attentively aggregated to an intermediate representation :
The above intermediate representation is defined w.r.t to each charge definition . In our method, we further perform a weighted summation over for different charge definition . The weight is the attention weight calculated at the last iteration in Eq. (9). Using this weight fits our intuition that the terms in the related charges are more relevant to the expressions in the fact description. Formally, we obtain
Note that can be viewed as a projection of in the space spanned by .
After obtaining for each word in the fact description, we process the sequence by a new and obtain the last hidden state :
We concatenate original and the projected fact representation, and feed them into a fully-connected layer to generate the auxiliary charge-token-related fact representation.
Finally, we concatenate all the generated representations and feed them into a fully-connected layer to generate the final fact representation .
is then passed to the classifier layer to make charge prediction.
The loss function for training is as follows:
where is the number of training data, is the number of charges. and is the estimated likelihood of the -th charge being true.
In order to verify the effectiveness of our model on criminal charges prediction, we conduct experiments on two real-world datasets with different scales to compare our model against several baselines. Further analyses are also made to validate the significance of introducing charge definitions and various components of our model.
We use publicly available datasets from [xiao2018cail2018] to conduct our experiments. There are two datasets with different scales:
CAIL150K dataset and
CAIL30K dataset. The criminal cases in these datasets are collected from the China Judgment Online111http://wenshu.court.gov.cn/ with a single defendant. Table 1 shows the descriptive statistics of used datasets222The training sets of CAIL150K and CAIL30K are exercise_contest/data_train.json and final_contest.json separately in CAIL2018 file.. It is worth noting that in these two datasets the distribution of charges is quite imbalanced. In
CAIL150K, the top 30 most frequent charges cover 60% cases, and the 31% charges in the training set have less than 100 cases, taking up only 1.88% of the total number of cases.
CAIL30K is a smaller dataset. In its training set, 42% charges have less than 10 cases, taking up only 0.89% of the total number of cases. The small number of samples makes it challenging to train a model that performs well on low-frequency classes.
As for charge definitions, they are extracted from articles in the Criminal Law of the People’s Republic of China. Specifically, in criminal law, except for articles irrelevant to specific charges, each article may include more than one charges, their corresponding charge definitions, and punishment. We use regular expressions to extract charge definitions, and merge charge definitions scattered in multiple articles. A snippet of cases and charge definitions is illustrated in Figure 1.
As all the sentences in charge definitions and fact descriptions are written in Chinese without word segmenting, we apply jieba333https://github.com/fxsjy/jieba for word cut. We set the maximum length of fact description to 500, charge definitions to 110. We use pre-trained GloVe [dong2014adaptive] vectors as our initial word embeddings. In practice, we choose the 64 dimensional embedding vectors trained on
baidubaike. The iteration time in Eq. (9) is set as 3. Adam [kingma2014adam] is used as the optimizer and the learning rate is initialized as 0.005 and halved in every other epoch.
|Not using charge definitions||TFIDF+SVM||71.87||79.71||56.84||63.32||49.13||31.48||19.98||22.06|
|Using multiple tasks||LJP||25.26||25.78||24.32||25.55||15.29||15.45||15.68||15.56|
|Match with charge definitions||TFIDF match||13.03||31.21||40.29||26.52||12.19||37.37||35.41||27.60|
|Augment with charge definitions||Fact-Law AN||75.61||58.89||52.30||53.62||60.73||28.15||25.16||24.79|
We compare our model against several text classification models and charge prediction methods, which can be categorized into four categories:
(1)Not using charge definitions for classification. We employ TFIDF [salton1988term] to extract text features from fact descriptions and use linear SVMs [suykens1999least] for charge prediction (TFIDF+SVM). We also implement deep learning models, such as multi-layers Convolution Neural Network(CNN) [kim2014convolutional] (CNN_classify), Gated Recurrent Unite (GRU) [cho2014learning] (GRU_classify) and hierarchical LSTM [sinha2018hierarchical] (HLSTM_classify) for fact descriptions encoding and classification.
(2)Using multi-task learning for classification. Our method is somehow related to LJP [zhong2018legal], which introduces related legal tasks and use multi-task learning to train a better fact representation. We also re-implement it to compare with our method.
(3)Matching the fact description with charge definitions for classification. We exploit TFIDF to extract text features from fact descriptions and charge definitions, then compare the fact description with each charge definition (TFIDF match) to find the best matched charges. We also train a Siamese CNN [koch2015siamese] (Siamese CNN) to match the representations of fact description and charge definitions.
(4)Augmenting fact description with charge definitions for classification. We implement Fact-Law AN model that \citeauthorluo2017learning propose to use relevant law articles, selected by SVMs, to serve as a legal basis for encoding the fact description. To demonstrate the advantage of our model in considering sentence- and word-level interaction jointly, we also implement improved memory network [kumar2016ask] (MemNet) and GA_Reader [wang2017gated]. These two methods are designed for question-answer task, which employ multi-iterative interaction between query and document at sentence- and word-level respectively for answer prediction. In the implementation, we replace query and document in GA_Reader and MemNet with the fact description and charge definitions.
We employ accuracy (Acc.), macro-precision (MP), macro-recall (MR) and macro-F1 (MF1) as our evaluation metrics. The macro-precision/recall/F1 are calculated by averaging the precision, recall and F1 of each charge, which are metrics commonly used for multilabel classification task.
Overall Evaluation Results
Experimental results on two scale datasets are shown in Table 2. The observations are as followings:
Generally speaking, models without incorporating charge definitions (TFIDF+SVM, CNN_classify, GRU_classify and HLSTM_classify) perform inferior to their charge-definition-incorporated counterparts. This is evident by their lower MF1 scores (MF1 is a more comprehensive score for evaluating multi-label classification than Acc., MP, and MR). This observation clearly demonstrates the benefit of introducing charge definitions to assist charge prediction.
Incorporating charge definitions through matching based approaches (TFIDF match and Siamese CNN) works to some extent, although their performance is still worse than methods using more sophisticated interaction between fact description and charge definitions, i.e. GA_Reader, MemNet and Ours.
Methods that Augment fact representation with charge definitions through end-to-end schema (GA_Reader, MemNet and Ours) attain better results than Fact-Law AN. The latter uses a separated two-stage framework to first identify the related charge definitions. This observation shows the importance of the end-to-end design. In addition, compared with GA_Reader and MemNet, which performs either sentence- or word-level interaction, our approach achieves better performance through considering sentence- and word-level interaction jointly.
Our proposed model outperforms other baselines on two datasets. The improvement is especially significant on the
CAIL30Kdataset: our method surpasses the second best about 5% in MF1. Since the
CAIL30Kcontains more classes with few training samples, the excellent performance of our approach suggests that our auxiliary representations may help to improve the generalization performance for classes with few samples.
Finally, we compare our method against LJP. Like our method, LJP also uses external information for building the fact representation. Different from our method, they introduce multiple related tasks and adopt the multi-task learning for representation training. As shown in Table 2, we can see that our method achieves superior performance than LJP.
We conduct ablation studies to verify the effectiveness of various components in our method. We consider several variations of our approach by removing some components of our model. The result is shown in Table 3. As seen, only using fact descriptions without any level auxiliary fact representations (w/o Fs,Fw) yields the worst performance, which proves the importance of the use of charge definitions. After adding either the sentence-level (w/o Fw) or the word-level auxiliary fact representation (w/o Fs), the performance can be significantly improved. It is observed that the performance of only adding charge-token-related fact representation (w/o Fs) is better than only adding charge-related fact representation (w/o Fw). We also created a variant of our method without using attention weight of each charge from Eq. (9) in the process of generating charge-token-related fact representation (w/o Fs,), which is implemented by setting the attention weight to instead of generated from charge identification part. It can be observed that the performance of w/o Fs, declines. This suggests that the two-level attention is necessary and using them jointly can get the best performance.
Impact of Exploiting charge definitions
We analyze the effects of incorporating generated auxiliary fact representations for classes with few training data. As shown in Figure 3, we study the results of classes with less than 100 samples on
CAIL150K dataset. We can find that the MF1 measure of many charges is zero if auxiliary representations are not used, and the results can be improved significantly if we add auxiliary representation from charge definitions. This observation highlights the benefit of introducing auxiliary representations for handling small sample cases.
Intra-class variance of different fact representations
To investigate whether the fact representation of our method is more stable, we conduct the following experiment: we calculate the variance along each dimension of fact representations from five classes with the most amount of samples, and then use the average variance along all dimensions as an indicator of the intra-class variance of different fact representations. As shown in Figure 4, fact representation () only learned from fact description yields the largest intra-class variance. After augmenting fact representation from charge definitions through sentence-level interaction (), the intra-class variance declines greatly. Specially, the final fact representation () with two auxiliary representations incorporated attains an even lower intra-class variance.
Finally, we select a representative robbery case to give an intuitive illustration of the attention results on the sentence- and word-level interaction. As shown in Table 4, the case describes that the defendant is convicted of robbery due to stealing property and poking the victim to resist arrest. On the sentence-level interaction, with the increasing of iteration in Eq. (9), our model narrows down the candidate charges and finally identifies the correct related charges. We choose the iteration times as 3 since the performance cannot improve with more iterations.
On the word-level interaction, the attention mechanism makes the words in fact description align with the formal terms in charge definitions. To demonstrate this mechanism, Figure 5 shows for the words in fact description, which terms are focused on in the charge definition of robbery. The identified keywords in fact description are ”electric vehicle”, ”resisting arrest” and ”a knife”, which correspond to key terms in robbery definition–”stolen goods”, ”resist arrest” and ”use violence”.
|Top5 Related Charges||t1||t2||t3|
|Negligent act causing severe injury|
|Endangering public security|
In this work, we focus on the task of multilabel charge prediction for given fact descriptions of criminal cases. To address the problem of having a large expression variance in fact descriptions due to informal language use, we introduce charge definitions from criminal law to create auxiliary representations of the fact descriptions. The experimental results on two datasets show the effectiveness of our model on charge prediction. The significant improvement on the classes with few training data validate that our method can benefit the small sample training scenario and the two-level auxiliary fact representations can help the model to generalize to the unseen description.