On mining complex sequential data by means of FCA and pattern structures

On mining complex sequential data by means of FCA and pattern structures

Aleksey Buzmakov Corresponding author. Email: aleksey.buzmakov@inria.fr    Elias Egho111 Elias Egho was in LORIA (Vandoeuvre-les-Nancy, France) when this work was done.    Nicolas Jay    Sergei O. Kuznetsov    Amedeo Napoli    Chedy Raïssi
Orpailleur, LORIA (CNRS – Inria NGE – U. de Lorraine), Vandoeuvre-lès-Nancy, France;
Orange Labs, Lannion, France
National Research University Higher School of Economics, Moscow, Russia
Abstract

Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of “complex” sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis (FCA) and its extension based on “pattern structures”. Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e., a data reduction of sequential structures), are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analyzing interesting patient patterns from a French healthcare data set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use case which is the main motivation for this work.

Keywords: data mining; formal concept analysis; pattern structures; projections; sequences; sequential data.

1 Introduction

Sequence data is present and used in many applications. Mining sequential patterns from sequence data has become an important data mining task. In the last two decades, the main emphasis has been on developing efficient mining algorithms and effective pattern representations (Han et al., 2000; Pei et al., 2001a; Yan et al., 2003; Ding et al., 2009; Raïssi et al., 2008). However, one problem with traditional sequential pattern mining algorithms (and generally with all pattern enumeration algorithms) is that they generate a large number of frequent sequences while a few of them are truly relevant. To tackle this challenge, recent studies try to enumerate patterns using some alternative interestingness measures or by sampling representative patterns. A general idea in finding statistically significant patterns is to extract patterns whose characteristics for a given measure, such as frequency, strongly deviates from its expected value under a null model, i.e. the value expected by the distribution of all data. In this work, we focus on complementing the statistical approaches with a sound algebraic approach trying to answer the following question: can we develop a framework for enumerating only relevant patterns based on data lattices and its associated measures?

The above question can be answered by addressing the problem of analyzing sequential data using the framework of Formal Concept Analysis (FCA), a mathematical approach to data analysis (Ganter and Wille, 1999), and pattern structures, an extension of FCA that handles complex data (Ganter and Kuznetsov, 2001). To analyze a dataset of “complex” sequences while avoiding the classical efficiency bottlenecks, we introduce and explain the usage of projections, which are mathematical mappings for defining approximations. Projections for sequences allow one to reduce the computational costs and the volume of enumerated patterns, avoiding the infamous “pattern flooding”. In addition, we provide and discuss several measures, such as stability, to rank patterns with respect to their “interestingness”, giving an expert order in which the patterns may be efficiently analyzed.

In this paper, we develop a novel, rigorous and efficient approach for working with sequential pattern structures in formal concept analysis. The main contributions of this work can be summarized as follows:

  • Pattern structure specification and analysis. We propose a novel way of dealing with sequences based on complex alphabets by mapping them to pattern structures. The genericity power provided by the pattern structures allows our approach to be directly instantiated with state-of-the-art FCA algorithms, making the final implementation flexible, accurate and scalable.

  • “Projections” for sequential pattern structures. Projections significantly decrease the number of patterns, while preserving the most interesting ones for an expert. Projections are built to answer questions that an expert may have. Moreover, combinations of projections and concept stability index provide an efficient tool for the analysis of complex sequential datasets. The second advantage of projections is its ability to significantly decrease the complexity of a problem, saving thus computational time.

  • Experimental evaluations. We evaluate our approach on real sequence dataset of a regional healthcare system. The data set contains ordered sets of hospitalizations for cancer patients with information about the hospitals they visited, causes for the hospitalizations and medical procedures. These ordered sets are considered as sequences. The experiments reveal interesting (from a medical point of view) and useful patterns, and show the feasibility and the efficiency of our approach.

This paper is an extension of the work presented at CLA’14 conference (Buzmakov et al., 2013). The main differences w.r.t. the CLA’14 paper are a more complete explanation of the mathematical framework and a new experimental part evaluating different aspects of the introduced framework.

The paper is organized as follows. Section 2 introduces formal concept analysis and pattern structures. The specification of pattern structures for the case of sequences is presented in Section 3. Section 4 describes projections of sequential pattern structures followed in Section 5 by the evaluation and experimentations. Finally, related works are discussed before concluding the paper.

2 FCA and pattern structures

2.1 Formal concept analysis

FCA is a formalism that can be used for guiding data analysis and knowledge discovery (Ganter and Wille, 1999). FCA starts with a formal context and builds a set of formal concepts organized within a concept lattice. A formal context is a triple , where is a set of objects, is a set of attributes and is a relation between and , . In Table 1, a cross table for a formal context is shown. A Galois connection between and is defined as follows:

The Galois connection maps a set of objects to the maximal set of attributes shared by all objects and reciprocally. For example, , while , i.e. the set is not maximal. Given a set of objects , we say that is the description of .

x x
x x
x
x x
Table 1: A toy FCA context.

Figure 1: Concept Lattice for the toy context
Definition 1.

A formal concept is a pair , where is a subset of objects, is a subset of attributes, such that and , where is called the extent of the concept, and is called the intent of the concept.

A formal concept corresponds to a pair of maximal sets of objects and attributes, i.e. it is not possible to add an object or an attribute to the concept without violating the maximality property. For example a pair is a formal concept. Formal concepts can be partially ordered w.r.t. the extent inclusion (dually, intent inclusion). For example, This partial order of concepts is shown in Figure 1. The number of formal concepts for a given context can be exponential w.r.t. the cardinality of set of objects or set of attributes. It is easy to see that for context , where , the number of concepts is equal to .

2.2 Stability index of a concept

The number of concepts in a lattice for real-world tasks can be large. To find the most interesting subset of concepts, different measures can be used such as the stability of the concept (Kuznetsov, 2007) or the concept probability and separation (Klimushkin et al., 2010). These measures help extracting the most interesting concepts. However, the last ones are less reliable in noisy data.

Definition 2.

Given a concept , the concept stability of is the relative number of subsets of the concept extent (denoted ), whose description, i.e. the result of , is equal to the concept intent (denoted ).

(1)

Here is the powerset of . Stability measures how a concept depends on objects in its extent. The larger the stability is the more combinations of objects can be deleted from the context without affecting the intent of the concept, i.e. the intent of the most stable concepts is likely to be a characteristic pattern of a given phenomenon and not an artifact of a dataset. Of course, stable concepts still depend on the dataset, and, consequently some important information can be contained in the unstable concepts. However, the stability can be considered as a good heuristic for selecting concepts because the more stable the concept is the less it depends on the given dataset w.r.t. to object removal.

Example 1.

Figure 2 shows a lattice for the context in Table 2, for simplicity some intents are not given. Extent of the outlined concept is , thus, its powerset contains elements. Descriptions of 5 subsets of ( and ) are different from , while all other subsets of have a common description equal to . So, .

x x
x x
x x
x x
x
Table 2: A toy formal context

[0.5]

[0.5]

[0.5]

[0.5]

[0.5]

[1.0]

[0.69]

[0.47]

Figure 2: Concept Lattice for the context in Table 2 with corresponding stability indexes.

One of the fastest algorithm processing a concept lattice is proposed in (Roth et al., 2008) with the worst-case complexity of where is the size of the concept lattice. The experimental section shows that for a big lattice, the stability computation can take much more time than the construction of the concept lattice. Thus, the estimation of concept stability is an important question. Here we present an efficient way for such an estimation. It should be noticed that in a lattice the extent of any ancestor of a concept is a superset of the extent of , while the extent of any descendant is a subset. Given a concept and an immediate descendant , we have , which means that , i.e. . Thus, we can exclude in the computation of the numerator of stability in (1) all subsets of the extent of a direct descendant . Thus, the following bound holds:

(2)

where is the set of all direct descendants and is the set-difference between extent of and extent of , .

Example 2.

With help of (2) we can find all stable concepts (and some unstable), i.e. the concepts with a high stability w.r.t. a threshold . If , we should compute for each concept in the lattice the following value and then select concepts verifying .

2.3 Pattern structures

Although FCA applies to binary contexts, more complex data such as sequences or graphs can be directly processed as well. For that, pattern structures were introduced in Ganter and Kuznetsov (2001).

Definition 3.

A pattern structure is a triple , where is a set of objects, is a complete meet-semilattice of descriptions and maps an object to a description.

The lattice operation in the semilattice () corresponds to the similarity between two descriptions. Standard FCA can be presented in terms of a pattern structure. In this case, is the set of objects, the semilattice of descriptions is and a description is a set of attributes, with the operation corresponding to the set intersection ( denotes the powerset of ). If and then . The mapping is given by, , and returns the description for a given object as a set of attributes.

The Galois connection for a pattern structure is defined as follows:

The Galois connection makes a correspondence between sets of objects and descriptions. Given a subset of objects , returns the description which is common to all objects in . Given a description , is the set of all objects whose description subsumes . More precisely, the partial order (or the subsumption order) on () is defined w.r.t. the similarity operation : , and is subsumed by .

Definition 4.

A pattern concept of a pattern structure is a pair where and such that and , is called the concept extent and is called the concept intent.

As in standard FCA, a pattern concept corresponds to the maximal set of objects whose description subsumes the description , where is the maximal common description for objects in . The set of all concepts can be partially ordered w.r.t. partial order on extents (dually, intent patterns, i.e ), within a concept lattice.

An example of pattern structures is given in Table 3, while the corresponding lattice is depicted in Figure 3.

As stability of concepts only depends on extents, it can be defined by the same procedure for both formal contexts and pattern structures.

3 Sequential pattern structures

Certain phenomena, such as a patient trajectory (clinical history), can be considered as a sequence of events. This section describes how FCA and pattern structures can process sequential data.

3.1 An example of sequential data

Patient Trajectory
Table 3: Toy sequential data on patient medical trajectories.

Imagine that we have medical trajectories of patients, i.e. sequences of hospitalizations, where every hospitalization is described by a hospital name and a set of procedures. An example of sequential data on medical trajectories with three patients is given in Table 3. We have a set of procedures , a set of hospital names , where hospital names are hierarchically organized (by level of generality). and are central hospitals (), and are clinics (), and denotes the root of this hierarchy. The least common ancestor in this hierarchy is denoted by , for any , i.e. . Every hospitalization is described by one hospital name and may contain several procedures. The procedure order in each hospitalization is not important in our case. For example, the first hospitalization for the second patient () was a stay in hospital and during this hospitalization the patient underwent procedures and . An important task is to find the “characteristic” sequences of procedures and associated hospitals in order to improve hospitalization planning, optimize clinical processes or detect anomalies.

We approach the search for characteristic sequences by finding the most stable concepts in the lattice corresponding to a sequential pattern structure. For the simplification of calculations, subsequences are considered without “gaps”, i.e the order of non consequent elements is not taken into account. This is reasonable in this task because experts are interested in regular consecutive events in healthcare trajectories. A sequential pattern structure is a set of sequences and is based on the set of maximal common subsequences (without gaps) between two sequences. Next subsections define partial order on sequences and the corresponding pattern structures.

3.2 Partial order on complex sequences

A sequence is constituted of elements from an alphabet. The classical subsequence matching task requires no special properties of the alphabet. Several generalizations of the classical case were made by introducing a subsequence relation based on an itemset alphabet (Agrawal and Srikant, 1995) or on a multidimensional and multilevel alphabet (Plantevit et al., 2010). Here, we generalize the previous cases, requiring for an alphabet to form a semilattice (We should note that in this paper we consider two semilattices, the first one is related to the characters of the alphabet, , and the second one is related to pattern structures, ). Thanks to the formalism of pattern structures we are able to process in a unified way all types of sequential datasets with poset-shaped alphabet (it is mentioned above that any partial order can be transformed into a semilattice). However, some sequential data can have connections between elements, e.g. (Adda et al., 2010), and, thus, cannot be straightforwardly processed by our approach.

Definition 5.

Given a semilattice , also called an alphabet, a sequence is an ordered list of elements from . We denote it by where .

In this alphabet semilattice there is a bottom element that can be matched with any other element. Formally, . This element is required by the lattice structure, but provides no useful information. Thus, it should be excluded from sequences. The bottom element of corresponds to the empty set in sequential mining (Agrawal and Srikant, 1995), and the empty set is always ignored in this domain.

Definition 6.

A valid sequence is a sequence where for all .

Definition 7.

Given an alphabet and two sequences and based on (), the sequence is a subsequence of , denoted , iff and there exist such that and for all , , i.e. .

Example 3.

In the running example (Section 3.1), the alphabet is with the similarity operation , where are hospitals and are sets of procedures. Thus, the sequence is a subsequence of because if we set (Definition 7) then (‘CH’ is more general than and ), (the same hospital and ) and (‘*’ is more general than and ).

With complex sequences and this kind of subsequence relation the computation can be hard. Thus, for the sake of simplification, only “contiguous” subsequences are considered, where only the order of consequent elements is taken into account, i.e. given in Definition 7, for all . Since experts are interested in regular consecutive events in healthcare trajectories, such a restriction does make sens for our data. It helps to connect only related hospitalizations.

The next section introduces pattern structures that are based on complex sequences with a general subsequence relation, while the experiments are provided for a “contiguous” subsequence relation.

3.3 Sequential meet-semilattice

Based on the previous definitions, we can define the sequential pattern structure used for representing and managing sequences. For that, we make an analogy with the pattern structures for graphs (Kuznetsov, 1999) where the meet-semilattice operation respects subgraph isomorphism. Thus, we introduce a sequential meet-semilattice respecting subsequence relation. Given an alphabet lattice , is the set of all valid sequences based on . is partially ordered w.r.t. Definition 7. is a semilattice on , where such that, if contains a sequence , then all subsequences of should be included into , , and the similarity operation is the set intersection for two sets of sequences. Given two patterns , the set intersection operation ensures that if a sequence belongs to then any subsequence of belongs to and thus . As the set intersection operation is idempotent, commutative and associative, is a semilattice.

Example 4.

If pattern includes sequence (see Table 4), then it should include also , , and others. If pattern includes , then it should include , and . Thus the intersection of two sets and is equal to the set .

The next proposition stems from the aforementioned and will be used in the proofs in the next section.

Proposition 1.

Given and , if and only if there is a sequence , such that .

The set of all possible subsequences for a given sequence can be large. Thus, it is more efficient to consider a pattern as a set of only maximal sequences , . Furthermore, every pattern will be given only by the set of all maximal sequences. For example, (see Tables 3 and 4), i.e. is the set of all maximal sequences specifying the intersection of and . Similarly we have . Note that representing a pattern by the set of all maximal sequences allows for an efficient implementation of the intersection “” of two patterns (in Section 5.1 we give more details on similarity operation w.r.t. a contiguous subsequence relation).

Example 5.

The sequential pattern structure for our example (Subsection 3.1) is , where , is the semilattice of sequential descriptions, and is the mapping associating an object in to a description in shown in Table 3. Figure 3 shows the resulting lattice of sequential pattern concepts for this particular pattern structure .

Figure 3: The concept lattice for the pattern structure given by Table 3. Concept intents reference to sequences in Tables 3 and 4.
Subsequences





Table 4: Subsequences of patient sequences in Table 3.

4 Projections of sequential pattern structures

Pattern structures are hard to process due to the large number of concepts in the concept lattice, the complexity of the involved descriptions and the similarity operation. Moreover, a given pattern structure can produce a lattice with a lot of patterns which are not interesting for an expert. Can we save computational time by avoiding to compute “useless” patterns? Projections of pattern structures “simplify” to some degree the computation and allow one to work with a reduced description. In fact, projections can be considered as filters on patterns respecting mathematical properties. These properties ensure that the projection of a semilattice is a semilattice and that projected concepts are related to original ones (Ganter and Kuznetsov, 2001). Moreover, the stability measure of projected concepts never decreases w.r.t the original concepts. We introduce projections on sequential patterns revising Ganter and Kuznetsov (2001). It is necessary to provide an extended definition of projection in order to deal with interesting projections for real-world sequential datasets.

Definition 8 (Ganter and Kuznetsov (2001)).

A projection is an interior operator, i.e. it is (1) monotone (), (2) contractive () and (3) idempotent ().

Definition 9.

A projected pattern structure is a pattern structure , where and .

Note that in (Ganter and Kuznetsov, 2001) . Our definition allows one to use a wider set of projections. In fact all projections that we describe for sequential pattern structures below require Definition 9. Now we should show that is a semilattice.

Proposition 2.

Given a semilattice and a projection , for all .

Proof.
  1. , thus,

  2. , thus,

  3. ,
    then and

  4. From (2) and (3) it follows that .

Corollary 1.

Proof.

It can be prooven by induction.

  1. by Definition 9.

  2. If , then

Corollary 2.

Given a semilattice and a projection , is a semilattice, i.e. is commutative, associative and idempotent.

The concepts of a pattern structure and a projected pattern structure are connected through Proposition 3. This proposition can be found in Ganter and Kuznetsov (2001), but thanks to Corollary 1, it is valid in our case.

Proposition 3.

Given a concept in , the extent is an extent in . Given a concept in , the intent is of the form , where is a concept in .

Moreover, while preserving the extents of some concepts, projections cannot decrease the stability of the projected concepts, i.e. if the projection preserves a stable concept, then its stability (Definition 2) can only increase.

Proposition 4.

Given a pattern structure , its concept and a projected pattern structure , and the projected concept , if the concept extents are equal () then .

Proof.

Concepts and have the same extent. Thus, according to Definition 2, in order to prove the proposition, it is enough to prove that for any subset , if in the original pattern structure, then in the projected one.

Suppose that such that in the original pattern structure and in the projected one. Then there is a descendant concept of in the projected pattern structure such that in the projected lattice. Then there is an original concept for the projected concept with the same extent . Then and, so, cannot be equal to in the original lattice. Contradiction. ∎

Now we are going to present two projections of sequential pattern structures. The first projection comes from the following observation. In many cases it may be more interesting to analyze quite long subsequences rather than short ones. This kind of projections is called Minimal Length Projection (MLP) and it depends on the minimal length parameter for the sequences in a pattern. The corresponding function maps a pattern without short sequences to itself, and a sequence with short sequences to the pattern containing only long sequences w.r.t. a given length threshold. Later, propositions 1 and 5 state that MLP is coherent with Definition 8.

Definition 10.

The function of minimal length is defined as

Example 6.

If we prefer common subsequences of length , then between and in Table 3 there is only one maximal common subsequence, in Table 4, while and are too short to be considered. Figure 3(a) shows the lattice of the projected pattern structure (Table 3) with patterns of length greater or equal to .

Proposition 5.

The function is a monotone, contractive and idempotent function on the semilattice .

Proof.

The contractivity and idempotency are quite clear from the definition. It remains to prove the monotonicity.

If , where and are sets of sequences, then for every sequence there is a sequence such that (Proposition 1). We should show that , or in other words for every sequence there is a sequence , such that . Given , since is a subset of and , there is a sequence such that , with ( is a parameter of MLP), and thus, . ∎

Another important type of projections is related to a variation of the lattice alphabet . One possible variation of the alphabet is to ignore certain fields in the elements. For example, if a hospitalization is described by a hospital name and a set of procedures, then either hospital or procedures can be ignored in similarity computation. For that, in any element the set of procedures should be substituted by , or the hospital by (“arbitrary hospital”) which is the most general element of the taxonomy of hospitals.

Another variation of the alphabet is to require that some field(s) should not be empty. For example, we want to find patterns with non-empty set of procedures or the element of the hospital taxonomy is not allowed in elements of a sequence. Such variations are easy to realize within our approach. For this, when computing the similarity operation between elements of the alphabet, one should check if the result contains empty fields and, if yes, should substitute the result by . This variation is useful, as it is shown in the experimental section, but is rather difficult to define within more classical frequent sequence mining approaches, which will be discussed later.

Example 7.

An expert is interested in finding sequential patterns describing how a patient changes hospitals, but with little interest in procedures. Thus, any element of the alphabet lattice, containing a hospital and a non-empty set of procedures can be projected to an element with the same hospital, but with an empty set of procedures.

Example 8.

An expert is interested in finding sequential patterns containing some information about the hospital in every hospitalization, and the corresponding procedures, i.e. hospital field in the patterns cannot be equal to , e.g., is an invalid pattern, while is a valid pattern in Table 4. Thus, any element of the alphabet semilattice with in the hospital field can be projected to the . Figure 3(b) shows the lattice corresponding to the projected pattern structure (Table 3) defined by a projection of the alphabet semilattice.

Below we formally define how the alphabet projection of a sequential pattern structure should be processed. Intuitively, every sequence in a pattern should be substituted with another sequence, by applying the alphabet projection to all its elements. However, the result can be an incorrect sequence, because cannot belong to a valid sequence. Thus, sequences in a pattern should be “developed” w.r.t. , as it is explained below.

Definition 11.

Given an alphabet , a projection of the alphabet and a sequence based on , the projection is the sequence , such that .

Here, it should be noticed that is not necessarily a valid sequence (see Definition 6), since it can include as an element. However, in sequential pattern structures, elements should include only valid sequences (see Section 3.3).

Definition 12.

Given an alphabet , a projection of the alphabet , an alphabet projection for the sequential pattern structure is the set of valid sequences smaller than the projected sequences from :

where is the set of all valid sequences based on .

Example 9.

is an alphabet-projected pattern for the pattern , where the alphabet lattice projection is given in Example 8.

In the case of contiguous subsequences, is an alphabet-projected pattern for the pattern , where the alphabet lattice projection is given by projecting every element with medical procedure to the element with the same hospital and with the same set of procedures excluding . The projection of sequence is , but , and, thus, in order to project the pattern the projected sequence is substituted by its maximal subsequences, i.e.

Proposition 6.

Considering an alphabet , a projection of the alphabet , a sequential pattern structure , the alphabet projection (see Definition 12) is monotone, contractive and idempotent.

Proof.

This projection is idempotent, since the projection of the alphabet is idempotent and only the projection of the alphabet can change the elements appearing in sequences.

It is contractive because for any pattern and any sequences , a projection of the sequence is a subsequence of . In Definition 12 the projected sequences should be substituted by their subsequences in order to avoid , building the sets . Thus, is a supersequence for any , and, so, the projected pattern is subsumed by the pattern .

Finally, we should show monotonicity. Given two patterns , such that , i.e. , consider the projected sequence of , . As for some then for some (see Definition 7) (), then (by the monotonicity of the alphabet projection), i.e. the projected sequence preserves the subsequence relation. Thus, the set of allowed subsequences of is a subset of the set of allowed subsequences of . Hence, the alphabet projection of the pattern preserves pattern subsumption relation, (Proposition 1), i.e. the alphabet projection is monotone. ∎

(a) MLP projection,

(b) Projection removing ‘*’ in the hospital field
Figure 4: The projected concept lattices for the pattern structure given by Table 3. Concept intents refer to the sequences in Tables 3 and 4.

5 Sequential pattern structure evaluation

5.1 Implementation

Nearly any state-of-the-art FCA algorithm can be adapted to process pattern structures. We adapted the AddIntent algorithm (Merwe et al., 2004), as the lattice structure is important for us to calculate stability (see an algorithm for calculating stability in (Roth et al., 2008)). To adapt the algorithm to our needs, every set intersection operation on attributes is substituted with the semilattice operation on corresponding patterns, while every subset checking operation is substituted with the semilattice order checking , in particular all are substituted with .

The next question is how the semilattice operation and subsumption relation can be implemented for contiguous sequences. Given two sets of sequences and , the similarity of these sets , is calculated according to Section 3.3, i.e. maximal sequences among all common subsequences for any pair of sequences and .

To find all common subsequences of two sequences, the following observations can be useful. If is a subsequence of with , i.e. (Definition 7: is the index difference from which is a contiguous subsequence of ) and a subsequence of with , i.e. , then for any index , . Thus, to find all maximal common subsequences of and , we first align and in all possible ways. For each alignment of and we compute the resulting intersection. Finally, we keep only the maximal intersected subsequences.

For example, let us consider two possible alignments of and :

The left intersection is not retained, as it is not maximal (), while the right intersection is kept.

The complexity of the alignment for two sequences and is , where is the complexity of computing a common ancestor in the alphabet lattice .

5.2 Experiments and discussion

The experiments are carried out on a MacBook Pro with a 2.5GHz Intel Core i5, 8GB of RAM Memory running OS X 10.6.8. The algorithms are not parallelized and are coded in C++.

Our use-case dataset comes from a French healthcare system, called PMSI222Programme de Médicalisation des Sytèmes d’Information. (Fetter et al., 1980). Each element of a sequence has a “complex” nature. The dataset contains patients suffering from lung cancer, who live in the Lorraine region (Eastern France). Every patient is described as a sequence of hospitalizations without any time-stamp. A hospitalization is a tuple with three elements: (i) healthcare institution (e.g. university hospital of Nancy ()), (ii) reason for the hospitalization (e.g. a cancer disease), and (iii) set of medical procedures that the patient undergoes. An example of a medical trajectory is given below:

This sequence represents a patient trajectory with three hospitalizations. It expresses that the patient was first admitted to the university hospital of Nancy () for a cancer problem as a reason, and underwent procedures and . Then he had two consequent hospitalizations in the general hospital of Paris () for chemotherapy with no additional procedure. Substituting the same consequent hospitalizations by the number of repetitions, we have a shorter and more understandable trajectory. For example, the above pattern is transformed into two hospitalizations where the first hospitalization repeats once and the second twice:

Diagnoses are coded according to the 10 International Classification of Diseases (ICD10). Based on this coding, diagnoses could be described at 5 levels of granularity: root, chapter, block, 3-character, 4-character, terminal nodes. This taxonomy has nodes. The healthcare institution is associated with a geographical taxonomy of 4 levels, where the first level refers to the root (France) and the second, the third and the fourth levels correspond to administrative region, administrative department and hospital respectively. Figure 5 presents University Hospital of Nancy (code: 540002078) as a hospital in Meurthe et Moselle, which is a department in Lorraine, region of France. This taxonomy has nodes. The medical procedures are coded according to the French nomenclature “Classification Commune des Actes Médicaux (CCAM)”. The distribution of sequence lengths is shown in Figure 6.

Figure 5: A geographical taxonomy of the healthcare institution
Figure 6: The length distribution of sequences in the dataset

With 500 patient trajectories, the computation of the whole lattice is infeasible. We are not interested in all possible frequent trajectories, but rather in trajectories which answer medical analysis questions. An expert may know the minimal size of trajectories that he is interested in, i.e. setting the MLP projection. We use the MLP projection of length and and take into account that most of the patients has at least 2 hospitalizations in the trajectory (see Figure 6).

(s)

(a) MLP projection,

(s)

(b) MLP projection,
Figure 7: Computational time for different projections
(a) MLP projection,
(b) MLP projection,
Figure 8: Lattice size for different projections

Figure 7 shows computational times for different projections as a function of dataset size. Figure 6(a) shows different alphabet projections for MLP projection with , while Figure 6(b) for MLP with . Every alphabet projection is given by the name of fields, that are considered within the projection: G corresponds to hospital geo-location, R is the reason for a hospitalization, P is medical procedures and I is repetition interval, i.e. the number of consequent hospitalizations with the same reason. We can see from these figures that MLP allows one to save some computational resources with increasing of . The difference in computational time between and projections is significant, especially for time consuming cases. Even a bigger variation can be noticed for the alphabet projections. For example, computation of the RPI projection takes 100 times more resources than any from GRP, RP, GR, GRP.

The same dependency can be seen in Figure 8, where the number of concepts for every projection is shown. Consequently, it is important for an expert to provide a strict projection that allows him to answer his questions in order to save computational time and memory.

Table 5 shows some interesting concept intents with the corresponding support and ranking w.r.t. concept stability. For example the concept #1 is obtained under the projection (i.e., we consider only hospital and reason), with the intent , where C341 Lung Cancer is a special kind of lung cancer (malignant neoplasm in Upper lobe, bronchus or lung). This concept is the most stable concept in the lattice for the given projection, and the size of the concept extent is patients.

# Projection Intent Stab. Rank Support