# Subjectively Interesting Subgroup Discovery

on Real-valued Targets

###### Abstract

Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining.

## I Introduction

We introduce the central ideas by means of an example. Consider the situation that a user want to learn about crime demographics, based on the UCI Communities and Crime data^{1}^{1}1http://archive.ics.uci.edu/ml/datasets/communities+and+crime [1]. This data contains violent crime rates for all () districts in the US and over 120 other attributes describing demographic statistics of those districts. One method to learn about the relation between the ‘number of violent crimes’ attribute and the demographic attributes is to extract subgroup patterns, which are sets of data points where violent crime is surprisingly high (or low) and that share similar statistics for one or several demographic attributes. A subgroup pattern should be interpreted as ‘for data points that fall within the specified statistics that describe the subgroup, violent crime is surprisingly low/high’.

For example, the top subgroup pattern—identified through the method introduced in this paper—states that there are high violent crime rates in districts where many mothers are unmarried at the moment they give birth to their child (condition ; mean violent crime rate in subgroup vs. overall). An illustration of the data coverage for this pattern is given in Fig. 1. The subgroup covers of the data and may be interesting because the distribution of crime rates within this subgroup deviates substantially from the full data. If a user would have no prior expectations about the data, this pattern is highly informative.

Indeed, we may quantify how informative/interesting it is, in the Information Theoretic sense: the number of bits of information we gain about the data by learning about this pattern, which depends on the amount of data covered (more is better) and how much the distribution in the subgroup differs from our expectation (more is better; in this paper we consider mean and variance statistics). Typically, we would like to weight this against how complex the description of the pattern is (number of attributes used to describe the subgroup plus the number of statistics presented to the user, fewer is better), such that our aim is to provide a maximal information rate.

This is precisely the contribution of this paper. We quantify the Information Content (IC; the amount of information gained) and Description Length (DL; the complexity of the description) for subgroup patterns. However, while the example above has only one target attribute (the violent crime rate), we also do this for multivariate real-valued targets, in order to enable users to learn about multivariate distributions. Besides, while the example above is about a surprisingly high mean (violent crime rate), we quantify the IC and DL for both mean and (co-)variance statistics.

As hinted at in the example, the IC of a pattern is inherently subjective, i.e., particular to a user, because how much you learn depends on your prior knowledge. We implement this subjectivity by modeling a background distribution over the data space that is a Maximum Entropy distribution subject to constraints corresponding to the current knowledge of a user. This approach is known as FORSIED [2, 3] and also immediately enables iterative mining of non-redundant patterns without much additional effort.

We have implemented an algorithm to iteratively mine interesting patterns which is freely available as open source code. We have not studied the algorithmic problem in detail, but the implementation is based on beam search, a frequently employed approach in subgroup discovery. That is, it maintains a list of most interesting patterns of arity , expands these to arity and selects the most interesting patterns again. Ultimately, it outputs the most interesting pattern found. It handles categorical, ordinal, and numerical description attributes (the demographic attributes in the example) and supports time constraints (e.g., stop after 1 minute of mining). The implementation is based on Cortana [4].

In summary, this paper contributes the following:

– We define a new pattern syntax for subgroups with a multivariate real-valued target distribution, called location and spread patterns. (Sec. II-A)

– We introduce a method to quantify their interestingness in a subjective manner. (Sec. II-C)

– Before that, we study how to incorporate prior knowledge into the background model, including previously identified patterns to enable iterative mining. (Sec. II-B)

– We present how to mine high-quality patterns using beam search and gradient descent. (Sec. II-D)

– We provide empirical evidence on four datasets that we can effectively find interesting patterns. (Sec. III)

Discussion of related work is presented in Sec. IV, directions for future work and conclusions are given in Sec. V. All code, including code for repeating the experiments, and links to the datasets are available at: https://www.dropbox.com/sh/3m1cgt1mh15k8bu/AAAViZtu5aeSOA3ybCS5mi-ta?dl=0.

## Ii Methods

Overview. The high-level problem addressed in this paper is:

###### Problem 1.

Main Problem. Iteratively inform the user about the mean and variance of subsets of data points that can be described concisely in terms of the description attributes, such that the rate of information gain of the user about the target attributes is maximized at each iteration.

We first formalize the type of pattern shown to the user (Sec. II-A). To explain how to find the most interesting patterns of this type (Sec. II-D), we first need to formalize the background distribution (Sec. II-B) and the interestingness of patterns (Sec. II-C).

The formalization follows the FORSIED approach: we formalize the user’s belief state about the target attributes by means of a background distribution, and quantify the IC of a pattern as the information (in its formal sense) the user gains about the target attributes by seeing the pattern. The Subjective Interestingness (SI) of a pattern is then formalized as the (subjective) IC divided by the DL of the pattern.

Notation. The data consists of a set of pairs , (where is shorthand for ), called the data points. Here, the so-called description attributes of the th data point is assumed to be a tuple of attributes with domains , and is a vector containing the values for real-valued target attributes. We denote . In our setup the user is interested in gaining an understanding of the behavior of the target attributes in terms of the descriptions.

For example, the target attributes could contain healthcare-related attributes, whereas the description attributes could describe lifestyle choices (e.g., smoking or not, sedentary or active lifestyle, etc). Then, our method would yield insights into the healthcare target attributes, in terms of the lifestyle descriptions. In the example in Section I, there is one target attribute (the violent crime rate) and over 120 description attributes.

We use hatted symbols to indicate these are empirical values. Non-hatted equivalents will be used to denote the respective random variables, e.g., . They allow us to reason about the amount of uncertainty the user has about the data points. In general, standard face lower case symbols denote scalars, bold face lower case symbols denote tuples or vectors, upper case bold face symbols denote matrices, and upper case calligraphic letters denote sets.

### Ii-a Location and spread patterns

Subgroups, intentions, and extensions. We define patterns in terms of subgroups. A subgroup is defined by a set of conditions on the description attributes (the value combination is the subgroup intention) and by the set of data points for which the description attributes satisfy these conditions (the index set is the subgroup extension).

The intention is described in a pre-defined formal description language, such as in the form of a conjunction of conditions on individual metadata attributes. For , such conditions are typically inequality conditions, and for categorical they can be set in-/exclusion conditions. The extension is then specified by the index set with iff satisfies the conditions.

Location and spread patterns. Subgroups tend to be informative if the target attribute values of data points in the extension are unusual in some sense. The way in which this set is unusual will be quantified by means of statistics—functions of this set of data points. For example, its empirical mean could be unusually far from what the user would expect, or its empirical variance around this mean could be unusually small or large along a certain direction.

To be precise, let us define two statistics and as follows:

(1) | |||||

(2) |

where and is a unit vector, i.e., . The first statistic (actually a set of statistics), when evaluated on , quantifies the average vector of the data points in the extension (i.e., its average location), whereas the second quantifies the spread around that location. Patterns considered here are specified by an intention, which uniquely determines the extension , a unit vector , and the specification of the empirical values of one or both of the statistics and : we call it a location pattern when the former is specified, and a spread pattern when the latter is specified. We find that the spread of a subgroup cannot be interpreted straightforwardly without knowing its location, hence we only ever provide the user with spread patterns for subgroups for which the location pattern has been provided first. That is, we only explain the (co-)variance structure of subgroups for which the user already knows the precise mean value within the subgroup for all attributes.

Example. For the synthetic data shown in Fig 2a, a location pattern is an intention, e.g., ‘Attribute3 = true’, along with the mean of the subgroup, e.g., the dark red set of points. A spread pattern is an intention, a direction (a weight vector of unit length, as in Fig. 2b), and the magnitude of the variance in that direction.

### Ii-B Modelling the user’s belief state

As we are interested in quantifying how informative a pattern is to a particular user, we quantify its informativeness (the IC) with respect to a model for the user’s belief state. Patterns that contrast more strongly w.r.t. this belief state are more surprising and thus carry more information for the user. We model the user’s belief state by the means of a so-called background distribution, represented by a density function . This is a distribution over the possible data values (here, a distribution for ), which assigns a higher probability density to data values that are deemed more probable by the user. The general form of this approach is known as FORSIED [2, 3].

The initial background distribution, with density function , can be estimated as the distribution of Maximum Entropy (MaxEnt) subject to constraints that express the user’s knowledge, aka. the prior beliefs. The reason to use the MaxEnt distribution is that this is the only neutral choice, i.e., the only distribution that contains no other information [5]. Importantly, during the mining process the background distribution evolves, as each pattern shown to the user changes their belief state about the data. We first derive the initial background distribution, and then show how it can be updated to account for location and spread patterns.

Initial background distribution. To derive the initial background distribution, we need to assume what prior beliefs the user may have. We consider the case where the user expects the overall mean of to be equal to a specified vector , and its covariance to be equal to a specified matrix . Notice that these need not be equal to the empirical statistics; they may be anything. The MaxEnt distribution subject to such expectations is well-known to equal a multivariate Normal distribution with and as parameters:

(3) |

The evolving background distribution. Given a pattern, the background distribution has to be updated to reflect the user’s acquired knowledge. This can be done by minimally altering the background distribution while ensuring the statistic is (in expectation) as specified by the pattern. Here, minimally is naturally measured in terms of the Kullback-Leibler (KL) divergence. This approach is known as the principle of minimum discrimination information, a generalization of the MaxEnt principle.

We postulate, for now, that through subsequent updates in this way, the background distribution will continue to be a product of multivariate Normal distributions, although the means and covariances of the different data points may differ. I.e., after iterations, the density function of the background distribution will be:

(4) |

where data points may have differing means and covariance matrices . This holds for (when and for all ), and the following shows that updating a distribution to account for location and spread patterns merely changes the parameter values, leaving the distribution’s parametric form intact.

Background distribution updating for location patterns. To update given a location pattern for a subgroup with extension , we must solve the following optimization problem:

(5) |

(6) |

with the additional technical constraint that guarantees that the distribution has a proper normalization.

###### Theorem 1.

Let be a density function of the form of Eq. (4). Then, has the same parametric form, with:

(7) |

for , and all other parameters unaltered.

###### Proof:

Given the convexity of the KL-divergence and the linearity of the constraints, the optimization problem to be solved is convex and any stationary point is a global minimum. The Karush-Kuhn-Tucker (KKT) stationarity condition gives us the functional form of :

(8) |

for a vector of KKT multipliers . Manipulating this expression shows that is still of the form of Eq. (4), with for and all other parameters unaltered. The optimal value of can be found by ensuring primal feasibility, yielding that . Substituting this for in the expression for proves the theorem. ∎

Background distribution updating for spread patterns. To update the background distribution given a spread pattern for a subgroup with extension , we need to use the constraint

(9) |

in the KL-minimization problem, where for conciseness we denote the empirical variance as .

###### Theorem 2.

Let be a density function of the form of Eq. (4). Then, , updated for a spread pattern with spread , has the same parametric form, with:

(10) | ||||

(11) |

for , and all other parameters unaltered. The optimal value for is found as the (unique) zero of the following equation:

(12) |

The proof is omitted for brevity. It is more tedious but analogous to the previous one.

Accounting for a set of location and spread patterns. If we want to take into account a set of location and spread patterns, the KL-divergence minimization problem needs to be solved with a constraint for each of these patterns. The problem remains convex, however, such that a coordinate-descent approach converges to the global optimum. This means iteratively updating the background distribution for each of the patterns, until convergence. As long as the extensions of the different patterns have limited overlaps, as is the case in our experiments, convergence occurs very rapidly.

Implementation details. Rather than updating the parameters and , we actually update the natural parameters and of these multivariate Normal distributions. This is numerically and computationally advantageous, but we feel it provides more insight to discuss the updates to and above.

Also note that, maintaining and updating the background distribution may be costly if implemented naively. Each and needs to be remembered and updating them involve summations over terms.
Yet, the number of distinct and remains limited.^{2}^{2}2Indeed, and for all and such or for all , since they will have been subjected to the same updates.

### Ii-C Subjective Interestingness

Given a background distribution, [2] proposed that the Subjective Interestingness (SI) of a pattern can be computed as a ratio of two quantities: (a) the Information Content (IC) of a pattern, which is the negative log probability of the pattern under the background distribution; and (b) the Description Length (DL), which measures the effort a user has to make to understand and internalize the pattern.

To describe location patterns, we have to inform the user about the number of conditions in the pattern’s intention, the conditions themselves, and the mean values for all attributes (to sufficient accuracy). For spread patterns, instead of the means, the vector needs to be described, with its magnitude. All these parts of the code have constant length, except for the set of conditions, which has a length proportional to the number of conditions . Thus:

where the applies to spread patterns only because they have one more term then location patterns.

We discuss determining and in Remark 1 below. Note that it does not matter whether the DL is reflective of reality in absolute terms, because the actual SI scores are irrelevant. What matters is the ranking, hence it is desirable that is chosen well relative to .

As the IC (thus the SI) depends on the pattern type, we derive it first for location patterns and subsequently for spread patterns.

SI for location patterns. As the background distribution (4) for the target values of a data record is a normal distribution, the marginal distribution of the mean of a subgroup is again a normal distribution, with mean and covariance . The IC of a location pattern with extension is thus the negative log probability of the pattern. Written in full:

(13) |

The SI of a location pattern with extension and statistic reads:

(14) |

SI for spread patterns. While the SI of a location pattern can be computed analytically, evaluating the SI for a spread pattern is more complex. However, it can be approximated well.

If the patterns assimilated into the background so far do not overlap (i.e.,
non-intersecting extensions)^{3}^{3}3If the patterns used to update the
background distribution do overlap, then even after the
update. So the random variable in Eq. (16) follows a
non-central chi-squared distribution, hence the linear combination
Eq. (17) also changes. In this case, we approximate the
SI with the same computation for the non-overlapping situation. , then after updating the background distribution
with location information of the pattern, the parameter of the
background model equals the observed mean of subgroup . So we can
derive:

(15) | ||||

(16) |

Denote the chi-squared random variable derived above by . Then, the variance statistic (2) is a linear combination of chi-squared random variables:

(17) |

The probability density function of a linear combination of chi-squared distributed random variables has been studied extensively, but a closed form analytic solution is unknown. Here we choose the state-of-art approximation proposed by [6]: Writing for the coefficient , they prove that the distribution of can be accurately approximated by an affine function of a chi-squared random variable with degrees of freedom:

(18) |

Therefore the approximated probability density function reads:

Thus the IC for a spread pattern with extension is given as:

(19) |

The SI is then given by

(20) |

###### Remark 1.

In practice, the SI’s from Eqs. (14) and (20) are only used for ranking the patterns, or even just for finding the single most interesting pattern. The absolute value of the SI is largely irrelevant in practice. Thus, we can set without losing generality, such that only remains as a parameter, the value of which essentially depends on the ‘coding scheme’ used to present the pattern to the user.

We do know of any principled approach to choose well. Notice that the problem here is not to do model selection in the statistical sense, but rather the DL should be determined based on aspects of human cognition. In this paper, we set throughout all the experiments. However, tuning biases the results toward more or fewer conditions to describe the subgroup and hence tuning could be useful.

### Ii-D Search strategies

Overall approach. We have not studied the complexity formally, but the optimization problem for either pattern type appears to be very difficult. Tiling [7], a similar and easier-appearing problem, is already NP-hard. The score function here (the SI) is also not monotonic and, if the cardinality of metadata attributes is large, pattern enumeration, which then equals exhaustive search, is not a feasible strategy. For spread patterns, the search problem is essentially a dimensionality reduction problem. From empirical results, we learn that the search problem can have many local optima. Besides, there is no structure in the problem that struck us as easy to use.

Hence, we resort to optimization procedures that are commonly used in either scenario. In brief, to find location patterns that maximize Eq. (14), we employ beam search. For spread patterns, we first search for the best location pattern and after updating the background distribution with the location, we use gradient descent to find the weight vector that maximizes Eq. (20) for that subgroup. The procedures are outlined in more detail below.

Location pattern. Beam search systematically explores the conjunctions of conditions by expanding a limited set of conjunctions that have the largest SI so far. It evaluates conjunctions of conditions on metadata attributes in a level-wise manner. On each level, a limited list (beam width) of most promising combinations is maintained. On the next level, the algorithm exhaustively grows the combinations from the limited pattern list and maintains again the best. The mining process stops when all possible conjunctions of conditions are explored or a chosen stopping criterion is met, either a maximum search depth or time spent. Then, the best pattern found throughout the search is given as output.

Spread pattern. Finding the best spread pattern consists of two steps: (1) find the best location pattern and update the background distribution with that information, (2) for that location pattern find the most interesting direction in the target space. We have already described the first step; the second step can be formularized in terms of the following optimization problem:

(21) |

Since the description length in is fixed for a specific extension ,
the problem (21) maximizes the entropy of a distribution
(II-C) over the unit sphere. To optimize , we apply the
off-the-shelf manifold optimization tool Manopt [8] with the unit
sphere as the manifold, and solve it with the gradient-based solver.^{4}^{4}4We
computed the gradient analytically, but details are omitted due to lack of
space.

## Iii Experiments

In this section we evaluate whether our method is able to find good location and spread patterns in terms of SI and whether the model updates work as expected. We also studied the pattern descriptions, to see whether the patterns found appear to be interesting. We conducted experiments on four datasets: one synthetic and three publicly available ones, of widely varying nature. The results for each dataset are described in the following subsections. The final subsection considers the scalability of the methods.

We used the beam search available within the data mining tool Cortana [4], using the following settings: descriptions on numerical metadata are based on and relations with four split points (1/5–4/5 percentiles). The beam width is set to 40 and the search depth is four conditions. The search logs the best 150 subgroups, with a maximum run time of 5 minutes.

### Iii-a Synthetic data

Data. We generated a dataset of 620 data points with two real-valued target attributes (attributes 1 and 2) and five binary descriptive attributes. We first sample 500 target values from the 2-D multivariate normal distribution and then embed three subgroups each consisting of 40 points into the data, see Fig. 2a. Each subgroup has distance 2 from the mean but a different covariance structure: the variance along the main eigenvector is much larger than the other. The first three descriptive attributes (attributes 3–5) contain the true labels for subgroups to ; the other two (attributes 6 and 7) take values randomly sampled from a Bernoulli distribution with .

Setup. We set the mean and covariance of the background model equal to the empirical values of the full data. First, we tested whether our method could retrieve the embedded patterns. We performed the two-step spread pattern mining process for three iterations, and at each iteration we selected the top pattern to update the background distribution. Second, we corrupted the descriptive attributes by randomly flipping every 0 and 1 with a certain probability. Then, we checked up to what noise level the subgroups can still be retrieved.

Intention | SI Iter1 | Iter 2 | Iter 3 | Iter 4 |
---|---|---|---|---|

a3 = ‘1’ | 48.35 | -1.13 | -1.13 | -1.13 |

a5 = ‘1’ | 47.49 | 47.49 | -1.13 | -1.13 |

a4 = ‘1’ | 39.49 | 39.49 | 39.49 | -1.13 |

a4 = ‘0’ a3 = ‘1’ | 36.26 | -0.85 | -0.85 | -0.85 |

a5 = ‘0’ a3 = ‘1’ | 36.26 | -0.85 | -0.85 | -0.85 |

a3 = ‘0’ a5 = ‘1’ | 35.62 | 35.62 | -0.85 | -0.85 |

a4 = ‘0’ a5 = ‘1’ | 35.62 | 35.62 | -0.85 | -0.85 |

a3 = ‘0’ a4 = ‘1’ | 29.62 | 29.62 | 29.62 | -0.85 |

a5 = ‘0’ a4 = ‘1’ | 29.62 | 29.62 | 29.62 | -0.85 |

a5 = ‘0’ a4 = ‘0’ a3 = ‘1’ | 29.01 | 29.01 | -0.68 | -0.68 |

Results. Figures 2b—2d show the top patterns in the first three iterations. Our method correctly found the embedded subgroups in the first three iterations by their displaced location from the expected center. It also retrieved the direction along which each subgroup’s spread differs most from the full data covariance. Of course this is not so surprising, because for each embedded subgroup there is a description attribute setting the subgroup apart from the rest of the data.

To study the mining process in more detail, Table I shows the change in SI for the top 10 patterns from the first iteration in subsequent iterations. We observe that the three embedded subgroups were the highest-ranking patterns in the first three iterations (indeed they were the top 3 immediately because the subgroups induced by the true descriptions stand out so clearly from the rest of the data).

Once they were selected and used to update the background distribution, in the subsequent iterations the SI of the embedded subgroup patterns, and the SI of the derived patterns, dropped and remained low afterwards. Hence, updating the background distribution and the influence that should have on the IC scores of patterns worked as expected.

It can be observed also that the subgroups with more complex descriptions (e.g., a4 = ‘0’ a3 = ‘1’) have lower SI, even though the extensions are equivalent to the corresponding = ‘1’ pattern. This is because their DL is higher, while their extension is equivalent. Note that non-redundancy in the description is indeed achieved naturally in a principled manner. Also worth noting is that the SI can be negative. This is due to that the IC is based on a probability density and not a mass.

The result of the retrieval experiment with noise added to the description attributes is given in Fig. 3. We find that all embedded patterns can still be recovered when the flipping probability is up to 0.22, and partially retrieved up to 0.25. These values correspond to adding a random set of points that is roughly three and four times the size of the embedded pattern (e.g., vs. ). We conclude that the method is quite robust against noise.

### Iii-B Mammal data

Data. The mammal data encompasses data from The Atlas of European Mammals and from WorldClim.org, as preprocessed by Heikinheimo et al. [9]. It contains records about the presence of species in 2220 cells located on a grid that covers Europe. Each record contains the geolocation, binary labels for the presence/absence of 124 mammals, as well as 67 climate condition indicators.

Setup. We used the presence/absence indicators as target attributes and climate indicators for descriptions. The location information was used only for visualization and interpretation. We again set the initial mean and covariance parameters of the background model equal to the empirical values.

We found that for binary target attributes, spread patterns are not truly interesting. This makes sense, because the variance of a Bernoulli random variable is uniquely determined by the mean. Hence, a spread pattern becomes a one dimensional location pattern. That the attributes are binary is another form of background knowledge that could in principle be incorporated into the method, but it would lead to different derivations and we did not study this. Instead, we studied only location patterns on this data.

Results. The geographic locations of the data points part of the subgroup for the top patterns found in the first three iterations are visualized in Fig. 6. The subrgoup intentions (combination of values that specifies the subrgoup) are given in the caption. The top pattern corresponds to locations that are relatively cold in late winter. In contrast, the second pattern covers locations that have an extremely dry summer, while the third pattern covers locations with a dry autumn and warm conditions in the months when most rain falls (which is the summer in that area).

We further investigated the distribution of the mammals within the subgroups. Fig. 5 shows the mean values for the first pattern, and the mean and confidence interval for the background model for the top five mammal species ranked by SI. Figures 4a–c show the actual occurrences of the top three species across Europe. The species ranked first is the wood mouse, which is wide-spread in the middle and southern Europe but not in the northern areas. The second species is the mountain hare, whose habitat mostly coincides with the area associated to the found location pattern. This indicates it thrives under harsh temperature condition. The third species, moose, is also wide-spread mostly in the same area.

By contrasting these ground-truth location maps for the species (Figs. 4a–c) against the subgroup location map (Fig. 6), we find that indeed this pattern could be highly informative. However, while the description is concise, the displacement in the target space does not appear to be sparse (it covers many species). To comprehend the pattern in full, one should look at all the attributes where the mean deviates from the expectation, not just at the top five. This means fully understanding the pattern is somewhat difficult.

Finally, notice that these three species correlate and the background model already accounts for that. Hence, the IC of the subgroup is much less than the sum over the three attributes if they would be considered individually. Nonetheless, the IC is very high.

Although not shown, we repeated this exercise for the second and third pattern. The subgroup patterns appear to be informative. For example, ranked by SI, the most surprising species for the second pattern are the absence of the stoat and the bank vole, who prefer a moist environment, and the presence of the Iberian hare, who indeed lives exclusively in the area of the pattern. Thus, our method appears to find geographically meaningful location patterns that reveal the relationship between climate conditions and sets of animals that are absent/present in the corresponding area.

### Iii-C Socio-economics data case study

Data. The German socio-economic dataset [10] consists of socio-economic records of 412 administrative districts in Germany. The features are divided into three groups: election voting counts, age distribution, and workforce distribution. The voting percentages of the five largest political parties (CDU/CSU, SPD, FDP, Greens, and Left) in the 2009 German elections are also included. We added the geographic coordinates of each district center ourselves.

Setup. We used the vote count attributes as targets and the age and the work force attributes for the descriptions. Geolocations were used only for interpretation. Again, we set the initial mean and (co-)variance for the background distribution equal to the empirical values. In this case, that means we assume a user initially knows the overall voting behavior of the 2009 German elections.

We again performed three iterations of the subgroup discovery algorithm, but this time we studied both the location and the associated spread pattern in each iteration. To increase interpretability, we enforced a 2-sparsity constraint on , by optimizing it for each pair of target attributes separately and then selecting the result with the highest SI.

Results. Fig. 7 shows the top location patterns found, and Fig. 8 some explanation and the spread pattern for the top location pattern. Comparing the distribution of the pattern against the expected distribution under the model (Fig. 8a, red and blue lines), we observe that the voting behavior in the corresponding districts deviates substantially from the full population: more votes for Left, fewer for all others. The intention of the pattern corresponds to districts with relatively few children; from the map we see the extension covers mainly East Germany.

Once we update the background distribution with the location pattern, the model mean of the pattern becomes the observed mean, see Fig. 8b. Given the updated background distribution, we find that the spread pattern with highest SI is related to the covariance between the social democrats (SPD) and Christian democrats (CDU), with weight vector (see Fig. 8c).^{5}^{5}5The 2d-contour plot of the subgroup is aggregated as the average pdf of the background model for each data point in the subgroup. The mean and covariance are the sub-vector and the sub-matrix that correspond to attributes indicated by the weight vector. This visualization is not fully accurate, as not all points have the same parameters. A single multivariate normal cannot represent the background model accurately. As visualized in Fig. 8d, the variance in this direction is much smaller than expected. Of course since the votes add up to a constant, under the model we also expect negative correlations between the parties, but for this subgroup the anti-correlation is much stronger than expected. This indicates these parties really appear to battle for the same voters. However, we are not sufficiently knowledgeable of German politics to judge whether this is a solid observation.

Fig. 7 also shows the extensions for the top patterns in the second and third iterations. The second pattern has intention “Middle-aged Pop. 26.9” and contains large cities. Within those districts, the Green party has relatively high vote counts, which comes at the expense of the Left party. The third pattern, “Children Pop. 16.4”, is mostly a complementary pattern to the first one (see Fig. 7a,c), except that many of the big cities (Munich, Berlin, Cologne, etc.) fall exactly between the two thresholds (). The third pattern indeed covers locations where Left is unpopular and all other parties receive relatively many votes compared to the background model. In both the second and third location pattern, the corresponding spread pattern is a similar low-variance pattern as in Fig. 8. In our subjective opinion, these patterns appear to convey potentially highly interesting insights into this data.

### Iii-D Water quality data case study

Data. The River Water Quality dataset [11] consists of 1060 water quality records sampled from rivers in Slovenia. Each record contains measured values for 16 physical/chemical parameters and 14 bioindicators (7 plants, 7 animals), including a list of all taxa present and their density. The density of each taxon is recorded by an expert biologist at three different qualitative levels, where 1 means the taxon occurs incidentally, 3 frequently, and 5 abundantly.

Setup. We use the 16 physical/chemical parameters as targets and the 14 bioindicators as descriptors. Mean and (co-)variance of the initial background distribution were set to the empirical values.

Results. The top location pattern has intention “Amphipoda Gammarus fossarum 0 AND Oligochaeta Tubifex 3” and covers 91 records. Fig. 10a shows that the water samples fulfilling the description have an above-average biological oxygen demand (BOD), chlorine concentration (Cl), electrical conductivity, as well as KCrO and KMnO (indicating chemical oxygen demand, COD).

In the second step our method finds, without enforcing it, a sparse weight vector placing high weights on BOD and KMnO (Fig. 10d). The contour plot (Fig. 10b) indicates that along the most interesting spread direction, , the variance of the subgroup is much larger than expected. The CDF in Fig. 10c also confirms this. The main conclusion here is that, although the identified patterns are typically subgroups that are displaced from the center of the data, which is typically associated with having a smaller variance in comparison to the full data, it is also possible to find spread patterns corresponding to surprising higher-variance directions.

### Iii-E Scalability

We have not analyzed the algorithmic complexity of mining optimal location and spread patterns in detail, nor have we studied extensively how to find good solutions in practice. The computation time of the beam search algorithm can be controlled through the search parameters (number of solutions kept at each iteration, discretization strategy for numerical attributes, maximum number of conditions for the description) and it employs a timer. Of course it may not find the optimal pattern, but this strategy allows it to work on data of any size and dimensionality. Likewise, the heuristic solution to mine spread patterns typically outputs a pattern in very little time.

Notice that for both algorithms, the runtime is linear in the number of data points (i.e., to do the exact same computations on larger data is linear). One may add attributes without affected the computation time at the mining stage (background model discussed below), but of course to include them in candidate descriptions leads to an exponential growth in number of possible subgroup definitions. We feel it would be pointless to include a runtime experiment for these steps, as it is not feasible to compute the optimal solutions as a comparison, except on very small data.

What we can analyze is the runtime of fitting the background distribution. For all four real-world datasets, we mined location and spread patterns and measured the time it took to find the new MaxEnt distribution incorporating both previous and the newly identified pattern, for 20 iterations. The results are presented in II. We find that after insertion of 10–20 location patterns, the time it takes to find the MaxEnt distribution becomes noticeable. This may not be so surprising, as there are at least new constraints every time we insert a new location pattern. For the Mammals data, which has target dimension 124, the time quickly grows to durations that cannot be considered acceptable for interactive use. We also observe that for spread patterns, this problem does not occur because they are by definition of low rank (the weigth vector is not necessarily sparse but it is only a one-dimensional projection).

Location pattern | Spread pattern | ||||||
---|---|---|---|---|---|---|---|

Iteration | GSE | WQ | Cr | Ma | GSE | WQ | Cr |

Init | 9.167 | 8.640 | 9.714 | 8.453 | |||

1 | 0.13 | 0.16 | 0.12 | 13.72 | 0.10 | 0.10 | 0.11 |

2 | 0.09 | 0.16 | 0.08 | 33.09 | 0.08 | 0.05 | 0.08 |

3 | 0.12 | 0.31 | 0.09 | 62.61 | 0.06 | 0.12 | 0.09 |

4 | 0.25 | 0.52 | 0.11 | 120.44 | 0.11 | 0.13 | 0.13 |

5 | 0.33 | 0.92 | 0.16 | 184.33 | 0.14 | 0.18 | 0.20 |

6 | 0.49 | 1.41 | 0.19 | 250.23 | 0.19 | 0.19 | 0.27 |

7 | 0.68 | 1.94 | 0.30 | 399.90 | 0.26 | 0.32 | 0.44 |

8 | 0.91 | 2.57 | 0.41 | 602.54 | 0.37 | 0.36 | 0.50 |

9 | 1.16 | 3.07 | 0.56 | 796.38 | 0.38 | 0.37 | 0.65 |

10 | 1.49 | 4.00 | 0.80 | 1130.81 | 0.42 | 0.46 | 0.83 |

11 | 1.69 | 5.05 | 1.02 | - | 0.42 | 0.49 | 1.07 |

12 | 1.95 | 6.17 | 1.23 | - | 0.52 | 0.57 | 1.32 |

13 | 2.56 | 7.48 | 1.52 | - | 0.63 | 0.65 | 1.62 |

14 | 2.76 | 9.04 | 1.95 | - | 0.68 | 1.16 | 2.09 |

15 | 3.17 | 10.60 | 2.60 | - | 0.72 | 1.00 | 2.86 |

16 | 3.51 | 11.92 | 3.41 | - | 0.81 | 1.06 | 3.42 |

17 | 4.40 | 14.06 | 4.15 | - | 1.12 | 1.38 | 5.01 |

18 | 4.94 | 15.95 | 5.34 | - | 1.17 | 1.47 | 5.69 |

19 | 4.99 | 17.92 | 6.66 | - | 1.07 | 1.57 | 6.30 |

20 | 5.58 | 19.97 | 6.71 | - | 1.24 | 1.92 | 6.65 |

## Iv Related Work

The pattern syntax introduced in this paper can be considered a type of Exceptional Model Mining (EMM) [12, 13]. EMM can be seen as a multi-target generalization of Subgroup Discovery (SD) [14], which is a single-target supervised form of Pattern Mining [15]: the broad subfield of data mining where only a part of the data is described at a time, ignoring the coherence of the remainder.

Tasks similar to SD are Contrast Set Mining [16] and Emerging Pattern Mining [17]. Both these tasks have not been considered for multiple target attributes simultaneously, and hence differ from the current paper in that they do not directly help in understanding interactions between variables. The relationships between Contrast Set Mining, Emerging Pattern Mining, and SD are extensively described in [18].

Distribution Rules [19] can be seen as an early instance of EMM with only one target. Umek et al. [20] do consider SD with multiple targets. They approach the attribute partition in the reverse way of EMM: candidate subgroups are generated by agglomerative clustering on the targets, and predictive modeling on the descriptors strives to find matching descriptions.

Redescription Mining by Galbrun et al. [21] is the closest related work to this paper. It considers the case where a dataset contains two distinct parts, describing the same entities from two different viewpoints. Redescription Mining treats these two parts symmetrically: it seeks descriptions inducing the same subgroup, resulting in a rule of the form . In contrast, we consider the setting where the two parts play distinct roles: one part contains description attributes on which subgroups are defined, the other part forms the numeric data which we aim to learn about and hence on which the informativeness of subgroups is evaluated. This then results in rules of the form .

Interestingly, Galbrun et al. [21, Fig. 8, Tab. 6, 7] also considered the problem of ‘biological niche finding’ on the Mammal data. However, none of the subgroups they report are the same as ours. Their version of the data also encompasses a slightly larger region, but it is anyway unsurprising that results are quite different. The score function in Redescription Mining is not based on how much the subgroups stand out from the overall data, but only on the accuracy of the redescription and its cover. Hence, we did not further compare the results of our method with theirs.

‘Subjective Interestingness’ was first used in the context of Association Rule Mining [22, 23]. These papers formalized the prior belief of a user in a belief system, and sought association rules that contrasted with these beliefs. We base our approach on the more recent and systematic approach named FORSIED [2, 3]. This framework has been applied successfully to a variety of data mining problems, such as mining relational patterns [24], community detection [25], clustering [26], and dimensionality reduction [27]. Maximum Entropy modeling for real-valued data has also been studied before [28], in order to compute the significance of the Weighted Relative Accuracy in SD. That method targets a different pattern syntax than what is introduced here and does not apply to EMM.

Finally, Boley et al. [29] recently introduced a score function for single-target SD where a reduction in variance adds to the interestingness score of a subgroup. While their approach is less general and the interestingsness score arguably less principled, they do study the algorithmic complexity of the problem in detail and derive a tight-optimistic-estimator-based branch and bound algorithm to find the globally best subgroup pattern very efficiently.

## V Discussion and Conclusion

Numerous unsupervised methods exist to make sense of real-valued datasets, most notably methods for dimensionality reduction and clustering. Labels (or more generally description attributes as in this paper) associated with the data points are then often used to interpret these results, e.g., by measuring enrichment of certain labels within a cluster, or by coloring data points in a scatter plot of a 2-D projection of the data with a color depending on the labels of the points, for subsequent visual inspection. However, whether such analyses provide explanations or insights is a matter of coincidence: there is no a priori reason that clusters should be enriched, and there is no guarantee that equally colored points are grouped in a scatter plot.

Here, we propose an alternative approach, in directly using the description attributes to guide the search for surprising multivariate relations in the data. Resulting subgroups are then automatically explained well by the descriptions. Our approach contrasts with traditional supervised methods in focusing on local patterns: properties of the target attributes that apply only to subsets of the data defined in terms of conditions on their metadata. Arguably, with increasing amounts and resulting inhomogeneity of datasets, the importance of local patterns is bound to increase.

Our approach generalizes the literature on Subgroup Discovery and Exceptional Model Mining in being applicable for real-valued target attributes of arbitrary dimensionality, and in searching for multivariate local patterns across all these dimensions, including unusual covariance structures of subgroups in the data. Moreover, the interestingness of the patterns of this type is formalized in a rigorous manner, quantifying the amount of information the user gains by observing them. We have demonstrated that the resulting algorithms are effective and efficient, in theory and in practice.

In further work, we plan to remove the dependency on third party tools (Matlab and Cortana) and produce a standalone version of the method for public dissemination. Furthermore, it would be interesting to study similar pattern syntaxes for binary, categorical, and mixed sets of target attributes. Besides, although we have little hope to improve the search for optimal spread patterns, it may be feasible to devise a branch-and-bound approach to mine optimal location patterns efficiently. Indeed this appears to be the most relevant question to be addressed in the future. Finally, we aim to integrate this method with SIDE [30, 31], our online tool for exploration of numerical data, which currently does not use any labels or description attributes.

Acknowledgements. This work has been supported by the ERC under the EU’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement no. 615517, FWO (project no. G091017N, G0F9816N), the EU’s Horizon 2020 research and innovation programme and the FWO under the Marie Skłodowska-Curie Grant Agreement no. 665501, the Academy of Finland (decision 288814), and Tekes (Revolution of Knowledge Work project).

## References

- [1] M. A. Redmond and A. Baveja, “A data-driven software tool for enabling cooperative information sharing among police departments,” EJOR, vol. 141, pp. 660–678, 2002.
- [2] T. De Bie, “An information theoretic framework for data mining,” in Proc. of KDD, 2011, pp. 564–572.
- [3] ——, “Subjective interestingness in exploratory data mining,” in Proc. of IDA, 2013, pp. 19–31.
- [4] M. Meeng and A. Knobbe, “Flexible enrichment with cortana–software demo,” in Proc. of BeneLearn, 2011, pp. 117–119.
- [5] T. Cover and J. Thomas, Elements of information theory, 2nd ed. Wiley, 2005.
- [6] J.-T. Zhang, “Approximate and asymptotic distributions of chi-squared–type mixtures with applications,” JASA, vol. 100, no. 469, pp. 273–285, 2005.
- [7] F. Geerts, B. Goethals, and T. Mielikäinen, “Tiling databases,” in Proc. of DS, 2004, pp. 278–289.
- [8] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre, “Manopt, a matlab toolbox for optimization on manifolds,” JMLR, vol. 15, no. 1, pp. 1455–1459, 2014.
- [9] H. Heikinheimo, M. Fortelius, J. Eronen, and H. Mannila, “Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters,” J. Biogeogr, vol. 34, no. 6, pp. 1053–1064, 2007.
- [10] M. Boley, M. Mampaey, B. Kang, P. Tokmakov, and S. Wrobel, “One click mining: Interactive local pattern discovery through implicit preference and performance learning,” in Proc. of KDD-IDEA Workshop, 2013, pp. 27–35.
- [11] S. Džeroski, D. Demšar, and J. Grbović, “Predicting chemical parameters of river water quality from bioindicator data,” Appl. Intell., vol. 13, no. 1, pp. 7–17, 2000.
- [12] W. Duivesteijn, A. Feelders, and A. Knobbe, “Exceptional model mining - supervised descriptive local pattern mining with complex target concepts,” DMKD, vol. 30, no. 1, pp. 47–98, 2016.
- [13] D. Leman, A. Feelders, and A. Knobbe, “Exceptional model mining,” in Proc. ECML-PKDD, 2008, pp. 1–16.
- [14] W. Klösgen, “Explora: A multipattern and multistrategy discovery assistant,” in AKDDM, 1996, pp. 249–271.
- [15] K. Morik, J. Boulicaut, and A. Siebes, Eds., Local Pattern Detection, International Seminar, Dagstuhl Castle, Germany, April 12-16, 2004, Revised Selected Papers, ser. LNCS, vol. 3539, 2005.
- [16] S. D. Bay and M. Pazzani, “Detecting group differences: Mining contrast sets,” DMKD, vol. 5, no. 3, pp. 213–246, 2001.
- [17] G. Dong and J. Li, “Efficient mining of emerging patterns: Discovering trends and differences,” in Proc. of KDD, 1999, pp. 43–52.
- [18] P. Kralj Novak, N. Lavrač, and G. Webb, “Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining,” JMLR, vol. 10, pp. 377–403, 2009.
- [19] A. Jorge, P. Azevedo, and F. Pereira, “Distribution rules with numeric attributes of interest,” in Proc. of PKDD, 2006, pp. 247–258.
- [20] L. Umek and B. Zupan, “Subgroup discovery in data sets with multi-dimensional responses,” IDA, vol. 15, no. 4, pp. 533–549, 2011.
- [21] E. Galbrun and P. Miettinen, “From black and white to full color: extending redescription mining outside the boolean world,” SADM, vol. 5, no. 4, pp. 284–303, 2012.
- [22] B. Padmanabhan and A. Tuzhilin, “A belief-driven method for discovering unexpected patterns,” in Proc. of KDD, 1998, pp. 94–100.
- [23] A. Silberschatz and A. Tuzhilin, “On subjective measures of interestingness in knowledge discovery,” in Proc. of KDD, 1996, pp. 275–281.
- [24] J. Lijffijt, E. Spyropoulou, B. Kang, and T. De Bie, “P-n-rminer: A generic framework for mining interesting structured relational patterns,” IJDSA, vol. 1, no. 1, pp. 61–76, 2016.
- [25] M. v. Leeuwen, T. De Bie, E. Spyropoulou, and C. Mesnage, “Subjective interestingness of subgraph patterns,” Mach. Learn., vol. 105, no. 1, pp. 41–75, 2016.
- [26] K.-N. Kontonasios and T. De Bie, “Subjectively interesting alternative clusterings,” MLJ, vol. 98, no. 1, pp. 31–56, 2015.
- [27] B. Kang, J. Lijffijt, R. Santos-Rodríguez, and T. De Bie, “Subjectively interesting component analysis: Data projections that contrast with prior expectations,” in Proc. of KDD, 2016, pp. 1615–1624.
- [28] K.-N. Kontonasios, J. Vreeken, and T. De Bie, “Maximum entropy modelling for assessing results on real-valued data,” in Proc. of ICDM, 2011, pp. 350–359.
- [29] M. Boley, B. R. Goldsmith, L. M. Ghiringhelli, and J. Vreeken, “Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery,” in Proc. of ECML-PKDD, 2017.
- [30] B. Kang, K. Puolamäki, J. Lijffijt, and T. De Bie, “A tool for subjective and interactive visual data exploration,” in Proc. of ECML-PKDD – Part III, 2016, pp. 3–7.
- [31] K. Puolamäki, B. Kang, J. Lijffijt, and T. De Bie, “Interactive visual data exploration with subjective feedback,” in Proc. of ECML-PKDD - Part II, 2016, pp. 214–229.