ENFrame: A Platform for Processing Probabilistic Data
This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame.
The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: -means, -medoids, and Markov Clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees.
Experiments with -medoids clustering of sensor readings from energy networks show orders-of-magnitude improvements of exact clustering using ENFrame over naïve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms.
Recent years have witnessed a solid body of work in probabilistic databases with sustained systems building effort and extensive analysis of computational problems for rich classes of queries and probabilistic data models of varying expressivity . In contrast, most state-of-the-art probabilistic data mining approaches so far consider the restricted model of probabilistically independent input and produce hard, deterministic output . This technology gap hinders the development of data processing systems that integrate techniques for both probabilistic databases and data mining.
The ENFrame data processing platform aims at closing this gap by allowing users to specify iterative programs to query and mine probabilistic data. The semantics of ENFrame programs is based on a unified probabilistic interpretation of the entire processing pipeline from the input data to the program result. It features an expressive set of programming constructs, such as assignments, bounded-range loops, list comprehension, and aggregate operations on lists, and calls to external database engines, coupled with aspects of probabilistic databases, such as possible worlds semantics, arbitrary data correlations, and exact and approximate probability computation with error guarantees. The existing probabilistic data mining algorithms do not share these latter aspects.
Under the possible worlds semantics, the input is a probability distribution over a finite set of possible worlds, whereby each world defines a standard database or a set of input data points. The result of a user program is equivalent to executing it within each world and is thus a probability distribution over possible outcomes (e.g., partitionings). ENFrame exploits the fact that many of the possible worlds are alike, and avoids iterating over the exponentially many worlds.
Correlations occur naturally in query results , after conditioning probabilistic databases using constraints , and are supported by virtually all mainstream probabilistic models. If correlations are ignored, the output can be arbitrarily off from the expected result [37, 2]. For instance, consider two similar, but contradicting sensor readings (mutually exclusive data points) in a clustering setting. There is no possible world and thus no cluster containing both points, yet by ignoring their negative correlation, we would assign them to the same cluster.
The user is oblivious to the probabilistic nature of the input data, and can write programs as if the input data were deterministic. It is the task of ENFrame to interpret the program probabilistically. The approach taken here is to trace the user computation using fine-grained provenance, which we call events. The event language is a non-trivial extension of provenance semirings  and semimodules  that are used to trace the evaluation of positive relational algebra queries with aggregates and to compute probabilities of query results . It features events with negation, aggregates, loops, and definitions. It is expressive enough to succinctly encode arbitrary correlations occurring in the input data (e.g., modelled on Bayesian networks and pc-tables) and in the result of the user program (e.g., co-occurrence of data points in the same cluster), and trace the program state at any time. By annotating each computation in the program with events, we effectively translate it into an event program: variables become random variables whose possible outcomes are conditioned on events. Selected events represent the probabilistic program output, \egin case of clustering: the probability that a data point is a medoid, or the probability that two data points are assigned to the same cluster. Besides probability computation, events can be used for sensitivity analysis and explanation of the program result.
The most expensive task supported by ENFrame is probability computation for event programs, which is #P-hard in general. We developed sequential and distributed algorithms for both exact and approximate probability computation with error guarantees. The algorithms operate on a graph representations of the event programs called event networks. Expressions common to several events are only represented once in such graphs. Event networks for data mining tasks are very repetitive and highly interconnected due to the combinatorial nature of the algorithms: the events at each iteration are expressions over the events at the previous iteration and have the same structure at each iteration. Moreover, the event networks can be cyclic, so as to account for program loops. While it is possible to unfold bounded-range loops, this can lead to prohibitively large event networks.
The key challenge faced by ENFrame is to compute the probabilities of a large number of interconnected events that are defined over several iterations. Rather than computing the probability of each event separately, ENFrame’s algorithms employ a novel bulk-compilation technique, using Shannon expansion to depth-first explore the decision tree induced by the input random variables and the events in the program. The approximation algorithms use an error budget to prune large tree fragments of this decision tree that only account for a small probability mass. We introduce three approximation approaches (eager, lazy, and hybrid), each with a different strategy for spending the error budget. The distributed versions of these algorithms divide the exploration space into fragments to be explored concurrently by distinct workers.
While the computation time can grow exponentially in the number of input random variables in worst case, the structure of correlations can reduce it dramatically. As shown experimentally, ENFrame’s algorithm for exact probability computation is orders of magnitude faster than executing the user program in each possible world.
To sum up, the contributions of this paper are as follows:
We propose the ENFrame platform for processing probabilistic data. ENFrame can evaluate user programs on probabilistic data with arbitrary correlations following the possible worlds semantics.
User programs are written in a fragment of Python that supports bounded-range loops, list comprehension, aggregates, and calls to external database engines. We illustrate ENFrame’s features by giving programs for three clustering algorithms (-means, -medoids, and Markov clustering) and provide a formal specification of ENFrame’s user language which can be used to write arbitrary programs for the platform.
User programs are annotated by ENFrame with events that are expressive enough to capture the correlations of the input, trace the program computation, and allow for probability computation.
ENFrame uses novel sequential and distributed algorithms for exact and approximate probability computation of event programs.
We implemented ENFrame’s probability computation algorithms in C++.
We report on experiments with -medoids clustering of readings from partial discharge sensors in energy networks . We show orders-of-magnitude performance improvements of ENFrame’s exact algorithm over the naïve approach of clustering in each possible world, of approximate over exact clustering, and of distributed over sequential algorithms.
The paper is organised as follows. Section 2 introduces the Python fragment supported by ENFrame along with encodings of clustering algorithms. Section 3 introduces our event language and shows how user programs are annotated with events. Our probability computation algorithms are introduced in Section 4, and experimentally evaluated in Section 5. Section 6 overviews recent related work.
2 ENFrame’s User Language
This section introduces the user language supported by ENFrame. Its design is grounded in three main desiderata:
It should naturally express common mining algorithms, allow to issue queries and manipulate their results.
User programs must be oblivious to the deterministic or probabilistic nature of the input data and to the probabilistic formalism considered.
It should be simple enough to allow for an intuitive and straightforward probabilistic interpretation.
We settled on a subset of Python that can express, among others, -means, -medoids, and Markov Clustering. In line with query languages for probabilistic databases, where a Boolean query is a map for deterministic databases and a Boolean random variable for probabilistic databases, every user program has a sound semantics for both deterministic and probabilistic input data: in the former case, the result of a clustering algorithm is a deterministic clustering, in the latter case it is a probability distribution over possible clusterings.
The user language comprises the following constructs:
Variables and arrays. Variables can be of scalar types (real, integer, or Boolean) or arrays. Examples of variable assignments: V = 2, W = V, M = True, or M[i] = W. Arrays must be initialised, e.g., for array M of cardinality k: M = [None] * k. Additionally, the expression range(0, n) specifies the array [0,…,n-1].
Functions. Scalar variables can be multiplied, exponentiated (pow(B, r) for ), and inverted (invert(B) for ). The function dist(A,B) is a distance measure on the feature space between the vectors specified by arrays A,B of reals; scalar_mult is component-wise multiplication of an array with a scalar.
Reduce. Given a one-dimensional array M of some scalar type, it can be reduced to a scalar value by applying one of the functions reduce_or, reduce_sum, reduce_count. For instance, for an array B of Booleans, the expressionreduce_and(B) computes the Boolean conjunction of the truth values in B, and the expression reduce_count(B) computes the number of elements in B. For a two-dimensional array of reals or integers, i.e., an array of vectors, reduce_sum computes the component-wise sum of the vectors.
List comprehension. Inside a reduce-function, anonymous arrays may be defined using list comprehension. For example, given an array B of Booleans of size n, the expression reduce_sum([1 for i in range(0,n) if B[i]]) counts the number of True values in B.
Loops. We only allow bounded-range loops; for any fixed integer n and counter variable i, for-loops can be defined by: for i in range(0,n). This allows us to know the size of each constructed array at compile time.
Input data. The special abstract primitive loadData() is used to specify input data for algorithms. This function can be implemented to statically specify the objects to be clustered, to load them from disk, or to issue queries to a database. ENFrame supports positive relational algebra queries with aggregates via the SPROUT query engine for probabilistic data . The abstract methods loadParams() and init() are used to set algorithm parameters such as the number of iterations and clusters of a clustering algorithm.
2.1 Clustering Algorithms in ENFrame
We illustrate ENFrame’s user language with three example data mining algorithms: -means, -medoids, and Markov Clustering. Figures 1, 2, and 3 list user programs for these algorithms; we next discuss each of them.
k-means clustering. The -means algorithm partitions a set of data points into groups of similar data points. We initially choose a centroid for each cluster, i.e., a data point representing the cluster centre (initialisation phase). In successive iterations, each data point is assigned to the cluster with the closest centroid (assignment phase), after which the centroid is recomputed for each cluster (update phase). The algorithm terminates after a given number of iterations or after reaching convergence. Note that our user language does not support fixpoint computation, and hence checking convergence.
Figure 2 implements -means. The set O of input objects is retrieved using a loadData call. Each object is represented by a feature vector (i.e., array) of reals. We then load the parameters k, the number iter of iterations, and initialise cluster centroids M (line 3). The initialisation phase has a significant influence on the clustering outcome and convergence. We assume that initial centroids have been chosen, for example by using a heuristic . Subsequently, an array InCl of Booleans is computed such that InCl[i][l] is True if and only if M[i] is the closest centroid to object O[l] (lines 5–10); every object is then assigned to its closest cluster. Since two clusters may be equidistant to an object, ties are broken using the breakTies2 call (line 11); it fixes an order of the clusters and enforces that each object is only assigned to the first of its potentially multiple closest clusters. Next, the new cluster centroids M[i] are computed as the centroids of each cluster (lines 12–16). The assignment and update phases are repeated iter times (line 4).
k-medoids clustering. The -medoids algorithm is almost identical to -means, but elects cluster medoids rather than centroids: these are cluster members that minimise the sum of distances to all other objects in the cluster. The assignment phase is the same as for -means (lines 5–11), while the update phase is more involved: We first compute an array DistSum of sums of distances between each cluster medoid and all other objects in its cluster (lines 12–17), then find one object in each cluster that minimises this sum (lines 18–24), and finally elect these objects as the new cluster medoids M (lines 25–27). The last step uses reduce_sum to select exactly one of the objects in a cluster as the new medoid, since for each fixed i only one value in Centre[i][l] is True due to the tie-breaker in line 24.
Markov clustering (MCL). MCL is a fast and scalable unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs . Natural clusters in a graph are characterised by the presence of many edges within a cluster and few edges across clusters. MCL simulates random walks within a graph by alternating two operations: expansion and inflation. Expansion corresponds to computing random walks of higher length. It associates new probabilities with all pairs of nodes, where one node is the point of departure and the other is the destination. Since higher length paths are more common within clusters than between different clusters, the probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one to the other. Inflation has the effect of boosting the probabilities of intra-cluster walks and demoting inter-cluster walks. This is achieved without a priori knowledge of cluster structure; it is the result of cluster structure being present.
Figure 3 gives the MCL user program. Expansion coincides with taking the power of a stochastic matrix M using the normal matrix product (i.e. matrix squaring). Inflation corresponds to taking the Hadamard power of a matrix (taking powers entry-wise). It is followed by a scaling step to maintain the stochastic property, i.e. the matrix elements correspond to probabilities that sum up to 1 in each column.
Section 3 discusses the probabilistic interpretation of the computation of the above three clustering algorithms.
2.2 Syntax of the User Language
Figure 4 specifies the formal grammar for the language of user programs. A program is a sequence of declarations (DECL) and loop blocks (LOOP), each of which may again contain declarations and nested loops. The language allows to assign expressions (EXPR) to variable identifiers (ID). An expression may be a Boolean, integer, or float constant (LIT), an identifier, an array declaration, the result of a Boolean comparison between expressions, or the result of such operations as sum, product, inversion, or exponentiation. The result of a reduce operation on an anonymous array created through list comprehension (LCOMPR), and the result of breaking ties in a Boolean array give rise to expressions; we elaborate on these two constructions below.
In addition to the syntactic structure as defined by the grammar, programs have to satisfy the following constraints:
Bounded-range loops. The parameters to the range construct must be integer constants (or immutable integer-valued variables). This restriction ensures that for-loops (LOOP) and list comprehensions (LCOMPR) are of bounded size that is known at compile time.
Anonymous arrays via list comprehension. List comprehension may only be used to construct one-dimensional arrays of base types, i.e., arrays of integers, floats, or Booleans.
Breaking ties. Clustering algorithms require explicit handling of ties: For instance, if two objects are equidistant to two distinct cluster centroids in -means, the algorithm has to decide which cluster the object will be assigned to. In ENFrame programs, the membership of objects to clusters can be encoded by a Boolean array like InCl such that InCl[i][l] is true if and only if object l is in cluster i. In this context, a tie is a configuration of InCl in which for a fixed object l, InCl[i][l] is True for more than one cluster i. We explicitly break such ties using the function breakTies2(M). For each fixed value i of the second dimension (hence the 2 in the function name) of the 2-dimensional array M, it iterates over the first dimension of M and sets all but the first True value of M[i][l] to False. Symmetrically, the function breakTies1(M) fixes the first dimension and breaks ties in the second dimension of M, and breakTies(M) breaks ties in a one-dimensional array.
3 Tracing Computation by Events
The central concept for representing user programs in ENFrame is that of events. Each event is a concise syntactic encoding of a random variable and its probability distribution. This section describes the syntax and semantics of events and event programs, and finally explains how ENFrame programs written in the user language from Section 2 can be translated to event programs.
The key features of events and event programs are:
Events can encode arbitrarily correlated, discrete probability distributions over input objects. In particular, they can succinctly encode instances of such formalisms as Bayesian networks and pc-tables. The input objects and their correlations can be explicitly provided, or imported via a positive relational algebra query with aggregates over pc-tables .
By allowing non-Boolean events, our encoding is exponentially more succinct than an equivalent purely Boolean description.
Each event has a well-defined probabilistic semantics that allows to interpret it as a random variable.
The iterative nature of many clustering algorithms carries over to event programs, in which events can be defined by means of nested loops. This construction together with the ability to reuse existing, named events in the definition of new, more complex events leads to a concise encoding of a large number of distinct events.
Clustering in possible worlds. We start by presenting an instructive example of -medoids clustering under possible worlds semantics. Let be objects in the feature space as shown below. They can be clustered into two clusters using -medoids with medoids and .
In order to go from deterministic to uncertain objects, we associate each object with a Boolean propositional formula – the event – over a set of independent Boolean random variables . The possible valuations define the the possible worlds of the input objects: for each valuation there exists on world that contains exactly those objects for which is \trueunder . The probability of a world is the product of the probabilities of the variables taking a truth value .
Let us assume that the objects have the following events:
Distinct worlds can have different clustering results, as exemplified next. The world defined by consists of objects , , and , for which -medoids clustering yields:
Similarly, the worlds defined by and any assignment for , yields:
The probability of a query “Are and in the same cluster?” is the sum of the worlds in which and are in the same cluster.
Events do not only encode the correlations and probabilities of input objects, but can symbolically encode the entire clustering process. We illustrate this in the next example.
Symbolic encoding of -means by events. We again assume four input objects , …, with their respective events . This example introduces conditional values (c-values) which are expressions of the form , where is a Boolean formula and is a vector from the feature space. Intuitively, this c-value takes the value whenever evaluates to \true, and a special undefined value when is \false. C-values can be added and multiplied; for example, the expression evaluates to if and are \true, or to if is \trueand is \false, etc.
Equipped with c-values, an initialisation of -means with can for instance be written in terms of two expressions and : Centroid is set to object if is \trueand to if is \false; centroid is always set to the geometric centre of and .
In the assignment phase, each object is assigned to its nearest cluster centroid. The condition InCl for object being closest to can be written as the Boolean event ,
which encodes that the distance from to centroid is smaller than the distance to centroid .
Given the Boolean events InCl, we can represent the centroid of cluster for the next iteration by the expression , which specifies a random variable over possible cluster centroids conditioned on the assignments of objects to clusters as encoded by InCl. This expression is exponentially more succinct than an equivalent purely Boolean encoding of cluster centroids, since the later would require one Boolean expression for each subset of the four input objects.
The event programs corresponding to the three user programs for -means, -medoids, and MCL are given on the right side of Figures 1–3. In addition to the constructs introduced in Example 2, they use event declarations that assign identifiers to event expressions, and -loops that specify sets of events parametrised by . The remainder of this section specifies the formal syntax and semantics of event programs, and gives a translation from user to event programs.
3.1 Syntax of Event Expressions
The grammar for event expressions is as follows:
|CVAL CVAL | dist(CVAL, CVAL) | EVENT CVAL|
The main constructs are:
Conditional values. Reals and feature vectors are denoted by VAL. Together with a propositional formula, they give rise to a conditional value (CVAL), c-value for short.
Functions of conditional values. Very much like scalars and feature vectors, c-values can be added, multiplied, and exponentiated. Additionally, the distance between two c-values yields another c-value. In addition to the binary operations specified in the grammar (e.g., CVAL+CVAL), we allow - and -expressions (see Figure 2).
Event expressions. Event expressions (EVENT) are propositional formulas over constants (\true), (\false), a set of Boolean random variables, event identifiers, and propositions defined by ATOM: [CVAL COMP CVAL] represents the truth value obtained by comparing two c-values.
3.2 Semantics of Event Expressions
The semantics of event expressions is defined by extending a Boolean valuation to a valuation of c-values and event expressions. We define in the sequel how acts on each of the expression types in the grammar. The base cases of this mapping are the standard algebraic operations on scalars and the feature space, extended by special undefined elements as follows.
We extend the reals (and their operations , , ) by a special element (for undefined) such that . Operators propagate as and for any real . For any other reals , and are as usual. For example, .
Similarly, we extend the feature space by an element . For any real and feature vector, and are propagated as , , , and .
The grammar for event programs does not distinguish between scalars and feature vectors for the sake of notational clarity. The following description implicitly assumes that the expressions are well-typed; e.g., the expression is only defined for vector-valued variable symbols .
CVAL. Conditional values of the form EVENTVAL have an if-then-else semantics: If EVENT evaluates to \true, then EVENTVAL evaluates to VAL, else it evaluates to (or for vector-valued c-values); the recursively constructed CVAL expressions have the natural recursive semantics that ultimately defaults to and for scalars and feature vectors.
ATOM, EVENT. Comparisons for between two c-values evaluate to \falseif they are both defined ( and ) and the comparison does not hold; otherwise (i.e. if at least one of the c-values is undefined, or if the comparison holds), it evaluates to \true. The semantics of the Boolean propositional EVENT expressions is standard, i.e. by propagating through the propositional operators . For instance evaluates to \trueif , and to \falseotherwise.
3.3 Probabilistic Semantics of Events
We next give a probabilistic interpretation of event expressions that explains how they can be understood as random variables: Boolean event expressions (EVENT) give rise to Boolean random variables, and conditional values (CVAL) give rise to random variables over their respective domain.
For every random variable , we denote by and the probability that is \trueor \false, respectively; we also simply write for . Let be the set of mappings from the random variables to \trueand \false.
Definition 1 (Induced Probability Space).
Together, the probability mass function for every sample , and the probability measure for define a probability space that we call the probability space induced by .
An event expression is a random variable over the probability space induced by with probability distribution
By virtue of this definition, every Boolean event expression becomes a Boolean random variable, and real-valued (vector-valued) c-values become random variables over the reals (the feature space).
3.4 Event Programs
Event programs are imperative specifications that define a finite set of named c-values and event expressions. The grammar for event programs is as follows:
Event programs consist of a sequence of event declarations (DECL) and nested loops (LOOP) of event declarations.
A central concept is that of event identifiers (EID); it is required that event declarations are immutable, i.e. each distinct EID may only be assigned once to an event expression. Inside a -loop, identifiers can be parametrised by to create a distinct identifier in each iteration of the loop.
The meaning of an event program is simply the set of all named and grounded c-value and event expressions defined by the program; grounded here means that all identifiers in expressions are recursively resolved and replaced by the referenced expressions. For declarations outside of loops, this is clear; each declaration inside of (nested loops) is instantiated for each value of the loop counter variables.
3.5 From User Programs to Event Programs
The translation of user to event programs has two main challenges: (i) Translating mutable variables and arrays to immutable events, and (ii) translating function calls such as reduce_*. We cover these two issues separately.
From mutable variables to immutable events. It is natural to reassign variables in user language programs, for example when updating -means centroids in each iteration based on the cluster configuration of the previous iteration. In contrast, events in event programs are immutable, i.e., can be assigned only once. The translation from the user language to the event language utilises a function getLabel that generates for each user language variable a sequence of unique event identifiers whose lexicographic order reflects the sequence of assignments of .
The basic idea of getLabel is to first identify the nested loop blocks of the given user language program, and then to establish a counter for each distinct variable symbol and each block. An assignment of a variable within nested blocks corresponds to an event identifier of the form where are the counters for the blocks. Within each block, its corresponding counter is incremented for every assignment of its variable symbol. When going from one block into a nested inner block, the counters for the outer blocks are kept constant while the counter for the inner block is incremented as is reassigned in the inner block.
Special attention must be paid to the encoding of entering and leaving a block: In order to carry over the reference to a variable to the next block at level , we establish a copy , such that the first access to in the block may access its last assignment of via . Similarly, the last assignment of a variable in the inner block is passed back to the outer block by copying the last identifier of an inner block to the next identifier of the outer block.
Consider the following user language program (left) and its translation to an event program (right).
The user language program has three nested blocks. Within each block, the respective counter is incremented for each assignment of : for the outer block, in the second block, and for the innermost block. The encodings for entering and leaving a block are in lines C and F, and lines I and J, respectively.
Translation of arrays. Since arrays in a user language program have a known fixed size, their translation is straightforward: A -dimensional array translates to distinct identifiers .
Translation of reduce_* calls. According to the grammar in Figure 4, reduce-operations can only be applied to anonymous arrays created by list comprehension. The expression reduce_and([EXPR for ID in range(FROM, TO) if COND] is translated to the Boolean event . Symmetrically, reduce_or translates to , reduce_sum to , and reduce_mult to . A call to reduce_count([EXPR for ID in range(FROM, TO) if COND]) translates to the event .
4 Probability Computation
The probability computation problem is known to be #P-hard already for simple events representing propositional formulas such as positive bipartite formulas in disjunctive normal form . In ENFrame, we need to compute probabilities of a large number of interconnected complex events. Although the worst-case complexity remains hard, we attack the problem with three complementary techniques: (1) bulk-compile all events into one decision tree while exploiting the structure of the events to obtain smaller trees, (2) employ approximation techniques to prune significant parts of the decision tree, and ultimately (3) distribute the compilation by assigning distinct distributed workers to explore disjoint parts of the tree.
We next introduce the bulk-compilation technique, look at three approximation approaches, and discuss how to distribute the probability computation.
4.1 Compilation of event programs
The event programs consist of interconnected events; which are represented in an event network: a graph representation of the event programs, in which nodes are, e.g., Boolean connectives, comparisons, aggregates, and c-values. An example of such a network is depicted in \figrefexample-dag.
The goal is to compute probabilities for the top nodes in the network, which are referred to as compilation targets. These nodes represent events such as “object is assigned to cluster in iteration ”. We keep lower and upper bounds for the probability of each target. Initially, these bounds are and they eventually converge during computation.
The bulk-compilation procedure is based on Shannon expansion: select an input random variable and partially evaluate each compilation target to for being set to \true() and for being set to \false(). Then, the probability of is defined by . We are now left with two simpler events and . By repeating this procedure,we eventually resolve all variables in the events to the constants \trueor \false. The trace of this repeated expansion is the decision tree. We need not materialise the tree. Instead, we just explore it depth-first and collect the probabilities of all visited branches as well as record for each event the sums and of probabilities of those branches that satisfied and respectively did not satisfy the event. At any time, and represent a lower bound and respectively an upper bound on the probability of . This compilation procedure needs time polynomial in the network size (and in the size of the input data set), yet in worst case (unavoidably) exponential in the number of variables used by the events.
For practical reasons, we do not construct and explicitly, but keep minimal information that, in addition to the network, can uniquely define them. The process of computing this minimal information is called masking. We achieve this by traversing the network bottom-up and remembering the nodes that become \trueor \falsegiven the values of their children. When a compilation target is eventually masked by a variable assignment , the probability is added to its lower bound if , or subtracted from its upper bound if . If one or more targets are left unmasked, a next variable is chosen and the process is repeated with , where is either or . The algorithm chooses a next variable such that it influences as many events as possible.
Once all compilation targets are masked by an assignment , the compilation backtracks and selects a different assignment for the most recently chosen variable whose assignments are not exhausted. When all branches of the decision tree have been investigated, the probability bounds of the targets have necessarily converged and the algorithm terminates.
example-dag shows a simplified event network under the assignment . The masks of and are propagated to event nodes , which are now also masked. The red nodes are masked for , whereas the green nodes are masked .
network-dfs gives the pseudocode for the DFS-traversal of the decision tree. The blue lines are necessary for approximate probability computation and will be explained later. Compilation starts with an empty branch (variable assignment) ; the mask values and probability bounds for all nodes in the event network are initialised. The error budgets for the targets are set to for exact computation. After the initialisation, the dfs procedure is called using . The procedure selects the first variable and recursively call itself using two newly created branches of the decision tree: one for and one with . These branches are propagated into the event network using the Mask procedure. If every target is reached, dfs returns. Otherwise, it selects a next variable and recursively calls dfs on the two new tree branches.
network-mask performs mask propagation: a mask (assignment) for a variable is inserted into the network, and the variable node propagates the mask to its parent nodes. Depending on the event node, its node mask is either updated and propagated further, or propagation is stopped in case a mask cannot be established.
Convergence of the algorithm (\eg, clustering) can be detected by comparing the mask values at network nodes corresponding to iteration with the masks of nodes for iteration . If none of the mask assignments has changed between iterations, then the algorithm has converged.
4.2 Bounded-range loops in event networks
Event programs can contain bounded-range loops for iterative algorithms. ENFrame offers two ways of encoding such loops in an event network: unfolded, in which case the events at any loop iteration are explicitly stored as distinct nodes in the network, or a more efficient folded approach in which all iterations are captured into a single set of nodes. The compilation of the network then involves looping. The pseudocode in Algorithms 1 and 2 assumes unfolded event networks. They need minor modifications to work on folded networks: the mask data structure becomes two-dimensional to be able to store the mask for a node at any iteration () the dfs procedure needs an additional parameter for the current compilation iteration, and the network requires an additional node to perform the transition from iteration to iteration . The extra logic required for the mask function is:
Additionally, probability bounds of compilation targets should only be updated if is the last iteration, and propagation should only take place if is not the last iteration.
4.3 Approximation with error guarantees
The compilation procedure can be extended to achieve an anytime absolute \eapproxwith error guarantees. The idea is to stop the probability computation as soon as the bounds of all compilation targets are sufficiently tight.
Given a fixed error and events with probabilities . An absolute \eapproxfor these events is defined as a tuple such that
The compilation of the network yields probability bounds for the targets . It can be easily seen that an absolute \eapproxcan be defined by any tuple such that We thus need to run the algorithm until for each target .
There exist multiple strategies for investing this error budget for every target. We next discuss three such strategies. The lazy scheme follows the exact probability computation approach and stops as soon as the bounds become tight enough. Effectively, this results in investing the entire error budget into the rightmost branches of the decision tree. The eager scheme spends the entire error budget as soon as possible, i.e., on the leftmost branches of the decision tree, and then continues as in the case of exact computation. At each node in the decision tree, the hybrid scheme divides the current error budget equally over the two branches. Any residual, unused error budget is transferred to the next branch.
The blue lines in \algrefnetwork-dfs show how the dfs procedure can be extended to support anytime absolute \eapproxwith error guarantees using the hybrid scheme. The dfs procedure is called using a non-zero error budget , and it assigns half of the budget to the newly created left branch of the decision tree. The recursive dfs call returns the residual error budget of each target, which is then added to the budget for the right branch.
4.4 Distributed probability computation
By splitting the task of exploring the decision tree in a number of jobs, the compilation can be performed concurrently by multiple threads or machines. A worker explores a tree fragment of a given maximum size. For simplicity, we define the size of a job to be the depth of the sub-tree to explore. The computation then proceeds as follows. One worker explores the tree from the root and every time it reaches depth , it forks a new job that continues from that node as its root. Given that the maximum depth of the tree is the number of variables , the number of jobs created is at most , where the cost of each job would propagate at most variable valuations into the event network. Each job incurs the cost of communicating the mask at job creation and the probability bounds for each target at the end of the job. In case of approximation, the error budgets need to be synchronised both at the start and end of a job.
5 Experimental evaluation
This section describes an experimental evaluation of clustering probabilistic data using ENFrame. The focus of this evaluation is a preliminary benchmark of the performance of the probability computation algorithms introduced in \secrefprobcomp. At the end of this section, we comment on further experimental considerations that could not be included in full due to space limitations.
Data. We use a data set describing network load and occurrences of partial discharge in energy distribution networks . This data is gathered from two different types of sensors: partial discharge sensors installed on switchgear and cables in substations of the distribution network, and network load sensors in substations. We aggregate the number of partial discharge occurrences over the duration of an hour and subsequently pair this value with the average network load during that hour. Clustering can assist in detecting anomalies and predicting failures in the energy networks.
Uncertainty. Our goal is to show that ENFrame can deal with common correlations patterns that occur in probabilistic data [3, 33, 34]. Each data point is associated with an event described by Boolean random variables, whose probabilities for true are chosen at random from the range . Different values would make the probabilities of clustering events too close to 0 or 1 which are then easily approximable. The experiments were carried out using three types of correlations to illustrate ENFrame’s capability to process arbitrarily correlated data.
The positive correlations scheme yields events such that two data points are either positively correlated or independent. Each event is a disjunction of distinct positive literals. In the mutex correlations scheme, the data points are partitioned in mutex sets of cardinality (at most) : any two points are mutually exclusive within a mutex set and independent across the sets. The conditional correlations scheme expresses uncertainty as a Markov chain, using one node per data point. Let be the event that the data point exists. The event becomes ; it is a disjunction of two events, for the cases that exists or not. We thus introduce two new Boolean random variables and per data point . For every correlation scheme, a group size of 4 has been used, \iedata points were divided in groups with identical lineage. This is realistic for uncertain time-series sensor data: readings from a small time window have identical correlations and uncertainty. Additionally, we show experiments with a varying fraction of certain data points.
Algorithms. We report on performance benchmarks for -medoids clustering on the energy network data set, comparing ENFrame to naïve clustering. The naïve approach computes an equivalent clustering by explicitly iterating over all possible worlds. We show the performance of multiple probability computation algorithms of ENFrame: the sequential exact approach, three sequential approximation schemes (eager, lazy, hybrid), and distributed hybrid approximation (hybrid-d). All approximation algorithms are set to compute probabilities with an (absolute) error of at most , the compilation targets are the events that represent medoid selection.
Algorithms described in the literature (see Section 6) simplify the clustering problem by ignoring correlations, using expected distances, and producing a deterministic output. They might outperform our sequential algorithms, at the cost of producing an output that can be arbitrarily off from the golden standard of clustering in each world. Unfortunately, none of the reported prototypes was available for testing at the time of writing.
Setup. The experiments were carried out on an Intel Xeon X5660/ 2.80GHz machine with 4GB of RAM, running Ubuntu with Linux kernel 3.5. The timings reported for hybrid-d were obtained by simulating distributed computation on a single machine. The algorithms are implemented in C++ (GCC 4.7.2). Each plot in Figures 6 and 7 depicts average performance with min/max ranges of five runs with randomly generated event expressions, different probabilities, and three clustering iterations (using Euclidean distance).
*Sequential algorithms. Figures 6 and 7 show that all of ENFrame’s probability computation algorithms outperform the naïve algorithm by up to six orders of magnitude for each data set with more than 10 variables. Furthermore, the hybrid approximation can be up to four orders of magnitude faster than exact computation.
Indeed, for a very small number of possible worlds (i.e., a small number of variables), it pays off to cluster individually in each world and avoid the overhead of the event networks. For a larger number of worlds, our exact and approximate approaches quickly become up to six orders of magnitude faster. The naïve method times out for over 25 variables in every correlation scheme.
The reason why our approximation schemes outperform exact is as follows. For a given depth , there are up to nodes in the decision tree that contribute to the probability mass of a node in the event network. The contributed mass decreases exponentially with an increase in depth, which suggests that most nodes in the decision tree only contribute a small fraction of the overall probability mass. Depending on the desired error bound, a shallow exploration of the decision tree could be enough to achieve a sufficiently large probability mass.
Among the approximation algorithms, hybrid performs best; it outperforms exact by up to four orders of magnitude since it does only need to traverse a shallow prefix of the decision tree. The algorithm invests the error budget over the entire width of the decision tree, cutting branches of the tree after a certain depth. The other two methods (eager and lazy) use the budget to respectively cut the first and last branches, while exploring other branches in full depth.
For positive correlations, lazy performs very well, because the decision tree is very unbalanced under this scheme. The left branches of the tree correspond to variables being set to \true, which quickly satisfy the (disjunctive) input events and allow for compilation targets to be reached. Further to the right, branches correspond to variables being set to \false. More variables need to be set to (un)satisfy the disjunctive input event, thus leading to longer branches. The lazy algorithm saves the error budget until the very last moment and can therefore prune the deep branches whilst maintaining the -approximation. The decision trees for the mutex and conditional correlation schemes are more balanced, resulting in both lazy and eager to perform almost identically to exact. Hence, they are not shown in \figrefexp3-ds23.
Distributed algorithms. By distributing the probability computation task, we can significantly increase ENFrame’s performance. Figures 6 and 7 show the timings for hybrid-d using workers and job size (as detailed below). For all correlation schemes, hybrid-d gets increasingly faster than hybrid as we increase the number of variables. For small numbers of variables, the overhead of distribution does not pay off. The benefits are best seen for mutex correlations and over 100 objects (over 60 variables), where hybrid-d becomes more than one order of magnitude faster than hybrid. For readability reasons, the performance of hybrid-d is not depicted in \figrefexp3-ds1 (right); its performance is up to one order of magnitude better than hybrid, as can be seen in \figrefexp3-ds1. For ten variables, there is only a small performance gain when compared to the single-threaded hybrid approximation: the decision tree remains small, as is the number of jobs that can be generated. However, for and variables, hybrid-d yields a performance improvement of more than one order of magnitude when compared to hybrid.
Figure 9 shows the influence of the number of workers on hybrid-d’s performance for varying job sizes. A job is the work unit allocated to a worker at any one time; a size of means that the worker has to explore a fragment of the decision tree of depth at most and would need to traverse the event network at most times. For large job sizes, the overall number of jobs decreases; in the case of positive correlations, the number of jobs of size 9 is small since the decision tree is very unbalanced and only a few branches on the right-hand side of the tree grow deeper than nine variables. Therefore, increasing the number of workers would not help; indeed, there is no improvement for more than four workers for job sizes larger than 5. However, for a job size of 3, up to 16 workers can still be beneficial. In our experiment, smaller job sizes led to a performance gain of up to one order of magnitude, since they allow for a more equal distribution of the work over the available workers. Synchronisation did not play a significant role in our setup.
*Certain data points. \figreflarge-certain shows that the performance improves as the number of certain data points (\ie, objects that occur in all possible worlds) increases. The speedup in such cases is explained by the fact that the distance sums of possible medoids to data points in a cluster become less complex and can be initialised using the distances to objects that certainly exist. Consequently, fewer variables assignments are needed to decide on a cluster medoid, resulting in a shallower decision tree and a speedup in the compilation time.
Further findings. We have investigated the influence of the number of dimensions, data point coordinates, the error budget, the numbers of iterations, and alternative clustering compilation targets on the performance of ENFrame, as well as its total memory usage. As is the case with traditional -medoids on certain data, the number of dimensions has no influence on the computation time. The reported performance gap between exact and hybrid shows that performance is highly sensitive to the error budget. The number of iterations has a linear effect on the running time of the algorithm. The number of targets (including targets representing co-occurrence queries) has a minor influence on performance; due to the combinatorial nature of -medoids, clustering events are mostly satisfied in bulk and it is thus very rare that one event alone is satisfied at any one time. This also explains why experiments with other types of compilation targets (\eg, object-cluster assignment, pairwise object-cluster assignment) show very similar performance. In our experiments, the size of the event networks grows linearly in the number of objects and clusters and the memory usage of ENFrame is under 1GB.
Comments on clustering quality. The research effort described in this paper is mainly concerned with expressing a rich class of algorithms for data analysis in ENFrame, and scalability of the probability computation task; -medoids clustering is merely a use-case to show ENFrame’s flexibility and scalability. The adaptation of -medoids to ENFrame has the exact same quality as the “golden standard”: -medoids applied in each possible world, yet without actually explicitly iterating over all possible worlds. This is not the case for prior work that does not support correlated uncertain input and uncertain output. An extensive quality comparison is out of scope of this paper, but is part of future work. A common approach to assess the quality measure of a probabilistic method like ours, is to assume a notion of ground truth. It is unclear how to deal with probabilistic data which represents inherently contradictory information for which no ground truth exists or is known.
6 Related Work
Our work is at the confluence of several active research areas: probabilistic data management, data analytics platforms, and provenance data management.
Probabilistic data mining and querying. Our work adds to a wealth of literature on this topic [1, 34] along two directions: distributed probability computation techniques and a unified formalisation of several clustering algorithms in line with work on probabilistic databases.
Distributed probability computation has been approached so far only in the context of the SimSQL/MCDB system, where approximate query results are computed by Monte Carlo simulations [24, 9]. This contrasts with our approach in that MCDB was not designed for exact and approximate computation with error guarantees and does not exploit at runtime symbolically-represented correlations allowed by pc-tables and by ENFrame.
Early approaches to mining uncertain data are based on imprecise (fuzzy) data, for example using intervals, and produce fuzzy (soft) and hard output. Follow-up work shifted to representation of uncertainty by (independent) probability density functions per data point. In contrast, we allow for arbitrarily correlated discrete probability distributions. The importance of correlations has been previously acknowledged for clustering  and frequent pattern mining . A further key aspect of our approach that is not shared by existing uncertain data mining approaches is that we follow the possible worlds semantics throughout the whole mining process. This allows for exact and approximate computation with error guarantees and sound semantics of the mining process that is compatible with probabilistic databases. This cannot be achieved by existing work; for instance, most existing -means clustering approaches for uncertain data define cluster centroids using expected distances between data points [11, 30, 17, 27, 19, 25] or the expected variance of all data points in the same cluster ; they also compute hard clustering where the centroids are deterministic. The recently introduced UCPC approach to -means clustering  is the first work to acknowledge the importance of probabilistic cluster centroids. However, it assumes independence in the input and does not support correlations.
Data analytics platforms. Support for iterative programs is essential in many applications including data mining, web ranking, graph analysis, and model fitting. This has recently led to a surge in data-intensive computing platforms with built-in iteration capability. Mahout is a library that supports iterative programming on large clusters . HaLoop allows iterative applications to be assembled from existing MapReduce Hadoop programs . REX supports iterative distributed computations along database operations in which changes are propagated between iterations . MADlib is an open-source library for scalable in-database analytics . Similar in spirit, Bismarck is a unified architecture for in-database analytics . In the Iterative Map-Reduce-Update programming abstraction for machine learning, user programs are compiled into declarative, optimisable Datalog code . Platforms that facilitate uniform treatment of data-intensive tasks were also proposed outside the data management community, e.g., to support expressive languages for recursive problems that can be used to automatically synthesise programs targeting a massively parallel processor .
A key aspect that differentiates ENFrame from the above platforms is the probabilistic nature of input data and of the whole computation process. This calls for specifically tailored algorithms. So far, ENFrame lacks the scalability achievable by the above platforms, since it only distributes the probability computation task, while the actual mining task is performed on one machine. The next step is to consider a fully distributed computational approach.
Provenance in database and workflow systems. To enable probability computation, we trace fine-grained provenance of the user computation. This is in line with a wealth of work in probabilistic databases . Our event language is influenced by work on provenance semirings  and semimodules [4, 14] that capture provenance for positive queries with aggregates in relational databases. The construct resembles the algebraic structure of a semimodule that is a tensor product of the Boolean semiring freely generated by the variable set and of the SUM monoid over the real numbers . There are two differences between our construct and these structures. Firstly, we allow negation in events, which is not captured by the Boolean semiring. Secondly, even for positive events, is not a semimodule since it violates the following law: . Indeed, under an assignment that maps both and to , the left side of the equality evaluates to , whereas the right side becomes . Furthermore, our event language allows to define events via iterations, as needed to succinctly trace data mining computation.
Workflows employ a much wider variety of programming constructs than databases. Workflow provenance aims to capture a complete description of evaluation of a workflow , though it sees tasks as black-boxes, and therefore, consider all outputs of a task to depend on all of its inputs. This provenance model is too coarse to support exact derivations of output as needed in our case for probability computation.
A distinct line of work is on such relational provenance systems as Perm , DBNotes , Orchestra  that trace provenance using query rewriting or modified query operators. Panda  enables provenance-aware querying.
Acknowledgements. This research was supported by EPSRC grant agreement ADEPT (EP/I000194/1).
-  C. Aggarwal. Managing and Mining Uncertain Data. Kluwer, 2009.
-  C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, chapter A Survey of Uncertain Data Clustering Algorithms. Chapman and Hall, 2013.
-  P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
-  Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011.
-  Apache Software Foundation. The Mahout machine learning library. http://mahout.apache.org. v0.7.
-  D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. VLDB Journal, 2005.
-  V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Declarative systems for large-scale machine learning. Data Eng. Bull., 35(2), 2012.
-  Y. Bu, B. Howe, M. Balazinska, and M. Ernst. The haloop approach to large-scale iterative data analysis. VLDB J., 2012.
-  Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using SimSQL. In SIGMOD, 2013.
-  L. Cartey, R. Lyngsø, and O. de Moor. Synthesising graphics card programs from DSLs. In PLDI, 2012.
-  M. Chau, R. Cheng, B. Kao, and J. Ng. Uncertain data mining: An example in clustering location data. In PAKDD, 2006.
-  S. Davidson, S. Cohen-Boulakia, A. Eyal, B. Ludäscher, T. McPhillips, S. Bowers, and J. Freire. Provenance in scientific workflow systems. Data Eng. Bull., 32(4), 2007.
-  X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, 2012.
-  R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5), 2012.
-  B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, 2009.
-  T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
-  F. Gullo, G. Ponti, and A. Tagarelli. Clustering uncertain data via k-medoids. In SUM, 2008.
-  F. Gullo, G. Ponti, and A. Tagarelli. Minimizing the variance of cluster mixture models for clustering uncertain objects. In ICDM, 2010.
-  F. Gullo, G. Ponti, A. Tagarelli, and S. Greco. A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In ICDM, 2008.
-  F. Gullo and A. Tagarelli. Uncertain centroid based partitional clustering of uncertain data. PVLDB, 2012.
-  J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12), 2012.
-  R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Eng. Bull., 33(3), 2010.
-  Z. Ives, T. Green, G. Karvounarakis, N. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. Pereira. The Orchestra collaborative data sharing system. SIGMOD Rec., 2008.
-  R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. ACM TODS, 36(3), 2011.
-  B. Kao, S. Lee, F. Lee, D. Cheung, and W. Ho. Clustering uncertain data using Voronoi diagrams and R-Tree index. TKDE, 2010.
-  C. Koch and D. Olteanu. Conditioning probabilistic databases. In VLDB, 2008.
-  H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In SIGKDD, 2005.
-  M. Michel and C. Eastham. Improving the management of MV underground cable circuits using automated on-line cable partial discharge mapping. In CIRED, 2011.
-  S. Mihaylov, Z. Ives, and S. Guha. REX: Recursive, delta-based data-centric computation. PVLDB, 5(11), 2012.
-  W. Ngai, B. Kao, C. Chui, R. Cheng, M. Chau, and K. Yip. Efficient clustering of uncertain data. In ICDM, 2006.
-  S. Omurca and N. Duru. Decreasing iteration number of k-medoids algorithm with IFART. In ELECO, 2011.
-  J. Provan and M. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4), 1983.
-  P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
-  D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Morgan & Claypool, 2011.
-  L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In KDD, 2010.
-  S. van Dongen. Graph clustering by flow simulation. PhD thesis, University of Utrecht, 2000.
-  P. B. Volk, F. Rosenthal, M. Hahmann, D. Habich, and W. Lehner. Clustering uncertain data with possible worlds. In ICDE, 2009.