ENFrame: A Platform for Processing Probabilistic Data

# ENFrame: A Platform for Processing Probabilistic Data

## Abstract

This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame.

The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: -means, -medoids, and Markov Clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees.

Experiments with -medoids clustering of sensor readings from energy networks show orders-of-magnitude improvements of exact clustering using ENFrame over naïve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms.

\SetAlCapHSkip

0cm \SetAlgoSkip \SetKwCommentComment  \SetKwBlockBegin \SetKwOutputoutput \SetKwChoosechoose \SetFuncStytextsc \SetCommentStyscriptsize \SetDataStytextsc

\numberofauthors

3

## 1 Introduction

Recent years have witnessed a solid body of work in probabilistic databases with sustained systems building effort and extensive analysis of computational problems for rich classes of queries and probabilistic data models of varying expressivity [34]. In contrast, most state-of-the-art probabilistic data mining approaches so far consider the restricted model of probabilistically independent input and produce hard, deterministic output [1]. This technology gap hinders the development of data processing systems that integrate techniques for both probabilistic databases and data mining.

The ENFrame data processing platform aims at closing this gap by allowing users to specify iterative programs to query and mine probabilistic data. The semantics of ENFrame programs is based on a unified probabilistic interpretation of the entire processing pipeline from the input data to the program result. It features an expressive set of programming constructs, such as assignments, bounded-range loops, list comprehension, and aggregate operations on lists, and calls to external database engines, coupled with aspects of probabilistic databases, such as possible worlds semantics, arbitrary data correlations, and exact and approximate probability computation with error guarantees. The existing probabilistic data mining algorithms do not share these latter aspects.

Under the possible worlds semantics, the input is a probability distribution over a finite set of possible worlds, whereby each world defines a standard database or a set of input data points. The result of a user program is equivalent to executing it within each world and is thus a probability distribution over possible outcomes (e.g., partitionings). ENFrame exploits the fact that many of the possible worlds are alike, and avoids iterating over the exponentially many worlds.

Correlations occur naturally in query results [34], after conditioning probabilistic databases using constraints [26], and are supported by virtually all mainstream probabilistic models. If correlations are ignored, the output can be arbitrarily off from the expected result [37, 2]. For instance, consider two similar, but contradicting sensor readings (mutually exclusive data points) in a clustering setting. There is no possible world and thus no cluster containing both points, yet by ignoring their negative correlation, we would assign them to the same cluster.

The user is oblivious to the probabilistic nature of the input data, and can write programs as if the input data were deterministic. It is the task of ENFrame to interpret the program probabilistically. The approach taken here is to trace the user computation using fine-grained provenance, which we call events. The event language is a non-trivial extension of provenance semirings [16] and semimodules [4] that are used to trace the evaluation of positive relational algebra queries with aggregates and to compute probabilities of query results [14]. It features events with negation, aggregates, loops, and definitions. It is expressive enough to succinctly encode arbitrary correlations occurring in the input data (e.g., modelled on Bayesian networks and pc-tables) and in the result of the user program (e.g., co-occurrence of data points in the same cluster), and trace the program state at any time. By annotating each computation in the program with events, we effectively translate it into an event program: variables become random variables whose possible outcomes are conditioned on events. Selected events represent the probabilistic program output, \egin case of clustering: the probability that a data point is a medoid, or the probability that two data points are assigned to the same cluster. Besides probability computation, events can be used for sensitivity analysis and explanation of the program result.

The most expensive task supported by ENFrame is probability computation for event programs, which is #P-hard in general. We developed sequential and distributed algorithms for both exact and approximate probability computation with error guarantees. The algorithms operate on a graph representations of the event programs called event networks. Expressions common to several events are only represented once in such graphs. Event networks for data mining tasks are very repetitive and highly interconnected due to the combinatorial nature of the algorithms: the events at each iteration are expressions over the events at the previous iteration and have the same structure at each iteration. Moreover, the event networks can be cyclic, so as to account for program loops. While it is possible to unfold bounded-range loops, this can lead to prohibitively large event networks.

The key challenge faced by ENFrame is to compute the probabilities of a large number of interconnected events that are defined over several iterations. Rather than computing the probability of each event separately, ENFrame’s algorithms employ a novel bulk-compilation technique, using Shannon expansion to depth-first explore the decision tree induced by the input random variables and the events in the program. The approximation algorithms use an error budget to prune large tree fragments of this decision tree that only account for a small probability mass. We introduce three approximation approaches (eager, lazy, and hybrid), each with a different strategy for spending the error budget. The distributed versions of these algorithms divide the exploration space into fragments to be explored concurrently by distinct workers.

While the computation time can grow exponentially in the number of input random variables in worst case, the structure of correlations can reduce it dramatically. As shown experimentally, ENFrame’s algorithm for exact probability computation is orders of magnitude faster than executing the user program in each possible world.

To sum up, the contributions of this paper are as follows:

[leftmargin=6mm]
• We propose the ENFrame platform for processing probabilistic data. ENFrame can evaluate user programs on probabilistic data with arbitrary correlations following the possible worlds semantics.

• User programs are written in a fragment of Python that supports bounded-range loops, list comprehension, aggregates, and calls to external database engines. We illustrate ENFrame’s features by giving programs for three clustering algorithms (-means, -medoids, and Markov clustering) and provide a formal specification of ENFrame’s user language which can be used to write arbitrary programs for the platform.

• User programs are annotated by ENFrame with events that are expressive enough to capture the correlations of the input, trace the program computation, and allow for probability computation.

• ENFrame uses novel sequential and distributed algorithms for exact and approximate probability computation of event programs.

• We implemented ENFrame’s probability computation algorithms in C++.

• We report on experiments with -medoids clustering of readings from partial discharge sensors in energy networks [28]. We show orders-of-magnitude performance improvements of ENFrame’s exact algorithm over the naïve approach of clustering in each possible world, of approximate over exact clustering, and of distributed over sequential algorithms.

The paper is organised as follows. Section 2 introduces the Python fragment supported by ENFrame along with encodings of clustering algorithms. Section 3 introduces our event language and shows how user programs are annotated with events. Our probability computation algorithms are introduced in Section 4, and experimentally evaluated in Section LABEL:sec:experiments. Section 5 overviews recent related work.

## 2 ENFrame’s User Language

This section introduces the user language supported by ENFrame. Its design is grounded in three main desiderata:

[leftmargin=4mm]
1. It should naturally express common mining algorithms, allow to issue queries and manipulate their results.

2. User programs must be oblivious to the deterministic or probabilistic nature of the input data and to the probabilistic formalism considered.

3. It should be simple enough to allow for an intuitive and straightforward probabilistic interpretation.

We settled on a subset of Python that can express, among others, -means, -medoids, and Markov Clustering. In line with query languages for probabilistic databases, where a Boolean query is a map for deterministic databases and a Boolean random variable for probabilistic databases, every user program has a sound semantics for both deterministic and probabilistic input data: in the former case, the result of a clustering algorithm is a deterministic clustering, in the latter case it is a probability distribution over possible clusterings.

The user language comprises the following constructs:

Variables and arrays. Variables can be of scalar types (real, integer, or Boolean) or arrays. Examples of variable assignments: V = 2, W = V, M[2] = True, or M[i] = W. Arrays must be initialised, e.g., for array M of cardinality k: M = [None] * k. Additionally, the expression range(0, n) specifies the array [0,…,n-1].

Functions. Scalar variables can be multiplied, exponentiated (pow(B, r) for ), and inverted (invert(B) for ). The function dist(A,B) is a distance measure on the feature space between the vectors specified by arrays A,B of reals; scalar_mult is component-wise multiplication of an array with a scalar.

Reduce. Given a one-dimensional array M of some scalar type, it can be reduced to a scalar value by applying one of the functions reduce_or, reduce_sum, reduce_count. For instance, for an array B of Booleans, the expressionreduce_and(B) computes the Boolean conjunction of the truth values in B, and the expression reduce_count(B) computes the number of elements in B. For a two-dimensional array of reals or integers, i.e., an array of vectors, reduce_sum computes the component-wise sum of the vectors.

List comprehension. Inside a reduce-function, anonymous arrays may be defined using list comprehension. For example, given an array B of Booleans of size n, the expression reduce_sum([1 for i in range(0,n) if B[i]]) counts the number of True values in B.

Loops. We only allow bounded-range loops; for any fixed integer n and counter variable i, for-loops can be defined by: for i in range(0,n). This allows us to know the size of each constructed array at compile time.

Input data. The special abstract primitive loadData() is used to specify input data for algorithms. This function can be implemented to statically specify the objects to be clustered, to load them from disk, or to issue queries to a database. ENFrame supports positive relational algebra queries with aggregates via the SPROUT query engine for probabilistic data [14]. The abstract methods loadParams() and init() are used to set algorithm parameters such as the number of iterations and clusters of a clustering algorithm.

### 2.1 Clustering Algorithms in ENFrame

We illustrate ENFrame’s user language with three example data mining algorithms: -means, -medoids, and Markov Clustering. Figures 12,  and 3 list user programs for these algorithms; we next discuss each of them.

k-means clustering. The -means algorithm partitions a set of data points into groups of similar data points. We initially choose a centroid for each cluster, i.e., a data point representing the cluster centre (initialisation phase). In successive iterations, each data point is assigned to the cluster with the closest centroid (assignment phase), after which the centroid is recomputed for each cluster (update phase). The algorithm terminates after a given number of iterations or after reaching convergence. Note that our user language does not support fixpoint computation, and hence checking convergence.

Figure 2 implements -means. The set O of input objects is retrieved using a loadData call. Each object is represented by a feature vector (i.e., array) of reals. We then load the parameters k, the number iter of iterations, and initialise cluster centroids M (line 3). The initialisation phase has a significant influence on the clustering outcome and convergence. We assume that initial centroids have been chosen, for example by using a heuristic [31]. Subsequently, an array InCl of Booleans is computed such that InCl[i][l] is True if and only if M[i] is the closest centroid to object O[l] (lines 5–10); every object is then assigned to its closest cluster. Since two clusters may be equidistant to an object, ties are broken using the breakTies2 call (line 11); it fixes an order of the clusters and enforces that each object is only assigned to the first of its potentially multiple closest clusters. Next, the new cluster centroids M[i] are computed as the centroids of each cluster (lines 12–16). The assignment and update phases are repeated iter times (line 4).

k-medoids clustering. The -medoids algorithm is almost identical to -means, but elects cluster medoids rather than centroids: these are cluster members that minimise the sum of distances to all other objects in the cluster. The assignment phase is the same as for -means (lines 5–11), while the update phase is more involved: We first compute an array DistSum of sums of distances between each cluster medoid and all other objects in its cluster (lines 12–17), then find one object in each cluster that minimises this sum (lines 18–24), and finally elect these objects as the new cluster medoids M (lines 25–27). The last step uses reduce_sum to select exactly one of the objects in a cluster as the new medoid, since for each fixed i only one value in Centre[i][l] is True due to the tie-breaker in line 24.

Markov clustering (MCL). MCL is a fast and scalable unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs [36]. Natural clusters in a graph are characterised by the presence of many edges within a cluster and few edges across clusters. MCL simulates random walks within a graph by alternating two operations: expansion and inflation. Expansion corresponds to computing random walks of higher length. It associates new probabilities with all pairs of nodes, where one node is the point of departure and the other is the destination. Since higher length paths are more common within clusters than between different clusters, the probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one to the other. Inflation has the effect of boosting the probabilities of intra-cluster walks and demoting inter-cluster walks. This is achieved without a priori knowledge of cluster structure; it is the result of cluster structure being present.

Figure 3 gives the MCL user program. Expansion coincides with taking the power of a stochastic matrix M using the normal matrix product (i.e. matrix squaring). Inflation corresponds to taking the Hadamard power of a matrix (taking powers entry-wise). It is followed by a scaling step to maintain the stochastic property, i.e. the matrix elements correspond to probabilities that sum up to 1 in each column.

Section 3 discusses the probabilistic interpretation of the computation of the above three clustering algorithms.

### 2.2 Syntax of the User Language

Figure 4 specifies the formal grammar for the language of user programs. A program is a sequence of declarations (DECL) and loop blocks (LOOP), each of which may again contain declarations and nested loops. The language allows to assign expressions (EXPR) to variable identifiers (ID). An expression may be a Boolean, integer, or float constant (LIT), an identifier, an array declaration, the result of a Boolean comparison between expressions, or the result of such operations as sum, product, inversion, or exponentiation. The result of a reduce operation on an anonymous array created through list comprehension (LCOMPR), and the result of breaking ties in a Boolean array give rise to expressions; we elaborate on these two constructions below.

In addition to the syntactic structure as defined by the grammar, programs have to satisfy the following constraints:

Bounded-range loops. The parameters to the range construct must be integer constants (or immutable integer-valued variables). This restriction ensures that for-loops (LOOP) and list comprehensions (LCOMPR) are of bounded size that is known at compile time.

Anonymous arrays via list comprehension. List comprehension may only be used to construct one-dimensional arrays of base types, i.e., arrays of integers, floats, or Booleans.

Breaking ties. Clustering algorithms require explicit handling of ties: For instance, if two objects are equidistant to two distinct cluster centroids in -means, the algorithm has to decide which cluster the object will be assigned to. In ENFrame programs, the membership of objects to clusters can be encoded by a Boolean array like InCl such that InCl[i][l] is true if and only if object l is in cluster i. In this context, a tie is a configuration of InCl in which for a fixed object l, InCl[i][l] is True for more than one cluster i. We explicitly break such ties using the function breakTies2(M). For each fixed value i of the second dimension (hence the 2 in the function name) of the 2-dimensional array M, it iterates over the first dimension of M and sets all but the first True value of M[i][l] to False. Symmetrically, the function breakTies1(M) fixes the first dimension and breaks ties in the second dimension of M, and breakTies(M) breaks ties in a one-dimensional array.

## 3 Tracing Computation by Events

The central concept for representing user programs in ENFrame is that of events. Each event is a concise syntactic encoding of a random variable and its probability distribution. This section describes the syntax and semantics of events and event programs, and finally explains how ENFrame programs written in the user language from Section 2 can be translated to event programs.

The key features of events and event programs are:

• Events can encode arbitrarily correlated, discrete probability distributions over input objects. In particular, they can succinctly encode instances of such formalisms as Bayesian networks and pc-tables. The input objects and their correlations can be explicitly provided, or imported via a positive relational algebra query with aggregates over pc-tables [14].

• By allowing non-Boolean events, our encoding is exponentially more succinct than an equivalent purely Boolean description.

• Each event has a well-defined probabilistic semantics that allows to interpret it as a random variable.

• The iterative nature of many clustering algorithms carries over to event programs, in which events can be defined by means of nested loops. This construction together with the ability to reuse existing, named events in the definition of new, more complex events leads to a concise encoding of a large number of distinct events.

###### Example 1.

Clustering in possible worlds. We start by presenting an instructive example of -medoids clustering under possible worlds semantics. Let be objects in the feature space as shown below. They can be clustered into two clusters using -medoids with medoids and .

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(5.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (8.55,0.5) rectangle +(0.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50] ; \node(2) at (2, 1) [circle,draw=black!70,fill=green!50,very thick] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50] ; \node(4) at (9, 1) [circle,draw=black!70,fill=yellow!50,very thick] ;

In order to go from deterministic to uncertain objects, we associate each object with a Boolean propositional formula – the event – over a set of independent Boolean random variables . The possible valuations define the the possible worlds of the input objects: for each valuation there exists on world that contains exactly those objects for which is \trueunder . The probability of a world is the product of the probabilities of the variables taking a truth value .

Let us assume that the objects have the following events:

.

Distinct worlds can have different clustering results, as exemplified next. The world defined by consists of objects , , and , for which -medoids clustering yields:

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(0.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (4.55,0.5) rectangle +(4.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50,very thick] ; \node[black!30] (2) at (2, 1) [circle,draw=black!20,fill=green!10,densely dotted,thick] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50,very thick] ; \node(4) at (9, 1) [circle,draw=black!70,fill=yellow!50] ;

Similarly, the worlds defined by and any assignment for , yields:

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(2.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (4.55,0.5) rectangle +(0.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50,very thick] ; \node(2) at (2, 1) [circle,draw=black!70,fill=green!50] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50,very thick] ; \node[black!30] (4) at (9, 1) [circle,draw=black!20,fill=yellow!10,densely dotted,thick] ;

The probability of a query “Are and in the same cluster?” is the sum of the worlds in which and are in the same cluster.

Events do not only encode the correlations and probabilities of input objects, but can symbolically encode the entire clustering process. We illustrate this in the next example.

###### Example 2.

Symbolic encoding of -means by events. We again assume four input objects , …, with their respective events . This example introduces conditional values (c-values) which are expressions of the form , where is a Boolean formula and is a vector from the feature space. Intuitively, this c-value takes the value whenever evaluates to \true, and a special undefined value when is \false. C-values can be added and multiplied; for example, the expression evaluates to if and are \true, or to if is \trueand is \false, etc.

Equipped with c-values, an initialisation of -means with can for instance be written in terms of two expressions and : Centroid is set to object if is \trueand to if is \false; centroid is always set to the geometric centre of and .

In the assignment phase, each object is assigned to its nearest cluster centroid. The condition InCl for object being closest to can be written as the Boolean event ,

which encodes that the distance from to centroid is smaller than the distance to centroid .

Given the Boolean events InCl, we can represent the centroid of cluster for the next iteration by the expression , which specifies a random variable over possible cluster centroids conditioned on the assignments of objects to clusters as encoded by InCl. This expression is exponentially more succinct than an equivalent purely Boolean encoding of cluster centroids, since the later would require one Boolean expression for each subset of the four input objects.

The event programs corresponding to the three user programs for -means, -medoids, and MCL are given on the right side of Figures 13. In addition to the constructs introduced in Example 2, they use event declarations that assign identifiers to event expressions, and -loops that specify sets of events parametrised by . The remainder of this section specifies the formal syntax and semantics of event programs, and gives a translation from user to event programs.

### 3.1 Syntax of Event Expressions

The grammar for event expressions is as follows:

 VAL ::=A scalar or feature vector INT ::=Any integer CVAL ::=EVENT⊗VAL | CVAL−1 | CVAL+CVAL | CVALINT | CVAL ⋅ CVAL — dist(CVAL, CVAL) — EVENT ∧ CVAL COMP ::=≤|≥|=|<|> ATOM ::=[CVAL COMP CVAL] EID ::=Elements of a set of event identifiers EVENT ::=Propositional formula over X, EID, ATOM

The main constructs are:

Conditional values. Reals and feature vectors are denoted by VAL. Together with a propositional formula, they give rise to a conditional value (CVAL), c-value for short.

Functions of conditional values. Very much like scalars and feature vectors, c-values can be added, multiplied, and exponentiated. Additionally, the distance between two c-values yields another c-value. In addition to the binary operations specified in the grammar (e.g., CVAL+CVAL), we allow - and -expressions (see Figure 2).

Event expressions. Event expressions (EVENT) are propositional formulas over constants (\true), (\false), a set of Boolean random variables, event identifiers, and propositions defined by ATOM: [CVAL COMP CVAL] represents the truth value obtained by comparing two c-values.

### 3.2 Semantics of Event Expressions

The semantics of event expressions is defined by extending a Boolean valuation to a valuation of c-values and event expressions. We define in the sequel how acts on each of the expression types in the grammar. The base cases of this mapping are the standard algebraic operations on scalars and the feature space, extended by special undefined elements as follows.

We extend the reals (and their operations , , ) by a special element (for undefined) such that . Operators propagate as and for any real . For any other reals , and are as usual. For example, .

Similarly, we extend the feature space by an element . For any real and feature vector, and are propagated as , , , and .

The grammar for event programs does not distinguish between scalars and feature vectors for the sake of notational clarity. The following description implicitly assumes that the expressions are well-typed; e.g., the expression is only defined for vector-valued variable symbols .

CVAL. Conditional values of the form EVENTVAL have an if-then-else semantics: If EVENT evaluates to \true, then EVENTVAL evaluates to VAL, else it evaluates to (or for vector-valued c-values); the recursively constructed CVAL expressions have the natural recursive semantics that ultimately defaults to and for scalars and feature vectors.

 ν(EVENT⊗VAL) ={VAL,if ν(EVENT)=\true% u (u, resp.)otherwise ν(CVAL1+CVAL2) =ν(CVAL1)+ν(CVAL2) ν(CVAL1⋅CVAL2) =ν(CVAL1)⋅ν(CVAL2) ν(CVAL−1) =ν(CVAL)−1 ν(\text{dist}(CVAL1, CVAL2)) Unknown environment '%' ν(CVALINT) =ν(CVAL)INT ν(EVENT ∧ CVAL) ={ν(CVAL),if ν(EVENT)=%\trueu (u, resp.)otherwise

ATOM, EVENT. Comparisons for between two c-values evaluate to \falseif they are both defined ( and ) and the comparison does not hold; otherwise (i.e. if at least one of the c-values is undefined, or if the comparison holds), it evaluates to \true. The semantics of the Boolean propositional EVENT expressions is standard, i.e. by propagating through the propositional operators . For instance evaluates to \trueif , and to \falseotherwise.

### 3.3 Probabilistic Semantics of Events

We next give a probabilistic interpretation of event expressions that explains how they can be understood as random variables: Boolean event expressions (EVENT) give rise to Boolean random variables, and conditional values (CVAL) give rise to random variables over their respective domain.

For every random variable , we denote by and the probability that is \trueor \false, respectively; we also simply write for . Let be the set of mappings from the random variables to \trueand \false.

###### Definition 1 (Induced Probability Space).

Together, the probability mass function for every sample , and the probability measure for define a probability space that we call the probability space induced by .

An event expression is a random variable over the probability space induced by with probability distribution

 PE[s]=Pr({ν∈Ω|ν(E)=s})=∑ν∈Ω:ν(E)=sPr(ν).

By virtue of this definition, every Boolean event expression becomes a Boolean random variable, and real-valued (vector-valued) c-values become random variables over the reals (the feature space).

### 3.4 Event Programs

Event programs are imperative specifications that define a finite set of named c-values and event expressions. The grammar for event programs is as follows:

 INT ::=A positive integer VAR ::=A variable symbol LOOP ::={ {DECL} { ∀ VAR in INT..INT: {LOOP} } } DECL ::=EID ≡ EVENT

Event programs consist of a sequence of event declarations (DECL) and nested loops (LOOP) of event declarations.

A central concept is that of event identifiers (EID); it is required that event declarations are immutable, i.e. each distinct EID may only be assigned once to an event expression. Inside a -loop, identifiers can be parametrised by to create a distinct identifier in each iteration of the loop.

The meaning of an event program is simply the set of all named and grounded c-value and event expressions defined by the program; grounded here means that all identifiers in expressions are recursively resolved and replaced by the referenced expressions. For declarations outside of loops, this is clear; each declaration inside of (nested loops) is instantiated for each value of the loop counter variables.

### 3.5 From User Programs to Event Programs

The translation of user to event programs has two main challenges: (i) Translating mutable variables and arrays to immutable events, and (ii) translating function calls such as reduce_*. We cover these two issues separately.

From mutable variables to immutable events. It is natural to reassign variables in user language programs, for example when updating -means centroids in each iteration based on the cluster configuration of the previous iteration. In contrast, events in event programs are immutable, i.e., can be assigned only once. The translation from the user language to the event language utilises a function getLabel that generates for each user language variable a sequence of unique event identifiers whose lexicographic order reflects the sequence of assignments of .

The basic idea of getLabel is to first identify the nested loop blocks of the given user language program, and then to establish a counter for each distinct variable symbol and each block. An assignment of a variable within nested blocks corresponds to an event identifier of the form where are the counters for the blocks. Within each block, its corresponding counter is incremented for every assignment of its variable symbol. When going from one block into a nested inner block, the counters for the outer blocks are kept constant while the counter for the inner block is incremented as is reassigned in the inner block.

Special attention must be paid to the encoding of entering and leaving a block: In order to carry over the reference to a variable to the next block at level , we establish a copy , such that the first access to in the block may access its last assignment of via . Similarly, the last assignment of a variable in the inner block is passed back to the outer block by copying the last identifier of an inner block to the next identifier of the outer block.

###### Example 3.

Consider the following user language program (left) and its translation to an event program (right).

 1: M = 7 A: M0≡7 2: M = M+2 B: M1≡M0+2 3: for i in range(0,2): C: M1.−1≡M1 D: ∀i in 0..1: 4: M = M+i E: M1.(2i)≡M1.(2i−1)+i 5: for j in range(0,3): F: M1.(2i).−1≡M1.(2i) G: ∀j in 0..2: 6: M = M+1 H: M1.(2i).j≡M1.(2i).(j−1)+1 I: M1.(2i+1)≡M1.(2i).2 J: M2≡M1.(2⋅1+1) 7: M = M+1 K: M3≡M2+1

The user language program has three nested blocks. Within each block, the respective counter is incremented for each assignment of : for the outer block, in the second block, and for the innermost block. The encodings for entering and leaving a block are in lines C and F, and lines I and J, respectively.

Translation of arrays. Since arrays in a user language program have a known fixed size, their translation is straightforward: A -dimensional array translates to distinct identifiers .

Translation of reduce_* calls. According to the grammar in Figure 4, reduce-operations can only be applied to anonymous arrays created by list comprehension. The expression reduce_and([EXPR for ID in range(FROM, TO) if COND] is translated to the Boolean event . Symmetrically, reduce_or translates to , reduce_sum to , and reduce_mult to . A call to reduce_count([EXPR for ID in range(FROM, TO) if COND]) translates to the event .

## 4 Probability Computation

The probability computation problem is known to be #P-hard already for simple events representing propositional formulas such as positive bipartite formulas in disjunctive normal form [32]. In ENFrame, we need to compute probabilities of a large number of interconnected complex events. Although the worst-case complexity remains hard, we attack the problem with three complementary techniques: (1) bulk-compile all events into one decision tree while exploiting the structure of the events to obtain smaller trees, (2) employ approximation techniques to prune significant parts of the decision tree, and ultimately (3) distribute the compilation by assigning distinct distributed workers to explore disjoint parts of the tree.

We next introduce the bulk-compilation technique, look at three approximation approaches, and discuss how to distribute the probability computation.

### 4.1 Compilation of event programs

{algorithm}

[t]\SetInd1.6mm1.6mm \DontPrintSemicolon\CommentBlue comments and pseudocode are related to -approx. \FuncStyCompile(\DataStynetwork , absolute error ) \Begin \CommentInitialise initial (empty) masks for nodes in the network \lForEach

\ForEach

\Comment*[f]initial probability lower bound: 0
\Comment*[f]initial probability lower bound: 1
\Comment*error budget (for exact, ) \FuncStydfs() \Comment*empty DFS branch , \FuncStydfs(\DataStynetwork , masks , branch , error budgets ) \Begin \If(\Comment*[f]sufficient budget) \lForEach
\Return
\Return\Comment*all reached/approx.

\Comment

error budget for left DFS-branch \lForEach

\Comment

DFS left branch, storing the residual error budget

\Comment

compute error budget for right DFS-branch \lForEach
\eIf \CommentDFS right branch, return the residual error budget \Return\FuncStydfs \Commentall probability bounds reached -approx., no right DFS \Return Exact and approx. compilation of network

The event programs consist of interconnected events; which are represented in an event network: a graph representation of the event programs, in which nodes are, e.g., Boolean connectives, comparisons, aggregates, and c-values. An example of such a network is depicted in \figrefexample-dag.

The goal is to compute probabilities for the top nodes in the network, which are referred to as compilation targets. These nodes represent events such as “object is assigned to cluster in iteration ”. We keep lower and upper bounds for the probability of each target. Initially, these bounds are and they eventually converge during computation.

The bulk-compilation procedure is based on Shannon expansion: select an input random variable and partially evaluate each compilation target to for being set to \true() and for being set to \false(). Then, the probability of is defined by . We are now left with two simpler events and . By repeating this procedure,we eventually resolve all variables in the events to the constants \trueor \false. The trace of this repeated expansion is the decision tree. We need not materialise the tree. Instead, we just explore it depth-first and collect the probabilities of all visited branches as well as record for each event the sums and of probabilities of those branches that satisfied and respectively did not satisfy the event. At any time, and represent a lower bound and respectively an upper bound on the probability of . This compilation procedure needs time polynomial in the network size (and in the size of the input data set), yet in worst case (unavoidably) exponential in the number of variables used by the events.

For practical reasons, we do not construct and explicitly, but keep minimal information that, in addition to the network, can uniquely define them. The process of computing this minimal information is called masking. We achieve this by traversing the network bottom-up and remembering the nodes that become \trueor \falsegiven the values of their children. When a compilation target is eventually masked by a variable assignment , the probability is added to its lower bound if , or subtracted from its upper bound if . If one or more targets are left unmasked, a next variable is chosen and the process is repeated with , where is either or . The algorithm chooses a next variable such that it influences as many events as possible.

Once all compilation targets are masked by an assignment , the compilation backtracks and selects a different assignment for the most recently chosen variable whose assignments are not exhausted. When all branches of the decision tree have been investigated, the probability bounds of the targets have necessarily converged and the algorithm terminates.

###### Example 4.
\figref

example-dag shows a simplified event network under the assignment . The masks of and are propagated to event nodes , which are now also masked. The red nodes are masked for , whereas the green nodes are masked .

### References

1. C. Aggarwal. Managing and Mining Uncertain Data. Kluwer, 2009.
2. C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, chapter A Survey of Uncertain Data Clustering Algorithms. Chapman and Hall, 2013.
3. P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
4. Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011.
5. Apache Software Foundation. The Mahout machine learning library. http://mahout.apache.org. v0.7.
6. D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. VLDB Journal, 2005.
7. V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Declarative systems for large-scale machine learning. Data Eng. Bull., 35(2), 2012.
8. Y. Bu, B. Howe, M. Balazinska, and M. Ernst. The haloop approach to large-scale iterative data analysis. VLDB J., 2012.
9. Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using SimSQL. In SIGMOD, 2013.
10. L. Cartey, R. Lyngsø, and O. de Moor. Synthesising graphics card programs from DSLs. In PLDI, 2012.
11. M. Chau, R. Cheng, B. Kao, and J. Ng. Uncertain data mining: An example in clustering location data. In PAKDD, 2006.
12. S. Davidson, S. Cohen-Boulakia, A. Eyal, B. Ludäscher, T. McPhillips, S. Bowers, and J. Freire. Provenance in scientific workflow systems. Data Eng. Bull., 32(4), 2007.
13. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, 2012.
14. R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5), 2012.
15. B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, 2009.
16. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
17. F. Gullo, G. Ponti, and A. Tagarelli. Clustering uncertain data via k-medoids. In SUM, 2008.
18. F. Gullo, G. Ponti, and A. Tagarelli. Minimizing the variance of cluster mixture models for clustering uncertain objects. In ICDM, 2010.
19. F. Gullo, G. Ponti, A. Tagarelli, and S. Greco. A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In ICDM, 2008.
20. F. Gullo and A. Tagarelli. Uncertain centroid based partitional clustering of uncertain data. PVLDB, 2012.
21. J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12), 2012.
22. R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Eng. Bull., 33(3), 2010.
23. Z. Ives, T. Green, G. Karvounarakis, N. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. Pereira. The Orchestra collaborative data sharing system. SIGMOD Rec., 2008.
24. R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. ACM TODS, 36(3), 2011.
25. B. Kao, S. Lee, F. Lee, D. Cheung, and W. Ho. Clustering uncertain data using Voronoi diagrams and R-Tree index. TKDE, 2010.
26. C. Koch and D. Olteanu. Conditioning probabilistic databases. In VLDB, 2008.
27. H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In SIGKDD, 2005.
28. M. Michel and C. Eastham. Improving the management of MV underground cable circuits using automated on-line cable partial discharge mapping. In CIRED, 2011.
29. S. Mihaylov, Z. Ives, and S. Guha. REX: Recursive, delta-based data-centric computation. PVLDB, 5(11), 2012.
30. W. Ngai, B. Kao, C. Chui, R. Cheng, M. Chau, and K. Yip. Efficient clustering of uncertain data. In ICDM, 2006.
31. S. Omurca and N. Duru. Decreasing iteration number of k-medoids algorithm with IFART. In ELECO, 2011.
32. J. Provan and M. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4), 1983.
33. P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
34. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Morgan & Claypool, 2011.
35. L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In KDD, 2010.
36. S. van Dongen. Graph clustering by flow simulation. PhD thesis, University of Utrecht, 2000.
37. P. B. Volk, F. Rosenthal, M. Hahmann, D. Habich, and W. Lehner. Clustering uncertain data with possible worlds. In ICDE, 2009.
101764