ENFrame: A Platform for Processing Probabilistic Data

ENFrame: A Platform for Processing Probabilistic Data

Abstract

This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame.

The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: -means, -medoids, and Markov Clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees.

Experiments with -medoids clustering of sensor readings from energy networks show orders-of-magnitude improvements of exact clustering using ENFrame over naïve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms.

\SetAlCapHSkip

0cm \SetAlgoSkip \SetKwCommentComment  \SetKwBlockBegin \SetKwOutputoutput \SetKwChoosechoose \SetFuncStytextsc \SetCommentStyscriptsize \SetDataStytextsc

\numberofauthors

3

1 Introduction

Recent years have witnessed a solid body of work in probabilistic databases with sustained systems building effort and extensive analysis of computational problems for rich classes of queries and probabilistic data models of varying expressivity [34]. In contrast, most state-of-the-art probabilistic data mining approaches so far consider the restricted model of probabilistically independent input and produce hard, deterministic output [1]. This technology gap hinders the development of data processing systems that integrate techniques for both probabilistic databases and data mining.

The ENFrame data processing platform aims at closing this gap by allowing users to specify iterative programs to query and mine probabilistic data. The semantics of ENFrame programs is based on a unified probabilistic interpretation of the entire processing pipeline from the input data to the program result. It features an expressive set of programming constructs, such as assignments, bounded-range loops, list comprehension, and aggregate operations on lists, and calls to external database engines, coupled with aspects of probabilistic databases, such as possible worlds semantics, arbitrary data correlations, and exact and approximate probability computation with error guarantees. The existing probabilistic data mining algorithms do not share these latter aspects.

Under the possible worlds semantics, the input is a probability distribution over a finite set of possible worlds, whereby each world defines a standard database or a set of input data points. The result of a user program is equivalent to executing it within each world and is thus a probability distribution over possible outcomes (e.g., partitionings). ENFrame exploits the fact that many of the possible worlds are alike, and avoids iterating over the exponentially many worlds.

Correlations occur naturally in query results [34], after conditioning probabilistic databases using constraints [26], and are supported by virtually all mainstream probabilistic models. If correlations are ignored, the output can be arbitrarily off from the expected result [37, 2]. For instance, consider two similar, but contradicting sensor readings (mutually exclusive data points) in a clustering setting. There is no possible world and thus no cluster containing both points, yet by ignoring their negative correlation, we would assign them to the same cluster.

1:  (O, n) = loadData()        # list and number of objects
2:  (k, iter) = loadParams()   # number of clusters and iterations
3:  M = init()                 # initialise medoids

4:  for it in range(0,iter):   # clustering iterations
5:   InCl = [None] * k         # assignment phase
6:   for i in range(0,k):
7:    InCl[i] = [None] * n
8:    for l in range(0,n):
9:     InCl[i][l] = reduce_and(
10:       [(dist(O[l],M[i]) <= dist(O[l],M[j])) for j in range(0,k)])
11:  InCl = breakTies2(InCl)   # each object is in exactly one cluster

12:  DistSum = [None] * k      # update phase
13:  for i in range(0,k):
14:   DistSum[i] = [None] * n
15:   for l in range(0,n):
16:    DistSum[i][l] = reduce_sum(
17:      [dist(O[l],O[p]) for p in range(0,n) if InCl[i][p]])

18:  Centre = [None] * k
19:  for i in range(0,k):
20:   Centre[i] = [None] * n
21:   for l in range(0,n):
22:    Centre[i][l] = reduce_and(
23:      [DistSum[i][l] <= DistSum[i][p] for p in range(0,n)])
24:  Centre = breakTies1(Centre)  # enforce one Centre per cluster

25:  M = [None] * k
26:  for i in range(0,k):
27:   M[i] = reduce_sum([O[l] for l in range(0,n) if Centre[i][l]])
      

# Encoding of breakTies2 omitted

# Encoding of breakTies2 omitted

Figure 1: K-medoids clustering specified as user program (left) and simplified event program (right).

The user is oblivious to the probabilistic nature of the input data, and can write programs as if the input data were deterministic. It is the task of ENFrame to interpret the program probabilistically. The approach taken here is to trace the user computation using fine-grained provenance, which we call events. The event language is a non-trivial extension of provenance semirings [16] and semimodules [4] that are used to trace the evaluation of positive relational algebra queries with aggregates and to compute probabilities of query results [14]. It features events with negation, aggregates, loops, and definitions. It is expressive enough to succinctly encode arbitrary correlations occurring in the input data (e.g., modelled on Bayesian networks and pc-tables) and in the result of the user program (e.g., co-occurrence of data points in the same cluster), and trace the program state at any time. By annotating each computation in the program with events, we effectively translate it into an event program: variables become random variables whose possible outcomes are conditioned on events. Selected events represent the probabilistic program output, \egin case of clustering: the probability that a data point is a medoid, or the probability that two data points are assigned to the same cluster. Besides probability computation, events can be used for sensitivity analysis and explanation of the program result.

The most expensive task supported by ENFrame is probability computation for event programs, which is #P-hard in general. We developed sequential and distributed algorithms for both exact and approximate probability computation with error guarantees. The algorithms operate on a graph representations of the event programs called event networks. Expressions common to several events are only represented once in such graphs. Event networks for data mining tasks are very repetitive and highly interconnected due to the combinatorial nature of the algorithms: the events at each iteration are expressions over the events at the previous iteration and have the same structure at each iteration. Moreover, the event networks can be cyclic, so as to account for program loops. While it is possible to unfold bounded-range loops, this can lead to prohibitively large event networks.

The key challenge faced by ENFrame is to compute the probabilities of a large number of interconnected events that are defined over several iterations. Rather than computing the probability of each event separately, ENFrame’s algorithms employ a novel bulk-compilation technique, using Shannon expansion to depth-first explore the decision tree induced by the input random variables and the events in the program. The approximation algorithms use an error budget to prune large tree fragments of this decision tree that only account for a small probability mass. We introduce three approximation approaches (eager, lazy, and hybrid), each with a different strategy for spending the error budget. The distributed versions of these algorithms divide the exploration space into fragments to be explored concurrently by distinct workers.

While the computation time can grow exponentially in the number of input random variables in worst case, the structure of correlations can reduce it dramatically. As shown experimentally, ENFrame’s algorithm for exact probability computation is orders of magnitude faster than executing the user program in each possible world.

To sum up, the contributions of this paper are as follows:

    [leftmargin=6mm]
  • We propose the ENFrame platform for processing probabilistic data. ENFrame can evaluate user programs on probabilistic data with arbitrary correlations following the possible worlds semantics.

  • User programs are written in a fragment of Python that supports bounded-range loops, list comprehension, aggregates, and calls to external database engines. We illustrate ENFrame’s features by giving programs for three clustering algorithms (-means, -medoids, and Markov clustering) and provide a formal specification of ENFrame’s user language which can be used to write arbitrary programs for the platform.

  • User programs are annotated by ENFrame with events that are expressive enough to capture the correlations of the input, trace the program computation, and allow for probability computation.

  • ENFrame uses novel sequential and distributed algorithms for exact and approximate probability computation of event programs.

  • We implemented ENFrame’s probability computation algorithms in C++.

  • We report on experiments with -medoids clustering of readings from partial discharge sensors in energy networks [28]. We show orders-of-magnitude performance improvements of ENFrame’s exact algorithm over the naïve approach of clustering in each possible world, of approximate over exact clustering, and of distributed over sequential algorithms.

The paper is organised as follows. Section 2 introduces the Python fragment supported by ENFrame along with encodings of clustering algorithms. Section 3 introduces our event language and shows how user programs are annotated with events. Our probability computation algorithms are introduced in Section 4, and experimentally evaluated in Section LABEL:sec:experiments. Section 5 overviews recent related work.

2 ENFrame’s User Language

1:  (O, n) = loadData()        # list and number of objects
2:  (k, iter) = loadParams()   # number of clusters and iterations
3:  M = init()                 # initialise centroids

4:  for it in range(0,iter):   # clustering iterations
5:   InCl = [None] * k         # assignment phase
6:   for i in range(0,k):
7:    InCl[i] = [None] * n
8:    for l in range(0,n):
9:     InCl[i][l] = reduce_and(
10:       [dist(O[l],M[i]) <= dist(O[l],M[j]) for j in range(0,k)])
11:  InCl = breakTies2(InCl)   # each object is in exactly one cluster

12:  M = [None] * k            # update phase
13:  for i in range(0,k):
14:   M[i] = scalar_mult(invert(
15:     reduce_count([1 for l in range(0,n) if InCl[i][l]])),
16:     reduce_sum([O[l] for l in range(0,n) if InCl[i][l]]))
      

# Encoding of breakTies2 omitted

Figure 2: K-means clustering specified as user program (left) and simplified event program (right).
1:  (O, n, M) = loadData()     # M is a stochastic n*n matrix of
2:        # edge weights between the n nodes, O is list of nodes
3:  (r, iter) = loadParams()   # Hadamard power, number of iterations

4:  for it in range(0,iter):
5:   N = [None] * n            # expansion phase
6:   for i in range(0,n):
7:    N[i] = [None] * n
8:    for j in range(0,n):
9:     N[i][j] = reduce_sum([M[i][k]*M[k][j] for k in range(0,n)])

10:  M = [None] * n             # inflation phase
11:  for i in range(0,n):
12:   M[i] = [None] * n
13:   for j in range(0,n):
14:    M[i][j] = pow(N[i][j],r)*invert(
15:             reduce_sum([pow(N[i][k],r) for k in range(0,n)]))
      

Figure 3: Markov clustering specified as user program (left) and simplified event program (right).

This section introduces the user language supported by ENFrame. Its design is grounded in three main desiderata:

    [leftmargin=4mm]
  1. It should naturally express common mining algorithms, allow to issue queries and manipulate their results.

  2. User programs must be oblivious to the deterministic or probabilistic nature of the input data and to the probabilistic formalism considered.

  3. It should be simple enough to allow for an intuitive and straightforward probabilistic interpretation.

We settled on a subset of Python that can express, among others, -means, -medoids, and Markov Clustering. In line with query languages for probabilistic databases, where a Boolean query is a map for deterministic databases and a Boolean random variable for probabilistic databases, every user program has a sound semantics for both deterministic and probabilistic input data: in the former case, the result of a clustering algorithm is a deterministic clustering, in the latter case it is a probability distribution over possible clusterings.

The user language comprises the following constructs:

\parhead

Variables and arrays. Variables can be of scalar types (real, integer, or Boolean) or arrays. Examples of variable assignments: V = 2, W = V, M[2] = True, or M[i] = W. Arrays must be initialised, e.g., for array M of cardinality k: M = [None] * k. Additionally, the expression range(0, n) specifies the array [0,…,n-1].

\parhead

Functions. Scalar variables can be multiplied, exponentiated (pow(B, r) for ), and inverted (invert(B) for ). The function dist(A,B) is a distance measure on the feature space between the vectors specified by arrays A,B of reals; scalar_mult is component-wise multiplication of an array with a scalar.

\parhead

Reduce. Given a one-dimensional array M of some scalar type, it can be reduced to a scalar value by applying one of the functions reduce_or, reduce_sum, reduce_count. For instance, for an array B of Booleans, the expressionreduce_and(B) computes the Boolean conjunction of the truth values in B, and the expression reduce_count(B) computes the number of elements in B. For a two-dimensional array of reals or integers, i.e., an array of vectors, reduce_sum computes the component-wise sum of the vectors.

\parhead

List comprehension. Inside a reduce-function, anonymous arrays may be defined using list comprehension. For example, given an array B of Booleans of size n, the expression reduce_sum([1 for i in range(0,n) if B[i]]) counts the number of True values in B.

\parhead

Loops. We only allow bounded-range loops; for any fixed integer n and counter variable i, for-loops can be defined by: for i in range(0,n). This allows us to know the size of each constructed array at compile time.

\parhead

Input data. The special abstract primitive loadData() is used to specify input data for algorithms. This function can be implemented to statically specify the objects to be clustered, to load them from disk, or to issue queries to a database. ENFrame supports positive relational algebra queries with aggregates via the SPROUT query engine for probabilistic data [14]. The abstract methods loadParams() and init() are used to set algorithm parameters such as the number of iterations and clusters of a clustering algorithm.

2.1 Clustering Algorithms in ENFrame

We illustrate ENFrame’s user language with three example data mining algorithms: -means, -medoids, and Markov Clustering. Figures 12,  and 3 list user programs for these algorithms; we next discuss each of them.

\parhead

k-means clustering. The -means algorithm partitions a set of data points into groups of similar data points. We initially choose a centroid for each cluster, i.e., a data point representing the cluster centre (initialisation phase). In successive iterations, each data point is assigned to the cluster with the closest centroid (assignment phase), after which the centroid is recomputed for each cluster (update phase). The algorithm terminates after a given number of iterations or after reaching convergence. Note that our user language does not support fixpoint computation, and hence checking convergence.

Figure 2 implements -means. The set O of input objects is retrieved using a loadData call. Each object is represented by a feature vector (i.e., array) of reals. We then load the parameters k, the number iter of iterations, and initialise cluster centroids M (line 3). The initialisation phase has a significant influence on the clustering outcome and convergence. We assume that initial centroids have been chosen, for example by using a heuristic [31]. Subsequently, an array InCl of Booleans is computed such that InCl[i][l] is True if and only if M[i] is the closest centroid to object O[l] (lines 5–10); every object is then assigned to its closest cluster. Since two clusters may be equidistant to an object, ties are broken using the breakTies2 call (line 11); it fixes an order of the clusters and enforces that each object is only assigned to the first of its potentially multiple closest clusters. Next, the new cluster centroids M[i] are computed as the centroids of each cluster (lines 12–16). The assignment and update phases are repeated iter times (line 4).

\parhead

k-medoids clustering. The -medoids algorithm is almost identical to -means, but elects cluster medoids rather than centroids: these are cluster members that minimise the sum of distances to all other objects in the cluster. The assignment phase is the same as for -means (lines 5–11), while the update phase is more involved: We first compute an array DistSum of sums of distances between each cluster medoid and all other objects in its cluster (lines 12–17), then find one object in each cluster that minimises this sum (lines 18–24), and finally elect these objects as the new cluster medoids M (lines 25–27). The last step uses reduce_sum to select exactly one of the objects in a cluster as the new medoid, since for each fixed i only one value in Centre[i][l] is True due to the tie-breaker in line 24.

\parhead

Markov clustering (MCL). MCL is a fast and scalable unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs [36]. Natural clusters in a graph are characterised by the presence of many edges within a cluster and few edges across clusters. MCL simulates random walks within a graph by alternating two operations: expansion and inflation. Expansion corresponds to computing random walks of higher length. It associates new probabilities with all pairs of nodes, where one node is the point of departure and the other is the destination. Since higher length paths are more common within clusters than between different clusters, the probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one to the other. Inflation has the effect of boosting the probabilities of intra-cluster walks and demoting inter-cluster walks. This is achieved without a priori knowledge of cluster structure; it is the result of cluster structure being present.

Figure 3 gives the MCL user program. Expansion coincides with taking the power of a stochastic matrix M using the normal matrix product (i.e. matrix squaring). Inflation corresponds to taking the Hadamard power of a matrix (taking powers entry-wise). It is followed by a scaling step to maintain the stochastic property, i.e. the matrix elements correspond to probabilities that sum up to 1 in each column.

Section 3 discusses the probabilistic interpretation of the computation of the above three clustering algorithms.

LOOP
DECL
EXPR
(REDUCE ‘(’ LCOMPR ‘)’) — (pow‘(’EXPR, EXPR‘)’) —
(invert‘(’EXPR‘)’) — (EXPR ‘*’ EXPR) — (EXPR ‘+’ EXPR) —
(scalar_mult‘(’EXPR, EXPR‘)’) — (breakTies‘(’EXPR‘)’)
LCOMPR
REDUCE
reduce_mult — reduce_count
RANGE
COMP
EXT
ID
LIT
Figure 4: The grammar of the user language.

2.2 Syntax of the User Language

Figure 4 specifies the formal grammar for the language of user programs. A program is a sequence of declarations (DECL) and loop blocks (LOOP), each of which may again contain declarations and nested loops. The language allows to assign expressions (EXPR) to variable identifiers (ID). An expression may be a Boolean, integer, or float constant (LIT), an identifier, an array declaration, the result of a Boolean comparison between expressions, or the result of such operations as sum, product, inversion, or exponentiation. The result of a reduce operation on an anonymous array created through list comprehension (LCOMPR), and the result of breaking ties in a Boolean array give rise to expressions; we elaborate on these two constructions below.

In addition to the syntactic structure as defined by the grammar, programs have to satisfy the following constraints:

\parhead

Bounded-range loops. The parameters to the range construct must be integer constants (or immutable integer-valued variables). This restriction ensures that for-loops (LOOP) and list comprehensions (LCOMPR) are of bounded size that is known at compile time.

\parhead

Anonymous arrays via list comprehension. List comprehension may only be used to construct one-dimensional arrays of base types, i.e., arrays of integers, floats, or Booleans.

\parhead

Breaking ties. Clustering algorithms require explicit handling of ties: For instance, if two objects are equidistant to two distinct cluster centroids in -means, the algorithm has to decide which cluster the object will be assigned to. In ENFrame programs, the membership of objects to clusters can be encoded by a Boolean array like InCl such that InCl[i][l] is true if and only if object l is in cluster i. In this context, a tie is a configuration of InCl in which for a fixed object l, InCl[i][l] is True for more than one cluster i. We explicitly break such ties using the function breakTies2(M). For each fixed value i of the second dimension (hence the 2 in the function name) of the 2-dimensional array M, it iterates over the first dimension of M and sets all but the first True value of M[i][l] to False. Symmetrically, the function breakTies1(M) fixes the first dimension and breaks ties in the second dimension of M, and breakTies(M) breaks ties in a one-dimensional array.

3 Tracing Computation by Events

The central concept for representing user programs in ENFrame is that of events. Each event is a concise syntactic encoding of a random variable and its probability distribution. This section describes the syntax and semantics of events and event programs, and finally explains how ENFrame programs written in the user language from Section 2 can be translated to event programs.

The key features of events and event programs are:

  • Events can encode arbitrarily correlated, discrete probability distributions over input objects. In particular, they can succinctly encode instances of such formalisms as Bayesian networks and pc-tables. The input objects and their correlations can be explicitly provided, or imported via a positive relational algebra query with aggregates over pc-tables [14].

  • By allowing non-Boolean events, our encoding is exponentially more succinct than an equivalent purely Boolean description.

  • Each event has a well-defined probabilistic semantics that allows to interpret it as a random variable.

  • The iterative nature of many clustering algorithms carries over to event programs, in which events can be defined by means of nested loops. This construction together with the ability to reuse existing, named events in the definition of new, more complex events leads to a concise encoding of a large number of distinct events.

Example 1.

Clustering in possible worlds. We start by presenting an instructive example of -medoids clustering under possible worlds semantics. Let be objects in the feature space as shown below. They can be clustered into two clusters using -medoids with medoids and .

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(5.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (8.55,0.5) rectangle +(0.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50] ; \node(2) at (2, 1) [circle,draw=black!70,fill=green!50,very thick] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50] ; \node(4) at (9, 1) [circle,draw=black!70,fill=yellow!50,very thick] ;

In order to go from deterministic to uncertain objects, we associate each object with a Boolean propositional formula – the event – over a set of independent Boolean random variables . The possible valuations define the the possible worlds of the input objects: for each valuation there exists on world that contains exactly those objects for which is \trueunder . The probability of a world is the product of the probabilities of the variables taking a truth value .

Let us assume that the objects have the following events:

.

Distinct worlds can have different clustering results, as exemplified next. The world defined by consists of objects , , and , for which -medoids clustering yields:

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(0.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (4.55,0.5) rectangle +(4.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50,very thick] ; \node[black!30] (2) at (2, 1) [circle,draw=black!20,fill=green!10,densely dotted,thick] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50,very thick] ; \node(4) at (9, 1) [circle,draw=black!70,fill=yellow!50] ;

Similarly, the worlds defined by and any assignment for , yields:

{tikzpicture}\draw

[rounded corners=5pt,draw=black!70,fill=red!20,densely dashed] (-0.45,0.5) rectangle +(2.9,1); \draw[rounded corners=5pt,draw=black!70,fill=blue!20,densely dashed] (4.55,0.5) rectangle +(0.9,1); \draw[draw=black!50] (-0.6,0.4) grid (9.6,1.6); \node(1) at (0, 1) [circle,draw=black!70,fill=red!50,very thick] ; \node(2) at (2, 1) [circle,draw=black!70,fill=green!50] ; \node(3) at (5, 1) [circle,draw=black!70,fill=blue!50,very thick] ; \node[black!30] (4) at (9, 1) [circle,draw=black!20,fill=yellow!10,densely dotted,thick] ;

The probability of a query “Are and in the same cluster?” is the sum of the worlds in which and are in the same cluster.  

Events do not only encode the correlations and probabilities of input objects, but can symbolically encode the entire clustering process. We illustrate this in the next example.

Example 2.

Symbolic encoding of -means by events. We again assume four input objects , …, with their respective events . This example introduces conditional values (c-values) which are expressions of the form , where is a Boolean formula and is a vector from the feature space. Intuitively, this c-value takes the value whenever evaluates to \true, and a special undefined value when is \false. C-values can be added and multiplied; for example, the expression evaluates to if and are \true, or to if is \trueand is \false, etc.

Equipped with c-values, an initialisation of -means with can for instance be written in terms of two expressions and : Centroid is set to object if is \trueand to if is \false; centroid is always set to the geometric centre of and .

In the assignment phase, each object is assigned to its nearest cluster centroid. The condition InCl for object being closest to can be written as the Boolean event ,

which encodes that the distance from to centroid is smaller than the distance to centroid .

Given the Boolean events InCl, we can represent the centroid of cluster for the next iteration by the expression , which specifies a random variable over possible cluster centroids conditioned on the assignments of objects to clusters as encoded by InCl. This expression is exponentially more succinct than an equivalent purely Boolean encoding of cluster centroids, since the later would require one Boolean expression for each subset of the four input objects.  

The event programs corresponding to the three user programs for -means, -medoids, and MCL are given on the right side of Figures 13. In addition to the constructs introduced in Example 2, they use event declarations that assign identifiers to event expressions, and -loops that specify sets of events parametrised by . The remainder of this section specifies the formal syntax and semantics of event programs, and gives a translation from user to event programs.

3.1 Syntax of Event Expressions

The grammar for event expressions is as follows:

VAL
INT
CVAL
CVAL CVAL — dist(CVAL, CVAL) — EVENT CVAL
COMP
ATOM
EID
EVENT

The main constructs are:

\parhead

Conditional values. Reals and feature vectors are denoted by VAL. Together with a propositional formula, they give rise to a conditional value (CVAL), c-value for short.

\parhead

Functions of conditional values. Very much like scalars and feature vectors, c-values can be added, multiplied, and exponentiated. Additionally, the distance between two c-values yields another c-value. In addition to the binary operations specified in the grammar (e.g., CVAL+CVAL), we allow - and -expressions (see Figure 2).

\parhead

Event expressions. Event expressions (EVENT) are propositional formulas over constants (\true), (\false), a set of Boolean random variables, event identifiers, and propositions defined by ATOM: [CVAL COMP CVAL] represents the truth value obtained by comparing two c-values.

3.2 Semantics of Event Expressions

The semantics of event expressions is defined by extending a Boolean valuation to a valuation of c-values and event expressions. We define in the sequel how acts on each of the expression types in the grammar. The base cases of this mapping are the standard algebraic operations on scalars and the feature space, extended by special undefined elements as follows.

We extend the reals (and their operations , , ) by a special element (for undefined) such that . Operators propagate as and for any real . For any other reals , and are as usual. For example, .

Similarly, we extend the feature space by an element . For any real and feature vector, and are propagated as , , , and .

The grammar for event programs does not distinguish between scalars and feature vectors for the sake of notational clarity. The following description implicitly assumes that the expressions are well-typed; e.g., the expression is only defined for vector-valued variable symbols .

\parhead

CVAL. Conditional values of the form EVENTVAL have an if-then-else semantics: If EVENT evaluates to \true, then EVENTVAL evaluates to VAL, else it evaluates to (or for vector-valued c-values); the recursively constructed CVAL expressions have the natural recursive semantics that ultimately defaults to and for scalars and feature vectors.

\parhead

ATOM, EVENT. Comparisons for between two c-values evaluate to \falseif they are both defined ( and ) and the comparison does not hold; otherwise (i.e. if at least one of the c-values is undefined, or if the comparison holds), it evaluates to \true. The semantics of the Boolean propositional EVENT expressions is standard, i.e. by propagating through the propositional operators . For instance evaluates to \trueif , and to \falseotherwise.

3.3 Probabilistic Semantics of Events

We next give a probabilistic interpretation of event expressions that explains how they can be understood as random variables: Boolean event expressions (EVENT) give rise to Boolean random variables, and conditional values (CVAL) give rise to random variables over their respective domain.

For every random variable , we denote by and the probability that is \trueor \false, respectively; we also simply write for . Let be the set of mappings from the random variables to \trueand \false.

Definition 1 (Induced Probability Space).

Together, the probability mass function for every sample , and the probability measure for define a probability space that we call the probability space induced by .

An event expression is a random variable over the probability space induced by with probability distribution

By virtue of this definition, every Boolean event expression becomes a Boolean random variable, and real-valued (vector-valued) c-values become random variables over the reals (the feature space).

3.4 Event Programs

Event programs are imperative specifications that define a finite set of named c-values and event expressions. The grammar for event programs is as follows:

INT
VAR
LOOP
DECL

Event programs consist of a sequence of event declarations (DECL) and nested loops (LOOP) of event declarations.

A central concept is that of event identifiers (EID); it is required that event declarations are immutable, i.e. each distinct EID may only be assigned once to an event expression. Inside a -loop, identifiers can be parametrised by to create a distinct identifier in each iteration of the loop.

The meaning of an event program is simply the set of all named and grounded c-value and event expressions defined by the program; grounded here means that all identifiers in expressions are recursively resolved and replaced by the referenced expressions. For declarations outside of loops, this is clear; each declaration inside of (nested loops) is instantiated for each value of the loop counter variables.

3.5 From User Programs to Event Programs

The translation of user to event programs has two main challenges: (i) Translating mutable variables and arrays to immutable events, and (ii) translating function calls such as reduce_*. We cover these two issues separately.

\parhead

From mutable variables to immutable events. It is natural to reassign variables in user language programs, for example when updating -means centroids in each iteration based on the cluster configuration of the previous iteration. In contrast, events in event programs are immutable, i.e., can be assigned only once. The translation from the user language to the event language utilises a function getLabel that generates for each user language variable a sequence of unique event identifiers whose lexicographic order reflects the sequence of assignments of .

The basic idea of getLabel is to first identify the nested loop blocks of the given user language program, and then to establish a counter for each distinct variable symbol and each block. An assignment of a variable within nested blocks corresponds to an event identifier of the form where are the counters for the blocks. Within each block, its corresponding counter is incremented for every assignment of its variable symbol. When going from one block into a nested inner block, the counters for the outer blocks are kept constant while the counter for the inner block is incremented as is reassigned in the inner block.

Special attention must be paid to the encoding of entering and leaving a block: In order to carry over the reference to a variable to the next block at level , we establish a copy , such that the first access to in the block may access its last assignment of via . Similarly, the last assignment of a variable in the inner block is passed back to the outer block by copying the last identifier of an inner block to the next identifier of the outer block.

Example 3.

Consider the following user language program (left) and its translation to an event program (right).

1: M = 7 A:
2: M = M+2 B:
3: for i in range(0,2): C:
D:
4: M = M+i E:    
5: for j in range(0,3): F:    
G:    
6: M = M+1 H:      
I:    
J:
7: M = M+1 K:

The user language program has three nested blocks. Within each block, the respective counter is incremented for each assignment of : for the outer block, in the second block, and for the innermost block. The encodings for entering and leaving a block are in lines C and F, and lines I and J, respectively.  

\parhead

Translation of arrays. Since arrays in a user language program have a known fixed size, their translation is straightforward: A -dimensional array translates to distinct identifiers .

\parhead

Translation of reduce_* calls. According to the grammar in Figure 4, reduce-operations can only be applied to anonymous arrays created by list comprehension. The expression reduce_and([EXPR for ID in range(FROM, TO) if COND] is translated to the Boolean event . Symmetrically, reduce_or translates to , reduce_sum to , and reduce_mult to . A call to reduce_count([EXPR for ID in range(FROM, TO) if COND]) translates to the event .

4 Probability Computation

The probability computation problem is known to be #P-hard already for simple events representing propositional formulas such as positive bipartite formulas in disjunctive normal form [32]. In ENFrame, we need to compute probabilities of a large number of interconnected complex events. Although the worst-case complexity remains hard, we attack the problem with three complementary techniques: (1) bulk-compile all events into one decision tree while exploiting the structure of the events to obtain smaller trees, (2) employ approximation techniques to prune significant parts of the decision tree, and ultimately (3) distribute the compilation by assigning distinct distributed workers to explore disjoint parts of the tree.

We next introduce the bulk-compilation technique, look at three approximation approaches, and discuss how to distribute the probability computation.

4.1 Compilation of event programs

{algorithm}

[t]\SetInd1.6mm1.6mm \DontPrintSemicolon\CommentBlue comments and pseudocode are related to -approx. \FuncStyCompile(\DataStynetwork , absolute error ) \Begin \CommentInitialise initial (empty) masks for nodes in the network \lForEach

\ForEach

\Comment*[f]initial probability lower bound: 0
\Comment*[f]initial probability lower bound: 1
\Comment*error budget (for exact, ) \FuncStydfs() \Comment*empty DFS branch , \FuncStydfs(\DataStynetwork , masks , branch , error budgets ) \Begin \If(\Comment*[f]sufficient budget) \lForEach
\Return
\If(\Comment*[f]propagate variable mask into DAG)
\FuncStymask \lIf or
\Return\Comment*all reached/approx.

\Comment

error budget for left DFS-branch \lForEach

\Comment

DFS left branch, storing the residual error budget

\Comment

compute error budget for right DFS-branch \lForEach
\eIf \CommentDFS right branch, return the residual error budget \Return\FuncStydfs \Commentall probability bounds reached -approx., no right DFS \Return Exact and approx. compilation of network

The event programs consist of interconnected events; which are represented in an event network: a graph representation of the event programs, in which nodes are, e.g., Boolean connectives, comparisons, aggregates, and c-values. An example of such a network is depicted in \figrefexample-dag.

The goal is to compute probabilities for the top nodes in the network, which are referred to as compilation targets. These nodes represent events such as “object is assigned to cluster in iteration ”. We keep lower and upper bounds for the probability of each target. Initially, these bounds are and they eventually converge during computation.

The bulk-compilation procedure is based on Shannon expansion: select an input random variable and partially evaluate each compilation target to for being set to \true() and for being set to \false(). Then, the probability of is defined by . We are now left with two simpler events and . By repeating this procedure,we eventually resolve all variables in the events to the constants \trueor \false. The trace of this repeated expansion is the decision tree. We need not materialise the tree. Instead, we just explore it depth-first and collect the probabilities of all visited branches as well as record for each event the sums and of probabilities of those branches that satisfied and respectively did not satisfy the event. At any time, and represent a lower bound and respectively an upper bound on the probability of . This compilation procedure needs time polynomial in the network size (and in the size of the input data set), yet in worst case (unavoidably) exponential in the number of variables used by the events.

For practical reasons, we do not construct and explicitly, but keep minimal information that, in addition to the network, can uniquely define them. The process of computing this minimal information is called masking. We achieve this by traversing the network bottom-up and remembering the nodes that become \trueor \falsegiven the values of their children. When a compilation target is eventually masked by a variable assignment , the probability is added to its lower bound if , or subtracted from its upper bound if . If one or more targets are left unmasked, a next variable is chosen and the process is repeated with , where is either or . The algorithm chooses a next variable such that it influences as many events as possible.

Once all compilation targets are masked by an assignment , the compilation backtracks and selects a different assignment for the most recently chosen variable whose assignments are not exhausted. When all branches of the decision tree have been investigated, the probability bounds of the targets have necessarily converged and the algorithm terminates.

Example 4.
\figref

example-dag shows a simplified event network under the assignment . The masks of and are propagated to event nodes , which are now also masked. The red nodes are masked for , whereas the green nodes are masked .  

5 Related Work

Our work is at the confluence of several active research areas: probabilistic data management, data analytics platforms, and provenance data management.

\includeplot

[width=\defaultplotwidth,height=2cm]exp3-ds1-performance-distributed-workers

Figure 5: Performance of distributed probability computation as function of number of workers. 

Probabilistic data mining and querying. Our work adds to a wealth of literature on this topic [1, 34] along two directions: distributed probability computation techniques and a unified formalisation of several clustering algorithms in line with work on probabilistic databases.

Distributed probability computation has been approached so far only in the context of the SimSQL/MCDB system, where approximate query results are computed by Monte Carlo simulations [24, 9]. This contrasts with our approach in that MCDB was not designed for exact and approximate computation with error guarantees and does not exploit at runtime symbolically-represented correlations allowed by pc-tables and by ENFrame.

Early approaches to mining uncertain data are based on imprecise (fuzzy) data, for example using intervals, and produce fuzzy (soft) and hard output. Follow-up work shifted to representation of uncertainty by (independent) probability density functions per data point. In contrast, we allow for arbitrarily correlated discrete probability distributions. The importance of correlations has been previously acknowledged for clustering [37] and frequent pattern mining [35]. A further key aspect of our approach that is not shared by existing uncertain data mining approaches is that we follow the possible worlds semantics throughout the whole mining process. This allows for exact and approximate computation with error guarantees and sound semantics of the mining process that is compatible with probabilistic databases. This cannot be achieved by existing work; for instance, most existing -means clustering approaches for uncertain data define cluster centroids using expected distances between data points [11, 30, 17, 27, 19, 25] or the expected variance of all data points in the same cluster [18]; they also compute hard clustering where the centroids are deterministic. The recently introduced UCPC approach to -means clustering [20] is the first work to acknowledge the importance of probabilistic cluster centroids. However, it assumes independence in the input and does not support correlations.



Data analytics platforms. Support for iterative programs is essential in many applications including data mining, web ranking, graph analysis, and model fitting. This has recently led to a surge in data-intensive computing platforms with built-in iteration capability. Mahout is a library that supports iterative programming on large clusters [5]. HaLoop allows iterative applications to be assembled from existing MapReduce Hadoop programs [8]. REX supports iterative distributed computations along database operations in which changes are propagated between iterations [29]. MADlib is an open-source library for scalable in-database analytics [21]. Similar in spirit, Bismarck is a unified architecture for in-database analytics [13]. In the Iterative Map-Reduce-Update programming abstraction for machine learning, user programs are compiled into declarative, optimisable Datalog code [7]. Platforms that facilitate uniform treatment of data-intensive tasks were also proposed outside the data management community, e.g., to support expressive languages for recursive problems that can be used to automatically synthesise programs targeting a massively parallel processor [10].

A key aspect that differentiates ENFrame from the above platforms is the probabilistic nature of input data and of the whole computation process. This calls for specifically tailored algorithms. So far, ENFrame lacks the scalability achievable by the above platforms, since it only distributes the probability computation task, while the actual mining task is performed on one machine. The next step is to consider a fully distributed computational approach.



Provenance in database and workflow systems. To enable probability computation, we trace fine-grained provenance of the user computation. This is in line with a wealth of work in probabilistic databases [34]. Our event language is influenced by work on provenance semirings [16] and semimodules [4, 14] that capture provenance for positive queries with aggregates in relational databases. The construct resembles the algebraic structure of a semimodule that is a tensor product of the Boolean semiring freely generated by the variable set and of the SUM monoid over the real numbers . There are two differences between our construct and these structures. Firstly, we allow negation in events, which is not captured by the Boolean semiring. Secondly, even for positive events, is not a semimodule since it violates the following law: . Indeed, under an assignment that maps both and to , the left side of the equality evaluates to , whereas the right side becomes . Furthermore, our event language allows to define events via iterations, as needed to succinctly trace data mining computation.

Workflows employ a much wider variety of programming constructs than databases. Workflow provenance aims to capture a complete description of evaluation of a workflow [12], though it sees tasks as black-boxes, and therefore, consider all outputs of a task to depend on all of its inputs. This provenance model is too coarse to support exact derivations of output as needed in our case for probability computation.

A distinct line of work is on such relational provenance systems as Perm [15], DBNotes [6], Orchestra [23] that trace provenance using query rewriting or modified query operators. Panda [22] enables provenance-aware querying.


\parhead

Acknowledgements. This research was supported by EPSRC grant agreement ADEPT (EP/I000194/1).


References

  1. C. Aggarwal. Managing and Mining Uncertain Data. Kluwer, 2009.
  2. C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, chapter A Survey of Uncertain Data Clustering Algorithms. Chapman and Hall, 2013.
  3. P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
  4. Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011.
  5. Apache Software Foundation. The Mahout machine learning library. http://mahout.apache.org. v0.7.
  6. D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. VLDB Journal, 2005.
  7. V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Declarative systems for large-scale machine learning. Data Eng. Bull., 35(2), 2012.
  8. Y. Bu, B. Howe, M. Balazinska, and M. Ernst. The haloop approach to large-scale iterative data analysis. VLDB J., 2012.
  9. Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using SimSQL. In SIGMOD, 2013.
  10. L. Cartey, R. Lyngsø, and O. de Moor. Synthesising graphics card programs from DSLs. In PLDI, 2012.
  11. M. Chau, R. Cheng, B. Kao, and J. Ng. Uncertain data mining: An example in clustering location data. In PAKDD, 2006.
  12. S. Davidson, S. Cohen-Boulakia, A. Eyal, B. Ludäscher, T. McPhillips, S. Bowers, and J. Freire. Provenance in scientific workflow systems. Data Eng. Bull., 32(4), 2007.
  13. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, 2012.
  14. R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5), 2012.
  15. B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, 2009.
  16. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
  17. F. Gullo, G. Ponti, and A. Tagarelli. Clustering uncertain data via k-medoids. In SUM, 2008.
  18. F. Gullo, G. Ponti, and A. Tagarelli. Minimizing the variance of cluster mixture models for clustering uncertain objects. In ICDM, 2010.
  19. F. Gullo, G. Ponti, A. Tagarelli, and S. Greco. A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In ICDM, 2008.
  20. F. Gullo and A. Tagarelli. Uncertain centroid based partitional clustering of uncertain data. PVLDB, 2012.
  21. J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12), 2012.
  22. R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Eng. Bull., 33(3), 2010.
  23. Z. Ives, T. Green, G. Karvounarakis, N. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. Pereira. The Orchestra collaborative data sharing system. SIGMOD Rec., 2008.
  24. R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. ACM TODS, 36(3), 2011.
  25. B. Kao, S. Lee, F. Lee, D. Cheung, and W. Ho. Clustering uncertain data using Voronoi diagrams and R-Tree index. TKDE, 2010.
  26. C. Koch and D. Olteanu. Conditioning probabilistic databases. In VLDB, 2008.
  27. H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In SIGKDD, 2005.
  28. M. Michel and C. Eastham. Improving the management of MV underground cable circuits using automated on-line cable partial discharge mapping. In CIRED, 2011.
  29. S. Mihaylov, Z. Ives, and S. Guha. REX: Recursive, delta-based data-centric computation. PVLDB, 5(11), 2012.
  30. W. Ngai, B. Kao, C. Chui, R. Cheng, M. Chau, and K. Yip. Efficient clustering of uncertain data. In ICDM, 2006.
  31. S. Omurca and N. Duru. Decreasing iteration number of k-medoids algorithm with IFART. In ELECO, 2011.
  32. J. Provan and M. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4), 1983.
  33. P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
  34. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Morgan & Claypool, 2011.
  35. L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In KDD, 2010.
  36. S. van Dongen. Graph clustering by flow simulation. PhD thesis, University of Utrecht, 2000.
  37. P. B. Volk, F. Rosenthal, M. Hahmann, D. Habich, and W. Lehner. Clustering uncertain data with possible worlds. In ICDE, 2009.
101764
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description