1 Introduction
Abstract

Organisations store huge amounts of data from multiple heterogeneous sources in the form of Knowledge Graphs (KGs). One of the ways to query these KGs is to use SPARQL queries over a database engine. Since SPARQL follows exact match semantics, the queries may return too few or no results. Recent works have proposed query relaxation where the query engine judiciously replaces a query predicate with similar predicates using weighted relaxation rules mined from the KG. The space of possible relaxations is potentially too large to fully explore and users are typically interested in only top- results, so such query engines use top- algorithms for query processing. However, they may still process all the relaxations, many of whose answers do not contribute towards top- answers. This leads to computation overheads and delayed response times.

We propose Spec-QP, a query planning framework that speculatively determines which relaxations will have their results in the top- answers. Only these relaxations are processed using the top- operators. We, therefore, reduce the computation overheads and achieve faster response times without adversely affecting the quality of results. We tested Spec-QP over two datasets - XKG and Twitter, to demonstrate the efficiency of our planning framework at reducing runtimes with reasonable accuracy for query engines supporting relaxations.

Spec-QP: Speculative Query Planning for Joins over Knowledge Graphs


Madhulika Mohanty1,*, Maya Ramanath1, Mohamed Yahya2, Gerhard Weikum2

1 IIT Delhi, New Delhi, India

2 Max Planck Institute for Informatics, Saarbrücken, Germany

* madhulikam@cse.iitd.ac.in

1 Introduction

The availability of immense amounts of digitized data and recent advances in automatic information extraction have made the construction of large Knowledge Bases (KBs) possible. These KBs are typically stored as RDF triples of spo where s is the subject, o is the object and p is the predicate. Prominent examples of freely available KBs include YAGO [28], DBPedia [2], Freebase [5], etc.

These RDF KBs are queried using the SPARQL query language, that, at its core consists of triple patterns. For example, the following SPARQL query asks: "Which singers also write lyrics and play guitar and piano?”.

SELECT ?s WHERE{
     ?s ‘rdf:type’ <singer>.
     ?s ‘rdf:type’ <lyricist>.
     ?s ‘rdf:type’ <guitarist>.
     ?s ‘rdf:type’ <pianist>
}

where ?s is a variable to be bound in each of the triple patterns and to be returned as a result.

Original Relaxations
<singer> <vocalist>,<jazz_singer>, <artist>
<lyricist> <writer>
<guitarist> <musician>, <instrumentalist>
<pianist> <percussionist>
Table 1: Example relaxations

An exhaustive list of such singers in the KB can be computed, but users who issue such queries typically want only the top-, ranked results. Ranking of SPARQL query results has been studied before in [19, 9, 11] and they typically make use of scores for each triple in the KB111The scores could be based on confidence values, popularity, etc.. However, a problem that users sometimes face when they issue such queries is low recall. That is, the KB may not have results to return (in some cases, the KB may have zero results if one or more of the triple patterns do not have a match). In these cases, it is desirable to relax the query by changing one or more of the triple patterns, while ensuring that the query still reflects the original information need. For example, a possible relaxation of the query above is to change the triple pattern ?s ‘rdf:type’ <singer> to ?s ‘rdf:type’ <vocalist>. Previous works have dealt with doing these relaxations automatically and ranking the corresponding results [10, 14, 25, 37]. In this paper, we address the problem of efficiently evaluating these relaxed queries.

Query Processing

Processing queries and their relaxations to return top- results is computationally expensive. For example, assuming that every triple pattern in the above query has relaxations as shown in Table 1, this would lead to a total of unique queries (that is, original query, query with one relaxation, query with two relaxations, etc.). A naive method would compute the results to each query, sort the results by score and return the top-.

Figure 1: Query processing by TriniT. One incremental merge operator is required for each triple pattern and its relaxations. Rank joins are performed over these incremental merges to get top-.

The TriniT [37] system proposed a mechanism to improve on this naive method. The idea was to compute results from all relaxations simultaneously, but in a way that drastically reduced wasteful computations. To this end, TriniT uses two operators: Incremental Merge [29] (to process the relaxations for a given triple pattern) and Rank Join [15] (to compute (partial) join results in sorted order). This is illustrated in Figure 1. However, this method still results in wasted resources, since not all relaxations will contribute a result to the top-. For the example query, if we were able to predict that none of the relaxations for the triple patterns with <singer> and <pianist> will contribute a result to the top-, then we can replace these incremental merge operators with only ranked joins over the non-relaxed triple patterns.

In this paper, we propose Spec-QP, a speculative approach to prune the space of relaxations resulting in efficient top- processing of SPARQL queries.

Approach and contributions

Given a KG and a SPARQL query over it, we want to devise an efficient strategy for query processing in the following scenario:

  • The dataset comprises of triples.

  • Triples are associated with scores [19, 9, 11] and the score of a result is an aggregate of the individual triple scores.

  • A query can be rewritten using weighted relaxations mined from the KG [10, 37].

  • User is interested in only top- answers.

We propose a speculative approach for pruning the space of possible relaxations for a given query. We make use of precomputed statistics about the distribution of scores of the matches to triple patterns in order to speculate on the requirement of relaxations for each triple pattern. This precomputed metadata is an approximation of the score distribution of the answers from the corresponding triple pattern and not the actual scores. When a user enters a query, we estimate the top answer scores that can be achieved using the possible relaxations. This estimation is done using the score distributions and the join cardinality estimates. We then prune those relaxations which are unlikely to contribute triples to the top- answers based on the top score estimates. Note that our work is orthogonal to any query engine as it can be used on top of any existing graph database engine.

Our main contributions are summarized as follows.

  1. A model for the score distribution of individual triple patterns.

  2. A technique to estimate the scores of answers to a query using the above model and using it to predict the presence of answers from each triple pattern’s relaxations in the top-.

  3. Pruning the space of relaxations to achieve improved response times over baseline with high prediction accuracy, thereby aiding effective exploration of KGs.

Organisation

The rest of the paper is organised as follows: section 2 introduces some useful definitions and explains the TriniT query processing approach. Section 3 outlines Spec-QP, the proposed speculative approach to query planning and explains how the plan is executed once the planner generates a plan. Section 4 summarizes and discusses the experimental results. Section 5 lists the related work and finally section 6 concludes the paper with future work directions.

2 Preliminaries

This section introduces some preliminary notions and definitions that will be used henceforth.

Definition 1.

Knowledge Graphs
Given a set of entities and predicates , a triple is a tuple
spo such that, , , and . Here, s is called the “subject”, p is the “predicate” and o is the “object” of the triple . Each triple is associated with a score, denoted by . These scores represent confidence values or popularity of the triples as previously studied in [19, 9, 11]. A set of such tuples can be represented as a graph, which we call a Knowledge Graph, .

Definition 2.

Triple pattern
A triple pattern is of the form
SPO, where S, P and O could either be entities or predicates from the KG or variables. Variables are always prefixed with a question mark. A triple pattern matches any triple in the KG having the same values in the designated field. The variables are then bound to the corresponding values in the triple.

Definition 3.

Triple pattern query
A triple pattern query is a set of triple patterns, .

Definition 4.

Answer for a Triple pattern query
Given a triple pattern query and a KG, an answer for the query, denoted by , is a mapping of the variables in to values in the KG such that the application of this mapping to each triple pattern , denoted , results in a triple in the KG. The set of all the answers to a query is denoted by the set, .

Definition 5.

Score of a triple matching a triple pattern
The score of a triple which matches the triple pattern is denoted by and is computed as follows:

The value ranges between and .

Definition 6.

Score of an answer
The score of an answer to a query is the aggregation of the scores of the triples resulting from applying the answer mapping to each triple pattern in the query. That is,

This has been studied previously in [10, 14, 25, 37].

Definition 7.

Weighted relaxation rule
A weighted relaxation rule is a triple where and are triple patterns respectively called the domain and range of the relaxation, and denotes the reduction in scores of the triples matching the relaxed triple pattern. Automatic computation of relaxations and the corresponding weights have been studied in [10, 37].

For example, ?x ‘rdf:type’ <singer> could be relaxed to ?x ‘rdf:type’ <vocalist> with a weight of , i.e.,
?x ‘rdf:type’ <singer>, ?x ‘rdf:type’ <vocalist>, .

Definition 8.

Relaxed Query
Given a query and a relaxation , we say that applies to if . The result of applying to is a new query called the relaxed query.

The score of an answer obtained through relaxation applied to a query is defined as:

The score is reduced further for each subsequent relaxation in a similar manner. Since the same answer could be obtained from multiple relaxed queries, the score of an answer with respect to the original query and a space of possible relaxations is defined as the maximum score obtained through any relaxation.

2.1 Non-Speculative Query Processing (TriniT)

As mentioned in the Introduction, TriniT computes results from all relaxations simultaneously using two operators: Incremental Merge [29] and Rank Join [15]. Given the query , and the relaxations, , , and , Figure 2 shows the query plan generated by TriniT. Incremental Merge is used to efficiently scan the list of matches to a triple pattern and all its relaxations to output only one merged and sorted list for each triple pattern. Each of the three incremental merge operators in the example takes as inputs the sorted lists of matches222Recall that each triple is associated with a score. for each triple pattern, , and and their relaxations. Each of them outputs a combined sorted list of triples for each triple pattern along with its relaxations. The rank join computes a join of the two sorted inputs in an incremental manner until enough results have been produced, while minimising the number of answers read from each list to get top- answers. This helps avoid computing the entire join and then sorting over it. The inputs for Rank Joins are either the outputs of Incremental Merges or Rank Joins. Both operators use priority queues for already seen answers and maintain upper bounds to estimate scores of the answers that can be obtained by reading further into the lists at any given point. This avoids accessing entire lists of (partial) answers and aids early termination.

However, TriniT still processes relaxations from all the triple patterns, many of which do not contribute triples towards the top- answers. Our technique aims to eliminate this inefficiency.

Figure 2: Query plan generated by TriniT for the query . One incremental merge operator is required for each triple pattern and its relaxations. A rank join operator takes in two sorted lists and produces a ranked list of (partial) answers from the join.

3 Spec-QP, the Speculative Framework for optimizing Query Plans

The Incremental Merge and Rank Join operators were introduced by TriniT to flexibly perform relaxations without the need to fully explore the space of all possible answers. Note that in the absence of relaxations, we could have simply resorted to rank joins over sorted answer-lists of only the original triple patterns. These joins are straight-forward and much faster than processing the triple pattern and its relaxations using incremental merges and joining over them.

We propose Spec-QP, a query planning approach which uses a predictor to predict whether the relaxations of a triple pattern are likely to be required for producing the top- answers. We need not process relaxations for those triple patterns whose relaxations are predicted to be not required. The predictor uses an expected score estimator based on the precomputed statistics about the distribution of the scores for triple pattern matches. We first describe the estimator and then give details of the planning approach.

3.1 Expected score estimator

Figure 3: Score distribution for answers of a triple pattern modelled as a two-bucket histogram.

The expected score estimator is based on order statistics and estimates the expected scores at given ranks for the original as well as relaxed queries. These are used by the query planner to predict the presence of answers from a relaxation in top-.
The matching triples for a triple pattern have scores represented by the independent and identically distributed random variables , each with a common distribution, . Here, is the probability distribution for the scores of the answers for a triple pattern (or relaxation), , from the . The cumulative distribution function (cdf) is represented by . The set {} is a sample of size taken from the distribution . The set of the observed values of answer scores {} of random variables is called a realization of the sample. are random variables resulting from arranging the values of each of in increasing order, and is called the order statistic. Given these random variables and their distributions, we need to estimate the score distribution for the answers of the query, . are the random variables representing the scores of the answers to the query, (possibly composed of a single triple pattern). is the first order statistic corresponding to the lowest scoring answer among all the answers of , and is the -th (or largest) order statistic corresponding to the highest scoring answer (ranked 1). A relaxed answer would appear in top- only when its expected highest score () amongst its answers exceeds the expected highest score of the original query (). In order to compute the expected value at a given rank, we use the result given in [7]: For i.i.d. random variables, each with a common distribution, , the expected value of order statictic, can be approximated as where denotes the cdf and is the size of the sample. Using this, the expectation of can be approximated as where denotes the cdf of the scores for the answers to the query and is the no. of answers of .

We now give the details of the construction of the probability density function (pdf) of these random variables.

3.1.1 Score Distributions for the Triple Patterns:

For every triple pattern in the , we store the following precomputed statistics about the scores of the matching triples:

  • : the total number of triples matching the triple pattern.

  • : the cumulative scores of the answers over all the ranks through for . As described later, these values of will represent the ranks which form the bucket boundaries for the histograms of the score distributions.

  • : the scores at rank for .

We now estimate the scores distribution for answers to triple pattern . Note that the ranks will not be explicitly reflected here, it is just the distribution of the answer score values from which each score in {} is assumed to be independently sampled. and are used to denote the pdf and cdf respectively.

The pdf can be modelled as a -bucket histogram in the following way:

(1)
(2)
(3)

The pdf is essentially uniform distribution in each bucket with the height being proportional to the score mass in the bucket.

In order to find the best fitting number of buckets, we observed the scores of the answers to few random triple patterns sorted in decreasing order. We found that they had a power law distribution. The power law distribution follows the rule which states that % of the score mass lies in the % of the answers. We, therefore, chose two-bucket histograms to represent the score distributions (as shown in Figure 3). The short and tall bucket represents the interval which has % of the score mass. The longer bucket represents the long tail having only % of the score mass. We store only the following values for each triple pattern:

  • : the total number of triples matching the triple pattern.

  • : the score of the answer at rank where represents the rank within which % of the score mass is contained for the triple pattern matches.

  • : the cumulative score of the answers over all the ranks through .

  • : the cumulative score of the answers over all the ranks through .

The pdf for each distribution is following:

This pdf gives us the following cdf:

with

3.1.2 Score Distribution for the Triple Pattern Query:

Figure 4: Score distribution for a triple pattern query is computed as the convolution of the pdf’s of the constituent triple patterns.

The score of an answer for the triple pattern query is the sum of the scores of the individual triples in the answer. Since each triple is contributed by one triple pattern in the query and we have estimates for their scores, we can estimate the scores for answers to the query using the following approach.

Let us assume our triple pattern query, {}. {, , …, } represents the triples matching and {, , …, } represents the triples matching . The scores for triples matching these triple patterns have the distributions and respectively, as defined before. The scores of ’s answers are represented by the random variables . Each of these is a sum of two random variables, one from {} and another from {}. The pdf for the sum of the random variables is given by the convolution of the two individual pdfs, (as depicted in Figure 4). Hence, the pdf for the scores of the answers to the query is given by the convolution of the pdf’s of the scores for matches to the constituent triple patterns. The resulting pdf is a multi-piece-wise linear function. Given the number of results in the combined distribution, , we can estimate , and using the expected score computation from order statistics. This again results in a two-bucket histogram for the distribution of the scores of the answers to the query. For the computation of , we use the estimates for join selectivity333Traditional database systems use multiple heuristics to estimate join selectivity. For the purpose of this work, we have taken exact join selectivity values., as . For three of more triple patterns, we repeat the above process the required number of times to get the final histogram representing the score distribution for the query.

3.1.3 Score prediction

Once we have constructed the pdf and cdf representing the scores for the answers of a given query, we can estimate the expected score, 444Note that it is and not since the order statistic represents the highest value with rank . at a given rank as where denotes the cdf of the query answer scores and is the no. of answers for . Given these estimates for scores at various ranks, we now generate the query plan.

3.2 Query Planning

Query Plan: Given a query , a query plan consists of subsets of triple patterns where

  1. each consists of one or more triple patterns from ,

  2. the ’s are pairwise disjoint, and

  3. the union of ’s equals .

For example, a query plan for the query , will be . The singletons correspond to the triple patterns which require relaxations.

Input: The query .
Output: The query plan,
, where Get from “expected score estimator”. for  do
       top-weighted relaxation for Get from “expected score estimator”. if  then
            
       end if
      
end for
return
Algorithm 1 PLANGEN generates the query plan.

3.2.1 Query plan generation

The key idea behind the planning approach is that the answers from all of the triple patterns’ relaxations do not appear in the top- answers. We save on computations over such triple patterns by never processing their relaxations. For each triple pattern, only the top-weighted relaxation has the highest top score due to normalization of scores as per Definition 5, i.e, the top score from each relaxation is equal to its weight. Hence, we need to check only the top-weighted relaxation for each triple pattern for its potential to contribute answers towards top-.

Given a query and the score distribution for each triple pattern, the query plan is generated as outlined in Algorithm 1. PLANGEN first predicts the requirement of relaxations for each triple pattern. For prediction, the query planner uses an “expected score estimator” described in Section 3.1, which gives estimates of the expected scores at rank for the original query, and top rank for the highest weighted relaxed query, (for a given triple pattern at a time). If the topmost score from the relaxed query obtained by relaxing a given triple pattern exceeds the score from the original query, it predicts that the triple pattern’s relaxations are required. Note that our estimator takes into account join score distributions and join cardinalities for estimating the expected score for a given query.

The query plan, returned by PLANGEN will have only one subquery, of size > , called the “join group” (non-relaxed triple patterns), the rest will be only singletons (triple patterns to be relaxed).

3.2.2 Query Execution

Given a speculative query plan with subsets generated by the speculative query planner, we execute it in the following manner.

  1. The join group, is executed as (left-deep) rank joins over the answer lists (sorted by score) for each triple pattern. Note that, none of the triple patterns in this group are relaxed.

  2. The singletons are processed by Incremental Merge operator for each.

  3. Rank joins are performed over the join group and singletons.

Given , when we predict that and are not going to be relaxed, the effective query plan is . We use rank joins to compute the join between sorted lists of matches for and and require incremental merge only for and its relaxations. The results from these are joined using Rank Joins to get the final top- answers. Note that we reduce the number of incremental merges required to as compared to by TriniT. This would lead to less computation at run time and thus, faster response times will be achieved. Figure 5 illustrates this approach.

The equivalent TriniT plan for this query will be , , , i.e., all triple patterns occur as different subsets and each of them are processed by Incremental Merges followed by Rank Joins over all these incremental merges (Refer Figure 2).

Figure 5: Query Plan when and only ’s relaxations are predicted to be in top-. Only requires an incremental merge. and are joined using a rank join over the sorted answer lists for each of them. One rank join is required to join these results.

4 Experimental Evaluation

This section discusses the experimental evaluation performed for demonstrating the performance of the speculative planner.

4.1 Baseline

We test our system for faster response times with good accuracy in a querying platform which has to give top- results in the scenario where the user query is allowed to be relaxed to get results that satisfy the desired information need. We compare Spec-QP with the non speculative query processing engine (TriniT) (refer Section 2.1) which involves Incremental Merges for relaxations and Rank Joins for joins. Note that TriniT processes all the relaxations and outputs the true top-.

4.2 Datasets used

We have used two datasets for the purpose of demonstrating the performance of the speculative planner. They are as follows:

  1. Extended Knowledge Graphs (XKG):
    We have used the eXtended Knowledge Graphs (XKG) introduced by TriniT [37]. This is a RDF format dataset but unlike standard RDF format, the triples are composed of a mixture of textual tokens and IRIs. These “textual” content triples are constructed from a document corpus by using OpenIE techniques and NED. Each triple score is equal to the number of times this particular triple was encountered. This knowledge base along with a RDF knowledge base (YAGO2s) is known as XKG (eXtended Knowledge Graph). The triple scores for YAGO2s triples are equal to the number of inlinks into the subject, i.e., the number of times the entity in the subject occurs in the object of any triple. XKG has about million triples. This dataset was selected for having a rich variety of relaxations. We evaluated on queries which were manually constructed so as to have non-empty result sets. Each query has - triple patterns and each triple pattern has atleast relaxations. The relaxations were obtained using the scheme outlined in [37].

  2. Twitter tweets:
    The dataset was built using Twitter Streaming API over trending hashtags. The stream was tracked for days with each day’s trending tags over the month of April . The dataset has about million unique triples of the form: where is the unique ID for a tweet and is a term contained in the tweet with ID as . A query over this dataset queries for IDs of those tweets which have all the queried terms. For example, the following query queries for IDs of all those tweets which contain the terms ‘#intoyouvideo’, ‘#ariana’ and ‘dangerous’:

    SELECT ?s WHERE{
         ?s <hasTag> <#intoyouvideo>.
         ?s <hasTag> <#ariana>.
         ?s <hasTag> <dangerous>
    }

    The score for each triple is equal to the number of retweets for the tweet in that triple. The relaxations were generated using the co-occurrence frequencies i.e. the relaxation weight, for the relaxation, will be equal to:

    For example, a possible relaxation for <#intoyouvideo> is <video>.

    The testset of queries was constructed manually using combinations of most frequent tags and terms. Each query had either or triple patterns, with each triple pattern having atleast relaxations.

4.3 Metrics

We measure the following metrics for each query to demonstrate the quality and efficiency of our technique:

  1. Quality:

    • Precision: The fraction of true top- results (of TriniT) in the top- results of Spec-QP.

    • Recall: The fraction of top- results by Spec-QP in the true top- by TriniT.

    • Prediction accuracy: The number of queries for which we could identify the correct relaxations.

    • Score error: The average of absolute error for Spec-QP vs. TriniT top- scores, i.e.,

      We also note the standard deviation.

  2. Efficiency:

    • Runtimes: We measure the time taken to plan and execute each query.

    • Memory used: Since it is not easy to measure exact memory consumption in Java, we use no. of answer objects created as a means to represent this. The total no. of answer objects created directly corresponds to the amount of search space traversed to arrive at top- answers. This number includes all the intermediate answer objects encountered by Incremental Merges and Rank Joins.

Note that precision and recall have identical values in our setup, because they have the same denominator .

4.4 Setup

The experiments were conducted on a Dell Blade server with Intel(R) Xeon(R) CPU E5-2420 @ 1.90GHz processors and GB RAM. The database engine used to retrieve the matches for triple patterns in sorted order is postgresql-9.5. Each query was evaluated using both the techniques- TriniT and Spec-QP. We considered three values for , namely , and . To have a warm cache, we conducted consecutive runs for each query and considered the average of the last runs for each technique.

4.5 Quality evaluations

We first discuss the quality of results obtained by Spec-QP and then provide the statistics for runtimes and memory consumptions.

k XKG Twitter
10 0.7 0.72
15 0.88 0.78
20 0.91 0.8
Table 2: Precision (and Recall) over each dataset.

4.5.1 Precision (and Recall)

The precision values for the datasets are given in Table 2. The precision is good for both the datasets, being about 90% in the best case. This indicates that on an average % of the answers belonged to true top-. Note that since the answers are sorted according to the scores, the answers outside the true top- appeared at lower ranks. The higher ranked answers were found correctly. An observation is that the accuracy increased with increasing the value for . Also, the approximate results obtained are very close to true top- as described later in Section 4.5.3.

4.5.2 Prediction Accuracy

We performed a detailed analysis of the number of queries for which we could predict the correct relaxation(s) over each dataset. They are given in Table 3. We observed that each query required some triple patterns to be relaxed to generate top- answers. It can be seen that the prediction accuracy is atleast % for all types of queries over XKG and queries requiring relaxations over Twitter. As the value for was increased, queries increasingly required relaxations to generate sufficient answers. For twitter, most of the queries required all triple patterns to be relaxed. This is due to absence of sufficient triples corresponding to each term and fewer relaxations (predicate does not have relaxations) for each triple pattern. Nevertheless, we were able to identify the requirement of all the relaxations in such a scenario.

Dataset XKG Twitter
k 10 15 20 10 15 20
queries requiring relaxation 5(6) 5(5) -(-) - - -
queries requiring relaxations 21(30) 22(26) 18(19) 1(2) 1(2) 1(2)
queries requiring relaxations 12(18) 16(19) 27(31) 35(48) 38(48) 39(48)
queries requiring relaxations 7(11) 14(15) 14(15) - - -
Table 3: Summary of prediction accuracy for various values of over XKG and Twitter grouped by the number of triple patterns requiring relaxations in the queries to generate true top- results. The number indicates the number of queries for which Spec-QP could identify exactly only these relaxations. The numbers in brackets show the total number of such queries.

Note that the -bucket histogram model for representing the distribution of scores is only an estimation of the score distribution and not the exact distribution. This leads to wrong estimates for expected score values in few cases. This can be improved upon by using multi-bucket histograms for modelling exact distributions but it will lead to higher planning time overheads.

4.5.3 Average score error

To judge the quality of approximate results returned by Spec-QP, we computed the score deviations of the approximate answers at each rank given by Spec-QP from the true top-. The average values for the score difference (along with the standard deviations) for various values of are given in Table 4. The percentages in brackets show the average percentage deviation from the original scores. Note that the maximum possible score for an answer to a triple pattern query can be , for a triple pattern query, it will be and so on.555This is because the maximum score for a matching triple for each triple pattern can be .

Dataset XKG Twitter
  k
#TP
\@killglue
2 3 4 2 3
10 0.1(5%)0.1 0.2(8%)0.3 0.1(3%)0.2 0.16(8%)0.0 0.5(16%)0.5
15 0.08(4%)0.08 0.1(3%)0.2 0.01(1%)0.04 0.16(8%)0.0 0.32(10%)0.3
20 0.07(4%)0.06 0.07(2%)0.1 0.01(1%)0.03 0.16(8%)0.0 0.18(6%)0.1
Table 4: Average score deviations for the approximate top- from the true top- for each dataset. It is grouped by the number of triple patterns (#TP) in the queries. The percentages in brackets show the average percentage deviation from the original scores.
Xkg

Even though k= has lowest precision, the score deviations from true top- answers are low (about for triple pattern queries). That is, for a query with triple patterns if the actual answer at a given rank has a score of , the score of the approximate answer would be about . The deviations are even lower (only about ) for higher values of and tolerable for achieving faster runtimes. In agreement with the trend for precision values, the deviation reduces as we increase .

Twitter

There is only one triple pattern query that required both the triple patterns to be relaxed but had a wrong speculation of relaxations for all values of . However, its score deviation is constant over all values of due to it having only results (including relaxations). The deviations are only for triple pattern queries with , which is only % deviation from the original scores. The deviations for higher values of are very low being only % in the best case. For k=, for a query with triple patterns if the actual answer at a given rank has a score of , the score of the approximate answer would be about .

Summary of precision results

We showed that our predictor predicted the correct relaxations about -% of the time as can be seen from the precision analysis. Also, the answers outside the top- had minimal score deviations from the original top- answers at each rank (Table 4) indicating that Spec-QP misses the true top- only narrowly. Hence, Spec-QP gives approximate top- of good quality.

4.6 Efficiency evaluations

We now discuss the efficiency of Spec-QP over TriniT in generating the results for individual datasets.

4.6.1 Efficiency over XKG


(a) Runtimes for k=.

.

(b) Runtimes for k=.

(c) Runtimes for k=.
(d) Memory for k=.
(e) Memory for k=.
(f) Memory for k=.
Figure 6: Runtimes and memory comparisons over XKG queries for k=, and grouped by the no. of triple patterns in the query. All the legends in the graphs for efficiency have ‘T’ for TriniT and ‘S’ for Spec-QP.

The results for XKG grouped by the number of triple patterns in the queries have been given in Figure 6.

  • k=: Spec-QP outperforms TriniT by a great margin for . This is because Spec-QP avoids unnecessary computation of all relaxations when only few relaxations are capable of giving top- answers. Most of the queries require only relaxations (Refer Table 3) to produce top- answers and Spec-QP either identifies the correct relaxation(s) or gives good quality approximate results.

  • k= and k=: Here, the and triple pattern queries have faster runtimes on using Spec-QP. The gain margin however has lowered from previous value of . This is because when the user seeks more answers, the original query becomes increasingly insufficient in generating answers and more relaxations are required. For triple pattern queries, higher values of leads to more relaxations because of answers becoming sparse with each join. Hence, the runtimes and memory consumptions are closer to TriniT.

The results grouped by the number of triple patterns relaxed by Spec-QP in the queries for XKG have been given in Figure 7. We can see that we have major gains when none of the triple patterns undergo relaxations. The difference in the runtimes of TriniT and Spec-QP reduces when more no. of triple patterns are relaxed. This is because with increasing no. of triple patterns requiring relaxations, the Spec-QP plan tends towards the plan by TriniT, i.e., processing relaxations from all the triple patterns. The memory consumption also follows a similar trend. For cases with triple patterns relaxed, all the triple patterns in the query are relaxed. The runtimes in these cases are slightly higher than TriniT owing to the additional time spent on speculative planning. The memory consumption is the same as for TriniT.


(a) Runtimes for k=.

(b) Runtimes for k=.

(c) Runtimes for k=.
(d) Memory for k=.
(e) Memory for k=.
(f) Memory for k=.
Figure 7: Runtimes and memory comparisons over XKG queries for k=, and grouped by the no. of triple patterns relaxed in the query by Spec-QP. All the legends in the graphs for efficiency have ‘T’ for TriniT and ‘S’ for Spec-QP.

4.6.2 Efficiency over Twitter

The results grouped by the number of triple patterns in the queries over Twitter data for various values of are given in Figure 8.

  • k=: We can see here that the Spec-QP performs really well on all queries. We have faster response times. The memory used is also less than the TriniT plan. As discussed before, Spec-QP identifies the required relaxations ascertaining low score deviations from true top- for all the queries.

  • k= and : The results are similar to what was observed for k=. Additionally, we observe that with increasing value of , the difference between the runtimes of Spec-QP and TriniT reduces from that for k=. This is due to the fact that when the user demands more answers, the original query no longer has sufficient answers. Hence, triple patterns require relaxations and the query requires more time to execute.

The results grouped by the number of triple patterns relaxed by Spec-QP in the queries over Twitter data for various values of are given in Figure 9. The results are similar to what was observed for XKG. For queries requiring relaxations, Spec-QP is similar to TriniT since relaxations from all the triple patterns are processed. The runtimes in these cases are slightly higher than TriniT owing to the additional time spent on speculative planning. The memory consumption is the same as for TriniT.


(a) Runtimes for k=.

(b) Runtimes for k=.

(c) Runtimes for k=.
(d) Memory for k=.
(e) Memory for k=.
(f) Memory for k=.
Figure 8: Runtimes and memory comparisons over Twitter for k=, and grouped by the no. of triple patterns in the query. All the legends in the graphs for efficiency have ‘T’ for TriniT and ‘S’ for Spec-QP.

(a) Runtimes for k=.

(b) Runtimes for k=.

(c) Runtimes for k=.
(d) Memory for k=.
(e) Memory for k=.
(f) Memory for k=.
Figure 9: Runtimes and memory comparisons over Twitter for k=, and grouped by the no. of triple patterns relaxed in the query by Spec-QP. All the legends in the graphs for efficiency have ‘T’ for TriniT and ‘S’ for Spec-QP.

4.7 Discussion and remarks

We have shown that Spec-QP is able to identify the correct relaxation(s) for most of the queries. For queries having precision , Spec-QP gives good quality approximations for top-, as is demonstrated by the average score deviation values. We have also shown that Spec-QP incurs less computation overheads and achieves faster response times with low memory overheads. Hence, Spec-QP is more efficient than TriniT and also has good accuracy.

5 Related Work

Top- query processing

FRPA [12] and Hash Rank-Join (HRJN*) [16] represent the state-of-the-art relational rank-join algorithms. HRJN* has been shown to perform well in practice, however, FRPA showed that it was not instance-optimal for a variant of the rank join problem that they considered. HRJN[17] is based on ripple join algorithm. It maintains two hash tables in-memory for storing the input tuples seen so far, the stored input tuples are used for finding join results. These results are then given as inputs to a priority queue, which outputs them in the order specified by the ranking function. Nested Loops Rank Join (NRJN) [15] is similar to HRJN except that unlike HRJN it does not store input tuples, but rather follows a nested-loop strategy. Pull/Bound Rank Join (PBRJ) [27] is an algorithm template that generalized previous rank join algorithms and provided tight upper bounds. DRJN [8] is an efficient algorithm for computing rank joins in distributed systems. Theobald et. al. [31] dealt with top-k query evaluation for joins over multiple index lists with pruning using probabilistic guarantees. It does so using histograms and dynamic convolutions to predict the top-. Our case, however differs in that we consider graph structured data and also, support multiple relaxations. The IO-Top-k [4] deals with top- query evaluation with pruning using sorted access (SA) scheduling. Other works include top- processing over xml data [30] and for data that is distributed over multiple nodes [39].

Top- queries on graphs

There are very few works which address the problem of top- processing over RDF graphs. The SPARQL-RANK framework proposed by [23] makes use of different index permutations used in native triple-stores for fast random access during top-k processing, and applies early-termination criterion. They propose an algorithm, which requires the left-most index used in the join plan to be sorted based on the ranking function, and then it randomly probes the right-side index. Thus, when the right-side index is large, the performance of rank join suffers. In another framework introduced by Wang et al. [36], quantitative entities in the RDF dataset are separated out into an MS-tree index. In the first step of query processing, candidate entities are located using the MS-tree index that are then used as seeds for performing breadth-first (BFS) traversals over the graph to find matching sub-graphs. If the query requires only a few highly correlated predicates, the algorithm may end up storing many unnecessary nodes in the queue, making the retrieval of the first entity possible only after several iterations. The work in [38] uses an approach similar to HRJN[17] for computing top-k star joins. However for RDF data, SPARQL-RANK showed experimentally that it outperformed HRJN. The performance gain was attributed to the unsorted nature of numerical attributes present in indexes build by RDF engines. QUARK-X [21] proposes an efficient technique to process top-k queries on RDF graphs using extra indexes and metadata. Another work specific to Linked Data is by Wagner et. al. [35], where partial results are located at different sources and can only be accessed via URI lookups. All of these works however do not consider efficient processing for relaxations over the original query.

Query Reformulation in IR

Various strategies have been proposed to reformulate queries in IR over documents. These include measures of query similarity [3], or using summary information included in the query-flow graph [1]. Another approach by Hristidis et. al. [13] relies on suggesting keyword relaxations by relaxing those which are least specific based on their idf score. These reformulations can be used as relaxations for our setting.

Faceted Search: Many answers problem

A related optimization problem is the one encountered when we have many-answers, i.e. those where given an initial query that returns a large number of answers, the objective is to design an effective drill-down strategy to help the user find acceptable results with minimum effort [26, 18, 22]. We solve a related problem, where we try to solve both empty-answer and many-answers problem in an efficient manner by generating additional scored answers using relaxations.

Query Relaxation Frameworks

Query relaxation in relational databases is quite common. The work [20] relaxes joins and selections in relational databases by suggesting alternative queries based on the “minimal” shift from the original query. Another work [34] suggests user ranking of the query edges so as to generate relevant differential queries with minimum deviation. “Why Not” queries are studied in [6, 32], where, given a query Q that did not return a set of tuples S that the user was expecting to be returned, they design an alternate query Q’ that (a) is very similar to Q, and (b) returns the missing tuples S, however the rest of the returned tuples should not be too different from those returned by Q. The paper [24] relaxes one constraint at a time and is interactive. It also tries to minimize the cost by suggesting low cost relaxations which lead to non-empty answers. DebEAQ [33] first tries to debug why the query is returning empty answer and then tries to relax it with minimum change to the original query. It is also limited only to property graphs.

The closest to our works are those which deal with relaxations over graphs. The paper [25] considers query relaxation for conjunctive regular path queries. Users are able to specify approximations and relaxations to be applied to their original query, and the relative costs of these. Query results are returned incrementally, ranked in order of increasing distance from the user’s original query. Another work which computes approximates answers uses two algorithms for evaluation [14]. The first algorithm is based on best-first strategy and relaxed queries are executed in order. They prune relaxations which do not give new results. The other algorithm executes the relaxed queries as a batch and avoids the unnecessary execution cost. TriniT [37] enhances the graphs using text corpus and computes relaxations over them. The relaxations are computed efficiently using incremental merges and rank joins. We use this system as our baseline.

6 Conclusion and Future Work

We have proposed Spec-QP, a strategy for top- query processing in a scenario where a query can have multiple relaxations. To achieve this, we used a speculative approach for pruning the relaxations which are not likely to contribute answers to the top- results. The triple patterns which are predicted to not require relaxations can be processed by rank joins over the sorted list of matches for them, thereby reducing top- processing and leading to great savings on runtimes and memory. Extensive experiments over two real world datasets - XKG and Twitter, show that Spec-QP achieves greater efficiency over the baseline with good accuracy for most of the queries. As future work, we would like to generate and use more complicated relaxations for the queries like replacing a triple pattern with a chain of triple patterns, etc. Also, we would like to extend these techniques to work for ranked retrieval from XML databases.

References

  • [1] Aris Anagnostopoulos, Luca Becchetti, Carlos Castillo, and Aristides Gionis. An optimization framework for query recommendation. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010, 2010.
  • [2] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007., 2007.
  • [3] Ricardo A. Baeza-Yates, Carlos A. Hurtado, and Marcelo Mendoza. Query recommendation using query logs in search engines. In Current Trends in Database Technology - EDBT 2004 Workshops, EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, and ClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Revised Selected Papers, 2004.
  • [4] H. Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. Io-top-k: Index-access optimized top-k query processing. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, 2006.
  • [5] Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts. Freebase: A shared database of structured general human knowledge. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada, 2007.
  • [6] Adriane Chapman and H. V. Jagadish. Why not? In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, 2009.
  • [7] H.A. David and H.N. Nagaraja. Order Statistics. Wiley Series in Probability and Statistics. Wiley, 2004.
  • [8] Christos Doulkeridis, Akrivi Vlachou, Kjetil Nørvåg, Yannis Kotidis, and Neoklis Polyzotis. Processing of rank joins in highly distributed systems. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, 2012.
  • [9] Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, and Gerhard Weikum. Language-model-based ranking for queries on rdf-graphs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6, 2009, 2009.
  • [10] Shady Elbassuoni, Maya Ramanath, and Gerhard Weikum. Query relaxation for entity-relationship search. In The Semanic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29 - June 2, 2011, Proceedings, Part II, 2011.
  • [11] Azam Feyznia, Mohsen Kahani, and Fattane Zarrinkalam. COLINA: A method for ranking SPARQL query results through content and link analysis. In Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014., 2014.
  • [12] Jonathan Finger and Neoklis Polyzotis. Robust and efficient algorithms for rank join evaluation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009.
  • [13] Vagelis Hristidis, Yuheng Hu, and Panagiotis G. Ipeirotis. Ranked queries over sources with boolean query interfaces without ranking support. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, 2010.
  • [14] Hai Huang, Chengfei Liu, and Xiaofang Zhou. Approximating query answering on RDF databases. World Wide Web, 15(1), 2012.
  • [15] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, 2003.
  • [16] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004.
  • [17] Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeffrey Scott Vitter, and Ahmed K. Elmagarmid. Rank-aware query optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004, 2004.
  • [18] Abhijith Kashyap, Vagelis Hristidis, and Michalis Petropoulos. Facetor: cost-driven exploration of faceted query results. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, 2010.
  • [19] Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum. NAGA: searching and ranking knowledge. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, México, 2008.
  • [20] Nick Koudas, Chen Li, Anthony K. H. Tung, and Rares Vernica. Relaxing join and selection queries. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, 2006.
  • [21] Jyoti Leeka, Srikanta Bedathur, Debajyoti Bera, and Medha Atre. Quark-X: An efficient top-k processing framework for RDF quad stores. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, 2016.
  • [22] Chengkai Li, Ning Yan, Senjuti Basu Roy, Lekhendro Lisham, and Gautam Das. Facetedpedia: dynamic generation of query-dependent faceted interfaces for wikipedia. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, 2010.
  • [23] Sara Magliacane, Alessandro Bozzon, and Emanuele Della Valle. Efficient execution of top-k SPARQL queries. In The Semantic Web - ISWC 2012 - 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I, 2012.
  • [24] Davide Mottin, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas, and Yannis Velegrakis. A probabilistic optimization framework for the empty-answer problem. PVLDB, 6(14), 2013.
  • [25] Alexandra Poulovassilis and Peter T. Wood. Combining approximation and relaxation in semantic web path queries. In The Semantic Web - ISWC 2010 - 9th International Semantic Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected Papers, Part I, 2010.
  • [26] Senjuti Basu Roy, Haidong Wang, Gautam Das, Ullas Nambiar, and Mukesh K. Mohania. Minimum-effort driven dynamic faceted search in structured databases. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008, 2008.
  • [27] Karl Schnaitter and Neoklis Polyzotis. Optimal algorithms for evaluating rank joins in database systems. ACM Trans. Database Syst., 35(1), 2010.
  • [28] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, 2007.
  • [29] Martin Theobald, Ralf Schenkel, and Gerhard Weikum. Efficient and self-tuning incremental query expansion for top-k query processing. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005, 2005.
  • [30] Martin Theobald, Ralf Schenkel, and Gerhard Weikum. An efficient and versatile query engine for topx search. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, 2005.
  • [31] Martin Theobald, Gerhard Weikum, and Ralf Schenkel. Top-k query evaluation with probabilistic guarantees. In (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004, 2004.
  • [32] Quoc Trung Tran and Chee-Yong Chan. How to conquer why-not questions. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, 2010.
  • [33] Elena Vasilyeva, Thomas Heinze, Maik Thiele, and Wolfgang Lehner. Debeaq - debugging empty-answer queries on large data graphs. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, 2016.
  • [34] Elena Vasilyeva, Maik Thiele, Christof Bornhövd, and Wolfgang Lehner. Top-k differential queries in graph databases. In Advances in Databases and Information Systems - 18th East European Conference, ADBIS 2014, Ohrid, Macedonia, September 7-10, 2014. Proceedings, 2014.
  • [35] Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer. Top-k linked data query processing. In The Semantic Web: Research and Applications - 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27-31, 2012. Proceedings, 2012.
  • [36] Dong Wang, Lei Zou, and Dongyan Zhao. Top-k queries on RDF graphs. Inf. Sci., 316, 2015.
  • [37] Mohamed Yahya, Denilson Barbosa, Klaus Berberich, Qiuyue Wang, and Gerhard Weikum. Relationship queries on extended knowledge graphs. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016, 2016.
  • [38] Shengqi Yang, Fangqiu Han, Yinghui Wu, and Xifeng Yan. Fast top-k search in knowledge graphs. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, 2016.
  • [39] Hailing Yu, Hua-Gang Li, Ping Wu, Divyakant Agrawal, and Amr El Abbadi. Efficient processing of distributed top-k queries. In Database and Expert Systems Applications, 16th International Conference, DEXA 2005, Copenhagen, Denmark, August 22-26, 2005, Proceedings, 2005.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
34075
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description