Topk Spatialkeyword Publish/Subscribe Over Sliding Window
Abstract
With the prevalence of social media and GPSenabled devices, a massive amount of geotextual data has been generated in a stream fashion, leading to a variety of applications such as locationbased recommendation and information dissemination. In this paper, we investigate a novel realtime top monitoring problem over sliding window of streaming data; that is, we continuously maintain the topk most relevant geotextual messages (e.g., geotagged tweets) for a large number of spatialkeyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geotextual data. To the best of our knowledge, this is the first work to study top spatialkeyword publish/subscribe over sliding window. A novel centralized system, called Skype (Topk Spatialkeyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. To reduce the expensive top reevaluation cost triggered by message expiration, we develop a novel costbased skyband technique to reduce the number of reevaluations in a costeffective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques. Furthermore, to support better scalability and higher throughput, we propose a distributed version of Skype, namely, DSkype, on top of Storm, which is a popular distributed stream processing system. With the help of finetuned subscription/message distribution mechanisms, DSkype can achieve orders of magnitude speedup than its centralized version.
1 Introduction
Recently, with the ubiquity of social media and GPSenabled mobile devices, large volumes of geotextual data have been generated in a stream fashion, leading to the popularity of spatialkeyword publish/subscribe system (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()) in a variety of applications such as locationbased recommendation and social network. In such a system, each individual user can register her interest (e.g., favorite food or sports) and location as a spatialkeyword subscription. A stream of geotextual messages (e.g., ecoupon promotion and tweets with location information) continuously generated by publishers (e.g., local business) are rapidly fed to the relevant users.
The spatialkeyword publish/subscribe system has been studied in several existing work (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 ()). Most of them are geared towards boolean matching, thus making the size of messages received by users unpredictable. This motivates us to study the problem of top spatialkeyword publish/subscribe such that only the top most relevant messages are presented to users. Moreover, we adopt the popular sliding window model DBLP:conf/pods/BabcockBDMW02 () on geotextual stream to provide the fresh information under controllable memory usage. In particular, for each subscription, we score a message based on their geotextual similarity, and the top messages are continuously maintained against the update of the sliding window (i.e., message arrival and expiration). Below is a motivating example.
Example 1
Figure 1 shows an example of locationaware ecoupon recommendation system. Three users interested in nearby restaurants are registered with their locations and favorite food, intending to keep an eye on the most relevant ecoupon issued recently. We assume the system only stores the most recent four ecoupons. An ecoupon will be delivered to a user if has the highest score w.r.t. according to their spatial and textual similarity. Initially, we have four ecoupons, and the top1 answer of each user is shown in bold in the upperright table, where the relevance score between user and ecoupon is depicted. When a new ecoupon arrives and the old ecoupon expires, the updated results are shown in bottomright table. Particularly, the top1 answer of is replaced by since is discarded from the system, while the answer of is replaced by , as is the most relevant to . The top1 answer of remains unchanged.
Challenges. Besides the existing challenges in spatialkeyword query processing DBLP:conf/icde/FelipeHR08 (); DBLP:journals/pvldb/CongJW09 (); DBLP:conf/ssd/RochaGJN11 (); DBLP:conf/sigir/ZhangCT14 (), our problem presents two new challenges.
The first challenge is to devise an efficient indexing structure for a huge number of subscriptions, such that each message from the highspeed stream can be disseminated immediately on its arrival. The only work that supports top spatialkeyword publish/subscribe is proposed by Chen et al. DBLP:conf/icde/ChenCCT15 (). In a nutshell, they first deduce a textual bound for each subscription and then employ DAAT (Documentatatime DBLP:conf/cikm/BroderCHSZ03 ()) paradigm to traverse the inverted file built in each spatial node. However, we observe that the continuous top monitoring problem is essentially a thresholdbased similarity search problem from the perspective of message; that is, a new message will be delivered to a subscription if and only if its score is not less than the current threshold score (e.g., th highest score) of the subscription. Consequently, although the DAAT paradigm has been widely used for top search (e.g., DBLP:conf/sigir/DingS11 ()), it is not suitable to our problem because the advanced thresholdbased pruning techniques cannot be naturally integrated under DAAT paradigm.
The second challenge is the top reevaluation problem triggered by frequent message expiration from the sliding window. For example, in Figure 1, the expiration of invalidates the current top answer (i.e., ) of , and thus the system has to recompute the new result for over the sliding window. It is costprohibitive to reevaluate all the affected subscriptions from scratch when a message expires. Some techniques have been proposed to solve this problem (e.g., DBLP:conf/icde/YiYYXC03 (); DBLP:conf/sigmod/MouratidisBP06 (); DBLP:conf/icde/BohmOPY07 (); DBLP:journals/tods/PripuzicZA15 ()). Yi et al. DBLP:conf/icde/YiYYXC03 () introduce a kmax strategy, trying to maintain top results, with being a value between and kmax, rather than buffering the exact top results. Later, Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 () notice that kmax ignores the dominance relationship between messages, and propose a novel idea to convert top maintenance into partial skyband maintenance to reduce the number of reevaluations. Nevertheless, they simply use the th score of a continuous query (i.e., subscription in our paper) as the threshold of its skyband without theoretical underpinnings, which may result in poor performance in practice.
On the other hand, the limited computational resources (e.g., CPU, memory) in a single machine often become the bottleneck when we increase the scale of reallife applications, where millions of active users need to be maintained simultaneously. To alleviate this issue, we extend Skype on top of Storm ^{1}^{1}1Apache storm project. http://storm.apache.org/, an opensource distributed realtime inmemory processing system, to leverage parallel processing such that high throughput can be achieved. Storm itself is intrinsically designed to solve realtime stream processing tasks, which therefore best suits our top publish/subscribe problem. The main challenge here lies in how to partition and distribute subscriptions and messages such that workload balance and high throughput can be achieved at a small communication cost.
In this paper, we propose a novel centralized system, i.e., Skype, to efficiently support top Spatialkeyword Publish/Subscribe over sliding window. Two key modules, message dissemination module and top reevaluation module, are designed to address the above challenges. Specifically, the message dissemination module aims to rapidly deliver each arriving message to its affected subscriptions on its arrival. We devise efficient subscription indexing techniques which carefully integrate both spatial and textual information. Following the TAAT (Termatatime DBLP:conf/sigir/BuckleyL85 ()) paradigm, we significantly reduce the number of nonpromising subscriptions for the incoming message by utilizing a variety of spatial and textual pruning techniques. On the other hand, the top reevaluation module is designed to refill the top results of subscriptions when their results expire. To alleviate frequent reevaluations, we develop a novel costbased skyband technique which carefully selects the messages to be buffered based on a threshold value determined by a cost model, considering both top reevaluation cost and skyband maintenance cost. In addition, to speedup realtime processing, we follow most of the existing publish/subscribe systems (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()) to implement all our indexes in main memory.
To support better scalability beyond Skype, we pioneer a novel distributed realtime processing system, namely, DSkype, which is a distributed version of Skype deployed on top of Storm. We propose four different distribution mechanisms, i.e., hashingbased, locationbased, keywordbased and prefixbased, to distribute subscriptions and messages to relevant components. Among them, prefixbased technique yields the best overall performance in terms of both throughput and communication cost. For example, it can process nearly messages per second over subscriptions on a smallsize cluster.
Contributions. Our principal contributions are summarized as follows:

We propose a novel framework, called Skype, which continuously maintains top geotextual messages for a large number of subscriptions over sliding window model. To the best of our knowledge, this is the first work to integrate sliding window model into spatialkeyword publish/subscribe system. (Section 4)

For message dissemination module, we propose both individual pruning technique and group pruning technique to significantly improve the dissemination efficiency following the TAAT paradigm. (Section 5)

For top reevaluation module, a novel costbased skyband method is developed to determine the best threshold value with indepth theoretical analysis. It is worth mentioning that our technique is a general approach which can be applied to other continuous top problems over sliding window. (Section 6)

We extend Skype on top of Storm, a distributed realtime processing environment. By introducing to Storm a distribution layer which employs several efficient distribution mechanisms, the distributed version can achieve high throughput with better scalability. As far as we know, this is the first work which extends top publish/subscribe system on top of Storm. (Section 7)

We conduct extensive experiments to verify the efficiency and effectiveness of both Skype and its distributed version DSkype. It turns out that Skype usually achieves up to orders of magnitude improvement compared to its competitors, while DSkype achieves further improvement over Skype with better scalability and large margin. (Section 8)
2 Related Work
2.1 Spatialkeyword Search
Spatialkeyword search has been widely studied in literatures. It aims to retrieve a set of geotextual objects based on boolean matching (e.g., DBLP:conf/cikm/ZhouXWGM05 (); DBLP:conf/ssdbm/HariharanHLM07 (); DBLP:conf/icde/FelipeHR08 ()) or score function (e.g., DBLP:journals/pvldb/CongJW09 (); DBLP:conf/ssd/RochaGJN11 (); DBLP:conf/cikm/ChristoforakiHDMS11 (); DBLP:conf/sigir/ZhangCT14 ()) by combining both spatial index (e.g., RTree, Quadtree) and textual index (e.g., inverted file). A nice summary of spatialkeyword query processing is available in DBLP:journals/pvldb/ChenCJW13 (). Several extensions based on spatialkeyword processing have also been investigated, such as moving spatialkeyword query DBLP:conf/sigmod/GuoZLTB15 (), collective spatialkeyword query DBLP:conf/sigmod/GuoCC15 () and reverse spatialkeyword query DBLP:conf/sigmod/LuLC11 (). Note that a spatialkeyword search is an adhoc/snapshot query (i.e., userinitiated model) while our problem focuses on continuous query (i.e., serverinitiated model).
2.2 Publish/Subscribe System
Users register their interest as longrunning queries in a publish/subscribe system, and streaming publications are delivered to relevant users whose interests are satisfied. Nevertheless, most of the existing publish/subscribe systems (e.g., DBLP:journals/pvldb/WhangBSVVYG09 (); DBLP:conf/sigmod/SadoghiJ11 (); DBLP:journals/pvldb/ZhangCT14 (); DBLP:journals/pvldb/ShraerGFJ13 ()) do not consider spatial information. Recently, spatialkeyword publish/subscribe system has been studied in a line of work (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()). Among them, DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 () study the boolean matching problem while DBLP:conf/icde/HuLLFT15 () studies the similarity search problem, where each subscription has a pregiven threshold. These work are inherently different from ours, and it is nontrivial to extend their techniques to support top monitoring.
The CIQ index proposed by Chen et al. DBLP:conf/icde/ChenCCT15 () is the only close work that supports top spatialkeyword publish/subscribe (shown in Figure 2). In CIQ, a Quadtree is used to partition the whole space. Each subscription is assigned to a number of covering cells, forming a disjoint partition of the entire space. In Figure 2, we assume all the subscriptions have the same cell covering, i.e., from to . A textual bound (e.g., MinT) is precomputed for each subscription w.r.t. each assigned cell, as shown in the tables where the textual bounds w.r.t. and are displayed. An inverted file ordered by subscription id is built to organize the subscriptions assigned to each cell. For a new message (e.g., ), CIQ traverses all the inverted files with corresponding cells penetrated by message location (e.g., ) in DAAT paradigm, and finds all the subscriptions with textual similarity higher than the precomputed bound as candidates, which are then verified to get final results. However, we notice that DAAT paradigm employed in CIQ cannot integrate some advanced techniques for thresholdbased similarity search, given that the nature of our problem is a thresholdbased search problem. Contrary to CIQ, our indexing structure is designed for the TAAT paradigm, combined with advanced techniques for thresholdbased pruning, thus enabling us to exclude a significant number of subscriptions. Moreover, CIQ indexes each subscription into multiple cells, taking advantage of precomputed spatial bound. However, the gain is limited since the number of covering cells for each subscription cannot be too large; otherwise, it would lead to extremely high memory cost. Thus, we turn to an onthefly spatial bound computation strategy, where each subscription is assigned to a single cell with finer spatial granularity. Finally, we remark that CIQ integrates a time decay function rather than a sliding window, which, in the worst case, may overwhelm the limited memory.
2.3 Topk Maintenance Over Sliding Window
One critical problem for top maintenance over sliding window is that, when an old element (i.e., message in this paper) expires, we have to recompute the top results for the affected continuous queries (i.e., subscriptions in this paper), which is costexpensive if we simply reevaluate from scratch. On the flip side, it is also infeasible to buffer all elements and their scores for each individual query to avoid top reevaluation. Several techniques are proposed aiming to identify a tradeoff between the number of reevaluations and the buffer size. In DBLP:conf/icde/YiYYXC03 (), Yi et al. introduce a kmax approach. Rather than maintain exact top results, they continuously maintain top results where is between and a parameter kmax. However, followed by observation from Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 (), kmax may contain redundant elements due to the overlook of dominance relationship. Thus, Mouratidis et al. propose a skyband based algorithm to remove redundancy. Since it is very expensive to maintain the full skyband for each individual query, they only keep elements with scores not lower than the th highest score determined by the most recent top reevaluation. We observe that this setting is rather adhoc and thus may result in unsatisfactory performance in practice. Böhm et al. DBLP:conf/icde/BohmOPY07 () utilize a delay buffer to avoid inserting the newlyarriving objects with low scores into the skyband. However, since each object has to probe query index twice during its life time, their method is not suitable to our problem given the large number of registered queries (i.e., subscriptions). Pripuzic et al. DBLP:journals/tods/PripuzicZA15 () propose a probabilistic skyband method to drop the data which is unlikely to become top results in order to save space and improve efficiency. However, their technique may discard some top elements due to its probabilistic nature. In this paper, we propose a novel costbased skyband technique to carefully determine the size of skyband buffer based on a cost model.
2.4 Distributed Spatial Query Processing
There are a bunch of work studying spatial query processing by utilizing distributed system. Nishimura et al. DBLP:conf/mdm/NishimuraDAA11 () extend HBase^{2}^{2}2Apache HBase project. https://hbase.apache.org/ to support multidimensional index. Aji et al. DBLP:journals/pvldb/AjiWVLL0S13 () propose HadoopGIS, a distributed data warehouse infrastructure built on top of Hadoop, which provides functionality of spatial data analytics. Later, Eldawy et al. DBLP:conf/icde/EldawyM15 () develop SpatailHadoop, a fullfledged system which supports various spatial queries by integrating spatialawareness in each Hadoop layer. Aly et al. present an adaptive mechanism on top of Hadoop to partition largescale spatial data for efficient query processing DBLP:journals/pvldb/AlyMHAOEQ15 (). Xie et al. xiesimba () introduce a system called Simba to provide efficient inmemory spatial analytics by extending Spark SQL engine. All the work above focus on some fundamental spatial queries, such as range query and kNN query, which is inherently different from our top spatialkeyword publish/subscribe problem. A very relevant work, called Tornado, which also supports spatialkeyword stream processing on Storm, appears in DBLP:journals/pvldb/MahmoodAQRDMAHA15 (). Tornado is a general spatialkeyword stream processing system to support both snapshot and continuous queries. However, their main focus is not on the index construction over subscription queries, which nevertheless is the main contribution of our paper. Besides, they cannot support the top spatialkeyword subscription queries as ours.
On the other hand, many stream processing systems, such as Spark Streaming ^{3}^{3}3Apache spark project. http://spark.apache.org/streaming/, Samza ^{4}^{4}4Apache samza project. http://samza.apache.org/ and Storm, have been developed to support efficient processing of realtime data. Most of them are featured with opensource, lowlatency, distributed, scalable and faulttolerant characteristics. A nice comparison between different stream processing systems can be found in DBLP:journals/cloudcomp/Ranjan14 (). We choose Storm here mainly because of its simplicity, efficiency, welldocumented APIs and very active community^{5}^{5}5https://github.com/apache/storm. To the best of our knowledge, our work is the first one to support top spatialkeyword publish/subscribe in a distributed environment.
3 Preliminary
In this section, we formally present some concepts which are used throughout this paper.
Definition 1 (Geotextual Message)
A geotextual message is defined as , where is a collection of keywords from a vocabulary , is a point location, and is the arrival time.
Definition 2 (Spatialkeyword Subscription)
A spatialkeyword subscription is denoted as , where is a set of keywords, is a point location, is the number of messages that is willing to receive and is the preference parameter used in the score function.
To buffer the most recent data from geotextual stream, we adopt a countbased sliding window defined as follows.
Definition 3 (Sliding Window)
Given a stream of geotextual messages arriving in time order, the sliding window over the stream with size consists of most recent geotextual messages.
In the following of the paper, we abbreviate geotextual message and spatialkeyword subscription as message (denoted as ) and subscription (denoted as ) respectively if there is no ambiguity. We assume that the keywords in vocabulary , as well as the keywords in subscription and message, are sorted in increasing order of their term frequencies. Note that sorting keywords in increasing order of frequency is a widelyadopted heuristic to speed up similarity search DBLP:conf/icde/ChaudhuriGK06 (); DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 (). The th keyword in is denoted as , and we use to denote a subset of , i.e., . Particularly, denotes . Message follows the similar notations.
Score function. To measure the relevance between a subscription and a message , we employ a score function defined as follows:
(1) 
where is the spatial proximity and is the textual relevance between and . Thus, a subscriber can receive messages which are not only close to her location but also fulfil her interest. Meanwhile, the parameter can be adjusted by subscribers to best satisfy their diverse preferences.
To compute spatial proximity, we utilize Euclidean distance as , where is the Euclidean distance between and , and MaxDist is maximum distance in the space.
For textual similarity, we employ the wellknown cosine similarity manning2008introduction () as , where and are tfidf weights of keyword in and respectively. Note that the weighting vectors of both and are normalized to unit length. Also, same as DBLP:conf/icde/ChenCCT15 (), to guarantee the top results are textualrelevant, a message must contain at least one common keyword with a subscription to become its top results.
Problem statement. Given a massive number of spatialkeyword subscriptions and a geotextual stream, we aim to continuously monitor top results for all the subscriptions against the stream over a sliding window in real time.
4 Framework
Figure 3 shows the framework of Skype (Top Spatialkeyword Publish/Subscribe). We assume our system already has some registered subscriptions. An arriving message will be processed by message dissemination module, where a subscription index is built to find all the affected subscriptions and update their top results. An expired message will be processed by top reevaluation module. Specifically, it will check against a result buffer, which maintains the top results (possibly including some nontop results) of all the subscriptions. For the subscriptions that cannot be refilled through result buffer, their top results will be reevaluated from scratch against a message index containing all the messages over the sliding window. Note that the message index can be implemented with any existing spatialkeyword index, such as IRTree DBLP:journals/pvldb/CongJW09 () and S2I DBLP:conf/ssd/RochaGJN11 (). Skype can also support subscription update efficiently. A new subscription will be inserted into subscription index, with its top results being initialized against message index, while an unregistered subscription will be deleted from both subscription index and result buffer. Note that the subscription index and message index serve different purposes and cannot be trivially combined together.
5 Message Dissemination
In this section, we introduce a novel subscription index, which groups similar subscriptions, to support realtime dissemination against message stream. Specifically, two key techniques, i.e., individual pruning and group pruning, are proposed in Section 5.1 and Section 5.2 respectively, followed by the detailed indexing structure in Section 5.3. Finally, we introduce dissemination algorithm in Section 5.4 and index maintenance in Section 5.5.
5.1 Individual Pruning Technique
For each incoming message , the key challenge is to determine all the subscriptions whose top results are affected. Specifically, we denote the th highest score of a subscription as . Then the top results of need to be updated if . In this section, we propose a novel locationaware prefix filtering technique to skip an individual subscription efficiently.
5.1.1 Locationaware Prefix Filtering
For ease of exposition, we denote a spatial similarity upper bound between a subscription and a message as , which will be discussed in detail in Section 5.1.2. Based on Equation 1, we can derive a textual similarity threshold for pruning purpose accordingly:
(2) 
Then the following lemma claims that if the textual similarity between and is less than , we can safely skip .
Lemma 1
A message cannot affect top results of a subscription if .
To utilize Lemma 1, we employ prefix filtering technique, which is widely adopted in textual similarity join problems (e.g., DBLP:conf/icde/ChaudhuriGK06 (); DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 ()). Prefix filtering is based on the fact that TSim is essentially a vector product; therefore, we can determine the similarity upper bound between two objects by only comparing their prefixes. Before we introduce prefix filtering technique, we first introduce a threshold value for each keyword in :
(3) 
Then we define a locationaware prefix as follows.
Definition 4 (Locationaware Prefix)
Given a subscription , a message and a textual similarity threshold , we use to denote the locationaware prefix of w.r.t. , where .
The following lemma claims that locationaware prefix is sufficient to decide whether a message can be top result of a subscription.
Lemma 2
Given a subscription and a message , is sufficient to skip regarding .
Proof
Example 2
Figure 4 shows an example of locationaware prefix, with 3 registered subscriptions and 3 incoming messages. The underlined value to the right of each keyword corresponds to its weight, and we do not normalize the keyword weight for simplicity. Assuming , then . Thus, . Since , we can skip w.r.t. .
It is noticed that different from conventional prefix technique (e.g., DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 ()) where only the prefix of a data entry needs to be indexed, our locationaware prefix is dependent on the spatial location of messages, and different locations may lead to different prefixes. Thus, it is impossible to precompute and index the prefix of subscriptions. To address this issue, we utilize the threshold value wtsum for each keyword in to indicate whether this keyword should occur in the prefix regarding a message , which is stated formally in the following lemma.
Lemma 3
Given a subscription , a message and , if does not contain any keyword that satisfies , we can safely skip regarding m.
In this way, we can dynamically determine the locationaware prefix of a subscription w.r.t. an arriving message. Also, since is irrelevant to incoming messages, it can be materialized for each subscription.
Example 3
Maxweight refinement. We notice that for a specific message , we can compute a better locationaware prefix for by considering the maximum weight for the keywords in . We first define as:
(4) 
Then we define a refined locationaware prefix:
Definition 5 (Refined Locationaware Prefix)
Given a subscription , a message and , we use to denote the refined locationaware prefix of w.r.t. , where with .
The following theorem claims that is sufficient to decide whether a message can be top result of a subscription.
Theorem 5.1
Given a subscription and a message , is sufficient to skip regarding .
Proof
5.1.2 Spatial Bound Estimation
In this section, we discuss the computation of between a subscription and a message in order to get a better threshold for efficient locationaware prefix filtering. To this end, we employ a spatial index to group subscriptions with similar locations, such that the spatial upper bound for a group of subscriptions can be computed simultaneously. Due to the easy implementation and welladaptiveness to skewed spatial distributions, we choose Quadtree to index subscriptions. Specifically, each subscription is assigned into a leaf cell with range based on its location . Then the following two types of spatial bounds can be defined and utilized.
Definition 6 (Inner Spatial Bound)
Given a subscription and its residing cell , inner spatial bound, denoted as is computed as , where is the from to the nearest boundary of cell range .
It is obvious that for any and , we have . An example is shown in Figure 4. Since the from to is , if we assume the MaxDist in the space is .
Definition 7 (Outer Spatial Bound)
Given a message and an outer cell , outer spatial bound, denoted as is computed as , where is the from to .
For any and , we have . An example is also shown in Figure 4. The from to is , and thus .
Definition 8 (Spatial Upper Bound)
Following the example in Figure 4, by combining both inner and outer distance, we can get a tighter spatial upper bound between and as .
Note that the inner spatial bound can be precomputed and materialized, while the outer spatial bound has to be computed onthefly as it is relevant to the location of an arriving message. However, the computation cost of is not expensive since we only need to compute this value against each leaf cell. Finally, we remark that when and are within the same cell, both and are always .
Example 5
An example is shown in Figure 4. If we assume the , we have , and . Thus, cannot be skipped w.r.t. . However, if we utilize the inner spatial bound and outer spatial bound together, we have , , and . In this case, we can safely skip w.r.t. .
5.1.3 Bound Estimation for Unseen Keywords
Since we employ TAAT paradigm to visit inverted file, we can estimate a textual upper bound for unseen keywords. If this upper bound plus the textual similarity that has already been computed is still less than the required threshold, we can safely skip . The textual upper bound between the unseen keywords of and can be computed as follows:
(5) 
where and are starting positions of unseen keywords. Then the following theorem claims we can skip a subscription by utilizing the textual upper bound.
Theorem 5.2
Given a subscription , a message and their textual similarity threshold , assuming we have already computed the partial similarity between and , denoted as , then is sufficient to skip .
Proof
As , we have . The theorem holds immediately from Lemma 1.
Example 6
In Figure 4, consider that we are currently disseminating . Based on the dissemination algorithm to be discussed later in Section 5.4, we need to traverse the inverted lists in cell (where resides) for all the keywords in one by one. We first check the inverted list of since is the 1st keyword of . Assuming , we cannot skip since . However, since is the 2nd keyword in , we can compute , and . Because , we can immediately skip .
5.2 Group Pruning Technique
After applying individual pruning technique, many subscriptions can be skipped without the need to compute their exact similarity w.r.t. a message. To further enhance the performance, we propose a novel Group Pruning Technique such that we can skip a group of subscriptions without the need to visit them individually. To begin with, we first define subscriptiondependent prefix for a message.
Definition 9 (Refined Subdependent Prefix)
Given a message , a subscription and , we use to denote the refined subscriptiondependent prefix of w.r.t. , where with .
The following lemma claims the refined subscriptiondependent prefix is sufficient to determine whether a message could be top result of a subscription.
Lemma 4
Given a subscription and a message , is sufficient to skip regarding .
Proof
Let us denote the posting list of keyword in cell as , which contains all the subscriptions having and residing in . Then based on Lemma 4, for a subscription in , if , we can safely skip . Further, if this holds for a group of subscriptions on , we can safely skip the whole group as follows.
Lemma 5
Given a message , a keyword , a posting list and a group of subscriptions inside , is sufficient to skip the whole group .
Proof
The left side of the inequality in Lemma 5 can be computed in time since we can materialize for each group. However, for the right side, it would be quite inefficient if we compute it on the fly for each new message. To avoid this, we propose a lower bound for which can be computed in constant time. In the following, we first present the subscription grouping strategy and then introduce the details of the lower bound deduction.
5.2.1 Partition Scheme
Intuitively, we should group subscriptions with similar such that we can get a tighter textual threshold for the group. We first let . Note that we compute by utilizing only in order to make independent of a specific subscription. Therefore, it is observed from Equation 2 that, for the computation of , only and are dependent on while is irrelevant to . For simplicity, we denote as and as respectively. Then, we partition subscriptions into groups based on their values, such that the subscriptions inside a group have similar values. We employ a quantilebased method to partition the domain of to ensure that each group has similar number of subscriptions. Then, we can skip the whole group as stated in the following theorem.
Theorem 5.3
Given a group generated by partition in a posting list , we denote as and as . then is sufficient to skip the whole group .
Proof
It is obvious that for any subscription in , we have . Thus, we have . Combined with Lemma 5, the theorem holds immediately.
Time complexity. The condition checking in Theorem 5.3 takes time, since we can precompute the values of and .
5.2.2 Early Termination Within Group
When a group cannot be skipped given a message, we have to check each subscription in it. To avoid this, we propose an early termination technique to early stop within a group when the group cannot be skipped totally. To enable early termination, for each group in , we sort the subscriptions in by their values increasingly. For each subscription in , we denote the subscriptions with not less than as , and maintain two statistics and w.r.t. keyword as follows:
(6) 
(7) 
Then we can employ early termination as follows.
Theorem 5.4
Given a group inside a posting list , and assuming is the subscription with smallest position in such that the following inequality holds: , then there is no need to check the subscriptions after (including itself).
Proof
Time complexity. To speedup the realtime processing, we precompute and and store them with each subscription in the group . The condition checking in Theorem 5.4 can be efficiently computed in time with a binary search method.
5.2.3 Cellbased Pruning
Besides the above group pruning technique, we notice that for some cells which are far away from the location of an arriving message, we can safely skip the whole cell. Specifically, for each subscription within a cell , we can derive a spatial similarity threshold as follows:
(8) 
where we assume the textual similarity achieves the largest value, i.e., . Then we can reach the following lemma.
Lemma 6
Given a cell , if , we can safely skip all the subscriptions in cell .
Proof
For , we have . Thus, . Thus, cannot be top results of any in .
5.3 Subscription Index
Relying on all the techniques discussed above, our subscription index is essentially a Quadtree structure integrated with inverted file in each leaf cell, as shown in Figure 5. For each registered subscription, we store its detailed information and relevant statistics in a subscription table, and insert it into a leaf cell of Quadtree based on its spatial location. Note that in Quadtree, we only store the subscription id referring to its detailed information in subscription table. Within each leaf cell, an inverted file is built upon all the subscriptions inside the cell. Then each posting list in inverted file is further partitioned into groups based on the subscription preference to enable group pruning. Each group is also associated with some statistics mentioned above. Finally, to facilitate early termination, the subscriptions within each group are ordered based on their .
5.4 Dissemination Algorithm
Algorithm 1 shows our message dissemination algorithm. We follow a filteringandverification paradigm, where we first generate a set of candidate subscriptions (Lines 11), and then compute the exact scores to determine the truly affected ones, with the updated results being disseminated accordingly (Line 1). Specifically, we first initialize an empty map to store candidates with their scores (Line 1). Then the maxwt and wtsum values for all the keywords in the arriving message are computed for later use (Line 1). For each leaf cell surviving from cell pruning (Line 1), we first compute and then traverse the inverted file in cell following a TAAT manner. For each group encountered in (Line 1), we skip if group pruning can be applied (Line 1); otherwise, we identify for early termination based on Theorem 5.4 (Line 1 and Line 1). For each surviving subscription , we employ locationaware prefix filtering (Line 1) and bound estimation for unseen keywords (Line 1) to skip it as early as possible. For the surviving subscriptions, we store the accumulated textual similarity so far w.r.t. in , while for the skipped subscriptions, we set to negative infinity (Line 1). Finally, for each subscription in with , we verify it and update its top results if needed (Line 1). Note that when verifying a candidate , we only need to compute the exact spatial similarity to get the final score because the textual similarity, i.e., , has already been computed. The statistics relevant to pruning techniques are also updated in Line 1.
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
5.5 Index Maintenance
Our indexing structure can also support subscription update efficiently. For a new subscription , we first find the leaf cell containing its location, and then insert it into the inverted file with cost. Note that the statistics mentioned above need to be updated accordingly. For an expired subscription, we simply delete it from index and update the statistics if necessary.
6 Topk Reevaluation
In this section, we present the details of top reevaluation module. We first introduce some background knowledge for skyband in Section 6.1. Then we present our costbased skyband technique in detail in Section 6.2. In the following of this paper, we denote the skyband buffer (either fully or partially) of a subscription as for simplicity, and the exact top results are denoted as . Meanwhile, is denoted as if it is clear from context.
6.1 KSkyband
The idea of utilizing skyband to reduce the number of reevaluations for top queries over a sliding window is first proposed in DBLP:conf/sigmod/MouratidisBP06 (). In particular, for a given subscription , only the messages in its corresponding skyband can appear in its top results over the sliding window, thus being maintained. Following are formal definitions of dominance and skyband.
Definition 10 (Dominance)
A message dominates another message w.r.t. a subscription if both and hold.
Definition 11 (skyband)
The skyband of a subscription , denoted as , contains a set of messages which are dominated by less than other messages.
Instead of keeping skyband over all the messages in the sliding window, which is costprohibitive, Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 () maintain a partial skyband. Specifically, they only maintain the messages with score not lower than a threshold , where is the after the most recent top reevaluation for and remains unchanged until next reevaluation is triggered. However, as our experiments suggest, the method in DBLP:conf/sigmod/MouratidisBP06 () may result in expensive computational cost due to the improper selection of .
To alleviate the above problem, we propose a novel costbased skyband technique, which judiciously selects a best threshold for the skyband maintenance of each subscription. To start with, we present an overview of our top reevaluation algorithm in Algorithm 2. For each subscription containing the expired message , if the size of after deleting is less than , we need to reevaluate its top results from scratch. Specifically, we first compute a proper threshold based on our cost model (Line 2), and then recompute skyband buffer based on , which contains all the messages with score at least (Line 2 and Line 2) ^{6}^{6}6The same technique in DBLP:conf/sigmod/MouratidisBP06 () is used to compute skyband.. Note that can be computed by utilizing message index. Finally, we extract top results from (Line 2). The key challenge here is to estimate a best threshold , which will be discussed in the following in detail. We remark that we use the term reevaluation to refer in particular to the top recomputation against message index.
5
5
5
5
6.2 Costbased KSkyband
The general idea of our costbased skyband model is to select a best threshold for each subscription such that the overall cost defined in the cost model can be minimized. The following theorem guarantees that, as long as we maintain a partial skyband over all the messages with score not lower than , we can extract top results from partial skyband safely when some message expires.
Theorem 6.1
Given a subscription , let be the after the most recent top reevaluation for . We always have if the following conditions hold: (1) ; (2) is a partial skyband which is built over all the messages with score at least in the sliding window, where .
Proof
We prove it by contradiction. Assuming there exists a message while , then we discuss two possible cases: (1) ; (2) . For the first case, since , must be dominated by more than messages in , which indicates it cannot be top results, i.e., . For the second case, at least messages in must have a higher score than because and all the messages in have score at least . Thus, still cannot be top results. Thus, the original assumption does not hold, which immediately indicates .
Thus, based on Theorem 6.1, we can safely extract top results from skyband buffer when ; when , we have to reevaluate from message index.
Our costbased skyband model, based on Theorem 6.1, aims to find the best such that the overall cost can be minimized for each subscription. We mainly consider two costs. The first one is skyband maintenance cost, denoted as , which is triggered upon message arrival and expiration. The second one is top reevaluation cost, denoted as , which is triggered when some message expires and the top results can no longer be retrieved from skyband buffer. We aim to estimate the expected overall cost w.r.t. each message update, i.e., message arrival and message expiration, each of which we assume occurs with probability as the window slides. To simplify the presentation, we denote as the probability that the score between a random message and a subscription is at least . We may immediately derive for a given from historical data, assuming the score follows previous distribution. The details of these two costs are presented in the following respectively.
6.2.1 Skyband Maintenance Cost
The maintenance of skyband is triggered when the following two types of updates happen, both with probability , where is the probability of message arrival or message expiration due to the countbased sliding window, and is the probability that the score between a random message and is at least . Please note that if the independence assumption does not hold for messages, the above probabilities cannot be estimated accurately, and we may resort to utilizing historical data for the estimation.
The first type of update is triggered when a message with score at least arrives. Apart from the insertion of into , the dominance counters of all the messages in with score not higher than will increase by 1, and the messages with dominance counter equal to will be evicted. Since we implement our skyband buffer with a linked list sorted by . The above operations can be processed in time with a linear scan. The next challenge is to estimate . Based on the independence assumption between score dimension and time dimension, the expected number, i.e., , of messages in the partial skyband is DBLP:journals/tkde/ZhangLYKZY10 (), where is the size of sliding window. Please note that if the independence assumption does not hold, the worst case space complexity will be .
The second type of update occurs when an old message among the skyband buffer of expires. In this case, we only need to delete from in time. Note that does not dominate any remaining messages and therefore the dominance counters of the remaining messages are not affected. Finally, we get the total cost of skyband maintenance as follows: