Top-k Spatial-keyword Publish/Subscribe Over Sliding Window
With the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data has been generated in a stream fashion, leading to a variety of applications such as location-based recommendation and information dissemination. In this paper, we investigate a novel real-time top- monitoring problem over sliding window of streaming data; that is, we continuously maintain the top-k most relevant geo-textual messages (e.g., geo-tagged tweets) for a large number of spatial-keyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geo-textual data. To the best of our knowledge, this is the first work to study top- spatial-keyword publish/subscribe over sliding window. A novel centralized system, called Skype (Top-k Spatial-keyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top- results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. To reduce the expensive top- re-evaluation cost triggered by message expiration, we develop a novel cost-based -skyband technique to reduce the number of re-evaluations in a cost-effective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques. Furthermore, to support better scalability and higher throughput, we propose a distributed version of Skype, namely, DSkype, on top of Storm, which is a popular distributed stream processing system. With the help of fine-tuned subscription/message distribution mechanisms, DSkype can achieve orders of magnitude speed-up than its centralized version.
Recently, with the ubiquity of social media and GPS-enabled mobile devices, large volumes of geo-textual data have been generated in a stream fashion, leading to the popularity of spatial-keyword publish/subscribe system (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()) in a variety of applications such as location-based recommendation and social network. In such a system, each individual user can register her interest (e.g., favorite food or sports) and location as a spatial-keyword subscription. A stream of geo-textual messages (e.g., e-coupon promotion and tweets with location information) continuously generated by publishers (e.g., local business) are rapidly fed to the relevant users.
The spatial-keyword publish/subscribe system has been studied in several existing work (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 ()). Most of them are geared towards boolean matching, thus making the size of messages received by users unpredictable. This motivates us to study the problem of top- spatial-keyword publish/subscribe such that only the top- most relevant messages are presented to users. Moreover, we adopt the popular sliding window model DBLP:conf/pods/BabcockBDMW02 () on geo-textual stream to provide the fresh information under controllable memory usage. In particular, for each subscription, we score a message based on their geo-textual similarity, and the top- messages are continuously maintained against the update of the sliding window (i.e., message arrival and expiration). Below is a motivating example.
Figure 1 shows an example of location-aware e-coupon recommendation system. Three users interested in nearby restaurants are registered with their locations and favorite food, intending to keep an eye on the most relevant e-coupon issued recently. We assume the system only stores the most recent four e-coupons. An e-coupon will be delivered to a user if has the highest score w.r.t. according to their spatial and textual similarity. Initially, we have four e-coupons, and the top-1 answer of each user is shown in bold in the upper-right table, where the relevance score between user and e-coupon is depicted. When a new e-coupon arrives and the old e-coupon expires, the updated results are shown in bottom-right table. Particularly, the top-1 answer of is replaced by since is discarded from the system, while the answer of is replaced by , as is the most relevant to . The top-1 answer of remains unchanged.
Challenges. Besides the existing challenges in spatial-keyword query processing DBLP:conf/icde/FelipeHR08 (); DBLP:journals/pvldb/CongJW09 (); DBLP:conf/ssd/RochaGJN11 (); DBLP:conf/sigir/ZhangCT14 (), our problem presents two new challenges.
The first challenge is to devise an efficient indexing structure for a huge number of subscriptions, such that each message from the high-speed stream can be disseminated immediately on its arrival. The only work that supports top- spatial-keyword publish/subscribe is proposed by Chen et al. DBLP:conf/icde/ChenCCT15 (). In a nutshell, they first deduce a textual bound for each subscription and then employ DAAT (Document-at-a-time DBLP:conf/cikm/BroderCHSZ03 ()) paradigm to traverse the inverted file built in each spatial node. However, we observe that the continuous top- monitoring problem is essentially a threshold-based similarity search problem from the perspective of message; that is, a new message will be delivered to a subscription if and only if its score is not less than the current threshold score (e.g., -th highest score) of the subscription. Consequently, although the DAAT paradigm has been widely used for top- search (e.g., DBLP:conf/sigir/DingS11 ()), it is not suitable to our problem because the advanced threshold-based pruning techniques cannot be naturally integrated under DAAT paradigm.
The second challenge is the top- re-evaluation problem triggered by frequent message expiration from the sliding window. For example, in Figure 1, the expiration of invalidates the current top- answer (i.e., ) of , and thus the system has to re-compute the new result for over the sliding window. It is cost-prohibitive to re-evaluate all the affected subscriptions from scratch when a message expires. Some techniques have been proposed to solve this problem (e.g., DBLP:conf/icde/YiYYXC03 (); DBLP:conf/sigmod/MouratidisBP06 (); DBLP:conf/icde/BohmOPY07 (); DBLP:journals/tods/PripuzicZA15 ()). Yi et al. DBLP:conf/icde/YiYYXC03 () introduce a kmax strategy, trying to maintain top- results, with being a value between and kmax, rather than buffering the exact top- results. Later, Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 () notice that kmax ignores the dominance relationship between messages, and propose a novel idea to convert top- maintenance into partial -skyband maintenance to reduce the number of re-evaluations. Nevertheless, they simply use the -th score of a continuous query (i.e., subscription in our paper) as the threshold of its -skyband without theoretical underpinnings, which may result in poor performance in practice.
On the other hand, the limited computational resources (e.g., CPU, memory) in a single machine often become the bottleneck when we increase the scale of real-life applications, where millions of active users need to be maintained simultaneously. To alleviate this issue, we extend Skype on top of Storm 111Apache storm project. http://storm.apache.org/, an open-source distributed real-time in-memory processing system, to leverage parallel processing such that high throughput can be achieved. Storm itself is intrinsically designed to solve real-time stream processing tasks, which therefore best suits our top- publish/subscribe problem. The main challenge here lies in how to partition and distribute subscriptions and messages such that workload balance and high throughput can be achieved at a small communication cost.
In this paper, we propose a novel centralized system, i.e., Skype, to efficiently support top- Spatial-keyword Publish/Subscribe over sliding window. Two key modules, message dissemination module and top- re-evaluation module, are designed to address the above challenges. Specifically, the message dissemination module aims to rapidly deliver each arriving message to its affected subscriptions on its arrival. We devise efficient subscription indexing techniques which carefully integrate both spatial and textual information. Following the TAAT (Term-at-a-time DBLP:conf/sigir/BuckleyL85 ()) paradigm, we significantly reduce the number of non-promising subscriptions for the incoming message by utilizing a variety of spatial and textual pruning techniques. On the other hand, the top- re-evaluation module is designed to refill the top- results of subscriptions when their results expire. To alleviate frequent re-evaluations, we develop a novel cost-based -skyband technique which carefully selects the messages to be buffered based on a threshold value determined by a cost model, considering both top- re-evaluation cost and -skyband maintenance cost. In addition, to speed-up real-time processing, we follow most of the existing publish/subscribe systems (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()) to implement all our indexes in main memory.
To support better scalability beyond Skype, we pioneer a novel distributed real-time processing system, namely, DSkype, which is a distributed version of Skype deployed on top of Storm. We propose four different distribution mechanisms, i.e., hashing-based, location-based, keyword-based and prefix-based, to distribute subscriptions and messages to relevant components. Among them, prefix-based technique yields the best overall performance in terms of both throughput and communication cost. For example, it can process nearly messages per second over subscriptions on a small-size cluster.
Contributions. Our principal contributions are summarized as follows:
We propose a novel framework, called Skype, which continuously maintains top- geo-textual messages for a large number of subscriptions over sliding window model. To the best of our knowledge, this is the first work to integrate sliding window model into spatial-keyword publish/subscribe system. (Section 4)
For message dissemination module, we propose both individual pruning technique and group pruning technique to significantly improve the dissemination efficiency following the TAAT paradigm. (Section 5)
For top- re-evaluation module, a novel cost-based -skyband method is developed to determine the best threshold value with in-depth theoretical analysis. It is worth mentioning that our technique is a general approach which can be applied to other continuous top- problems over sliding window. (Section 6)
We extend Skype on top of Storm, a distributed real-time processing environment. By introducing to Storm a distribution layer which employs several efficient distribution mechanisms, the distributed version can achieve high throughput with better scalability. As far as we know, this is the first work which extends top- publish/subscribe system on top of Storm. (Section 7)
We conduct extensive experiments to verify the efficiency and effectiveness of both Skype and its distributed version DSkype. It turns out that Skype usually achieves up to orders of magnitude improvement compared to its competitors, while DSkype achieves further improvement over Skype with better scalability and large margin. (Section 8)
2 Related Work
2.1 Spatial-keyword Search
Spatial-keyword search has been widely studied in literatures. It aims to retrieve a set of geo-textual objects based on boolean matching (e.g., DBLP:conf/cikm/ZhouXWGM05 (); DBLP:conf/ssdbm/HariharanHLM07 (); DBLP:conf/icde/FelipeHR08 ()) or score function (e.g., DBLP:journals/pvldb/CongJW09 (); DBLP:conf/ssd/RochaGJN11 (); DBLP:conf/cikm/ChristoforakiHDMS11 (); DBLP:conf/sigir/ZhangCT14 ()) by combining both spatial index (e.g., R-Tree, Quadtree) and textual index (e.g., inverted file). A nice summary of spatial-keyword query processing is available in DBLP:journals/pvldb/ChenCJW13 (). Several extensions based on spatial-keyword processing have also been investigated, such as moving spatial-keyword query DBLP:conf/sigmod/GuoZLTB15 (), collective spatial-keyword query DBLP:conf/sigmod/GuoCC15 () and reverse spatial-keyword query DBLP:conf/sigmod/LuLC11 (). Note that a spatial-keyword search is an ad-hoc/snapshot query (i.e., user-initiated model) while our problem focuses on continuous query (i.e., server-initiated model).
2.2 Publish/Subscribe System
Users register their interest as long-running queries in a publish/subscribe system, and streaming publications are delivered to relevant users whose interests are satisfied. Nevertheless, most of the existing publish/subscribe systems (e.g., DBLP:journals/pvldb/WhangBSVVYG09 (); DBLP:conf/sigmod/SadoghiJ11 (); DBLP:journals/pvldb/ZhangCT14 (); DBLP:journals/pvldb/ShraerGFJ13 ()) do not consider spatial information. Recently, spatial-keyword publish/subscribe system has been studied in a line of work (e.g., DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 (); DBLP:conf/icde/HuLLFT15 (); DBLP:conf/icde/ChenCCT15 ()). Among them, DBLP:conf/kdd/LiWWF13 (); DBLP:conf/sigmod/ChenCC13 (); DBLP:conf/icde/WangZZLW15 () study the boolean matching problem while DBLP:conf/icde/HuLLFT15 () studies the similarity search problem, where each subscription has a pre-given threshold. These work are inherently different from ours, and it is non-trivial to extend their techniques to support top- monitoring.
The CIQ index proposed by Chen et al. DBLP:conf/icde/ChenCCT15 () is the only close work that supports top- spatial-keyword publish/subscribe (shown in Figure 2). In CIQ, a Quadtree is used to partition the whole space. Each subscription is assigned to a number of covering cells, forming a disjoint partition of the entire space. In Figure 2, we assume all the subscriptions have the same cell covering, i.e., from to . A textual bound (e.g., MinT) is precomputed for each subscription w.r.t. each assigned cell, as shown in the tables where the textual bounds w.r.t. and are displayed. An inverted file ordered by subscription id is built to organize the subscriptions assigned to each cell. For a new message (e.g., ), CIQ traverses all the inverted files with corresponding cells penetrated by message location (e.g., ) in DAAT paradigm, and finds all the subscriptions with textual similarity higher than the precomputed bound as candidates, which are then verified to get final results. However, we notice that DAAT paradigm employed in CIQ cannot integrate some advanced techniques for threshold-based similarity search, given that the nature of our problem is a threshold-based search problem. Contrary to CIQ, our indexing structure is designed for the TAAT paradigm, combined with advanced techniques for threshold-based pruning, thus enabling us to exclude a significant number of subscriptions. Moreover, CIQ indexes each subscription into multiple cells, taking advantage of precomputed spatial bound. However, the gain is limited since the number of covering cells for each subscription cannot be too large; otherwise, it would lead to extremely high memory cost. Thus, we turn to an on-the-fly spatial bound computation strategy, where each subscription is assigned to a single cell with finer spatial granularity. Finally, we remark that CIQ integrates a time decay function rather than a sliding window, which, in the worst case, may overwhelm the limited memory.
2.3 Top-k Maintenance Over Sliding Window
One critical problem for top- maintenance over sliding window is that, when an old element (i.e., message in this paper) expires, we have to recompute the top- results for the affected continuous queries (i.e., subscriptions in this paper), which is cost-expensive if we simply re-evaluate from scratch. On the flip side, it is also infeasible to buffer all elements and their scores for each individual query to avoid top- re-evaluation. Several techniques are proposed aiming to identify a trade-off between the number of re-evaluations and the buffer size. In DBLP:conf/icde/YiYYXC03 (), Yi et al. introduce a kmax approach. Rather than maintain exact top- results, they continuously maintain top- results where is between and a parameter kmax. However, followed by observation from Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 (), kmax may contain redundant elements due to the overlook of dominance relationship. Thus, Mouratidis et al. propose a -skyband based algorithm to remove redundancy. Since it is very expensive to maintain the full -skyband for each individual query, they only keep elements with scores not lower than the -th highest score determined by the most recent top- re-evaluation. We observe that this setting is rather ad-hoc and thus may result in unsatisfactory performance in practice. Böhm et al. DBLP:conf/icde/BohmOPY07 () utilize a delay buffer to avoid inserting the newly-arriving objects with low scores into the -skyband. However, since each object has to probe query index twice during its life time, their method is not suitable to our problem given the large number of registered queries (i.e., subscriptions). Pripuzic et al. DBLP:journals/tods/PripuzicZA15 () propose a probabilistic -skyband method to drop the data which is unlikely to become top- results in order to save space and improve efficiency. However, their technique may discard some top- elements due to its probabilistic nature. In this paper, we propose a novel cost-based -skyband technique to carefully determine the size of -skyband buffer based on a cost model.
2.4 Distributed Spatial Query Processing
There are a bunch of work studying spatial query processing by utilizing distributed system. Nishimura et al. DBLP:conf/mdm/NishimuraDAA11 () extend HBase222Apache HBase project. https://hbase.apache.org/ to support multi-dimensional index. Aji et al. DBLP:journals/pvldb/AjiWVLL0S13 () propose Hadoop-GIS, a distributed data warehouse infrastructure built on top of Hadoop, which provides functionality of spatial data analytics. Later, Eldawy et al. DBLP:conf/icde/EldawyM15 () develop SpatailHadoop, a full-fledged system which supports various spatial queries by integrating spatial-awareness in each Hadoop layer. Aly et al. present an adaptive mechanism on top of Hadoop to partition large-scale spatial data for efficient query processing DBLP:journals/pvldb/AlyMHAOEQ15 (). Xie et al. xiesimba () introduce a system called Simba to provide efficient in-memory spatial analytics by extending Spark SQL engine. All the work above focus on some fundamental spatial queries, such as range query and kNN query, which is inherently different from our top- spatial-keyword publish/subscribe problem. A very relevant work, called Tornado, which also supports spatial-keyword stream processing on Storm, appears in DBLP:journals/pvldb/MahmoodAQRDMAHA15 (). Tornado is a general spatial-keyword stream processing system to support both snapshot and continuous queries. However, their main focus is not on the index construction over subscription queries, which nevertheless is the main contribution of our paper. Besides, they cannot support the top- spatial-keyword subscription queries as ours.
On the other hand, many stream processing systems, such as Spark Streaming 333Apache spark project. http://spark.apache.org/streaming/, Samza 444Apache samza project. http://samza.apache.org/ and Storm, have been developed to support efficient processing of real-time data. Most of them are featured with open-source, low-latency, distributed, scalable and fault-tolerant characteristics. A nice comparison between different stream processing systems can be found in DBLP:journals/cloudcomp/Ranjan14 (). We choose Storm here mainly because of its simplicity, efficiency, well-documented APIs and very active community555https://github.com/apache/storm. To the best of our knowledge, our work is the first one to support top- spatial-keyword publish/subscribe in a distributed environment.
In this section, we formally present some concepts which are used throughout this paper.
Definition 1 (Geo-textual Message)
A geo-textual message is defined as , where is a collection of keywords from a vocabulary , is a point location, and is the arrival time.
Definition 2 (Spatial-keyword Subscription)
A spatial-keyword subscription is denoted as , where is a set of keywords, is a point location, is the number of messages that is willing to receive and is the preference parameter used in the score function.
To buffer the most recent data from geo-textual stream, we adopt a count-based sliding window defined as follows.
Definition 3 (Sliding Window)
Given a stream of geo-textual messages arriving in time order, the sliding window over the stream with size consists of most recent geo-textual messages.
In the following of the paper, we abbreviate geo-textual message and spatial-keyword subscription as message (denoted as ) and subscription (denoted as ) respectively if there is no ambiguity. We assume that the keywords in vocabulary , as well as the keywords in subscription and message, are sorted in increasing order of their term frequencies. Note that sorting keywords in increasing order of frequency is a widely-adopted heuristic to speed up similarity search DBLP:conf/icde/ChaudhuriGK06 (); DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 (). The -th keyword in is denoted as , and we use to denote a subset of , i.e., . Particularly, denotes . Message follows the similar notations.
Score function. To measure the relevance between a subscription and a message , we employ a score function defined as follows:
where is the spatial proximity and is the textual relevance between and . Thus, a subscriber can receive messages which are not only close to her location but also fulfil her interest. Meanwhile, the parameter can be adjusted by subscribers to best satisfy their diverse preferences.
To compute spatial proximity, we utilize Euclidean distance as , where is the Euclidean distance between and , and MaxDist is maximum distance in the space.
For textual similarity, we employ the well-known cosine similarity manning2008introduction () as , where and are tf-idf weights of keyword in and respectively. Note that the weighting vectors of both and are normalized to unit length. Also, same as DBLP:conf/icde/ChenCCT15 (), to guarantee the top- results are textual-relevant, a message must contain at least one common keyword with a subscription to become its top- results.
Problem statement. Given a massive number of spatial-keyword subscriptions and a geo-textual stream, we aim to continuously monitor top- results for all the subscriptions against the stream over a sliding window in real time.
Figure 3 shows the framework of Skype (Top- Spatial-keyword Publish/Subscribe). We assume our system already has some registered subscriptions. An arriving message will be processed by message dissemination module, where a subscription index is built to find all the affected subscriptions and update their top- results. An expired message will be processed by top- re-evaluation module. Specifically, it will check against a result buffer, which maintains the top- results (possibly including some non-top- results) of all the subscriptions. For the subscriptions that cannot be refilled through result buffer, their top- results will be re-evaluated from scratch against a message index containing all the messages over the sliding window. Note that the message index can be implemented with any existing spatial-keyword index, such as IR-Tree DBLP:journals/pvldb/CongJW09 () and S2I DBLP:conf/ssd/RochaGJN11 (). Skype can also support subscription update efficiently. A new subscription will be inserted into subscription index, with its top- results being initialized against message index, while an unregistered subscription will be deleted from both subscription index and result buffer. Note that the subscription index and message index serve different purposes and cannot be trivially combined together.
5 Message Dissemination
In this section, we introduce a novel subscription index, which groups similar subscriptions, to support real-time dissemination against message stream. Specifically, two key techniques, i.e., individual pruning and group pruning, are proposed in Section 5.1 and Section 5.2 respectively, followed by the detailed indexing structure in Section 5.3. Finally, we introduce dissemination algorithm in Section 5.4 and index maintenance in Section 5.5.
5.1 Individual Pruning Technique
For each incoming message , the key challenge is to determine all the subscriptions whose top- results are affected. Specifically, we denote the -th highest score of a subscription as . Then the top- results of need to be updated if . In this section, we propose a novel location-aware prefix filtering technique to skip an individual subscription efficiently.
5.1.1 Location-aware Prefix Filtering
For ease of exposition, we denote a spatial similarity upper bound between a subscription and a message as , which will be discussed in detail in Section 5.1.2. Based on Equation 1, we can derive a textual similarity threshold for pruning purpose accordingly:
Then the following lemma claims that if the textual similarity between and is less than , we can safely skip .
A message cannot affect top- results of a subscription if .
To utilize Lemma 1, we employ prefix filtering technique, which is widely adopted in textual similarity join problems (e.g., DBLP:conf/icde/ChaudhuriGK06 (); DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 ()). Prefix filtering is based on the fact that TSim is essentially a vector product; therefore, we can determine the similarity upper bound between two objects by only comparing their prefixes. Before we introduce prefix filtering technique, we first introduce a threshold value for each keyword in :
Then we define a location-aware prefix as follows.
Definition 4 (Location-aware Prefix)
Given a subscription , a message and a textual similarity threshold , we use to denote the location-aware prefix of w.r.t. , where .
The following lemma claims that location-aware prefix is sufficient to decide whether a message can be top- result of a subscription.
Given a subscription and a message , is sufficient to skip regarding .
Figure 4 shows an example of location-aware prefix, with 3 registered subscriptions and 3 incoming messages. The underlined value to the right of each keyword corresponds to its weight, and we do not normalize the keyword weight for simplicity. Assuming , then . Thus, . Since , we can skip w.r.t. .
It is noticed that different from conventional prefix technique (e.g., DBLP:conf/www/BayardoMS07 (); DBLP:journals/tods/XiaoWLYW11 ()) where only the prefix of a data entry needs to be indexed, our location-aware prefix is dependent on the spatial location of messages, and different locations may lead to different prefixes. Thus, it is impossible to pre-compute and index the prefix of subscriptions. To address this issue, we utilize the threshold value wtsum for each keyword in to indicate whether this keyword should occur in the prefix regarding a message , which is stated formally in the following lemma.
Given a subscription , a message and , if does not contain any keyword that satisfies , we can safely skip regarding m.
In this way, we can dynamically determine the location-aware prefix of a subscription w.r.t. an arriving message. Also, since is irrelevant to incoming messages, it can be materialized for each subscription.
Max-weight refinement. We notice that for a specific message , we can compute a better location-aware prefix for by considering the maximum weight for the keywords in . We first define as:
Then we define a refined location-aware prefix:
Definition 5 (Refined Location-aware Prefix)
Given a subscription , a message and , we use to denote the refined location-aware prefix of w.r.t. , where with .
The following theorem claims that is sufficient to decide whether a message can be top- result of a subscription.
Given a subscription and a message , is sufficient to skip regarding .
5.1.2 Spatial Bound Estimation
In this section, we discuss the computation of between a subscription and a message in order to get a better threshold for efficient location-aware prefix filtering. To this end, we employ a spatial index to group subscriptions with similar locations, such that the spatial upper bound for a group of subscriptions can be computed simultaneously. Due to the easy implementation and well-adaptiveness to skewed spatial distributions, we choose Quadtree to index subscriptions. Specifically, each subscription is assigned into a leaf cell with range based on its location . Then the following two types of spatial bounds can be defined and utilized.
Definition 6 (Inner Spatial Bound)
Given a subscription and its residing cell , inner spatial bound, denoted as is computed as , where is the from to the nearest boundary of cell range .
It is obvious that for any and , we have . An example is shown in Figure 4. Since the from to is , if we assume the MaxDist in the space is .
Definition 7 (Outer Spatial Bound)
Given a message and an outer cell , outer spatial bound, denoted as is computed as , where is the from to .
For any and , we have . An example is also shown in Figure 4. The from to is , and thus .
Definition 8 (Spatial Upper Bound)
Following the example in Figure 4, by combining both inner and outer distance, we can get a tighter spatial upper bound between and as .
Note that the inner spatial bound can be precomputed and materialized, while the outer spatial bound has to be computed on-the-fly as it is relevant to the location of an arriving message. However, the computation cost of is not expensive since we only need to compute this value against each leaf cell. Finally, we remark that when and are within the same cell, both and are always .
An example is shown in Figure 4. If we assume the , we have , and . Thus, cannot be skipped w.r.t. . However, if we utilize the inner spatial bound and outer spatial bound together, we have , , and . In this case, we can safely skip w.r.t. .
5.1.3 Bound Estimation for Unseen Keywords
Since we employ TAAT paradigm to visit inverted file, we can estimate a textual upper bound for unseen keywords. If this upper bound plus the textual similarity that has already been computed is still less than the required threshold, we can safely skip . The textual upper bound between the unseen keywords of and can be computed as follows:
where and are starting positions of unseen keywords. Then the following theorem claims we can skip a subscription by utilizing the textual upper bound.
Given a subscription , a message and their textual similarity threshold , assuming we have already computed the partial similarity between and , denoted as , then is sufficient to skip .
As , we have . The theorem holds immediately from Lemma 1.
In Figure 4, consider that we are currently disseminating . Based on the dissemination algorithm to be discussed later in Section 5.4, we need to traverse the inverted lists in cell (where resides) for all the keywords in one by one. We first check the inverted list of since is the 1st keyword of . Assuming , we cannot skip since . However, since is the 2nd keyword in , we can compute , and . Because , we can immediately skip .
5.2 Group Pruning Technique
After applying individual pruning technique, many subscriptions can be skipped without the need to compute their exact similarity w.r.t. a message. To further enhance the performance, we propose a novel Group Pruning Technique such that we can skip a group of subscriptions without the need to visit them individually. To begin with, we first define subscription-dependent prefix for a message.
Definition 9 (Refined Sub-dependent Prefix)
Given a message , a subscription and , we use to denote the refined subscription-dependent prefix of w.r.t. , where with .
The following lemma claims the refined subscription-dependent prefix is sufficient to determine whether a message could be top- result of a subscription.
Given a subscription and a message , is sufficient to skip regarding .
Let us denote the posting list of keyword in cell as , which contains all the subscriptions having and residing in . Then based on Lemma 4, for a subscription in , if , we can safely skip . Further, if this holds for a group of subscriptions on , we can safely skip the whole group as follows.
Given a message , a keyword , a posting list and a group of subscriptions inside , is sufficient to skip the whole group .
The left side of the inequality in Lemma 5 can be computed in time since we can materialize for each group. However, for the right side, it would be quite inefficient if we compute it on the fly for each new message. To avoid this, we propose a lower bound for which can be computed in constant time. In the following, we first present the subscription grouping strategy and then introduce the details of the lower bound deduction.
5.2.1 -Partition Scheme
Intuitively, we should group subscriptions with similar such that we can get a tighter textual threshold for the group. We first let . Note that we compute by utilizing only in order to make independent of a specific subscription. Therefore, it is observed from Equation 2 that, for the computation of , only and are dependent on while is irrelevant to . For simplicity, we denote as and as respectively. Then, we partition subscriptions into groups based on their values, such that the subscriptions inside a group have similar values. We employ a quantile-based method to partition the domain of to ensure that each group has similar number of subscriptions. Then, we can skip the whole group as stated in the following theorem.
Given a group generated by -partition in a posting list , we denote as and as . then is sufficient to skip the whole group .
It is obvious that for any subscription in , we have . Thus, we have . Combined with Lemma 5, the theorem holds immediately.
Time complexity. The condition checking in Theorem 5.3 takes time, since we can precompute the values of and .
5.2.2 Early Termination Within Group
When a group cannot be skipped given a message, we have to check each subscription in it. To avoid this, we propose an early termination technique to early stop within a group when the group cannot be skipped totally. To enable early termination, for each group in , we sort the subscriptions in by their values increasingly. For each subscription in , we denote the subscriptions with not less than as , and maintain two statistics and w.r.t. keyword as follows:
Then we can employ early termination as follows.
Given a group inside a posting list , and assuming is the subscription with smallest position in such that the following inequality holds: , then there is no need to check the subscriptions after (including itself).
Time complexity. To speed-up the real-time processing, we precompute and and store them with each subscription in the group . The condition checking in Theorem 5.4 can be efficiently computed in time with a binary search method.
5.2.3 Cell-based Pruning
Besides the above group pruning technique, we notice that for some cells which are far away from the location of an arriving message, we can safely skip the whole cell. Specifically, for each subscription within a cell , we can derive a spatial similarity threshold as follows:
where we assume the textual similarity achieves the largest value, i.e., . Then we can reach the following lemma.
Given a cell , if , we can safely skip all the subscriptions in cell .
For , we have . Thus, . Thus, cannot be top- results of any in .
5.3 Subscription Index
Relying on all the techniques discussed above, our subscription index is essentially a Quadtree structure integrated with inverted file in each leaf cell, as shown in Figure 5. For each registered subscription, we store its detailed information and relevant statistics in a subscription table, and insert it into a leaf cell of Quadtree based on its spatial location. Note that in Quadtree, we only store the subscription id referring to its detailed information in subscription table. Within each leaf cell, an inverted file is built upon all the subscriptions inside the cell. Then each posting list in inverted file is further partitioned into groups based on the subscription preference to enable group pruning. Each group is also associated with some statistics mentioned above. Finally, to facilitate early termination, the subscriptions within each group are ordered based on their .
5.4 Dissemination Algorithm
Algorithm 1 shows our message dissemination algorithm. We follow a filtering-and-verification paradigm, where we first generate a set of candidate subscriptions (Lines 1-1), and then compute the exact scores to determine the truly affected ones, with the updated results being disseminated accordingly (Line 1). Specifically, we first initialize an empty map to store candidates with their scores (Line 1). Then the maxwt and wtsum values for all the keywords in the arriving message are computed for later use (Line 1). For each leaf cell surviving from cell pruning (Line 1), we first compute and then traverse the inverted file in cell following a TAAT manner. For each group encountered in (Line 1), we skip if group pruning can be applied (Line 1); otherwise, we identify for early termination based on Theorem 5.4 (Line 1 and Line 1). For each surviving subscription , we employ location-aware prefix filtering (Line 1) and bound estimation for unseen keywords (Line 1) to skip it as early as possible. For the surviving subscriptions, we store the accumulated textual similarity so far w.r.t. in , while for the skipped subscriptions, we set to negative infinity (Line 1). Finally, for each subscription in with , we verify it and update its top- results if needed (Line 1). Note that when verifying a candidate , we only need to compute the exact spatial similarity to get the final score because the textual similarity, i.e., , has already been computed. The statistics relevant to pruning techniques are also updated in Line 1.
5.5 Index Maintenance
Our indexing structure can also support subscription update efficiently. For a new subscription , we first find the leaf cell containing its location, and then insert it into the inverted file with cost. Note that the statistics mentioned above need to be updated accordingly. For an expired subscription, we simply delete it from index and update the statistics if necessary.
6 Top-k Re-evaluation
In this section, we present the details of top- re-evaluation module. We first introduce some background knowledge for -skyband in Section 6.1. Then we present our cost-based skyband technique in detail in Section 6.2. In the following of this paper, we denote the -skyband buffer (either fully or partially) of a subscription as for simplicity, and the exact top- results are denoted as . Meanwhile, is denoted as if it is clear from context.
The idea of utilizing -skyband to reduce the number of re-evaluations for top- queries over a sliding window is first proposed in DBLP:conf/sigmod/MouratidisBP06 (). In particular, for a given subscription , only the messages in its corresponding -skyband can appear in its top- results over the sliding window, thus being maintained. Following are formal definitions of dominance and -skyband.
Definition 10 (Dominance)
A message dominates another message w.r.t. a subscription if both and hold.
Definition 11 (-skyband)
The -skyband of a subscription , denoted as , contains a set of messages which are dominated by less than other messages.
Instead of keeping -skyband over all the messages in the sliding window, which is cost-prohibitive, Mouratidis et al. DBLP:conf/sigmod/MouratidisBP06 () maintain a partial -skyband. Specifically, they only maintain the messages with score not lower than a threshold , where is the after the most recent top- re-evaluation for and remains unchanged until next re-evaluation is triggered. However, as our experiments suggest, the method in DBLP:conf/sigmod/MouratidisBP06 () may result in expensive computational cost due to the improper selection of .
To alleviate the above problem, we propose a novel cost-based -skyband technique, which judiciously selects a best threshold for the -skyband maintenance of each subscription. To start with, we present an overview of our top- re-evaluation algorithm in Algorithm 2. For each subscription containing the expired message , if the size of after deleting is less than , we need to re-evaluate its top- results from scratch. Specifically, we first compute a proper threshold based on our cost model (Line 2), and then re-compute -skyband buffer based on , which contains all the messages with score at least (Line 2 and Line 2) 666The same technique in DBLP:conf/sigmod/MouratidisBP06 () is used to compute -skyband.. Note that can be computed by utilizing message index. Finally, we extract top- results from (Line 2). The key challenge here is to estimate a best threshold , which will be discussed in the following in detail. We remark that we use the term re-evaluation to refer in particular to the top- re-computation against message index.
6.2 Cost-based K-Skyband
The general idea of our cost-based -skyband model is to select a best threshold for each subscription such that the overall cost defined in the cost model can be minimized. The following theorem guarantees that, as long as we maintain a partial -skyband over all the messages with score not lower than , we can extract top- results from partial -skyband safely when some message expires.
Given a subscription , let be the after the most recent top- re-evaluation for . We always have if the following conditions hold: (1) ; (2) is a partial -skyband which is built over all the messages with score at least in the sliding window, where .
We prove it by contradiction. Assuming there exists a message while , then we discuss two possible cases: (1) ; (2) . For the first case, since , must be dominated by more than messages in , which indicates it cannot be top- results, i.e., . For the second case, at least messages in must have a higher score than because and all the messages in have score at least . Thus, still cannot be top- results. Thus, the original assumption does not hold, which immediately indicates .
Thus, based on Theorem 6.1, we can safely extract top- results from -skyband buffer when ; when , we have to re-evaluate from message index.
Our cost-based -skyband model, based on Theorem 6.1, aims to find the best such that the overall cost can be minimized for each subscription. We mainly consider two costs. The first one is -skyband maintenance cost, denoted as , which is triggered upon message arrival and expiration. The second one is top- re-evaluation cost, denoted as , which is triggered when some message expires and the top- results can no longer be retrieved from -skyband buffer. We aim to estimate the expected overall cost w.r.t. each message update, i.e., message arrival and message expiration, each of which we assume occurs with probability as the window slides. To simplify the presentation, we denote as the probability that the score between a random message and a subscription is at least . We may immediately derive for a given from historical data, assuming the score follows previous distribution. The details of these two costs are presented in the following respectively.
6.2.1 -Skyband Maintenance Cost
The maintenance of -skyband is triggered when the following two types of updates happen, both with probability , where is the probability of message arrival or message expiration due to the count-based sliding window, and is the probability that the score between a random message and is at least . Please note that if the independence assumption does not hold for messages, the above probabilities cannot be estimated accurately, and we may resort to utilizing historical data for the estimation.
The first type of update is triggered when a message with score at least arrives. Apart from the insertion of into , the dominance counters of all the messages in with score not higher than will increase by 1, and the messages with dominance counter equal to will be evicted. Since we implement our -skyband buffer with a linked list sorted by . The above operations can be processed in time with a linear scan. The next challenge is to estimate . Based on the independence assumption between score dimension and time dimension, the expected number, i.e., , of messages in the partial -skyband is DBLP:journals/tkde/ZhangLYKZY10 (), where is the size of sliding window. Please note that if the independence assumption does not hold, the worst case space complexity will be .
The second type of update occurs when an old message among the -skyband buffer of expires. In this case, we only need to delete from in time. Note that does not dominate any remaining messages and therefore the dominance counters of the remaining messages are not affected. Finally, we get the total cost of -skyband maintenance as follows: