Profiling User Activities
With Minimal Traffic Traces
Understanding user behavior is essential to personalize and enrich a user’s online experience. While there are significant benefits to be accrued from the pursuit of personalized services based on a fine-grained behavioral analysis, care must be taken to address user privacy concerns. In this paper, we consider the use of web traces with truncated URLs – each URL is trimmed to only contain the web domain – for this purpose. While such truncation removes the fine-grained sensitive information, it also strips the data of many features that are crucial to the profiling of user activity. We show how to overcome the severe handicap of lack of crucial features for the purpose of filtering out the URLs representing a user activity from the noisy network traffic trace (including advertisement, spam, analytics, webscripts) with high accuracy. This activity profiling with truncated URLs enables the network operators to provide personalized services while mitigating privacy concerns by storing and sharing only truncated traffic traces.
In order to offset the accuracy loss due to truncation, our statistical methodology leverages specialized features extracted from a group of consecutive URLs that represent a micro user action like web click, chat reply, etc., which we call bursts. These bursts, in turn, are detected by a novel algorithm which is based on our observed characteristics of the inter-arrival time of HTTP records. We present an extensive experimental evaluation on a real dataset of mobile web traces, consisting of more than 130 million records, representing the browsing activities of 10,000 users over a period of 30 days. Our results show that the proposed methodology achieves around 90% accuracy in segregating URLs representing user activities from non-representative URLs.
Behavioral analysis of mobile users based on their web activities has the potential to transform their online experience. It enables service providers to personalize their deliverable, specialize their content, customize recommendations and target advertisements based on user context. For the network operators, it opens up the possibility of provisioning their resources and dynamically managing their network infrastructure (particularly, with the realization of network function virtualization) to effectively serve the varying user and content demand in order to deliver advanced quality-of-service experience.
However, behavioral analysis also raises serious concerns about user privacy. Users are uncomfortable if personalization is taken too far. In the wider philosophical debate between personalized services based on user behavior analysis and preserving the user privacy, there is a need to find a middle ground that will allow for potential benefits of personalized services and still safeguard the fine-grained sensitive user information.111 Specific search queries, personal entertainment preferences, purchased products, location etc. are generally considered highly sensitive user information.
Ideally, the data set for such analyses should be stripped of all sensitive user information, while still allowing for inference of medium-grained user activity. This is becoming even more important with the tightening privacy legislations in various countries [21, 1, 10, 6], increasing regulation (e.g., ) and heavy penalties for data breaches which has made network operators as well as service providers (e.g., ) more careful about the data sets they collect, store and share. The operators would like to store the minimal amount of data to still be able to perform complex analytics, raising the important question of determining the thin boundary between the required data for necessary analytics and the data that can enable mining of highly sensitive fine-grained user traits. In this context, we consider the usage of truncated URLs, wherein each URL is trimmed to only contain the web domain. For instance, the HTTP URL finance.yahoo.com/q?s=BAC is truncated to finance.yahoo.com (to hide the fact that the user had queried for Bank of America Corp. stock price), the URL https://www.google.com/#q=postnatal+depression is truncated to www.google.com (to avoid leaking the sensitive health query of the user) and the URL www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA is truncated to www.amazon.com (to avoid leaking the searched or purchased product). Already, many network operators only share the truncated URL data-sets with third-party analysts, owing to privacy considerations. For the non-HTTP traces (e.g., HTTPS encapsulated in IP packets), even the network operators, themselves, have limited information available. While a reverseDNS service can be used to extract the URL from the IP address, it does not recovers the content type or the query parameters. Thus, it is important to explore whether high accuracy can still be obtained in profiling user activities if an analyst is restricted to only using truncated-URL web trace. In this paper, we investigate this issue.
Specifically, we focus on the task of identifying URLs that are representative of user activities, which is often an important step in profiling user activities. We note that the remaining task of mapping the representative URLs to activity categories (and creating user profiles) can be done using either manual labeling of interesting categories or in an automated way by using external databases or web analytics services (e.g., Alexa ).
The key challenge in filtering out the representative URLs from noisy truncated traffic trace is that a truncated trace lacks many crucial features for such a filtering. These include the file name suffix (e.g., .jpg, .mp3, .mpg etc.) that is usually a good indicator of the content type as well as number, type and values of parameters in the URL strings. Nonetheless, we show that even with the truncated URLs, we can achieve a highly accurate automated classification of web-domains into those that represent the user activity and those that don’t. The key insight that we bring in this paper is that a user’s traffic trace is composed of many data bursts. A burst usually corresponds to a micro user action like a web click, chat reply, etc. and is typically associated with a unique activity. We show that novel features related to burst measurements, such as positioning of a URL in a burst, the number of URLs in the burst containing the web-domain etc., can improve the accuracy of filtering the noise (unintentional traffic such as spam, analytics, advertisements as well as other non-representative traffic such as images, multimedia, scripts) out of the traffic trace, by around 20%, offsetting the loss due to URL truncation.
To achieve this result, we need to decompose a traffic trace of a user into its constituent data bursts. The problem here is that there is a significant variation in the traffic pattern across different users, at different timestamps and different activities. Even the distribution shape of the inter-arrival time of HTTP records differs significantly from one user to another. We resolve this problem by proposing a novel burst decomposition algorithm that adapts itself to any distribution shape, rather than relying on specific distributions.
We provide an extensive experimental evaluation over more than 130 million HTTP records generated from 10,000 users over a period of 30 days. The experimental analysis demonstrates that our methodology provides high accuracy (around 90%), in segregating representative URLs from non-representative URLs.
Our approach, thus, enables the network operators to personalize services without risking the leakage of more sensitive user data (as the sensitive information need not be stored or shared). Specifically, it enables many medium-grained personalization applications, including, but not restricted to, product recommendation and targeted advertisement. For instance, knowing when their users read, shop, browse and play games, enables telecom operators to create better pricing schemes that are personalized and targetted for different users and demographics. Such profiling of user activity also opens up many avenues for network optimization to service providers. For instance, system resources can be better allocated to match the data access rate and desired delay time for gaming activities at specific time in the day and better caching strategies can be designed.
Outline In Section 2, we show that there is a considerable variation in the user activity, that necessitates the data-dependent feature extraction and complex statistical models to deal with this problem. Section 3 presents an overview of our methodology. In Section 4, we argue that there is a considerable variation in the distribution shapes of the inter-arrival time of HTTP records and thus, the burst decomposition techniques that rely on specific distribution shapes do not work well across the entire user spectrum. In Section 5, we show how we can remedy the situation by using a threshold on the inter-arrival time of HTTP records, that adapts to the distribution profile of each user. Section 6 presents the results of domain classification using the features extracted from burst measurements. Section 7 presents an overview of some related work.
2 Variation in User Activity
The main goal of our investigation is to develop an automated procedure to filter out representative URLs from the noisy trace of truncated HTTP records. In this section, we describe our dataset. We show that in this dataset, there is a significant variation between different users in terms of non-representative traffic, user activities, number of HTTP records etc. In the next section, we propose a novel methodology that employs robust algorithms for extracting user-dependent features to overcome this high variation in user activity.
Dataset. Our dataset consists of more than 130 million web-logs generated from randomly selected 10,000 users over a period of 30 days from an anonymous network operator. In our traces, each record contains information fields such as user hashed ID, truncated-URL, download size, upload size and timestamp. Note that our dataset is not restricted to any particular domain or limited to a small set of volunteer users. Being a network-side dataset, it is fairly large and diverse in terms of the domains and the users covered. The flip side of this is that it is also very noisy – it contains not just the URLs that a user types in his browser, but also all the redirects, secondary URLs (pictures, embedded videos etc.) and unintentional data (scripts, analytics, advertisement, spam etc.).
Variation in Total Traffic. We first observe that there is a significant variation in the HTTP traffic generated by different users. For instance, the number of HTTP records ranges from low tens for some users to tens of thousands for other users, over the 30 day period of study. In fact, a majority of HTTP download traffic () is generated by just of user (Figure 1). We observe even more skewed distribution for the traffic in terms of the generating activity domain. Less than of domains generate of traffic in terms of download size (Figure 1) and HTTP record counts (Figure 1). Note that even though a large majority () of URLs together constitute only a small portion () of the traffic, these less popular URLs are more likely to characterize the unique features of different users and therefore, they play a critical role in differentiating user specific behavior. Thus, it is vitally important to correctly classify these URLs into those that represent the user activities and those that don’t.
Variation in Type of Traffic. Even among the users with similar total traffic, the kind of web activities and the fraction of non-representative URLs in the traffic trace varies considerably between the users. For instance, Figure 2 shows the web trace snapshot of two users, illustrating two different activity patterns. Different colored segments in this figure represent traffic from different domains, which can be either representative or non-representative. The trace of the first user (Figure 2) has only one domain, i.e. gaming, and in fact, repeated records from a single URL for more than 1300 seconds. For this user, there is no non-representative traffic to filter out. However, the web browsing activity of another user shown in Figure 2 alternates between a large number of domains (scripts, multimedia, HTML CSS, advertisements, analytics etc.) in less than 100 seconds, even though he/she is browsing a single web-page during this time. This variation in activity patterns is reflected in download size, inter-arrival time as well as number of HTTP records. In addition, the timestamp patterns of HTTP records also varies significantly from one user to another (see Figure 2).
Variation in User Behaviors. We also observe that there is a significant variation between different users in terms of the activities themselves. To summarize the aggregated variation of the top- domains of both representative and non-representative traffic, we use the following global entropy-based metric to measure this variation:
where is the number of times that URL appears in the top domains, satisfying . By Equation 1, the variation metric is maximized at when all users have different non-overlapping top domain set and is minimized at when all users have the same non-ordered top domain set. For the web trace data, is with distinct domains from among the top domains for the 2000 users. The discovered value suggests that there is a significant variation in the top activities among the different users. We show this intuition graphically in Figures 3 and 3, where we depict the activity variation of two users over time. For this figure, we filtered out the non-representative domains manually, selected top representative domains for each user according to the number of HTTP records. Figure 3 presents the daily record counts for each representative domain and demonstrates both the temporal and activity variations in terms of activity types and the magnitude across two randomly selected users.
Summary. These above variational statistics imply that the methods to extract features for separating noise from the representative URLs have to adapt to changing user patterns. In particular, the variation in the total traffic and the timestamp patterns necessitates user-adaptive solutions that we explore in the next sections.
3 Our Methodology
In this section, we present an overview of our methodology to automatically classify the web-domains into those that represent the user activities and those that don’t. The key feature of this methodology is the usage of novel features derived from the burst decomposition of a user’s web-trace that improves the accuracy of the classification, offsetting the loss due to URL truncation.
The main intuition behind our methodology is that a user’s browsing activity consists of several data bursts. These data bursts correspond to micro user actions, such as a web click or a chat reply. In each burst, there are some URLs representing the user activity intermixed with other unintentional web-traffic such as advertisements, web-analytics etc and secondary URLs corresponding to multimedia associated with the representative URL. Our statistical methodology decomposes the web-trace back into its constituent data bursts. It then leverages specialized features from data bursts (e.g., the position of a URL in a data burst, the number of unique URLs in a data burst, burst duration, burst download size etc.) to segregate the representative web-domains from the remaining web-domains. In Section 6, we show that the usage of features derived from data burst help in significantly improving the accuracy of the segregation task.
A key challenge in our methodology is the decomposition of the web-trace into data bursts. As highlighted already in Section 2, there is a considerable variation in the traffic patterns of different users. We found that even the distribution of inter-arrival time of HTTP records is very different for different users. This makes it particularly difficult to model these data bursts and to find good thresholds to decompose the web-trace into data bursts. We solve this problem by having different thresholds for different users and ensuring that the threshold computing function is robust with respect to the distribution shape. This is achieved using a novel technique to generate thresholds for each user that adapts to any distribution of inter-arrival time.
4 Inter-arrival Time Distribution Models
In this section, we study the inter-arrival time of HTTP records with a view to finding good thresholds that will decompose a user’s traffic-trace into burst of records that represent micro user actions.
As described in Section 3, the key concept behind burst is that when a user performs a micro action like web click, chat reply etc., it not only generates many HTTP records related to the representative activity, but also a large number of secondary records such as advertisements, web analytics, webscripts etc. These records are all intermixed. When the user completes the current micro-action, e.g. reading the current web page, and starts a new one, e.g. opening the next page, a new burst is generated with its associated records. So, the observed inter-arrival time records are the combined results of within-burst and out-of-burst records. However, we expect that the within-burst HTTP records are closer together and the out-of-burst records are far apart in time. By computing an appropriate separation threshold on the inter-arrival time, we aim to decompose the traffic into its constituent bursts.
Since traffic patterns and the inter-arrival time distributions for different users are very different, we can’t expect a global threshold to work well for all users. Instead, we compute a different threshold for each user specific to his/her traffic patterns. If the difference between the time-stamp of a record and its predecessor is greater than the computed separation threshold for that user, the record marks the beginning of a new burst. Otherwise, the record belongs to the burst of its predecessor.
To learn the separation threshold for each user, our first approach is to learn the probability density function of inter-arrival time for the users. By computing the best-fitting parameters for this density function for each user and defining the separation threshold as a function of those parameters, we can decompose the traffic trace for each user into its constituent bursts.
We modeled personalized inter-arrival time distributions by exploring different density functions, such as exponential distribution, pareto distribution and mixtures and concatenations of these distributions (details provided in Appendix A). From the analysis, we found that even these general density functions are not flexible enough to accommodate highly varied and personalized inter-arrival time of different users. Thus, we concluded that even though this formalism is principled, there is a need for a more robust technique to separate within-burst and out-of-burst records, that is independent of the personalized distribution shape of a user.
5 Burst Decomposition Using Adaptive Thresholds
In this section, we propose a robust burst decomposition algorithm that is independent of the distribution shape. Our technique only relies on the general characteristics of the inter-arrival time distribution observed in Appendix A, but not on any specific model. The only characteristic of the inter-arrival time distribution that we use is that there is a within-burst component with high arrival-rate of records (and small inter-arrival time), an out-of-burst component with low arrival-rate (high inter-arrival time) forming a long tail and that these two components are separable with a threshold. Our aim in this section is to have a threshold that adapts itself to any inter-arrival time distribution, subject to this general property.
We first observe that an optimal threshold is expected to lay in a low probability range and should satisfy the following conditions:
, should, generally, be high and show the presence of bursts
, should, typically, have low values and imply user inactivity periods
In order to satisfy the above conditions, has to intercept the minimum point where the probability density function of inter-arrival time distribution decays to fairly close to zero and the density of values beyond is minimal.
However, to quantitatively measure the significance of each value, we need a scalar indicator that would determine when a value is minimal. This approach would suffer from the selection of a global scalar indicator that would fail in detecting the intrinsic variations of the density proportion between the within bursts and out-of bursts components for different users.
Therefore, instead of using this approach of quantifying , we leverage the conditional density, i.e. , to determine . Note that, which is the probability that a time sample belongs to bin , conditioning on the fact that it belongs to a bin less than or equal to : . In other words, measures the contribution of the current bin to the accumulated probability.
Our Algorithm 1 searches for by starting from the smallest value of the inter-arrival time density such that the extended probability by increasing decaying point is insignificant, compared to the accumulated probability at that point (as captured by ). Specifically, the threshold is found when the contributions of consecutive bins are less than a predefined probability, for a pre-specified parameter .
The burst decomposition algorithm will group all the records with inter-arrival time less than the obtained into actual bursts.
In the next section, we provide evidence that this algorithm detects meaningful bursts that significantly improve the classification accuracy in identifying the domains that represent user activities.
We estimate the values of the scalar indicator, , used in the Algorithm 1 based on an analysis of the corresponding values across all users. In Figure 4 we only report the behaviors of users as representative of entire values computed across all users. It is easy to notice that for the would range from to seconds, which is a reasonable range to separate inter-arrival time values between within burst and out-of bursts for activities such as web browsing, reading, shopping, etc. Hence, this value of was used in our experimental analysis.
Next, we examine the results of our algorithm with respect to users with substantially different behaviors. In particular, we leverage the three users examined in Figure 9. Even though the distribution shapes and the number of records characterizing these three users are very different, the algorithm successfully finds a user specific as shown in Figure 5.
6 Domain classification
In this section, we describe our classification model for identifying the representative URLs and show that it is possible to achieve very high accuracy for this task even with truncated URLs. Features extracted from the burst decomposition presented in Section 5 play a crucial role in significantly improving the accuracy of our classification model.
Classification Model Formalization. We use a logistic regression model for the domain classification problem. Our model for logistic regression is as follows:
where is the binary label ( if URL is representative and otherwise) and is the specific classification feature that we derive from record-level and burst-level analysis in Sections 6.1 and 6.2. The representative probability is computed by the logistic function on a linear predictor and all the parameters are estimated by the Iteratively Re-Weighted Least Squares (IRWLS) method .
The domain classification follows three steps. First, we manually label URLs into two classes: representative and non-representative domains. Second, we extract five sets of web traces generated out of random users each, perform the burst decomposition and obtain aggregated measurements independently for each set. Finally, half of the labelled URLs of the first set are used in training the classifier, which is validated by the other half of the first set and the remaining four. We use five different sets to validate the robustness of our approach.
We demonstrate the accuracy of our classification approach in two steps. We first study the accuracy obtained by only using the record-level features and ignoring the burst-level features. Then, we show the improvements we gain by adding the burst-level features which are derived upon the detected bursts from our burst decomposition algorithm.
Record-level Features. The key part of our modeling is feature engineering, or identifying the right set of features to achieve a high accuracy. For the record-level features, shown in Table 1, we use the aggregated measurements across all users and compute the quantile values by ranging from to with an increment step equal to . These features were carefully selected to achieve a high accuracy with record-level features. Specifically, for each record we collect the leading and following inter-arrival time and the upload and download size. These features are examined as covariates in our domain classification model.
6.1 Accuracy with Record-level Features
Accuracy. As shown in Table 3, the resultant accuracy with the record-level features is quite poor. For the five sets of web traces, the accuracy varies between 69.7% and 72.9%, implying that around 30% of the URLs are misclassified. Among the analyzed features we have discovered two particularly important: and by the stepwise model selection procedure. The first is the difference between the and quantile statistics of the download size per domain and the second is the quantile statistic of the upload size. The estimated coefficients for this model is shown in Table 3, implying that domains with small variation of download size and high value of upload size have higher chance of being representative domains. However they are the most relevant features at record-level, their discriminatory capacity still remains limited.
|Record-level features (wrt )|
|Quantile of the leading inter-arrival time|
|Quantile of the next inter-arrival time|
|Quantile of the upload size|
|Quantile of the download size|
6.2 Accuracy with Burst-level Features
In this section, we show how the accuracy improves with features measured at burst-level.
Burst-level Features. By leveraging the burst decomposition algorithm, we segment our web traces in a series of consecutive bursts and we measure burst-specific characteristics. Specifically, for each URL , we choose a list of aggregated measurements, shown in Table 5, where denotes the set of bursts containing URL . We observe that two burst features, i.e. and (), in Table 6 are particularly important in improving the domain classification results. The first measure, i.e. , describes the probability that a URL is ranked in its burst and the second, i.e. , quantifies the probability that there are unique domains in the burst containing the URL (). Similar to record-level features, these aggregated measurements are examined as covariates in our domain classification model.
Discriminating Features. We perform a model selection procedure, based on AIC, to select the most discriminating features for our classification model and starting from those listed in Table 5. We observe that the feature is selected with high significance. The intuition behind this is that the URLs which usually come first in bursts are more likely belonging to the representative class. Thus, is a good distinguishing feature between representative domains (SEARCH ENGINE, WEB PORTAL) and non-representative domains (ADS, CDN) (as shown in Figure 6). Solely using this feature will misclassify domains from STATIC CONTENT class as representative (as these are also likely to come first in burst). This class includes many CSS HTML pages and static images on web-pages. However, the exceptions such as those from STATIC CONTENT class have a high probability of being alone in their bursts, as shown by in Figure 6. Thus, the feature is able to distinguish between most representative and non-representative domains.
Note that the domains in the SEARCH ENGINE class have a unique characteristic, i.e. they show high values in both and features. However, the differenced can still act as a discriminator in selecting representative domains.
The corresponding estimated coefficients are shown in Table 5 along with standard deviation and -values, indicating all significant coefficients. As explained above, domains that have high rankings among others, i.e. do not appear alone in their bursts, are more likely to be representative domains. From the estimated value , we can also see that domains appearing in small bursts of few unique records have smaller chance of becoming representative domains.
6.3 Trade-off Between Classification Metrics
The relation between the linear predictor and representative probability is plotted in Figure 8, together with the binary labelled observations and histogram of each domain class. The red vertical line represents the decision boundary such that all URL with are put into representative class and the other are in the non-representative class. Hence, the ratio between the points on the left and right of the red line at row corresponds to the ratio between true negative-ness and false positive-ness. Similarly, the ratio between true positive-ness and false negative-ness is at row .
In Figure 8, we use the boundary value , corresponding to the case of minimizing the number of misclassification cases to separate between representative and non-representative records. However, from a decision theory point-of-view, as the false-positive and false-negative penalties are usually different, we may want to optimize the boundary value by customizing these penalties based on application specific requirements. For instance, let’s consider an application that generates users’s profiles. These kind of applications may want to put a higher penalty for false negatives (representative URLs incorrectly classified as non-representatives) than for false positives (non-representative URLs correctly classified as representative). This is because when determining the activity of a burst, there is an opportunity to prune out the noise (non-representative URLs) further, while the representative URLs lost in the process are unlikely to be re-inserted later on. Thus, these applications should be calibrated to improve sensitivity, i.e. the ratio of correctly classified representative URLs to the total number of representative URLs.
Because the penalties are problem-specific and not obvious in many contexts, we show the trade-off between true positive rate (complement of false negative rate) and false positive rate with the receiver operating characteristic curve (ROC) in Figure 8. The figure also illustrates another optimal boundary point in purple for the case of minimizing the sum of false positive and false negative rates. The high value of area under the curve (AUC), , again confirms the good performance of our classifier.
Table 7 provides further statistics on the trade-off between false positives and true negatives for two different values of the boundary, that were shown in Figure 8. These trade-offs are characterized in terms of precision, negative predictive value, sensitivity, specificity and accuracy. We note that while the value of results in higher precision and accuracy, the results in better sensitivity. Thus, for applications with more emphasis on accuracy, we may choose , while for applications where sensitivity is crucial, we may select .
Accuracy. The usage of burst-level features, and in particular , results in significant improvement in the accuracy of the classification model. As shown in Table 7, the accuracy improves from around 70% to around 90%; AIC value drops from 242.82 without burst-level features to just 112.8 with these features; and finally, the BIC drops from 252.72 to 122.69.
Altogether, these results show that it is possible to achieve a 90% accuracy in segregating representative URLs from non-representative URLs by using burst-level features on truncated URL web-traces. In other words, the burst decomposition and the extraction of specific features from the bursts overcome the information lost due to URL truncation. Note that this accuracy is in terms of the number of URLs correctly classified as being representative or non-representative. Popular URLs are more likely to be correctly classified by our methodology and thus, the accuracy in terms of number of records, download size (e.g., to answer questions like how much download is generated corresponding to each activity type) or number of bursts (identifying the activity for each burst) is likely to be significantly higher.
7 Related Work
The past research related to identifying URLs representing user activities falls into the following categories:
Providing an activity description at a very coarse level (e.g., peer-to-peer networking, HTTP browsing, chatting etc.) by filtering out URLs based on connection port number, packet payload, statistical traffic patterns etc. [17, 29], primarily for the purpose of network traffic analysis and traffic classification (e.g. for CDN).
Manual blacklisting of URLS to filter out spam, adult content or advertisements.
Filter out the unintentional traffic by relying on URL suffixes (e.g., .mp3, .js etc.), URL header patterns, HTTP referrers and HTTP blacklists.
The first category of coarse-grained application-type classification is different from our representative URL identification problem as our goal is to segregate URLs at the HTTP domain-level (HTTP browsing itself is just one class for the coarse-grain classification). From the behavioral analysis’ perspective, traffic classification [17, 29] cannot provide a medium-grained insight into user web activities such as reading, shopping, researching, etc.
The second category has obvious scalability limitations. Given that many new web-sites appear every day in different languages and different countries that may have very local characteristics, it is very expensive to manually maintain the blacklists. Furthermore, the existing blacklists (see  for a list of many manual blacklists) are for specific purposes (such as spam, adult content, advertisements) and do not contain all non-representative URLs (such as images and videos associated with the main content). In fact, it is difficult to even filter out all unintentional traffic using only these blacklists.
The third category critically relies on full HTTP web-traces. For some papers (e.g., ) in this category, the setting even allows to take a peek into a user’s full network traffic (including a deep packet inspection of the content). The full HTTP URLs can reveal highly sensitive user information and their usage raises serious privacy concerns. For instance, Song et al.  highlighted a practical privacy attack that exploits seemingly-anonymous recorded information of shortened URL service such as HTTP referrer URLs, countries, browsers, platform, etc to infer the clicking pattern of a specific user. In contrast, our focus is on inferring medium-grained user behavior analysis from minimal traffic traces (truncated-URLs) and on techniques that will allow us to offset the accuracy loss due to URL truncation.
Furthermore, our work deals with a large, diverse, but noisy traffic trace from a network-side. This allows us to perform a detailed study that is not limited to a few domains or restricted to a few volunteer users. This is in contrast with many publications on behavior analysis that deal with data from users or service-providers.
Privacy preserving user profiling. There has been considerable work in recent years on privacy preserving personalization. Herein, we list a few approaches:
Bilenko and Richardson  also consider the problem of user profiling while mitigating the privacy concerns. However, their approach is based on storing the sensitive information on the client side, in the form of cookies or browser local storage. The storage of this sensitive information still leaves a user vulnerable to privacy violations. On the other hand, our user profiling does not require the sensitive information to be stored at all. Similar client-side approaches in the context of personalized search (e.g., ) and online advertisements (e.g., ) have also been studied.
Nandi et al.  take an alternative approach to privacy preserving personalization. They replace the user traces by traces of group of similar users. However, this requires user segmentation, which in turn, requires significant historical data. Also, this results in an aggregate level personalization and not an individual level personalization.
Also, there are some theoretical solutions based on k-anonymity  and l-diversity  for preserving privacy. However, it is not clear if they can be useful for profiling personalized time-series data. There are also some approaches (e.g. ) that add random or correlated noise to the data to preserve the privacy. However, such approaches also introduce more noise in the user profiles.
Burst Detection. Kleinberg  proposed a discrete state space model as a burst detection algorithm, with applications in email streams. However, the focus of this solution is the varying exponential distribution’s rate, modelled by the hidden state. The rate characterizes the email arrival of a temporal local period but does not provide a clear distinction between within-burst and out-of-burst records. For example, even when the rate goes down to the smallest value, the positive skewness of exponential distribution still favours small inter-arrival time samples. Such an approach is unlikely to work for our problem of segregating two inter-arrival time classes.
Crovella and Bestavros  and Wang et al.  looked at the related problem of counting process modeling of download size, open connection, disk operation request, etc. However, these papers primarily focus on the self-similarity property of the time series and do not provide a clear distinction and separation of different HTTP record types.
Karagiannis et al.  showed that the accuracy of exponential distribution varies with different backbone packet traces. In general, exponential distribution has nice mathematical features such as memoryless-ness and closed-form solutions of sum-concat-minimum operators. However, its light tailed property may not be a good match to some datasets. In a more general context of human dynamics, Barabasi  discussed the bursty nature of human actions and argued that heavy tailed distribution pareto is a better match than exponential distribution with email data analysis. However, both of these papers analyze aggregated datasets, and not the per-user dataset which is much more dynamic.
We have proposed a novel methodology to identify URLs representing user activities from a truncated URL web-trace. Our statistical methodology offsets the loss in accuracy due to URL truncation by considering additional features derived from the burst measurements. To enable the computation of burst-level features, we propose a novel technique for burst decomposition.
Once the set of representative URLs is identified, one can compare the (live) streaming web-traces of users to infer medium grained activities in real-time and offer personalized services. Burst decomposition can play a critical role in this part as well. Once the user-adaptive thresholds are identified, our burst decomposition algorithm can be used to decompose the streaming trace into bursts and a unique URL representing the activity in that burst can be identified using the identified set of representative URLs.
We consider that our methodology can be very useful for providing personalized services, while being considerate about more sensitive user privacy data. For instance, state-of-the-art techniques to predict click-through-rate (CTR) rely on behavioral targeting of fine-grained user data , such as advertisement clicks, web-page clicks, page views and search query data. A medium-grained user profiling, such as the one created by our technique, can be used to provide good CTR predictions while preserving privacy considerations. Similarly, user segmentation based on behavioral targeted advertisement (e.g., ) can also benefit from our medium-grained profiling. Our profiling can also be employed to re-rank the search results for a more personalized experience (similar to the approach in ). More generally, we hope that our work will lead to deeper studies on the usage of truncated URL traces, as a means to striking the fine balance between personalized services and user privacy.
-  http://dataprotection.ie/documents/guidance/Electronic_Communications_Guidance.pdf (2011), european Communities (Electronic Communications Networks and Services)(Privacy and Electronic Communications) Regulations
-  http://www.alexa.com (2015), Alexa: Actionable Analytics for the Web
-  Akaike, H.: A new look at the statistical model identification. Automatic Control, IEEE Transactions on 19(6), 716–723 (1974)
-  Animesh Nandi, Armen Aghasaryan, M.B.: P3: A privacy preserving personalization middleware for recommendation-based services. In: Proceedings of 4th Hot Topics in Privacy Enhancing Technologies Symposium (HotPETS 2011) (2011)
-  Barabasi, A.L.: The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005)
-  BBC: http://www.bbc.com/news/technology-25825690 (2014), online, accessed Nov-2014
-  Bilenko, M., Richardson, M.: Predictive client-side profiles for personalized advertising. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 413–421. ACM (2011)
-  Chen, Y., Pavlov, D., Canny, J.F.: Large-scale behavioral targeting. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 209–218. ACM (2009)
-  Crovella, M., Bestavros, A.: Self-similarity in world wide web traffic: evidence and possible causes. IEEE/ACM Transaction on Networking 5(6), 835–846 (1997)
-  European Data Protection Supervisor: https://secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2011/11-05-30_Evaluation_Report_DRD_EN.pdf (2011), online; accessed Nov-2014
-  Facebook: Facebook and the Irish Data Protection Commission. https://www.facebook.com/notes/facebook-public-policy-europe/facebook-and-the-irish-data-protection-commission/288934714486394 (2011), online, accessed Nov-2014
-  Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006)
-  Karagiannis, T., Molle, M., Faloutsos, M., Broido, A.: A nonstationary Poisson view of internet traffic. In: INFOCOM (2004)
-  Kleinberg, J.M.: Bursty and hierarchical structure in streams. In: KDD. pp. 91–101 (2002)
-  Li, F., Sun, J., Papadimitriou, S., Mihaila, G.A., Stanoi, I.: Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking. In: Proceedings of the 23rd International Conference on Data Engineering, ICDE. pp. 686–695. IEEE (2007)
-  Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: Privacy beyond k-anonymity. TKDD 1(1) (2007)
-  Nguyen, T.T.T., Armitage, G.J.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials 10(1-4), 56–76 (2008)
-  Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464 (1978)
-  Song, J., Lee, S., Kim, J.: I know the shortened urls you clicked on twitter: Inference attack using public click analytics and twitter metadata. In: Proceedings of the 22Nd International Conference on World Wide Web. pp. 1191–1200. WWW ’13, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013), http://dl.acm.org/citation.cfm?id=2488388.2488492
-  Sweeney, L.: k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5), 557–570 (2002)
-  TechCrunch: http://techcrunch.com/2014/10/01/hamburg-google/ (2015), online, accessed Mar-2015
-  Toubiana, V., Narayanan, A., Boneh, D., Nissenbaum, H., Barocas, S.: Adnostic: Privacy preserving targeted advertising. In: Proceedings of the Network and Distributed System Security Symposium, NDSS 2010. The Internet Society (2010)
-  Wang, M., Chan, N.H., Papadimitriou, S., Faloutsos, C., Madhyastha, T.M.: Data mining meets performance evaluation: fast algorithms for modeling bursty traffic. In: ICDE. pp. 507–516 (2002)
-  Wood, S.N.: Generalized additive models: an introduction with R. Chapman and Hall/CRC Texts in Statistical Science Series, Chapman and Hall/CRC Press (2006)
-  Xu, Y., Wang, K., Zhang, B., Chen, Z.: Privacy-enhancing personalized web search. In: Proceedings of the 16th International Conference on World Wide Web, WWW. pp. 591–600. ACM (2007)
-  Yan, J., Liu, N., Wang, G., Zhang, W., Jiang, Y., Chen, Z.: How much can behavioral targeting help online advertising? In: Proceedings of the 18th International Conference on World Wide Web, WWW. pp. 261–270. ACM (2009)
-  Zeltser, L.: http://zeltser.com/combating-malicious-software/malicious-ip-blocklists.html (2014), online, accessed Nov-2014
-  Zhang, F., He, W., Liu, X., Bridges, P.G.: Inferring users’ online activities through traffic analysis. In: Proceedings of the Fourth ACM Conference on Wireless Network Security. pp. 59–70. WiSec ’11, ACM (2011)
-  Zhang, J., Xiang, Y., Wang, Y., Zhou, W., Xiang, Y., Guan, Y.: Network traffic classification using correlation information. IEEE Trans. Parallel Distrib. Syst. 24(1), 104–117 (2013)
-  Zuckerberg, M.: Our commitment to the facebook community. https://www.facebook.com/notes/facebook/our-commitment-to-the-facebook-community/10150378701937131 (2011), online, accessed Nov-2014
Appendix A Modeling Inter-arrival Time Distribution
We consider the models of exponential and Pareto distribution functions to fit the inter-arrival time samples of all users. Specifically, we use the following distribution functions:
Exponential Distribution (EXP):
where is the inter-arrival rate of all records.
Pareto Distribution (PRT):
where is the minimum value of the Pareto distribution and is the shape parameter.
In addition, we use two mixture models. The intuition behind the mixture approximation is that the two models will respectively capture the high arrival rate for the within-burst records, and the slow arrival rate for the out-of-burst records.
Mixture of two Exponential Distributions (EXP2):
where and are the corresponding inter-arrival rate parameters for within-burst and out-of-burst records with ; is the proportion of within-burst records.
Mixture of two Pareto Distributions (PRT2):
where and are the corresponding Pareto shapes of within-burst and out-of-burst records; is the proportion of within-burst records.
We observed that the exponential component matches with the within-burst records while the heavy tailed Pareto component is better for out-of-burst records. Hence, we use another function form:
Concatenation of Exponential and Pareto Distributions (EXP_PRT):
where is the indicator function, is the density normalization constant and the constant is to make the function continuous.
These five distributions are fitted with per-user inter-arrival records and all the parameters , , and for each user are estimated by Maximum Likelihood Estimation (MLE) method.
Note that our goal is not just to model the inter-arrival density function, but to model it in a parameterized way that provides an intuition for identifying the separation threshold for burst decomposition. Kernel density approximation may be well-matched for the target density but cannot be used as it does not imply any meaning of record types.
The mixture models of the standard exponential and Pareto functions portray such an intuition. For example, in the method EXP2, the exponential component with high arrival rate is supposed to contain all within-burst records; So, its quantile can be used as a threshold to separate the records types. Or in the method EXP_PRT, the location parameter can be used as the threshold as it is the boundary of the exponential component for within-burst records and the heavy-tail Pareto distribution for out-of-burst records.
a.1 Model Selection and Evaluation
We consider the inter-arrival time distribution of various users to determine (i) whether or not our basic intuition of a user’s data traffic as essentially consisting of a series of burst is true, (ii) which of the considered functions best fits the inter-arrival time distribution of users and (iii) how well does the best fitting function matches the inter-arrival time distribution of users.
We observed that for most users, there are two well-separated components in the inter-arrival time distributions. Furthermore, for most users, the exponential function tightly matches the within-burst component and the tail of the Pareto function closely matches the portion of human activity with long delay, corresponding with the out-of-burst records. In addition, the positive skewness of both exponential and Pareto distributions implies the existence of compacted within-burst records, which have small inter-arrival time. All of these observations confirm our basic intuition that a user’s data-traffic primarily consists of a number of contiguous bursts of URL records. Furthermore, the two components of within-burst and out-of-burst records in the inter-arrival time distribution are usually separable.
Determining the Best-fit Model. Next, we consider the modeling of inter-arrival time distributions. We compare the fitness of all the studied distributions by using Akaike information criterion (AIC) , based on the Kullbeck-Leiber discrepancy between the true unknown model and the best estimated model of the assumed family and Bayesian information criterion (BIC) , based on the Laplace approximation of the marginal likelihood. The EXP_PRT model, presented in Equation 8, shows the best AIC and BIC values, i.e. the smallest discrepancy between the data and the model, which we have highlighted in bold in Table 8. Thus, we conclude that the EXP_PRT model best fits the target inter-arrival time distribution, with its exponential function capturing the within-burst component and its Pareto function matching the out-of-burst component.
Fitting Error of the EXP_PRT Model Across All Users. We investigate whether the EXP_PRT model is able to capture the diversity across users with high level of accuracy. As we are going to show next, this is not always true and so, a robust technique to separate within-burst and out-of-burst records that is independent of user-based distribution shape is needed.
Figure 9 shows the distribution fitting of EXP2, PRT2 and EXP_PRT functions for three different users. We find that out of the three users considered in this figure, the EXP_PRT function fits the target inter-arrival time distribution of two users (Figures 9 and 9) very well, but it is a poor match for the third user (Figure 9). We quantify the density approximation error by the following measure:
Here, is the target kernel density as interpolated (from the inter-arrival time distribution of the user) by a kernel density approximation function. The function is the one that we use to fit our target distribution. The error measure is calculated by numerical integration.
The value of our error metric ranges from to , where corresponds to a perfect fit. The closer the error statistic is to , the poorer is the fit, and a value of greater than reflects an extremely poor fit. The error measurements of the method EXP_PRT for Figures 9, 9 and 9 are , and respectively.
Figure 10 presents the variation in the value of our error metric for various fitting functions, over all users. Again, we observe that the EXP_PRT function results in the lowest error (even with this error measure). However, this is still a poor fit. The median user has an error of , which means that about of the users have approximation errors larger than . We consider that this is too high an error to be useful for developing an analytics system using it. More specifically, the thresholds based on such a poorly fitting function are unlikely to result in a good burst decomposition.
The above analysis suggests that these density functions are not flexible enough to accommodate highly varied and personalized inter-arrival time of different users. Thus, there is a need for a robust technique to separate within-burst and out-of-burst records that is independent of the personalized distribution shape of a user.