# A highly optimized flow-correlation attack

###### Abstract

Deciding that two network flows are essentially the same is an important problem in intrusion detection and in tracing anonymous connections. A stepping stone or an anonymity network may try to prevent flow correlation by adding chaff traffic, splitting the flow in several subflows or adding random delays. A well-known attack for these types of systems is active watermarking. However, active watermarking systems can be detected and an attacker can modify the flow in such a way that the watermark is removed and can no longer be decoded. This leads to the two basic features of our scheme: a highly-optimized algorithm that achieves very good performance and a passive analysis that is undetectable.

We propose a new passive analysis technique where detection is based on Neyman-Pearson lemma. We correlate the inter-packet delays (IPDs) from both flows. Then, we derive a modification to deal with stronger adversary models that add chaff traffic, split the flows or add random delays. We empirically validate the detectors with a simulator. Afterwards, we create a watermark-based version of our scheme to study the trade-off between performance and detectability. Then, we compare the results with other state-of-the-art traffic watermarking schemes in several scenarios concluding that our scheme outperforms the rest. Finally, we present results using an implementation of our method on live networks, showing that the conclusions can be extended to real-world scenarios.

Our scheme needs only tens of packets under normal network interference and a few hundreds of packets when a number of countermeasures are taken.

## I Introduction

Network attackers intentionally hide their identity to avoid prosecution. A widely-used way of achieving this anonymity is forwarding the traffic through a chain of compromised hosts called stepping stones [1]. Tracing back the chain to the source is a challenging problem due to the encrypted or even anonymized connections between stepping stones. Deciding that two flows are essentially the same can be applied to the mentioned problem as well as in many other contexts, such as tracing anonymous connections [2] or preventing congestion attacks on anonymous networks [3].

There are two general approaches for finding correlated flows: passive analysis and active watermarks. Passive analysis schemes are based on correlating some characteristics of the flows, such as packet timings or packet counts, without altering such flows [4, 5, 6]. On the other hand, active watermarks actively modify the flow by delaying individual packets. Active watermarks can be packet-based, embedding the watermark on individual delays between packets [7, 8] or interval-based, embedding the watermark in some properties of the intervals [3, 9, 10].

Lately, most of the work has been focused on the design of new active watermarking techniques, as they are considered to be more efficient, obtaining lower error rates for flows of the same length. All of them are designed with the idea of being undetectable, as detection can lead to the stepping stone or the anonymous network to modify the flow in such a way that it it can no longer be detected. In spite of this, detecting these watermarks has been shown to be feasible and not a hard task [11, 12]. This allows an attacker, e.g. the stepping stone, anonymous network, etc., to easily modify the timing of the detected flow to prevent the correlation using techniques as chaff packets, flow splitting, merging flows or adding random delays.

Achieving a good performance for flow-correlation approaches is critical for two main reasons: first, to be able to deal with flow modifications and other countermeasures, and second, to ensure a minimum reliability, implying a small probability of false positives. These two reasons lead to the necessity of extremely accurate techniques to correlate flows as provided by our scheme. Furthermore, we cannot rely on the length of the sequence, as in many kinds of stealthy attacks, the amount of traffic sent by the attackers or compromised bots is very small.

We propose a passive traffic analysis technique that outperforms any of the state-of-the-art traffic watermarking schemes. For instance, 21 packets separated at least 10 ms are enough to correlate two flows, one in Virginia, the other in California, correctly with probability when the false positive probability is fixed to and no countermeasures are exerted. The proposed method saves the inter-packet delays (IPDs) of the flow and uses a detector based on the likelihood ratio test (Neyman-Pearson lemma).

As IPDs are not robust against the insertion and drop of packets, we develop a modification which is robust against chaff packets, repacketization, flow splitting, and attacks that add or remove packets from the flow. We also make it robust against random delays under a maximum delay constraint.

The rest of the paper is organized as follows: Section II reviews previous schemes for correlating flows and techniques for detecting active watermarks. In Section III we introduce the notation that we follow and formally describe the model. In Section IV we construct our detector. Section V validates its performance using a simulator. Section VI proposes a modification to ensure robustness against chaff traffic, flow splitting and constrained random delays. In Section VII we create an active watermark to study the trade-off between performance and detectability. In Section VIII we compare our passive scheme with existing algorithms in terms of error probability. Section IX shows the result of a real implementation. Finally, Section X summarizes our contribution.

## Ii Previous Work

### Ii-a Passive analysis

Zang and Paxson [4] proposed to correlate the traffic by measuring the time that both flows are in OFF (i.e., no transmission) state. They achieve a large confidence when connections are several minutes long (i.e., thousands of packets) but not so much reliability on short connections. They do not consider any alteration in the traffic. Donoho et al. [5] studied what happens if the stepping stone modifies its flow to evade detection with a maximum tolerable delay constraint. Then, with large enough sequences, they can correlate the traffic regardless of the modification. They use wavelets to separate the short-term behavior from the long-term behavior, and use the correlation on the latter. Blum et al. [6] studied stepping-stone detection under a maximum tolerable delay constraint. They count the difference between the number of packets in both flows. When this difference goes over a certain value they conclude that the flows are not correlated.

The common drawback of those three methods is that they require a large number of packets to achieve an acceptable performance. This number can be significantly reduced by using active watermarks, which are discussed next.

### Ii-B Active watermarks

Wang and Reeves [7] proposed the first active flow watermark. The watermark is embedded in the IPDs. They first quantize the IPD and embed one bit of information by adding half of the quantization step or not. They argue that with sufficient redundancy (infinitely large watermark) the watermark can always be detected even if a timing perturbation is added to each packet. Hence, the drawback of this method is the amount of packets needed to obtain a good performance. Wang et al. [10] proposed an interval centroid based watermark (ICB). They divide the time into intervals. In each interval they embed one bit of the watermark; if the bit is 0 they send a request in the first half of the interval, and if the bit is 1 they do it in the second half. Each bit is decoded according to which half of the interval the centroid falls in. Yu et al. [13] proposed an interval watermark based on Direct Sequence Spread Spectrum (DSSS) communication techniques in order to hide it. The DSSS signal is embedded by modifying the traffic rate. This method again requires a long sequence.

Houmansadr et al. [8] proposed RAINBOW, a non-blind watermark which is robust to packet drops and repacketization. They record the IPD, then they embed the watermark by modifying the IPDs by a different quantity ( for 1, for 0, or vice versa). The normalized correlation is used in detection, and a selective correlation when dealing with added and dropped packets. Houmansadr and Borisov [3] proposed SWIRL (Scalable Watermark that is Invisible and Resilient to packet Losses). The flow is divided into intervals: half of them are used to determine the slots pattern and the other half are used to actually embed the watermark by delaying packets so as they fall into certain slots. Pyun et al. [9] proposed an interval-based watermark (IB) designed to resist attacks that modify the number of packets, such as flow splitting, chaff packets and repacketization. The information is embedded in the difference between the number of packets in two contiguous intervals. This method has the drawback of being more detectable compared to the others.

All the methods discussed above have been shown to be detectable, see next section. This would allow an attacker to modify the known-watermarked flows in such a way that the watermark is removed.

### Ii-C Detecting watermarks

Peng et al. [14] showed how a watermark can be detected and replicated. They detect which packets do not come from the assumed one-way packet delay distribution. Using that information, they can recover the parameters of the watermark algorithm thus being able to replicate it. Specifically, they applied their attack against the watermark in [7].

Kiyavash et al. [15] discovered how one can detect not only the watermark but also extract the parameters and the key, with several network flows watermarked using the same key. This attack is effective against most of the interval based watermarks: ICB, DSS and IB. However, RAINBOW and SWIRL are designed to be immune to this attack.

Luo et al [11] showed that any practical timing-based traffic watermark causes noticeable alterations in the intrinsic timing features typical of TCP flows, and so it can be easily detected. Concretely, they propose metrics based on the round-trip time (RTT), IPDs, and one-packet bursts, that can expose IB, ICB, RAINBOW and SWIRL watermarks for any kind of traffic: bulk or interactive. Lin and Hopper [12] proposed more efficient ways to deal with passive detection than [11]. They also argued that security against passive detection is not sufficient, as a stronger adversary that knows the previous flow is feasible in many scenarios.

## Iii Proposed Scheme

This section introduces the notation we use and explains how we correlate the flows to decide whether they are linked or not.

Notation We use the following notation. Random variables (r.v.) are denoted by capital letters (e.g., ), and their actual values by lower case letters (e.g., ). Sequences of random variables are denoted with a superscript (e.g., ). The probability distribution function (pdf) of a continuous random variable is denoted by . When no confusion is possible, we drop the subscript in order to simplify the notation. The sample mean is denoted by . We summarize the name for each r.v. and parameters introduced in the sequel in Table I.

### Iii-a System model

Random Variables | |
---|---|

PDV modification by an Attack | |

IPD at the Creator | |

IPD at the Detector | |

PDV or Jitter introduced by the network | |

Network delays | |

Timing information at the creator | |

Timing information at the detector | |

IPD modification by a Watermark | |

Parameters | |

Original sequence length | |

Matched sequence length | |

Threshold for the likelihood-ratio test to reject | |

Synchronization constant | |

Threshold to consider a packet as lost | |

Packet loss probability | |

Probability of packet loss due to the network | |

Probability of not detecting a packet | |

Number of subflows after a split | |

Maximum delay constraint for an attacker | |

Maximum delay constraint for the watermark |

Figure 1 illustrates our system model. A flow of length packets, that we are interested in tracking, goes through a certain link, termed “creator”, where we can measure its packet timing information, . The th inter-packet delay (IPD) at the creator is defined as , and these values are saved for later use in detection. This flow continues through the network without any modification.

The “detector” is another link in which we can measure the timing information, , where is the network delay suffered by the th packet. Then, the IPDs at the detector are

(1) |

where represents the packet delay variation (PDV), also known as jitter.

By using the information of the actual values and , the detector has to decide correctly if the two flows are linked. Two flows are linked if they follow a common timing pattern due to sharing the same source (i.e. the unencrypted payload is the same). Formally, we can express this problem via classical hypothesis testing with the following hypotheses:

### Iii-B Performance Metrics

To measure performance, we use two metrics: the probability of detection (), which represents the probability of deciding that the flows are linked when they actually are; and the probability of false positive (), which represents the probability of deciding incorrectly that the flows are linked. Formally, is the probability of deciding when holds, whereas is the probability of deciding when holds.

Typically, performance is graphically represented using the so-called ROC (Receiver Operating Characteristic) curves, which represent vs. . In a practical setting, one fixes a certain value of (that has to be very small if we want to achieve a high reliability) and then measure (which we would like to be as large as possible).

In order to compare different ROCs in a simple way, we use the AUC (area under the ROC curve), a measure that takes a value of 1 in the case of perfect detection and 0.5 in case of random choice. The AUC is shown in the legend of each graph.

## Iv Basic Detector

In this section we derive our detector and model the distributions of PDVs and IPDs as needed.

### Iv-a Detector construction

In order to obtain the best possible performance, we construct the optimal detector, which is the likelihood ratio test. Neyman-Pearson lemma proves that this test is the most efficient one between two simple hypotheses [16]. Hence, our detector chooses when

(2) |

and in the opposite case. represents the likelihood function and is a threshold that we fix to achieve a certain probability of false positive.

Recall from (III-A) that if holds, then . Conversely, if holds, is a sequence with joint pdf .

For feasibility reasons, we constraint the detector to use first-order statistics, discarding the information carried by higher-order statistics. This is equivalent to assuming sample-wise independence in the sequences and . In Section V-B we quantify the impact of this assumption on performance, comparing the real results with those that would be obtained for independent and identically distributed (i.i.d.) sequences. Under these assumptions, the likelihood ratio becomes

(3) |

Therefore, we need to model the PDVs and the IPDs, i.e. determine and .

### Iv-B Modeling the packet delay variation

To model the distribution of the PDVs, we first measure them in several real connections, then fit these data to some candidate distributions and select the distribution that matches best.

The measured delays are reported in [17]. This dataset contains the delays between two hosts during 72 hours, and for 11 different scenarios. As it is customary, we separate these data into three subsets: training, validation and test, using 24 hours of data for each.

Scenarios 1 to 9 measure common Internet connections between two hosts. Scenario 10 models the delays of a stepping-stone scenario, where a host in Oregon is retransmitting to a host in California the flow coming from a host in Virginia. Scenario 11 measures the delays associated with one instance of the Tor network [18]. In order to get a general idea about the connection scenarios, we show some basic information of the hosts and the connections in Table II, where is the probability of packet loss and the source and destination are represented with ISO 3166 codes [19].

Source | Dest. | [ms] | ||

Sc1 | CA-US | NM-US | ||

Sc2 | OR-US | NM-US | ||

Sc3 | VA-US | NM-US | ||

Sc4 | ES | NM-US | ||

Sc5 | IE | NM-US | ||

Sc6 | JP | NM-US | ||

Sc7 | AU | NM-US | ||

Sc8 | BR | NM-US | ||

Sc9 | SG | NM-US | ||

Sc10 | VA-US | CA-US | ||

Sc11 | NM-US | NM-US |

From these measured delays we calculate the measured PDV as . The basic statistics from Table III imply a nearly symmetric (i.e., small skewness) and leptokurtotic distribution (i.e., sharp peak and heavy tail).

[s] | Var. ] | Skew. | Kurtosis | |
---|---|---|---|---|

Sc1 | ||||

Sc2 | ||||

Sc3 | ||||

Sc4 | ||||

Sc5 | ||||

Sc6 | ||||

Sc7 | ||||

Sc8 | ||||

Sc9 | ||||

Sc10 | ||||

Sc11 |

To construct the model, we make the same assumptions as to build the test, i.e. an i.i.d. sequence. The candidate distributions were selected among the ones that have support on and possess the mentioned characteristics. The chosen distributions are Cauchy, Gumbel, Laplace, Logistic and Normal. Their pdfs are summarized in Table IV, where the indicator function takes the value 1 when , and is 0 otherwise.

Distrib. | |
---|---|

Cauchy | |

Gumbel | |

Laplace | |

Logistic | |

Normal | |

Exp. | |

Pareto | |

LogNor. | |

LogLog. | |

Weibull |

We estimate the respective parameters using robust statistics, to prevent that outliers affect the measures. These estimators are based on the median and median absolute deviation and calculated as explained in [20, Chapter 3]. Afterwards, we measure the goodness of fit between the validation sequence and the model using the square root of the Jensen-Shannon divergence (JSD), [21]. This is a metric for two probability densities , which is based on the Kullback-Leibler divergence (KLD) as follows:

(4) |

where is the mid-point measure, and is the KLD, defined as

(5) |

Scenario | Cau. | Gum. | Lap. | Log. | Nor. |
---|---|---|---|---|---|

Sc. 1 | |||||

Sc. 2 | |||||

Sc. 3 | |||||

Sc. 4 | |||||

Sc. 5 | |||||

Sc. 6 | |||||

Sc. 7 | |||||

Sc. 8 | |||||

Sc. 9 | |||||

Sc. 10 | |||||

Sc. 11 |

Results from Table V show that no distribution stands out above the rest, being the Laplace and the Cauchy distributions the best fits.

The Laplacian is the most commonly used model for the jitter, but Rio-Dominguez et al. [22] claimed that an alpha-stable distribution models it better. Note that a Cauchy distribution is a particular case of an alpha-stable distribution, but we do not generalize it further, as we are interested in a close-form pdf model.

The performance of the two possible detectors, based on Laplace and Cauchy distributions, respectively, is evaluated in Section V-B.

### Iv-C Modeling the Inter-Packet Delays

In many works it is assumed a Poisson model for the traffic because of its desirable theoretical properties [23]. This model implies that IPD times are an i.i.d. exponentially distributed sequence. But Paxson et al. [24] have shown that this model is not accurate in interactive applications.

We model the IPDs on both SSH and HTTP protocols. As done in [24], we only take into account packets that are separated at least by 10 ms, considering that if two packets are separated by less than 10 ms they are subpackets of the same packet. Therefore, the considered IPDs are lower bounded by 10 ms. We use the captures from Dartmouth College [25], using the traces from Fall 03 as training set, Spring 02 as validation set and Fall 01 as test set for the simulator. The basic characteristics of these sets are shown in Table VI.

Set | Flows | Packets |
---|---|---|

SSH Train. | ||

SSH Val. | ||

SSH Sim. | ||

HTTP Train. | ||

HTTP Val. | ||

HTTP Sim. |

We estimate the parameters through maximum likelihood estimation (MLE) and measure the goodness of fit using the square root of the JSD. The candidate distributions are: Exponential, Pareto, Log-Normal, Log-Logistic, and Weibull. Their pdfs can be seen in Table IV.

Distribution | Error SSH | Par SSH | Error HTTP | Par HTTP |
---|---|---|---|---|

Exponential | ||||

Pareto | ||||

Log-Normal | ||||

Log-Logistic | ||||

Weibull |

Results shown in Table VII confirm the findings of Paxson et al., i.e., that the Pareto distribution is a better model for interactive traffic. In non-interactive traffic such as HTTP, this model also gives acceptable results. Therefore, we will assume that

(6) |

### Iv-D Detector

Once we have a model for the IPD and PDV sequences, we derive the likelihood test.

If Cauchy distributed PDVs are assumed, the test chooses when

(7) |

and otherwise.

In the case that a Laplace model for PDV is adopted, then

(8) |

## V Performance

In this section we construct a simulator and present the scenarios we use in the remaining of the paper. Afterwards, we test the model assumptions and measure the performance with different sequence lengths.

### V-a Simulator and Scenarios

Simulations are carried out in the following way. First, we generate timing information at the creator using the IPD test set, . The purpose of this sequence is to evaluate the performance when holds. A delay is added to each packet using the measured delays from the test set (as explained in the following paragraphs), obtaining . We generate a second sequence , using the IPD test set; this sequence has the purpose of evaluating the performance under . Finally, we use the Test from (7) or (8), to obtain both and . This experiment is repeated times, and for different values of we obtain as the rate of , and as the rate of . Note that due to the number of runs, cannot be measured and results of this order are not accurate.

The sequences are generated in the following way: we place all the IPDs from the test set in an order-preserving list. The starting point is randomly selected from the list and the generated IPDs are the following values.

For generating the delays, we used the test set as a list with the delay every 50 ms. We select one value randomly from the list that will be considered time 0 ms; the following values will represent the delay at times 50 ms, 100 ms, and so on. To obtain the delays at times where we do not have a measure, we use linear interpolation.

The performance is evaluated in the two scenarios depicted in Figures 2 and 3. Scenario A represents a stepping stone that forwards SSH traffic inside the Amazon Web Services [26] network. The creator, stepping stone and detector are EC2 instances located in Virginia, Oregon and California, respectively. This example corresponds to tracing the source of an attack that was launched from a compromised Amazon instance. The simulated delays correspond to those of Scenario 10 in Section IV-B, where the standard deviation of the network delay is 4 ms.

Scenario B simulates a web page accessed from Tor network whose real origin is to be found, and where the creator will be the web page and the detector the client. For instance, this case can correspond to a company in whose forum an anonymous insulting post has been placed using Tor and it is to be known whether the source comes from an employee within the company. The simulated delays correspond to the measurements of Scenario 11 in Table V, where the standard deviation of the network delay is 340 ms.

### V-B Impact of our assumptions

In this section, we wish to quantify the impact of the assumptions we have made, that is, the PDVs form an i.i.d. Cauchy or Laplace sequence. To this end, we extend our simulator to create 3 types of delays: first, according to the model (Cauchy or Laplace), second as a random sample from the data, and last, from the data maintaining the time correlation. is used for Scenario A and for Scenario B. Results are shown in Figures 4 and 5. We notice two details: first, that the Cauchy-based detector gives slightly better performance than the Laplace under real data, and second, that the independence of the PDVs previously assumed slightly reduces the performance. In the sequel, we just derive the expressions for a Cauchy-based detector. The modification for a Laplace detector is rather straightforward.

### V-C Performance dependence on

We want to evaluate how much performance is improved when longer sequences are used. The result is depicted in Figures 6 and 7. We can see that Scenario B, whose IPDs have a larger variance because of the Tor network, needs much longer sequences to achieve the same performance. For instance, with fixed , in Scenario A for we obtain . However, in Scenario B the needed for a comparable result is around , with which we obtain . If we compare AUCs, in Scenario A with we obtain while a similar result in Scenario B requires a value of between and .

## Vi Robust detector

The previous test does not take the existence of any countermeasure into account. Attacks to timing correlation can be exerted by introducing uncorrelated random delays, adding chaff traffic or splitting the flow, making the Test in (7) ineffective. In this section, we build a test that is robust to these attacks. First, we deal with adding or removing packets from the flow, and then with random delays.

### Vi-a Matching packets

Hitherto, we have assumed that there is a one-to-one relation between the flows at the creator and the detector; i.e., no packets are added or removed. This assumption is not necessarily valid for every situation, not only due to the presence of an active attacker, but also as a result of many applications that repacketize flows, changing the number of packets, for instance, SSH tunneling [27].

To deal with packet addition and removal, we first choose the most likely packet at the detector for each packet at the creator. In the case that there is no packet likely enough, we consider the creator packet as lost.

Given the th packet at the creator, we match it with the most likely th packet at the detector, denoting this as . Consequently, if is a synchronization constant to be discussed in Section VI-C, and is the threshold for which a packet is considered lost, the condition for a match in the th packet is

(9) |

and to avoid considering it lost,

(10) |

Threshold should be large enough so that the probability that a packet is wrongly considered lost is very small, for instance, . Although this can lead to incorrectly matching with another packet when the packet is indeed lost, the impact on Test (VI-B) of this mismatch is very small. Empirically, the best performance we obtained for Scenario A is when ms and when s for Scenario B.

In practice, the standard deviation of the network delay can be larger than some of the IPDs, especially in Scenario 2, in which case the matching is likely to fail. The impact of these matching errors is evaluated in Section VI-E. In the case that most of the IPDs are smaller than the standard deviation of the network delay, a better matching function is the one used in [29]. This corresponds to the injective function that minimizes the mean square error between and , which has the drawback of a higher computational cost.

The matching process modifies the timing sequences to and , where , as the lost packets are removed. Formally, we can define the new sequences as , and .

### Vi-B Test robust to chaff and flow splitting

From (7), we can obtain a test robust to packet removal and insertion as

(11) |

where is the probability that a packet at the creator cannot be matched at the detector. This can be due to three reasons: network loss with a probability , lack of matching when the packet appears, and flow splitting into subflows by the stepping stone, i.e., of the original packets are not seen by the detector, as only one of the subflows traverses this link. Therefore,

(12) |

### Vi-C Self-Synchronization

We have mentioned that is a synchronization constant. The detector can perform detection maximizing the value of with respect to through an exhaustive search. For instance, Figure 8 shows a detector trying values of using steps of 1 ms in the interval s. We can see that the maximum occurs when , as expected. Recall that is the sample mean of the network delays.

### Vi-D Robust test against random delays

So far the situation where an attacker can inject random delays has not been considered. Random delay injection is a well-known technique for covert channel prevention and can be easily implemented via buffering by attackers across their step stones.

We assume that the attacker has the constraint of not being able to delay any packet more than seconds. Hence, she can modify the PDV by a quantity that falls in the interval . As we do not know the distribution of the attacker’s random delay, the detector assumes a uniform distribution. Thus, the PDV at the decoder is , and

(13) |

Consequently, the likelihood ratio becomes

(14) |

A game-theoretic approach to this problem is taken in [28], where for simplicity the detector is constrained to estimating and compensating the attack. The optimal detector for the same game is derived in [29], where it is shown that a nearly deterministic attack impairs the detector more than a uniform distribution even if the detector knows the attack distribution.

### Vi-E Performance

To evaluate the proposed robust algorithms, the functionalities of adding chaff traffic, splitting the flow, and delaying the packets randomly are implemented in our simulator. This is done as follows: each packet is delayed by a certain quantity. We implement two different delay strategies: a) the value is picked from a uniform distribution in the range , and b) the values are taken to minimize (VI-B), i.e. the values are chosen by an intelligent adversary who knows both the test and its parameters. Then, the simulator adds traffic according to a Poisson process with a fixed rate proportional to the rate of the original traffic. Afterwards, it simulates the flow split, which is implemented by discarding packets as a Bernoulli process with a probability equal to . Recall that is the number of subflows we divide the flow into.

We created five different attacks. In the first three, we evaluate each traffic modification strategy separately, namely, Attack 1 adds 500% of chaff traffic; Attack 2 splits the flow into 4 subflows; Attack 3 adds delays with ms; Attack 4 combines 500% of chaff traffic with delays constrained to ms, and Attack 5 is a complex attack where a combination of Attack 4 with splitting the flow into 2 subflows takes place. For Attacks 3 to 5, we consider the two delay strategies specified above: with indicating the attack number, we denote by the case where the delays are chosen randomly, and by where they are chosen by an intelligent attacker. We simulate these situations using sequences of length in Scenario A and in Scenario B. Results are depicted in Figures 9 and 10.

Comparing these figures under no attacks with the corresponding plots for the case of no mismatches of Figs. 6 and 7, we can evaluate the impact of mismatched packets, as the AUC drops from to in Scenario 1 and from to in Scenario 2.

In low jitter situations, namely Scenario A, chaff traffic by itself has little impact, but the effect when combined with random delays is significantly increased. The reason behind this is that in the first case the matching process chooses the real packets with a very low probability of error but when a random delay is added the probability of a mismatch increases. We also see that the flow splitting attack has a considerable impact as the received sequence length is reduced.

In high jitter situations, i.e. Scenario B, random delays have considerably smaller influence, because the standard deviation of the network delay is larger than the attack delay. In fact, due to the high network-delay variability, chaff traffic alone has a significant impact on performance without the need of an attacker injecting random delays.

## Vii Comparison with an active watermark

We want to analyze how much performance can be improved by sacrificing undetectability. For this purpose, we create an active watermark designed with invisibility as a goal, and we study the trade off between performance and detectability.

We measure the latter as the KLD between the covertext, i.e., the sequence without watermark, and the stegotext, i.e., watermarked. Cachin [30] defines a stegosystem to be -secure against passive adversaries if , where is the distribution of the covertext and is the distribution of the stegotext. Hence, we measure the detectability as the minimum for which our system is -secure.

The watermark is embedded adding a random uniform delay between . Thus, the watermarked flow is , where is the embedded watermark which is triangular distributed between as it is the difference of two delays uniformly distributed. At the detector, we receive . The detector remains (VI-D) using instead of .

We assume that the attacker knows the original traffic as done in [12, 14] and wants to test for the existence of a watermark. Therefore, the attacker’s goal is to differentiate between and .

We simulate Scenario A with and Scenario B with under no traffic modification, where we evaluate the trade-off between the detectability and when is fixed. Results are depicted in Figures 11 and 12, where we can see that watermarking schemes give a significant improvement under low-jitter conditions even with ms, (cf. ), but this improvement is significantly lower on large-jitter conditions, e.g. the Tor network, even of very large , for instance, for ms (cf.).

## Viii Comparison with other schemes

We want to compare our passive analysis with four other state-of-the-art traffic watermarking schemes: IB [9], ICB [10] , RAINBOW [8] and SWIRL [3].To this end, we extend our simulator to be able to embed the mentioned watermarks and to detect them.

The presented results have been obtained with the following parameters: IB, ICB and SWIRL use a time interval of 500 ms; this is the value used in the original ICB experiments reported in [10].The experiments for SWIRL in [3] use 2 s, but with short sequences this implies that many flows cannot be watermarked as the whole flow falls into one interval. We compensate this shorter interval by dividing it into less subintervals (5 instead of 20). In our experiments RAINBOW can modify the IPD up to 20 ms, which is the largest watermark amplitude used in the simulations in [8].

We first compare the performance in both scenarios when the flows do not suffer any addition or removal of packets, for this we use (7). We take in Scenario A and in Scenario B. Figure 13 shows the results for Scenario A, where our scheme and RAINBOW outperform the rest by a significant amount. This is due to the fact that both are non-blind and perform good with short sequences if the PDV has small variance. The other watermarking schemes do not perform well with short sequences. Figure 14 shows the results in Scenario B. We see that with longer sequences IB and ICB despite of the larger PDV sequence improve their performance.

We also compare the performance under traffic modification using Attack 5a, i.e., 500% chaff traffic added, and random delays with ms. As before, we fix in Scenario A and in Scenario B. Results are shown in Figures 15 and 16. Note that RAINBOW or SWIRL are not designed to be robust against an active attacker.

Our algorithm is more robust to the considered traffic modifications than the rest of schemes, for example, in Scenario B, we achieve , while IB achieves , ICB , and for both RAINBOW and SWIRL . Recall also that we do not modify the flow, while the rest do.

Our scheme performs better than RAINBOW, which is also a non-blind detection, although it does not modify the IPDs. The improvement in performance is due to using a likelihood test (optimal) instead of normalized correlation. Recall also that the IPDs have been restricted to be larger than 10 ms. Lifting this restriction would have a bigger impact on passive analysis than on a watermarking scheme.

## Ix Real Implementation

Obviously, simulations are not fully realistic. To check if simulator results are applicable to real networks, we carry out a real implementation of the proposed passive analysis scheme, the watermark modification proposed in Section VII and the watermark schemes with which we compared in Section VIII for A and B scenarios.

For the first experiment, we launched three EC2 [26] instances. We used replayed SSH connections from real traces taken at University of Vigo and the stepping stone was created by forwarding the traffic with the socat command. For the second experiment, we replay connections from real HTTP traces also from University of Vigo. We use 6 packets () and 51 () for Scenarios A and B, respectively. The experiment is repeated 1000 times in each case. In order to obtain values of the test under , we use the saved timing information from the previous sequence in the non-blind cases, i.e., our proposed method and RAINBOW, and for the blind cases, i.e., IB, ICB and SWIRL, we use a different random key.

The parameters chosen are maximum IPD variation for RAINBOW and watermark modification of 5 ms in Scenario A and 20 ms in Scenario B, that are the middle and maximum amplitudes in the experiments presented in [8]. For the blind-watermark, IB and ICB uses a interval size of 500 ms, SWIRL uses an interval length of 250 ms and 1000 ms for Scenarios A and B, respectively, divided into 5 subintervals of 3 slots each. These values have been chosen to maximize the AUC in each scenario.

Experiments are carried out in a non-active-attack scenario, this means that insertions and losses are only due to repacketization. As the detector from(VI-B) needs a value for , we use from Table II.

Results in Scenario A are similar to the simulator results: for Real Scenario vs for the Simulator. However, Scenario B shows a decrease in performance for the Real Scenario compared to the simulator results. This loss of performance affects all schemes, being for ours less severe.

## X Conclusions

Network flow watermarks are becoming increasingly popular in traffic analysis owing to their improved performance as compared to passive analysis. Unfortunately, the ease with which these watermarks can be exposed has revealed itself as the Achilles’ heel of these techniques and can lead to a traffic modification attack in which the watermark is finally removed. In this paper we have presented a highly-optimized traffic analysis method for deciding if two flows are linked that can be used as passive analysis, as well as a watermarking scheme.

With performance in mind, we develop an optimal decoder, i.e. likelihood-ratio test, that allows to achieve a very good performance under a passive analysis scheme. For example, with 21 packets separated at least 10 ms we can correlate two flows obtaining given a false alarm probability equal to without flow modifications.

A more robust detector is created that can deal with chaff traffic, flow splitting and random delays added by an attacker. To this end, packet matching is carried out by removing the packets that do not have a correspondent in the other flow. Then, a new likelihood-ratio test that considers losses and the maximum delay that an attacker can add is derived.

Afterwards, we study the trade-off between performance improvement versus the detectability on a watermarking scheme based on our algorithm. We also show a comparison with four state-of-the-art traffic watermarking schemes. Finally, a real implementation is carried out to show that the simulator results can be extended to real networks.

The obtained results show that passive analysis schemes with an optimal detector can compete with and outperform state-of-the-art traffic watermarking schemes, giving the advantages of being undetectable, which decreases the risk of a traffic modification attack, and that they can be carried out ex-post, in addition to in real-time, allowing them to be used in forensic analysis applications as well as in intrusion detection.

## Acknowledgment

The authors would like to thank Dr. Negar Kiyavash for her insightful and helpful comments. This work was supported in part by Iberdrola Foundation through the Prince of Asturias Endowed Chair in Information Science and Related Technologies.

## References

- [1] S. Staniford-Chen and L. Heberlein, “Holding intruders accountable on the internet,” in Security and Privacy, 1995. Proceedings., 1995 IEEE Symposium on, may 1995, pp. 39 –49.
- [2] P. Syverson, G. Tsudik, M. Reed, and C. Landwehr, “Towards an analysis of onion routing security,” in Designing Privacy Enhancing Technologies, ser. Lecture Notes in Computer Science, H. Federrath, Ed. Springer Berlin / Heidelberg, 2001, vol. 2009, pp. 96–114.
- [3] A. Houmansadr and N. Borisov, “SWIRL: A scalable watermark to detect correlated network flows,” in NDSS, 2011.
- [4] Y. Zhang and V. Paxson, “Detecting stepping stones,” in In Proceedings of the 9th USENIX Security Symposium, 2000, pp. 171–184.
- [5] D. L. Donoho, A. G. Flesia, U. Shankar, V. Paxson, J. Coit, and S. Staniford, “Multiscale stepping-stone detection: detecting pairs of jittered interactive streams by exploiting maximum tolerable delay,” in Proceedings of the 5th international conference on Recent advances in intrusion detection, ser. RAID’02. Berlin, Heidelberg: Springer-Verlag, 2002, pp. 17–35.
- [6] A. Blum, D. Song, and S. Venkataraman, “Detection of interactive stepping stones: Algorithms and confidence bounds,” in Recent Advances in Intrusion Detection, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2004, vol. 3224, pp. 258–277.
- [7] X. Wang and D. S. Reeves, “Robust correlation of encrypted attack traffic through stepping stones by manipulation of interpacket delays,” in Proceedings of the 10th ACM conference on Computer and communications security, ser. CCS ’03. New York, NY, USA: ACM, 2003, pp. 20–29.
- [8] A. Houmansadr, N. Kiyavash, and N. Borisov, “RAINBOW: A robust and invisible Non-Blind watermark for network flows,” in Network and Distributed Systems Security Symposium. Internet Society, Feb. 2009.
- [9] Y. J. Pyun, Y. Park, D. S. Reeves, X. Wang, and P. Ning, “Interval-based flow watermarking for tracing interactive traffic,” Computer Networks, vol. 56, no. 5, pp. 1646 – 1665, 2012.
- [10] X. Wang, S. Chen, and S. Jajodia, “Network flow watermarking attack on low-latency anonymous communication systems,” in Security and Privacy, 2007. SP ’07. IEEE Symposium on, may 2007, pp. 116 –130.
- [11] X. Luo, P. Zhou, J. Zhang, R. Perdisci, W. Lee, and R. K. C. Chang, “Exposing invisible timing-based traffic watermarks with BACKLIT,” in Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 2011, pp. 197–206.
- [12] Z. Lin and N. Hopper, “New attacks on timing-based network flow watermarks,” in USENIX Security Symposium. Bellevue,WA: USENIX Association, Aug. 2012.
- [13] W. Yu, X. Fu, S. Graham, D. Xuan, and W. Zhao, “Dsss-based flow marking technique for invisible traceback,” in Security and Privacy, 2007. SP ’07. IEEE Symposium on, may 2007, pp. 18 –32.
- [14] P. Peng, P. Ning, and D. Reeves, “On the secrecy of timing-based active watermarking trace-back techniques,” in Security and Privacy, 2006 IEEE Symposium on, may 2006, pp. 15 pp. –349.
- [15] A. Houmansadr, N. Kiyavash, and N. Borisov, “Multi-flow attack resistant watermarks for network flows,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, april 2009, pp. 1497 –1500.
- [16] J. Neyman and E. S. Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London Series A Containing Papers of a Mathematical or Physical Character, vol. 231, no. 694-706, pp. 289–337, 1933.
- [17] J. A. Elices and F. Pérez-González, “Measures to model delays on internet,” http://www.unm.edu/~elices/captures.html, Jan. 2013.
- [18] R. Dingledine, N. Mathewson, and P. Syverson, “Tor: the second-generation onion router,” in Proceedings of the 13th conference on USENIX Security Symposium - Volume 13, ser. SSYM’04. Berkeley, CA, USA: USENIX Association, 2004, pp. 21–21.
- [19] ISO, ISO 3166-2:1998 Codes for the representation of names of countries and their subdivisions — Part 2: Country subdivision code, 1998.
- [20] D. J. Olive, Applied Robust Statistics. Southern Illinois University, 2008. [Online]. Available: http://www.math.siu.edu/olive/ol-bookp.htm
- [21] D. Endres and J. Schindelin, “A new metric for probability distributions,” Information Theory, IEEE Transactions on, vol. 49, no. 7, pp. 1858 – 1860, july 2003.
- [22] L. Rizo-Dominguez, D. Munoz-Rodriguez, D. Torres-Roman, and C. Vargas-Rosales, “Packet variation delay distribution discrimination based on kullback-leibler divergence,” in Communications (LATINCOM), 2010 IEEE Latin-American Conference on, sept. 2010, pp. 1 –4.
- [23] L. Kleinrock, Queueing Systems. Wiley Interscience, 1975, vol. I: Theory.
- [24] V. Paxson and S. Floyd, “Wide area traffic: the failure of poisson modeling,” IEEE/ACM Trans. Netw., vol. 3, no. 3, pp. 226–244, Jun. 1995.
- [25] D. Kotz, T. Henderson, I. Abyzov, and J. Yeo, “CRAWDAD trace set dartmouth/campus/tcpdump (v. 2004-11-09),” http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump, Nov. 2004.
- [26] Amazon Inc., “Amazon elastic compute cloud (amazon ec2),” http://aws.amazon.com/ec2/.
- [27] T. Ylonen and C. Lonvick, “The Secure Shell (SSH) Protocol Architecture,” RFC 4251 (Proposed Standard), Internet Engineering Task Force, Jan. 2006. [Online]. Available: http://www.ietf.org/rfc/rfc4251.txt
- [28] J. A. Elices and F. Pérez-González, “Linking correlated network flows through packet timing: a game-theoretic approach,” 2013, submitted available: http://arxiv.org/abs/1307.3136.
- [29] ——, “The flow fingerprinting game,” 2013, submitted available: http://arxiv.org/abs/1307.3341.
- [30] C. Cachin, “An information-theoretic model for steganography,” in Information Hiding. Springer, 1998, pp. 306–318.