Measuring Privacy Leakage for IDS Rules
This paper proposes a measurement approach for estimating the privacy leakage from Intrusion Detection System (IDS) alarms. Quantitative information flow analysis is used to build a theoretical model of privacy leakage from IDS rules, based on information entropy. This theoretical model is subsequently verified empirically both based on simulations and in an experimental study. The analysis shows that the metric is able to distinguish between IDS rules that have no or low expected privacy leakage and IDS rules with a significant risk of leaking sensitive information, for example on user behaviour. The analysis is based on measurements of number of IDS alarms, data length and data entropy for relevant parts of IDS rules (for example payload). This is a promising approach that opens up for privacy benchmarking of Managed Security Service providers.
Intrusion detection, privacy leakage, entropy, metrics
The objective of this paper is to develop an entropy-based metric that can be used for privacy leakage detection in intrusion detection system (IDS) alarms. The approach should be able to identify IDS rules that according to stakeholders’ perception have a significant potential for leaking private or confidential information. It should also identify the worst IDS rules from a privacy or confidentiality perspective based on indicators that can be calculated automatically. For example IDS rules that:
have a significant risk to leak information that is sensitive (privacy sensitive, security sensitive, business sensitive etc.);
have an unclear or too simple definition of the attack detecting pattern, and therefore may trigger unnecessarily, in the worst case on person sensitive or confidential information.
Privacy policies can be used to define what information that is sensitive. Examples of sensitive information may be certain IP ranges of classified systems or sampled payload that may reveal private or confidential information. Information can also be defined as person sensitive by law, for example the sampled payload from a health institution which may contain person sensitive information. Another example is critical infrastructures that may contain security sensitive or confidential information in the data traffic about the processes being controlled. Last, but not least, payment databases handling financial transactions may reveal sensitive information like credit card numbers.
In these cases, the information is per definition sensitive, which means that any leakage of information that can be identified may be problematic. For such use cases, an objective information leakage metric will be sufficient to identify problematic leakage of private or confidential information.
In other cases, the privacy sensitivity will be subjective, and can only be evaluated in a representative way by the owners of the data being sampled - the users themselves. It may even in this case be possible for the data controller to get realistic estimates of the perceived privacy sensitivity by asking a representative random set of users, for example using a random poll on the service being used, about how they would value privacy leakages. However this approach will be expensive and does not scale well. It is therefore only viable for smaller evaluations of privacy impact.
It is therefore assumed possible for an authority like the data controller, that is overseeing the privacy interests, to estimate the privacy impact, denoted by , that an identified information leakage causes. The privacy impact could for example be the subjective value or expected liability from privacy or confidentiality breaches, as proposed by [gritzalis_probabilistic_2007]. The privacy leakage, denoted by for a given IDS rule can then be defined as the product of the information leakage metric and the privacy impact , i.e: . However, if investigation shows that the information leakage is caused by activities from attack vectors that do not cause any risk of revealing private, business sensitive or confidential information, then the privacy impact for a given IDS rule may be set low or even to zero. The combined metric can be regarded as a privacy leakage risk metric, that can be used to measure and perform incremental improvements of the Managed Security Service (MSS) operation from a privacy perspective.
Current IDSs typically provide an all or nothing solution for handling private or confidential information in the alarms. The payload of the alarms is either being sent in cleartext or may be pseudonymised, for example by only sending references to where more information can be found in a data forensics system. There does not exist a more fine-grained management nor any measurements of sensitive information flows in such systems. It is in particular common that Open Source based IDS’s like Snort, OSSEC or Prelude send payload in cleartext in the IDS alarms. Having a metric for how privacy invasive an MSS operation is will therefore be useful to benchmark the performance of different MSS providers from a privacy perspective. It will also be useful for tuning the IDS rulesets and for implementing anonymisation policies to reduce the privacy impact of the monitoring. Intuitively, such a privacy leakage model relates to the perceived preciseness of the IDS rule, i.e. how good it is at detecting only attack traffic without revealing non-attack traffic.
A promising candidate for a privacy leakage metric for IDS rules, is data entropy. This is a privacy leakage metric that is based on the variability of the underlying data. Examples of such metrics are Shannon-, Rényi or Min-entropy, which previously have been proposed as anonymity metrics [shannon_1948, clau_structuring_2006]. Entropy can also be used to measure coding efficiency, for example whether sampled payload excerpts most likely are encrypted or compressed [shannon_1948]. This paper investigates a model of privacy leakage from IDS rules that is based on the variation in entropy between IDS alarms. This is to the best of our knowledge the first comprehensive privacy leakage model for IDS rules based on quantitative measurements of information flow founded in information theory.
The proposed privacy leakage metric has several practical applications. First, it can be used to identify imprecise IDS rules, since such rules typically will have more variation in the underlying data, and therefore also a larger variance in entropy than more precise IDS rules. Furthermore, an advantage with the proposed metric is that it can detect two common ways of preserving privacy or data confidentiality: anonymisation and pseudonymisation. Both encrypted and anonymised information can be expected to have zero entropy variance, given sufficiently long input. On the other hand, the entropy variance of plaintext data will be significantly larger than for encrypted data, as will be discussed in Section V-C.
This means that the entropy variance can be used as a metric to detect leakage of private or confidential information in message oriented data streams in general and IDS alarms in particular. It can also be used to verify whether an anonymisation/pseudonymisation or encryption scheme works as intended.
This paper is organised as follows: Section II discusses the motivation behind introducing an entropy variance based information leakage metric, based on existing knowledge of how common attack vectors work. Section III describes the threat model and scenario that is assumed when using the privacy leakage metric. Section IV develops the entropy-based privacy leakage model based on quantitative information flow analysis after introducing the necessary theoretical background information. The last part discusses how clustering based on the Expectation Maximisation algorithm can be used to identify the underlying attack vectors for IDS rules that detect more than one attack vector. Section V does a detailed analysis of the convergence speed as a function of amount of input data for the entropy algorithms and symbol definitions considered. This includes analysing the metrics’ abilities to distinguish between plaintext and encrypted data. Section LABEL:sub:Analysis-of-Alarm-PDF analyses experimental results based on realistic measurements of IDS alarms. Section LABEL:sec:Related-Works discusses related works; Section LABEL:sec:Conclusion concludes the paper and Section LABEL:sec:Future-Work suggests future work and research opportunities.
A precise IDS rule will in many cases report only one or a few different attack patterns corresponding to real attack vectors, as will be discussed below. One common type of attack vector that follows this behaviour, is stack or heap buffer overflow attacks [vallentin_evolution_2007]. These attack vectors frequently use large sequences of characters corresponding to the NOP operation or similar to increase the probability of successfully exploiting buffer overflow vulnerabilities. The attacker does then not need to know the exact memory location of injected shellcode, since returning to any address within the NOP sled will cause the shellcode to be executed. This makes it simpler for the adversary to exploit such vulnerabilities. The entropy of this NOP sled will be zero, and variance zero, as long as only NOP operations are being used in the sled and the attack vector does not mutate (e.g. by changing the length of the NOP sled). This is clearly distinguishable from ordinary traffic, and also easy to distinguish for rule-based IDSs.
Such naive attacks are however not so common nowadays, because the IDS and anti-virus technologies easily can detect such anomalies in the input. It is therefore increasingly common that the adversaries obfuscate the attack vector. Obfuscation of the NOP sled can for example be done using metamorphic coding, which means that instructions in the sled are substituted with other instructions that effectively perform the same function [jordan_writing_2005]. Furthermore, it is now common practice that also the shellcode of the attack is being obfuscated by using encryption techniques. This means that the attack after the NOP sled contains a small decryption program, with a decryption key that decrypts the obfuscated shellcode before it is being run [song_infeasibility_2007]. Even the decryption program can be hidden by using metamorphic coding techniques [song_infeasibility_2007], although this is still not very common [polychronakis_empirical_2009].
This means that obfuscated attack vectors can be expected to have quite high entropy, in some cases indistinguishable from encrypted traffic [song_infeasibility_2007, goubault-larrecq_detecting_2006]. This means that the variation in entropy can be expected to go towards zero for a sufficiently large data sample from a polymorphic attack vector, given that it is indistinguishable from a perfect encryption scheme. Such an attack vector will behave like random uniform data. This means that the entropy variance of sufficiently large attack vector samples from both traditional NOP sled based attacks and modern obfuscated attacks also can be expected to have low entropy variance.
It can furthermore be observed that samples of encrypted user traffic, assuming that strong encryption is used, in itself does not leak any private or confidential information, hence can be expected to have low entropy variance. Ordinary non-encrypted user traffic, can however be expected to show a significant variance in entropy between different samples, as illustrated in Figure LABEL:fig:Difference-bit-entropy. This indicates that entropy variance may be an interesting metric for measuring whether IDS alarms leak information, in particular for buffer overflow type of attacks. However this metric does obviously not understand the semantics of the data traffic, and can therefore not be used to evaluate whether the leaked information is private or confidential.
Attacks are to a great extent automated and performed by large botnets of compromised hosts.
Attack vectors do typically not yet mutate or change dynamically111Although proof-of-concept polymorphic self-mutating worms has been demonstrated [kolesnikov_advanced_2005].. This means that multiple attacks by a given host being controlled by an adversary typically has the same payload. Different hosts running the same version of a given malware can also be expected to typically have the same payload [polychronakis_empirical_2009].
Attack vectors are modular programs that are improved incrementally, which means that not all parts of a malware will change at the same time, and some parts of malware code are even shared between different malware families [polychronakis_empirical_2009].
Botherders, that manage large botnets of compromised hosts, will also have a self interest in a “well managed” botnet. This means that the malware of a botnet at regular intervals will be upgraded to include patches and new functionalities, amongst others to avoid being detected by Anti-Virus and IDS [freiling_computer_2005]. It is therefore reasonable to believe that a large amount of the machines in a given botnet will run the same version of the malware and therefore also will use the same arsenal of attack vectors for attacking other hosts.
Iii Threat Model
The paper assumes that intrusion detection services have been outsourced to a third party Managed Security Service (MSS) provider. Security monitoring is furthermore subdivided into two different security levels. An outsourced first-line service that is doing 24x7 monitoring of the computer networks, and a trusted second-line service that will have full knowledge of the IDS service, including capabilities to perform data forensic analysis. It is assumed that the MSS provider operates using a privacy-enhanced IDS, so that changes to the IDS ruleset must be agreed upon by both the data controller and the second line security analyst responsible for updating the IDS ruleset, to avoid that excessively privacy violating IDS rules are being deployed.
This paper mainly focuses on two adversaries: an external adversary that may want to manipulate the privacy metrics for example to reduce the chance of attacks being detected. The IDS ruleset is assumed public, so that an external adversary can investigate how the IDS rules work in order to perform targeted attacks on either privacy or security. However the external adversary will not know which IDS rules that are enabled.
Insiders are divided into two main groups. First-line security analysts are considered untrusted insiders, that only have limited authorisation to see information and no authorisation to modify information related to the IDS configuration. They do not have access to the data forensic tool to investigate attacks in detail. Second-line analysts are considered a trusted CERT team, that has authorisation to perform security investigations and reconfigure the IDS. A third actor is the data controller, who shares the responsibility for managing the IDS ruleset with the security officer, to ensure that both the privacy and security objectives are being considered. The paper furthermore assumes that suitable enforcement mechanisms exist, for example anonymisation or pseudonymisation schemes for sensitive information in IDS alarms, so that the privacy leakage metrics can be used for verification of the security or privacy policies.
Iv A Privacy Leakage Model of IDS Rules
This section will first provide an information theoretic analysis of privacy leakage from IDS alarms, assuming a simple model of a perfect IDS rule that does not have any false alarms. This model is subsequently generalised to handle IDS rules that may leak potentially sensitive information, and we then show how this model corresponds to measuring the standard deviation of entropy from the IDS rule. It is finally shown how to measure the privacy leakage from IDS rules that detect more than one attack vector.
Iv-a Basic Definitions
The definitions and notation in this section give a short introduction to quantitative information flow analysis, and is based on [smith_foundations_2009]. It is throughout this paper assumed that the logarithm is taken to the base 2, i.e. means . Shannon and Min-entropy can be considered instances of the more general Rényi entropy [renyi_1961], and we therefore use the Rényi notation to describe the entropies. Any Rényi entropy metric is denoted as , where is the entropy degree; represents Shannon entropy and represents Min-entropy. Given an IDS rule , which may leak sensitive information from a set of input data and to a set of IDS alarms , the objective is then to measure how much information leaks.
Let and be random variables whose set of possible values are and respectively. The Shannon entropy is then defined by [shannon_1948]:
Shannon entropy indicates the number of bits that are required to transfer in an optimal way. The conditional entropy denoted as indicates the expected resulting entropy from input data given a set of IDS alarms that pass through the IDS rule [smith_foundations_2009]:
Min-entropy is another entropy metric that is calculated based on the worst case (maximum) symbol occurrence probability, defined as the vulnerability that an adversary can guess the value of correctly in one try [smith_foundations_2009]:
Min-entropy indicates the number of bits required to store , and is defined as [smith_foundations_2009]:
The conditional min-entropy can be defined as [smith_foundations_2009]:
It is then possible to define the information leakage from to using either Shannon or Min-entropy as proposed by [smith_foundations_2009]:
Iv-B Perfect model IDS Rule
Assume a perfect model IDS rule , that always detects the attack vector and does not have any false alarms or other entropy sources. Furthermore assume that the given attack vector does not change between different attack instances. The payload sample in the IDS alarm from is also assumed to not contain any other entropy sources. The IDS will in this case always sample the same payload excerpt in every alarm according to the attack pattern definition.
This IDS rule is termed a perfect model IDS rule, since it is considered perfect according to the theoretical model of privacy leakage. is in other words a perfect model of IDS rule behaviour from a privacy perspective. This is not a purely theoretical IDS rule behaviour. We observed three IDS rules that behaved like in our experiments, for example the Snort IDS rule with SID 1:2003 SQL Worm Propagation attempt, as shown in Figure IV.1. This is obviously a simplistic model of an IDS rule, since it does not handle the fact that many IDS rules and also non-rule based technologies like anomaly-based IDS will be able to detect more than one attack vector, and also variants of attack vectors. The model is furthermore oblivious to whether the source of entropies is adversarial or ordinary user activities. An entropy-based metric can only measure whether information is leaking or not. Therefore the privacy impact will need to be evaluated, as discussed earlier.
The perfect model IDS rule will under these assumptions provide a constant leakage denoted as of information in each alarm, corresponding to the pattern matched by .
The privacy impact of this constant information leakage as a privacy leakage is however not known. The privacy impact of the information leakage from each IDS rule must therefore be evaluated by a data controller, to determine whether the expected information leakage from the IDS rule can be considered necessary and acceptable from a security perspective, and also that the effective privacy impact from the rule can be considered negligible if the rule is effective over time.
This manual quality assurance procedure makes it possible to detect and avoid IDS rules where in itself is judged to cause a significant privacy leakage, for example if the rule itself triggers on person sensitive information. The privacy leakage from each installed IDS rule is therefore in the rest of this paper considered as either necessary or negligible. If this constant privacy leakage is not considered tolerable, then it is assumed that this can be mitigated using anonymisation or pseudonymisation policies.
will under these assumptions always triggers on the same attack pattern , as illustrated in Figure IV.2. The inter-alarm entropy, assuming a set of input data , denoted as , is defined as the entropy between different IDS alarms, calculated over the entire payload excerpt (i.e. each IDS alarm is considered as one “symbol”). The inter-alarm entropy will in this case be , since . This means that a perfect model IDS rule according to this definition from an information theoretical perspective does not reveal any additional information apart from what can be inferred from the limited and constant information leakage in each alarm.
This does not mean that additional leakage of sensitive information cannot occur, since the resulting privacy leakage also will depend on the timing and context of the alarms. Additional information may for example be revealed by correlating the interdependencies between the IDS rules.
However, under the given assumptions, this means that when triggers, then a known data pattern will have been sent in the input data stream. This information leakage is considered a tolerable privacy leakage under the assumptions in the previous subsection.
Iv-C A Non-perfect IDS rule
Then consider a non-perfect IDS rule , which in addition to the assumed necessary and limited information leakage by the attack pattern, also may have false alarms or other entropy sources, as illustrated in Figure IV.3. However, it still only detects one attack vector, that does not change between attacks. This means that the entropy distribution function will be unimodal, perhaps with some outliers as illustrated in Figure IV.3. This is a simplistic model of how an IDS rule behaves. It does not assume any particular IDS rule implementation (e.g. whether string matching or regular expressions are being used) and does not take any position on the type of IDS technology being used. Experimental results have however shown that a significant amount of all IDS rules (35-53% in the experiments we have performed22253% of the IDS rules in the experiments performed here were unimodal, indicating one attack vector. A former pre-experiment at a commercial MSS provider indicated that 35% of the IDS rules were unimodal.) actually behave in this way. However, this also means that many IDS rules actually do not behave this way. We will therefore later discuss how this restriction can be removed.
The model of a unimodal non-perfect IDS rule is illustrated in Figure IV.4. Assume that this IDS rule generates the ordered set of IDS alarms denoted as , where for , . The inter-alarm entropy will in this case be greater than zero for both Shannon and Min-entropy, because and .
Iv-D Privacy Leakage Model
The next question is how to model the privacy leakage from the non-perfect IDS rule . One way to do this, is to measure the information leakage of the non-perfect IDS rule relative to a perfect model IDS rule , as illustrated in Figure IV.5. The communication channel then consists of a cascade of two IDS rules (or two IDS rules connected in series), where the output of the first IDS rule serves as input to the second IDS rule. Both IDS rules have the objective to trigger on the same attack vector, however the first IDS rule is non-perfect, and may have false alarms or other entropy sources, whereas the second IDS rule is considered a perfect model IDS rule. The advantage of using a cascading model, is that this allows for comparing known values, and it is not dependent on the unknown Internet traffic . The set of alarms from are known by the MSS provider and the set of expected alarms from are also known given .
Focusing on the inter-alarm entropies is not a fruitful approach here, since the difference in inter-alarm entropies is , because . What is needed, is therefore a measure of the limited information leakage that the perfect model IDS rule causes.
This initial information loss, denoted as the intra-alarm information loss , can be expressed by measuring the entropy of the IDS alarm in bits, instead of measuring the inter-alarm entropy (the entropy between IDS alarms, considering the entire IDS alarm as one symbol). The intra-alarm entropy for a perfect model IDS rule can be calculated by assuming that the IDS alarm consists of a large sequence of bits. This can be expressed formally by considering a given IDS alarm as where .
Considering the perfect model IDS rule first, then this IDS rule will always return the same IDS alarm where with bit-probability . The information leakage is defined according to (IV.8) as:
Since is deterministic, then will be determined by , which means that . Furthermore, for Shannon entropy:
for , which means that this can be expressed as:
This shows that has a constant privacy leakage for Shannon entropy. This can also be shown for Min-entropy by substituting into Equation (IV.6):
where the vulnerability can be expressed as:
for , which means that this can be expressed as:
which can be expressed as
This shows that the vulnerability is for . The lowest vulnerability is for , as expected. This means that the Min-entropy for can be expressed as:
This means that has a constant information leakage for both Shannon-entropy and Min-entropy . However these constants are different, except in the special cases where , as can be expected (see Figure IV.6).
Let the constant information leakage for either Shannon or Min-entropy be denoted as . The relative information leakage from the IDS rule can then be formally defined as follows:
Let be a non-perfect IDS rule, that in addition to the assumed necessary and limited information leakage by the attack pattern, also may have false alarms or other entropy sources. Let be a perfect model IDS rule with a limited privacy leakage , 333It is possible to show that this definition generalises to any Rényi entropy, however that is beyond the scope of this paper , since Min-entropy and Shannon-entropy are considered the best candidates for the privacy leakage metric [smith_foundations_2009].. The relative information leakage for an IDS rule with input , that generates a set of IDS alarms , each with probability is then defined as the difference in intra-alarm entropy between and a perfect model IDS rule that both trigger on the same attack vector:
If the probability distribution function (PDF) of the IDS alarm entropies for a given attack vector is symmetric, then the average entropy denoted as for input and a sufficiently large set of IDS alarms can be considered as a good estimator of . For skewed distributions, the median may give a better estimate, given that the sample is sufficiently large. It can furthermore be observed that the precision of this estimator will improve with the precision of the IDS rule . This means that the information leakage of for a given IDS alarm can be expressed as:
where the average entropy can be expressed as
for a set of input data .
Iv-E Information Leakage for a Sample of IDS Alarms
The average entropy per byte for a sample of IDS alarms generated by an IDS rule that detects a single attack vector, can be expressed as
The information leakage for any IDS alarm , denoted as can then be expressed as:
Further processing of the information leakage for the IDS alarms can now be calculated using traditional statistical analysis. The privacy leakage of the IDS rule can be expressed as the standard deviation error margin or the 95% confidence interval of the IDS rule. This gives an indication of the expected precision of the IDS rule. Another useful metric, is to consider the worst-case information leakage denoted as where , or the minimum information leakage denoted as where . Both of these can be useful in statistical analyses, in addition to the standard deviation. Furthermore, the privacy leakage can be calculated as , where is the privacy impact estimated by the data controller.
Iv-F Sample Standard Deviation of Entropy
Iv-F1 Normal Distribution
Assuming that the probability distribution of alarms can be approximated using a Normal distribution, then the standard deviation can be calculated using the second norm.
Assume that the IDS generates a sample of IDS alarms . Each alarm contains payload or other potentially privacy leaking elements or attributes from the IDS alarms generated by an IDS rule . The sample standard deviation of the entropy of the elements can then be expressed as:
The general properties of the variance of entropy measurements will fulfill the same requirements as the standard deviation of entropy measurements. However, the standard deviation is considered more appropriate, since it operates with the same unit of measure as the entropy.
Iv-F2 Laplacian Distribution
An alternative distribution that during the experiment was shown to fit the data well, is the Laplacian (or double exponential) distribution. The Laplacian standard deviation, denoted as is based the norm (or Manhattan distance), and can be expressed as the sum of absolute deviations:
A well known advantage with , is that it will be less influenced by outliers in the tail of the PDFs than the standard deviation of the Normal distribution.
The standard deviation of normalised entropy is a measure of the relative information leakage from an IDS rule, under the assumption that it detects only one nonmutating attack vector. If an IDS rule detects the attack vector perfectly without any false alarms, then the entropy of the IDS alarms will always be the same, and . If the IDS alarm is precise at detecting the attack, then only a few bits of information will vary between IDS alarms. This means that all alarms will have similar entropy with low standard deviation and therefore also low information leakage. However if the IDS rule also has a significant amount of false alarms, or gets entropy from other sources then the entropy variance, and therefore also the information leakage from the IDS rule, will increase.
This subsection shows how the standard deviation of entropy metric
can be aggregated for a set of IDS rules. Assume that an IDS uses
a rule set denoted as with IDS rules .
Each IDS rule matches independently a set of IDS
, where the number of IDS alarms typically will vary between IDS rules. Furthermore, assume that the IDS alarms are independent and non-overlapping, i.e. for . This means that all IDS alarms, denoted , can be expressed as .
Assume that an IDS rule has entropy standard deviation denoted as and resulting standard deviation denoted as . The aggregated metric should furthermore fulfill the following criteria in order to provide meaningful aggregation:
If all IDS rules have the same standard deviation, say , then should also be the same, i.e. .
The resulting entropy standard deviation should be weighted according to how many alarms that trigger on a given IDS rule .
Each IDS rule should be assessed individually, in the same way as each underlying vulnerability should be assessed individually. This means that a weighted average, weighted by number of alarms from each IDS rule, can be used as aggregation function for , i.e:
This function fulfills criterion C1, since the resulting average weighted sum is the same if is the same for all IDS rules and it fulfills C2 by weighting the standard deviation against number of IDS alarms.
Iv-H IDS Rules Detecting Several Attack Vectors
A significant part of the IDS rules will detect more than one attack vector, as illustrated in Figure LABEL:fig:Number-of-attack-vectors. The data set used in this paper has 47% of the IDS rules with more than one attack vector. An earlier preliminary experiment at a commercial MSS provider shows even higher percentage (65%). An indication of an IDS rule that detects several attack vectors, is that the entropy probability distribution is multi-modal. Figure IV.7 shows an example IDS rule that matches three privacy leaking attack vectors. The Figure shows the payload entropy distribution of the Snort IDS rule with SID 1:11969 VOIP-SIP inbound 401 Unauthorized. A payload length correction causes the metric to be larger than one, and is required to make the metric incentive compatible444Incentive compatibility – a characteristic of mechanisms whereby each agent knows that his best strategy is to follow the rules, no matter what the other agents will do [durlauf_incentive_2008].. The details of this can be ignored for now, since this will be discussed in Section LABEL:sub:Payload-Length-Correction. Each attack vector cluster corresponds to a different SIP service provider.
A clustering algorithm is needed to identify each underlying attack vector for multi-modal distributions. Each individual cluster will in this case represent an attack vector, which behaves in a similar way as a non-perfect IDS rule described in Section IV-C. This means that the privacy leakage of each attack vector cluster can be calculated as the entropy standard deviation over all samples belonging to the cluster, and the resulting privacy leakage for the IDS rule can be calculated by aggregating the data over all IDS rules in the cluster using Equation IV.26.
Iv-I How to Perform the Clustering
There are two main types of clustering algorithms: hard clustering and soft clustering. Hard clustering algorithms assign each sample to a given cluster. Examples of a hard clustering algorithm is the popular k-means and k-medians algorithms [macqueen_methods_1967, bradley_clustering_1997]. Hard clustering is however not appropriate for clustering the IDS rules, since it cuts off the samples at the tail of the distribution where two distributions overlap. This will give a bias towards lower entropy standard deviation than can be expected.
Soft clustering is then a better approach, since it assigns the probability that each sample belongs to a given cluster, instead of assigning each sample to a given cluster. A commonly used soft clustering technique is the Expectation Maximisation (EM) algorithm [dempster_maximum_1977]. This soft-clustering method provides a Maximum Likelihood estimate of the underlying data distribution as a mixture of assumed probability distributions. The EM-algorithm is basically a two-step hill-climbing technique where the first step (E-step) calculates the expectation of the log-likelihood using the current estimate of the parameters of the underlying probability distributions. The second step (M-step) computes the parameters that maximise the expected log-likelihood identified during the E-step.
There are however some drawbacks with the EM-algorithm. It is prone to get stuck in local minima, which means that it is sensitive to the initial cluster parameters. We use the cluster centers identified by k-means, since this is a generally recommended method of initialising the cluster centers555We used k-means from the Python module scikit-learn to initialise the EM algorithm [scikit-learn].. Another issue is the selection of number of clusters. Too many clusters may cause EM to overfit the data, whereas too few clusters may give a poor representation of the distribution of the samples.
It is commonly assumed that the underlying probability distribution either is a mixture of Gaussian or Laplacian probability density functions. Both outliers and skewedness have been found to be significant during the experimental analysis in Section LABEL:sub:Analysis-of-Alarm-PDF. We have therefore decided to model the probability distribution as a mixture of Laplacian probability density functions using the method proposed in [cord_feature_2006]. This method is based on order statistics (uses a weighted median instead of the mean), and is therefore more robust against outliers and skewedness than using a Gaussian mixture [cord_feature_2006]. The remainder of this section highlights the necessary theory and notation to understand how we have implemented the Laplacian mixture based clustering.
Iv-J Laplacian Mixture Model
This section defines the general notation, which is based on the well-known theory of learning finite mixture models [bailey_fitting_1994, figueiredo_unsupervised_2002]. Furthermore, the Laplacian Mixture Model used here, is based on [cord_feature_2006]. Our implementation is simplified compared to the original solution, since only univariate clustering is needed. Let be a random variable representing the IDS alarm entropies of an IDS rule , with representing one particular outcome of . This random variable is expressed as:
where are the mixing probabilities, each is the set of parameters defining the -th component of the mixture and is the complete set of parameters that define the mixture. Being probabilities, must satisfy and . It is assumed that all the components of the mixture are Laplacian distributions . The Laplacian distribution is defined as:
where is the entropy of the IDS alarm , is the scale parameter and is the median for mixture component . In the remainder, assume the shorthand notation that .
Iv-K EM-Algorithm for Laplacian Mixture Model
The implementation of the EM-algorithm is based on [cord_feature_2006, figueiredo_unsupervised_2002]. Assume that the EM-algorithm is performing cluster analysis on a sample of ordered entropy values , where for , . These entropy values are calculated over the IDS alarms generated by an IDS rule . The Expectation Maximisation algorithm for the Laplacian Mixture Model then consists of two steps that are iterated until convergence is detected:
E-step: calculate the conditional expectation of the complete log-likelihood that comes from the -th component of the mixture:
M-step: estimate new model parameters and weights that maximise the log-likelihood of the model:
where the algorithm to calculate the weighted median for a given cluster , according to [cord_feature_2006], is described in Algorithm 1.
The algorithm uses the Minimum Message Length (MML) as stop criterion [wallace_information_1968], assuming one-dimensional data. We do not go into details on the MML criterion and just present the implemented solution here. The detailed derivation of the MML criterion used can be found in [figueiredo_unsupervised_2002].
The last term of Equation IV.33 is derived from the fact that the minimum of the criterion over can be obtained by using the negative maximum of the log-likelihood (the last term), since
The algorithm stops when the difference in MML length between two iterations is less than . In addition to the MML criterion, the implementation of the EM algorithm requires at least 40 iterations to converge initially, and at least 20 iterations to converge after modifications of the cluster definitions. This is to avoid accidentally hitting a local MML minimum before convergence has occurred.
Iv-L Determining the Optimal Number of Clusters
We initially tested the method for estimating the number of components in [figueiredo_unsupervised_2002]. This method worked for nice continuous distributions, however it did not work equally well for for noisy or a mixture containing binomial distributions, since the EM-algorithm then easily got stuck in local modes. Overfitting was also a significant problem for binomial distributions.
Furthermore, to judge whether a cluster should be interpreted as an attack vector or not typically requires that the data controller does some investigation of the IDS alarms. This means that some degree of manual intervention typically will be required during the clustering to assert obvious clusters that the clustering algorithm has missed or delete clusters where overfitting occurs. A typical example of overfitting is where several components with the same median are used to represent a given cluster. Another example is for skewed distributions, where the EM attempts to fit the skewed curve by overfitting the data.
We implemented a simple user interface for managing the clusters. It supports configuration of the initial number of clusters as well as managing the model definition after the initial configuration. The program also supports selecting type of entropy data and IDS rule to analyse from the datasets. The user interface for managing the clustering consists of the following functions:
Assert that the cluster number has a mode at .
Delete clusters at index clusterlist. Deleted clusters are marked with .
Pick the cluster to be asserted by clicking the mouse at the position to be asserted in the histogram showing the frequency distribution of the IDS alarm entropies. If there are no clusters that are marked as deleted, then the least significant cluster (with lowest ) will be chosen.
After having modified the clusters, the EM-algorithm continues by typing the cont command in the debugger. When the data controller is satisfied with the cluster definition, typing cont without modifying the cluster causes the algorithm to finish and print out the calculated privacy leakage for each cluster and also the aggregated privacy leakage for the IDS rule .
Iv-M Calculating the Privacy Leakage for Clusters
The privacy leakage for the identified clusters is calculated after the data controller has asserted that the relevant clusters have been identified and that the EM-algorithm subsequently has converged. All probability mass is then assigned to the clusters, which means that the privacy leakage can be calculated for the given IDS rule .
First, the model will in itself give an indication of the privacy leakage in the form of the entropy standard deviation of the Laplacian function for a given cluster . It is a well known fact that this can be calculated from the scale parameter for a Laplacian distribution as . However to be able to aggregate the entropy standard deviation over all clusters, the relative proportion of the samples for a given cluster must be estimated, which is exactly what indicates. This means that the resulting entropy standard deviation for the IDS rule can be calculated as the weighted average using Equation IV.26, substituting with :
A disadvantage by using , is that this only will be correct if the model fits the data reasonably well. This may be true in some cases, however the sample distributions in the experiments do in several cases deviate significantly from the model due to outliers, heavy tails or noise. In these cases, it will be more correct to have a measure of that is based on the underlying samples weighted according to the conditional expectation of the model distributions defined by , so that the weighted entropy is described by . This means that the model distributions is used to specify how the samples are divided between the clusters, instead of defining the clusters directly. The mean value of the cluster entropies for cluster k can then be expressed as:
and the Normal standard deviation can be expressed in a similar way as:
Furthermore, the Laplacian standard deviation, based on the norm, can be expressed in terms of the conditional expectation and the median of the mixture component as:
The resulting aggregated entropy standard deviation for the IDS rule can in both these cases be calculated from Equation IV.35 by substituting the relevant standard deviation into the equation. The clustering analysis tool prints out both the individual standard deviations per cluster as well as the resulting standard deviation for the IDS rule based on both the standard deviation of the model , Normal standard deviation and Laplacian standard deviation . It is useful to compare these, since a large deviation between and the other standard deviations indicate a poor model fit, which may or may not be relevant depending on examination of the underlying data.
One can for example expect good model fit for IDS rules with some Gaussian or Laplacian noise, since this is close to the expected model of privacy leakage. However very noisy rules that match random traffic will get a poor model fit. An example of this is the IDS rule 1:1394000 in our experiments that detects random traffic. It has a standard deviation over all data of 6.7 for both Normal and Laplacian standard deviation, but only a model standard deviation of =1,44 . In such cases the standard deviation of the model will not be usable. Another example is if is significantly larger than , then may be unduly influenced by outliers, which means that would be the more robust estimate. In general, the Laplacian standard deviation can be expected to give the most conservative estimate, which is least influenced by skewedness and outliers.
Iv-N Summary of EM-based Clustering
The Laplacian Mixture Model is implemented using the EM-algorithm. A semiautomatic process is used to identify the underlying clusters in the IDS alarms. The standard deviation of entropy metric is then calculated for each cluster and also the aggregated metric for the entire IDS rule. A possible attack on the clustering method, is an overfitting attack where a MSS provider decides to shirk by deliberately overfitting the attack vectors, by asserting too many clusters during the clustering process. It is therefore important that the role as data controller is separate from the role as security manager, and also that external quality assurance entities like certification organisations oversee the operation, to ensure that it is not overly privacy invasive. It must be emphasised that the objective not necessarily is to match the underlying probability distribution as closely as possible. The objective is rather to identify any likely attack vectors, and distribute the samples between these. The EM algorithm does this reasonably well.
The EM-based clustering generalises the privacy leakage metric to work for IDS rules that detect more than one attack vector. This generalisation is necessary, since our experiments have shown that a significant amount of all IDS rules trigger on more than one underlying attack vector. An advantage with this generalisation, is that it avoids the incentive incompatibility of the single cluster metric, which would encourage a shirking MSS provider to cheat by splitting up IDS rules into smaller IDS rules detecting a single attack vector.
V Detailed Analysis of
This section does a more thorough investigation of the standard deviation of entropy metric . The objective of this discussion is to do an analysis of the convergence speed required to reliably detect random uniform input data as a function of the data length. It is expected that random uniform input data converges towards zero entropy standard deviation for a sufficiently long data series. This convergence speed is an important decision factor for the selection of entropy algorithm and symbol definition, since the IDS alarm entropies are calculated over a limited number of IDS alarms. Furthermore, it is discussed which metric and symbol definition that works best for distinguishing between plaintext and encrypted data. This analysis shows which entropy type (Min- or Shannon entropy) and symbol size (bit or octet) that is best for calculating privacy leakage in IDS rules.
V-a Entropy Calculation
There are at least three obvious ways of selecting the symbol space that is used to calculate the entropies:
Define the payload of the IDS alarm as the symbol, i.e. calculate the inter-alarm entropy;
Use binary entropy, i.e. the intra-alarm entropy as described in Section V-A;
Use octets, i.e. 8-bit words, which commonly are used to define the character set in computer systems.
Other word sizes are possible, however these are considered the most common and interesting ones for our purpose. Each of these symbol definitions have their advantages and disadvantages, and it is important to note that the entropy values calculated from each of these definitions typically will be different. It has already been shown that the intra-alarm entropy calculated from bit-entropy is different from the inter-alarm entropy by a constant value. Furthermore, the inter-alarm entropy is not possible to use, since it can not be used to calculate the standard deviation of entropy.
Bit-entropy was used to develop the Equation IV.20, since it is the easiest way to develop the theory for the privacy leakage metric. The entropy standard deviation formula is however not dependent on any particular symbol definition, as long as the symbol definition ensures that the entropy standard deviation in the worst case, i.e. for random, uniform data, can be measured to be sufficiently close to zero for encrypted traffic. It is assumed that converges towards zero for random, uniform data as a function of input data length, however the convergence speed is unknown and must be investigated. It can furthermore be observed that for a perfect encryption scheme that is approximated by random uniform data, the symbol definition does not matter, since random uniform data does not leak any information. This means that if the objective is to purely detect whether the information conveyed is encrypted or not, then the entropy scheme with fastest convergence speed may make sense to use.
This means that the minimum length of data required to reliably detect that random uniform data has zero variance (i.e. speed of convergence) is an important design factor that this metric relies on. It can be expected that different entropy metrics will have different convergence speed. In particular, can Min-entropy be expected to converge more slowly, since it only considers the maximum symbol occurrence probability, and not a weighted sum of all symbol occurrence probabilities, as Shannon entropy does.
V-B Entropy Bias of Finite Length Encrypted Data
A question that needs to be investigated, is therefore how different entropy standard deviation metrics (Shannon- or Min-entropy) respond to random uniform data strings of varying length, and also how it is influenced by the symbol width, i.e. whether bit-entropy or octet-based entropy is used. The reason for this, as discussed in Subsection IV-F2, is that the metric shall be able to measure privacy leakage sufficiently close to zero in the following three cases:
For a perfect model IDS rule which detects and displays one or more non-changing attack vectors perfectly;
for anonymised IDS alarms from the IDS rule;
as a limit case for encrypted (e.g. pseudonymised) IDS alarms from the IDS rule, as the number of bits in the IDS alarm goes towards infinity.
The entropy standard deviation bias for finite length encrypted data, denoted as , can be analysed by simulating the response function of as a function of number of bits of data. The simulation is based on a set of Monte-Carlo experiments, one for each octet of data. Each standard deviation is the average of an ensemble of 10000 experiments. Bit-length is calculated for each octet as eight times the octet length, in order to have comparable x-axis values for bit- and octet-based data. The experiments are based on simulations using random uniform data selection, which means that a Normal distribution can be assumed.
Figure V.1 shows a log-log plot of the entropy standard deviations. The bit-entropies both appear to be log-linear, which means that the bias for detecting a perfectly encrypted IDS alarm with length bits can be expressed as , where is the offset and is the slope of the log-log scale. This gives , where is constant. The slope can be calculated from the experimental data, which shows that that for Shannon bit-entropy and for Min-entropy. This means that , whereas , which means that Shannon bit-entropy converges by an order of faster towards zero than Min-entropy666This means that each factor in the bit-entropy calculations (one for Min-entropy and two for Shannon entropy) contributes with a convergence speed of . Shannon bit-entropy has initially 2.7 times less bias than Min-entropy for perfectly encrypted (i.e. random uniform) data.
The octet-based entropies perform very poorly during the initial transient phase, but are then stabilised on a slope similar to the respective bit-entropy slopes, as shown in Figure V.1. This means that there is a significant, but approximately constant, difference between the bit- and octet-based metrics after the initial transient phase. Shannon bit-entropy entropy ends up with a precision 143 times better than Shannon octet-entropy after 80 kbit. The difference in precision between bit- and octet-based Min-entropy is smaller, only 25 times.
A nice property is that the bias is systematic, which means that the entropy standard deviation calculations may be able to compensate for it by subtracting the expected bias from the entropy standard deviation, given that the number of samples (IDS alarms) is sufficiently large. However, this only makes sense if it is known that the data are encrypted. Since this in general is not known for the payload from IDS rules, and it will be wrong to correct for this bias for nonencrypted data, this means that the metric with fastest convergence speed is preferable.
It must also be noted that bit-based entropies (both Shannon Min-entropy) are computationally less complex than octet-based Shannon entropy, which needs to calculate the weighted logarithm expression for each symbol in an octet. Counting the number of bits set to one in an octet or word (list of octets) can be done by calculating the Hamming weight, which is implemented in hardware on most modern Intel or AMD processors using the popcnt (population count) operator. This opens up for efficient implementations of bit-entropy calculations for up to 64 bits word chunks [intelSSE4reference], which is more efficient than iterating to calculate the octet frequencies, as required by octet-based entropies.
V-C Entropy Standard Deviation Difference between Encrypted and Plaintext data
Another foundational scenario that must be investigated, is how well the proposed entropy algorithms and symbol definitions distinguish between encrypted and plaintext information. The entire theory behind hinges on the assumption that there is a difference in entropy standard deviation between plaintext and as a limit case encrypted information. To determine whether this assumption is true or not, and which entropy configuration that works best, we set up another Monte-Carlo simulation, this time comparing the entropy standard deviation of plaintext data with the entropy standard deviation of random uniform data for both Min- and Shannon-entropy, using both bit and octet-based symbol definition.
The experiment configuration calculates the average and the 95% confidence band