Big Data in Critical Infrastructures Security Monitoring: Challenges and Opportunities
Critical Infrastructures (CIs), such as smart power grids, transport systems, and financial infrastructures, are more and more vulnerable to cyber threats, due to the adoption of commodity computing facilities. Despite the use of several monitoring tools, recent attacks have proven that current defensive mechanisms for CIs are not effective enough against most advanced threats. In this paper we explore the idea of a framework leveraging multiple data sources to improve protection capabilities of CIs. Challenges and opportunities are discussed along three main research directions: i) use of distinct and heterogeneous data sources, ii) monitoring with adaptive granularity, and iii) attack modeling and runtime combination of multiple data analysis techniques.
Over the past years attackers’ community has developed smarter worms and rootkits to achieve a variety of objectives, which range from credentials compromise to sabotage of physical devices. Cyber threats are targeting extremely diverse critical application domains including e-commerce systems, corporate networks, datacenter facilities and industrial systems. For example, on July 2010, the well known Stuxnet  cyber attack was launched to damage gas centrifuges located at the Natanz fuel enrichment plant in Iran by modifying their speed very quickly and sharply.
On August 2012, the Saudi oil giant Aramco was subjected to a large cyber attack
Analysis of data collected under real workload conditions plays a key role to monitor system activities and to detect ongoing anomalies. CIs are currently equipped with a variety of monitoring tools, such as system and application logs, intrusion detection systems (IDS), and network monitors. However, recent cyber attacks have proven that today’s defensive mechanisms are not effective against most advanced threats. For example, Stuxnet was able to fool the supervisory control and data acquisition (SCADA) system by altering the reading of sensors deployed on the centrifuge engines, and it went undetected for months.
Among the possible countermeasures that could be adopted, the idea of leveraging distinct and heterogeneous data sources can help to draw a clearer picture of the system to protect. Indeed, by correlating diverse information flows coming from multiple origins not always collected by current CI monitors, it can be possible to extract additional insights on potentially threatening activities that are being carried out. For instance, the presence of Stuxnet could possibly have been detected by monitoring several other operational and environmental parameters, like the centrifuge energy consumption, and by correlating their readings to infer possible anomalies in the status  (e.g., fluctuating power consumption in a centrifuge, correlated with a stable rotational speed can be considered as an anomalous state). In addition, according to a CyberArk’s report , several successful attacks including the ones reported above exploited privileged accounts to achieve their objectives, and the same report states that “86% of large enterprises (across North America and EMEA) either do not know, or have grossly underestimated the magnitude of their privileged account security problem”. A possible solution could consist in leveraging the monitoring of the activities of such privileged accounts to pinpoint ongoing suspicious activity.
The use of multiple and diverse sources producing huge amounts of data calls for the research of new solutions for monitoring and analysis, able to timely and efficiently recognize ongoing malicious activities in CIs. This paper introduces the basic notions of a framework for data-driven security monitoring and protection of CIs. Our proposal stems from needs and challenges for effective security monitoring and describes an architectural solution to them, moving along the following research directions: i) the use of large amount of data collected from distinct and heterogeneous data sources; ii) the adoption of monitoring strategies with an adaptable level of granularity, to face the issue of big data volumes; iii) the formalization of attack models and the combination of diverse state-of-art data analysis techniques to improve the capability of detecting threats and triggering protection actions.
2 Needs and Challenges
2.1 Multiple Data Sources
The idea of using distinct and heterogeneous data sources available in today’s CIs can help to draw a clearer picture of the system to protect and of the threatening activities being carried out. The aim is to improve the protection of future CIs exploiting the (hidden) value of data: they are already available but not fully exploited in today CIs.
However, as the size and complexity of systems increase, the amount of information that can be collected by data sources skyrockets. For example, in the 1300-nodes data center we target as case study (see Section 3.4) the monitoring system produces about 16.6 GB of data per day, with observed traffic peaks of about 240000 pkt/s. This is a consequence of multiple factors: (i) the increasing availability of cheap HW probes, (ii) the ubiquitousness of communication infrastructures (either wired or wireless) and the Internet, and (iii) the novel algorithmic approaches that todays make handling huge amounts of data practical. A further important aspect is that the heterogeneity of collected data is going to increase as well: new data sources are connected to monitoring systems to collect and analyze different kinds of data as this could potentially provide useful insights on current system statuses.
This mix of factors marks the shift from a mostly human-controlled distributed monitoring model (think, for example, about how railway companies in the past controlled the status of their infrastructures through hundreds of people deployed on the territory along their tracks to locally monitor and then report to their bosses) to fully automated IT infrastructure for monitoring that tries to relieve as much as possible from humans the burden of analyzing data to infer high-level information. Making this new model practical in scenarios where huge amounts of heterogeneous data are available calls for the research of new algorithmic and architectural solutions able to withstand these new challenges.
2.2 Monitoring with different granularity
An accurate tuning of the amount of variables to be monitored and the frequency of data collected from system probes appears fundamental to study and plan at design time the computational load on the monitoring infrastructure.
First, it is necessary to select what sources are worth monitoring amongst the many available, considering the target system and also the expected workload. For example in  sources at the OS level, such as amount of free memory, disk throughput, or network throughput, are selected out of hundreds of possible indicators; their relevance for anomaly detection is further explored and confirmed in .
Appropriate selection of data sources is relevant but unfortunately may not be sufficient. In large-scale critical infrastructures, given the number of components, we can reasonably consider that monitoring each parameter using the best possible resolution system is unfeasible. Thus it may be required to define monitoring strategies that minimize the amount of data to analyze and consequently the monitoring resources to be used, still without decreasing the efficacy of the monitor, e.g., adopting different monitoring granularities depending on the current alert level of the system and of its components. This calls for the definition of new solutions able to find the right compromise in terms of the monitoring grain without having a negative impact on the monitoring accuracy as well as without depleting the resources devoted to monitoring.
2.3 On-Line Big Data Processing
The large number of collected data also implies difficulties in the data processing phase. Several techniques and tools have been proposed to analyze raw data with the objective of detecting on-going attacks. However, the performance of the detection, in terms of coverage and false alarm rate, strictly depends on the adopted technique. Solutions which encompass the (on-line) combination of multiple analysis techniques need to be investigated, in order to improve the capability of detecting potential threats and triggering protection actions on the CI. Recent studies have also proven the usefulness of (temporal and/or typed) graph-based attack models [6, 7, 8, 9, 10, 11, 12]. If we assume that the input log is a sequence of events having a type and a timestamp, a graph-based attack model has event types as vertices and is defined in such a way that the paths from start to terminal vertices in the model represent a critical event/attack when they correspond to subsequences of the log. Such subsequences are also called instances of the attack model. However, finding correlations among data by comparing analyzed data with attack models and producing alerts in an on-line fashion may become extremely difficult when the number of attack models at hand and the size of the input log increase. It is therefore important to ensure the scalability of the algorithms and data structures used when performing the conformance checking task.
3 A framework for Data-Driven Security of CIs
Figure 1 proposes an architectural solution to the discussed challenges. The key idea is to combine several data sources and different data analysis techniques to improve the capability of detecting potential threats and triggering protection actions on the CI. The results of the analysis are also useful to assess the current alert level of the CI’s components and to adapt the grain of monitoring through the Monitoring Adapter, e.g., to intensify the monitoring of components deemed of suspicious activity and reduce the monitoring of the other ones. The main blocks of the framework are described in the following.
3.1 Raw Data Collection
As the name suggests, the Raw Data Collection block is responsible for gathering raw data from the monitored CIs, exploiting available data monitoring technologies and/or logs produced by diverse software layers or hardware controllers.
Many technologies for data monitoring have been developed over the past thirty years, ranging from relatively simple data collection tools (such as Unix syslog
Data collectable through these monitoring systems can be classified in three broad categories: performance, environment, and operational data. Performance data are among the most monitored, and are related to the use of system resources. Main examples are the usage of CPU or memory. Other sources are about the use of the network, such as inbound and outbound traffic. Environment data are rarely exploited, even if they could be very useful to detect ongoing cyber physical attacks. They describe the status of the environment in which the system is placed and nodes’ parameters not related to performed activities. In this category fall temperature, humidity, the status of cooling systems, if any, etc. The monitoring of the energy consumption is also in this category. Finally, Operational data encompass all the information achieved by collecting, and presumably parsing and filtering, logs from the various software layers of the system, including event logs produced by applications, the OS and IDSes, if present.
3.2 Adaptive Monitoring
As shown in Figure 1, the main idea is to adapt the monitoring by dynamically changing what raw data to collect and analyze, thus shaping at run time the resource utilization of the monitoring framework.
All the monitoring systems described in the previous Section 3.1 have been designed so as to allow users to plug-in custom modules, in order to extend their functionalities such to fit application-specific needs. The plug-in modules can be implemented so as to receive external commands that dynamically adapt their monitoring capabilities. An initial proposal to reduce the complexity (i.e. the quantity of collected data, hence required storage space and processing resources) is to define two different monitoring layers. By default, the monitoring system operates in a coarse-grained layer collecting a limited number of variables, causing a high False Alarm Rate, but also a low Missed Detection Rate. In this configuration, the system acts as a very suspicious monitor which observes a reduced set of indicators and that easily raises alarms. When the coarse-grained layer detects an alarm in a specific area of the system, it triggers a fine-grained layer for monitoring that specific area through an enlarged set of indicators, a finer granularity of data, possibly reducing the False Alarm Rate.
The two-layers approach may lead to two different design solutions to be explored:
The monitoring infrastructure is created with sufficient spare resources that are used to activate the fine-grained layer (we call this overprovisioning approach).
The monitored system is created such that it does not have spare resources, or with a very limited number of spare resource. The activation of the fine-grained layer in a certain area of the critical infrastructure requires to reduce the monitoring activity in some other areas (we call this the downgrade approach).
The first approach leads to unused resources and the number of possible concurrent activations of the fine-grained layers is limited by the amount of spare resources. The second approach has no spare resources, but the downgrade of the monitoring activity risks to expose the system, leading to a not sufficient level of protection and/or to an unacceptable rate of False Alarm Rate. Also, the two approaches could be merged trying to take advantage of both of them.
Clearly, the selection of the right approach and its tuning require to understand the distribution and temporal persistence of anomalies in the system. This is relevant to understand the expected frequency of fine-grained layer activations, and the extent to which it is possible to reduce the monitoring resolution without significantly affecting the detection of threats.
On the other side of the spectrum, general purpose data analysis systems, which include a large family of tools like rule engines (e.g. Drools
A possible solution we foresee for the Monitoring Adapter block is represented by an hybrid approach, where existing monitoring systems and general purpose data analysis tools are mixed and deployed in such a way to maximize their effectiveness in reaching the desired adaptability goals. Monitoring systems could locally analyze and observe specific subsystems to provide more high level information to data analysis tools for correlation with information provided by other different sources. The complexity involved in mixing these approaches together, however, remains to be studied.
3.3 Data Analysis
This component analyzes the data and provides as outputs information about (i) how to adapt the grain of the monitoring, (ii) what protection actions should be performed on the CI. Starting from our past experiences on attack modeling and data analysis, we consider the following functional blocks.
Data Processing. Collected raw data typically contain useless or redundant information that can undermine the goodness of performed analysis . The first analysis step to be performed is thus to polish raw data, adopting filtering or event coalescence techniques, such as the ones analyzed in .
Attack Modeling. This functional block provides tools to define and statically analyze attack models. The attack model used in this block must be capable of: (i) providing a high degree of flexibility in representing many different security scenarios in a compact way; (ii) allowing the specification of various kinds of constraints (e.g., temporal) on possible attacks; (iii) representing attack scenarios at different abstraction levels, allowing to “focus” the conformance checking task in various ways. Typed temporal graph-based attack models  appear to be good options for the above requirements. They are rich in terms of temporal constraints that can be expressed. In addition, it is relatively easy to handle the definition of generalization/specialization hierarchies among event types.
By way of temporal graph-based model example, consider the hypergraph shown in Fig. 2. Here it is assumed that the log is a sequence of tuples that represent high-level actions corresponding to types of possible security exploits – such logs can be built on-line from operational data. Actions are depicted with plain circles (), while (hyper-)edges are depicted with dotted circles (). As an instance, according to the semantics given in , is a start hyperedge (indicated with a white arrow) so an attack can begin with it. The vertex labeled Local Access requires the presence of a group of log tuples with one ore more tuples of type Local Access (cardinality constraint “+”); the same applies to the Directory Traversal vertex. The hyperedge itself represents an association between the two vertices, with a temporal constraint of time points for the log segment. Hyperedge requires, in any order: (i) one or more Directory Traversal tuples; (ii) between 2 and 5 SQL Injection tuples; (iii) one or more Buffer Overflow tuples. The same applies to other hyperedges, such as and . In particular, since is a terminal hyperedge (indicated with a black arrow), an attack can end with it.
Conformance Checking. The main purpose of this functional block is that of detecting attack instances in sequences of logged events by checking the conformance of logged behavior with the given set of attack models. The main requirement of this block is obviously scalability. In real-world critical infrastructure protection scenarios, in fact, logged events are streamed into the system on-line and, ideally, we would like to raise an alert as soon as an event with a “criticality” above the threshold is logged. It is therefore important to define appropriate data structures that ensure fast access to the relevant information, as well as suitable algorithms that are tightly coupled with such structures in order to ensure the fast detection of an attack [6, 10]. Moreover, it is important to identify conditions that make the problem tractable from a theoretical point of view. One possibility is that of imposing specific limitations to the structure of the allowed models. In fact, recent work on the detection instances of temporal automaton-like models in sequences of logged events [7, 8, 10] has shown that acceptable detection times in real-world cases can be obtained by limiting the number of partial solutions through a form of early filtering based on temporal constraints. Finally, the parallelization of both the data structures and the conformance checking algorithms (see, e.g., ) appears mandatory when we target big data for security protection.
Invariant-based Mining. Invariants are properties of a system that are guaranteed to hold for all of its executions. If those properties are found to be violated (or broken) while monitoring the execution of the system, it is possible to trigger alarms useful to undertake immediate protection actions. As an example, figure 3 shows a relationship between the memory and CPU usage discovered from workload traces of the data center discussed in Section 3.4. Several studies have confirmed that is possible to discover invariants from real-world complex systems [17, 18]. However, in our case the challenge is to discover invariant relationships in the big data collected from the CI. The Invariant-based Mining block intends to face this issue, performing two tasks: i) to automatically mine invariants from collected data streams using autoregressive models, and ii) to detect at runtime when invariant relationships are broken, to trigger immediate action. A preliminary application of the approach on real data collected from a production cloud software system has proven its feasibility and usefulness to discover execution deviations and SLA violations .
Bayesian Inference. Security monitors usually produce a large number of false alerts. A Bayesian network approach can be used on the top of the CI to correlate alerts coming from different sources and to filter out false notifications. This approach has been successfully used to detect credential stealing attacks . Raw alerts generated during the progression of an attack, such as user-profile violations and IDS notifications, are correlated trough a Bayesian network to pinpoint misuse performed by compromised users. The approach was able to remove around 80% of false positives (i.e., not compromised user being declared compromised) without missing any compromised user.
Fuzzy Logic. Statistical methods cause a lot of false alarms. This is due to the difficulty in defining exact and crisp rules describing when an event is an anomaly or not. Boundaries between the normal and the anomaly behavior of a system are not clear and deciding the degree of intrusion at which an alarm is to be raised may vary in different situations . Fuzzy logic is derived from fuzzy set theory to deal with approximate reasoning rather than precise data, and efficiently helps to smooth the abrupt separation of normality and abnormality . Anomalous events can be described by means of linguistic variables characterized by linguistic terms . The degree of truth of an expression is not crisp (i.e. of the type “ is anomaly ” or “ is not anomaly ”), and the use of fuzzy linguistic variables allows one to express vagueness in measurements. In some applications, 99.95% for attack detection accuracy has been reached .
3.4 Case Studies
In the framework of the Research Project of National Interest (PRIN) “TENACE - Protecting National Critical Infrastructures from Cyber Threats”, we plan to experiment the framework on data extracted from two real-world systems.
The first is the data center of the Italian Ministry of Economic and Finance (MEF), which represents an important CI because it manages a wide range of software, spanning from very large applications with millions of end-users, such as those for the consumer credit support, up to small and very mission critical applications, such as those for managing the auctions of Italian Government Bonds and Treasury bills, and those for monitoring the government securities market (MTS). In its architecture each rack is organized in up to five sub-racks. Each sub-rack can include up to sixteen blade servers and is connected to the datacenter network through four switches. A probe is connected to each switch in order to monitor flowing network traffic. Two smart PDUs are connected to each sub-rack to gather information about energy consumption. This configuration allows to enforce non-intrusive monitoring and to consider the system as a blackbox.
The second is the S.Co.P.E. supercomputer, a scientific data center at the University of Naples Federico II and Italian Institute of Nuclear Physics (INFN). It is equipped with a monitoring system that collects data similar - for type and amount - to those collected by data centers of real CIs.
S.Co.P.E. mainly runs scientific batch jobs and also acts as a Tier-2 resource of the Worldwide LHC Computing Grid (WLCG)
4 Closing Remarks and Open Issues
Field data represent a rich source of information for improving the security monitoring and protection of future critical infrastructures. Existing monitoring technologies already offer the possibility of collecting different types of data, such as performance, environmental and operational data. The idea of collecting these types of heterogenous data, and analyze them trough a combination os state-of-art attack modeling and data analysis techniques, is promising to improve the accuracy of detection and drive the adaptation of the monitoring itself. In addition, the availability of existing analysis approaches and open, configurable monitoring tools represents a good start for the viability of the proposed framework.
However, the achievement of envisioned research objectives requires to face many open issues, such as:
lack of publicly available data sets and ground truth. Such data sets are vital for the validation of approaches like the one proposed in this paper. Some datasets are outdated (such as DARPA
15) or unlabeled (such as, iCTF 16), while others target standard IT systems (such as UNB iSCX 17). To date, there are no datasets available for CIs.
difficulty of performing long-running tests on real-world systems (or representative reproductions). These tests are very useful to improve the understanding of phenomena and to produce realistic datasets. However, honeypot-like approaches cannot be adopted in the case of CIs, due to the possibility of physical damages as a consequence of an attack. Innovative controlled environments are to be created, involving the expertise and equipment of CIs stakeholders.
need of strategies for changing the monitoring configuration at runtime on the basis of some predefined logic, without being forced to stop and restart the services in charge of gathering and analyzing data. This is important because it provides large freedom in adapting the monitoring without interrupting related services.
urgency of scalable solutions to combine the outputs generated by the different data analysis techniques, as the ones envisaged in the framework. Further research is in order, involving the different views and know-how of the researchers active in these fields.
Hence, the path towards industry-ready solutions calls for further joint industry-academia efforts, involving major players and stakeholders, to take real advantage of big data for the security monitoring of future critical infrastructures.
This work has been supported by the TENACE PRIN Project (n. 20103P34XC) funded by the Italian Ministry of Education, University and Research.
- The Large Hadron Collider computing grid, http://wlcg.web.cern.ch
- N. Falliere, L. O. Murchu, and E. Chien, “W32.Stuxnet Dossier,” http://www.symantec.com/content/en/us/enterprise/media/security_response/whitepapers/w32_stuxnet_dossier.pdf, 2011.
- H. Hashimoto and T. Hayakawa, “Distributed cyber attack detection for power network systems,” in Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, Dec 2011, pp. 5820–5824.
- CyberArk, “Privileged Account Security & Compliance Survey Report,” http://www.cyberark.com/, 2013.
- A. Bovenzi, F. Brancati, S. Russo, and A. Bondavalli, “A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems,” in SAFECOMP, 2011, pp. 128–142.
- A. Bondavalli, F. Brancati, A. Ceccarelli, D. Santoro, and M. Vadursi, “Experimental analysis of the first order time difference of indicators used in the monitoring of complex systems,” in Measurements and Networking Proceedings (M&N), 2013 IEEE International Workshop on. IEEE, 2013, pp. 138–142.
- M. Albanese, A. Pugliese, V. S. Subrahmanian, and O. Udrea, “Magic: A multi-activity graph index for activity detection,” in IRI. IEEE Systems, Man, and Cybernetics Society, 2007, pp. 267–272.
- M. Albanese, S. Jajodia, A. Pugliese, and V. S. Subrahmanian, “Scalable analysis of attack scenarios,” in ESORICS, ser. Lecture Notes in Computer Science, V. Atluri and C. Díaz, Eds., vol. 6879. Springer, 2011, pp. 416–433.
- ——, “Scalable detection of cyber attacks,” in CISIM, ser. Communications in Computer and Information Science, N. Chaki and A. Cortesi, Eds., vol. 245. Springer, 2011, pp. 9–18.
- A. Guzzo, A. Pugliese, A. Rullo, and D. Saccà, “Intrusion detection with hypergraph-based attack models,” in GKR, ser. Lecture Notes in Computer Science, M. Croitoru, S. Rudolph, S. Woltran, and C. Gonzales, Eds., vol. 8323. Springer, 2013, pp. 58–73.
- M. Albanese, A. Pugliese, and V. S. Subrahmanian, “Fast activity detection: Indexing for temporal stochastic automaton-based activity models,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 2, pp. 360–373, 2013.
- A. Pugliese, V. S. Subrahmanian, C. Thomas, and C. Molinaro, “Pass: A parallel activity search system,” IEEE Trans. Knowl. Data Eng., to appear.
- D. Cotroneo, A. Pecchia, and S. Russo, “Towards secure monitoring and control systems: Diversify!” in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, June 2013, pp. 1–2.
- G. F. Creţu-Ciocârlie, M. Budiu, and M. Goldszmidt, “Hunting for problems with artemis,” in Proceedings of the First USENIX Conference on Analysis of System Logs, ser. WASL’08. Berkeley, CA, USA: USENIX Association, 2008, pp. 2–2. [Online]. Available: http://dl.acm.org/citation.cfm?id=1855886.1855888
- M. Cinque, D. Cotroneo, and A. Pecchia, “Event logs for the analysis of software failures: A rule-based approach,” Software Engineering, IEEE Transactions on, vol. 39, no. 6, pp. 806–821, June 2013.
- A. Pecchia and S. Russo, “Detection of software failures through event logs: An experimental study,” in Software Reliability Engineering (ISSRE), 2012 IEEE 23rd Intnl. Symp. on, Nov 2012, pp. 31–40.
- C. Di Martino, M. Cinque, and D. Cotroneo, “Assessing time coalescence techniques for the analysis of supercomputer logs,” in Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, June 2012, pp. 1–12.
- A.B. Sharma et al., “Fault detection and localization in distributed systems using invariant relationships,” in Proc. of 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013.
- J.-G. Lou et al., “Mining invariants from console logs for system problem detection,” in Proc. of the USENIX annual technical conference, ser. USENIXATC’10, 2010.
- S. Sarkar, R. Ganesan, M. Cinque, F. Frattini, S. Russo, and A. Savignano, “Mining invariants from saas application logs,” in Tenth European Dependable Computing Conference (EDCC 2014), May 2014.
- A. Pecchia, A. Sharma, Z. Kalbarczyk, D. Cotroneo, and R. K. Iyer, “Identifying compromised users in shared computing infrastructures: A data-driven bayesian network approach,” in Proceedings of the International Symposium on Reliable Distributed Systems (SRDS). IEEE Computer Society, 2011, pp. 127–136.
- A. Feizollah, S. Shamshirband, N. Anuar, R. Salleh, and M. Mat Kiah, “Anomaly detection using cooperative fuzzy logic controller,” in Intelligent Robotics Systems: Inspiring the NEXT, ser. Communications in Computer and Information Science. Springer Berlin Heidelberg, 2013, vol. 376, pp. 220–231.
- F. Frattini, M. Esposito, and G. Pietro, “Mobifuzzy: A fuzzy library to build mobile dsss for remote patient monitoring,” in Autonomous and Intelligent Systems, ser. Lecture Notes in Computer Science, M. Kamel, F. Karray, and H. Hagras, Eds. Springer Berlin Heidelberg, 2012, pp. 79–86.