RT-DAP: A Real-Time Data Analytics Platform for Large-scale Industrial Process Monitoring and Control
In most process control systems nowadays, process measurements are periodically collected and archived in historians. Analytics applications process the data, and provide results offline or in a time period that is considerably slow in comparison to the performance of the manufacturing process. Along with the proliferation of Internet-of-Things (IoT) and the introduction of “pervasive sensors” technology in process industries, increasing number of sensors and actuators are installed in process plants for pervasive sensing and control, and the volume of produced process data is growing exponentially. To digest these data and meet the ever-growing requirements to increase production efficiency and improve product quality, there needs to be a way to both improve the performance of the analytics system and scale the system to closely monitor a much larger set of plant resources. In this paper, we present a real-time data analytics platform, called RT-DAP, to support large-scale continuous data analytics in process industries. RT-DAP is designed to be able to stream, store, process and visualize a large volume of real-time data flows collected from heterogeneous plant resources, and feedback to the control system and operators in a real-time manner. A prototype of the platform is implemented on Microsoft Azure. Our extensive experiments validate the design methodologies of RT-DAP and demonstrate its efficiency in both component and system levels.
Key objectives in process industries include maintaining the product quality within product specifications, improving the overall efficiency of the process operations, and constantly improving health, safety and environment [1, 2]. To achieve these objectives practitioners have been using both first principle [3, 4] and data driven methods [5, 6], and in some cases combinations of the two methods. In most process control systems nowadays, process measurements are periodically collected and communicated to gateways, controllers, and workstations. Feedback is provided through alarm/control messages and visualization to both the system itself and to human operators. All collected data are put into a time-series form that can be used by the modeler and archived in historians. Analytics applications read data from these historians, find the right set of features and the correct subset of data to use to capture the desired variation, and provide results off-line or in a time period that is considerably slow in comparison to the performance of the manufacturing process.
Due to the lack of an efficient and scalable real-time data analytics infrastructure specifically designed for process monitoring and control applications, only a minimal set of process conditions and plant equipment is monitored, and a limited set of control and analytics algorithms is provided. With the proliferation of Internet-of-Things (IoT)  and the introduction of “pervasive sensors” technology in process industry, the volume of produced process measurements and the complexity of the analytics tasks are growing exponentially. To digest these data and meet the ever-growing requirements to increase production efficiency and improve product quality, there needs to be a way to provide easy access to the data, improve performance of the analytics tasks and scale the system to closely monitor a much larger set of plant resources and their operational environments.
In this paper, we present the design and implementation of a scalable real-time data analytics platform, called RT-DAP, to support continuous data analytics for a wide range of industrial process monitoring and control applications. As shown in Fig. 1, RT-DAP will serve as the core component in the real-time sensing, communication, analytics and control loop for the next-generation large-scale process monitoring and control systems. A large volume of real-time data flows will be collected from heterogenous plant resources via the smart plant communication infrastructures, typically a combination of wired network backbones and wireless edge networks. These real-time process measurements will be streamed into RT-DAP for real-time data storage, modelling and analytics. Data visualization and control decisions will be made and fed back to the system and operators in a real-time manner for both emergency and daily process operations.
RT-DAP differs from existing general-purpose computing platforms with the following key components: 1) a unified messaging protocol to support massive real-time process data, 2) a distributed time-series database specifically designed for storing and querying process data, and 3) a model development studio for designing data and control flows in process control related analytics tasks. The developed models and analytics modules will be deployed on an elastic real-time processing framework to distribute the computation on the parallel computing infrastructure. We further design an industrial IoT field gateway, called IIoT-FG, through which RT-DAP can be connected to various heterogeneous plant resources for distributed data acquisition. Combined with the real-time communication infrastructure deployed in process plants nowadays, the integrated communication and computing framework will significantly improve the scalability, reliability and real-time performance of industrial process monitoring and control applications, and close the loop of process monitoring, communication, decision making, and control.
To validate the design methodologies, we implement a prototype of RT-DAP on Microsoft Azure  and a prototype of IIoT-FG on Minnowboard . Our extensive experiments on both component- and system-level testing demonstrate the efficiency of the platform in providing real-time data streaming, storage, decision making and visualization for real-time analytics applications in process industries.
The remainder of the paper is organized as follows. Section 2 reviews the existing real-time data analytics platforms, in particular for industrial automation applications. We present our platform design in Section 3 and describe its implementation on Microsoft Azure in Section 4. We present the performance evaluation and summarize our experimental results in Section 5. Section 6 concludes the paper and discusses the future work.
2 Related Works
Internet-of-Things (IoT)  has drawn a tremendous amount of attention in recent years. A growing number of physical objects and devices are being connected to the Internet at an unprecedented rate to collect and exchange data . It is projected that 50 billion devices will be connected to the Internet by the year 2020 and the total amount of exchanged data volume will reach 35 ZB [11, 12, 13]. To fully utilize these data, many data acquisition and analytics systems have been developed in different application domains. In the field of industrial automation, the focus of our study in this paper, many analytics platforms have also been developed. These platforms include but are not limited to the HAVEn  from HP, the Industrial Solutions System Consolidation Series from Intel, and the Omneo  from Siemens. Although these analytics systems are comprehensive, they are designed for general purpose industrial automation applications, and few details on their architecture design, implementation and performance evaluation are provided. In this paper, we will present the design details of RT-DAP, which is specifically developed for industrial process monitoring and control applications. In the following, we first summarize the state of the art in the key components of the proposed platform.
With the exponentially growing number of data sources and data volume, efficient messaging protocols and messaging systems particularly designed for IoT applications have gained their popularity in recent years. Data collected from physical objects and devices are streamed into messaging systems through messaging protocols and eventually consumed by IoT applications. Among many existing messaging protocols, MQTT , AMQP , CoAP  and STOMP  are the four most popular ones based on TCP/IP. All these protocols are designed for resource constrained devices and networks, thus can be efficiently implemented on a variety of embedded systems. A comparison among these popular IoT messaging protocols are provided in .
In order to partition and separate the data streams, and process messages asynchronously, most messaging systems are carefully designed with features like scalability, reliability, clustering, multi-protocol and multi-language support. Messaging systems usually comprise several servers, known as messaging brokers, working in the middle of the data sources and data analytics systems. These systems such as ActiveMQ , RabbitMQ , ZeroMQ , and Kafka  share many similarities. Some throughput and latency benchmarks are provided in  for their performance comparison.
Data store is another critical component of a data analytics system. Relational databases have been widely used in past decades, but it is not suitable for storing a large amount of time series data without explicit structures and relations . A variety of NoSQL databases thus have been designed to address this problem, including key-value stores and document stores. Key-value stores – including Berkeley DB (BDB) , Oracle NoSQL Database , HBase , and Cassandra  – use a map or dictionary as a collection of key-value pairs as the fundamental data model. On the other hand, Document stores differ on the definition of “document”, but generally all assume that documents encapsulate and encode data in some standard formats or encodings including XML, YAML , and JSON. CouchDB  and MongoDB  are among the most popular Document stores.
Regarding the parallel computing framework, MapReduce , Storm  and Spark  are the three most widely employed computation models nowadays for big data processing, while they have different strengths and use cases. MapReduce is a software framework capable for processing vast amounts of data (multi-terabyte datasets) in parallel on large clusters of commodity hardware in a reliable manner with high fault-tolerance [37, 38]. Apache Storm is a distributed computation framework which is suitable for reliably processing unbounded data streams. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. In Storm, all data are stored in memory if the message exchanging happens in the same machine. This design makes Storm fast enough to process huge amount of data in real-time. Apache Spark is an open source cluster computing framework. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms . A detailed design overview and performance comparison between the Storm and Spark streaming platforms is provided in [39, 40]. It shows that Storm is more suitable for stateless stream processing while the Spark is more useful in complex event processing.
3 RT-DAP Design Details
In this section, we present the design details of RT-DAP. RT-DAP is a synergy of multiple advanced communication and computing technologies and is expected to provide a scalable solution for large-scale real-time industrial process monitoring, control and analytics. Fig. 2 presents the overall architecture of RT-DAP which consists of five key components: one or multiple industrial IoT field gateways for connecting to heterogenous plant resources, a run-time execution engine for real-time data processing, a distributed time-series database for fast data loading and queries, an analytics studio to define analytics models, and a rich set of web services for real-time data visualization and interactions. The industrial IoT field gateway, called IIoT-FG, provides hardware interfaces and protocol adaptation between RT-DAP and various heterogeneous plant resources running different communication protocols (OPC-UA , HART-IP , etc.) for distributed data acquisition. The collected process measurements from the field are in essence time-series data, and will be streamed into RT-DAP through a unified messaging protocol. These real-time data will be digested in the run-time execution engine which is a combination of data flow system (like Kafka ) and parallel real-time computing framework (like Storm ) where data analytics tasks will be performed in a parallel and resource-aware manner. The computed results will either be fed back to the physical systems directly in the forms of notification and alarm messages, or loaded into the time-series database system for further queries and processing. Another important component, the analytics studio is designed to develop various anayltics models using first principle and/or data-driven methods. These developed models will be deployed in the run-time execution engine for performing online continuous data analytics. Lastly, a web server is developed to interact with the time-series database to provide real-time visualization to end users through a variety of dashboards. In the following, we will elaborate the design of each key components.
3.1 Naming Conventions
We first describe our naming convention to distinguish real-time data points from different plant resources. The format is consistent with the one used in DeltaV  system, and other major Distributed Control System (DCS) vendors apply a similar approach. Each data point is assigned a unique tag name. Within one zone, the top name of the tag can be one of three types: module, workstation/controller, and device. The data points are all defined as paths from the top name. In a plant of multiple zones, zone name will be prefixed to the top name. To further distinguish different plants, domain name will be prefixed to the zone name. For ease of presentation, in this paper we only consider the data points from a single plant and thus domain name is not included. Considering an example where we have a zone named “UCONN-ITEB-311”. Inside this zone, there is one wireless Gateway with a 5-byte UniqueID of “A5EF69D256”, which has one sensor device connected with a 5-byte UniqueID of “A286BD21FA”. In the following, Tag-1 represents the health status of the Gateway, and tag-2 and tag-3 represent the health status and primary variable (PV) output value of the sensor device respectively.
3.2 Industrial IoT Field Gateway
Fig. 3 presents our vision of the industrial IoT paradigm in process industry. A large number of real-time data points from heterogeneous plant resources will be collected from a variety of data connectors (both hardware devices and software interfaces) which are geographically distributed in the field. For example, OPC UA servers are used to connect to DeltaV systems to retrieve periodic measurements from installed modules, controllers and hardware devices. HART and WirelessHART gateways are used to collect sensor and actuator measurements as well as network health information in a real-time and continuous manner. Wireless packet sniffers, spectrum analyzers and surveillance cameras (not shown in Fig. 3) are installed in the plant to monitor its operation and RF spectrum environments. All these real-time data will be streamed into cloud-based data analytics platform(s) (such as RT-DAP) for advanced modeling, scalable analytics, real-time visualization and mobile alerting.
Given the connectors are usually running different communication protocols (e.g., OPC UA and HART-IP), instead of implementing protocol adapters on each of the connectors to stream the data to the cloud, we design an industrial IoT field gateway, referred to as IIoT-FG, to connect to multiple data connectors to provide protocol adaptation and remote configuration. IIoT-FG is designed to be of small form factor, cheap, and thus can support massive field deployment.
Fig. 4(a) presents the software architecture of IIoT-FG. The current prototype is implemented on Minnowboard and has the following major software modules: (1) a web portal running on Apache to enable remote configuration, (2) a dataconcentrator module and an OPC publisher module to interpret HART-IP and OPC UA messages respectively, (3) a device mapping module to map device UID to the device key used on the analytics platform, and (4) an IoT hub module to stream the data to the data analytics platform by supporting different messaging protocols, such as HTTP, MQTT and AMQP. All these modules communicate with each other through a message bus. As shown in Fig. 4(b), operators can remotely access to the IIoT-FG via the web portal and configure the target data points and the associated streaming parameters. Process measurements will be streamed into IIoT-FG for protocol adaptation, and then further forwarded to the data analytics platform according to the messaging protocol to be described in Section 3.3. In the data analytics platform, these data points will be further stored, fused, analyzed and visualized to represent the current status of plant operations.
3.3 Messaging Protocol for Data Collection
Along with the ever-growing number of sensors and actuators being deployed in field, a large amount of real-time measurements are being collected from heterogeneous plant resources for various monitoring and control applications. A simple and unified streaming protocol is thus needed to define these data streams for cross-platform data emitting and retrieving. Given its capability to represent rich data structures in an extendable way, we use JSON objects to define these data streams. In RT-DAP, we set up a TCP portal server based on Netty and design a streaming API for clients to interact with the TCP server. This streaming API is designed with two fields at the top level: the request type, and its associated parameter, which is another JSON object with multiple fields. The followings are two major request types:
Stream Definition: This request type is created by the client to define the data stream before sending any data records to the server. It has a type of “D” and a parameter with id, tag, type and optional fields. This indicates that the stream has a unique id and is mapped to the tag name. Upon receiving such a request, the server will create the mapping.
Data Record: After a stream is defined, the client can emit data records through the Data Record request. This request has a type of “d” and an associated parameter with fields of id, time, value, and status. This is to represent a data record from stream id, with its timestamp (in UTC format), data value and status.
Given the fact that JSON essentially sends its schema along with every message, it requires relatively large bandwidth. Since many compatible compression techniques have been reported to achieve good JSON format compression rates, they can be performed on Data Record requests to make the streaming protocol more bandwidth-efficient. Our performance evaluation in Section 5.2 summarizes our experimental results on how the compression techniques will affect the throughput of the TCP portal server. All data records collected through the streaming APIs will be streamed into RT-DAP.
3.4 Scalable Time Series Database Design
Data records collected from plant resources include continuous, batch, event, and other data sources such as lab systems and material handling systems. These data records are in essence time series data and need to be periodically transferred to the system’s Real-Time Database (RTDB) to provide a complete picture of the plant operation and support operator trends, process analysis, model building and data mining activities. To serve these purposes, we design a distributed and scalable time series database on top of HBase . The time series database addresses a common need: store, index and serve process and related data collected from the distributed control systems (control strategies, DCS equipment, devices, lab systems, applications, etc.) at a large scale, and make this data easily accessible. We design the time series database as a general-purpose data store with a flexible data model. This allows it to craft an efficient and relatively customized schema for storing its data.
The non-relational database mechanisms in HBase enable design simplicity, horizontal scaling, and finer control over data availability. In the logical data model of HBase as shown in Fig. 5, it stores a piece of data within a table based on a 4D coordinate system: rowkey, column family, column qualifier, and version. In our design, the rowkey of the raw data table (DATA_TABLE) consists of the tagID of the data stream and higher-order of the timestamp when a data record is received; the column qualifier contains the data type, status and the lower-order of the timestamp; the column family is reserved for future use, and the size and contents of the value field depends on specific data streams. We have this design to sort the time series according to their unique names and time resolutions, so that the set of values for a single TagID is stored as a contiguous row. Within the run of rows for a TagID, stored values are ordered by timestamp. The timestamp in the rowkey is rounded down to the nearest 60 minutes so a single row stores a ‘bucket’ of measurements for the hour. Dividing the rowkey this way allows users to shrink their scan range by specifying the names of the target data streams and the required time intervals.
Following the design principles of HBase, our database design installs all raw data records into one big table (DATA_TABLE). The table can be automatically split and distributed to multiple region servers in HBase for fault tolerance and scalability. In addition to the raw data table, as shown in Fig. 5, we create an index table (TAG_TABLE) which contains two column families: ‘tag name’ and ‘id’. This table provides a 2-way dictionary of tag name to TagID. Based on the requirements of typical process control applications, we further create a set of aggregation tables (MM_AGG_TABLE, HH_AGG_TABLE, DD_AGG_TABLE) to store aggregated data about the rows according to different time resolutions, for example minimum, maximum and close values for a given tag in each hour. This allows discarding time ranges that are known not to include data in the search range without scanning each sample. The creation and update of the aggregation tables can either be done in the runtime or through executing offline MapReduce jobs.
3.5 Analytics Model Development Studio
Developing an analytics-based solution not only requires access to data, but also an environment to develop models using that data, an easy method to deploy models, and a way to monitor the operation of the deployed models. For example in process industry, the overall scope of a project may be to ensure that the separations coming out of a column are maintained across a wide range of material compositions. The way to do this is to provide continuous and timely feedback on how the separations column is performing. In many cases it may be possible to install and utilize one or more on-line analyzers. However, in some cases it may be too expensive or not possible to install an analyzer in the process. In other cases, there may not be an analyzer available to measure the properties that are needed for the control strategy. As an alternative inferred measurement may be used. Each inferred measurement utilizes multiple parameters. Each inferred measurement may be periodically validated and the models recalibrated using lab data. Each of these inferred measurements is developed using our self-developed tool called Model Development Studio (MDS).
MDS provides a visual workspace to build, test, deploy, and monitor analytics strategies. The studio environment makes it easy for the model developer to develop and test using different datasets, different data cleaning techniques, and different algorithms. The environment is both interactive and visual. The user creates an analytics module and work inside it. The analytics strategy inside an analytics module is constructed from a pallet of analytics blocks. Blocks are arranged into categories. Each category is used to hold blocks for accessing data, cleaning data, manipulating data, modeling, and testing.
An MDS example is shown in Fig. 6(a). In this example the analytics module has one instantiated analytics block called LoadFile1. As a first step the data could be loaded into MDS and visualized. This first step is often used to get a quick feel for shape of the data and to summarize missing data, outliers, and bad data. To complete the inferred measurement described above the data must be separated into features used to predict the measurement we are after. The completed model is shown in Fig. 6(b). Once the analytics strategy is ready it may be deployed for online operation using the Online option. The online operation automatically strips off blocks that are needed for runtime execution of the model. The online view also provides options for the user to control execution of the model and direct the output of the block, for example the inferred measurement may be written back to the control system using OPC or other interfaces. The online view is shown in Fig. 6(c).
The overall architecture of MDS is shown in Fig. 7. Its front-end is a web-based application. The studio environment itself contains a menu structure for creating and deploying modules, a side-bar for switching between studio and a run-time dashboard, a pallet for organizing analytics blocks, and a work surface for organizing the data flow through blocks. The model editor also supports dragging/dropping blocks from the pallet of blocks on to the work surface and connecting blocks on the work surface.
The Web Server contains a controller for interacting with the client and web socket interface for serving up data. The actual analytics blocks themselves are designed separately and loaded into the analytics environment. Each block contains both an algorithm and a set of meta data describing the interfaces for the block. In this way the model clients don’t need to know anything about the internals of blocks.
Analytics blocks are defined externally using a wrapper, which describes both the offline aspects for off-line editing and on-line operation. Block definitions are defined in the wrapper and then loaded into a MongoDB. Once in the database the blocks are immediately available to the editor. When users drag blocks on to the work surface they are instantiating an analytics block. The structure and configuration of each analytics model is stored in the MongoDB as a separate entity.
3.6 Real-Time Runtime Execution Engine
The developed models in the MDS will be deployed on the real-time runtime execution engine to perform designated analytics tasks. Among many existing computing frameworks, the MapReduce provides good performance in processing large datasets with a parallel, distributed algorithm on a cluster, and brings in scalability and fault-tolerance by optimizing the execution engine once. Our proposed RT-DAP fully supports running complex MapReduce jobs on HBase datasets to perform computation intensive analytics tasks for process monitoring and control. MapReduce however is not a good choice for processing unbounded real-time data streams. It is hard to achieve processing rates with short latencies, which is critical for real-time continuous analytics. Although we can run the MapReduce jobs periodically, the startup and shutdown cost on a MapReduce job is proved to be heavy.
In RT-DAP, we use a combination of Apache Kafka  and Storm  frameworks to serve as the runtime execution engine. The real-time data measurements streamed to the portal server will be pushed to the Kafka framework for queuing and to achieve “at least once” delivery guarantee. These measurements will then be pulled by the Storm framework for real-time and parallel processing. Storm provides an enhanced computation model by extending MapReduce jobs to a computation topology. Incoming data streams are split among a number of processing pipelines. Each node on the computation topology run a specific job continuously on the unbounded data streams flowing through it by using long-lived processes, and thus amortize the startup costs to zero and significantly improve the latency.
Fig. 8 gives an example of Kafka partition and Storm topology to support parallel data aggregation tasks. Real-time data records are streamed into different Kafka partitions according to their tag IDs. The data records in each Kafka partition are handled by a separate Storm topology which contains one Spout and one Bolt. The Spout subscribes to the corresponding Kafka partition and keeps pulling in the data records and stores them in a local buffer. The Bolt receives the buffered data records from the Spout, accesses the HBase to retrieve historical data, calculate the min, max and close values according to different time resolutions (minute, hour and day), and update the new aggregation results back to HBase.
Kafka and Storm have been shown in our experiments to be very effective in queuing and processing unbounded real-time streams. It is however difficult for the system designer to decide how to create and optimize the computation topology so that the timing constraints on the analytics jobs can be met. To overcome these deficits, we are working on an enhanced computation model for the existing real-time processing frameworks by taking timing requirements of the analytics jobs into consideration. As the ongoing work, we are exploring how to automatize the parallelizing process on the existing analytics model into a computation topology. Another challenge is how to dynamically allocate physical computing resources to each computation unit in the computation topology, so that the overall required computing resources will be minimized while the timing constraints on the analytics tasks can still be maintained.
4 Implementation on Azure
RT-DAP can either run on a private computing infrastructure or be deployed on an enterprise cloud platform, such as Microsoft Azure. In this section we describe the implementation details of our prototype development on Microsoft Azure.
Fig. 9 presents the system architecture. It follows a Client/Server architecture design. The server provides high-volume data ingest, exactly-once delivery, scalable time-series data storage and real-time parallel data processing. The clients can either push real-time streams into the server or retrieve data (e.g., in the form of queries, visualization, etc.) from the server, through either web or Thrift interface. We created a portal VM to bridge RT-DAP and external data sources. The portal VM includes 1) a standalone TCP server running on Netty to accept the meta/raw data streams from plant resources using the unified JSON format, 2) a web server to query HBase and provide user-friendly web UI for visualization, and 3) a Storm development tool to define, build and submit Storm topologies for the analytics models developed in the MDS.
A combination of Apache Kafka and Storm frameworks is running in a HDInsight cluster for queuing and real-time processing on data streams received from the portal VM. Raw data were also sent directly to a HDInsight HBase cluster and stored in Azure storage. HDInsight cluster deployment tool provided by Azure allows us to quickly deploy and scale HBase and Storm clusters, while the Kafka cluster needs to be manually installed because it is not yet supported by Azure. Currently we deploy a single node Kafka service in the headnode of Storm HDInsight cluster, and reuse its zookeeper nodes. By default, the HDInsight cluster only exposes the Web manager interface to the Internet which only provides basic and high-level management. We created and deployed the HDInsight Storm cluster, HBase cluster, and the Portal VM in a same Azure virtual network. By doing so, the services provided in the Portal VM can get exposed to the Internet by creating endpoints in the Azure platform and adding firewall exceptions. A connection to a virtual public IP address will be redirected to the Portal VM.
Sample models were created in MDS for fault detection based on both historical data in HBase and real-time data streams flowing through the Storm topology. Power BI and Dashboards were used for reporting, and real-time alerts & notification. This prototype provides a conceptual validation on the system design. Due to its clean design and potential impact to many industrial sectors, the platform was picked by the Azure team as a case study in UK Azure user group meeting and was discussed in Azure cloud cover show episode 174 on Channel 9 .
On the client side, various data sources, either physical plant resources (DeltaV DCS system, wireless Gateways, etc.) or virtual resources (Crude simulator, packet generator, etc.) are connected to RT-DAP and stream real-time data measurements using a uniformed JSON streaming API. The web interface provided by the web server allows authenticated clients to send queries to RT-DAP and retrieve both analytics results and raw data from anywhere on any device.
5 Performance Evaluation
In this section, we evaluate the performance of RT-DAP. We report our experimental results on the TCP portal server, the time series database, and the runtime execution engine running the aggregation tasks. A crude oil refining plant simulator (Crude simulator for short) is used to be one of the data sources for streaming real-time process measurements into the platform. The Crude simulator can simulate a complete oil refinery process with high fidelity, and all process measurements are accessible via the OPC server.
The key performance metrics used in our experiments are the throughput and latency of the platform in digesting high-volume real-time data streams. For ease of presentation, we implemented a general aggregation function in our performance evaluation, which is a common building block for many batch processing and continuous analytics tasks.
5.1 Experiment Setup
Our experiments are performed on the prototype development on Microsoft Azure as described in Section 4. Multiple Crude simulators have been running on workstations to simulate the process measurement flows sent to the real-time data analytics platform from geographically distributed oil refineries over the Internet. A wide range of process measurements are extracted from the Crude simulator through an OPC server and then sent to the portal server via the JSON-based streaming APIs. The tags to be sampled and their corresponding sampling rates are configurable and decided by individual applications. The portal server further forwards the data to different Kafka partitions which are identified by the tag ID, so that the data records for the same tag will always go to the same partition. By doing partitions, we are able to divide work load and parallelize their processing.
5.2 Throughput of the TCP Portal Server
In the first set of experiments, we varied the computing resources (in terms of number of cores) on the virtual machine where the TCP portal server is deployed. We evaluated the maximum throughput of the TCP server in terms of number of processed tags per second. It reflects the maximum input throughput of RT-DAP.
To reach the maximum throughput, instead of using the Crude simulator, we used 7 TCP packet generators installed on different workstations to send JSON objects (in the format of Data Record with a payload size of 70 bytes) to the TCP server at their maximum speeds. The maximum throughput of the TCP server will then be derived when its CPU usage reaches 100%. As shown in Fig. 14, we performed four experiments with the number of CPU cores (AMD Opteron 4171 HE) set to be 1, 2, 4, and 8, respectively. Each core is assigned with 1.75 GB memory while all other settings are identical. Each experiment run for 1000 seconds. From Fig. 14, we observe that when the TCP server run on an one-core machine, it can process approximately K data tags per second. When the number of cores is increased to 2, 4 and 8, the maximum throughput of the TCP sever increases to K, K and K data tags per second, respectively. With this near linear growth of the throughput along with the increased allocation of computing resources, the throughput of the portal server can be easily scaled up.
To evaluate how data compression mechanism affects the throughput of the TCP server, we performed another set of experiments to compare the server throughput by sending compressed and uncompressed data records, respectively. We used the same TCP server as in the last set of experiments and used four cores and 7 GB memory in total. 7 TCP packet generators were used to send JSON objects to the TCP server at their maximum speeds with the payload size varied from 2 bytes to 512K bytes. In the experiments, we used the “zlib” library  for data compression, and the payload was randomly generated using a combination of numbers (0 to 9) and characters. We repeated the experiments to evaluate the throughput with compressed and uncompressed data records.
The experimental results are summarized in Fig. 13. We have the observation that the throughput of the TCP server is around 200,000 tags per second when the payload size is smaller than 64 bytes. For each tag, once a new data record is received at the TCP server, it took a constant time for processing. A smaller payload size led to a larger number of tags consumed by the TCP server which resulted a higher CPU usage. When the payload size is small, the bottleneck of the TCP server is the CPU resource rather than the network bandwidth. Since it took CPU time to decode the compressed data, applying data compression would downgrade the TCP throughput. On the other hand, when the payload size is larger than 8K bytes, the network bandwidth became the bottleneck. Since a smaller number of tags are transmitted to the server, the CPU resource is sufficient to process all compressed data records. Under this situation, transmitting data records with compression will save network bandwidth which in turn improve the server throughput.
5.3 Performance of the Time Series Database
In order to evaluate the performance of the time-series database schema design, we conducted two sets of experiments to compare the throughput of database write and read under different experimental settings. We tested the write performance by loading an one-month dataset collected from a real-world refinery with a total number of 3,677,625 data records. We tested the read throughput by performing queries on a four-month dataset with a total number of 15,574,062 data records. These data records were loaded from the portal server to the HBase cluster on the Microsoft AZure platform. This cluster comprises of 2 head nodes (one primary and one secondary), one Zookeeper quorum of 3 nodes and a varied number (1-4) of region servers. These machines were configured with the same type of CPU (AMD Opteron 4171 HE) as we chose in Section 5.2. HBase head nodes and region servers used 4-core CPUs and the zookeeper masters used 2-core CPUs. We evaluated the read and write performance of the time series database by changing the batch write size and the number of region servers in the HBase cluster. These experiments were also repeated with the HBase caching mechanism enabled and disabled. The experimental results for the write and read performance are summarized in Fig. 13 and Fig. 13, respectively.
In Fig. 13, we have the observation that the write throughput can be improved with a larger batch size and more region servers. With a batch size of 2000 data records, we can achieve a writing speed of 40,000 data records per second on a cluster with four region servers. In the experiments, we pre-split the raw data table with 1, 2 and 4 regions respectively to ensure that each region server will hold one region. The write requests were distributed almost evenly among the region servers to achieve a better throughput. However, given that the write operations have to hit HBase root and meta tables before they are written in to the raw data table and these operations cannot be paralleled, the write performance does not exhibit a linear improvement.
The read tests are performed on a four-month dataset in the raw data table. This table is pre-split with four regions. In each read test we scanned all data records within a time window of varied length from 1 minute to 2400 minutes and we randomly chose these time windows within the four-month time period. We performed these read tests 1000 times and calculated the average execution time per scan. From Fig. 13, we observe that the average execution time per scan remains stable when we increased the time window size from 1 minute to 960 minutes. This is consistent with our database schema design, in which each row contains one hour’s data records for a given tag. Since the HBase scan is row-based, when the time window is smaller than one hour, we still scan the entire row. From Fig. 13, the average scan time is only significantly increased when the size of the time window is larger than 960 minutes. The scan time increased mainly because of the increased sizes of the scan results.
Comparing Fig. 13 and Fig. 13, we observe that each scan operation took a much longer time than the write operation. This is because HBase will scan the whole region for the specified row key after the scan hits root and meta table. An important technique to accelerate the reading speed is caching. HBase by default enables an LRU cache to accelerate operations on row level. From the results in Fig. 13, we observe that this caching mechanism leads to a doubled scanning speed in average.
5.4 Performance of the Aggregation Task
To evaluate the performance of the runtime execution engine, we implemented the time series data aggregation task as described in Section 3.6. Process data flows were streamed into the Kafka via the TCP server. The Spouts in the Storm topology used the Kafka consume function to receive the data records from the associated Kafka partitions, and send them to the corresponding Bolts. The Bolt, once received a data record, first pushed it into the HBase raw data table, and then performed the data aggregation tasks. It first retrieved the aggregated data records from the aggregation tables, calculated the new high, low and close values based on the new received data measurements, and then stored the results back to the HBase at three different time resolutions (minute, hour and day). To complete these aggregation tasks, 3 HBase GET operations and 4 PUT operations are needed.
To reduce the number of HBase accesses during the data aggregation, we implemented a queue in each Spout in order to process multiple data records in batch. To achieve a balance between HBase throughput and latency, we made the queue size self-adaptive. We had one thread on the Spout to keep filling the queue by receiving the data from Kafka. The other thread sent the complete queue to the Bolt and waited until the Bolt finished the aggregation tasks. The size of the queue was affected by the Bolt processing speed. When there were more data records coming in, it took the Bolt a longer time to process, and the Bolt would receive a larger queue in the next round for aggregation. This made the HBase access more efficient but introduced in a larger latency. In the experiments, we bounded the queue size with a maximum number to prevent the scenario when the data input speed is consistently faster than the processing speed of the Storm. Under this situation, the data records will be accumulated in the Kafka, but the Storm will keep a reasonable processing latency, which is the speed of the Bolt processing a maximum size queue of data.
Fig. 18 summarizes the results where we tested the maximum number of tags the aggregation task can process. We let the Crude simulator keep sending data records to the analytics platform for 500 seconds at different speeds, and measured the time it spent on the Bolt to perform the aggregation tasks. In addition to the total processing time, we also measured the HBase scan and write time. In Fig. 18, the red bar represents the HBase write time while the yellow bar represents the scan time. It shows that the aggregation tasks on the Bolts took negligible amount of computation time when compared to the time spent on HBase access. The results also indicate that when the sending speed was faster than 1200 tags per second, the data could not be processed in time since it would take more than 500 seconds for processing all data records.
Fig. 18 and Fig. 18 summarize the results of the experiments where we tested the performance of the aggregation tasks with different maximum queue sizes. In the experiments, we kept the data sources sending data records for a time duration of 500 seconds and fixed the sending speed at 1000 tags per second. We varied the maximum queue size from 20 to 200 and measured the execution times on the Bolts as well as the actual queue sizes in the experiments. In Fig. 18, we observe that when the maximum queue size increased, the execution time – especially the HBase write time – dropped significantly. When the maximum queue size approached to 200, the execution time started to drop below 500 seconds, where the processing speed in the runtime execution engine catched up the sending speed. The similar conclusion can be derived from Fig. 18, where the average queue size in runtime became smaller than the maximum queue size when it went beyond 160.
6 Conclusion and Future Work
In this paper, we present the design and implementation of a real-time data analytics platform called RT-DAP for large-scale industrial process monitoring and control applications. The proposed platform consists of a distributed time-series database for scalable data storage, an analytics model development studio for data and control flow design, and a real-time runtime execution engine to perform parallel and continuous analytics. RT-DAP can be connected to various plant resources through lightweight industrial IoT field gateway via a unified messaging protocol. Our prototype development on Microsoft Azure and extensive experiments validate the platform design methodologies and demonstrate the efficiency of the data analytics platform in both component and system levels.
For future work, we will extend the time-series database design to support heterogeneous data formats, enhance the real-time parallel processing framework with resource-aware features, and add more analytics models in MDS to support a wider range of continuous analytics tasks.
The authors would like to thank Joshua Kidd and Noel Bell from Emerson Automation Solutions, and Lara Rubbelke and Wee Hyong Tok from Microsoft for their great discussions on various technical issues in the design, development and performance evaluation of the RT-DAP platform.
- W. L. Luyben, Process modeling, simulation and control for chemical engineers. McGraw-Hill Higher Education, 1989.
- F. G. Shinskey, Process control systems: application, design and tuning. McGraw-Hill, Inc., 1990.
- T. Blevins and M. Nixon, Control loop foundation: batch and continuous processes. ISA, 2010.
- T. Blevins, W. K. Wojsznis, and M. Nixon, Advanced control foundation: tools, techniques and applications. ISA, 2013.
- P. Kadlec, B. Gabrys, and S. Strandt, “Data-driven soft sensors in the process industry,” Computers & Chemical Engineering, vol. 33, no. 4, pp. 795–814, 2009.
- S. Yin, S. X. Ding, X. Xie, and H. Luo, “A review on basic data-driven approaches for industrial process monitoring,” Industrial Electronics, IEEE Transactions on, vol. 61, no. 11, pp. 6418–6428, 2014.
- L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
- “Microsoft Azure,” https://azure.microsoft.com/.
- “MinnowBoard,” https://minnowboard.org/.
- A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of things: A survey on enabling technologies, protocols, and applications,” Communications Surveys & Tutorials, IEEE, vol. 17, no. 4, pp. 2347–2376, 2015.
- A. Zaslavsky, C. Perera, and D. Georgakopoulos, “Sensing as a service and big data,” arXiv preprint arXiv:1301.0159, 2013.
- B. Gerhardt, K. Griffin, and R. Klemann, “Unlocking value in the fragmented world of big data analytics,” Cisco Internet Business Solutions Group, June, 2012.
- S. Sagiroglu and D. Sinanc, “Big data: A review,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on. IEEE, 2013, pp. 42–47.
- “HP HAVEn,” http://www8.hp.com/us/en/software-solutions/big-data-platform-haven/.
- Siemens, “Realize innovation with big data analytics,” 2016. [Online]. Available: https://www.plm.automation.siemens.com/en_us/products/omneo/
- “MQTT,” http://mqtt.org/.
- S. Vinoski, “Advanced message queuing protocol,” IEEE Internet Computing, no. 6, pp. 87–89, 2006.
- “Mqtt and coap, iot protocols.” [Online]. Available: https://eclipse.org/community/eclipse_newsletter/2014/february/article2.php
- “STOMP,” https://stomp.github.io/.
- “IoT Messaging Protocols Comparison,” https://iotprotocols.wordpress.com/.
- “ActiveMQ,” http://activemq.apache.org/.
- “RabbitMQ,” https://www.rabbitmq.com/.
- “ZeroMQ,” http://zeromq.org/.
- J. Kreps, N. Narkhede, J. Rao et al., “Kafka: A distributed messaging system for log processing,” in Proceedings of the NetDB, 2011, pp. 1–7.
- “Dissecting message queues.” [Online]. Available: http://bravenewgeek.com/dissecting-message-queues/
- S. Madden, “From databases to big data,” IEEE Internet Computing, no. 3, pp. 4–6, 2012.
- M. A. Olson, K. Bostic, and M. I. Seltzer, “Berkeley db.” in USENIX Annual Technical Conference, FREENIX Track, 1999, pp. 183–191.
- A. Joshi, S. Haradhvala, and C. Lamb, “Oracle nosql database-scalable, transactional key-value store,” in Proc. the 2nd International Conference on Advances in Information Mining and Management, 2012, pp. 75–78.
- “Apache HBase,” http://hadoop.apache.org/.
- “Apache Cassandra.” [Online]. Available: http://cassandra.apache.org/
- “YAML,” http://yaml.org/.
- “Couchdb,” http://couchdb.apache.org/.
- “MongoDB.” [Online]. Available: https://www.mongodb.org/
- J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
- “Apache Storm,” https://storm.apache.org/.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: fast data analysis using coarse-grained distributed memory,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012, pp. 689–692.
- “MapReduce Tutorial.” [Online]. Available: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
- J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
- S. Mittal, “Real-time stream processing at inmobi- part 2,” 2015, [Online; accessed 31-January-2016]. [Online]. Available: http://technology.inmobi.com/blog/real-time-stream-processing-at-inmobi-part-2
- “Apache storm vs. apache spark,” zdatainc.com/2014/09/apache-storm-apache-spark/.
- “OPC United Architecture,” https://opcfoundation.org/.
- “HART-IP Protocol,” http://en.hartcomm.org/main_article/hartip.html.
- “Apache Kafka,” http://kafka.apache.org/.
- “DeltaV System,” http://www2.emersonprocess.com/en-us/brands/deltav/Pages/index.aspx.
- “Azure cloud cover show 174,” channel9.msdn.com/Shows/Cloud+Cover/Episode-174-Big-Data-with-Lara-Rubbelke-and-James-Baker.
- “zlib,” http://www.zlib.net/.