Distributed mining of large scale remote sensing image archives on public computing infrastructures
Earth Observation (EO) mining aims at supporting efficient access to and exploration of petabyte-scale space- and airborne remote sensing archives that are currently expanding at rates of terabytes per day. A significant challenge is performing the analysis required by envisaged applications — like for instance process mapping for environmental risk management — in reasonable time. In this work, we address the problem of content-based image retrieval via example-based queries from EO data archives. In particular, we focus on the analysis of polarimetric SAR data, for which target decomposition theorems have proved fundamental in discovering patterns in data and characterize the ground scattering properties. To this end, we propose an interactive region-oriented content-based image mining system in which 1) unsupervised ingestion processes are distributed onto virtual machines in elastic, on-demand computing infrastructures 2) archive-scale content hierarchical indexing is implemented in terms of a “big data" analytics cluster-computing framework 3) query processing amounts to traversing the generated binary tree index, computing distances that correspond to descriptor-based similarity measures between image groups and a query image tile. We describe in depth both the strategies and the actual implementations for the ingestion and indexing components, and verify the approach by experiments carried out on the NASA/JPL UAVSAR full polarimetric data archive.
We report the results of the tests performed on computer clusters by using a public Infrastructure-as-a-Service and evaluating the impact of cluster configuration on system performance. Results are promising for data mapping and information retrieval applications.
Remotely sensed data volumes are growing at faster and faster rates due to the increasing number of spaceborne and airborne Earth Observation (EO) missions and to the improvement of image resolution. As an example, according to a yearly report published at the end of 2013, the archives of NASA’s Earth Observing System Data and Information System (EOSDIS) have a volume of almost 10 petabytes, with 6,900 accessible datasets and an average archive growth of approximately 8.5 terabytes per day. The possibility of accessing and processing large databases of remote sensing images allows planetary scale applications like deforestation monitoring, glacial retreat investigation, urban development mapping, land cover classification and so on. Allowing an efficient discovery, annotation and retrieval of data products is the goal of EO mining systems .
Common methods to search for images are based on meta-data, that is a user formulates a query in terms of geographic location, sensor parameters, time of acquisition, manual annotations, and the system returns images that are relevant to the query. The problem with such approaches is that they hardly satisfy user needs in terms of human perception, while manually annotating large volumes of images for describing their content is so expensive and time-consuming that it becomes impossible when facing real scenarios, where petabyte-scale image archives are the rule. An alternative approach is Content-Based Image Retrieval (CBIR), where a description of an image in terms of primitive features (textures, shapes, relevant colors etc…) is automatically extracted from its content, providing the user with query and retrieval methods that are closer to visual perception. Despite several systems for content-based retrieval having been proposed for mining EO image databases, most of them have not been conceived for scaling to massive volumes of data and it is estimated that up to 95% of the data present in existing archives has never been accessed or analyzed by a human analyst .
The fundamental limitation of remote sensing content information retrieval has been about their applicability to extended archives: operational scalability has not usually been a characteristic of a number of prototype systems presented in the literature. A system for automatic object and tile-based feature extraction, content mining and indexing has been proposed in [16, 8]. Experiments are conducted on a 40 Gigabyte archive of high resolution images from the IKONOS and QuickBird satellites. Multiple machines are used to answer user queries, with each machine managing a different feature related index and processing queries in parallel. When queried, the system answers by combining the results from the distributed index. However, in substance, both the system architecture and the algorithms are single node-based, which is unlikely to provide the computational capabilities required to process millions or billions of records. Ingestion and, in particular, indexing procedures are considered offline operations. In  a similarity-based retrieval system for Terabyte scale image archives is presented. The reported experiments are conducted on a large image dataset whose descriptors are stored on a cluster of commodity machines. However parallel processing procedures needed for the generation of the descriptors are not considered, and index creation is performed by building a tree-structured index on a single machine, relying on the descriptors of a limited subset of archived products obtained by randomly sampling the entire collection. Distributing index creation procedures across multiple machines is in fact a challenging task, consisting of long-running data- and computational-intensive operations for which scalable data analysis algorithms are needed.
In this paper, we propose an EO data search framework that allows the exploration of massive datasets via example-based queries. We show that the system can be exploited to interactively query large-scale EO imagery archives with logarithmic search complexity and report results of tests performed by using a public Infrastructure-as-a-Service. Without imposing any constraint on the extension of the system to any kind of sensor, we focus in this specific work on polarimetric Synthetic Aperture Radar (SAR) data, where the improvements in the sensor resolution and the use of multiple polarizations are causing a relevant growth of data volume. The publicly available database of polarimetric SAR images collected by the UAVSAR  sensor has been considered as a case study.
The paper is structured as follows. In Section II, the analysis and parallelization methodologies involved in different parts of the system architecture are introduced. In Section III, the basic concepts of target decomposition theory for polarimetric SAR data are introduced and the specific technique employed in the system, Cloude-Pottier decomposition, is briefly described. In Section IV the system architecture and the procedures adopted for the data ingestion and the index creation are outlined. The results of the experiments carried out by using cloud-based computing infrastructures are reported in Section V. Finally, in Section VI, a discussion on the results and the conclusions close the article.
Ii Analysis and parallelization methodology
CBIR systems are typically based on three separate data processing phases and on a separate client UI.
The first phase entails an unsupervised ingestion step, in which single archive products are analyzed independently of each other in order to characterize them based on their contents. The second phase consists of an indexing step in which the obtained characterizations are compared with each other in order to organize the contents of the archive in groups of similar characteristics, in order to simplify navigation. This phase is unsupervised as well, and results in the production of a content index. A last and interactive phase is query processing, in which the unsupervised characterization of an example query is used to navigate the index tree to retrieve the most relevant archive items.
In general terms, the ingestion phase can be seen as “perfectly parallel” or “embarrassingly parallel”  and fits a single instruction, multiple data paradigm, while the second phase, index formation, requires iterative processes in which archive elements are repeatedly compared with each other. The third phase, query processing, already operates on efficient data structures and often does not require specific parallelization methodologies.
Perfectly parallel workloads typically involve multiple tasks that have no dependency on one another. It is in this sense that the first phase can be expressed by a map pattern (see Fig. (a)a) that invokes an elemental function over every element of a data collection in parallel . A fundamental requirement for the implemented elemental functions is to not modify global data on which other instances of that function depend. Examples of operations amenable to direct implementation as map-like functions that are typically part of the ingestion phase of an EO image mining system include data product access, tiling and feature extraction. All such functions operate on isolated elements of the input archive, but their usage is generally not uniform in time. Virtual resources in elastic cloud computing infrastructures are well suited to such a scenario, allowing to distribute the ingestion processes dynamically and on-demand, achieving economies of scale and maximizing the effectiveness of the shared resources.
The scenario of the second phase is different: the dataset has to be seen as a whole rather than a collection of separate elements, as the iterative machine-learning processes involved require communication between tasks, especially when combining intermediate results. The combine function defines a reduce pattern, with the elements of a collection being combined in any order to create a summary (see Fig. (b)b).
MapReduce is a programming model that is well suited to this scenario. It requires the definition of two user-defined methods that process data organized as key/value pairs. The map function takes in input key/value pairs and, for each of them, applies the mapping procedures, resulting in zero, one or multiple key/value pairs in output. The reduce function takes the intermediate key/value groups, i.e. the grouped-by-key/value pairs, and processes together the intermediate values associated with the same key. A key advantage of such programming model is the possibility of keeping the computation close to the data, avoiding the need of moving the data from the place where they are stored. Expressing an algorithm by sequences of map and reduce patterns allows implementing parallel applications in a way that is suitable for the execution on large clusters of commodity machines .
Based on these considerations and to demonstrate the feasibility of large scale remote sensing CBIR on inexpensive on-demand processing infrastructures, we have developed a complete system that allows to distribute both ingestion and indexing processes over computer clusters and is able to scale to large scale databases.
Iii Mining polarimetric SAR data archives
The potential of SAR polarimetry to characterize the physical properties of the Earth surface has led to a variety of applications that aim at exploiting the ground scattering mechanism to extract geophysical parameters and perform landcover classification. Such parameter signatures are known to allow ground target classification, and therefore represent a direct possibility for the development of a system to mine large scale remote sensing archives.
Electromagnetic radiation incident on a target with a given polarization (horizontal or vertical) produces a backscattered wave that has, in general, both horizontal and vertical polarizations. The way a target changes the backscattered wave polarization is described by the scattering matrix, which describes the transformation of the electric field between the incident and backscattered wave:
where is the electric field of the incident wave, is the electric field of the backscattered wave and , , and are the four complex coefficients of the scattering matrix, which are variable and depend on the nature of the target .
The elements on the main diagonal of the scattering matrix, and , produce the power return in the copolarized channels and the elements and produce the power return in the cross-polarized channels.
There are several ways to represent the scattering properties of a target. The coherence matrix is a possible power representation on which Cloude-Pottier decomposition theory relies . Let us consider the vectorized version of the scattering matrix, based on Pauli spin elements:
where the reciprocity assumption has been supposed. The coherence matrix is a positive semi-definite Hermitian matrix, given by the product of the Pauli vector by itself:
Cloude-Pottier decomposition relies on the eigen-decomposition of the (averaged) coherence matrix:
where is a diagonal matrix with non-negative real components, corresponding to the eigenvalues of , and is a square matrix whose columns are the eigenvectors of . Let us further account for the following parameterization of the eigenvectors, holding for the case of scattering medium that does not have azimuth symmetry:
Based on statistical considerations , the mean parameters of the dominant scattering mechanism can be extracted from the 3-by-3 coherency matrix as a mean unit target vector given by:
It has been shown that the mean parameter , ranging from to rad, is roll-invariant and that it is the main parameter for identifying the dominant ground scattering mechanism, while , and can be used to define the target polarization orientation angle and is physically equivalent to an absolute target phase. The best estimate of is:
with representing pseudo probabilities:
Based on the value of , a characterization of the scatterer physical properties can be inferred, in terms of single (surface), volume or double bounce scattering, respectively corresponding to low, average and high values of .
The coherence matrix eigen-decomposition can also be used to estimate polarimetric entropy , a measure of the degree of statistical disorder (randomness) of the scatterer:
with for a monostatic radar and for a bistatic radar. Polarimetric entropy ranges from to . Low values of this parameter indicate weakly depolarizing targets (point scatterers), while high values characterize depolarizing targets (mixture of point scatterers). In the limit case of , the target scattering corresponds to a random noise process.
As a complementary information to the polarimetric entropy, the polarimetric anisotropy can be estimated from the eigenvalues of the coherence matrix:
Polarimetric anisotropy ranges from to and provides a measure of the relative importance of the second and the third eigenvalues of the eigen-decomposition. Both and are roll-invariant parameters, as the eigenvalues are rotationally invariant.
Iv System Architecture
Four subsystems compose the architecture of the proposed CBIR system: data ingestion, indexing, query processing and interactive visualization. The system allows users to query large-scale archives of polarimetric data and retrieve image tiles with scattering characteristics similar to the ones of a user-provided query one. A schematic representation of the system is reported in Fig. 2.
The visualization subsystem provides the interactive environment by which the user — typically an image analyst — can communicate with the retrieval system and submit a query image tile. The query is unsupervisedly analyzed to extract one or more vectors of content-based features. The resulting query signature feature vectors are sent to the the query execution subsystem, where the images with signatures most similar to the query image are efficiently retrieved. Finally, the retrieval results are shown to the user again by the user interface module.
The current section provides a detailed description of the architecture of the proposed CBIR system for polarimetric SAR data products, and analyzes the inner workings of each of the modules.
Iv-a Data ingestion
Data ingestion is the first back-end module of the retrieval system. It consists of a system that retrieves images from the web or from a local database, extracts information about their contents in an unsupervised manner and stores it for later indexing.
Workloads in this stage are typically characterized by strong variability. For example, workload peaks are expected in early ingestion phases of a new database containing a large number of products. On the other hand, low demanding or idle states are also expected to happen in a full-time operating system. In order to avoid fast resource saturation or under utilization, the system has to be scalable, i.e. it has to be able to sustain both increasing and decreasing workloads with adequate performance by adapting hardware resources .
To this end, cloud-based computing environments can be exploited for scraping and performing processing and feature extraction operations on data products. In particular, elastic versions of such services can be implemented by allowing automated load balancers to on-demand provision and deprovision resources in the cloud environment. A load balancer allows the computing capacities to adapt to current requirements, i.e. to the amount of data that has to be ingested and processed at a certain time. Its operation is based on a distributed queue system and on a machine instance manager.
A further concrete issue are the heavy file transfers involved in this step in cases in which the storage and data analysis systems are far apart (as is of course the case of our prototype). This can easily bring to a fast saturation of networking resources, adding to the cost of ingestion.
Iv-A1 Crawling and elastic load balancing
Three components are involved in the first phase of the ingestion system: the crawling agents, the elastic load balancer, which is in turn formed by a distributed queue system and a machine instance manager, and the elastic computing cluster. Crawling agents are charged with performing a continuous polling of the contents of the input data archive. In the specific case of the prototype under discussion, they perform an analysis of the NASA/JPL UAVSAR web pages with the aim of discovering new data products to ingest. When a new product is found, a pointer to it is created, and inserted in a distributed queue system. The distributed queue system is a scalable service that allows reliable queuing of independent jobs that cluster components have to perform. It acts as a buffer and communication channel among the crawling agents and the elastic cluster platform components, preventing the system to loose job instantiation requests when there is an excessive processing load or in case workers are scheduled to operate intermittently. The distributed nature of the queue also allows multiple interactions between crawlers and workers. In addition, a specific function of the queue manager is to allow better failure recovery strategies to be implemented in the system: when a worker node picks up a task to be executed, it also declares a timeout date after which it can be assumed that the processing will surely have been completed. If after this date the job has not been actively declared as completed by the worker tasked with executing it, the queue manager assumes that the processing failed and toggles a visibility flag on the scheduled job, which reappears in the queue for execution. The other component of the load balancing system is the worker machine instance manager, which is in charge of managing the cluster capacities in terms of both the number and the kind of instantiated machines based on explicitly defined scaling policies. Scaling policies can be based on simple metrics like the queue size, feedbacks about resource usage from subsequent modules in the ingestion chain or on more complex strategies for costs or energy saving that could entail intermittent service availability.
Iv-A2 Tiling and feature extraction
Remote sensing image products cover vast areas of the Earth surface. Their content is therefore typically quite heterogeneous. Extracting a global description of each image product is not useful for most applications of content-based image retrieval systems, as, in general, the aim is to allow a user to identify specific targets or regions with similar characteristics.
The processing operations required for region-oriented content-based image mining consist of two main tasks: tiling the images in geo-referenced patches, whose size depends on the kind of sensor, on the resolution and on the aims of the content-based searches, and performing feature extraction operations.
The dimension of the tiles has to be chosen in such a way as to satisfy three main requirements:
Preserving as much as possible the local characteristics of each patch of the original image product: a tile should be small enough to make it probable that it will include only one target class.
Retaining enough information for the extraction of the ground target category. This general requirement means, in the context of polarimetric SAR images, retaining a number of pixels that is reasonable for an accurate estimation of the density distribution;
Allowing efficient use of the addressing space. Because memory devices are arrays of bytes, for the sake of memory allocation efficiency the tile dimension should be chosen following a power-of-2 scheme.
Furthermore, an overlap between the tiles can also be accounted for. During tile creations, image patches are univocally identified, georeferenced and processed for feature extraction. The block scheme of the operations implemented for polarimetric image ingestion is shown in Fig. 4.
Iv-A3 Distributed storage
The operations in the first phase are typically disk I/O bound and need to operate on data products that have volumes of Gigabytes (around 10 Gigabytes of average size for UAVSAR ground-projected products). The obtained results need to be accessed globally for index formation. This naturally raises the problem of allowing distributed access to files in clustering environments. For this reason, the extracted image tile descriptors and metadata are pushed to a distributed file system.
A distributed file system allows storing information on networks of worker machines, offering at the same time scalability, fault tolerance and efficient computation performances. The distributed nature of the systems allows large datasets to be accommodated on machines in the cluster. Data descriptors are stored in files that are split into one or more equal and fixed size chunks that are replicated and stored across the network nodes. Such operations allow the system to handle node failures, thereby preserving data from being lost and preventing the completion of long running computations on the cluster from depending on each single machine. Furthermore, distributed storage systems are designed to minimize the risk of network congestion and to increase overall system throughput. As network resources have bandwidth limits, data movements have to be discouraged, preferring a paradigm that allows bringing computation close to data and not vice-versa. Distributed file systems naturally fit such computational paradigms, as nodes working as storage units can also be employed as computational units in the cluster. We adopted for our system the Hadoop Distributed File System (HDFS) .
Iv-B Retrieval and indexing
A crucial component for a large-scale CBIR system is an efficient methodology for the generation of a global index. In the proposed tile-based retrieval system, the index is the result of an unsupervised hierarchical clustering subsystem that has the purpose to provide an efficient way to group image tiles that are characterized by similar scattering behavior and that are therefore all likely to be relevant to a specific query. Without an index, the search engine would need to perform an scan of the entire archive with unsustainable computational costs for mining even moderate size archives.
In a “big data" context, it is fundamental for the system to be able to perform indexing on large-scale datasets containing billions of entries, each representing the image content and eventually consisting of a large number of descriptors, and to allow fast and accurate retrieval results. In order to generate the content-based index, the system has to analyze the set of tile descriptors generated by the feature extraction module. Such analysis typically requires the application of machine learning algorithms. However, indexing large scale archives of data requires suitable implementations, able to distribute the computation among multiple processors in the cluster. To this purpose, we propose a mechanism for scalable indexing based on Tree-Structured Vector Quantization .
Iv-B1 Tree Structured Vector Quantization
Vector Quantization is a signal processing technique that is popular in data compression. A generic schema of a vector quantizer consists of an encoding/decoding system. Let us represent the generic entry (a set of tile descriptors, in our case) in the database by a -dimensional feature vector . The encoder maps the -dimensional vectors in the vector space to a finite set of vectors , called code vectors or codewords, operating according to a nearest neighbor or minimum distortion rule. We here consider the squared error distortion, i.e. the square of the Euclidean distance between the input vector and the codeword:
In this way, each codeword has an associated nearest neighbor region, also called a Voronoi region, defined by:
Once a codeword is associated to an input -dimensional vector, the corresponding codeword index is sent to the decoding system through a channel (a file system or a communication link, depending on the application). The decoder consists of a lookup table, i.e. a codebook containing all the possible codewords. When the index of a codeword is received, it will return the codeword corresponding to that index. The schema of the encoding/decoding system is shown in Fig. 6.
Several different approaches can be considered in the design a Vector Quantizer. Tree-Structured Vector Quantization (TSVQ) is a class of constrained structure quantizers . In TSVQ, the codebook is constrained to have a tree structure. The encoder builds the codeword associated to an input vector by performing a sequence of binary comparisons, following a minimum distortion rule at each branch, until a leaf (terminal) node is reached. The path followed to reach the leaf node starting from the root indicates the binary sequence associated with the codeword.
A pseudo-code of the implementation of the tree-growing procedure is reported in 1. The tree structure is built starting from the root node. The root node is represented by the centroid of the entire set of feature vectors. From the root node, two new child nodes are estimated by applying the -means clustering algorithm, each corresponding to the centroid of the space partitions minimizing the within-cluster sum of squares cost function:
where represents the ith partition centroid. Then, each data point is assigned to its respective centroid by a binary string labeling, so that each group of data points defines a sub-tree. The process continues iterating on the discovered nodes, until a predefined stopping criterion is reached. The centroids of the items belonging to the leaves of the final tree structure, each with an associated binary string, represent the entries of the index. During the growing process, each split produces a decrease of the average distortion and an increase in the average length of the binary string. Based on these observations, several conditions are possible as stopping criteria. For example, the algorithm can stop growing a sub-tree (thereby defining a leaf node) when one of the following events occurs:
the node contains a predefined minimum number of data points under which further partitions become untrustworthy;
the distortion measure given by the within-cluster sum of squares of the data vectors associated with the node is below a minimum threshold ;
the maximum height of the tree or, equivalently, the maximum binary code length have been reached.
Iv-B2 Scalable in-memory clustering
The proposed indexing mechanism is based on a scalable version of the TSVQ algorithm. The method builds on a distributed implementation of the -means algorithm to estimate the optimal space partitioning, allowing us to parallelize the clustering operations among the worker nodes of the computing cluster.
A schema of a possible MapReduce implementation of the -means algorithm is shown in Fig. 7. Input data points are stored on a distributed file system as key/value pairs. The mapping phase consists of the so called assignment step. At the ith iteration, the set of centroid estimates is globally broadcast to map workers. Each map function has in input a group of data points and iteratively evaluates the squared Euclidean distance between the data points and each of the centroids vectors. The outputs consist of a key/value pair for each data point, with the value being the input vector and the key being its nearest centroid. Each map worker can perform a combine operation, coupling to each centroid key the sum and the number of its associated vectors. This last local combination operation is in principle optional, yet it allows to avoid using expensive network resources by compressing the information generated by each node. The update step of the -means algorithm is finally implemented by the reduce workers. Each reduce function takes in input the grouped-by-key intermediate key/value pairs and computes for each centroid the average of the associated values
where is the current iteration and is the set of vectors belonging to centroid . Each average value represents an updated centroid to be used in successive iterations. The algorithm stops when the update variations are below a given threshold.
The assignment step of the algorithm requires to repeatedly access data points and to evaluate the distances between each value vector and the cluster centroids:
As the data points are fixed during iterations, the first term in the second line of the expression in Eq. 17 does not need to be recomputed at each iteration of the algorithm. Furthermore, also the third term, i.e. the centroid squared Euclidean norm, is repeatedly used inside each single iteration. This indicates the advantage of performing operations in-memory and avoiding, at each iteration, costly operations like both recomputing invariable quantities and storing and reading intermediate results on low performing devices. While standard cluster computing frameworks like Hadoop  routinely materialize to the distributed file system intermediate results produced by the directed acyclic graph of operations, with massive usage of cluster resources especially in the case of iterative processing schemes, in-memory ones like Apache Spark [21, 20] avoid this step and have therefore been selected as the base for our implementation.
Further improvements can be be obtained accounting for the sparsity of the feature vectors. If the vectors are sparse, the computational complexity of performing -means decreases from to , where is the number of data points, is the data space dimensionality, is the number of clusters, is the number of iterations needed for convergence and is the (average) number of non-zero elements in each data point.
Iv-B3 TSVQ Complexity
The implemented TSVQ for scalable indexing processing works with a fixed number of clusters, . This choice allows the association of binary codes to the leaves of the tree, each with a length equal to the level of the corresponding leaf. As each leaf corresponds to a partition centroid, all the points will be labeled with the binary code of the leaf they belong to. Given the tree structure generated by the TSVQ algorithm and a query vector, retrieving the nearest feature vectors consists of traversing the tree from the root node and choosing, at each branch, the nearest child node, until a leaf is reached. The retrieved data points are the ones with the same codeword as the final leaf.
There are two main motivations behind the choice of the TSVQ algorithm for the index formation. First, this approach leads to the definition of a binary indexed tree, a structure that allows efficient lookup and update operations. In particular, search and modification operations can be executed in constant or logarithmic average times, instead of the linear access times required by linked lists. Second, as described in Sec IV-B2, the -means algorithm can be reimplemented to scale to massive sets of data. In addition, scalable implementations of seeding strategies have been recently proposed, allowing for faster runs and more robust searches with respect to suboptimal solutions .
Iv-C Search and interactive visualization subsystems
The interactive visualization subsystem provides users with a query interface to the search engine where they can select or upload a query/example image tile and set metadata constraints for the search (geographic coverage, time of acquisition, sensor parameters, mission annotations). The search subsystem is then invoked to transform the user provided examples to a form suitable for query processing, i.e. performing any pre-processing and feature extraction operations on them, and then to determine the best matching tiles based on the content index. Finally, the interactive visualization subsystem receives the identifiers of the best matching tiles and displays the results.
We considered full polarimetric SAR data products from NASA JPL’s UAVSAR image archives. In the public archive, six images for each product are made available, that is the cross-products of the 4 Single Look Complex files representing the measurements of the scattering matrix , , and . Each product is available in the ground-projected polarimetric format (equiangular geographic projection, 6-by-6 m pixels resolution). In the experiments, data ingestion has been carried out by instantiating five nodes on the elastic cloud computing infrastructure by using the Amazon EC2 service111http://aws.amazon.com/ec2/. In the tiling phase, each image is divided in 512-by-512 pixel sized square patches, corresponding to approximately 3-by-3 km areas. Then, for each tile, the feature extraction module performs pixel-by-pixel decomposition procedures, forms multi-dimensional histograms and pushes the vectorized tile descriptors to a durable cloud-based infrastructure, the Amazon S3 service 222http://aws.amazon.com/s3/, in which data are redundantly stored across multiple facilities. The reason we use such a durable storage service is that the results of the ingestion need to be reused in multiple experiments. In this work, we performed the ingestion of 43 UAVSAR products, amounting to about 0.67 Terabytes of data, from which we obtained 12950 tiles, with a total size for the descriptors of 2.4 GB. The replication factor in the HDFS filesystem has been set to three in all experiments.
V-a Scalability analysis
|ID||Processor||vCPU||Memory (Gb)||Network bandwidth (Mbps)|
|Model||Frequency (GHz)||Cache size (MB)|
|m1.xlarge||E5-2651 v2||1.80||30||4||15.00||(High )|
An experimental analysis of both vertical and horizontal scalability has been conducted by varying the cluster configuration. Horizontal scalability is the ability to face increases in workload by adding worker entities to an application environment. Fig. 8 shows the results of the tests carried out by varying the number of nodes in the cluster. The cluster instances are of type m1.medium (see Tab. I). The relation between the number of nodes and the times needed for data processing can be expected to be linear. For this reason a least squares linear fitting has been carried out on the results and the presence of possible outliers has been assessed by Bonferroni method with familywise significance threshold . The first plot refers to the time needed to transfer data from the durable (S3) storage to the cluster distributed storage (HDFS). Since both storage systems are distributed, data transfer can be carried out in parallel and the linear relation is quite well satisfied. The second plot refers to the indexing running times. In this case, an outlier is detected for the simplest cluster, with two nodes, while the linear dependence is well satisfied for all larger configurations. The reason for which in the two-nodes configuration the performance is worse and falls outside the linear dependence relation can be attributed to an overall shortage of memory. This causes the loss of already computed in-memory intermediate results, thereby forcing their recomputation from HDFS and degrading the overall system performance.
The advantage of performing in-memory computations during the indexing procedure is better exemplified in Fig. 9. Standard frameworks for data analytics like Apache Hadoop would require a write operation at each iteration in order to save intermediate results on the distributed file system. This operation involves overhead, as data must be distributed and replicated across the nodes, thereby increasing the total network traffic and slowing the computation further. The results in Fig. 9 refer to an experiment performed with a cluster of 3 m1.medium nodes. In this case, exploiting in-memory computations improves indexing performance by almost 90%.
Vertical scalability is the ability to increase the computation capabilities of the cluster by improving the capacity of existing hardware or software. In theory, the more the resources – as for instance memory, network bandwidth, number of virtual CPUs – we allocate in the cluster, the more the work that can be carried out in a certain amount of time. However, increasing the hardware capacities works just up to a certain point, as the best configuration depends on the algorithm, on the user needs and on their means. In Fig. 10 we show an analysis of the indexing running times by varying the kind of instances allocated in the cluster, as reported in Tab. I. The dimensionality of the cluster is kept fixed to three. A first observation in this experiments is about the cheapest kind of instance, the m1.small, which performs considerably worse than all other cases. The reason is here a combination of scarceness of both memory and network resources. On the other hand, the improvements in the other experiments are mainly attributable to the increase in the number of virtual CPUs allocated.
As for the system response time to queries, it is in the order of seconds and essentially independent of variations of the cluster settings.
V-B Cost analysis
We also performed a cost-benefits analysis, in order to evaluate whether investing on more resources brings effective advantages, and to tune the system according to this analysis. Results are reported in Fig. 11 and refer to the indexing phase. The two plots show the relation between the improvement percentage of the indexing running time compared to the cost variation percentage obtained by varying the resources allocated in the cloud, with respect to the worst case, as described below. The analysis for a variable number of nodes in the cluster is shown in Fig. (a)a, where the instances are of type m1.medium and the worst case for comparison is the use of two nodes (see Fig. 8). This test shows that adding more nodes to the cluster diminishes the indexing running times, but increasing investments are not balanced by equivalent increases of performance. The best trade-off has been obtained for the three nodes configuration for the specific case of the data volumes considered in the experiment.
Similar considerations hold for the analysis shown in Fig. (b)b, which refers to the use of different types of instances, with a cluster dimensionality fixed to three and the worse case for comparison is the use of m1.small instances (see Fig. 10). In this case, positive balances are obtained for both cases m1.medium and c1.medium instances, while the worse case happens for c1.xlarge instances.
V-C Quality assessment
Evaluating the quality of content-based retrieval systems is a challenging task. In the literature, quality assessment is not always present or is based on non-public databases of categorized images. Evaluating the retrieval performance on massive archives of images is even more difficult. A first idea to automatically assign labels to our dataset was to exploit publicly available worldwide land cover data, as those provided by ESA’s Globcover project, obtained by MERIS sensor data. UAVSAR products were coregistered with land cover maps and image patches were obtained from both sources. Land cover tile descriptors consisted of the vectorized class labels and a separate index for such dataset was created. We then tried to evaluate retrieval performance by querying the two search engines with coregistered example-query images, i.e. by polarimetric SAR and landcover images respectively. However very poor results were obtained, due to differences in the spectral bands of the sensors ( cm for L-band UAVSAR products and nm for MERIS sensor), since the information collected about a target is different in the two cases, and to the variability in their spatial resolution (3 m and 300 m respectively).
Therefore we resort here to visual inspection analysis. To this aim, the system has been repeatedly queried by different example-based queries. The results of ten experiments, each characterized by different target characteristics, are shown in Tab. II, where the query tile is shown together with quick looks of the top ten tiles displayed by the interactive visualization subsystem. We consider that the retrieved tiles are highly consistent with the query images in terms of content similarity. Better performances might be achieved by enriching the tiles content description by textural and shape descriptors as well as more advanced descriptions for metric resolution data based on signal decomposition in fractional frequency domains as in [18, 17].
Vi Discussion and Conclusions
Content-based EO image retrieval is an open research area that is currently receiving particular attention because of the fast and continuous increase in the volume of acquired product repositories. While several systems for image mining have been proposed, they often lack the capability of managing and accessing massive amounts of data because of their architectural and algorithmic approaches oriented to limited archive volumes.
We propose using inexpensive computing clusters instantiated on public Infrastructure-as-a-service cloud platforms in order to ingest and index Petabyte-scale remote sensing image archives with the aim of implementing query-by-example retrieval functionalities on them. To this end, we developed and tested a full prototype operating on a subset of polarimetric SAR products from the NASA JPL UAVSAR archive.
The ingestion component of the system is centered on an elastic load balancer instantiating virtual resources in elastic platforms based on the occupation of input queues, in order to distribute ingestion processes on multiple machines. Furthermore, the indexing component is a parallel/scalable indexing algorithm built upon a cluster-computing framework to enable the prototype to face the difficulties of performing large scale processing on massive image archives. The use of public Infrastructure-as-a-service cloud platforms has proved both flexible, by providing dynamic scaling of resources, and advantageous, as it allows economies of scale. Performances of retrieval measured have proved both the efficiency of the system in terms of response times and its effectiveness in terms of retrieval quality. Although proposed for the analysis of polarimetric SAR image archives, the proposed approach is general and can be expanded to ingest archives from different sensors and missions.
-  Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622–633, 2012.
-  Shane R Cloude and Eric Pottier. A review of target decomposition theorems in radar polarimetry. Geoscience and Remote Sensing, IEEE Transactions on, 34(2):498–518, 1996.
-  PC Cosman, C Tseng, R Gray, RA Olshen, LE Moses, HC Davidson, CJ Bergin, and EA Riskin. Tree-structured vector quantization of CT chest scans: image quality and diagnostic accuracy. Medical Imaging, IEEE Transactions on, 12(4):727–739, 1993.
-  Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
-  Ian Foster. Designing and building parallel programs. Addison Wesley Publishing Company, 1995.
-  Allen Gersho and Robert M Gray. Vector quantization and signal compression. Springer, 1992.
-  Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. Elasticity in cloud computing: What it is, and what it is not. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), pages 23–27. USENIX, 2011.
-  M Klaric, G Scott, Chi-Ren Shyu, C Davis, and K Palaniappan. A framework for geospatial satellite imagery retrieval systems. In Geoscience and Remote Sensing Symposium, 2006. IGARSS 2006. IEEE International Conference on, pages 2457–2460. IEEE, 2006.
-  Manolis Koubarakis, Michael Sioutis, George Garbis, Manos Karpathiotakis, Kostis Kyzirakos, Charalampos Nikolaou, Konstantina Bereta, Stavros Vassos, Corneliu Octavian Dumitru, Daniela Espinoza-Molina, et al. Building virtual earth observatories using ontologies, linked geospatial data and knowledge discovery algorithms. In On the Move to Meaningful Internet Systems: OTM 2012, pages 932–949. Springer, 2012.
-  Jong-Sen Lee and Eric Pottier. Polarimetric radar imaging: from basics to applications. CRC press, 2009.
-  Michael McCool, James Reinders, and Arch Robison. Structured parallel programming: patterns for efficient computation. Elsevier, 2012.
-  Diana Moise, Denis Shestakov, Gylfi Gudmundsson, and Laurent Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Big Data, 2013 IEEE International Conference on, pages 674–682. IEEE, 2013.
-  Marco Quartulli and Igor G Olaizola. A review of EO image information mining. ISPRS Journal of Photogrammetry and Remote Sensing, 75:11–28, 2013.
-  Paul A Rosen, Scott Hensley, Kevin Wheeler, Greg Sadowy, Tim Miller, Scott Shaffer, Ron Muellerschoen, Cathleen Jones, Howard Zebker, and Soren Madsen. Uavsar: a new nasa airborne sar system for science and technology research. In Radar, 2006 IEEE Conference on, pages 8–pp. IEEE, 2006.
-  Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010.
-  Chi-Ren Shyu, Matt Klaric, Grant J Scott, Adrian S Barb, Curt H Davis, and Kannappan Palaniappan. GeoIRIS: Geospatial information retrieval and indexing system-content mining, semantics modeling and complex queries. Geoscience and Remote Sensing, IEEE Transactions on, 45(4):839–852, 2007.
-  Jagmal Singh and Mihai Datcu. SAR image categorization with log cumulants of the fractional Fourier transform coefficients. IEEE transactions on geoscience and remote sensing, 51(12):5273–5282, 2013.
-  Marc Walessa and Mihai Datcu. Model-based despeckling and information extraction from SAR images. Geoscience and Remote Sensing, IEEE Transactions on, 38(5):2258–2269, 2000.
-  Tom White. Hadoop: the definitive guide: the definitive guide. " O’Reilly Media, Inc.", 2009.
-  Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2–2. USENIX Association, 2012.
-  Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.