3-D Visual Coverage Based on Gradient Descent Techniques on Matrix Manifold and Its Application to Moving Objects Monitoring
This paper investigates coverage control for visual sensor networks based on gradient descent techniques on matrix manifolds. We consider the scenario that networked vision sensors with controllable orientations are distributed over 3-D space to monitor 2-D environment. Then, the decision variable must be constrained on the Lie group . The contribution of this paper is two folds. The first one is technical, namely we formulate the coverage problem as an optimization problem on without introducing local parameterization like Eular angles and directly apply the gradient descent algorithm on the manifold. The second technological contribution is to present not only the coverage control scheme but also the density estimation process including image processing and curve fitting while exemplifying its effectiveness through simulation of moving objects monitoring.
Aspirations for safety and security of human lives against crimes and natural disasters motivate us to establish smart monitoring systems to monitor surrounding environment. In this regard, vision sensors are expected as powerful sensing components since they provide rich information about the outer world. Indeed, visual monitoring systems have been already commoditized and are working in practice. Typically, in the current systems, various decision-making and situation awareness processes are conducted at a monitoring center by human operator(s), and partial distributed computing at each sensor is, if at all, done independently of the other sensors. However, as the image stream increases, it is desired to distribute the entire process to each sensor while achieving total optimization through cooperation among sensors.
Distributed processing over the visual sensor networks is actively studied in recent years motivated by a variety of application scenarios –. Among them, several papers address optimal monitoring of the environment assuming mobility of the vision sensors –, where it is required for the network to ensure the best view of a changing environment . The problem is related to coverage control –, whose objective is to deploy mobile sensors efficiently in a distributed fashion. A typical approach to coverage control is to employ the gradient descent algorithm for an appropriately designed aggregate objective function. The objective function is usually formulated by integrating the product of a sensing performance function of a point and a density function indicating the relative importance of the point. The approach is also applied to visual coverage in –. The state of the art of coverage control is compactly summarized in , and a survey of related works in the computer vision society is found in .
In this paper, we consider a visual coverage problem under the situation where vision sensors with controllable orientations are distributed over the 3-D space to monitor 2-D environment. In the case, the control variables i.e. the rotation matrices must be constrained on the Lie group , which distinguishes the present paper from the works on 2-D coverage –. On the other hand, [4, 9, 10] consider situations similar to this paper. [9, 10] take game theoretic approaches which allow the network to achieve globally optimal coverage with high probability but instead the convergence speed tends to be slower than the standard gradient descent approach. In contrast,  employs the gradient approach by introducing a local parameterization of the rotation matrix and regarding the problem as optimization on a vector space.
This paper approaches the problem differently from . We directly formulate the problem as optimization on and apply the gradient descent algorithm on matrix manifolds . This approach will be shown to allow one to parametrize the control law for a variety of underactuations imposed by the hardware constraints. This paper also addresses density estimation from acquired data, which is investigated in  for 2-D coverage. However, we need to take account of the following characteristics of vision sensors: (i) the sensing process inherently includes projection of 3-D world onto 2-D image, and (ii) explicit physical data is not provided. To reflect (i), we incorporate the projection into the optimization problem on the embedding manifold of . The issue (ii) is addressed technologically, where we present the entire process including image processing and curve fitting techniques. Finally, we demonstrate the utility of the present coverage control strategy through simulation of moving objects monitoring.
Preliminary: Gradient on Riemannian Manifold
Let us consider a Riemannian manifold whose tangent space at is denoted by , and the corresponding Riemannian metric, an smooth inner product, defined over is denoted by . Now, we introduce a smooth scalar field defined over the manifold , and the derivative of at an element in the direction , denoted by . We see from Definition 3.5.1 and (3.15) of  that the derivative is defined by
where is a smooth curve such that . In particular, when is a linear manifold with , the derivative is equal to the classical directional derivative
Now, the gradient of is defined as follows.
 Given a smooth scalar field defined over a Riemannian manifold , the gradient of at , denoted by , is defined as the unique element of satisfying
Suppose now that is a Riemannian submanifold of a Riemannian manifold , namely is a subspace of and they share a common Riemannian metric. In addition, the orthogonal projection of an element of onto is denoted by . Then, the following remarkable lemma holds true.
 Let be a scalar field defined over such that the function defined on is a restriction of . Then, the gradient of satisfies the equation
Ii Targeted Scenario
Ii-a Vision Sensors and Environment
We consider the situation illustrated in Fig. 1 where vision sensors are located in 3-D Euclidean space. Let the fixed world frame be denoted by and the body fixed frame of sensor by . We also denote the position vector of the origin of relative to by , and the rotation matrix of relative to by . Then, the pair , called pose, represents the configuration of sensor . In this paper, each sensor’s position is assumed to be fixed, and sensors can control only their orientations . In addition, we suppose that sensors are localized and calibrated a priori and is available for control.
We use the notation to describe not only the pose but also a coordinate transformation operator similarly to . Take two frames and . Let the pose of the frame relative to be denoted by , and the coordinates of a point relative to by . Then, the coordinates of the point relative to are given as
Let us next define the region to be monitored by a group of sensors . In this paper, we assume that the region is a subset of a 2-D plane (Fig. 1), where the 2-D plane is called the environment and the subset to be monitored is called the mission space. Let the set of coordinates of all points in the environment and the mission space relative to are respectively denoted by and . Just for simplicity, we suppose that the world frame is attached so that its -plane is parallel to the environment (Fig. 1). Then, the set is formulated as
with some constant , where is an -th standard basis. Suppose that a metric , called a density function, indicating the relative importance of every point is defined over . In this paper, the function is assumed to be small if point is important and to satisfy with a constant such that .
A vision sensor has an image plane containing the sensing array, whose elements, called pixels, provide the numbers reflecting the amount of light incident. We assume that the image plane is a rectangle as illustrated in Fig. 3. The set of position vectors of all points on the image plane relative to the sensor frame is denoted by . Now, the axes of the sensor frame is assumed to be selected so that its -plane is parallel to the image plane and -axis perpendicular to the image plane passes through the focal center of the lens. Then, the third element of any point in the set must be equal to the focal length .
We next denote the set of pixels of sensor by and the position vector of the center of the -th pixel on the image plane of sensor relative to by . Since in and deffer, we may need to use the notation like but we omit the subscript to reduce notational complexity. In addition, the positions of its vertices relative to are denoted by (Fig. 3).
When a point on the environment with coordinates relative to is captured by a sensor with , the point is projected onto the image plane as illustrated in Fig. 4. If the coordinates of the projected point are denoted by , it is well known that the projection is formulated as
It is not difficult to show that the inverse map of the map (Fig. 4) from to is given by
Note that, while is independent of , the map depends on and hence we describe as .
Using the map , we denote by the set of coordinates of the field of view (FOV) of each sensor relative to , which is also a polytope. Its -polytope representation is trivial, namely it is given by the convex hull of the four points with coordinates
relative to (Fig. 5). The -polytope representation is also computed efficiently as follows.
Suppose now that -th side line segment () specifying the boundary of the image plane connects the vertices and without loss of generality. Then, the line projected onto the environment is also a line segment whose vertices have coordinates and relative to , and hence the line is formulated as
where the matrix is derived as
from the fact that and are on the line. Since the coordinates for any interior of must be inside the FOV, a half space specifying the FOV is described by the inequality with
In the same way, we can find the pair for all . Stacking them allows one to formulate the FOV as
Iii Coverage for a Single Sensor
In this section, we consider a simple case with .
Iii-a Objective Function
Let us first define the objective function to be minimized by sensor . In this paper, we basically take the concept of coverage control [11, 12], where the objective function is defined by a sensing performance function and a density function at a point . Note however that we accumulate the function only at the center of the pixels projected onto the environment in order to reflect the discretized nature of the vision sensors. In the sequel, the sensing performance function and the density function at are denoted by and , respectively.
Let us first define a function providing the coordinates in of the point on which is captured by -th pixel as
Then, the objective function takes the form of
where is a weighting coefficient. If we impose a large on the pixel at around the center of the image, the sensor tends to capture the important area at around the image center. If we need to accelerate computation, replacing in (7) by its subset is an option. In order to ensure preciseness, we need to introduce an extended function allowing , but we will not mention it since it can be easily avoided by choosing appropriately.
Similarly to , we let the performance function depend only on the distance . Remark however that, differently from , the third element of is not controllable since the sensor is fixed. This may cause a problem that penalty of seeing distant area does not work in the case that the element is large enough. However, the element is not ignorable since it reflects heterogeneous characteristics of vision sensors in the multi-sensor case. We thus use the weighting distance as
with , where is introduced since the distance is scaled by the focal length. Suppose that is set as . Then, a large imposes a heavy penalty on viewing distant area and a small a light penalty on it. In particular, when for some , (8) is rewritten as
Once the density function is given, the goal is reduced to minimization of (7) with (9) under the restriction of . In order to solve the problem, this paper takes the gradient descent approach which is a standard approach to coverage control. For this purpose, it is convenient to define an extension such that if . We first extend the domain of in (6) from to as
Then, the vector is not always on the environment when but the function in (8) is well-defined even if the domain is altered from to . We thus denote the function with the domain by , and define the composite function
We next focus on the term in (7) and expand the domain of the composite function from to . Here, since is not always on , we need to design such that if . In this paper, we assign to a point the density of a point
where the operations are illustrated in Fig. 6. Accordingly, the density function is defined by
Remark that, differently from , the function is not naturally extended and the selection of is not unique. The motivation to choose (12) will be clear in the next subsection.
Iii-B Density Estimation for Moving Objects Monitoring
In the gradient descent approach, we update the rotation in the direction of at each time . This subsection assumes that the density is not given a priori and that needs to be estimated from acquired vision data as investigated in [4, 14].
Let us first consider an ideal situation such that the density function is exactly projected onto the image plane, namely
holds with respect to the density over the image plane. Then, the density function value is available at any point in the FOV. We next consider a point which does not always lie on . Then, the value of is also given by the same function as (14) since
Ensuring the equality is the reason for choosing (12).
We next consider estimation of the density on the image since assuming (14) is unrealistic. Rich literature has been devoted to the information extraction from the raw vision data, and a variety of algorithms are currently available even without expert knowledge . For example, it is possible to detect and localize in the image plane specific objects like cars or human faces, and even abstract targets such as everything moving or some environmental changes.
The present coverage scheme is indeed applicable to any scenario such that a nonnegative number reflecting its own importance is assigned to each pixel after conducting some image processing. However, we mainly focus on a specific scenario of monitoring moving objects on the mission space. Suppose that a sensor captures a human walking from left to right in the image as in Fig. 7. Then, a way to localize such moving objects is to compute optical flows from consecutive images as in Fig. 7, where the flows are depicted by yellow lines. We also let the data be the norm of the flow vector at each pixel. Then, the plots of over the image plane are illustrated by green dots in Fig. 9.
We next fit the data of by a continuous function defined over and use the function as . Such algorithms are also available even in real time . Similarly to , we employ the mixed Gaussian function known to approximate a variety of functions with excellent precision by increasing the number of Gaussian functions, and widely used in data mining, pattern recognition, machine learning and statistical analysis. Fig. 9 shows the Gaussian function with computed so as to fit the data in Fig. 9. Of course, using a larger achieves a better approximation as shown in Fig. 10.
As a result, we obtain a function in the form of
over the 2-D image plane coordinates . Note that (15) is large when captures an important point, which is opposite to the density function. Thus, we define the function
where is a positive scalar guaranteeing for all . It is also convenient to define for all 3-D vectors on the image plane as
Iii-C Gradient Computation
Full 3-D Rotational Motion
Here, we will derive the gradient , given a rotation and in (17). It is widely known that is formulated as , where is the set of all skew symmetric matrices in . We also define the operator (wedge) from to such that for the cross product . The rotational group is known to be a submanifold of a Riemannian manifold with and the Riemannian metric
See Subsection 3.6.1 of  for more details.
Now, we have the following theorem, where we use the notation and .
See Appendix A.
Rotational Motion with Underactuations
In the above discussion, we assume that the sensor can take full 3-D rotational motion. However, the motion of many commoditized cameras is restricted by the actuator configurations. Hereafter, we suppose that the sensor can be rotated around two axes () and (), where these vectors are defined in and assumed to be linearly independent of each other. These axes may depend on the rotation matrix . For example, in the case of Pan-Tilt (PT) cameras in Figs. 12 and 12, which are typical commoditized cameras, the axis of the pan motion (Fig. 12) is fixed relative to , while that of the tilt motion (Fig. 12) is fixed relative to the sensor frame . Then, only one of the two axes depends on . Note that even when there is only one axis around which the sensor can be rotated, the subsequent discussions are valid just letting .
Let us denote a normalized vector () orthogonal to the -plane. Then, the three vectors , and span . Thus, any element of can be represented in the form of . Now, we define a distribution  assigning to the subspace
whose dimension is . The distribution is clearly regular and hence induces a submanifold of , called integral manifold, such that its tangent space at is equal to (24). The manifold specifies orientations which the camera can take.
Since is a submanifold of , a strategy similar to Theorem 1 is available and we have the following corollary.
Iv Coverage for Multiple Sensors
In this section, we extend the result of the previous section to the multi-sensor case. The difference from the single sensor case stems from the overlaps of the FOVs with the other sensors as illustrated in Fig. 13. [4, 11, 13] present sensing performance functions taking account of the overlaps and their gradient decent laws. However, in this paper, we present another simpler scheme to manage the overlap.
Let us first define the set of sensors capturing a point within the FOV as where . We also suppose that, when has multiple elements for some , only the data of the sensor with the minimal sensing performance (8) among sensors in is employed in higher-level decisions and recognitions. This motivates us to partition into the two region
Then, what pixel captures a point in is identified with what it captures a point outside of , whose cost is set as in the previous section, in the sense that both of the data are not useful at all. This is reflected by assigning to the pixels with .
Accordingly, we formulate the function to be minimized by as with
Strictly speaking, to compute the gradient of (27), we need to expand from to . For this purpose, it is sufficient to define from to a subset of . For example, an option is to define an extension
of (5) similarly to (10), and to let be the convex full of these points. However, at the time instants computing the gradient with , the extension for a sufficiently small perturbation is equivalent to the original set irrespective of the selection of except for the pathological case when a pixel is located on the boundary of . Namely, ignoring such pathological cases which do not happen almost surely for (23), the gradient can be computed by using the set instead of its extension. Hence, the gradient is simply given as Theorem 1 by just replacing by . Note that the curve fitting process is run without taking account of whether or not, and is assigned to at the formulation of as in (17). This is because letting at the curve fitting stage would degrade the density estimation accuracy at around the boundary of .
The remaining issue is efficient computation of the set . Hereafter, we assume that each sensor acquires , i.e. and for all , and its index through (all-to-all) communication or with the help of a centralized computer. The computation under the limited communication will be mentioned at the end of this section. In addition, we suppose that every sensor stores the set
for all which can be computed off-line since the sensor positions are fixed.
Then, the set is computed as
in polynomial time with respect to . Namely, checking for all provides .
The computation process including image processing, curve fitting and gradient computation is successfully distributed to each sensor but the resulting FOVs need to be shared among all sensors to compute