In-Network View Synthesis for Interactive Multiview Video Systems

# In-Network View Synthesis for Interactive Multiview Video Systems

Laura Toni , Gene Cheung , and Pascal Frossard  L. Toni, and P. Frossard are with École Polytechnique Fédérale de Lausanne (EPFL), Signal Processing Laboratory - LTS4, CH-1015 Lausanne, Switzerland. Email: {laura.toni, pascal.frossard}@epfl.ch. Gene Cheung is with the National Institute of Informatics, Tokyo, Japan. Email Address: cheung@nii.ac.jp
###### Abstract

Interactive multiview video applications endow users with the freedom to navigate through neighboring viewpoints in a 3D scene. To enable such interactive navigation with a minimum view-switching delay, multiple camera views are sent to the users, which are used as reference images to synthesize additional virtual views via depth-image-based rendering. In practice, bandwidth constraints may however restrict the number of reference views sent to clients per time unit, which may in turn limit the quality of the synthesized viewpoints. We argue that the reference view selection should ideally be performed close to the users, and we study the problem of in-network reference view synthesis such that the navigation quality is maximized at the clients. We consider a distributed cloud network architecture where data stored in a main cloud is delivered to end users with the help of cloudlets, i.e., resource-rich proxies close to the users. In order to satisfy last-hop bandwidth constraints from the cloudlet to the users, a cloudlet re-samples viewpoints of the 3D scene into a discrete set of views (combination of received camera views and virtual views synthesized) to be used as reference for the synthesis of additional virtual views at the client. This in-network synthesis leads to better viewpoint sampling given a bandwidth constraint compared to simple selection of camera views, but it may however carry a distortion penalty in the cloudlet-synthesized reference views. We therefore cast a new reference view selection problem where the best subset of views is defined as the one minimizing the distortion over a view navigation window defined by the user under some transmission bandwidth constraints. We show that the view selection problem is NP-hard, and propose an effective polynomial time algorithm using dynamic programming to solve the optimization problem under general assumptions that cover most of the multiview scenarios in practice. Simulation results finally confirm the performance gain offered by virtual view synthesis in the network. It shows that cloud computing resources provide important benefits in resource greedy applications such as interactive multiview video.

Depth-image-based rendering, network processing, cloud-assisted applications, interactive systems.
DIBR
depth-image based rendering
DP
dynamic programming
RTT
round-trip time
SC
set covering
IMVS
interactive multiview video system
MRF
Markov random field

## I Introduction

Interactive free viewpoint video systems [1] endow users with the ability to choose and display any virtual view of a 3D scene, given original viewpoint images captured by multiple cameras. In particular, a virtual view image can be synthesized by the decoder via depth-image-based rendering (DIBR) [2] using texture and depth images of two neighboring views that act as reference viewpoints. One of the key challenges in interactive multiview video streaming (IMVS) [3] systems is to transmit an appropriate subset of reference views from a potentially large number of camera-captured views such that the client enjoys high quality and low delay view navigation even in resource-constrained environments [4, 5, 6].

In this paper, we propose a new paradigm to solve the reference view selection problem and capitalize on cloud computing resources to perform fine adaptation close to the clients. We consider a hierarchical cloud framework, where the selection of reference views is performed by a network of cloudlets, i.e., resource-rich proxies that can perform personalized processing at the edges of the core network [7, 8]. An adaptation at the cloudlets results in a smaller round-trip time (RTT), hence more reactivity than in more centralized architectures. Specifically, we consider the scenario depicted in Fig. 1, where a main cloud stores pre-encoded video from different cameras, which are then transmitted to the edge cloudlets that act as proxies for final delivery to users. We assume that there is sufficient network capacity between the main cloud and the edge cloudlets for the transmission of all camera views, but there exists however a bottleneck of limited capacity between a cloudlet and a nearby user111In practice, the last-mile access network is often the bottleneck in real-time media distribution.. In this scenario, each cloudlet sends to a client the set of reference views that respect bandwidth capacities and enable synthesis of all viewpoints in the client’s navigation window. This window is defined as the range of viewpoints in which the user can navigate during the RTT and enables zero-delay view-switching at the client.

We argue that, in resource-constrained networks, re-sampling the viewpoints of the 3D scene in the network— i.e., synthesizing novel virtual views in the cloudlets that are transmitted as new references to the decoder—is beneficial compared to the mere subsampling of the original set of camera views. We illustrate this in Fig. 1, where the main cloud stores three coded camera views: while the bottleneck links between cloudlet-user pairs can support the transmission of only two views. 222We consider integer index for any camera view, while we assume that a virtual view can have a non-integer index , which corresponds to a position between camera views and . If user 1 requests a navigation window , the cloudlet can simply forward the closest camera views and . However, if user 2 requests the navigation window , transmitting camera views and results in large synthesized view distortions due to the large distance between reference and virtual views (called reference view distance in the sequel). Instead, the cloudlet can synthesize virtual views and using camera views and send these virtual views to the user 2 as new reference views for the navigation window . This strategy may result in smaller synthesized view distortion due to the smaller distance to the reference views. However, the in-network virtual view synthesis may also introduce distortion into the new reference views and , which results in a tradeoff that should be carefully considered when choosing the views to be synthesized in the cloudlet.

Equipped with the above intuitions, we study the main tradeoff between reference distortion and bandwidth gain. Using a Gauss-Markov model, we first analyze the benefit of synthesizing new reference images in the network. We then formulate a new synthesized reference view selection optimization problem. It consists in selecting or constructing the optimal reference views that lead to the minimum distortion for all synthesized virtual views in the user’s navigation window subject to a bandwidth constraint between the cloudlet and the user. We show that this combinatorial problem can be solved optimally but that it is NP-hard. We then introduce a generic assumption on the view synthesis distortion which leads to a polynomial time solution with a dynamic programming (DP) algorithm. We then provide extensive simulation results for synthetic and natural sequences. They confirm the quality gain experienced by the IMVS clients when synthesis is allowed in the network, with respect to scenarios whose edge cloudlets can only transmit camera views. They also show that synthesis in the network allows to maintain good navigation quality when reducing the number of cameras as well as when cameras are not ideally positioned in the 3D scene. This is an important advantage in practical settings, which confirms that cloud processing resources can be judiciously used to improve the performance of applications that are a priori quite greedy in terms of network resources.

The remainder of this paper is organized as follows. Related works are described in Section II. In Section III, we provide a system overview and analyze the benefit of in-network view synthesis via a Gauss-Markov model to impart intuitions. The reference view selection optimization problem is then formulated in Section IV. We propose general assumptions on view synthesis distortion in Section V and derive an additional polynomial time view selection algorithm. In Section VI, we discuss the simulation results, and we conclude in Section VII.

## Ii Related Work

Prior studies addressed the problem of providing interactivity in selecting views in IMVS, while saving on transmitted bandwidth and view-switching delay [9, 10, 11, 12, 13, 3, 14, 15]. These works are mainly focused on optimizing the frame coding structure to improve interactive media services. In the case of pre-stored camera views, however, rather than optimal frame coding structures, interactivity in network-constrained scenario can be addressed by studying optimal camera selection strategies, where a subset of selected camera views is actually transmitted to clients such that the navigation quality is maximized and resource constraints are satisfied [4, 16, 5, 17, 6, 18]. In [19], an optimal camera view selection algorithm in resource-constrained networks has been proposed based on the users’ navigation paths. In [20] a bit allocation algorithm over an optimal subset of camera views is proposed for optimizing the visual distortion of reconstructed views in interactive systems. Finally, in [21, 22] authors optimally organize camera views into layered subsets that are coded and delivered to clients in a prioritized fashion to accommodates for the network and clients heterogeneity and to effectively exploit the resources of the overlay network. While in these works the selection is limited to camera views, in our work we rather assume in-network processing able to synthesize virtual viewpoints in the cloud network.

In-network adaptation strategies allow to cope with network resource constraints and are mainly categorized in packet-level processing and modification of the source information. In the first category, packet filtering, routing strategies [23, 24] or caching of media content information [25, 26] allow to save network resources while improving the quality experienced by clients. To better address media delivery services in highly heterogenous scenarios, network coding strategies for multimedia streaming have been also proposed [27, 28, 29]. In the second category — in-network processing at the source level — the main objective is usually to avoid transmitting large amounts of raw streams to the clients by processing the source data in the network to reduce both the communication volume and the processing required at the client side. Transcoding strategies might be collaboratively performed in peer-to-peer networks [30] or in the cloud [31]. Furthermore, source data can be compressed in the cloud [30, 32, 33] to efficiently address users’ requests. Rather than media processing in the main cloud, offloading resources to a cloudlet, i.e., a resource-rich machine in the proximity of the users, might reduce the transmission latency [7, 8]. This is beneficial for delay-sensitive / interactive applications [34, 35, 36]. Because of the proximity of cloudlets to users, cloudlet computing has been under intense investigation for cloud-gaming applications, as shown in [37] and references there in. The above works are mainly focused on multimedia processing, rather than on specific multiview scenarios. However, the use of cloudlets in delay sensitive applications motivates the idea of cloudlet-based view synthesis for IMVS.

Cloud processing for multiview system is considered in [38, 39, 40]. In [39] authors mainly address the cloud-based processing from a security perspective. In [40], view synthesis in the network has been introduced for cloud networks to offload clients’ terminals (in terms of complexity). The desired view is synthesized in the cloud and then sent directly to clients. However, only the view requested by the client is synthesized. This means that either the desired view is a priori known at the source or a switching delay is experienced by the clients. To the best of our knowledge, none of the work investigating cloud processing have considered the problem of multi-view interactive streaming under network resource constraints. In our work, we propose view synthesis in the network mainly to both overcome uncertainty of users’ requests in interactive systems and to cope with limited network resources.

## Iii Background

### Iii-a System Model

Let be the set of the camera viewpoints captured by the multiview system. For all camera-captured views, compressed texture and depth maps are stored at the main cloud, with each texture/depth map pair encoded at the same rate using standard video coding tools like H.264[41] or HEVC[42]. The possible viewpoints offered to the users are denoted by . The set contains both synthesized views and camera views for navigation between the leftmost and rightmost camera views, and . It is equivalent to offering views , where is a positive integer and is a pre-determined fraction that describes the minimum view spacing between neighboring virtual views. We consider that any virtual viewpoint can be synthesized using a pair of left and right reference view images and , , via a known DIBR technique such as 3D warping333Note that view synthesis can be performed in-network (to generate new reference views) or at the user side (to render desired views for observation). In both cases, the same rendering method and distortion model apply.  [43].

Each user is served by an assigned cloudlet through a bottleneck link of capacity , expressed in number of views. Assuming a RTT of seconds between the cloudlet and the user, and a maximum speed at which a user can navigate to neighboring virtual views, one can compute a navigation window , given that the user has selected virtual view at some time . The goal of the cloudlet is to serve the user with the best subset of viewpoints in that synthesize the best quality virtual views in . In this way, the user can experience zero-delay view navigation at time (see [14] for details) with optimized visual quality.

### Iii-B Analysis of Cloudlet-based Synthesized Reference View

To impart intuition of why synthesizing new references at in-network cloudlets may improve rendered view quality at an end user, we consider a simple model among neighboring views. Similarly to [44, 45], we assume a Gauss-Markov model, where variable at view is correlated with :

 xv =xv−1+ev,   ∀v≥2 (1)

where is a zero-mean independent Gaussian variable with variance , and . A large would mean views and are not similar. We can write variables in matrix form:

 Fx=e,   x=F−1e (2)

where

 F=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣10…−110…0−110…⋮⋱⋱0…0−11⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,  x=⎡⎢ ⎢⎣x1⋮xN⎤⎥ ⎥⎦,  e=⎡⎢ ⎢⎣e1⋮eN⎤⎥ ⎥⎦ (3)

Given is zero-mean, the covariance matrix can be computed as:

 C =E[xxT]=F−1E[eeT](F−1)T (4)

where is a diagonal matrix. The precision matrix is the inverse of and can be derived as follows:

 Q=C−1 =(F−1diag(σ21,…,σ2N)(F−1)T)−1 =FTdiag(σ21,…,σ2N)−1F =⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1σ21+1σ22−1σ220…−1σ221σ22+1σ23−1σ230…0−1σ231σ23+1σ24−1σ24⋮⋱⋱⋱0−1σ2N1σ2N⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (5)

which is a tridiagonal matrix.

When synthesizing a view using its neighbors and , we would like to know the resulting precision. Without loss of generality, we write as a concatenation of two sets of variables, i.e. . It can be shown [46] that the conditional mean and precision matrix of given are:

 μy|z =μy−Q−1yyQyz(z−μz) Qy|z =Qyy (6)

Consider now a set of four views , where are camera views transmitted from the main cloud. Suppose further that the user window is , and the cloudlet has to choose between using received as right reference, or synthesizing new reference using received and . Using the discussed Gauss-Markov model (1) and the conditionals (6), we see that synthesizing using reference and results in precision:

 Q3|(2,4)=Q33=1σ23+1σ24 (7)

is thus the additional noise variance when using new reference to synthesize . We can then compute the conditional precision given new reference :

 Q2|(1,¯3)=1σ22+1σ23+(1σ23+1σ24)−1 (8)

In comparison, if a user uses received as right reference, will accumulate two noise terms from to :

 x4=x2+e3+e4 (9)

The resulting conditional precision of given and is:

 Q2|(1,4)=1σ22+1σ23+σ24 (10)

We now compare in (8) with in (10). We see that if is very large relative to , then , and . That means that if view is very different from , then synthesizing new reference does not help improving precision of . However, if , then , and , which means that in general it is worth to synthesize new reference . The reason can be interpreted from the derivation above: by synthesizing using both and , the uncertainty (variance) for the right reference has been reduced from to , improving the precision of the subsequent view synthesis.

## Iv Reference View Selection Problem

In this section, we first formalize the synthesized reference view selection problem. We then describe an assumption on the distortion of synthesized viewpoints. We conclude by showing that under the considered assumption the optimization problem is NP-hard.

### Iv-a Problem Formulation

Interactive view navigation means that a user can construct any virtual view within a specified navigation window with zero view-switching delay, using viewpoint images transmitted from the main cloud as reference [14]. We denote this navigation window by that depends on the user’s current observed viewpoint. If bandwidth is not a concern, for best synthesized view quality the edge cloudlet would send to the user all camera-captured views in as reference to synthesize virtual view , . When this is not feasible due to limited bandwidth between the serving cloudlet and the user, among all subsets of synthesized and camera-captured views that satisfy the bandwidth constraint, the cloudlet must select the best subset that minimizes the aggregate distortion of all virtual views , i.e.,

 T⋆: argminTD(T) (11) s.t |T|≤C T⊆U

We note that (11) differs from existing reference view selection formulations [18, 22, 17] in that the cloudlet has the extra degree of freedom to synthesize novel virtual view(s) as new reference(s) for transmission to the user.

Denote by the distortion of viewpoint image , due to lossy compression for a camera-captured view, or by DIBR synthesis for a virtual view. The distortion experienced over the navigation window at the user is then given by

 D(T)=∑u∈[U0L,U0R]minvL,vR∈T{du(vL,vR,D(vL),D(vR))} (12)

where and are the respective distortions of the left and right reference views and is the distortion of the virtual view synthesized using left and right reference views and with distortions and , respectively. In (12), for each virtual view the best reference pair in is selected for synthesis. Note that, unlike [17], the best reference pair may not be the closest references, since the quality of synthesized depends not only on the view distance between the synthesized and reference views, but also on the distortions of the references.

### Iv-B Distortion of virtual viewpoints

We consider first an assumption on the synthesized view distortion called the shared optimality of reference views:

 if du(vL,vR,D(vL),D(vR))≤du(v′L,v′R,D(v′L),D(v′R)) (13) then du′(vL,vR,D(vL),D(vR))≤du′(v′L,v′R,D(v′L),D(v′R))

for . In words, this assumption (13) states that if the virtual view is better synthesized using the reference pair than , then another virtual view is also better synthesized using than .

We see intuitively that this assumption is reasonable for smooth 3D scenes; a virtual view tends to be similar to its neighbor , so a good reference pair for should also be good for . We can also argue for the plausibility of this assumption as a consequence of two functional trends in the synthesized view distortion that are observed empirically to be generally true. For simplicity, consider for now the case where the reference views have zero distortion, i.e. . The first trend is the monotonicity in predictor’s distance [20]; i.e., the further-away are the reference views to the target synthesized view, the worse is the resulting synthesized view distortion. This trend has been successively exploited for efficient bit allocation algorithms [47, 20]. In our scenario, this trend implies that reference pair is better than at synthesizing view because the pair is closer to , i.e.

 |u−vL|+|vR−u|≤|u−v′L|+|v′R−u| (14)

where .

It is easy to see that if reference pair is closer to than , it is also closer to , thus better at synthesizing . Without loss of generality, we write new virtual view as . We can then write:

 |(u+δ)−vL|+|vR−(u+δ)| =u−vL+vR−u ≤u−v′L+v′R−u ≤|(u+δ)−v′L|+|v′R−(u+δ)| (15)

where .

Consider now the case where the reference views have non-zero distortions. In [48], another functional trend is empirically demonstrated, where a reference view with distortion was well approximated as a further-away equivalent reference view with no distortion . Thus a better reference pair than at synthesizing just means that the equivalent reference pair for are closer to than the equivalent reference pair for . Using the same previous argument, we see that the equivalent reference pair for are also closer to than , resulting in a smaller synthesized distortion. Hence, we can conclude that the assumption of shared optimality of reference views is a consequence of these two functional trends.

We can graphically illustrate possible solutions to the optimization problem (11) under the assumption of shared optimality of reference views. Fig. 2 depicts the selected reference views for virtual views in the navigation window. In the figure, the -axis represents the virtual views in the window that require synthesis. Correspondingly, on the -axis are two piecewise constant (PWC) functions representing the left and right reference views selected for synthesis of each virtual view in the window, assuming that for each there must be one selected reference pair such that . A constant line segment—e.g., for in Fig. 2—means that the same reference is used for a range of virtual views. This graphical representation results in two PWC functions—left and right reference views—above and below the line. The set of selected reference views are the unions of the constant step locations in the two PWC functions.

Under the assumption of shared reference optimality we see that the selected reference views in Fig. 2 cannot be an optimal solution. Specifically, virtual views and employ references and respectively. However, if references are better than for virtual view , they should be better for virtual view also according to shared reference optimality in (13). An example of an optimal solution candidate under the assumption of shared reference optimality is shown in Fig. 2.

### Iv-C NP-hard Proof

We now outline a proof-by-construction that shows the reference view selection problem (11) is NP-hard under the shared optimality assumption. We show it by reducing the known NP-hard set cover (SC) problem [49] to a special case of the reference view selection problem. In SC, a set of items (called the universe) are given, together with a defined collection of subsets of items in . The SC problem is to identify at most subsets from collection that covers , i.e., a smaller collection with such that every item in belongs to at least one subset in collection .

We construct a corresponding special case of our reference view selection problem as follows. For each item in in the SC problem, we first construct an undistorted reference view . In addition, we construct a default undistorted right reference view , and the navigation window is set to and . Further, for each item in , we construct a virtual view that requires the selection of left reference , in combination of default right reference , for the resulting synthesized view distortion to achieve distortion . Thus the selection of left references and one default right reference consumes views worth of bandwidth already. See Fig. 3 for an illustration. Note that given this selection of left reference views, any selection of right reference views will satisfy the shared optimality of reference views assumption.

For each subset in collection in the SC problem, we construct a right reference view , such that if item belongs to subset in the SC problem, the synthesized distortion at virtual view will be reduced to given right reference view is used. The corresponding binary decision we ask is: given channel bandwidth of , is there a reference view selection such that the resulting synthesized view distortion is or less?

From construction, it is clear that to minimize overall distortion, left reference views and default right reference view must be first selected in any solution with distortion . Given remaining budget of additional views, if distortion of is achieved, that means or fewer additional right reference views are selected to reduce synthesized distortion from to at each of the virtual view , . Thus these additionally or fewer selected right reference views correspond exactly to the subsets in the SC problem that covers all items in the set . This solving this special case of the reference view selection problem is no easier than solving the SC problem, and therefore the reference view selection problem is also NP-hard.

## V Optimal View Selection Algorithm

Given that the reference view selection problem (11) is NP-hard under the assumption of shared optimality of reference views, in this section we introduce another assumption on the synthesized view distortion that holds in most common 3D scenes. Given these two assumptions, we show that (11) can now be solved optimally in polynomial time by a DP algorithm. We also analyze the DP algorithm’s computation complexity.

### V-a Independence of reference optimality assumption

The second assumption on the synthesized view distortion is the independence of reference optimality, stated formally as follows:

 if du(vL,vR,D(vL),D(vR))≤du(v′L,vR,D(v′L),D(vR)) (16) then du(vL,v′R,D(vL),D(v′R))≤du(v′L,v′R,D(v′L),D(v′R))

for . In words, the assumption (16) states that if is a better left reference than when synthesizing virtual view using as right reference, then remains the better left reference to synthesize even if a different right reference is used. This assumption essentially states that contributions towards the synthesized image from the two references are independent from each other, which is reasonable since each rendered pixel in the synthesized view is typically copied from one of the two references, but not both. We can also argue for the plausibility of this assumption as a consequence of the two aforementioned functional trends in the synthesized view distortion in Section IV. Consider first the case where the reference views have zero distortion. The monotonicity in predictor’s distance in (14) for a common right reference view becomes

 |u−vL|+|vR−u| ≤|u−v′L|+|vR−u|⟶|u−vL|≤|u−v′L| (17)

where . Thus if is preferred to for , it will hold also for as long as . Consider now the case where the reference views have non-zero distortions. Introducing the equivalent reference views with no distortion , the same argument of (17) holds for the equivalent reference views, leading to , .

We illustrate different optimal solution candidates to (11) now under both virtual view distortion assumptions to impart intuition. We see that the assumption of independence of reference optimality would prevent the reference view selection in Fig. 2 from being an optimal solution. Specifically, we see that both and are feasible right reference views for virtual views and . Regardless of which left references are selected for these two virtual views, if is a strictly better right reference than , then having both virtual views select as right reference will result in a lower overall distortion (and vice versa). If and are equally good right reference views resulting in the same synthesized view distortion, then selecting just without can achieve the same distortion with one fewer right reference view. Thus the selected reference views in Fig. 2 cannot be optimal.

We can thus make the following observation: as virtual view increases, an optimal solution cannot switch right reference view from current earlier than . Conversely, as virtual view decreases, an optimal solution cannot switch left reference view from current earlier than . As examples, Fig. 4 provides solutions of left and right reference views for virtual views in the navigation window. In the figure, on the -axis are the virtual views in the window that require synthesis. Correspondingly, on the -axis are the left and right reference views (blue and red PWC functions respectively) selected to synthesize each virtual view in the window. We see that the reference view selections in Fig. 4 and Fig. 4 are optimal solution candidates to (11). Thus, the optimal reference view selections must be graphically composed of “staircase” virtual view ranges as shown in Fig. 4 and Fig. 4. In other words, either a shared left reference view is used for multiple virtual view ranges where each range has the same as left reference (“shared-left” case), or a shared right reference view is used for multiple ranges , where each range has as its right reference (“shared-right” case). This motivates us to design an efficient DP algorithm to solve (11) optimally in polynomial time.

### V-B DP Algorithm

We first define a recursive function as the minimum aggregate synthesized view distortion of views between and , given is the selected left reference view for synthesizing view , and there is a budget of additional reference views. To analyse , we consider the two “staircase” cases identified by Fig. 4 and Fig. 4 separately, and show how can be evaluated in each of the cases.

Consider first the “shared-left” case (Fig. 4) where a shared left reference view is employed in a sequence of virtual view ranges. A view range represents a contiguous range of virtual viewpoints that employ the same left and right reference views. The algorithm selects a new right reference view , , creating a new range of virtual views . Virtual views in range are synthesized using a shared left reference and the newly selected reference view , resulting in distortion for each virtual view , . The aggregate distortion function for this case is the distortion of views in plus a recursive term to account for aggregate synthesized view distortions to the right of :

 v−1L∑u=uLdu(vL,v,D(vL),D(v))+Φ(v,Λ(vL,v),k−1) (18)

where is the remaining budget of additional reference views, and chooses the better of the two left reference views, and , for the recursive function . In particular, using any right reference view and virtual view , where , we set if virtual view is better synthesized using as left reference than (and set otherwise). Formally, the left reference selection function is defined as:

 Λ(v1,v2)={v1if  du(v1,vR,D(vl),D(vR))≤du(v2,vR,D(v2),D(vR))v2o.w. (19)

Given our two assumptions, we know that the selected left reference remains better for all other virtual views in .

We now consider the “shared-right” case (Fig. 4) where a newly selected view is actually a common right reference view for a sequence of virtual view ranges from to . We first define a companion recursive function that returns the minimum aggregate synthesized view distortion from view to , given that is the selected left reference view, is the common right reference view, and there is a budget of other left reference views in addition to . We can write recursively as follows:

 Ψ(uL,vL,vR,n)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩minv>uLv−1L∑u=vLdu(vL,vR,D(vL),D(vR))+Ψ(v,v,vR,n−1)if  k≥1vR−1L∑u=vLdu(vL,vR,D(vL),D(vR))o.w. (20)

In more details, the equation (20) states that is the synthesized view distortion of views in the range , plus the recursive distortion from view to with a reduced reference view budget .

We can now put the two cases together into a complete definition of as follows:

 Φ(uL,vL,k) =minv>vL⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩min⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣v−1L∑u=uLdu(vL,v,D(vL),D(v))+Φ(v,Λ(vL,v),k−1)% shared-left" case , (21) min1≤n≤k−1Ψ(uL,vL,v,n)+Φ(v,v,k−n−1)shared-right" case ⎤⎥ ⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭

The relation (21) states that examines each candidate reference view , , which can be used either as right reference for synthesizing virtual views in with left reference (“shared-left” case), or as a common right reference for a sequence of virtual view ranges within the interval (“shared-right” case).

When the remaining view budget is , the relation in (21) simply selects a right reference view , , which minimizes the aggregate synthesized view distortion for the range :

 Φ(uL,vL,1)=minv≥U0RU0R∑u=uLdu(vL,v,D(vL),D(v)) (22)

Having defined , we can identify the best reference views by calling repeatedly to identify the best leftmost reference view , , and start the selection of the remaining reference views as follows

 minv≤U0LΦ(U0L,v,K−1) (23)

### V-C Computation Complexity

Our proposed DP algorithm requires two different tables to be stored. The first time is computed, the result can be stored in entry of a DP table , so that subsequent calls with the same arguments can be simply looked up. Analogously, the first time is called, the computed value is stored in entry of another DP table to avoid repeated computation in future recursive calls.

We bound the computation complexity of our proposed algorithm (21) by computing a bound on the sizes of the required DP tables and the cost in computing each table entry. For notation convenience, let the number of reference views and synthesized views be and , respectively. The size of DP table is no larger than . The cost of computing an entry in using (21) over all possible reference views involves the computation of the “shared-left” case with complexity and the one of the “shared-right” case with complexity . Thus, each table entry has complexity . Hence the complexity of completing the DP table is . Given that in typical setting , the complexity for computing DT table is thus .

We can perform similar procedure to estimate the complexity in computing DP table . The size of the table in this case is upper-bounded by . The complexity in computing each entry is . Thus the complexity of computing DP table is . which is the same as DP table . Thus the overall computation complexity of our solution in (21) is also .

## Vi Simulation Results

### Vi-a Settings

We study the performance of our algorithm and we show the distortion gains offered by cloudlets-based virtual view synthesis. For a given navigation window , we provide the average quality at which viewpoints in the navigation window is synthesized. This means that we evaluate the average distortion of the navigation window as , with being the number of synthesized viewpoints in the navigation window, and we then compute the corresponding PSNR. In our algorithm, we have considered the following model for the distortion of the synthesized viewpoint from reference views ,

 du(VL,VR,DL,DR)=αDmin+(1−α)βDmax+[1−α−(1−α)β]DI (24)

where , , is the inpainted distortion, and with is the distance between two consecutive camera views and , if , otherwise, and if , . The model can be explained as follows. A virtual synthesis , when reconstructed from has a relative portion that is reconstructed at a distortion , from the dominant reference view, defined as the one with minimum distortion. The remaining portion of the image, i.e., , is either reconstructed by the non-dominant reference view for a potion , at a distortion , or it is inpainted, at a distortion .

The results have been carried out using 3D sequences “Statue” and “Mansion” [50], where cameras acquire the scene with uniform spacing between the camera positions. The spacing between camera positions is mm and mm for “Statue” and “Mansion”, respectively. Among all camera views provided for both sequences, only a subset represents the set of camera views available at the cloudlet, while the remaining are virtual views to be synthesized. Table I depicts how the camera notation used in [50] is adapted to our notation. Finally, for the “Mansion” sequence, in the theoretical model in (24) we used , , and , while for the “Statue” sequence we used , , and .

In the following, we compare the performance achieved by virtual view synthesis in the cloudlets with respect to the scenario in which cloudlets only send to users a subset of camera views. We denote by the subset of selected reference views when synthesis is allowed in the network, and by the subset of selected reference views when only camera views can be sent as reference views, i.e., when synthesis is not allowed in the network. For both the cases of network synthesis and no network synthesis, the best subset of reference views is evaluated both with the proposed view selection algorithm and with an exact solution, i.e., an exhaustive search of all possible combinations of reference views. For the proposed algorithm, the distortion is evaluated both with experimental computation of the distortion, where the results are labeled “Proposed Alg. (Experimental Dist)”, and with the model in (24), results labeled “Proposed Alg. (Theoretical Dist)”. For all three algorithms, once the optimal subset of reference view is selected, the full navigation window is reconstructed experimentally and the mean PSNR of the actual reconstructed sequence is computed.

In the following, we first validate the distortion model in (24) as well as the proposed optimization algorithm. Then, we provide simulation using the model in (24) and study the gain offered by network synthesis. For the sake of clarity in the notation, in the following we identify the viewpoints by their indexes only. This means that the set of camera views , for example, is denoted in the following by . Analogously for the navigation window is denoted in the following by .

### Vi-B Performance of the view selection algorithm

In Fig. 5, we provide the mean PSNR as a function of the available bandwidth in the setting of a regular spaced cameras set , and a navigation window requested by the user. Results are provided for the “Mansion” and the “Statue” sequences in Fig. 5(a) and Fig. 5(b), respectively. For the “Mansion” sequence, the proposed algorithm with experimental distortion perfectly matches the exhaustive search. Also the proposed algorithm based on theoretical distortion nicely matches the exhaustive search method, with the exception of the experimental point at in the network synthesis case. In that experiment, the algorithm selects as best subset rather than selected by the exhaustive search. Beyond the good match between exhaustive search and proposed algorithm, Fig. 5(a) also shows the gain achieved in synthesizing reference views at the cloudlets. For , the optimal sets of reference views are and . The possibility of selecting the view at position as reference view reduced the reference view distance for viewpoints in compared to the case in which camera view is selected. Thus, as long as the viewpoint is synthesized at a good quality in the network, synthesizing in the network improves the quality of the reconstructed region of interest, when the bandwidth is limited. Increasing the channel capacity reduces the quality gain between synthesis and no synthesis at the cloudlets. For , for example, the virtual viewpoint is used to reconstruct the views range of the navigation window. Thus, the benefit of selecting rather than is limited to a portion of the navigation window and this portion usually decreases for large . Similar considerations can be derived from Fig. 5(b), for the “Statue” sequence. We observe a very good match between the proposed algorithm and the exhaustive search one.

We then compare in Fig. 6 the performance of the exhaustive search algorithm with our optimization method in the case of non-equally spaced cameras. The “Statue” sequence is considered with unequally spaced cameras set , and a navigation window at the client. Similarly to the equally spaced scenario, the performance of proposed optimization algorithm matches the one of the exhaustive search. This confirms the validity of our assumptions and the optimality of the DP optimization solution. Also in this case, a quality gain is offered by virtual view synthesis in the network, with a maximum gain achieved for , with optimal reference views and .

### Vi-C Network synthesis gain

Now, we aim at studying the performance gain due to synthesis in the network for different scenarios. However, multiview video sequences (with both texture and depth maps) currently available as test sequences have a very limited number of views (e.g., views in the Ballet video sequences). Because of the lack of test sequences, we consider synthetic scenarios and we adopt the distortion model in (24) both for solving the optimization algorithm and evaluating the system performance. The following results are meaningful since we already validated our synthetic distortion model in the previous subsection.