Towards Multifocal Displays with Dense Focal Stacks
We present a virtual reality display that is capable of generating a dense collection of depth/focal planes. This is achieved by driving a focus-tunable lens to sweep a range of focal lengths at a high frequency and, subsequently, tracking the focal length precisely at microsecond time resolutions using an optical module. Precise tracking of the focal length, coupled with a high-speed display, enables our lab prototype to generate 1600 focal planes per second. This enables a novel first-of-its-kind virtual reality multifocal display that is capable of resolving the vergence-accommodation conflict endemic to today’s displays.
The human eye automatically change the focus of its lens to provide sharp, in-focus images to objects at different depths. While convenient in the real world, for virtual or augmented reality (VR/AR) applications, this focusing capability of the eye often causes a problem that is called the vergence-accommodation conflict [Kramida, 2016; Hua, 2017]. Vergence refers to the simultaneous movement of the two eyes so that a scene point comes into the center of the field of view, and accommodation refers to the changing the focus of the ocular lenses to bring the object into focus. In the real world, these two cues act in synchrony. However, most commercial VR/AR displays render scenes by only satisfying the vergence cues, i.e., they manipulate the disparity of the images shown to each eye. But given that the display is at a fixed distance from the eyes, the corresponding accommodation cues are invariably incorrect, leading to a conflict between vergence and accommodation that can cause discomfort and fatigue especially after long durations of usage [Hoffman et al., 2008]. While many approaches have been proposed to mitigate the vergence-accommodation conflict (VAC), it remains one of the important challenges for VR and AR displays.
In this paper, we provide the design for a VR display that is capable of addressing the VAC by displaying content at a dense collection of depth planes at a high frame rate. The proposed display falls under the category of multifocal displays, i.e., displays that generate content at different depths or focal planes by either changing the focal length of its eye piece [Liu et al., 2008; Liu and Hua, 2009; Love et al., 2009; Llull et al., 2015; Johnson et al., 2016; Konrad et al., 2016], or use multiple physical displays located at different depths [Rolland et al., 2000; Akeley et al., 2004; Narain et al., 2015; Mercier et al., 2017]. We focus on displays that change the focal length of the eye piece via the use of a focus-tunable lens. The distinguishing factor is that the proposed device is capable of displaying a stack of focal planes that are an order of magnitude greater in number as compared to prior work; further, this capability is achieved with no loss in frame rate of the display. Specifically, our prototype is capable of displaying 1600 focal planes per second, which can be used to display scenes with 40 focal planes per frame at 40 frames per second. As a consequence, we are able to render virtual worlds at a realism that is hard to achieve with current multifocal display designs.
To understand how we can display thousands of focal planes per second, it worth pointing out that the key factor that limits the depth resolution of a multifocal display is the operational speed of its focus-tunable lens. Focus-tunable lenses change their focal length based on an input driving voltage. Most commercially-available focus-tunable lenses [Optotune, 2017; Varioptic, 2017] require around s to settle onto a particular focal length. Hence, in order to wait for the lens to settle so that the displayed image is rendered at the desired depth, we can output at most focal planes per second. For a display operating with 30-60 frames per second (fps), this would imply anywhere between three and six focal planes per frame.
An important observation that we make is that, while focus-tunable lenses have long settling times, their frequency response is rather broad and has a cut-off upwards of Hz [Optotune, 2017]. This suggests that we can drive the lens with excitations that are radically different from a simple step edge (i.e., a change in voltage). For example, we could make the lens sweep through its entire gamut of focal lengths at very high frequencies simply by exciting it with a sinusoid or a triangular voltage of a desired frequency. If we can subsequently track the focal length of the lens in real-time, we can accurately display focal planes at any depth without waiting for the lens to settle. In other words, by driving the focus-tunable lens to periodically sweep the desired range of focal lengths and tracking the focal length at high-speed and in real-time, we can display numerous focal planes.
This paper proposes a novel multifocal display that produces three-dimensional scenes by displaying dense focal stacks. In this context, we make the following contributions:
High-speed focal-length tracking. The core contribution of this paper is a system for real-time tracking of the focal length of a focus-tunable lens at microsecond-scale resolutions. We achieve this by measuring the deflection of a laser incident on the lens.
Design space analysis. We provide analytical expressions for achievable spatial resolution and the depth-of-field of multifocal displays when viewed under the finite aperture of human eye, and show that a high focal plane density is required to preserve spatial detail in the displayed imagery. This analysis also characterizes the field-of-view and eye box of the proposed display.
Prototype. Finally, we build a proof-of-concept prototype that is able to produce -bit focal planes per frame with fps. This corresponds to focal planes per second — a capability that is an order of magnitude greater than competing approaches.
The proposed display has the following limitations:
Need for additional optics. The proposed focal-length tracking device requires additional optics that increase its bulk.
Peak brightness. Displaying a large number of focal planes per frame leads to a commensurate decrease in peak brightness of the display since each depth plane is illuminated for a smaller fraction of time. This is largely not a concern for VR displays, and can potentially be alleviated with techniques that redistribute light [Damberg et al., 2016].
Occlusion cues. Another limitation of displaying dense focal stacks is the inability of the display to block light. When we focus on a focal plane, the defocus blur from other planes behind will leak onto the content of the plane, causing the reduction in contrast. While this problem can be mitigated with preprocessing of the focal stack [Narain et al., 2015; Mercier et al., 2017], we observe satisfactory results even without them.
Limitations of our prototype. Our current proof-of-concept prototype uses a digital micromirror display (DMD) and, as a consequence, has low energy efficiency. The problem can be easily solved by switching to energy-efficient displays, like OLED, or projectors with laser scanning or displays that redistribute light to achieve higher peak brightness and contrast.
2. Related work
A typical VR display is composed of a convex eye piece and a display unit. As shown in Figure 2a, the display is placed within the focal length of the convex lens in order to create a magnified virtual image. The distance of the virtual image can be calculated by the thin lens formula:
where is the distance between the display and the lens, and is the focal length. We can see that is an affine function of the optical power () of the lens and (). By choosing and , the designer can put the virtual image of the display at a desired depth. For example, in an aviation simulator, since most virtual objects are far away from the user, is usually chosen to be to send the display to infinity to minimize VAC. However, for most consumer applications, most scenes usually across a wide range of depths. Due to the fixed focal plane, these displays are unable to provide natural accommodation cues and often cause dizziness and blur vision, especially with long use.
2.1. Accommodation-Supporting Displays
There have been many designs proposed to provide accommodation support. We concentrate on techniques most relevant to the proposed method, deferring a detailed description to prior art, e.g., [Kramida, 2016] and [Hua, 2017]; in particular, see Table 1 of [Matsuda et al., 2017].
2.1.1. Multifocal and Varifocal Displays
Multifocal and varifocal displays control the depths of the focal planes by dynamically adjusting or in (1). Multifocal displays aim to produce multiple focal planes at different depths for each frame (Figure 2b), whereas varifocal displays support only one focal plane per frame whose depth is dynamically adjusted based on the gaze of the user’s eyes (Figure 2c). Multifocal and varifocal displays can be designed in many ways, including the use of multiple (transparent) displays placed at different depths [Rolland et al., 1999; Akeley et al., 2004], a translation stage to physically move a display or optics [Shiwa et al., 1996; Sugihara and Miyasato, 1998; Akşit et al., 2017] as well as a focus-tunable lens to optically reposition a fixed display [Liu et al., 2008; Love et al., 2009; Padmanaban et al., 2017; Johnson et al., 2016; Konrad et al., 2016]. Varifocal focal displays require to show a single focal plane at any point in time, but they require precise eye/gaze-tracking at low latency. Multifocal displays, on the other hand, have largely been limited to displaying a few focal planes per frame due to the bulk of physical displays as well as the limited switching speed of translation stage and focus-tunable lenses.
2.1.2. Light Field Displays
While multifocal and varifocal displays produce a collection of focal planes, light field displays aim to synthesize the light field of a 3D scene. Lanman and Luebke  introduce angular information by replacing the eyepiece with a microlens array; Huang et al.  utilize multiple spatial light modulators to modulate the intensity of light rays. While these displays fully support accommodation cues and produce natural defocus blur and parallax, they usually suffer from poor spatial resolution, due to the spatial-angular trade-off, and diffraction induced by multiple layers of spatial light modulators.
Multifocal displays (and thereby the proposed display) can also be considered as light field displays. As opposed to trading spatial resolution for angular/depth resolution, multifocal displays exploit the persistence of vision of the human eye by displaying content at multiple depth planes in a short interval of time, i.e., we trade time resolution of the display to obtain depth resolution. This approach reduces the complexity of hardware and algorithm, and enables us to generate light fields of high spatial/angular resolution.
2.1.3. Other Types of Virtual Reality Displays
Other types of VR/AR displays have been proposed to solve vergence-accommodation conflict. Matsuda et al.  use a phase-only spatial light modulator to create spatially-varying lensing based on the virtual content and the gaze of the user. Maimone et al.  utilize a phase-only spatial light modulator to create a 3D scene using holography. Similar to our work, Konrad et al.  operate a focus-tunable lens in an oscillatory mode. Instead of tracking the focal length, they simply use the focus-tunable lens to create a depth-invariant blur — a concept proposed earlier for extended depth of field imaging [Miau et al., 2013]. Intuitively since content is displayed at all focal planes, the vergence-accommodation conflict is significantly resolved. However, this comes at the cost of reduced spatial resolution due to the defocus blur that is intentionally introduced.
2.2. Depth-Filtering Methods
Multifocal displays with limited focal planes produce virtual scenes that have aliasing artifacts as well as reduced spatial resolution on content that is to be rendered in between focal planes. Akeley et al.  propose a linear depth filtering method to reduce artifacts, and MacKenzie et al.  validate its effectiveness in approximating retinal defocus blur. Liu and Hua  propose a nonlinear filtering that fuses virtual displays to adjusts the depth of a perceived image. Despite their computational efficiency, these filtering methods produce artifacts near object boundaries due to the inability of multifocal displays to occlude light.
To simultaneously reduce artifacts due to lack of focal planes and to produce proper occlusion cues with multifocal displays, Narain et al.  propose an optimization-based method that jointly solves for the contents shown on all focal planes. By modeling the defocus blur of focal planes when an eye is focused at certain depths, they formulate a non-negative least-square problem that minimizes the mean-squared error between perceived images and target images at multiple depths. While the algorithm demonstrates promising results, the computational costs of the optimization are often too high for real-time applications. Mercier et al.  simplify the forward model of Narain et al.  and significantly improve the speed to solve the optimization problem. While their experiment results show that optimization-based filtering is able to drive accommodation, they also show that linear depth filtering provides stronger accommodation cues. It is worth noting that these filtering approaches are largely complimentary to the proposed work, in that, they can incorporated into the dense focal stacks produced by our proposed display.
3. How Many Focal Planes Do We need?
A key factor underlying the design of multifocal displays is the number of focal planes required to support a target accommodation range. In order to be indistinguishable from the real world, a virtual world should enable human eyes to accommodate freely on arbitrary depths. In addition, the virtual world should have high spatial resolution anywhere within the target accommodation range. Simultaneously satisfying these two criteria for a large accommodation range is very challenging, since it requires generating light field of high spatial and angular resolution. In the following, we will show that displaying a dense focal stack is a promising step toward the ultimate goal of generating virtual worlds that can handle the accommodation cues of the human eye.
To understand the capability of a multifocal display, we can analyze its generated light field in the frequency domain. Our analysis, following the methods derived in Wetzstein et al.  and Narain et al. , provides an upper-bound on the performance of a multifocal display, regardless of the depth filtering algorithm applied. It is also similar to that of Sun et al.  with the key difference that we focus on the minimum number of focal planes required to retain spatial resolution within an accommodation range, as opposed to efficient rendering of foveated light fields.
Light-Field Parameterization and Assumptions
For simplicity, our analysis considers a flatland with two-dimensional light fields. The generalization to four-dimensional light fields can be conducted in a similar manner. In the flatland, the direction of a light ray is parameterized by its intercepts with two parallel axes, and , which are separated by unit, and the origin of the -axis is relative to each individual value of such that measures the tangent angle of a ray passing through , as shown in Figure 3a. We model the human eye with a camera composed of a finite-aperture lens and a sensor plane away from the lens, following the assumptions made in Mercier et al.  and Sun et al. . We assume that the pupil of the eye locates at the center of the focus-tunable lens and is smaller than the aperture of the tunable lens. We assume that the display and the sensor emit and receive light isotropically. In other words, each pixel on the display uniformly emits light rays toward every direction, and vice versa for the sensor. We also assume small-angle (paraxial) scenarios, since the distance and the focal length of the focus-tunable lens (or essentially, the depths of focal planes) are large compared to the diameter of the pupil. This assumption simplifies our analysis by allowing us to consider each pixel in isolation.
Light Field Generated by the Display
Since the display is assumed to emit light isotropically in angle, the light field creates by a display pixel can be modeled as , where is the radiance emitted by the pixel, represents two-dimensional convolution, and is the pitch of the display pixel. The Fourier transform of is , which lies on the axis, as shown in Figure 3b. We only plot the central lobe of corresponding to , since this is sufficient for calculation of the half-maximum bandwidth of retinal images. In the following, we omit the constant for brevity.
Propagation from Display to Retina
Let us decompose the optical path from the display to the retina (sensor) and examine its effects in frequency domain. After leaving the display, the light field propagates a distance , gets refracted by the tunable lens, and by the lens of the eye where it is partially blocked by the pupil, and propagates a distance to the retina where it finally gets integrated across angle. Propagation and refraction shears the spectrum of the light field along and , respectively, as shown in Figure 3(c,d,e). Before entering the pupil, the focal plane at depth forms a segment of slope within , where is due to the magnification of the lens. For brevity, we show only the final (and most important) step and defer the full derivation to the appendix.
Suppose the eye focuses at depth , and the focus-tunable lens configuration creates a focal plane at . The Fourier transform of the light field reaching the retina is
where represents two-dimensional cross correlation, is the Fourier transform of the light field from the focal plane at reaching the retina without aperture (Figure 3f), and is the Fourier transform of the aperture function propagated to the retina (Figure 3g). Depending on the virtual depth , the cross correlation creates different extent of blur on the spectrum (Figure 3h). Finally, the Fourier transform of the image that is seen by the eye is simply the slice along on .
When the eye focuses at the focal plane (), the spectrum lies entirely on and the cross correlation with has no effect on the spectrum along . The resulted retinal image has maximum spatial resolution , which is independent of the depth of the focal plane .
When the eye is not focused on the virtual depth plane, i.e., , the cross correlation results in a segment of width
on the -axis (Figure 3h). Note that , and thereby the half-maximum bandwidth of the spatial frequency of the perceived image is upper-bounded by .
Spatial Resolution of Retinal Images
With the above derivations, we are ready to characterize the spatial resolution of a multifocal display. Suppose the eye can accommodate freely on any depth within a target accommodation range, . Let be the set of depth of the focal planes created by the multifocal display. When the eye focuses at , the image formed on its retina has spatial resolution of
where the first term characterizes the inherent spatial resolution of the display unit, and the second term characterizes spatial resolution limited by accommodation, i.e. potential mismatch between the focus plane of the eye and the display. This bound on spatial resolution is a physical constraint caused by the finite display pixel pitch and the limiting aperture (i.e., the pupil) — even if the retina had infinitely-high spatial sampling rate. Any post-processing methods including linear depth filtering, optimization-based filtering, and nonlinear deconvolution cannot surpass this limitation.
Number of Focal Planes Needed
As can be seen in (3), the maximum spacing between any two focal planes in diopter determines , the lowest perceived spatial resolution within the accommodation range. If we desire a multifocal display with spatial resolution across the accommodation range to be at least , , the best we can do with focal planes is to have a constant inter-focal separation in diopter. This results in an inequality that
Thereby, increasing the number of focal planes (and distributing them uniformly in diopter) is required for multifocal displays to support higher spatial resolution and wider accommodation range.
Relationship to Prior Work.
There are many prior works studies the minimum focal plane spacing of multifocal displays. Rolland et al.  compute the depth-of-focus based on typical acuity of human eye ( cycles per degree) and pupil diameter ( m) and conclude that focal planes equally spaced by diopter are required to accommodate from m to . Both theirs and our analyses share the same underlying principle — maintaining the minimum resolution seen by the eye within the accommodation range, and thereby provide the same required focal planes. By taking m, , m, and , we have , which is the same as their result. MacKenzie et al. [2010; 2012] measured accommodation responses of human eye during usage of multifocal displays of different plane-separation configurations with linear depth filtering [Akeley et al., 2004]. Their results suggest that focal-plane separations as wide as diopter can drive accommodation with insignificant deviation from natural accommodation. However, it is also reported that smaller plane-separation provides more natural accommodation and higher retinal contrast — features that are desirable in any VR/AR display. With a dense focal stack of focal-plane separation as small as diopter, our prototype can simultaneously provide proper accommodation cues and display high-resolution retinal images.
Depth-of-Field of a Focal Plane
At the other extreme, if we have a sufficient number of focal planes, then the limiting factor becomes the pixel pitch of the display unit. and our derivation of (3) can be used to compute the depth-of-field of a focal plane. For a focal plane at virtual depth , the retinal image of an eye focuses on will have maximal spatial resolution if
Since the maximum accommodation range of the multifocal display with a convex tunable lens is diopter, we need at least focal planes to achieve the maximum spatial resolution of the multifocal display across the supported depth range. For example, our prototype has m and say m, it would require focal planes to reach the resolution upper-bound.
4. Generating Dense Focal Stacks
We now have a clear goal — designing a multifocal display supporting a very dense focal stack, which enables display high resolution images across a wide accommodation range. The key bottleneck for building multifocal displays with dense focal stack is the settling time of the focus-tunable lens. The concept described in this section outlines an approach to mitigate this bottleneck and provides a design template for displaying dense focal stacks.
4.1. Focal-Length Tracking
The centerpiece of our proposed work is the idea that we do not have to wait for the focus-tunable lens to settle at a particular focal length. Instead if we constantly drive the lens so that it sweeps across a range of focal lengths, and subsequently track the focal length in real time, we can display the corresponding focal plane without waiting the focus-tunable lens to settle. This enables us to display as many focal planes as we want, as long as the display supports the required frame rate.
While the optical power of focus-tunable lenses is controlled by an input voltage or current, simply measuring these values only provides inaccurate and biased estimates of the focal length. This is due to the time-varying transfer functions of tunable lenses, which are known to be sensitive to operating temperature and irregular motor delays. Instead, we propose to estimate the focal length by probing the tunable lens optically. This enables robust estimations that are invulnerable to the unexpected factors.
In order to measure the focal length, we send a collimated infrared laser beam through the edge of the focus-tunable lens. Since the direction of the outgoing beam depends on the focal length, the laser beam changes direction as the focal length changes. There are many approaches to measure this change in direction, including using a one-dimensional pixel array or an encoder system. In our prototype, we use a one-dimensional position sensing detector (PSD), which enables to measure the location efficiently, accurately and rapidly. The schematic is shown in Figure 4a.
The focal length of the laser is estimates as follows. We first align the laser so that it is parallel to the optical axis of the focus-tunable lens. After deflection caused by the lens, the beam is incident on a spot on the PSD. As shown in Figure 4b, the location of this spot is given as
where is the focal length of the lens, is the distance measured along the optical axis between the lens and the PSD, and is the distance between the optical center of the lens and the spot the laser is incident on. Note that the displacement is an affine function of the optical power of the focus-tunable lens.
We next discuss how the location of the spot is estimated from the PSD outputs. A PSD is composed of a photodiode and a resistor distributed throughout the active area. The photodiode has two connectors at its anode and a common cathode. Suppose the total length of the active area of the PSD is . When a light ray reaches a point at on the PSD, the generated photocurrent will flow from each anode connector to the cathode with amount inversely proportional to the resistance in between. Since resistance is proportional to length, we have the ratio of the currents in the anode and cathode as
After arranging the expression, we get
As can be seen, the optical power of the tunable lens is an affine function of . With simple calibration (to get the two coefficients), we can easily estimate the value.
When the incoming laser beam is not infinitely narrow, the PSD measures the centroid of the intensity of the light ray. The speed and accuracy of PSDs are determined by their photodiodes, in the order of nano-seconds and micro-meters, respectively. Thereby, by incorporating a PSD, the tracking system enables real-time robust estimation of the focal length of the tunable lens.
4.2. The Need for Fast Displays
In order to display multiple focal planes within one frame, we also require a display that has a high frame rate, greater than or equal to the focal-plane display rate. To achieve this, in our prototype, we use a digital micromirror device (DMD)-based projector as our display. Commercially available DMDs can easily achieve upwards of bitplanes per second. Following the design in [Chang et al., 2016], we modulate the intensity of the projector’s light source to display 8-bit images; this enables us to display each focal plane with 8-bits of intensity and generate as many as focal planes per second.
4.3. Design Criteria and Analysis
We now analyze the system in terms of various desiderata and the system configurations required to achieve them.
Achieving a Full Accommodation Range.
A first requirement is that the system be capable of supporting the full accommodation range of typical human eyes, i.e., generate focal planes from m to infinity. Suppose the optical power of the focus-tunable lens ranges from to diopter. From (1), we have
where is the distance between the display unit and the tunable lens, is the distance of the virtual image of the display unit from the lens, is the focal length of the lens at time , and is the optical power of the lens in diopter. Since we want to range from cm to infinity, ranges from to . Thereby, we need
An immediate implication of this is that , i.e., to support the full accommodation range of a human eye, we need a focus-tunable lens whose optical power spans at least diopters. We have more choice over the actual range of focal lengths taken by the lens. A simple choice is to set ; this ensures that we can render focal planes at infinity; subsequently, we choose sufficiently large to cover diopters. By choosing a small value of , we can have a small and thereby achieve a compact display.
The proposed display shares the same field-of-view and eye box characteristics as other multifocal displays. The field-of-view will be maximized when the eye is located right near the lens. This will results in a field-of-view of , where is the height (or width) of the physical display (or its magnification image via lensing). When the eye is further away from the lens, the numerical aperture will limit the extent of the field-of-view. Since the aperture of most tunable lenses are small (around 1 m in diameter), we would prefer to put the eye as close as the lens as possible. This can be achieved by embedding the dichroic mirror (the right one in Figure 4a) onto the rim of the lens.
The eye box of multifocal displays are often small, and the proposed display is no exception. Due to the depth difference of focal planes, as the eye shifts, contents on each focal plane shift by different amounts, with the closer ones traverse more than the furtherer ones. This will leave uncovered regions on the final retinal images. This problem can be solved by incorporating an eye tracker, as demonstrated by Mercier et al. .
The proposed display has the following limitations.
Reduced Maximum Brightness.
Suppose we are displaying focal planes per frame, and frames per second. Each focal plane is displayed for second, which is -times smaller compared to typical virtual reality displays with one focal plane or varifocal displays that dynamically change the depth of the focal plane. These can be solved by increasing the radiance output of the light source. For our prototype, we use a high power LED that is designed to lid a full wall with a typical projector. Since the area of the display is much smaller and there is no competing sources in virtual reality displays, we do not experience any brightness issue.
Energy efficiency of the proposed method also depends on the type of display used. For our prototype, since we use a DMD to spatially modulate the intensity at each pixel, we waste of the energy. This can be completely avoided by adopted by using fast OLED displays, where a pixel can be completely turned off. An alternate solution is to use a phase spatial light modulator (SLM) [Damberg et al., 2016] to spatially redistribute a light source so that each focal plane only gets illuminated at pixels that need to be displayed; a challenge here is the slow refresh rate of current crop of phase SLMs. Another option is to use a laser along with a 2D galvo to illuminate only the pixels at each depth plane; this would lead to reduction in spatial resolution of the displayed image since 2D galvos are often slow when operated in non-resonant modes.
Inability to Block Light.
Another limitation of the proposed display is its inability to block light. This is an inherent problem of all multifocal displays — the focal planes closer to the eye cannot occlude light from the further focal planes. As a result, when focusing on a focal plane, we will see defocus-blurred focal planes behind overlaying on the focal plane. This can be solved by applying optimization-based depth filtering [Narain et al., 2015; Mercier et al., 2017], which deconvolve the blur kernels on the contents shown on the focal planes. In practice, due to the small pupil diameter of the eye, the defocus blur does not create disturbing artifacts, and we see satisfactory results even without optimization-based filtering.
5. Proof-of-Concept Prototype
In this section, we present a lab prototype that validates the core ideas of this paper, namely, generating a dense focal stack using high-speed tracking of the focal length of a tunable lens and a high-speed display.
5.1. Implementation Details
The prototype is composed of three functional blocks: the focus-tunable lens and its driver, the focal-length tracking device and its processing circuit, and a DMD-based projector. All the three components are controlled by a FPGA (Altera DE0-nano-SOC) . The FPGA drives the tunable lens with a digital-to-analog converter (DAC), following Algorithm 1. Simultaneously, the FPGA reads the focal-length tracking output with an analog-to-digital converter (ADC) and uses the value to trigger the projector to display the next focal plane. Every time a focal plane has been displayed, the projector is immediately turned off to avoid blur caused by the continuously changing focal-length configuration. A photo of the prototype is shown in Figure 5. In the following, we will introduce each component in detail.
In order to display focal planes at correct depths, we need to know the corresponding PSD tracking outputs. While in principle we can build a mapping table by measuring the PSD tracking outputs and the corresponding depths by focusing a camera on a physical resolution chart, in practice it is very difficult — simply because the depths can go to infinity. In addition, measuring the depth of a virtual image is not easy. With no actual screen to reflect the light, we can not use a range meter to measure the distance; the small aperture of the tunable lens also makes stereo methods unreliable. All these reasons limit our options.
Thereby, we can estimate the current depth if we know and , which only requires two measurements to estimate.
With a camera focusing at m and , we get the two corresponding ADC readings and . The two points can be accurately measured, since the depth-of-field of the camera at m is very small, and infinity can be approximated as long as the image is far away. Since (11) has an affine relationship, we only need to divide evenly into the desired number of focal planes.
5.1.2. Control Algorithm.
The FPGA follows Algorithm 1 to coordinate the tunable lens and the projector. On a high level, we drive the tunable lens with a triangular wave by continuously increasing/decreasing the DAC levels. We simultaneously detect the PSD’s DAC reading to trigger the projection of focal planes. When the last/first focal plane is displayed, we switch the direction of the waveform. Note that while Algorithm 1 is written in serial form, every module in the FPGA runs in parallel.
The control algorithm is simple yet robust. It is known that the transfer function of the tunable lens is sensitive to many factors, including device temperature and unexpected motor delay and errors [Optotune, 2017]. In our experience, even with the same input waveform, we observe different offsets, peak-to-peak values on the PSD output waveform for each period. Since the algorithm do not drive the tunable lens with fixed DAC values and instead directly detect the PSD output (i.e., the focal length of the tunable lens), it is robust to these unexpected factors. However, the robustness comes with a price. Due to the motor delay, the peak-to-peak value is often a lot larger than . This causes the frame rate of the prototype ( focal planes per second, or focal planes per frame at fps) to be lower than the highest display frame rate ( focal planes per second). This problem can be solved by adopting more advanced control algorithm that properly deal with motor delay.
5.1.3. Focus-Tunable Lens and its Driver
We use the focus-tunable lens EL-10-30 from Optotune [Optotune, 2017]. The optical power of the lens ranges from approximately to diopters and is an affine function of the driving current input from to mA. We use a 12-bit DAC (MCP4725) with a current buffer (BUF634) to drive the lens. The DAC provides thousand samples per second, and the current buffer has bandwidth of MHz. This allows us to faithfully create a triangular input voltage up to several hundreds Hertz. The circuit is drawn in Figure 6b.
5.1.4. Focal-Length Tracking and Processing
The focal-length tracking device is composed of a one-dimensional PSD (SL15 from OSI Optoelectronics) , two 800nm dichroic shortpass mirrors (Edmundoptics #69-220), and a 980nm collimated infrared laser (Thorlabs CPS980S). The laser beam passes through the boundary of the focus-tunable lens, parallel to the optical axis. We drive the the PSD with reverse bias voltage of V. This enables us to have m precision on the PSD surface and rise time of s. Across the designed accommodation range, the laser spot traverses within m on the PSD surface, which has a total length m. This allows us to accurately differentiate up to focal-length configurations.
The analog processing circuit has three stages — amplifier, analog calculation, and an ADC, as shown in Figure 6a. We use two operational amplifiers (TI OPA-37) to amplify the two output current of the PSD. The gain-bandwidth of amplifiers are MHz, which can fully support our desired operating speeds. We also add a low-pass filter with cut-off frequency of Hz at the amplifier, as a denoising filter. The computation of is conducted analogly with two operational amplifiers (TI OPA-37) and an analog divider (TI MPY634). We use an 12-bit ADC (LTC2308) to port the analog voltage to the FPGA. The ADC has sampling rate of thousand samples per second. While the analog processing circuit introduces a latency of s, the delay is regular and can be easily handled by calibration.
5.1.5. DMD-based Projector
The projector is composed of a DLP-7000 from Texas Instruments, projection optics from Vialux, and a high-power LED XHP35A from Cree. We update the configuration of micro-mirrors every s. Following Chang et al. , to project -bit gray-scale images with very high frame rate, we use pulse-width modulation to change intensity of the LED concurrently with the update of micro-mirrors. The pulse-width modulation is performed through a LED driver (TI LM3409HV). This enables us to display at most 8-bit images per second. We observed that the LED driver has a high overshoot output, which causes bright regions in the projected images to flicker irregularly. This can be solved with better LED control drivers.
Note that we divide the 8 bitplanes of each focal planes into two groups of 4 bitplanes, and we display the first group when the triangular waveform is increasing, and the other at the downward waveform. From the results that will be presented in Section 6, we can see that the images of the two groups align nicely. This demonstrates the high accuracy of the focal-length tracking.
As a quick verification of the prototype, we used the burst mode on the Nikon camera to capture multiple photographs at an aperture of , ISO 12,800 and an exposure time of s. Figure 7 shows six examples of displayed focal planes. Since a single focal plane requires an exposure time of s, the captured images are composed of at most focal planes.
6. Experimental Evaluations
We showcase the performance of our prototype on a range of scene designed carefully to highlight important features of our system. The supplemental material has video illustrations that contains full camera focus stacks of all results in this section.
To evaluate the focal-length tracking module, we measure the input signal to the focus-tunable lens and the PSD output from a Analog Discovery oscilloscope (100 mega-samples per second). The measurements are shown in Figure 8. As can be seen, the output waveform matches that of the input. The high bandwidth of the PSD and the analog circuit enables us to track the focal length robustly in real-time. From the figure, we can also observe the delay of the focus-tunable lens ( s).
Depths of Focal Planes
As stated previously, measuring the depth of displayed focal planes is very difficult. Thereby, we use a method similar to depth-from-defocus to measure their depths. When a camera is focusing at infinity, the defocus blur kernel size will be proportional to the depth of the (virtual) object in diopter. This provides a method to measure the depths of the focal planes. One downside of the method is that since we cannot display a infinitesimal spot due to the finite pixel pitch of the display, the estimation of blur kernel diameters becomes inaccurate and predominated by the displayed spot size as a spot come into focus.
For each focal plane separately, we display a white spot at the center, capture multiple images of various exposure time, and average the images to reduce noise. We also capture images without the center spot, in order to remove the background in post-processing. We label the diameter of the defocus blur kernels and show the results in Figure 9. As as be seen, when the blur-kernel diameters can be accurately estimated, i.e., largely defocus spots on closer focal planes, the values fit nicely to a straight line, indicating the depths of focal planes are uniformly separated in diopter. For focal planes close to infinity, the spot size becomes predominated by the displayed spot diameter, causing the method to lose its accuracy. As there is no special treatments to individual planes in terms of system design or algorithm, we expect these focal planes to be placed accurately as well.
Characterizing the System Point-Spread Function
To characterize the proof-of-concept prototype, we measure its point spread function with a Nikon D3400 using a m prime lens. We display a static scene that is composed of spots with each spot at a different focal plane. Using the camera, we capture a focal stack of images ranging from to diopters away from the focus-tunable lens. For visibility, we remove the background and noise due to dust and scratches on the lens by capturing the same focal stack with no spot shown on the display. Figure 10 shows the point spread function of the display at four different focus settings, and a video of this focal stack is attached in the supplemental material. The result shows that the prototype is able to display the spots at depths concurrently within a frame, verifies the functionality of the proposed method. In addition, the defocus blur shown in the result also verifies that the prototype can automatically generate defocus blur without rendering. We note that the spherical aberration of the focus-tunable lens creates ring-shaped defocus blur kernels. The cut-off on the defocus blur is due to off-axis design used by the projection optics, which only projects the top-half angles.
Dense Focal Stacks
To evaluate the benefit provided by dense focal stacks, we simulate two multifocal displays, one with 4 focal planes and the other has 40 focal planes. The 40 focal planes are distributed uniformly in diopter from 0 to 4 diopters, and the other multifocal displays has focal planes at the depth of the 5th, 15th, 25th, and 35th focal planes. The scene is composed of 28 resolution charts, each at a different depth from 0 to 4 diopters, as shown in the supplemental material. The dimension of the scene is . We perform direct quantization, which directly assigns the contents of the all-in-focus image of the scene to the closest focal plane, linear filtering [Akeley et al., 2004], and optimization-based filtering [Mercier et al., 2017]. We initialize the solutions of optimization-based filtering with the results of direct quantization and perform typical gradient descent with iterations to ensure convergence. The perceived images of the resolution chart at diopters are shown in Figure 11.
As can be seen from the results, the perceived images of the 40-plane display closely follow those of the ground truth — with high spatial resolution if focused (Figure 11a) and natural retinal blur when are defocus (Figure 11b). In comparison, at its inter-plane location (Figure 11a), the 4-plane display has much lower spatial resolution than the other display, regardless of the depth filtering methods applied. All these result verify our analysis in Section 3.
To characterize the spatial resolution at an inter-plane location of multifocal displays of different plane separations, we implement a 4-plane and a 20-plane multifocal display with our prototype. The 4-plane display has its focal planes on the th focal planes of the 40-plane display, and the 20-plane display has its focal planes on all the odd-numbered focal planes. Note that the brightness of the 4-plane and 20-plane display are and of a typical 4-plane and a 20-plane multifocal display, respectively. We display a resolution chart on the 5th focal plane of the 40-plane display, which is an on-plane location for the 4-plane, 20-plane and 40-plane displays. By displaying a mesh at the regions surrounding the resolution chart on the th and th focal plane of the 40-plane display, we are able to accurately focus a camera on the inter-plane locations of the 4-plane and 20-plane display. For the 40-plane display, however, we can only estimate the inter-plane location by interpolating the focus locations of the th and th focal planes. The results captured by a camera with a m lens are shown in Figure 12. As can be seen, higher number of focal planes (smaller focal-plane separation) results in higher spatial resolution at inter-plane locations.
We compare our prototype with a 4-plane multifocal display on a real scene. Note that we implement the 4-plane multifocal display with our 40-plane prototype by showing contents on the th focal planes, and thereby the brightness of the 4-plane display has lower () brightness than a typical 4-plane multifocal display. Since we are evaluating image characteristics related to focusing and reproduction of virtual depth, the comparison is fair. The perceived images by the camera are shown in Figure 13. For the 4-plane multifocal display, when used without linear depth filtering, virtual objects at multiple depth are focus/defocus as groups; when used with linear depth filtering, same objects appearing in two focal planes reduces the visibility and thereby lowers the resolution of the display. In comparison, the proposed method produces smooth focus/defocus cue across the range of depths, and the perceived images at inter-plane locations (e.g. m) has much higher spatial resolution than the 4-plane display.
This paper provides a simple but effective technique for rendering virtual reality scenes that are made of a dense collection of focal planes. We believe that the system proposed in the paper for high-speed tracking could spur innovation in not just virtual and augmented reality systems but also in traditional light field displays (like glass-free 3D televisions). Our tracking technique is fairly straightforward and extremely amenable to miniaturization.
Color display can be implemented by using a three color LED and cycling through them using time division multiplexing. This would lead to loss in time-resolution or focal stack resolution by a factor of . This loss in resolution can be completely avoided with OLED-based high speed displays since each group of pixels automatically generate the desired image at each focal stack.
Stereo virtual display
The proposed method can be extended to support stereo virtual reality displays. The most straight-forward method is to use two sets of the prototypes, one for each eye. Since all focal planes are shown in each frame, there is no need to synchronize the two focus-tunable lenses. It is also possible to create a stereo display with a single focus tunable lens and a single tracking module; the design for this is shown in Figure 15. This design trades half of the focal planes to support stereo, and thereby, only requires one set of the prototype and additional optics. Polarization is used to ensure that each eye only sees the scene that is meant to see.
- Akşit et al.  Kaan Akşit, Ward Lopes, Jonghyun Kim, Peter Shirley, and David Luebke. 2017. Near-eye Varifocal Augmented Reality Display Using See-through Screens. ACM Transactions on Graphics (TOG) 36, 6 (2017), 189:1–189:13.
- Akeley et al.  Kurt Akeley, Simon J Watt, Ahna Reza Girshick, and Martin S Banks. 2004. A stereo display prototype with multiple focal distances. In ACM Transactions on Graphics (TOG), Vol. 23. 804–813.
- Chang et al.  Jen-Hao Rick Chang, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. 2016. shades of gray: high bit-depth projection using light intensity control. Optics express 24, 24 (2016), 27937–27950.
- Damberg et al.  Gerwin Damberg, James Gregson, and Wolfgang Heidrich. 2016. High brightness HDR projection using dynamic freeform lensing. ACM Transactions on Graphics (TOG) 35, 3 (2016), 24.
- eMirage [[n. d.]] eMirage. [n. d.]. Barcelona Pavillion. https://download.blender.org/demo/test/pabellon_barcelona_v1.scene_.zip.
- Hecht  Eugene Hecht. 2002. Optics. Addison-Wesley.
- Hoffman et al.  David M Hoffman, Ahna R Girshick, Kurt Akeley, and Martin S Banks. 2008. Vergence–accommodation conflicts hinder visual performance and cause visual fatigue. Journal of vision 8, 3 (2008), 33–33.
- Hua  Hong Hua. 2017. Enabling focus cues in head-mounted displays. Proc. IEEE 105, 5 (2017), 805–824.
- Huang et al.  Fu-Chung Huang, Kevin Chen, and Gordon Wetzstein. 2015. The light field stereoscope: immersive computer graphics via factored near-eye light field displays with focus cues. ACM Transactions on Graphics (TOG) 34, 4 (2015), 60.
- Johnson et al.  Paul V Johnson, Jared AQ Parnell, Joohwan Kim, Christopher D Saunter, Gordon D Love, and Martin S Banks. 2016. Dynamic lens and monovision 3D displays to improve viewer comfort. Optics express 24, 11 (2016), 11808–11827.
- Konrad et al.  Robert Konrad, Emily A Cooper, and Gordon Wetzstein. 2016. Novel optical configurations for virtual reality: evaluating user preference and performance with focus-tunable and monovision near-eye displays. In Conference on Human Factors in Computing Systems (CHI). 1211–1220.
- Konrad et al.  Robert Konrad, Nitish Padmanaban, Keenan Molner, Emily A Cooper, and Gordon Wetzstein. 2017. Accommodation-invariant computational near-eye displays. ACM Transactions on Graphics (TOG) 36, 4 (2017), 88.
- Kramida  Gregory Kramida. 2016. Resolving the vergence-accommodation conflict in head-mounted displays. IEEE Transactions on visualization and computer graphics 22, 7 (2016), 1912–1931.
- Lanman and Luebke  Douglas Lanman and David Luebke. 2013. Near-eye light field displays. ACM Transactions on Graphics (TOG) 32, 6 (2013), 220.
- Liu et al.  Sheng Liu, Dewen Cheng, and Hong Hua. 2008. An optical see-through head mounted display with addressable focal planes. In IEEE/ACM International Symposium on Mixed and Augmented Reality. 33–42.
- Liu and Hua  Sheng Liu and Hong Hua. 2009. Time-multiplexed dual-focal plane head-mounted display with a liquid lens. Optics letters 34, 11 (2009), 1642–1644.
- Liu and Hua  Sheng Liu and Hong Hua. 2010. A systematic method for designing depth-fused multi-focal plane three-dimensional displays. Optics express 18, 11 (2010), 11562–11573.
- Llull et al.  Patrick Llull, Noah Bedard, Wanmin Wu, Ivana Tosic, Kathrin Berkner, and Nikhil Balram. 2015. Design and optimization of a near-eye multifocal display system for augmented reality. In Imaging and Applied Optics. JTH3A.5.
- Love et al.  Gordon D Love, David M Hoffman, Philip JW Hands, James Gao, Andrew K Kirby, and Martin S Banks. 2009. High-speed switchable lens enables the development of a volumetric stereoscopic display. Optics express 17, 18 (2009), 15716–15725.
- MacKenzie et al.  Kevin J MacKenzie, Ruth A Dickson, and Simon J Watt. 2012. Vergence and accommodation to multiple-image-plane stereoscopic displays. Journal of Electronic Imaging 21, 1 (2012), 011002.
- MacKenzie et al.  Kevin J MacKenzie, David M Hoffman, and Simon J Watt. 2010. Accommodation to multiple-focal-plane displays: Implications for improving stereoscopic displays and for accommodation control. Journal of vision 10, 8 (2010), 22–22.
- Maimone et al.  Andrew Maimone, Andreas Georgiou, and Joel S Kollin. 2017. Holographic near-eye displays for virtual and augmented reality. ACM Transactions on Graphics (TOG) 36, 4 (2017), 85.
- Matsuda et al.  Nathan Matsuda, Alexander Fix, and Douglas Lanman. 2017. Focal surface displays. ACM Transactions on Graphics (TOG) 36, 4 (2017), 86.
- Mercier et al.  Olivier Mercier, Yusufu Sulai, Kevin Mackenzie, Marina Zannoli, James Hillis, Derek Nowrouzezahrai, and Douglas Lanman. 2017. Fast Gaze-contingent Optimal Decompositions for Multifocal Displays. ACM Transactions on Graphics (TOG) 36, 6 (2017).
- Miau et al.  Daniel Miau, Oliver Cossairt, and Shree K Nayar. 2013. Focal sweep videography with deformable optics. In IEEE International Conference on Computational Photography (ICCP).
- Narain et al.  Rahul Narain, Rachel A Albert, Abdullah Bulbul, Gregory J Ward, Martin S Banks, and James F O’Brien. 2015. Optimal presentation of imagery with focus cues on multi-plane displays. ACM Transactions on Graphics (TOG) 34, 4 (2015), 59.
- Optotune  Optotune. 2017. Optotune electrically tunable lens EL-10-30. http://www.optotune.com/images/products/Optotune.
- Padmanaban et al.  Nitish Padmanaban, Robert Konrad, Tal Stramer, Emily A Cooper, and Gordon Wetzstein. 2017. Optimizing virtual reality for all users through gaze-contingent and adaptive focus displays. Proceedings of the National Academy of Sciences 114 (2017), 9.
- Rolland et al.  Jannick P Rolland, Myron W Krueger, and Alexei Goon. 2000. Multifocal planes head-mounted displays. Applied Optics 39, 19 (2000), 3209–3215.
- Rolland et al.  Jannick P Rolland, Myron W Krueger, and Alexei A Goon. 1999. Dynamic focusing in head-mounted displays. In Electronic Imaging. 463–470.
- Shiwa et al.  Shinichi Shiwa, Katsuyuki Omura, and Fumio Kishino. 1996. Proposal for a 3-D display with accommodative compensation: 3DDAC. Journal of the Society for Information Display 4, 4 (1996), 255–261.
- Sugihara and Miyasato  Toshiaki Sugihara and Tsutomu Miyasato. 1998. System development of fatigue-less HMD system 3DDAC (3D Display with Accommodative Compensation: System implementation of Mk. 4 in light-weight HMD. In ITE Technical Report 22.1. The Institute of Image Information and Television Engineers, 33–36.
- Sun et al.  Qi Sun, Fu-Chung Huang, Joohwan Kim, Li-Yi Wei, David Luebke, and Arie Kaufman. 2017. Perceptually-guided foveation for light field displays. ACM Transactions on Graphics (TOG) 36, 6 (2017), 192.
- Varioptic  Varioptic. 2017. Varioptic variable focus liquid lens ARCTIC 25H. http://varioptic.com/media/cms_page_media/45/MADS_-_160429_-_Arctic_25H_family.pdf.
- Wetzstein et al.  Gordon Wetzstein, Douglas Lanman, Wolfgang Heidrich, and Ramesh Raskar. 2011. Layered 3D: tomographic image synthesis for attenuation-based light field and high dynamic range displays. In ACM Transactions on Graphics (TOG), Vol. 30. 95.
Appendix A Light Field Analysis
This section provides the analysis discussed in Section 3 in detail. The analysis follows closely to the one in [Narain et al., 2015]. A notable difference however is that we provide analytical expressions for the perceived spatial resolution (Equation (3)) and the minimum number of focal planes required (Equation (5)), whereas they only provide numerical results. For simplicity, we consider a flatland where a light field is two-dimensional and is parameterized by intercepts with two parallel axes, and . The two axes are separated by unit, and for each , we align the origin of -axis to . We model the human eye with a camera model that is composed of a finite-aperture lens and a sensor plane away from the lens, as that used by Mercier et al.  and Sun et al.  We assume that the display and the sensor emits and receives light isotropically so that each pixel on the display uniformly emits light rays toward every direction, and vice versa for the sensor.
Light Field Generated by a Display
Let us decompose the optical path from the display to the retina (sensor) and examine the effect in frequency domain due to each component. Due to the finite resolution, the light field creates by the display can be model as
where represents two-dimensional convolution, is the pitch of the display pixel, and is the target light field. The Fourier transform of is
The finite pixel pitch causes pre-filtering, and thus we consider only the central spectrum replica (). Also, we assume for all to avoid aliasing. Since the light field is nonnegative, or , we have . Therefore, we have
Therefore, in the ensuing derivation, we will focus on the upper-bound
The light field spectrum forms a line segment parallel to , as plotted in Figure 3a.
Propagation to the eye
After leaving the display, the light field propagates and get refracted by the focus-tunable lens before reaching the eye. Under first-order optics, there operations can be modeled by coordinate transformation of the light fields [Hecht, 2002]. Let . After propagating a distance , the output light field is a reparameterization of the input light field and can be represented as
After refracted by a thin lens with focal length , the output light field right after the lens is
Since and are invertible, we can use the stretch theorem of -dimensional Fourier transform to analyze their effect in the frequency domain. The general stretch theorem states that: Let , be the Fourier transform operator, and be any invertible matrix. We have
where is the Fourier transform of , is the variable in frequency domain, represents determinant of , and . By applying the stretch theorem to and , we can see that propagation and refraction shears the Fourier transform of the light field along and , respectively, as shown in Figure 3c-d.
Light Field Incident on the Retina
After reaching the eye, the light field is partially blocked by the pupil, refracted by the lens of the eye, propagates to the retina, and finally integrated through all directions to form an image. The light field reaching the retina can be represented as
and is the diameter of the pupil. To understand the effect of the aperture, we analyze a more general situation where the light field is multiplied with a general function and transformed by an invertible with unit determinant. By multiplication theorem, we have
where we use a change of variable by setting , and the last equation holds because . Equation (14) relates the effect of the aperture directly to the output light field at the retina: The spectrum of the output light field is the cross correlation between the transformed (refracted and propagated) input spectrum with full aperture and the transformed spectrum of the aperture function. The result is important since it significantly simplifies our analysis, and as a result, we are able to derive an analytical expression of spatial resolution and number of focal planes needed.
In our scenario, we have . For a virtual display at , is a line segment of slope within , where is the magnified pixel pitch. According to Equation (14), is simply the cross correlation of and . After transformation, is a line segment of slope , where . Similarly, is a line segment with slope within . Note that we only consider because the cross-correlation result at the boundary has value . Since function is monotonically decreasing for , the half-maximum spectral bandwidth () must be within the region. Let the depth the eye is focusing at be . We have . When , we can see from the above expression that is a flat segment within , where is the overall magnification caused by the focus-tunable lens and the lens of the eye. From Fourier slice theorem, we know that the spectrum of the image is simply the slice along . In this case, the aperture has no effect to the final image, since the cross correlation does not extend or reduce the spectrum along , and the final image has the highest spatial resolution .
Suppose the eye does not focus on the virtual display, or . In the case of a full aperture (), the resulted image will be a constant DC term (completely blurred) because the slice along is a delta function at . In the case of finite aperture diameter , with a simple geometric derivation (see Figure 3h), we can show by simple geometry that the bandwidth of the -slice of , or equivalently, the region , is bounded by . And we have
Thereby, based on Fourier slice theorem, the bandwidth of the retinal images is bounded by .