IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

347

Perceptual Issues in Stereoscopic Signal Processing Scott J. Daly, Member, IEEE, Robert T. Held, and David M. Hoffman

Abstract—Perceiving three-dimensional video imagery appropriately in a display requires matching parameters throughout the imaging pathway, such as inter-aperture distance at the stereoscopic camera side with parallax shifting at the display side. In addition, many tradeoffs and compromises are often made at different points in the imaging pathway, leading to common perceptual distortions. Some of these may be simple two-dimensional image distortions such as display surface noise, while others are three-dimensional distortions, such as global geometric scene distortions and localized depth errors around edges. There is an increasing use of various forms of signal processing to modify the images, either for compensation of distortions due to system limitations, display constraints, formatting and compression for efficient transmission, or making depth range adjustments dependent on the display viewing conditions. Perceptual issues are critical to the design of the entire imaging pathway and this paper will highlight some of those due to stereoscopic signal processing. Index Terms—Distortion, perceptual, stereoscopic.

I. INTRODUCTION

S

TEREOSCOPIC 3D image quality is an important area of active research due to its complex, unintuitive interactions with traditional image artifacts and the multiple cues involved in human depth perception. In a recent, provocatively-titled paper (“New, lively and exciting or just artificial, straining, and distracting? A sensory profiling approach to understand mobile 3D audiovisual quality”) exploring observer responses on various quality dimensions of stereoscopic 3D video, Strohmeier et al. [1] found that it was difficult to increase image quality 1 by using stereoscopic 3D display (S3D) technology, even Manuscript received December 31, 2010; revised February 04, 2011; accepted March 03, 2011. Date of publication April 19, 2011; date of current version May 25, 2011. S. J. Daly is with the Dolby Laboratories, Vancouver, BC V5M 4X7, Canada (e-mail: [email protected]). R. T. Held is with the Computer Graphics Dept., Soda Hall, Berkeley, CA 94720 USA (e-mail: [email protected]). D. M. Hoffman is with MediaTek USA, San Jose, CA 95134 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBC.2011.2127630 1There is a definitional issue with the term ‘image’ across various imaging technology specialists. For some, ‘image’ implies a still image representation, having no motion, such as a photographic print. For others it implies a flat representation with no depth, like a print or traditional TV. However, for many in the field, the term ‘image’ encompasses all of the dimensions that are visible (e.g., spatial, contrast, color, motion, and depth). This is backed by the Webster dictionary, which for its 1st definition of ‘image’ includes ‘statues’, and for its 2nd definition includes ‘video’. While the Oxford dictionary is less clear, like the Webster, it points to the root of ‘image’ coming from the Latin imago meaning to copy appearance, which stems from imitari meaning to imitate. As early as the 14th century, ‘image’ was used to refer to reflections in a mirror, which of course contain both motion and depth. In this paper, we will likewise use ‘image’ to refer to the overall visible appearance, unless otherwise stated (i.e., referring to usage in specific papers).

when subjects found that dimensionality was increased. Further analysis suggested that visible 2D artifacts detract from stereo quality even more than they do in the conventional 2D (non-stereo) image. The paper performed scene-content analysis by considering features such as frequency of scene cuts, dynamic activity, and depth range. Of the several videos tested, they found only one video scene where overall quality was increased by stereoscopic display. This scene had characteristics of low scene-cut frequency and low depth dynamism (i.e. reduced depth changes over time) 2 . The study’s key trends were anticipated by previous work [2], [3] that also suggested the strong need for decreasing all perceived artifacts to achieve subjectively high-quality stereoscopic images. For example, [2] similarly found that adding depth does not increase image quality. While there was a large increase in perceived naturalness in 3D compared to 2D image, and a slight increase in overall viewing experience, there was no improvement in image quality. Therefore, stereoscopic depth does not seem to lift overall quality when other artifacts are present. However, this may be a linguistics/understanding issue on the part of the observer, where the term seemed to imply the strictly depth-flattened 2D aspects. Regardless of the interpretation of quality, the work found that the various stereo and non-stereo distortions had a stronger negative pull in stereo images than in non-stereo images (for similar levels of perceived distortions). Thus stereoscopic presentation of images magnifies the quality detriment of certain artifacts. Furthermore, a study by Tam [3] on overall quality of stereoscopic presentation found that only when no artifacts were visible 3 , was stereoscopic imagery preferred. Assessment of image quality is strongly governed by monocular image quality, such as image sharpness. Such results, that stereoscopic image presentation itself does not compensate image quality for the presence of other distortions, motivates this paper to focus on the perceptual issues arising within the signal processing path for stereoscopic systems. It will also discuss attempts to ameliorate some of the distortions by using advanced signal-processing techniques. With a digital signal representation, the term mathematical distortion describes errors in the intended code values, not all of which pass through the system to be physically measurable at the display. Of these distortions, those that are measurable are referred to as physical distortions. Finally, due to limitations of the visual system, such as optical blur and neural noise, not all of the physical distortions are visible. Those whose contrast magnitudes are high enough to be detected are referred to as perceptual distortions. Starting at the last stage of the imaging pipeline, the display, this paper will first describe the 2The mobile-device displays‘ viewing situation has substantial surrounding stereoscopic features due to the ambient scene. Having strong ambient depth leads to increased vergence eye movements as the eye scans from display to surround. Thus the results may not apply to large displays. 3They primarily tested coding artifacts, but did include blur.

0018-9316/$26.00 © 2011 IEEE

348

key perceptual distortions of crosstalk ( Section II) which are due to failures of total binocular image separation. This technical problem constrains display design and significantly affects cost, form factor, contrast and brightness. Next, we will address the perceptual geometric distortions arising from capture geometry and viewing conditions ( Section III), with an emphasis on viewing distance and viewing angle. In order to achieve comfortable display conditions, various depth adjustments ranging from simple parallax shifting to more advanced methods of new view synthesis are often used. The algorithms required for view synthesis lead to specific perceptual distortions, which will also be discussed ( Section IV). Following a brief section on distortions due to data compression ( Section V), we will then address how well-known 2D motion artifacts, such as judder, manifest in stereoscopic displays ( Section VI). A short Section VII will then mention some of the key perceptual issues with 2D-to-3D conversion. To wrap up, we will mention some high-level cognitive effects of viewing traditional (i.e., single-surface) stereoscopic displays and depth’s curious role in assessments of display quality (Discussion). Visual-system issues relating to discomfort problems will be covered in another review paper in this issue [4], as will be the myriad ways to design three-dimensional displays [5].

II. CROSSTALK SYSTEM ASSESSMENT AND COMPENSATION ALGORITHMS Crosstalk is a primary concern with developing stereoscopic display systems, and refers to incomplete separation of the left and right eye’s images. Some of the image signal intended for the left eye is visible to the right eye and vice versa. The causes include extinction ratios of the polarized stereoscopic glasses (passive or active), temporal responses of various system components in time-multiplexed systems (in the glasses or the display), polarization losses due to scatter, and the intentional crosstalk in autostereoscopic displays for creating a smooth transition between views as the observer’s head moves relative to the display [6]. While it is possible to design crosstalk-free displays using independent optical paths to separate displays, i.e., the Wheatstone & Helmholtz approaches [7], such technology does not meet current preferences for a flat-panel display form factor that is visible to a number of viewers. Often, development of displays with low crosstalk requires trade-offs between brightness, cost, complicated eyewear and contrast. While the Wheatstone stereo displays for professional use can have crosstalk below 0.5 % [8], the first generation of shutter-glass consumer stereoscopic TVs in early 2010 had crosstalk values as high as 20%. Thus, the crosstalk problem is of high interest today. A. Basic Perceptual Issues of Crosstalk While the lack of complete image separation is the cause of crosstalk, a number of factors related to the image content and underlying display technology will determine how visible and objectionable the crosstalk is. At best, crosstalk is perceived as a barely visible halo accompanying edges. With small disparities and low amplitudes, such as in textures, it can be perceived

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

only as a blur 4 [9]. With moderate amplitudes and disparities it appears as tolerable double edges, with higher amplitudes it can be an annoying ghost image, and in the worst case, the double image disturbs stereoscopic fusion and prevents the depth effect. Often the term ‘ghosting’ is used to distinguish the perceptual from physical crosstalk. One of the most important factors is the magnitude of disparity. When disparity is zero the left and right images are identical, and crosstalk is imperceptible. Second to disparity magnitude, the largest determinant of the visibility of crosstalk is contrast. Small amounts of crosstalk create a dim ghost image that is superimposed on the intended image. If this dim ghost image is over a bright region in the intended image, the ghost image will be low contrast and is unlikely to be visible. Likewise, if the images are heavily textured, the ghost image is masked by the intended image. Furthermore, the inherent contrast of the display itself can help attenuate the visibility of ghost images by reducing their contrast. A key summary experiment was performed [11] on diplopia (double-edges) and fusion as a function of signal contrast and disparity parameters. The study used very briefly flashed geometric stimuli to test neural fusion limits (no vergence eye movements were possible), as well as longer duration stimuli to test overall visual system stereopsis (vergence eye movements plus neural fusion). This review paper will generally focus on the applied studies using natural imagery, as opposed to those using test targets. In addition to the visibility of ghost images, crosstalk can lead to several other problems, including general annoyance, discomfort, and stereoscopic depth breakdown. Crosstalk is measured and specified as the percent of one eye’s signal that leaks to the other eye. However, the visibility of a crosstalk signal is generally determined by its contrast, and the crosstalk signal’s contrast is affected by the intended eye’s signal contrast and the display contrast. The signal contrast is generally normalized by using the display’s maximum contrast range. For many years, there was not much variability in display’s contrast (usually 200:1), but in the recent decade, display contrasts have risen dramatically to exceed 5000:1 and vary widely from display to display. Further, ambient light levels have a strong effect on contrast due to screen surface reflectivity. Unfortunately, nearly all of the papers on this topic lack details on the display and ambient levels to quantitatively analyze visual system performance. Thus the range of image attributes, display technologies and perceptual measures has led to an inconclusive literature on the topic with crosstalk guidelines that span multiple log units. To illustrate the variance in psychophysical crosstalk measures, we have plotted crosstalk values ( Fig. 1) and image-contrast levels where various perceptual degradations occur, derived from key studies. Nevertheless, it is still useful to describe the studies. Huang et al. [12] studied the effect of crosstalk on stereopsis, but not on perceived quality, for natural images. They found that stereopsis was possible even with large crosstalk levels, with values as high as 10% being the upper bound for acceptability for stereopsis. This value is below the fusion limit derived from simple signals (approx 50% from their prior work) but much higher than their crosstalk visibility thresholds (0.01%). Pastoor 4The MTF of a double impulse (i.e., offset lines) is a cosine centered at DC [10], a low-pass filter. Therefore, when the line or edge separation is very close, the blur due to the MTF dominates over the double image.

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

349

Fig. 1. Compilation of key crosstalk studies.

[13] studied crosstalk ratios at given signal contrasts, defined , and measured crosstalk visibilities as as a function of binocular disparity, but he was not able to study crosstalk below 1%. Hanazato [14] compared visibility and quality of ghost edges caused by stereoscopy, as well as simulated in non-stereo images. The ghosting due to crosstalk in stereo images was more annoying in stereo than in non-stereo images for the same levels of visibility, and easier to see in geometric images than natural images. Kooi [15] delved into the discomfort aspects of crosstalk and found that 5% crosstalk led to ‘a bit reduced’ comfort, however, the viewing duration for these conditions was generally short, and did not involve video. Several perceptual aspects of crosstalk, including edge ghosting, double lines, visual strain, were studied [16], [17], and whether there was any added value in having crosstalk and collected ratings for distortion, strain, perceived depth, and naturalness. Naturalness of stereo images rated higher than non-stereo images up to a crosstalk of 10%, suggesting once the L-R images can be fused in spite of crosstalk signals 5 , depth perception is robust and the presence of crosstalk does not affect naturalness. Auto-stereoscopic displays (a 9-view display) were studied for crosstalk [6]. The results indicate a threshold for crosstalk visibility ranging from 2% to 7%, which is much higher than for two-view stereoscopic displays). Crosstalk threshold and levels of annoyance were studied [18] for a two-view autostereoscopic display having a high level of inherent crosstalk. Crosstalk for moving images as opposed to static was also studied [19], with moving edges having higher levels of disparity (at a given crosstalk level) before breaking fusion. The study did not test crosstalk levels lower than 5%. Blur effects due to crosstalk were studied [9] by using smaller disparities than those causing diplopia. Monocular and binocular viewing of double edges were similar in terms of the just-noticeable blur as a function of disparity and different viewing distances. They also studied color-dependent crosstalk and the results were consistent with a luminance-based (e.g., ) depth channel. By using textured computer-generated objects, crosstalk-distortion effects on depth perception were isolated from the usual visibility of multiple edges [20]. This 5e.g.,

10% was image stereopsis level from [12].

was done by vastly reducing diplopia visibility by testing with textures that have a high degree of masking (also note: more depth cues were available with the textured imagery than is common with simple object edges). The observer’s task was indicate whether a shape was convex or concave, and the results showed shape detectability at very high levels of crosstalk. B. Crosstalk Compensation Algorithms Due to the difficulty in completely eliminating crosstalk with standard hardware, various signal-processing techniques are being developed [21]–[25]. These techniques have proven particularly valuable in hardware approaches using synchronized, alternating viewpoints. Methods using liquid crystal displays, with their slow temporal response, can be improved for field sequential stereo display through the use of overdrive algorithms 6 to effectively speed up the temporal response. Another approach is known as L-R matrix compensation to anticipate and remove the ghost images from the images sent to the screen. This approach was first proposed by Lipscomb [21]: if crosstalk, , can be represented simply as:

then a pre-corrected data signal 7 ,

can be sent to the display to eliminate the crosstalk.

Then,

6Originally 7The 1=(1

developed to compensate motion blur in non-stereo displays.

0 ) factor can often be ignored as the crosstalk value is usually

< 0.1 (i.e., 10%), but we leave it in for completeness.

350

Thus, the unwanted crosstalk signal (caused by the display process) is pre-subtracted from the digital image prior to reaching the final display process, for a cross-talk-free displayed image. However, if the data is high enough contrast the subtraction causes the pre-corrected signal to go negative, which is impossible for the physical display to achieve and complete crosstalk correction cannot be achieved. This manifests itself when bright regions of one image fall in the dark regions of the other image 8 . The footroom approach [22], [23] to overcome this issue is to add a DC component to each image (effectively reducing contrast) so that there is a non-zero black level in which the other image can be subtracted. Sometimes the necessary correction signal does not match the available pixel locations, an effect lessened with image-capture blur. Barkowski [24] used pre-measured crosstalk levels to reevaluate the above-mentioned crosstalk-compensation technique. He studied parameter choices such as minimum-luminance adjustment (raising the black level to allow for negative values to be subtracted) versus reducing the overall contrast (lowering the maximum and raising the minimum luminance) to simply lower the source image’s contribution to the crosstalk signal. Raising the black level was preferred over overall contrast reduction (e.g. it is preferable to leave the bright parts of the image unchanged). Raising the black level also reduces the contrast, but its perceived contrast reduction depends strongly on the ambient levels. For some autostereoscopic display technologies, they found that crosstalk is needed to keep transitions smooth across views, and the compensation algorithm reduced that desired smoothness. Also, using this cross-talk compensation technique too aggressively can lead to false edges that can also be disturbing. Kerofsky [25] broke down the crosstalk into its sources including liquid crystal (LC) temporal response and stereo-glasses extinction ratio. They also considered the second-order effects of crosstalk correction, in that the pre-correction signal is also subject to crosstalk.

giving a more accurate crosstalk correction signal, important for systems with higher levels of crosstalk,

They also designed an adaptive-crosstalk cancellation algorithm by using preprocessing to respectively reduce or lift the bright or black regions, along with compensation signals derived from the standard crosstalk matrix. In one version, a two-dimensional co-sited histogram is used to assess total distortion introduced by pre-processing such as footroom elevation, maximum-signal reduction, the crosstalk correction signal, and the remaining 8There is not a corresponding problem at the bright end since the off-diagonal correction elements are only negative.

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

crosstalk. Thus, to maintain image contrast, non-linear “ghostbusting” techniques are being developed to diminish perceptual ghosting. III. GEOMETRIC DISTORTION DUE TO CAPTURE, DISPLAY, AND VIEWING CONDITIONS The most ideal stereoscopic presentation method is the orthostereoscopic approach, in which the eyes are located at the centers of projection of the corresponding images, and the eye positions necessary to fixate on objects in the scene match those that would occur in direct viewing 9 . If these conditions are met, then the retinal images from viewing the display and viewing the original 3D scene are identical. Typically, this scenario is not possible, so it is important to consider what happens when there is a mismatch between the camera geometry and viewing configuration. Stereo images are unique in that the viewer’s position relative to the display affects the 3D percept. If one understands all the image-acquisition, display, and viewing-position variables affecting this percept, misperceptions can be predicted and minimized. Here we first explain why misperceptions are less problematic for non-stereo images, and then discuss sources of misperceptions for stereo images. A. Perception of Conventional (Non-Stereo) Images A strength of conventional (non-stereo) images is that they convey the same information about the scene’s 3D geometry to multiple viewers at different locations relative to the display. The lack of misperceptions is due to compensation processes performed by the visual system. To explain, first consider “completely correct” viewing of a non-stereo image, where the image delivers the same rays of light to the retina as if the viewer were at the original scene. This can be done by placing the eye at the center of projection (CoP) of the image. Under this condition, one can expect the viewer to correctly interpret the image’s contents, including shapes and relative sizes. If one moves away from the CoP, the rays of light delivered to the retina specify a different geometry, so one may expect the percept to change. For instance, a circle in the image will project to an ellipse on the retina. But the observer will still perceive the shape as a circle, as humans show a robust ability to see the scene as intended regardless of viewing position. The consistent interpretation of the image results from the visual system’s ability to compensate for an oblique viewing position by estimating the slant of the display surface based on its disparities and the orientation of its frame. The image content is interpreted with a correction for the perspective foreshortening, and the percept is the same as for a viewer whose eye is at the CoP [26]. The ability to compensate for off-axis viewing is critical to the success of images meant to be enjoyed by multiple viewers, whether they are in the cinema or a home theater. B. Perception of Stereo Images For stereo images, the prevailing thinking and standard model in the stereo-cinema literature assumes that the visual system 9This is achieved when the camera inter-aperture distance matches the viewers inter-pupiliary distance, the displayed image objects match the angular extent of the real objects, and the distance of the camera to the object matches the distance from the viewer to the display.

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

351

Fig. 2. Ray intersections. Plan view of ray-intersection approach to the perception of stereoscopic content. A pair of corresponding points on the left and right images represents a single point in space. To predict where the point is perceived, rays are projected from the eyes through the corresponding points. The intersection of rays is assumed to be the perceived location.

directly interprets the pattern of disparities on the retinas with no compensation [27]. In other words, no other depth cues are considered when predicting 3D percepts from stereo displays, even though they are certainly present both in the image content and display surface. Although further research is needed to validate the assumption, the major implication is that disparities are strongly dependent on a viewer’s position relative to the display, and these distorted disparities will yield distorted percepts. Here we briefly summarize the standard model for misperceptions and discuss the influence of image-acquisition, display, and viewing parameters on the expected percept. When a viewer is observing a stereo pair consisting of left and right images, corresponding points in the images are associated with one 3D point in image space. The standard model for predicting the perceived location of that 3D point uses a ray-intersection algorithm, as shown in Fig. 2. Rays are projected from the centers of the eyes through the corresponding points on images. The intersection of the rays is considered to be the perceived location of the specified point. Assembling the intersection points for all the corresponding points in the stereo pair constitutes the entire 3D percept. As mentioned earlier, one may define “correct” viewing of stereo images as the condition where the disparities delivered to the eyes and the vergence eye positions necessary to fuse those disparate points are identical to those produced by a direct view of the original scene. To meet that condition, several image-acquisition, display, and viewing parameters must be carefully set. Image-acquisition parameters include focal length (affecting field of view), baseline (stereo-camera separation), and camera orientation (whether the cameras are parallel or toed-in). The display parameters are display size and stereo-image lateral offset (parallax). For our discussion, we will assume there is one display surface, though vision researchers sometimes use a unique display for each eye. The primary viewing parameters are the locations of the two eyes relative to the display. The variables listed above interact in nonlinear and often unintuitive ways. We provide a qualitative analysis of their effects on the predicted 3D percept; more detailed treatments exist in [27]–[29].

Fig. 3. Example Misperceptions. Plan-view illustrations of predicted 3D percepts for several image-acquisition, display, and viewing parameters. The viewer, stereo cameras, display surface, original 3D object, and perceived 3D object are represented for each condition. Thin red lines are the camera’s optical axes, the gray square is the real object that is captured, the cyan polygon is the perceived object geometry, and the green line is the display surface. The cube is composed of gridlines, so all sides are visible to the camera and observer, even if obliquely. All parameters are set correctly in panel (e), so the perceived object matches the original object. (a) Stereo half-images are presented too far apart. (g) Stereo images are presented too close together. (b) Viewer is too far away from the display. (h) Viewer is too close to the display. (c) Camera baseline is too small. (i) Camera baseline is too large. (d) Viewer is to the left of display. (f) Viewer is to the right of display.

C. Perceptual Effects of Image-Acquisition Settings For depictions of most of the following distortions, refer to Fig. 3. Panel (e) shows the case where all image-acquisition, display, and viewing parameters are set correctly. In that case, the depicted object (a cube) appears to be the correct size and at the correct location. The modifications below change that appearance. Camera Baseline: Larger camera baselines produce larger on-screen disparities. As seen in Fig. 3(i), for parallel camera bodies, this causes the 3D scene to shrink and move closer to the observer 10 . Conversely, decreasing the baselines decreases the disparities, which expands the scene and moves it farther away ( Fig. 3(c)). As stated above, these effects apply for parallel cameras, where the object of interest is much closer than the cameras’ point of convergence, which is at infinity. When converging camera bodies are used and the point of convergence is closer than the object, increasing or decreasing the baseline has the opposite effect; the scene moves farther and closer, respectively. 10Sometimes, there is an assumption that the objects are far in front of the point of convergence. Thus, increasing the camera baseline moves the near object much closer. This is a stretching of depth, which shrinks near objects.

352

Camera Orientation: When stereo images are presented on a single display surface, the only way to produce geometrically correct percepts is to use parallel camera bodies. In particular, the camera sensors and lenses should be parallel. It can be tempting to toe-in the camera bodies so their optical axes converge and objects at the point of convergence will have zero disparities and appear at the surface of the display. However, this practice produces the keystone effect, where objects that should project to rectangles on-screen instead project to trapezoids. It also introduces unnatural vertical and horizontal disparities into the images. These disparities cause a breakdown in the ray-intersection model. Specifically, the on-screen vertical disparities cause some of the rays projecting from the eyes through disparate points to become skew. No intersection point can be calculated for those ray pairs, so the model cannot produce a solution to the predicted percept (though a cohesive percept is observed). The perceptual effects of these skew rays remain unclear and research is ongoing [29]. To control the convergence of the cameras’ optical axes, the superior approach is to keep the cameras parallel to each other, and laterally offset the sensors relative to the lenses or use a cropped portion of the image. This eliminates the keystone distortion and erroneous disparities. Some observer data is available for keystoning tolerances in stereoscopic displays [30], [31], but not across enough key parameters such as display field of view and viewing distances. A related geometric model [28] focuses more on image-capture issues, such as object distances, lens focal length, and sensor convergence. Display Size: Increasing pixel pitch due to displaying the same image data on a larger display will magnify the disparities. Thus, features beyond the display surface will move farther away and features in front of the display will move closer to the observer (not shown in the figure). Stereo Image Offset: Moving the stereo images farther apart increases the disparities. The scene appears to move farther away and expand. The opposite occurs when the images are moved closer together. Also, as seen in Fig. 3(a) and (g), the depth is not scaled uniformly with angular size, which causes changes in perceived shape, i.e., a trapezoidal distortion of the cube in the depth direction. Viewer Distance: When a viewer is too close to a stereo display, the intersections of the rays emanating from the eyes and passing through the corresponding points in the images move closer to the display surface. As a result, the scene is compressed in depth ( Fig. 3(h)). Viewing a stereo display from too far away produces the opposite effect to close viewing. The scene is expanded in depth ( Fig. 3(b)), as well as having its mean distance pushed backward. Viewer Translated to Left/Right: As a viewer translates himself or herself horizontally relative to the display surface, the 3D scene shears. Objects in front of the display surface follow the viewer, while objects behind the surface appear to move away from the viewer ( Fig. 3(d), (f)). Other Viewer Orientations: We have only considered translations of the viewer relative to the display surface. Natural viewing may also include head rotation in the forms of pitch, yaw, and roll. These movements directly translate into pitch, yaw, and roll of the interocular axis. Pitch is not expected to produce misperceptions, but yaw and roll produce vertical

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

disparities that sometimes make solutions to the ray-intersection algorithm impossible [29]. Similar to the case with converging camera bodies, many of the rays that emanate from the eyes and pass through the disparate points no longer intersect. Interestingly, these situations, to a limit, still produce cohesive 3D percepts, though it remains unclear what stereoscopic depth the visual system estimates [29]. More research will be needed to fully characterize misperceptions for viewing situations that produce skew rays. As evidenced by the various misperceptions outlined above, stereoscopic viewing is very sensitive to capture, display, and viewing variables. It should also be stressed that the distortions from one parameter cannot be easily counteracted by adjustments to other parameters—their relationships are complicated and nonlinear. Software is available to simulate these misperceptions [32]. Some of the other situations in which these issues arise include reconverging the images to avoid vergence accommodation conflicts [33]. The simplest algorithm for such corrections is to horizontally shift the L- and R- images to the left and right respectively to shift the simulated distance away from the observer, and shift the images in the opposite directions to bring it closer. This is often referred to as parallax adjustment (display-side or post-capture). Fig. 3(a) shows that such a technique will result in an asymmetric stretching, where a cube would become trapezoidal in depth, assuming one can see all of its sides. Questions arise as to the actual visibility thresholds for the depth compression described, as well as tolerances and even preferences for artistic distortion. D. Puppet-Theater Effect A well-known geometric distortion is the puppet-theater effect, where typically human characters in the rendered 3D scene look overly small, and hence like animated puppets. Significant debate still remains over the source of the effect. Early work in the area was performed by MacAdams [34], who analyzed geometric distortions, and found that the shooting distance from camera to object had an effect on reproduction size, and showed how relative magnification between foreground and background regions within an image can differ, which causes object distortions. He mentioned a link to the puppet-theater effect but did not study it. Yamanoue [35] continued MacAdams’s effort by analyzing deviations from orthostereoscopic presentation. A derived geometric model of a converging-camera setup was tested through subjective studies, with the parallel camera case as a special case of the converging camera model. With parallel capture, a magnification mismatch (typically foreground magnification background magnification) from scene to display does not occur. In such a case, while there may be a deviation from orthostereoscopic, all depths undergo the same magnification factor (whether greater or less than one, relative to the original scene), so the overall foreground scale equals the background scale. However, for converging-lens image capture (toed-in), ratios of magnification for foreground and background scale differently (here the foreground and background are defined by the convergence point). The puppet-theater effect was proposed to occur when the foreground magnification was less than the background magnification. A subjective study was run,

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

in which cameras were converged on a test object (a manikin in a hallway). Observers were asked to assess the size of the manikin, using a Likert scale with options ranging from “small” to “normal size” to “large.” The displayed size of the object matched the physical size of the object, so it was possible to rule out the effect as being a simple matter of a reduced angular size (as occurs frequently with 2D TV) combined with 3D perception. The reported sizes matched their predictions. Based on the results, an algorithm to predict puppet theater distortions was developed [36] using Fourier techniques. Unfortunately, there are a number of confounding issues with the study. First, the puppet-theater effect is typically a global effect, not limited to the subject of a scene. Also, the conditions predicted to produce small ratios of foreground to background magnifications also employed unnaturally large camera baselines. Spacing the cameras too far apart results in large disparities that are only consistent with real-life close viewing of small objects. So even if the original scene was large, if its disparities are consistent with close viewing, it could be perceived as small, regardless of whether parallel or converging cameras were employed. This notion is similar to the “tilt-shift effect” in photography, where the addition of blur to an image of a large scene makes its content appear to be miniaturized [37]. The unnaturally large disparities in the puppet-theater effect are analogous to the unnaturally large blur in a tilt-shift image. In fact, even the geometries of blur and disparity are analogous [37]. Therefore, it is possible that the camera separations, and not the relative magnifications of the foreground and background, were the cause of the effect in [35]. Also not studied were interactions with lighting and other factors, such as focus (which has miniaturization implications due to the tilt-shift effect). Familiarity with the scene contents could make the effect more or less noticeable, so there are signal-known-exactly (SKE) aspects to the problem. As it stands, more research must be done to understand the root causes of the effect. Regardless of its causes, familiarity with stereoscopic viewing may eventually lessen the puppet-theater effect, or at least make it less annoying. For example, initially with the widespread introduction of non-stereo TV in the 1950s, it was not uncommon to hear and read comments about humorously incorrect sizes of people on TV, yet those growing up with TV do not seem to notice that effect. E. Cardboard-Cutout Effect Another familiar geometric distortion is the ‘cardboard-cutout effect.” It is characterized by a loss of the rounded-ness of objects; they look flattened as if they were standing cardboard cutouts set at different depths within the displayed 3D scenes. The problem is particularly prevalent in low quality 2D-to-3D conversions, in which there is only a coarse depth map available. To date, there is no comprehensive explanation for the sources of the effect. Yamanoue studied the effect [35], [38], combining a geometric model with subjective studies. The model predicted that when lateral magnification exceeded depth magnification (when going from scene to the display) the cardboard effect was expected to occur. This rule was considered applicable for both parallel and toed-in capture methods. Factors likely to cause these ratios to deviate included the use of telephoto lens (and this

353

is known to occur when such lenses are used even for 2D capture of scenes, but generally requires more observant viewers than for 3D). The effect was studied strictly as a function of binocular parallax conditions (that is, they tried to remove all other factors). Also studied was the angular extent of the lens (i.e. of the captured scene) and lighting conditions, but here they only found minor effects. Similarly to the puppet theater effect, the accommodation-vergence mismatch (leading to comfort issues) did not effect the perception of the effect. However, there are important issues with the two studies. The standard notion of the cardboard-cutout effect is that there is depth compression within objects, but not between objects. Yet the experiments only used one object as the test stimulus. So it is impossible to know whether they tested overall scene compression or the cardboard-cutout effect 11 . In the 2000 study, a 3D background was included in some experiments as a viewing condition, but the relative compression of the stimulus compared to the overall scene was not recorded. A geometric model predicting the cardboard-cutout effect [36] also suffered from a similar problem. That system operated on the assumption that scene-width depth compression was directly associated with the effect, which misses the unequal compression within and between objects. For this reason, focusing exclusively on geometric distortions as a cause of the effect seems limiting. Though it is only conjecture, it is possible that other factors, including focal cues, could be involved in the effect. Size anomalies were also studied [39] and the role of lighting, such as the known effect that flash photography nearly always causes the cardboard effect, was also studied [38]. An algorithm designed to reduce the cardboard-cutout effect as caused by signal processing due to formatting (2D depth) was presented [40]. They studied reducing the perceived cardboard cutout effect by blurring the depth map and then producing a new L and R pair from the new depth map. This technique reduced perception of the cardboard-cutout effect, and had no negative effect on visual comfort, but significantly degraded the 3D shape content. IV. PARALLAX AND DEPTH ADJUSTMENT ALGORITHMS AND PERCEPTUAL CONSEQUENCES Horizontal image shifting of the left and right image pairs on a stereoscopic display is a common depth-adjustment tool, and is used to remaster depth for different display sizes and viewing distances so the depth imagery is near the screen and is comfortable [33]. The main drawback of simple parallax shifting is that it requires loss of image area due to the cropping at the screen edges necessary to eliminate unpaired points, and it can be uncomfortable for photography shot with converging cameras 12 . An integrated approach [34] uses a depth map and a simple comfort model (allowing for a max of 1-2 deg of parallax in the crossed direction) to guide horizontal shifting of 11A stimulus of a tree was used for determining thickness. There is a wellknown visual phenomenon that often causes trees in real life to appear wider than they are deep, even if they are radially symmetric. Therefore, the choice of stimulus already had some built-in perceived depth compression. 12Depth adjustments are done to keep the vergence within the depth-of-focus range of the display surface (i.e., no accommodative conflict) or to adjust the depth range for preference (some prefer the depth to jump out of the screen, while some prefer the depth to stay entirely behind the screen).

354

the L and R images in order to keep the stereo image’s disparities and consequential vergences within in the comfort region. However, pushing the images backward (moving images COPs closer together) to avoid uncomfortable screen-side vergences may cause viewer exotropia (too high of an uncrossed disparity), especially in theater environments. Further problems include the geometric distortion of asymmetric stretching as described in Section III-C, as well as problems due to images captured with a convergent lens setup. More advanced depth adjustments include rendering a CG scene specifically based on information from the Z-buffer [42] (such as can occur in PC-gaming applications), and when that flexibility is not available, re-rendering by synthesizing a new left and/or right eye view with the desired depth properties (the inputs can be a left-right view pair, left-right view pair plus depth map, single image view plus depth map, etc). The steps of synthesizing a new left-right image pair can include: depth-map generation, occluded-regions determination, disoccluded region-filling techniques such as in-painting, or geometric warping [43]. All of these techniques lead to various distortions, which may be noticeable and annoying.

A. New-View Synthesis Algorithms Advanced depth-range adjusting algorithms scale the range of the depth map, and then may also reposition its mean distance to fit the displayed depth range within a comfort zone determined by basic accommodation-vergence visual data [44]. Once a new depth map is generated 13 , either one eye view can be generated from the combination of the other view and the depth map, or sometimes a new left and right image are generated from the input images, a process generally referred to as depth-image-based rendering (DIBR). This was originally motivated for autostereoscopic applications and approaches that format the stereo image as a single image with a depth map. Meijers [45] has an early version of this approach that is often evaluated for benchmarking. Some of the sub-stages include depth-map smoothing, and horizontal interpolation for disoccluded regions. Depth-map smoothing is generally needed since the depth-map generation techniques are easily perturbed by image noise, vagaries of false correspondences, and pixel-scale computational tradeoffs [46]. Areas that are occluded by forward objects in both original L-R view images may be revealed (disoccluded) in the desired new view. There are no corresponding pixels to simply map from either of the input images (or single image and depth map, etc), and these must be filled-in by using a variety of techniques, with horizontal interpolation being the simplest. The major perceptual errors from such techniques involve false depths (primarily holes), non-matching textures across eyes in the filled-in disoccluded regions (i.e. after local fusion), depth-edge blur, and edge shearing. Non-matching textures are particularly 13In many approaches, the depth map needs to be estimated from the existing L-R image pairs, as no depth map is made available. Obtaining a depth-map from a stereo pair is still a difficult problem, and solutions often involve numerous heuristics. One particularly difficult image feature is the repeating pattern, which can lead to false correspondences.

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

problematic since they lead to inter-eye rivalry effects, giving a false transparency14. B. Advanced View-Synthesis Algorithms One advanced new view-synthesis algorithm [47] simultaneously solves for depth and color using a graph-cut-based optimizer along with fast optimization of texture priors. Another advanced approach [46] takes a stereo image and estimates disparity maps, completes the disparities for occlusion regions using a segmentation approach, mutually changes the viewpoint of both images through 3D image warping, and completes the remaining unknown regions with depth-assisted texture synthesis. Segmentation is generally avoided because its layering tendencies (i.e. grouping errors into a single plane) is expected to lead to cardboard-cutout distortions. C. Temporal View-Synthesis Algorithms One of the most vexing problems, disocclusion, occurs when an occluded object becomes visible to a new viewpoint. If the application is video imagery, this problem can be ameliorated by using the preceding or following frames to fill in occluded regions. Of course, this requires either the camera or the object to move enough to reveal the needed data. One approach [48] accomplished “temporal inpainting” by using motion estimation. In addition, multiple possible reconstructions are blended using a mixing function. The method can use up to 50 frames on each side of the frame needing localized inpainting, so it is limited to off-line applications. While temporally stable, its main artifact is that the disoccluded region is generally low-pass filtered relative to its ground truth image. The algorithm almost never obtains ground-truth restoration, yet it never makes large visible errors. This shows the need for more perceptual work in understanding the thresholds and tolerances of such shape distortions around objects edges. There is an expectation that spatial 2D masking at depth edges contributes to lack of visibility of texture loss in the inpainted regions, as they are usually very close to depth edges, and depth edges usually have luminance edges [49]. There may be pure depth masking as well. Another new view-synthesis approach [50] requires a depth map and starts with a depth-map refinement step that does discontinuity-preserving smoothing via boundary edge detection and boundary edge extension (to reduce noise and broken edge regions within depth defined objects). The disocclusion area is calculated based on a depth map, position of views, and the preceding frames under the assumption that the objects are moving. The algorithm leverages the depth maps of preceding frames, and does not use motion estimation. They use the typically correct assumption that the side of the disocclusion area is the background, thus the next step of the algorithm is to look back in depth for matching image blocks. If the non-pixel areas are not in occlusion areas 14The importance of removing rivalry artifacts is hinted at by the common usage of the adjectives “solid” or “stable” to describe high quality stereoscopic systems. This is because transparency artifacts result from pixel noise and distortions that allow for multiple false correspondences (rivalries), giving multiple depth planes at a single position. This is most likely to occur in textures with pseudo-repeating patterns. Such transparency percepts can also result from crosstalk.

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

(e.g., caused by noise), interpolation is used. The paper provides illustrations of the perceptual problems with two benchmark methods: [45] giving horizontally striated distortions, and [51] giving depth-shape distortions like shearing around edges. D. Perceptual Studies of New View-Synthesis Artifacts Perceptual studies of the related artifacts generally only exist as geometric stimuli (such as depth-hole visibility resolution), and there is not as much work in using natural or full-scene imagery. Some studies exist in disocclusion in PSNR terms for compression applications [52]. Tam & Alain [53] performed a subjective study of smoothing depth ramps for hiding occlusion artifacts such as holes, blockiness, or noise. While the more advanced techniques are capable of in-painting textures into the disoccluded areas, occluded shapes that are more salient, such as hidden graphic elements (e.g. text) are currently unsolved, and can be visually disturbing. More studies on types of signal content in the disoccluded regions are needed. There are also studies from basic vision science that shows that merely introducing occlusions from one eye to the other causes depth sensations [49]. V. DATA COMPRESSION AND FORMATTING FOR TRANSMISSION The multiple images required for stereoscopic video puts extra demand on formatting and compression technologies, and several aspects of human visual system processing are taken into account. Most efforts build on the longstanding knowledge that the depth signal in human vision has much lower bandwidths than the luminance spatial channel. Tyler has done key work in finding these bandwidth limits for both spatial [54] and temporal [55] frequencies. For example, the highest depth frequency 15 that can be seen is approx 3-4 cy/deg, well below the 30-50 cy/deg that occurs for spatial luminance. The optics of the eye are not the source of the limit [56] as is generally the case for spatial luminance vision [57], so it is expected to be the neural operators for correspondence matching that limit disparity detection, which have been found to occur in region V2 [58] and are organized at a much lower resolution than the luminance spatial operators in the V1 region of the visual cortex [59]. Allemark [60] has recently done experiments to test certain correspondence theories by comparing sine and square wave stereoresolution, and found sines and squares behave similarly. One of the main approaches for stereo compression is mixed resolution coding [61], where one image of the L-R pair is reduced in resolution, either spatially, temporally, or both, and this modified image pair is input to existing compression techniques. One issue is which eye’s viewpoint should receive the lower resolution, and concerns about the need to match to each individuals dominant eye have been alleviated with findings [62], [63] that the binocular perception is dominated by the higher quality component, independent of eye dominance; but there are still some inconsistencies across experimental results, such as findings of problems with asymmetric coding [64]. The topic of compression will be covered more thoroughly in another paper 15Depth frequency is the inverse of spatial distance separating peaks of modulation on the depth direction(periodic variations in a corrugated surface).

355

in this special issue [65], so we will focus on a few perceptual studies related to resolution and compression. Berthold [66] found stereo images with a given blur were perceived sharper than 2D images of same blur, suggesting there is a depth signal input to edge perception, and hence, sharpness perception. A similar result, but involving JPEG asymmetries which have a DCT-based distortion instead of spatial resolution16, was found [67]. Unacceptable quality at a certain compression rate for 2D can be acceptable for a 3D version of the image. Compression is generally a noise-adding process, and we have several expectations when this is done in a stereo-imaging system. If the added noise were identical for each image pair, in terms of amplitude and phase, we would expect such correlated noise would be perceived at a single depth plane, generally at the display surface17, and independent of the image content, appearing like a curtain of noise. However, for compression with images having capture noise, we expect the noise to be mostly independent across the image pairs. Such uncorrelated noise would be perceived as floating throughout the whole volume of the scene. The actual outcomes depend on whether the visual system can solve the correspondence problem with the noise features and fuse the noise. Because the coding error is independent across eyes and is generally high spatial and temporal frequency, the potential false correspondences due to distortions are seldom fused and the distortion does not contribute to the perceived depth signal. Further, its binocular sum as a 2D signal is of low perceptible amplitude. Perceptual findings like these give motivation for a related approach, known as asymmetric coding, which uses similar resolutions across the two viewpoints, but compresses one much more strongly than the other. Of course combinations of mixed resolution and asymmetric coding may be fruitful. Stelmach and Tam [68] studied quality for mixed resolution and asymmetric coding for video, using MPEG2 compression. They found with mixed resolution, the quality is approximately the average of the two eyes. In a later study of overall quality [69] further probing spatio-temporal resolution asymmetries, they found a negligible effect on depth and that sharpness was biased toward the eye with greater spatial resolution. They also investigated temporal interleaving of asymmetric coding. Differences between the component effects of compression have been found [70] for asymmetric distortions: for blur, the binocular perception was closer to the high quality eye’s image, but for blocking, it was closer to the lower quality eye. Overall image quality of asymmetric and symmetric coding, in which the quality of stereo 3D images were also compared to 2D images, was done for JPEG compression [71]. Using two still images captured in a converged camera setup, they found that even at low distortion levels, depth had no discernable effect on quality, so 3D added no boost over 2D in quality. Conversely, depth was not affected by coding quality, while sharpness and 16While most techniques of quantization of the DCT coefficients for compression lead to edge and blocking artifacts as opposed to blur, at higher compression rates and with very adaptive algorithms, low amplitude texture blur can be common. 17Parallax shifting for depth adjustment could cause the noise to have a different depth than the display surface.

356

quality were correlated18. The quality was slightly below the average between the two eyes when one eye’s quality was very low, and toward the higher of the qualities when the quality in both eyes was already high. Surprisingly, there was no significant difference between asymmetric presentations depending on eye dominance. The 3D quality was less than or barely equal to the 2D quality across all compression settings. This appears inconsistent with Chen’s [67] blur result, given the overall correlation found between sharpness and quality. Stelmach and Tam [68] found different behavior for blockiness and sharpness; for blockiness, the quality was the average of the two eyes, but for blur, the binocular percept was close to the higher quality image of the stereo pair. With the success of motion estimation and residual-image compression for video, more perceptual work is needed to understand the consequences of changing the parameters of these key components. Advanced coding methods motivated by autostereoscopic displays requiring numerous views (9 views is common) include transmitting the scene as a 2D image plus depth map (2D depth). New-view synthesis, similar to that motivated by depth-range adjustment, is needed for this compression technology, and occlusions also cause problems. A method for the restoration of occlusion regions using a residual signal and wavelet-based interpolation was developed [72]. It includes a reconstruction technique that utilizes the property that disocclusions are generally an extension to the farther side of a depth edge. Including the occlusion side channel of information dramatically increased the image quality. In the context of the 2D depth application, which relies heavily on the depth maps, the cardboard-cutout effect (as described earlier) has been known to occur. VI. MOTION ARTIFACTS IN 3D Stereoscopic percepts will also be influenced by motion, which can serve as a strong depth cue, but in a stereoscopic display, it can cause a number of perceptual artifacts, including erroneous depth estimates. One technique for displaying stereoscopic images that is gaining dominance is field-sequential presentation, wherein a single display alternates between emitting the images for the left and right eyes while an active element switches which eye is permitted to see each image; the eyes receive images in counter-phase. This type of presentation is unlike the stimulation the visual system receives with real-world imagery or in a conventional 2D display. Given this unusual type of presentation, it is unclear how the frame rate of the content, the frame rate of the display, and the speed of the depicted content will influence the likelihood of the motion being perceived as smooth and non-flickering. Additionally, the field-sequential presentation delays one eye’s image slightly, which can cause disparity artifacts leading to erroneous depth signals [73]. A. Window of Visibility The human visual system is sensitive to a range of spatial and temporal frequencies. Within this sensitivity range, if a higher 18Sharpness attributes were only studied as those that are caused by compression, which can be very minimal, depending on the shape of the Q-table (flat tables lead to blur, steep tables lead to ringing and edge artifacts).

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

frequency signal is attenuated from what we would expect under typical viewing conditions, we categorize the stimulation as blurred. If we instead view the imagery in a display, there may be a series of aliases, and when these aliases have a spatial and temporal frequency to which we are sensitive, we can perceive either flicker or judder artifacts. This band of spatial and temporal frequencies to which people are sensitive has been described qualitatively [74] as the window of visibility, and the shape of this window was explored in great detail [75], even with the inclusion of eye-movement effects [76]–[78]. Consider an object that moves with constant speed, , and is imaged every seconds ( Hz). The motion is shown as the gray line with slope in Fig. 4(a), and the sampled version of the motion is shown as the blue dots. This type of visualization is known as spatio-temporal plotting. To understand the visibility of digitized approximations of continuous signals, it is often useful to decompose the signals to their constituent spatial and temporal frequencies. The analysis is known as a Fourier transform. The Fourier transform of the smooth mo. The tion is shown in Fig. 4(b) as the black line with slope aliases from the sampled motion are shown as the parallel blue lines that are horizontally spaced by . In the frequency domain, the window of visibility can be approximated as an ellipse. The width of the ellipse defines the frequency threshold at which flicker becomes noticeable (the critical flicker fusion ( cff, 40-60 Hz) frequency) [79], [80], and the height is the visual acuity ( va, 40-60 cy/deg) limit. Aliases outside this ellipse will be imperceptible, and those within it will lead to perceptible display artifacts. Situations in which the alias crosses the axis at a position less than cff will lead to perceptible flicker, a temporal change in the luminance level. When the alias intersects the window elsewhere, the observer will observe judder, that is, jaggies in time. This jaggy edge in time can manifest as a visible discreteness to the motion or series of bands on the edge. As speed increases, the aliases become increasingly likely to intersect the window. Conventionally, cinema has used a 24 frames per second (fps) image rate during filming. This rate is typically judged adequate to represent smooth motion and various projector and display techniques such as double- and triple-shuttering have been used to eliminate the flicker. With field-sequential displays, some of these traditional techniques are no longer adequate. The left and right images alternate, and thus the fraction of the frame in which each eye can see the image is less than half (a duty cycle, i.e., ratio of displayed image duration over total frame duration, of less than 50%). In such a display system, where each eye was presented with its image for 1/48th of a second the spectra for the sampled motion is shown in Fig. 4(c). Along the horizontal axis, where , there is , which falls inside the window, and a strong signal at thus flicker is likely to be problematic. To address the flicker problem using a 24 Hz image-rate with field sequential presentation, projection systems triple-flash the image. In this technique, the right images(R1, R2, R3 ) and the left images (L1, L2, L3, ) are presented three times in an interleaved fashion: L1-R1-L1-R1-L1-R1-L2-R2-L2-R2-L2-R2-L3-R3 This protocol raises the presentation rate to each eye to three times the image rate. The spectra for this protocol is shown in Fig. 4(d). The multi-flashing attenuates the first and second

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

357

Fig. 6. Erroneous disparity introduced by presentation delay. (a) An object’s motion is sampled at intervals of t. (b) With single-flash field-sequential display, (blue square) the right eye’s image is delayed by t=2. (c) Based on nearestneighbor matching, the disparity from field-sequential displays alternates between the correct disparity and an erroneous disparity. The average is represented as the dashed line. (d) The disparity needed to null the depth for objects moving at different speeds with different display protocols. Data is from the 75-Hz presentation rate conditions and is replotted with permission from [81].

Fig. 4. Motion sampling and alias formation. (a) In space-time domain, an object with (gray line) smooth motion is sampled with period t. (b) In the Fourier domain, the smooth motion signal becomes the black line with slope 1=s passing through the origin, and the blue lines represent the sampling aliases from stroboscopic presentation. The white ellipse illustrates the window of visibility. (c) The spectra of the sampled motion with single-flash presentation (50% duty cycle). (d) The spectra of the sampled motion with triple-flash presentation.

0

Fig. 5. Motion artifact and flicker data. Replotted with permission from [81]. The rate at which the image is updated is shown on the abscissa, and the object speed is shown on the ordinate. The regions shaded in red, green, and yellow mark the image rate and object speeds where observers reports motion artifacts, flicker, and both artifacts, respectively. (Left) The display protocol is a single flash presentation. (Right) The protocol is a triple flash presentation.

aliases at the point and thus eliminates flicker. However, this protocol does not change the judder signal. Hoffman, Karasev, and Banks [81] studied the interaction of the image sampling rate, the speed of objects, and the presentation protocol on the perceptibility of these artifacts. Their results demonstrate that flicker, as expected, is influenced solely by the presentation rate of the images, and thus triple flashing is effective. Their results also show that a 24Hz image rate will cause judder for a variety of object speeds, especially with triple flash. Using a higher image rate could allow the display to present faster speeds without judder. Fig. 5 summarizes these judder and flicker results. Fig. 5 depicts the space of image rate and speed in which flicker and motion artifacts become perceptible. As predicted by the window of visibility analysis, the flicker rate was three times higher in the single flash than the triple-flash presentation. The observers perceived motion artifacts at moderately lower speeds with the triple-flash presentation than with the single-

flash presentation. Thus triple flash is effective at eliminating flicker but can exacerbate motion artifacts for objects moving with moderate speed. B. Depth Distortions Due to Motion Field-sequential displays do not stimulate both eyes simultaneously, and thus instantaneous disparity is undefined. The visual system must make disparity estimates by computing spatial differences between non-simultaneous images. In this situation, the presentation delay for the right eye introduces a spatial disparity for moving objects. Consider an object moving with constant velocity ( Fig. 6(a)) that is imaged by two synchronized cameras every seconds; the cameras record no change in disparity. When the image is presented in a field sequential display as depicted in Fig. 6(b), the right eye’s image (blue square) is delayed by sec with respect to the left image (red X). Although the images are not displayed simultaneously, the visual system can estimate disparity. If the visual system used a simple nearest-neighbor disparity calculation (subtracting adjacent left eye images from right images), the disparity signal would alternate between the intended disparity, 0 (solid line in Fig. 6(c)), and an erroneous disparity occurring between frame transitions, . The average disparity signal, indicated by the dashed line in Fig. 6(c) is . Thus objects could change depth as they change velocity. An example of this distorted depth could be experienced by filming and displaying a simple pendulum. The pendulum has right-left motion in a plane parallel with the face. The cameras recording the motion are synchronized. When these images are played back in a synchronous (simultaneous presentation) display, the disparities of the original pendulum are reconstructed accurately. If the same set of images were presented in a single flash field-sequential display, one of the images would be systematically delayed by , causing the object to appear to have an ellipsoidal path in depth [73]. Hoffman et al. [81] characterized the magnitude of this depth distortion with various speeds, image rates and display protocols. They asked observers to adjust the disparity of an object to null out the erroneous depth. With simultaneous image presentation, no nulling disparity was needed for the object to have the correct depth (blue boxes Fig. 6(d)). When the observers viewed single-flash field-sequential imagery (cyan circles), they set the

358

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

nulling disparities in a manner consistent with the predicted av). The authors erage disparity (dashed line, also considered the triple-flash protocol. The presentation rate was maintained by reducing the image rate by a factor of three and triple flashing each image. If average disparity alone determines the depth percept, again the right eye is delayed by half the presentation rate, and the dashed black line is again our prediction for the depth nulling values. However, in this condition (red asterisks) the observers generally reported little depth distortion, except at low speeds. This result suggests that the visual system has a nonlinearity when it estimates a volatile dis10deg/sec encompass parity signal. The studied speeds of the majority of speeds found in video imagery [82], except data scrolling. C. Motion Interpolation and Frame Rate Coding Issues The use of motion-vector interpolation techniques in lieu of multi-flashing can be useful in reducing some artifacts. Interpolating frames with appropriate updates in object position effectively increases the image rate and can reduce both motion artifacts, such as judder, and flicker. It could also be used to introduce the appropriate temporal delay for the right-eye images such that the depth distortion is reduced. One issue complicating the use of these algorithms is that they should not be executed independently for each eye’s image. Positional errors may be unnoticeable in monoscopic video but uncorrelated errors could introduce time-varying disparity errors with stereoscopic video that could be quite noticeable. If the images are rendered on the fly, care to execute the rendering at the correct sub-frame times would help alleviate the aforementioned artifacts without needing for interpolation. VII. 2D-TO-3D CONVERSION ALGORITHMS Most of the more successful approaches for 2D-to-3D conversion require human interaction, such as in indicating objects that are then automatically tracked through that video, or indicating approximate depth of regions. There are also a number of fully automated algorithms in working systems, but the reaction has been mixed. There are not many perceptual studies directly addressing this application, but many of its particular distortions have been studied in other contexts, and are described in other sections. One of the noted problems is depth reversal due to object motion and camera panning, which throws off the assumptions of motion-based parallax (that foreground motion is higher than background for moving viewpoints, including panning). Another effect is depth errors, such as holes in objects or texture errors across objects. Flat water is particularly problematic, since false low-frequency depth signals in those regions leads to curved water surfaces, which immediately look incorrect. Smoke and translucent objects are challenging because it is difficult to identify the depth of those nebulous parts of the image. Facial distortions are also easily noticeable, most likely due to our strong familiarity and likely neural coding for faces. For short-duration viewing, it is surprising that fully automated approaches can look reasonable at first glance, but contain many errors that lead to discomfort with careful observation. See the article by Zhang et al. [83] in this issue for more information on 2D-to-3D conversions.

VIII. DISCUSSION This review highlighted key perceptual distortions due to various aspects of stereoscopic imaging systems. For engineering purposes, there is a need to consolidate the different distortions into an overall quality metric19. Recent work has been done in this regard on limited distortions [3], [84] One of the difficulties in consolidating the various stereoscopic perceptual studies are the different questions asked. For example, some study the JND of detection of a distortion, some study the JND quality effects of the single distortion, while others look at JND of overall quality; of course, none of these are the same. Other studies query observer ratings, and the visibility of distortion can only be coarsely captured. Others work with an overall quality axis without making category distinctions between visibility and annoyance20. Then there are aspects of different components to overall image quality. In work solely limited to non-stereo imaging, Keelan [85] has studied how the various components of non-stereo image quality (sharpness, noise, contrast, color, etc) are integrated into an overall image quality percept. His multi-attribute21 image-quality model is based on a Minkowski summation which has been useful for assessing [86], but he adds the flexibility of a overall image quality variable exponent, ,

The summation is of negative values which represent degradations, . The units are in JNDs of quality, as opposed to detection, since some distortions contribute less to quality degradations than others for the same visibility (a common example for non-stereo imaging is blur versus blocking artifacts). The exponent varies in a sigmoidal manner with the strength of the most severe distortion along the attributes considered. This was found to be necessary for the model to work over a wide range of distortions. When distortions are small, they approximately sum (as occurs with ), but when one or more quality attribute degradations is large, even modest changes in the other attributes have little impact on overall quality (as occurs with ). If we try to incorporate depth into this overall quality model, the studies described herein mean that depth is integrated into the overall perception with a very high exponent, and one that is apparently different from the other non-stereo quality attributes. This is because while the depth can be many JNDs above detection (i.e., over a flat non-stereo displayed image), it still contributes little to lifting overall quality. Of course JNDs of depth quality need to be used in this model, as opposed to JNDs of depth change visibility. At this stage we do not know whether the existing Minkowski quality model can incorporate depth, but it does not have the capability of handling a different exponent for depth, which may be needed. The distortions associated with 19The term Quality of Experience (QoE) is used to describe the overall quality of the entire viewing experience, as opposed to the image-related portion of the experience, and it is a superset that also includes the discomfort issues and immersion feelings. This paper addresses the image-related quality. 20Visibility tends to be more robust than tolerance, as expectation plays a role in tolerance, which changes due to steadily improving quality/cost ratios of imaging and display products. 21To avoid confusion, we will use “attribute” instead of “dimension”.

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

the tradeoffs in achieving depth, such as crosstalk, blur, or reduced bandwidth constraints, can have substantial degradations in overall image quality. If the depth contributes little to overall quality, it may pose a problem for widespread usage or longevity of the stereoscopic display22. Other aspects of stereoscopic quality do easily fit in the variable-exponent Minkowski model. For example, some work [63] on asymmetric coding supports the overall quality model of Keelan with a variable Minkowski coefficient. When the quality is high, the distortions of the components are low, and summing of quality produces an average of the eyes’ quality. However, when image quality is low across both eye viewpoints, the overall quality is less than the mean. Some of the needed research includes tolerances for distortion, particularly for prolonged viewing. In addition, understanding the viewing population distributions of visual sensitivity is important [87]. For non-stereo spatial vision, population statistical data is available for the contrast sensitivity function, showing age effects, mean, max and min percentiles, etc. For stereoscopic viewing, understanding these individual variations and tolerances will be extremely useful, as anecdotes show a strong role in expertise for noticing stereoscopic distortions (although by the end of an observer study, the subject inherently may become an expert since the experiment itself acts a training exercise for the related aspect of perception). Cognitive effects [88], [89] may be very important in the overall quality of stereoscopic displayed imagery. Problems like the cardboard-cutout effect may be more complicated than suggested by modeling it with simple low-level geometry [35]. Cases where motions and the expected time traversed by well-known moving objects conflict may lead to noticeable distortions, such as a person running back into the depth plane where the viewing conditions cause asymmetric depth stretching. What would be the cognitive consequence of the runner seeming to run much faster or slower than is physically realistic? Alternatively, 2D-to-3D conversions that work well for casual viewing suggest that the visual system is robust to these types of conflicts. Reversed depth due to surrogate depth [90] and even technical mistakes are not as easy for many to see the distortions as one would expect. So it might be enough to reduce the visibility and cues of the 2D display surface to observe good dimensional imagery. SKE (signal known exactly) issues of the observer’s knowledge of most objects may play a role in these approaches, one of which is the synopter, which is a parallax removing display apparatus. By providing identical perspectives to each eye, the disparity cues to the 2D display surface are removed, and without such cues, the observer is claimed to perceive more compelling 3D shapes in the depicted scene. Another technology is the high-dynamic-range (HDR) display, which by having a perceptually black screen surface has greatly reduced 2D surface cues, as well as being more accurate. Good oil painting by the masters also falls in this category, and that was what the synopter was first developed for in 1907 by Moritz von Rohr. However, for experienced stereoscopic viewers, such impressions of dimensionality are 22Another possibility is that in some observer’s understanding, image quality strictly describes non-stereo aspects for some unknown reason. This highlights the importance of clear instructions given to observers during psychophysical experiments, and not to assume a shared definition on ‘image.’

359

not confused with disparity-based depth, and for inexperienced23 viewers, a side-by-side comparison quickly reveals the difference.

REFERENCES [1] D. Strohmeier, S. Jumisko-Pyykko, and K. Kunze, “New, lively and exciting or just artificial, straining, and distracting? A sensory profiling approach to understand mobile 3D audiovisual quality,” VPQM, 2010. [2] P. Seuntiens, “Visual experiences of 3DTV,” PhD Thesis, Eindhoven University of Technology and Philips Research, Eindhoven, 2006, Ch. 4. [3] W. J. Tam, “Psychovisual aspects of viewing stereoscopic video sequences,” Proc. SPIE, vol. 3295, pp. 226–235, 1998. [4] W. J. Tam, F. Speranza, S. Yano, K. Shimono, and H. Ono, “3D-TV visual comfort,” IEEE Trans Broadcast Eng., 2011. [5] N. S. Holliman, N. A. Dodgson, G. E. Favalora, and L. Pockett, “Threedimensional displays: A review and applications analysis,” IEEE Trans. Broadcast., Jun. 2011. [6] R. Kaptein and I. Heynderickx, “Effect of crosstalk in multi-view autostereoscopic 3D displays on perceived image quality,” in SID Digest, 2007, pp. 1220–1223. [7] L. Lipton, “High resolution immersion view,” Proc. SPIE, vol. 2177, p. 132-, 1994. [8] “Planar—model SD2620W,” [Online]. Available: http://www. planar3d.com/entry-level-3d/ [9] C.-Y. Chen et al., “3D technology development and human factor,” in SID 10 Digest, 2010, pp. 518–521. [10] X. Feng and S. Daly, “Vision-based strategy to reduce the perceived color misregistration of image capturing devices,” Proc. IEEE: Special Issue on Applied Visual Models, vol. 90, no. 1, pp. 18–27, Jan. 2002. [11] Y. Yeh and L. D. Silverstein, “Limits of fusion and depth judgment in stereoscopic color displays,” Human Factors, vol. 32, no. 1, pp. 45–60, 1990. [12] K. C. Huang et al., “A study of how crosstalk affects stereopsis in stereoscopic displays,” SPIE, vol. 5006, pp. 247–253, 2003. [13] Pastoor, “Human factors of 3D images: Results of recent research,” in Proc. IDW5, 1995, vol. 3D-7, pp. 69–72. [14] A. Hanazato, M. Okui, and I. Yuyama, “Subjective evaluation of cross talk disturbance in stereoscopic displays,” in SID Digest, 2000, pp. 288–291. [15] F. Kooi and M. Lucassen, “Visual comfort of binocular and 3D displays,” in HVEI VI SPIE Proc, 2001, p. 4299. [16] P. Seuntiens, L. Meesters, and W. Ijsselsteijn, “Perceptual attributes of crosstalk in 3D images,” Displays, vol. V26, pp. 177–183, 2005. [17] P. J. B. Seuntiens et al., “Viewing experience and naturalness of 3D images,” Proc. SPIE, vol. 5006, 2003. [18] L. Chen et al., “Investigation of crosstalk in a 2-view 3D display,” in SID Digest, 2008, pp. 1138–1141. [19] C.-Y. Chiang, K.-T. Chen, Y.-C. Chang, and Y.-P. Huang, “Effect of crosstalk for stereoscopic 3D- Dynamic moving images,” in SID Symp. Digest, 2009, vol. 40, no. 1, pp. 808–811. [20] S. Pala, R. Stevens, and P. Surman, “Optical crosstalk and visual comfort of stereoscopic display used in a real-time application,” Proc. SPIE, vol. 6940, p. 11.1-11.1, 2007. [21] J. S. Lipscomb and W. L. Wooten, “Reducing crosstalk between stereoscopic views,” SPIE, vol. 2177, pp. 92–96, 1994. [22] J. Konrad, B. Lacotte, and E. Dubois, “Cancellation of image crosstalk in time-sequential displays of stereoscopic video,” IEEE Trans. Image Process., vol. 9, no. 5, p. 897, May 2000. [23] T. Tsai, C.-W. Chen, C.-H. Shih, and W.-M. Huang, “An image processing method for the elimination of ghost image and improvement of the image quality in stereoscopic display,” in SID IDW, 2009. [24] M. Barkowski, P. Campisi, P. LeCallet, and V. Rizzo, “Crosstalk measurement and mitigation for autostereoscopic displays,” SPIE Electron. Imaging Conf.. 3D Image Process. Appl., 2010. [25] L. Kerofsky, Y. Yoshida, S. Deshpande, and J. Speigle, “Crosstalk in 3D-TV: Adaptive cancellation and perceptual validation,” in SID IDW, 2010. [26] D. Vishwanath, A. R. Girshick, and M. S. Banks, “Why pictures look right when viewed from the wrong place,” Nature Neurosci., vol. 8, no. 10, pp. 1401–1410, 2005. 23Except

for those with stereoanalomies, of course.

360

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011

[27] A. J. Woods, T. Docherty, and R. Koch, “Image distortions in stereoscopic video systems,” in Proc. SPIE Stereoscopic Displays Appl. IV, 1993, vol. 1915, pp. 36–47. [28] G. Jones, D. Lee, D. Ezra, and N. Holliman, “Controlling perceived depth in stereoscopic images,” in Proc. SPIE Stereoscopic Displays Virtual Reality Syst. VIII , 2001, p. 4297. [29] R. T. Held and M. S. Banks, “Misperceptions in stereoscopic displays: A vision science perspective,” in Proc. Appl. Perception Graphics, Vis. (APGV). ACM, 2008, pp. 23–32. [30] L. B. Stelmach, W. J. Tam, F. Speranza, R. Renaud, and T. Martin, “Improving the visual comfort of stereoscopic images,” Proc. SPIE, vol. 5006, 2003. [31] F. Kooi and A. Toet, “Visual comfort of binocular 3D displays,” Displays, vol. 25, pp. 99–108, 2004. [32] R. T. Held, “Software,” 2009 [Online]. Available: http://www.eecs. berkeley.edu/~rheld/Robert_T_Held/Software.html [33] D. Hoffman, A. Girshick, K. Akeley, and M. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,” J. Vis., vol. 8, no. 3, pp. 33, 1–33, 30, 2008. [34] D. L. MacAdams, “Stereoscopic perceptions of size shape distance and direction,” SMPTE J., vol. 62, pp. 271–289, 1954. [35] H. Yamanoue, M. Okui, and F. Okano, “Geometrical analysis of puppet-theatre and cardboard effects in stereoscopic HDTV images,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 744–752, 2006. [36] K. Masaoka et al., “Spatial distortion prediction system for stereoscopic images,” J. Electron. Imaging, vol. 15, no. 1, p. 013002, 2006. [37] R. T. Held, E. A. Cooper, J. F. O’Brien, and M. S. Banks, “Using blur to affect perceived distance and size,” ACM Trans. Graph., vol. 29, 2, Apr. 2010. [38] H. Yamanoue, M. Okui, and I. Yuyama, “A study on the relationship between shooting conditions and cardboard effect of stereoscopic images,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 3, pp. 411–416, 2000. [39] W. J. Tam, K. Shimono, and S. Yano, “Perceived size of targets displayed stereoscopically,” Stereo Displays Appl. XII, 2001. [40] K. Shimono, W. J. Tam, C. Vasquez, F. Speranzo, and R. Renaud, “Removing the cardboard effect in stereoscopic images using smoothed depth maps,” SPIE, vol. 7524 -48, 2009. [41] S. Kishi, S. H. Kim, T. Shibata, T. Kawai, J. Hakkinen, J. Takatalo, and G. Nyman, “Scalable 3D image conversion and ergonomic evaluation,” Proc. SPIE, vol. 6803, 2008. [42] G. Sun and N. Holliman, “Evaluating methods for controlling depth perception in stereoscopic cinematography,” SDA, Proc. SPIE, vol. 7237, 2009. [43] S. Poulakos, A. Smolic, and M. Gross, “Non-linear disparity mapping for stereoscopic 3D,” ACM Trans. Graphics (Siggraph), 2010. [44] H. Pan, C. Yuan, and S. Daly, “Stereoscopic 3D content depth tuning guided by human visual models,” in Proc. SPIE Stereoscopic Displays Appl. XXII, Electron. Imaging, 2011, vol. 7863. [45] P. F. P. Meiers, “Parallactic depth-dependent pixel shifts,,” U.S. Patent 5 929 869, Jul. 27, 1999. [46] L. Wang, H. Jin, R. Yang, and M. Gong, “Stereoscopic inpainting: joint color and depth completion from stereo images,” in IEEE CVPR, 2008. [47] O. J. Woodsford, J. D. Reid, P. H. S. Torr, and W. Fitzgibbon, “On new view synthesis using multiview stereo,” in BMVC, 2007. [48] R. K. Gunnewiek, R. Beretty, B. Barenbrug, and J. Magalhaes, “Coherent spatial and temporal occlusion generation,” Proc. SPIE, vol. 7237, 2009. [49] J. M. Harris, “Monocular zones in stereoscopic scenes: A useful source of information for human binocular vision,” Proc. SPIE, vol. 7524, 2010. [50] Y. J. Jeong, Y. Kwak, Y. J. Yung, and D. Park, “Depth-image –based rendering (DIBR) using disocclusion area restoration,” in SID Digest, 2009, pp. 119–122. [51] L. Zhang and W. J. Tam, “Stereoscopic image generation based on depth images for 3DTV,” IEEE Trans. Broadcast., vol. 52, no. 2, 2005. [52] C. Vasquez and W. J. Tam, “3D-TV: Coding of dislocations for 2D depth representation of multi-view images,” in IASTED Int. Conf. Comput. Graphics Imaging, 2008. [53] W. J. Tam, G. Alain, L. Zhang, T. Martin, and R. Renaud, “Smoothing depth maps for improved stereoscopic image quality,” in Proc SPIE3-D TV, Video, Display III, 2004, vol. 5599, pp. 162–172. [54] C. W. Tyler, “Stereoscopic depth movement: Two eyes less sensitive than one,” Science, vol. 174, pp. 958–961, 1971.

+

[55] C. W. Tyler, C. M. Schor, and N. J. Coletta, “Spatiotemporal limitations of vernier and stereoscopic alignment acuity,” in Proc. Int. Soc. Opt. Eng., 1992, vol. 1669, pp. 112–121. [56] B. N. Vlaskamp, G. Yoon, and M. S. Banks, “Neural and optical constraints on stereoacuity,” Perception 37 ECVP Abstract Supplement, p. 2, 2008. [57] F. W. Campbell and D. G. Green, “Optical and retinal factors affecting visual resolution,” J. Physiology, vol. 181, pp. 576–593, 1965. [58] G. C. DeAngelis, I. Ohzawa, and R. D. Freeman, “Depth is encoded in the visual cortex by a specialized receptive field structure,” Nature, vol. 352, no. 11, pp. 156–159, 1992. [59] R. von der Heydt, H. Zhou, and H. S. Friedman, “Representation of stereoscopic edges in monkey visual cortex,” Vis. Res., vol. 40, pp. 1955–1967, 2000. [60] F. Allenmark and J. Read, “Spatial stereoresolution,” J. Vis., vol. 9, no. 8, p. 262, 2009. [61] M. G. Perkins, “Data compression of stereopairs,” IEEE Commun., vol. 40, no. 4, pp. 684–696, 1992. [62] W. D. Reynolds and R. V. Kenyon, “The wavelet transform and the suppression theory of binocular vision for stereo image compression,” IEEE ICIP, vol. 1, pp. 557–560, 1996. [63] P. J. B. Seuntiens et al., “Viewing experience and naturalness of 3D images,” Proc SPIE, vol. 5006, 2003, 17. [64] Gorley and N. Holliman, “Stereoscopic image quality metrics and compression,” in SPIE P.W. Stereoscopic Displays Virtual Reality Syst. XIX, 2008, vol. 6803. [65] A. Vetro, A. M. Tourapis, K. Müller, and T. Chen, “3D-TV content storage and transmission,” IEEE Trans. Broadcast., Jun. 2011. [66] A. Berthold, ATR Research laboratories technical report, “The influence of blur on the perceived quality and sensation of depth of 2D and stereo images,” Kyoto, Japan, TR-H-232, 1997. [67] J. Y. Chen, Z. Liwei, and S. Q. Ding, “The effect of JPEG coding scheme on the perceived quality of 3D images,” in SID Symp., 1998, vol. 29, pp. 1211–1214. [68] L. B. Stelmach and W. J. Tam, “Stereoscopic image coding: Effects of disparate image quality in left- and right- eye views,” Signal Process.: Image Commun., vol. 14, pp. 111–117, 1998. [69] L. B. Stelmach, W. J. Tam, and D. V. Meegan, “Stereo image quality: Effects of spatiotemporal resolution,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, pp. 188–193, 1999. [70] D. V. Meegan, L. Stelmach, and W. J. Tam, “Unequal weighting of monocular inputs in binocular combination : Implications for compression for stereoscopic imagery,” J. Exp. Psy.: Appl., vol. 7, no. 2, pp. 143–153, 2001. [71] P. J. B. Seuntiens et al., “Viewing experience and naturalness of 3D images,” Proc SPIE, vol. 5006, 2003. [72] C. Vazquez, W. J. Tam, and F. Speranza, “Stereoscopic imaging : filling disoccluded areas in depth- image-based rendering,” in Proc. SPIE 3–D TV, Video, Display V., 2006, vol. 6392, p. 63920D. [73] D. C. Burr and J. Ross, “How does binocular delay give information about depth?,” Vis. Res., vol. 19, pp. 523–532, 1979. [74] A. B. Watson, A. J. Ahumada, and J. E. Farrell, “Window of visibility: a psychophysical theory of fidelity in time-sampled visual motion displays,” JOSA A: Optics Image Sci., vol. 3, pp. 300–307, 1986. [75] D. H. Kelly, “Motion and Vision. II. Stabilized spatio-temporal threshold surface,” J. Opt. Soc. Am., vol. 69, no. 10, pp. 1340–1349, 1979. [76] B. Girod, “Motion compensation: visual aspects, accuracy, and limitations,” in Motion Analysis and Image Sequence Processing, M. Sezan, Ed. Lambrecht: Kluwer, 1993, pp. 125–152. [77] S. Daly, “Engineering observations from spatiovelocity and spatiotemporal visual models,” in Vision Models and Applications to Image and Video Processing. Lambrecht: Kluwer, 2001. [78] J. Laird, M. Rosen, J. Pelz, E. Montag, and S. Daly, “Spatio-velocity CSF as a function of retinal velocity using unstabilized stimuli,” in HVEI, Proc. SPIE, 2006, vol. 6057, pp. 605705:1–605705:11. [79] C. R. Cavonius, “Binocular interaction in flicker,” Quart. J. Exp. Psychol., vol. 31, pp. 273–280, 1979. [80] C. Landis, “Determinants of the critical flicker-fusion threshold,” Physiological Rev., vol. 34, pp. 259–286, 1954. [81] D. M. Hoffman, I. V. Karasev, and M. S. Banks, “Stereo presentation protocols: How they affect flicker visibility, perceived motion, and perceived depth,” J. Soc. Inf. Display (in press 2011). [82] T. Fujine, Y. Kikuchi, M. Sugino, and Y. Yoshida, “Real-life in- home viewing conditions for FPDs and statistical characteristics of broadcast video signal,” in Proc. AM-FPD, 2006, p. 85-.

DALY et al.: PERCEPTUAL ISSUES IN STEREOSCOPIC SIGNAL PROCESSING

[83] L. Zhang, C. Vazquez, and S. Knorr, “3D-TV content creation: Automatic 2D to 3D video conversion,” IEEE Trans. Broadcast., Jun. 2011. [84] A. Benoit, P. LeCallet, P. Campisi, and R. Cousseau, “Quality assessment of stereo images,” Eurasip J. Image Video Process., vol. 2008, 2008, article ID 659024. [85] B. W. Keelan, Handbook of Image Quality. Boca Raton, FL: CRC Press, 2002. [86] H. de Ridder, “Minkowski-metrics as a combination rule for digitalimage coding impairments,” in SPIE HVEI III, 1992, vol. 1666, pp. 16–26. [87] S. Endrikhovski, E. Jin, M. E. Miller, and R. W. Ford, “Predicting individual fusional range from optometric data,” in Proc. SPIE Stereoscopic Displays Virtual Reality Syst. XII, 2005, vol. 5664. [88] R. Patterson, “Human factors of 3-D displays,” JSID, vol. 15/11, pp. 861–872, 2007. [89] R. Patterson, “Human factors of stereo displays : An update,” J. SID, vol. 17/12, pp. 987–996, 2009. [90] W. J. Tam, A. S. Yee, J. Ferriera, S. Tariq, and F. Speranza, “Stereoscopic image rendering based on depth maps created from blur and edge information,” in Proc. SPIE Stereo Depth Appl., 2005, vol. 5664, pp. 104–115. Scott J. Daly (M’84) received the B.S. EE degree in 1980 from North Carolina State University, Raleigh, and the M.S. degree in bioengineering from the University of Utah, Salt Lake City, in 1984. He then worked until 1996 in the Imaging Science Division at Eastman Kodak in the fields of image compression, image fidelity models, and image watermarking. The years 1996-2010 were spent at Sharp Laboratories of America in Camas, Washington, where became a research fellow. Currently, he is a senior member at Dolby Laboratories, fo-

361

cusing on overall perceptual issues He is currently a member of IEEE, SPIE, and SID.

Robert T. Held received the B.S. in bioengineering cum laude from the University of Washington, Seattle in 2004, and the Ph.D. in bioengineering from the Joint Graduate Group in Bioengineering, University of California, San Francisco, and University of California, Berkeley, in 2010. He is currently a Postdoctoral Scholar in the Computer Science division at UC Berkeley. He has published articles on topics ranging from therapeutic ultrasound to stereoscopic misperceptions to the use of pictorial blur as a cue to distance and size. His current research interests include computer graphics, human-computer interaction, display technology, and the human visual system. He is a member of the ACM.

David M. Hoffman received the B.S. in bioengineering from the University of California, San Diego, in 2005 and went on to receive the Ph.D. in May 2010 from the Vision Science department at University of California, Berkeley. He worked on a number of projects in volumetric display development and understanding the causes of discomfort from stereoscopic viewing. He has since begun a job at MediaTek USA as an Image Quality Engineer, and has been developing techniques to assess and improve the visual quality of electronic images.

Perceptual Issues in Stereoscopic Signal Processing

Index Terms—Distortion, perceptual, stereoscopic. I. INTRODUCTION. STEREOSCOPIC 3D image .... A. Basic Perceptual Issues of Crosstalk. While the lack of complete image separation is the cause of crosstalk ...... hidden graphic elements (e.g. text) are currently unsolved, and can be visually disturbing. More studies on ...

848KB Sizes 0 Downloads 188 Views

Recommend Documents

Weighting Techniques in Data Compression - Signal Processing ...
new implementation, both the computational work, and the data structures and ...... we can safely use our CTW algorithm with such deep context trees, and in that ..... The decoder knows that the description is complete when all free slots at the.

Digital Signal Processing - GitHub
May 4, 2013 - The course provides a comprehensive overview of digital signal processing ... SCHOOL OF COMPUTER AND COMMUNICATION. SCIENCES.

Computational stereoscopic zoom
10, 2012; published online Apr. 3, 2012. .... by computing the gradient of binary occlusion map where occlusion ..... PhD degrees from Korea Advanced Institute.

filter design in digital signal processing
before Signal Processing techniques can be applied. For ... of the digital filter hardware or on the computer (PC, ... FILTER TYPES AND DESIGN METHODS.

two doctoral student positions in acoustics and audio signal processing
We are looking for two doctoral students for the ICHO project (Immersive Concerts for. Homes). The aim of the project is to bring the immersive concert experience to people's homes with the help of head-tracked headphones and sophisticated signal pro

SPINE: Signal Processing In Node Environment
Marco Sgroi, Wireless Sensor Networks Lab sponsored by Pirelli and Telecom Italia Berkeley, CA, USA. SPINE (Signal Processing In Node Environment) is a framework for the distributed implementation of signal processing algorithms in wireless sensor ne

Synthesis Lectures on Signal Processing
pdf Processing of Seismic Reflection Data Using MATLAB (Synthesis Lectures on Signal Processing), Processing of. Seismic Reflection Data Using MATLAB (Synthesis Lectures on Signal Processing) Free download, download Processing of Seismic Reflection D