Binocular Stereo Vision Based Gesture Segmentation ...

Viewer
Transcript

Applicant’s name: Zhen, Yi Proposed Topic/Title of Research: Binocular Stereo Vision Based Gesture Segmentation Background: Gesture recognition and gesture segmentation Nowadays, hand gesture recognition has become an important part of human-computer interaction (HCI) and computer virtual reality. Various applications with gesture recognition could be found in all fields as follows: the body sense recreation equipment [1]; robots manipulated by human’s gestures [2]; a gesture detection system for the deaf mutes communication. At present, the methods to recognize gestures are mainly classified into two categories: “data gloves” based and visual based. Compared to expensive and inconvenient data glove, visual based recognition merely uses one or more cameras [3] to realize HCI. Besides, the artificial vision system can realize recognizing gestures or tracking hand motion. The most critical part in visual based gesture recognition is gesture segmentation [4]. Gesture segmentation refers to locate and separate ROI (region of interest), namely users’ gestures or the region of the gestures, from one picture or video. And, only in the basis of locating and tracking gestures correctly can feature extraction and recognition be completed accurately.

Stereoscopic vision based gesture segmentation Stereoscopic vision allows representation of objects in three spatial dimensions, width, height and depth. Compared to two dimension vision, it adds perception of depth to offer complete image information. To avoid the loss of image information and detrimental effect of visual angle, stereoscopic vision based gesture segmentation has been put forward to replace image appearance based segmentation. Stereoscopic vision is able to restore some gesture image information (e.g. finger joint angle, palm position) which is neglected through appearance segmentation methods. In addition, stereoscopic vision mainly studies the way to collect the distance (depth) information of objects in the scene from multiple images depending on imaging technology. Therefore, stereoscopic vision is naturally introduced in gesture segmentation, and is much better applied in gesture segmentation and recognition. Brief Literature Review: In recent ten years, the stereo vision technology has just been introduced into gesture segmentation for high computational complexity. With high-speed developing of integrated circuit, hardware processing technology and computer vision, stereo vision can be implemented in low-cost and real-time gesture recognition. Michael Van den Bergh and Luc Van Gool [5] applied a TOF camera and an ordinary RGB camera in gesture segmentation. The TOF camera and the RGB camera were respectively used to get depth information and color information of each pixel in the scene image. And then, they located the position of hand through combining the

information. Mohamad El-Jaber, Khaled Assaleh and Tamer Shanableh [6] developed a binocular stereo vision system with two Bumblebee XB3 cameras, used K-means clustering method to segment body of user with the depth feature of each pixel in scene image. After that, they separated gesture according to frame difference method. Jamie Shotton Andrew, Fitzgibbon Mat and Cook Toby [7] extracted depth information feature of neighborhood pixels via depth cameras based on “Prime Sense” light encoded depth imaging technology. And, they adopted random forest to classify and recognize each joint point of body. Finally they located hand clearly and accurately. The method was also adopted by Microsoft’s body sense recreation equipment Kinect attached by XBOX360. Motivated by the above research and works, some melioration which improves accuracy and robustness of gesture segmentation via binocular stereo vision should be proposed and investigated to solve the following problems: i. The limited accuracy in computing depth which is dependent on triangulation survey with pixel parallax; ii. The large computational stereo matching algorithm which is easily affected by optical distortion and noise, specula reflection of smooth surface, perspective distortion, low and repetitive texture, overlap and discontinuity, etc. ; iii. The weak robustness in locating and detecting ROI by using depth information image. Methodology: The gesture segmentation method via binocular stereo vision (BSV) consists of two processes: generating depth image by BSV; gesture segmentation by using depth information of each pixel in depth image.

i. The process of generating depth image by BSV includes five steps: Step1 Binocular stereo calibration: it is a process to compute the geometrical relationship (rotation and translation parameters, RTP) between the space positions of two cameras. Using the checkerboard, RTP between the two cameras can be computed by the RTP calculated in each single camera calibration. Considering to the image noise, the mid-value is chosen as the initial value of RTP. After that, the minimum projection error of the checkerboard corner is calculated by the Levenberg-Marquardt iterative algorithm, and RTP are returned [8-9]. Step2 Binocular stereo correction: it is a process to repeat two cameras ’projection, keep them in the same plane accurately, and align two projection images. Step3 Binocular stereo matching: it is the most crucial steps in BSV that aims to determine the relationship between corresponding points in two vision images. And the parallax of pixels can be calculated by real-time computation. Improving the stereo matching process will be specified in continued research. Step4 Refining parallax: it is to refine the parallax image processed by stereo matching. Step5 Triangle computation: a process of computing the depth of each pixel with the parallax and length of base-line by using similar triangles method. In the proposed method, stereo matching step will adopt Shiftable Windows [10] as the

support window. For each pixel, it uses 9 different windows for match-cost computing and selects the lowest match-cost window as the support window. Then, Box-Filtering optimizing strategy is utilized to reduce the direct match-cost computation. ii. The process of gesture segmentation In the research, gesture segmentation is achieved by using depth information, extracting certain features, and selecting suitable classifier (Boosting classifier) [11-14]. In the segmentation, extracted features should conform to the following constrains: a) scale invariant; b) rotate invariant and shift invariant; c) a certain degree of illumination change invariant; d) a certain degree of perspective change invariant; All above 4 properties help reduce interference from noise, overlap and complex background when extracting image features and thus improve gesture segmentation performance. While getting the depth information, the research choose a proposed image depth feature fθ(I,x) [15 ] ，

Notation: I denotes Image; x denotes a pixel in I; parameter θ= (u, v) describes offset of u and v. 1/dI(x) is to normalize u and v; it will ensure that the feature is depth invariant. Outcomes and Value: The proposed method is indicated to get a good trade-off between speed and effectiveness of stereo matching algorithms by making some improvements on the existing ones. Furthermore, the improved algorithm should run fast enough to satisfy real-time gesture application scenes with good performance. Specifically, running the algorithm with the image dataset provided by Middlebury College, the error rate should be kept blow 10%. And, a Pentium Dual-Core E5400 machine with 2G memory could process more than 20 frames per seconds. The processing frames' size is 320*240, and the radius of search area is 64 pixels. Based on the improved stereo matching algorithm, the research implements a new gesture segmentation algorithm which combines both luminance and depth information. The expected accuracies of hand detection and hand location would be greater than 90%. References:

[1] Heng-Tze Cheng, An Mei Chen, Ashu Razdan, Elliot Bulle, "Contactless Gesture Recognition System Using Proximity Sensors", icce, p.149 - 150, Consumer Electronics (ICCE), 2011 IEEE International Conference, 2011 [2] Jagdish Lal Raheja, Radhey Shyam, Umesh Kumar, P. Bhanu Prasad, "Real-Time

Robotic Hand Control Using Hand Gestures," icmlc, p.12-16, 2010 Second International Conference on Machine Learning and Computing, 2010 [3] F. Quek, “Towards a Vision Based Hand Gesture Interface”, p. 17-31, in Proceedings of Virtual Reality Software and Technology, Singapore, 1994. [4] Cao Xin-yan, Liu Hong-fei, Zou Ying-yong, "Gesture Segmentation Based on Monocular Vision Using Skin Color and Motion Cues", IASP, p.358-362,

Image

Analysis and Signal Processing 2010 International Conference, 2010 [5] Michael Van den Bergh, Luc Van Gool, "Combining RGB and ToF Cameras for Real-time 3D Hand Gesture Interaction", wacv, p.66-72, Applications of Computer Vision, 2011 [6] Mohamad El-Jaber, Khaled Assaleh, Tamer Shanableh, "ENHANCED USER-DEPENDENT RECOGNITION OF ARABIC SIGN LANGUAGE VIA DISPARITY IMAGES", isma(10):1-4, Proceeding of the 7th International Symposium on Mechatronics and its Applications, 2010 [7] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake, "Real-Time Human Pose Recognition in Parts from Single Depth Images", Microsoft Research Cambridge & Xbox Incubation, CVPR2011 Best Paper [8] R.Y.Tsai, "A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV camerae and lenses," IEEE Journal of Robotics and Automation 3(1987):323-344 [9] W. Ouyang, F. Tombari, S. Mattoccia, L. Di Stefano, W. Cham, "Performance evaluation of full-search equivalent pattern matching algorithms", IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2011 [10] D. Scharstein and R. Szeliski, Taxonomy and evaluation of dense two-frame stereo correspondence algorithms Int. Jour. Computer Vision, 47(1/2/3):7–42, 2002 [11] Viola and Jones, "Rapid object detection using boosted cascade of simple features", Computer Vision and Pattern Recognition, 2001 [12] Y Freund, R Schapire, "A Short Introduction to Boosting", Journal of Japanese

Society for Artificial Intelligence, 14(5):771-780, 1999 [13] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine". February 1999 [14] Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. Doi: 10.1023/A:1010933404324 [15] Y Chu, L Li, DB Goldgof, Y Qui, RA Clark, Classificationof masseson mammograms using support vector machine, Proc. SPIE, 2003, Vol. 5032, 940-948

Stereo Vision based Robot Navigation

a fast algorithm for vision-based hand gesture ...

Computer Vision Based Hand Gesture Recognition ...

Binocular Vision and Squint.pdf

$pdf-1894\normal-binocular-vision-theory-investigation-and-practical ...$

pdf-1894\normal-binocular-vision-theory-investigation-and-practical ...

$pdf-1487\binocular-anomalies-diagnosis-and-vision-therapy-by ...$

pdf-1487\binocular-anomalies-diagnosis-and-vision-therapy-by ...

Underwater Stereo Vision and vSLAM Support for the ...

ACTIVITY-BASED TEMPORAL SEGMENTATION FOR VIDEOS ... - Irisa

Segmentation of Markets Based on Customer Service

Outdoor Scene Image Segmentation Based On Background.pdf ...

Segmentation of Connected Chinese Characters Based ... - CiteSeerX

Spatiotemporal Video Segmentation Based on ...

Efficient Hierarchical Graph-Based Video Segmentation

Segmentation-based CT image compression

Hand gesture recognition for surgical control based ... - Matthew P. Reed

ACTIVITY-BASED TEMPORAL SEGMENTATION FOR VIDEOS ... - Irisa

Hand gesture recognition for surgical control based ... - Matthew P. Reed

Vision-based hexagonal image processing based hexagonal ... - IJRIT

VISION-BASED CONTROL FOR AUTONOMOUS ...

VISION-BASED CONTROL FOR AUTONOMOUS ... - Semantic Scholar