Video-based Hand Movement Tracking by

Jason Hau Xuan Tram [email protected]

Final Year Project Report Submitted by Jason Hau Xuan Tram in partial fulfilment of the Requirements for the Degree of Master of Engineering in Computing in the Department of Computing at Imperial College London Supervisor: Dr Fernando Bello 2nd Marker: Prof. Duncan Gillies

JUNE 2006

1

Abstract Our aim is to implement a marker-based hand movement tracking system using computer vision techniques. The implementation allows users to capture images/video sequence from a stereoscopic system, calibrate single camera independently and perform stereo calibration to obtain the extrinsic parameters. Given the pair of stereo video sequences of hand movement, our implementation would provide user methods to locate hand regions which help to minimise the search area for markers, detect markers on each hand, perform stereo correspondence between the stereo pair of frames and then finally reconstruct the 3D coordinates from the corresponding points. Having the 3D coordinates of markers in the world coordinate would provide us methods to measure dexterity such as the number and speed of hand movements, the distance travelled by hands and the time taken for the task. One of the main applications of this work is used in surgical skills assessment where the dexterity mentioned above would provide information on skilfulness of the practising surgeons.

2

Acknowledgement I would like to thank Dr Fernando Bello for his advice and supervision of this project. I would also like to thank Professor Duncan Gillies for agreeing to be my second marker. I would like to thank all my friends for their help and support. Finally, I would like to thank my family for making it possible for me to study at Imperial College.

3

Table of contents Chapter 1 Introduction............................................................................................................... 6 1.1 Motivation........................................................................................................................... 6 1.2 Goals ................................................................................................................................... 6 1.3 Achievements ..................................................................................................................... 7 1.4 Thesis outline...................................................................................................................... 8 Chapter 2 Background ............................................................................................................... 9 2.1 Assessment of surgical skills .............................................................................................. 9 2.1.1 Introduction ................................................................................................................. 9 2.1.2 Principles of assessment............................................................................................ 10 2.1.3 Current methods of assessment in surgery and their limitations ............................... 10 2.1.4 Objective methods of assessing technical skills ........................................................ 12 2.2 Hand Tracking System ..................................................................................................... 18 2.2.1 Tracking methods:..................................................................................................... 18 2.3 Hand Localisation............................................................................................................. 19 2.3.1 Markers detection ...................................................................................................... 19 2.3.2 Matching of marker ................................................................................................... 20 2.3.4 Background Subtraction ............................................................................................ 20 2.3.5 Skin Colour Detection ............................................................................................... 21 2.4 Feature extraction ............................................................................................................. 22 2.5 Tracking............................................................................................................................ 23 2.5.1 Tracking with the Kalman filter: ............................................................................... 23 2.6 Tracking of movements .................................................................................................... 24 2.7 Stereopsis.......................................................................................................................... 24 2.7.1 Image rectification..................................................................................................... 24 2.7.2 Stereo correspondence............................................................................................... 31 2.7.3 3D reconstruction ...................................................................................................... 38 Chapter 3 Calibration............................................................................................................... 42 3.1 Camera model................................................................................................................... 42 3.2 Lens distortion .................................................................................................................. 43 3.3 Single camera calibration ................................................................................................. 45 3.4 Stereo camera calibration.................................................................................................. 45 3.5 Stereo calibration optimisation ......................................................................................... 46 3.5.1 Nonlinear optimisation: ............................................................................................. 46 3.5.2 Optimisation using Newton method.......................................................................... 46 3.5.3 Optimisation using non-linear least squares method................................................. 47 3.6 Implementation results...................................................................................................... 48 Chapter 4 Stereo correspondence............................................................................................ 50 4.1 The Image Formation Model ............................................................................................ 50 4.2 Image Rectification Algorithm 1 ...................................................................................... 51 4.3 Image Rectification Algorithm 2: ..................................................................................... 55 4.4 Calculating the Stereo Correspondence............................................................................ 59 Chapter 5 3D reconstruction.................................................................................................... 60 5.1 Epipolar geometry ............................................................................................................ 60 5.2 The fundamental matrix F................................................................................................. 62 5.3 Geometric derivation of F (for reference)......................................................................... 62 5.4 Algebraic derivation of F.................................................................................................. 63 5.5 Fundamental matrix properties ......................................................................................... 63 5.6 The essential matrix.......................................................................................................... 64

4

5.7 Kronecker Product ............................................................................................................ 64 5.8 The 8 point linear algorithm ............................................................................................. 65 5.9 The universal scale ambiguity .......................................................................................... 65 5.10 Finding translation vector T from E ............................................................................... 65 5.11 Finding rotation matrix R from E ................................................................................... 66 5.12 Estimating depths given rotation matrix R and translation vector T .............................. 67 5.13 Algorithm analysis.......................................................................................................... 68 Chapter 6 Marker-based tracking........................................................................................... 70 6.1 Choice of markers............................................................................................................. 70 6.2 Background subtraction .................................................................................................... 70 6.2.1 Basic methods: .......................................................................................................... 70 6.2.2 Implementation.......................................................................................................... 71 6.3 Edge detection .................................................................................................................. 72 6.3.1 Introduction ............................................................................................................... 72 6.3.2 Canny edge detection algorithm................................................................................ 73 6.4 Colour detection................................................................................................................ 74 6.4.1 HSV colour space...................................................................................................... 75 6.4.2 Transformation from RGB to HSV ........................................................................... 75 6.4.3 Gaussian blur............................................................................................................. 76 6.4.4 Dilation and erosion .................................................................................................. 76 6.4.5 Implementation.......................................................................................................... 77 6.5 Spatial moments and centre of gravity ............................................................................. 78 Chapter 7 Evaluation................................................................................................................ 79 7.1 Calibration evaluation....................................................................................................... 79 7.1.1 Experiments and results: ........................................................................................... 79 7.1.2 Results analysis: ........................................................................................................ 81 7.1.3 Results comparison: .................................................................................................. 83 7.2 Image rectification evaluation .......................................................................................... 84 7.2.1 Rectification algorithm 1........................................................................................... 84 7.2.2 Rectification algorithm 2........................................................................................... 85 7.2.3 Conclusion................................................................................................................. 88 7.3 Markers detection evaluation............................................................................................ 88 7.3.1 Stereo pair 1: ............................................................................................................. 89 7.3.2 Stereo pair 2: ............................................................................................................. 90 7.4 3D reconstruction evaluation ............................................................................................ 92 7.4.1 No motion.................................................................................................................. 92 7.4.2 Forwards-backwards motion ..................................................................................... 93 7.5 Electromagnetic tracking system vs. Video-based tracking system ................................. 95 7.5.1 Up-down motion........................................................................................................ 95 7.5.2 Sideways motion ....................................................................................................... 99 7.5.3 Summary and conclusion ..................................................................................102 Chapter 8 Conclustion and Future Work............................................................................. 104 8.1 Contributions .............................................................................................................. 104 8.2 Future work ................................................................................................................ 104

5

Chapter 1

Introduction 1.1 Motivation There has been a growing concern over the clinical competence of practising surgeons. There is increasing consensus about the necessity to measure competence, set standards, and assess surgeons before any damage is done1. Although many factors influence surgical outcome, the skill of the surgeon in the operating theatre is very important. A skilfully performed operation is 75% decision making and 25% dexterity2; in some specialties, such as minimally invasive surgery, dexterity becomes more important. One of the current state-of-the-art dexterity analysis systems is the Imperial College Surgical Assessment Device (ICSAD). The ICSAD uses an electromagnetic tracking system and bespoke software to convert the positional data generated by the sensors to dexterity measures such as the number and speed of hand movements, the distance travelled by the hands and the time taken for the task. Although having various advantages and advanced features, one of the main disadvantages of ICSAD is that user has to wear special equipments and the presence of cables might intrude the operation of surgeons. This has created a motivation to design a video-based hand tracking system that uses computer vision techniques to analyse the video sequences and reconstruct the 3D pose and motion of the hand. Using the vision based approach has the greatest potential for widespread use because of the price and ease-of-use for the user.

1.2 Goals Our aim is to implement a system using stereoscopic techniques to track hand movements using maker gloves. This would entail the process of capturing video sequences/images of hands movement from two cameras of the stereo system, and then perform image processing approaches to reduce noise and detect the location of markers on each image of the stereo pair. Consequently we would carry out 3D reconstruction of markers based on their 2D locations on the image planes obtained in the previous step. Having the 3D coordinates of markers in the world coordinate would provide us measures to obtain the dexterity from which surgical skills could be analysed and assessed. One of the main requirements of the

1

Darzi A, Smith S, Taffinder N. Assessing operative skill. Needs to become more objective [editorial]. BMJ 1999;318: 887–888. 2 Spencer F. Teaching and measuring surgical techniques: the technical evaluation of competence. Bull Amer Coll Surg 1978;63:9-12.

6

system is to measure dexterity such as the number and speed of hand movements, the distance travelled by hands and the time taken for the task.

We are to use two Unibrain Firewire colour webcams that are positioned on the desk as illustrated as below

Figure 1: Stereo system set up

The accuracy of the hand movement depends on the accuracy of calibration system. Calibration of a stereo camera system has to be performed whenever the focal length or the relative positions or view directions of both cameras change (assuming without loss of generality that the world coordinate system is one of the camera coordinate systems). Hence one of our implementation requirements is to implement a calibration system so that it would make the job of calibration as least cumbersome as possible.

1.3 Achievements Based upon the specified requirements above, our implementation of the marker-based tracking system has achieved the following major individual features Capturing and calibration system Our implementation allows user to capture the stereo video sequences in an ease-to-use way. It also allows user to capture chessboard patterns which would be used to calibration. It also provides user options to calibrate a single camera (left/right) or stereo calibration to obtain the stereo extrinsic parameters (details can be found in Chapter 3). Stereo correspondence system It allows user to rectify stereo images with two different algorithms (details can be found in Chapter 4). Performing stereo correspondence of a pair of stereo images 3D reconstruction from stereo images Our implementation allows user to perform 3D reconstruction in two cases: Both intrinsic and extrinsic parameters available Only intrinsic parameters available Maker-based tracking Markers with rectangular shapes and particular colour could be tracked effectively.

7

Image processing techniques such as Gaussian blur, background subtraction are used to improve the output of marker tracking process. Displacement of markers over a sequence of frames is produced as a final output.

1.4 Thesis outline Background: This chapter would provide readers with detailed information on: Different methods of assessment of surgical skills including detailed analysis on advantages and disadvantages of current solutions. Current solutions of hand movement tracking system with detailed analysis Background knowledge on stereo vision system including image processing, image rectification, epipolar geometry, and 3D reconstruction Calibration: This chapter provides essential knowledge on camera calibration including single camera calibration, stereo system calibration, and available optimisation approaches for calibration. At the end of the chapter, it also provides a number of evidence from our implementation. Stereo correspondence This chapter provides essential knowledge on stereo correspondence. It also present in details the two image rectification algorithms that have been implemented. 3D reconstruction This chapter presents two approaches of 3D reconstruction process. First approach is a simple stereo triangulation when we know both intrinsic and extrinsic parameters of the system. The second approach is when we do not have the knowledge of extrinsic parameters. Evaluation In this chapter, we carry out experiments, test runs on our implementation. It would provide experiment results, analysis and conclusion on each part of our implementation. Conclusion and Future work This chapter summarises our achievements from the implemented system and outlines future improvements and research work.

8

Chapter 2

Background 2.1 Assessment of surgical skills3 2.1.1 Introduction Surgeons have formal examinations in surgical knowledge, there is no such requirement to show operative dexterity. Common sense suggests that technical skill does affect outcome. However, despite variation in operative results between surgeons4, it has been impossible to relate outcome to surgical dexterity. A major reason for this is that we have no way of reliably assessing operative skill. This deficiency in assessment needs to be addressed5. Evaluation of surgical trainees is often limited to assessment of their theoretical knowledge by multiple choice questions and oral examinations. Assessment of technical skills in most teaching institutions is based on the subjective opinion of senior colleagues, often accumulated during unsystematic observations6. In recent years there has been an increase in surgical performance, especially with the use of the new technology in general endoscopic surgery7 8. Surgical skills are accessible for analysis in three different environments: 1. Open or minimally invasive surgery (MIS) utilizing traditional surgical tools and equipment, 2. A robotic system using a master/slave setup 3. A simulator utilizing a haptic device that generates force feedback in addition to a virtual reality graphic representation of the surgical scene. All of these systems have a human-machine interface. Through this interface, visual, kinematics, and dynamic information is flowing back and forth between the surgeon and the environment. Minimally invasive surgical techniques present many advantages compared with open surgery. However this new technology has introduced new problems such as the training of individuals for the required skills and achieving an objective assessment of these skills based on quantitative data. Assessment done to date to evaluate the surgical dexterity is mostly highly subjective and lacking of quantitative data. Recent literature has commented on 3

Krishna Moorthy, Yaron Munz, Sudip K Sarker and Ara Darzi. Objective assessment of technical skills in surgery. BMJ 2003;327;1032-1037 4 Fielding LP, Stewart-Brown S, Dudley HA. Surgeon-related variables and the clinical trial. Lancet 1978;ii:778-9.) 5 Smith R. All changed, changed utterly. BMJ 1998;316:1917-8. 6 Bardram L, Rosenberg J. Assessment of surgeons’ operative skills. Ugeskr Laeger 1999; 161: 5518. 7 C.J.Davis, C.J.Filipi, “A history of endoscopic surgery” in M.E. Arregui, R.J.Jr Fitzgibbons, N. Kathouda, J.B.Mckernan, H.Reigh, eds “Principles of Laparoscopic Surgery: Basic and Advanced Techniques”, New York, NY Springer-Verlag NY Inc, 1995, pp.3-20 8 R.Smith, “All changed and utterlyl. British Medicine will be transformed by the Bristol case”,BMJ 1998 316:1917-18.

9

the fact that skills evaluation needs to be made more objective. Objective assessment is essential because deficiencies in training and performance are difficult to correct without objective feedback9.

2.1.2 Principles of assessment Validity − Construct validity is the extent to which a test measures the trait that it purports to measure. One inference of construct validity is the extent to which a test discriminates between various levels of expertise Content validity is the extent to which the domain that is being measured is measured by the assessment tool—for example, while trying to assess technical skills we may actually be testing knowledge − Concurrent validity is the extent to which the results of the assessment tool correlate with the gold standard for that domain − Face validity is the extent to which the examination resembles real life situations − Predictive validity is the ability of the examination to predict future performance Reliability − Reliability is a measure of a test to generate similar results when applied at two different points10. When assessments are performed by more than one observer another type of reliability test is applicable that is referred to as inter-rater reliability, which measures the extent of agreement between two or more observers11 − A system that can provide unbiased and objective measurement of surgical precision (rather than just speed) could help training, complement knowledge based examinations, and provide a benchmark for certification. A specific and sensitive test of operative competence could also detect important problems and might improve surgical outcome. Revealing underperformance early would allow for further training or career guidance towards other less practical specialties. The surgical profession needs a reliable and valid method of assessing the operative skill of its members. − A driving test may not be a guarantee against accidents but it makes it less likely that you career off the road. Surgeons, the public, and politicians need reassurance.

2.1.3 Current methods of assessment in surgery and their limitations Method of assessment Procedure list with logs Direct observation Direct observation with criteria Animal models with criteria Videotapes

Validity Not applicable Poor High

Reliability Poor Modest High

High High

Proportional to realism Proportional to realism

9

Scott DJ, Valentine RJ, Bergen PC, Rege RV, Laycock R, Tesfay ST, et al. Evaluating surgical competency with the American board of surgery in-training examination, skill testing, and intraoperative assessment. Surgery 2000;128:613-22. 10 Reznick RK. Teaching and testing technical skills. Am J Surg 1993;165: 358-61. 11 Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 1997;84:273-8.

10

It is evident that some of the methods of assessment currently in use have poor validity and reliability. Conventional written examinations Written examinations have long been a mainstay of medical examinations. Formats include essays, short answers, and multiple choice questions and, more recently, extended matching questions. The advantages of written examinations are that both the questions and the marking scheme can be standardised and the process is easily demonstrated to be objective and fair. However, the content being assessed is limited either in scope (essay format) or depth (multiple choice format), and may not sufficiently assess the complex attributes essential for good practice. The written examination is, in general, an effective approach to assessing factual knowledge but has limited application for assessment of decision making ability. Perhaps it is least useful for assessing some of the complexities of the diagnostic process or capacity for intra-operative judgement. Conventional viva voce examinations The “viva” has been an established part of the process of medical assessment for years. This examination allows some assessment of facts but may also allow exploration of the integration of information—for example, in a traditional “long case” examination where a candidate is assessed around a diagnostic consultation with a patient. The strength of this format is that the examiners may explore topics with the candidate in greater or lesser depth, as appropriate, and may also examine the process by which a candidate comes to a particular decision or view. The weakness is that the process is not standardised and is difficult to make objective, so it may be unfair to some. Furthermore, the viva voce is a potentially threatening process and some candidates may be disadvantaged by being more intimidated than others. Such examinations are most useful for exploring how a candidate arrives at a diagnosis and for assessing the judgement necessary in intra-operative decision making. The assessors can use the situation to re-state questions with subtle variations to ensure that the candidate truly understands the rationale for the decisions that he/she proposes. Examinations such as the fellowship and membership of the Royal College of Surgeons (FRCS and MRCS), conducted jointly by the royal colleges, focus mainly on the knowledge and clinical abilities of the trainee and do not assess a trainee’s technical ability. One study showed that no relation existed between the American Board of Surgery in Training Exam (ABSITE) score and technical skill. 12 All trainees in the United Kingdom are required to maintain a log of the procedures performed by them, which is submitted at the time of examinations, at job interviews, and during annual assessments. However, it has been found that log books are indicative merely of procedural performance and not a reflection of operative ability and therefore lack content validity13. Time taken for a procedure does not assess the quality of performance and is an unreliable measure when used during real procedures, owing to the influence of various other factors. The assessment of technical skills by observation, as currently occurs in the operating room, is subjective. As the assessment is global and not based on specific criteria it is 12

Scott DJ, Valentine RJ, Bergen PC, Rege RV, Laycock R, Tesfay ST, et al. Evaluating surgical competency with the American board of surgery in training examination, skill testing, and intraoperative assessment. Surgery 2000;128:613-22. 13 Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative “bench station” examination. Am J Surg 1997;173:226-30.

11

unreliable. As it is influenced by the subjectivity of the observer it would possess poor testretest reliability and also be affected by poor inter-observer reliability as even experienced senior surgeons have a high degree of disagreement while rating the skills of a trainee14. Morbidity and mortality data, often used as surrogate markers of operative performance, are influenced by patients’ characteristics, and it is believed that they do not to truly reflect surgical competence15.

2.1.4 Objective methods of assessing technical skills Checklists and global scores

The availability of set criteria against which technical skills can be assessed makes the assessment process more objective, valid, and reliable. It has been said that checklists turn examiners into observers, rather than interpreters, of behaviour, thereby removing the subjectivity of the evaluation process16. The wide acceptance of the objective structured clinical examination (OSCE) led a group in Toronto to develop a similar concept for the

14

Reznick RK. Teaching and testing technical skills. Am J Surg 1993; 165: 358-61. Bridgewater B, Grayson AD, Jackson M, Brooks N, Grotte GJ, Keenan DJ, et al. Surgeon specific mortality in adult cardiac surgery: comparison between crude and risk stratified data. BMJ 2003; 327:13-7. 16 Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998;73:993-7. 15

12

assessment of technical skills17. Objective structured clinical examinations (OSCEs) are based on a series of stations, each of which has a self-contained question/item18 19 20. The format is typically of candidates rotating through several stations that can be taken in any order in a “round robin” manner. The potential content is wide ranging and, importantly, can be standardised. Questions/items may include, for example, pathology results with specific questions, radiographs for interpretation, discussing a diagnosis, taking a history from an actor/patient, or evaluation of clinical scenarios. Marking is by trained observers placed at each station. Candidates at each station are marked by the same observer in a standardised marking process which uses a scoring system that is established according to agreed objective criteria. The great advantage of this process is that a wide variety of material may be examined in a highly standardised way. The disadvantage is that the depth of assessment is often limited by time constraints. OSCEs are of great value in assessing the interpretation of information— whereas a written examination can ask about a test, the OSCE allows the presentation of a set of results and seeks an appraisal by the candidate. However, these assessments are limited in their capacity to explore a candidate’s understanding of complex issues; the cases and situations presented may be suitable for in depth analysis but the time constraints and the requirements of a standardised marking schedule make this difficult. The objective structured assessment of technical skills (OSATS) consists of six stations where residents and trainees perform procedures on live animal or bench models in fixed time periods. Performance during a task is assessed by using checklists specific to the operation or task and a global rating scale. The checklist comprises a series of yes/no items which have been developed by analysis of the task and also of the specific tuition that has been provided in skills training sessions. The global scoring sheet comprises eight items, each of which is marked from 1 to 5. The items assessed include tissue handling skills, flow of operation, and familiarity with the technique. Examples of poor (score 1), average (score 3), and excellent performance (score 5) are given as guidelines for the observers. Both global and checklist scoring systems have been validated. In general, global scoring assesses generic aspects of technical performance and has a broad applicability whereas checklists are task specific. A new checklist must be developed and validated for each new task included in the assessment. Experience suggests that global scoring is a more effective discriminator between subjects than the checklist, perhaps because the checklist items, of necessity, have to be relatively straightforward components of a procedure and should not call upon the observer to exercise significant judgement in deciding whether to mark “yes” or “no”. OSATS have been widely used by Reznick et al and are being increasingly used elsewhere. They are useful in assessing technical skills in terms of knowledge and dexterity aspects but they do not offer the scope to assess judgement as the tasks are highly standardised. Currently, the methodology is well established as a research tool and is moving towards implementation within training schemes in some countries. 17

Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 1997;84:273-8. 18 Cuschieri A, Gleeson FA, Harden RM, et al. A new approach to a final examination in surgery. Use of the objective structured clinical examination. Ann R Coll Surg Engl 1979;61:400–5. 19 Reznick RK, Smee S, Baumber JS, et al. Guidelines for estimating the real cost of an objective structured clinical examination. Acad Med 1993;68:513–7. 20 Sloan DA, Donnelly MB, Schwartz RW, et al. The Objective Structured Clinical Examination. The new gold standard for evaluating postgraduate clinical performance. Ann Surgeon 1995;222:735–42.

13

The use of both checklists and global rating scales entail the presence of multiple faculty raters or extensive video watching. Systems therefore need to be developed that can produce an assessment of technical skills in real time, with little need for several observers. Dexterity analysis systems have the potential to address this issue.

Dexterity analysis systems

Imperial College Surgical Assessment Device (ICSAD) system consists of 3 components. A commercially available electromagnetic tracking system (Isotrak II; Polhemus Inc, Colchester, VT) connected to a portable computer through a standard RS-232 (serial) port, independent motion acquisition software21. Bespoke software is used for converting the positional data generated by the sensors to dexterity measures such as the number and speed of hand movements, the distance travelled by the hands and the time taken for the task. The ICSAD device was integrated into a large-scale project involving the assessment of modelbased tasks and real procedures carried out using robotic, conventional laparoscopic, as well as open surgery. New software, the Robotic Video and Motion Analysis Software

21

Dosis A, Bello F, Moorthy K, Munz Y, Gillies D, Darzi A. Synchronized Video and Motion Analysis for the Assessment of Procedures in the Operating Theater. Arch Surg. 2005;140:293-299

14

(ROVIMAS), was developed to provide concurrent qualitative and quantitative functionalities22. This is based on the same framework as the original ICSAD analysis software and enables compatibility when comparing old and new data sets. All ICSAD components (acquisition and data analysis software) were integrated into ROVIMAS, with data retrieval and analysis automatically performed by the new software. Additional visual functionality includes Cartesian coordinates and distance and velocity graphs, as well as hand direction, polar graph, and trajectories. Moreover, a new efficient graphical method called “density analysis” provides information regarding which part of the surgical scene was the focus of activity, the active scene surface, and the total task surface. The zooming functionality is of significant importance as it allows isolation of various parts of procedures to be broken down according to the particular assessment process, saving valuable time. Each time zooming occurs, the number of movements and of hands’ path length are shown and recalculated with all graphs being updated accordingly.

Imperial College Surgical Assessment Device (ICSAD) data and video acquisition component. On the left side information regarding the commercially available electromagnetic tracking system (Isotrak II; Polhemus Inc, Colchester, Vt) is displayed along with the tracking status. On the right the video capturing window displays the endoscopic surgical scene. Both motion and video streams are recorded using the same time stamp succeeding in perfect synchronization. The ROVIMAS software also includes a video frame capturing facility. Motion and video data are captured in a time-synchronous manner achieving synchronization accuracy of less

22

Dosis A, Bello F, Rockall T, et al. ROVIMAS: a software package for assessing surgical skills using the da Vinci telemanipulator system. Paper presented at: The Fourth International Conference of Information Technology (ITAB 2003); April 24-27, 2003; Birmingham, England.

15

than 40 milliseconds23 Video frames are compressed in real time using the Microsoft VKI codec (Microsoft Corporation, Redmond, Wash) with a pixel size of 320x240. It is also possible to capture single full-sized bitmap pictures while recording a task.

Snapshot of the Robotic Video and Motion Analysis Software (ROVIMAS) showing the crown pattern during a laparoscopic cholecystectomy case in which each pair of 2 highvelocity peaks represents extraction-insertion of the scissors-stapler instrument. Users can browse these data using the “browsing bar” and watch the corresponding video frames from the video player. P indicates path. The main advantage of this system is that through video synchronization it is possible to assess the operation using the ROVIMAS-integrated video player and to concurrently study the corresponding motion signal trace. An application that identifies high-velocity movements and associates them with erroneous movements was developed previously. A playback line bar has been implemented on the signal trace to enable searching for a particular part of the procedure, and the dexterity graph will change in synchrony. Alternatively, one can zoom into the desired signal regions and the video will play only the required part and not the whole operation. ROVIMAS also uses the same concept of motion and video synchronization when data are acquired from the da Vinci tele-manipulator system. Studies have shown the construct validity of the Imperial College surgical assessment device with respect to a range of laparoscopic and open surgical tasks24. The correlation between

23

Dosis A, Bello F, Moorthy K, Munz Y, Gillies D, Darzi A. Real-time synchronization of kinematic and video data for the comprehensive assessment of surgical skills. Paper presented at: the 12th Conference on Medicine Meets Virtual Reality; January 15, 2004; Los Angeles, Calif. 24

Datta V, Mackay S, Mandalia M, Darzi A. The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model. J Am Coll Surg 2001;193:47985.

16

dexterity and previous laparoscopic experience on a simple task in a box trainer25 and on more complex tasks such as laparoscopic cholecystectomy on a porcine model is strong26. Experienced and skilled laparoscopic surgeons are more economical in terms of the number of movements and more accurate in terms of target localisation and therefore use much shorter paths. Other motion analysis systems Motion tracking can be based on electromagnetic, mechanical, or optical systems. The advanced Dundee endoscopic psychomotor trainer (ADEPT) was originally designed as a tool for the selection of trainees for endoscopic surgery, based on the ability of psychomotor tests to predict innate ability to perform relevant tasks. Studies have shown the validity and reliability of the trainer. Optical motion tracking systems consist of infrared cameras surrounded by infrared light emitting diodes. The infrared light is reflected off sensors that are placed on the limb of a surgeon. Software is used to extrapolate the positional data of the markers to data on movement analysis. The disadvantages of such optical systems are that they suffer from disturbances to the line of vision. If the camera and signal generators become obscured from the markers the resulting loss in link leads to lost data. Signal overlap also prevents the use of markers on both limbs, making these systems restrictive. Virtual reality Virtual reality is defined as a collection of technologies that allow people to interact efficiently with three dimensional computerised databases in real time by using their natural senses and skills27. Virtual reality (VR) is a technology that holds out the exciting prospect of including simulation as part of the training and assessment of surgical performance28 29. It offers the opportunity to learn a new skill without the pressure of a clinical situation. Moreover, it is theoretically possible for the learner to have repeated practice and tuition on any weak aspects of a given procedure. Beyond this, VR offers very detailed feedback on the progress that is being made and may allow more subtle measurement of performance than is possible in a “real world” setting. From the patient’s point of view, it is obviously preferable that trainees’ attempts at a new procedure are performed on a simulator than on a patient. Surgical virtual reality systems allow interaction to occur through an interface, such as a laparoscopic frame with modified laparoscopic instruments. The minimally invasive surgical trainer-virtual reality (MIST-VR) system was one of the first virtual reality laparoscopic simulators developed as a task trainer. The system was developed as the result of collaboration between surgeons and psychologists who performed a task analysis of laparoscopic cholecystectomy. This resulted in a toolkit of skills needed to perform the 25

Taffinder N, Smith S, Mair J, Russell R, Darzi A. Can a computer measure surgical precision? Reliability, validity and Feasibility of the ICSAD. Surg Endosc 1999;13(suppl 1):81. 26 Smith SG, Torkington J, Brown TJ, Taffinder NJ, Darzi A. Motion analysis. Surg Endosc 2002;16:640-5. 27 McCloy R, Stone R. Science, medicine, and the future. Virtual reality in surgery. BMJ 2001;323:912-5. 28 Michael Isard and Andrew Blake. Condensation-conditional density propagation for visual tracking. Int. J. Comput. Vision, 29(1):5–28, 1998. 29 Ryugo Kijima and Michitaka Hirose. Fine object manipulation in virtual environment. In VE ’95: Selected papers of the Eurographics workshops on Virtual environments ’95, pages 42–58, London, UK, 1995. Springer-Verlag.

17

procedure successfully30. These were then replicated in the virtual domain by producing three dimensional images of shapes, which users can manipulate. As virtual reality simulators are computer based systems they generate output data, or what is commonly referred to as metrics. In an international workshop a group of experts reviewed all methods of assessment and suggested parameters that should constitute output metrics for the assessment of technical skills31 Included were variables such as economy of movement, length of path, and instrument errors. The MIST-VR system has been validated extensively for the assessment of basic laparoscopic skills. These systems are low fidelity task trainers that are effective only in the assessment of basic skills. More advanced and higher fidelity virtual reality systems such as the LapMentor (www.simbionix.com) can be used as procedural trainers and for the assessment of procedural skill. One of the main advantages of virtual reality systems, in comparison to dexterity analysis systems, is that they provide real time feedback about skill based errors.

Minimally invasive surgical trainer-virtual reality (MIST-VR)

2.2 Hand Tracking System 2.2.1 Tracking methods: •

Magnetic tracking: Magnetic tracking relies on a base station that emits a magnetic field. Sensors that consist of three orthogonal wire coils are used to calculate the position and orientation of the sensor within the magnetic field. To track a hand accurately a sensor is put on every finger as well as the palm to calculate the hand posture. Magnetic tracking is very accurate but also very sensible to metal objects in the environment that will distort the measurements. A second disadvantage is that the system is very obtrusive as sensors and wires are attached to the hand.

30

Dosis A, Bello F, Moorthy K, Munz Y, Gillies D, Darzi A. Real-time synchronization of kinematic and video data for the comprehensive assessment of surgical skills. Paper presented at: the 12th Conference on Medicine Meets Virtual Reality; January 15, 2004; Los Angeles, Calif. 31 Rosen J, MacFarlane M, Richards C, Hannaford B, Sinanan M. Surgeon-tool force/torque signatures evaluation of surgical skills in minimally invasive surgery. MedicineMeets Virtual Reality: The Convergence of Physical & Informational Technologies:Options for a New Era in Healthcare. 1999;62:290-296.

18

•

•

•

Acoustic tracking: Acoustic tracking is similar to magnetic tracking but a sound emitting device is used instead of the wire coils. At least three microphones are used to triangulate the position of the sound emitting device. It is not possible to measure orientation on a sensor. Problems with acoustic tracking are the obtrusive sensors similar to magnetic tracking and the sensibility to acoustic reflections. Data Gloves: Data gloves are gloves that contain different sensors which measure the finger and wrist bending. These measurements can be used to calculate the 3D model of the hand. Additional position and orientation tracking (for example using magnetic tracking) allows the user to have a fully tracked 3D hand. However, data gloves are obtrusive and expensive. Optical tracking: Optical tracking is based on tracking the user’s hands using one or more cameras. Computer vision techniques are used to analyse the video sequence and reconstruct the 3D pose and motion of the hand. Advantages are that the user does not have to wear any special equipment and that the equipment is relatively cheap. Problems are that fingers are difficult to track because of occlusion and fast motion. This work uses the vision based approach as it has the greatest potential for widespread use because of the price and ease-of-use for the user.

2.3 Hand Localisation Tracking systems have to localize the hand in the image before feature extraction can be done. The task of localizing the hand is still problematic in respect to illumination changes and background clutter. Hand localisation can be grouped into marker detection, background subtraction and skin colour detection.

2.3.1 Markers detection The use of markers omits the problem of separating fore- and background. The markers are found directly in the image32. Often reflective markers in combination with infrared light are used for tracking. The markers are then found as light points in an otherwise dark image33. A problem is that it requires an infrared light source and a filter on the camera.

Gloves fitted with retro reflective markers: Ring-shaped markers composed from small stripes of 32

James Davis and Mubarak Shah. Recognizing hand gestures. In ECCV ’94: Proceedings of the third European conference on Computer vision (vol. 1), pages 331–340. Springer-Verlag New York, Inc., 1994. 33 K. Dorfmueller-Ulhaas and D. Schmalstieg. Finger tracking for interaction in augmented environments. In 2nd International Symposium on Augmented Reality (ISAR’01), New York NY, October 2001. ACM/IEEE.

19

reflector material that are wrapped around the finger joints , displaced balls on the back of the finger, that have good retro reflective properties and are not often significantly occluded by the fingers themselves.

2.3.2 Matching of marker Application of epipolar constraint does not solve the complete matching problem, which is problematic in cases where the corresponding feature for a marker in the first image is not the feature which has the closest distance to the epipolar line in the second image. This can lead to erratic matching that combines image features which are not correlated in reality. Since the epipolar constraint module can not detect such ambiguous cases based on the distance of a feature from the epipolar line, all matching features which lie within a small neighbourhood of the epipolar line must be considered. By using knowledge about the distances between the markers on the user’s hands and some further constraints, the system is able to estimate the hand’s pose. The constraints (assumptions) are made as follows: − The marker positions are located approximately in one 3D plane. While it is indeed possible to move a hand sideways to some degree, this constraint is sufficiently satisfied by the rigid marker skeleton. − The distances between markers are specifically pre-determined. A random marker is chosen. In the next step, the algorithm searches for a second marker which has been located close to the surface of a sphere of which centre is the previous marker and radius determined by the known marker distance. If no such marker can be found, the algorithm starts with another arbitrarily chosen marker. If a marker can be found close to the sphere’s surface, a second sphere is used to find the third marker position and so on so forth. The algorithm successfully completes if a full path including all the markers has been found, if the identified 3D coordinates are located within a given threshold to a 3d plane and if the shape of the polygon constructed from the joint positions is convex.

2.3.4 Background Subtraction34 Background subtraction techniques impose constraints on the background and they assume that the background is either uniform or static. The hand can then be localised by uniformity, background subtraction35 or segmentation by motion cues. Uniform background can only be achieved in very constrained lab settings. Temporal static background can be assumed depending on the camera setup. The simplest method for background subtraction is to assume a background of uniform colour. All pixels that have a different colour value are assumed to be part of the hand. Similarly for static backgrounds the image at system start-up is assumed to be background. A pixel is considered foreground if its colour value Ip differs by more than a certain threshold from the background pixel Ib

I p − Ib ≥ Θ The quality of the background subtraction is sensitive to the choice of the threshold Θ .

34

M. Piccardi. Background subtraction techniques: a review. In IEEE Transactions on Systems, Man, and Cybernetics, volume 4, pages 3099–3104, October 2004. 35 J. Segen and S. Kumar. Fast and accurate 3D gesture recognition interface. 14th International Conference on Pattern Recognition, 1, August 1998.

20

More robust techniques allow the background to change over time, such as average, median, running average, mixture of Gaussians, kernel density estimators, mean shifts, eigenbackgrounds, etc. Running average method is done by keeping the background image as a running average of the history of images. Only pixels that are not considered foreground are used to update the average. This method allows other objects to enter the field of view of the cameras. Single and mixture Gaussian distributions have been used to model the background pixels. In contrast to simple threshold it takes the variability and multimodality of the background into account.

2.3.5 Skin Colour Detection Many hand tracking systems attempt to segment the hand region by applying skin colour detection 36 37 38 39. Infrared camera systems that improve detection of the human hand can be used40. Skin colour detection can be divided into three categories: explicit, parametric and nonparametric approaches. A good review for skin colour detection can be found in41. All approaches are based on pixel-based classification. In contrast 42 uses regional information to improve the segmentation. Most detectors use a brightness invariant colour space for detection to handle illumination changes and noise.

Explicit Approach The explicit approach to skin colour detection uses a rule-based classifier to differentiate between skin and non-skin pixels in some colour space. 43 have explicit rules to find skin colour in daylight and flashlight conditions in the RGB colour space. It has been successfully used in an art installation for face detection. 36

S. Ahmad. A usable real-time 3D hand tracker. In 28th Asilomar Conference on Signals, Systems and Computers, pages 1257–1261. IEEE Computer Society Press, 1995. 37 Yoichi Sato, Makiko Saito, and Hideki Koik. Real-time input of 3D pose and gestures of a user’s hand and its applications for hci. In VR ’01: Proceedings of the Virtual Reality 2001 Conference (VR’01), page 79. IEEE Computer Society, 2001. 38 F. Tomaz, T. Candeias, and H. Shahbazkia. Fast and accurate skin segmentation in color images. In First Canadian Conference on Computer and Robot Vision, pages 180–187, 2004. 39 Xiaojin Zhu, Jie Yang, and Alex Waibel. Segmenting hands of arbitrary color. In FG ’00: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000, page 446, Washington, DC, USA, 2000. IEEE Computer Society. 40 Kenji Oka, Yoichi Sato, and Hideki Koike. Real-time tracking of multiple fingertips and gesture recognition for augmented desk interface systems. In FGR ’02: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, page 429. IEEE Computer Society, 2002. 41 Vezhnevets V., Sazonov V., and Andreeva A. A survey on pixel-based skin color detection techniques. In GraphiCon 2003, pages 85–92, Moscow, Russia, September 2003. 42 Hannes Kruppa, Martin A. Bauer, and Bernt Schiele. Skin patch detection in real-world images. In DAGM-Symposium, pages 109–116, 2002. 43 Borut Batagelj, Franc Solina, and Peter Peer. 15 seconds of fame: an interactive, computer-vision based art installation. In MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pages 764–765, New York, NY, USA, 2004. ACM Press.

21

The advantage of the explicit approach is its simplicity and speed for classification; however, it is less accurate than the other methods.

Non-Parametric Approach The non-parametric approach uses a classifier such as a neural network or Bayesian classifier to distinguish skin from non-skin pixels. The classifier needs to be trained with sample images where the skin colour is defined by the user. The probability of a coloured pixel to be classified as skin can be described by the following formula:

P(skin | c) =

P(c | skin)P(skin) P(c | skin)P(skin) + P(c | ¬skin)P(¬skin)

P(skin|c) ≥ Θ can then be used as general rule to distinguish between skin and non-skin pixels. For effectiveness the colour space is divided into regular bins. The classifier is trained using sample images which makes it less colour space dependent.

Parametric Approach In contrast to the non-parametric approach, the parametric approach attempts to model the skin colour distribution using a mathematical model. The simplest model is to use a single Gaussian:

1

P (skin c ) = 2π

∑S

1/ 2

e

1 − ( c − µ S ) T ∑ −S 1 ( c − µ S ) 2

Other more elaborate models have been used. The more specific and higher order a model, the better it can model the true skin distribution. The problem is to find a good training set without over fitting. The validity of the mathematical models is dependent on the distribution of the skin colour and thus, dependent on the colour space.

2.4 Feature extraction Feature extraction uses the output from the hand localisation step. A description of different feature detection techniques for use with model-based approaches is presented here. More information can be found in 44 that also discusses feature extraction techniques for appearance-based methods. − Silhouettes are one of the simpler features to extract and can be easily matched against the projection of the 3D model45. This technique depends highly on the successful localisation of the hand. − Contours and Edges are an extension of the silhouettes scheme as they use more information in the image. More notably the edges of fingers can be used in the 44

Vladimir Pavlovic, Rajeev Sharma, and Thomas S. Huang. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):677–695, 1997 45 J. J. Kuch and T. S. Huang. Vision based hand modeling and tracking for virtual teleconferencing and telecollaboration. In ICCV ’95: Proceedings of the Fifth International Conference on Computer Vision, page 666. IEEE Computer Society, 1995.

22

−

−

matching as they will have parallel segments. Much information can be obtained by just extracting the contours of objects in the image. The contour represents the shape of the hand and is therefore not directly dependent on skin colour and lighting conditions. Extracting contours by edge detection will result in a large number of edges both from the tracked hand and from the background. Therefore some form of intelligent post processing is needed to make a reliable system. Fingertip locations are used frequently as feature points as they enhance the 3D model matching. Fingertips have been extracted either based on markers, template matching or curvature pattern matching. The major problem of fingertip detection is occlusion. Markers are used to extract fingertips, but they are also used for the hand palm region. Markers seem to be the easiest approach to extract because the characteristics of the markers are known. Again occlusion is one major problem.

2.5 Tracking The tracking step is responsible for using the extracted features to estimate the hand parameters of the model. Marker-(and to some extend fingertip-) based approaches can directly use the position of the markers to estimate the hand pose if multiple cameras are used. Contour, edge, silhouette approaches need to minimize the difference between the projected model and the image features. Error minimization between model and image as well as occlusion can be improved using parameter prediction algorithms.

2.5.1 Tracking with the Kalman filter: One way of solving the problem of tracking the movement of an object from frame to frame is by use of a Kalman filter. The Kalman filter models the dynamic properties of the tracked object as well as the uncertainties of both the dynamic model and the low level measurements. Consequently the output of the filter is a probability distribution representing both the knowledge and uncertainty about the state of the object. The estimate of the uncertainty can be used to select the size of the search area in which to look for the object in the next frame. The Kalman filter is an elegant solution and easily computable in real time. However, the probability distribution of the state of the object is assumed Gaussian. As this is generally not the case, especially not in the presence of background clutter, the Kalman filter in its basic form can not robustly handle real world tracking tasks on an unknown background46. However on a controlled background good results can be obtained47. An attempt to avoid the limiting assumption of normal distribution inherent in the Kalman filter was introduced in 48 and denoted the CONDENSATION algorithm. The approach is to 46

John MacCormick. Stochastic Algorithms for Visual Tracking: Probabilistic Modelling and Stochastic Algorithms for Visual Localisation and Tracking. Springer-Verlag New York, Inc., 2002. 47 James M. Rehg and Takeo Kanade. Digiteyes: Vision-based hand tracking for human-computer interaction. In Workshop on Motion of Non-Rigid and Articulated Bodies, pages 16–24, 1994. 48 Michael Isard and Andrew Blake. Contour tracking by stochastic propagation of conditional density. In ECCV (1), pages 343–356, 1996.

23

model the probability distribution with a set of random particles and perform all the involved calculations on this particle set. The group of methods, to which the CONDENSATION algorithm belongs, is generally refereed to as: Random sampling methods, sequential Monte Carlo methods or particle filters.

2.6 Tracking of movements In surgery a single movement is defined as a change in velocity which reaches its maximum value as the movement occurs and then falls to near zero as the movement is completed. A single movement represents a change in position between two random and distinct points A and B in a time interval dt. Assuming that (xA, yA, zA) is the Cartesian coordinate of A and (xB, yB, zB) is that of B, the distance dAB travelled between point A and B in the time interval dt as follows:

d AB =

( x B − x A )2 + ( y B − y A )2 + ( z B − z A ) 2

The movement pattern is then obtained by plotting all the distance values. This pattern can also be described in terms of velocity. Since the velocity (V) is the derivative of distance travelled in a time interval dt,

V =

d (d AB ) dt

The slope of the connecting line between the two distinct co-ordinate points will determine the change of velocity. Surgical movements should be smooth and controlled; therefore it is important to be able to distinguish between sudden and more controlled movements. This can be achieved by using a low-pass Gaussian filter that results in signal smoothing, keeping only the low-frequency or smooth movements. The total number of movements can then be obtained by adding the local high peaks in the smoothened signal. The total path length is calculated by summing all the partial distances N

PL = ∑ d i i =1

, where N is the total number of partial distances and di is the distance between the Cartesian points Pi+1, Pi. By applying the above to calculate number of movements, path length for left and right hands, total time taken to perform a task, it is possible to quantifiably assess surgical dexterity by comparing the performance of experienced and novice surgeons on a simple surgical task. An experienced surgeon would be able to complete the task by making fewer movements and with a lower path length than an inexperienced trainee.

2.7 Stereopsis 2.7.1 Image rectification

24

The pinhole camera model implies that the correspondent of a given point lies on its epipolar line in the other image. Corresponding points can then be found by scanning every point’s epipolar lines. Unfortunately the epipolar lines form a pencil of lines i.e. they all go through a point called the epipole. This makes the scanning task fairly complex and thus inefficient. Rectification determines a transformation of each image plane such that pairs of conjugate epipolar lines become collinear and parallel to one of the image axes. In a binocular stereo setup the images are reprojected onto a plane parallel to the line between the optical centres, and in the case of a trinocular system to the optical centres plane. The important advantage of rectification is that computing correspondences, a 2-D search problem in general, is reduced to a 1-D search problem, typically along the horizontal raster lines of the rectified images Perspective projection matrices: The relationship between 3D world co-ordinates can be written as

U f V = 0 S 0

0 f 0

X 0 Y 0 ⋅ Z 0 1

0 0 f

Equation 1 , where x=U/S and y=V/S if S ≠ 0 . Given a vector x = [x, y, K] we use ~ x to denote its augmented vector by adding 1 as the last element. Let P be the 3 x 4 matrix Τ

f P = 0 0

0 f 0

0 0 0 0 1 0

, which is called the camera perspective projection matrix. Given a 3D point M = [X, Y, Z]T and its image m = [x, y]T , hence (Equation 1) can be written

~ = PM sm Equation 2 , where s = S is an arbitrary nonzero scalar. For a real image point, S should not be 0. If S = 0, then Z = 0, i.e. the point is in the focal plane and x and y are not defined. For all points in the focal plane but the optical centre, their corresponding points in the image are at infinity. For the optical centre C, we have U = V = S = 0 since X = Y = Z = 0. 3D points are expressed in camera coordinate systems. We go from one coordinate system cantered at optical centre C (camera coordinate system) to the new coordinate system cantered at O (world coordinate system) by a rotation R followed by a translation t = CO

M c = RM w + t , or

M c = DM w , where

25

r1 r D= 2 r3 0

tx t y tz 1

The matrix R and the vector t describe the orientation and position of the camera with respect to the new world coordinate system. These are the extrinsic parameters of the camera.

m = PM c = PDM w Hence the new perspective projection matrix

Pnew = PD This tells us how the PPM P changes when we change the coordinate systems. We simply multiply it on the right by the corresponding Euclidean transformation.

~

The projection matrix P can be decomposed into the product

~ P = A[R t ] = Ti Te

Te maps from world to camera coordinates and depends on the extrinsic parameters of the stereo rig only; Ti, which maps from camera to pixel coordinates and depends on the intrinsic parameters only, has the following form:

− fk u Ti = 0 0

0 − fk v 0

u0 v0 1

, where f is the focal length in millimetres, ku, kv are the scale factors along the u and v axes respectively (the number of pixels per millimetre), and αu = −fku, and αv = −fkv are the focal lengths in horizontal and vertical pixels, respectively. Let pij be the (i, j) entry of matrix P. Eliminating the scalar s in (Equation 2) yields two nonlinear equations

p11 U V = p11 p 31

u=

p12 p 22 p32

p13 p 23 p33

X p14 Y p 24 Z p34 1

p X + p12Y + p13 Z + p14 X = 11 Z p31 X + p32Y + p33 Z + p34 Equation 3

v=

p X + p 22Y + p 23 Z + p 24 Y = 21 Z p31 X + p32Y + p33 Z + p34 Equation 4

By using camera calibration we would first estimate the matrix P and then deducing the camera parameters from P. From (Equation 2), (Equation 3), (Equation 4), we have two nonlinear equations relating 2D to 3D coordinates. This implies that each pair of an identified

26

image point and its corresponding point on the calibration object provides two constraints on the intrinsic and extrinsic parameters of the camera. The number of unknowns is 11. If we write the projection matrix as

q1T p14 ~ P = q 2T p 24 = (Q ~ p) qT p 3 34 Equation 5 We see that the plane q 3T w + p 34 = 0 (S = 0) is the focal plane, and the two planes

q 3T w + p 34 = 0 and q 3T w + p 34 = 0 intersect the retinal plane in the vertical (U = 0) and horizontal (V = 0) axis of the retinal coordinates, respectively. The optical centre, c, is the intersection of the three planes; therefore P(c | 1)T = 0, i.e. [P | p][c | 1] = 0 hence

c = − P −1 p Equation 6 Given matrix P and an image point m, we can obtain the equation of the 3D line defined by the optical centre c and the pixel point m. This line is called the optical ray defined by m i.e.

{

~

}

~ = Pw ~ . Any point on it projects to the single point m. We already the set of points w : m know that c is on the optical ray. To define it, we need another point. We can choose the point g. This point satisfies (Equation 2). So choosing a scale factor s = 1,

g ∴ m = [P p ] 1 m = Pg + p Pg = m − p g = P −1 (m − p) A point on the optical ray is thus

w = c + λ ( g ) = c + λP −1 m Equation 7 Epipolar geometry All the epipolar lines in one image plane pass through a common point called the epipole, which is the projection of the conjugate optical centre:

~ c ~ e2 = P2 1 1 Equation 8 The rectification transformation Once the rectifying PPM is known, we want to compute the linear transformation in projective coordinates (meaning we are using retinal coordinates from the original image)

27

~ that maps the retinal plane of the old PPM, P0 = P0 ~ p0

(

)

, onto the retinal plane of

~ Pn = (Pn ~ p n ) .For any 3D point w ~ ~ =P m 0 0w ~ ~ mn = Pn w Equation 9 Using (Equation 7) and (Equation 9) −1 ~ ~ ~ c ~ c0 ~ λP0−1m ~ c + λP m 0 0 ~ ~ = Pn1 + Pn1 = Pn1 0 + Pn P0−1m mn = Pn1 0 1 1 1 0

Assuming that rectification doesn’t alter the optical centre then

~ = P P −1 m ~ m n n 0 0 Equation 10 Let put T = Pn P0−1 , the transformation T is then applied to the original image to produce a rectified image. Constraints of the rectification projection matrices In the following section we describe the constraints necessary to guarantee rectification. We define our rectifying PPM’s as

a1T a14 ~ Pn1 = a 2T a 24 aT a 3 34 b1T b14 ~ Pn 2 = b2T b24 bT b 3 34 Common Focal Planes If cameras share the same focal plane the common retinal plane is constrained to be parallel to the baseline and epipolar lines are parallel. The two rectifying PPM’s have the same focal plane if and only if

a3 = b3 a34 = b34 Equation 11 Position of the Optical Centres The optical centres of the rectifying perspective projections must be the same as those of the original projections in order to allow image rectification without knowing the depth of the scene.

28

~ c Pn1 1 = 0 1 ~ c Pn 2 2 = 0 1 Equation 12 From (Equation 6), we obtain 6 linear constraints:

a1T c1 + a14 T a 2 c1 + a 24 a3T c1 + a34 T b1 c 2 + b14 b2T c 2 + b24 T b3 c 2 + b34

=0 =0 =0 =0 =0 =0

Equation 13 Alignment of conjugate epipolar lines The vertical coordinate of the projection of a 3D point onto the rectified retinal plane must be the same in both images, i.e. the y value for a point q in image 1 should be the same as the y value for the corresponding point q in image 2 (the x value will of course be different)

a 2T w + a 24 b2T w + b24 = T a3T + a34 b3 + b34 Equation 14 Using (Equation 11) we obtain

a 2 = b2 a 24 = b24 Equation 15

Scale factor Projection matrices are defined up to a scale factor. The common choice to block the latter is a34 = 1 and b34 = 1 however this brings about two problems: – The intrinsic parameters become dependent on the choice of the world coordinate system – The resulting projection matrices do not satisfy the conditions guaranteeing that meaningful calibration parameters can be extracted from their entries of the matrices

a3 = 1

(a1 ∧ a3 )T (a 2 ∧ a3 ) = 0 Equation 16 To avoid these problems we enforce

29

a3 = 1 b3 = 1 Equation 17 Orientation of the rectified retinal plane Set the rectified focal planes to be parallel to the intersection of the two original focal planes, i.e.

a 3T ( f 1 ∧ f 2 ) = 0 Equation 18 , where f1 and f2 are the third rows of Po1 and Po2 respectively. Due to (Equation 11) the dual equation b3T ( f 1 ∧ f 2 ) = 0 is redundant. Orthogonality of the rectified reference frames. The intersections of the retinal plane with the planes a1T w + a14 = 0 and

a 2T w + a 24 = 0 correspond to the u and v axes of the retinal reference frame. For this reference frame to be orthogonal the planes must be perpendicular, hence

a1T a 2 = 0 b1T a 2 = 0 Equation 19

Principle points Given a full-rank 3×4 matrix satisfying constraints (Equation 16), the principle point (u0, v0) is given by

u 0 = a1T a3 v0 = a 2T a3 Equation 20 We set the two principal points to (0, 0) and use (Equation 11) and (Equation 15) to obtain the constraints. Focal lengths in Pixels The horizontal and vertical focal lengths in pixels are:

α u = a1 ∧ a3 α v = a 2 ∧ a3 Equation 21 By setting the values of αu and αv, for example, to the values extracted from Po1 we obtain the constraints

30

a1 ∧ a 3

2

= α u2

a 2 ∧ a3

2

= α v2

b1 ∧ b3

2

= α u2

Equation 22

2.7.2 Stereo correspondence The dense correspondence problem consists of finding a unique mapping between the points belonging to two images of the same scene. If the camera geometry is known, the images can be rectified, and the problem reduces to the stereo correspondence problem, where points in one image can correspond only to points along the same scan line in the other image. If the geometry is unknown, then we have the optical flow estimation problem. In both cases, regions in one image which have no counterparts in the other image are referred to as occlusions (or more correctly as half occlusions). Complications in solving the correspondence problem The task of matching points between the two images is known as the correspondence problem. This problem is made difficult by several known complications: − Noise: Due to quantisation error, imperfect optics noise in the imaging system, lighting variation between images, specular reflection, etc., the feature values for corresponding points in the left and right images often differ. − Indistinct image features: many images contain large regions of constant luminance and therefore are effectively featureless in these regions. Even with near perfect measurements and minimal lighting variation between images, the matching is still ambiguous for a significant number of pixels. − Salient 3D features: most stereo scenes contain salient features in the 3D scene geometry (i.e., discontinuities in depth at object boundaries, discontinuities in surface orientation, and steeply sloping surfaces) which must be preserved to produce accurate reconstructions. Many of the methods used to minimise the first two complications smooth over the salient features in the scene geometry. − Half occlusion: due to occlusion there are almost certainly whole regions of half occluded points which appear in only one image and consequently have no match at all. In fact this problem is two fold: o First there is the problem of incorrectly matching half-occluded points to mutually visible points and getting wildly inaccurate depth estimates. o Second even if a point can be identified as half-occluded, what depth should be assigned to it? For years people have offered solutions to the correspondence problem without adequately addressing all of these complications. Many have convincingly argued that the complications caused by noise and indistinct image features could be minimised by enforcing constraints on the estimated disparities. In the early area-based algorithms, the disparity was assumed to be constant within a fixed sized window. For camera set-ups with parallel optical axes, this assumption is equivalent to assuming that the observed surfaces are front-parallel. This disparity was determined by

31

matching a window of points in the left image with a window in the right image and then choosing for each point in the left image – the disparity which gave rise to the best match49. Others50 integrated a type of smoothness (flatness) constraint into matching process again biasing toward reconstructions with constant disparity. Poggio et al (1985) and Matthies (1992)51 elaborated on this idea to impose smoothness as soft constraint in an energy/cost functional that biased toward depth maps where the disparity gradient was small. Pollard et al. (1985) proposed a clever and somewhat less restrictive, assumption about the nature of the observed surfaces: the disparity gradients within a window should not exceed some prechosen threshold. While algorithms using smoothness constraints proved effective in handling the first two complications, their performance deteriorated at salient features in the scene geometry. Discontinuities in depth at object boundaries or discontinuities in surface orientation were either smoothed over or caused the algorithm to produce erratic results. Marroquin et al (1987) and Yuille (1989)52 maintained that if smoothness prior is used to influence the matching there must be some mechanism for suspending the smoothing at the boundaries of objects. Here the suggestion was that “line processes” (i.e. binary random processes) used to solve the image segmentation problem53 should be used to explicitly represent discontinuities in depth. While this observation was a significant theoretical step toward preserving the boundaries of objects, it overlooked three important complications. − First the introduction of line processes to model object boundaries gave rise to highly non-linear optimisation problems- ones for which no adequate optimisation strategies were proposed. − Second, no prescription was given for preserving other salient features in the scene geometry, namely steeply sloping surfaces and discontinuities in surface orientation. − Third, what makes stereopis different from segmentation problem is that in addition to identifying boundaries across which smoothing should be suspended due to a discontinuity, algorithms must also identify whole regions of half-occlusion caused by the discontinuity.

49

Gennery, D. 1980. Modeling the Environment of an Exploring Vehicle by Means of Stereo Vision, PhD. Thesis, Stanford University Lucas, B, and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proc. Of 7th International Joint Conference on Artificial Intelligence, pp. 674-679 50 Julesz, B. 1971. Foundations of Cyclopean Perception. University of Chicago Press Marr, D and Poggio, T. 1979. A computational theory of human stereo vision. Proc. R. Soc. London, 204:301-328 51 Poggio, T., Torre, V. and Koch, C. 1985. Computational vision and regularisation theory. Nature, 317:314-319 Matties, L.1992. Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation. In Journal of Computer Vision, 8(1):71:91. 52 Marroquin, J., Mitter, S. and Poggio, T. 1987. Probabilistic solutions of ill-posed problems in computational vision. Journal of the Am. Stat. Soc., 82(397):76-89 Yuille, A. 1989.Energy functions for early vision and analogue networks. Biological Cybernetics, 61:115-123. 53 Geman, S. and Geman,D.1984. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Trans. Pattern Analysis and machine Intelligence, 6:721-741. Mumford, D. and Shah, J, 1985. Boundary detection by minimising functional. In Proc. Conf. Computer Vision and Pattern Recognition , Vol. 22. Blake, A. and Zisserman, A. 1987. Visual Reconstruction. MIT Press: Cambridge, USA.

32

Few of the papers on stereo vision properly handled the implicit relation between discontinuities in depth and the resulting unmatched regions54. These algorithms were forced either to constrain their environments so that occlusion was uncommon, or to accept solutions which smoothed over the depth discontinuities or produced spurious matches for the pixels which did not match anything. Recent papers have addressed approaches to occlusions. Current solution There currently exists a considerable amount of work on the dense stereo correspondence problem. Scharstein and Szeliski55 have provided an exhaustive comparison of dense stereo correspondence algorithms. Most algorithms generally utilise local measurements such as image intensity (or colour) and phase, and aggregate information from multiple pixels using smoothness constraints. The simplest method of aggregation is to minimise the matching error within rectangular windows of fixed size56. Better approaches utilise multiple windows57 58, adaptive windows59 which change their size in order to minimise the error, shiftable windows60 61 62, or predicted windows63, all of which give performance improvements at discontinuities. Global approaches to solving the stereo correspondence problem rely on the extremisation of a global cost function or energy. The energy functions which are used include terms for local property matching (‘data term’), additional smoothness terms, and in some cases, penalties for occlusions. Depending on the form of the energy function, the most efficient energy minimisation scheme can be chosen.

54

Marr, D and Poggio, T. 1979. A computational theory of human stereo vision. Proc. R. Soc. London, 204:301-328 Baker, H. and Binford, T. 1981. Depth from edge and intensity based stereo. In IJCAI, pp. 631-636. Grimson, W. 1981. From Images to Surfaces, MIT Press: Cambridge, USA. Ohta, Y. and Kanade, T. 1985. Stereo by intra- and inter-scan line search using dynamic programming. IEEE Trans. Pattern Analysis and Machine Learning, 7(2):139-154. 55 D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int’l Journal of Computer Vision, vol. 47, no. 1, pp. 7–42, April 2002. 56 M. Okutomi and T. Kanade, “A multiple baseline stereo,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 4, pp. 353–363, April 1993. 57 D. Geiger, B. Ladendorf, and A. Yuille, “Occlusions and binocular stereo,” Proc. European Conf. Computer Vision, pp. 425–433, 1992. 58 A. Fusiello, V. Roberto, and E. Trucco, “Efficient stereo with multiple windowing,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 858– 863, June 1997. 59 T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: theory and experiment,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 9, pp. 920–932, 1994. 60 A. F. Bobick and S. S. Intille., “Large occlusion stereo,” Int’l Journal of Computer Vision, vol. 33,

no. 3, pp. 181–200, Sept 1999. 61

H. Tao, H. Sawhney, and R. Kumar, “A global matching framework for stereo computation,” Proc. Int’l Conf. Computer Vision, vol. 1, pp. 532–539, July 2001. 62 M. Okutomi, Y. Katayama, and S. Oka, “A simple stereo algorithm to recover precise object boundaries and smooth surfaces,” Int’l Journal Computer Vision, vol. 47, no. 1-3, pp. 261–273, 2002. 63 J. Mulligan and K. Daniilidis, “Predicting disparity windows for real-time stereo,” Lecture Notes in Computer Science, vol. 1842, pp. 220–235, 2000.

33

These include dynamic programming64, simulated annealing65 66, relaxation labelling67, nonlinear diffusion68, maximum flow69 and graph cuts70 71. Maximum flow and graph cut methods provide better computational efficiency than simulated annealing for energy functions which possess a certain set of properties. Some of these algorithms treat the images symmetrically and explicitly deal with occlusions. The uniqueness constraint72 is often used to find regions of occlusion. Egnal and Wildes73 provide comparisons of various approaches for finding occlusions. The issue of recovering a piecewise planar description of a scene has been previously explored in the context of stereo and motion74 .Recently; some algorithms75 have explicitly incorporated the estimation of slant while performing the estimation of dense horizontal disparity. Lin and Tomasi76 explicitly model the scene using smooth surface patches and also find occlusions; they initialise their disparity map with integer disparities obtained using graph cuts, after which surface fitting and segmentation are performed repeatedly. Previously, Devernay and Faugeras77 have used local image deformations to obtain differential properties of 3D shapes directly.

64

Y. Ohta and T. Kanade, “Stereo by intra- and inter-scanline search using dynamic programming,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 7, no. 2, pp. 139–154, March 1985. 65 S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, Nov 1984. 66 S. T. Barnard, “Stochastic stereo matching over scale,” Int’l Journal of Computer Vision, vol. 3, no. 1, pp. 17– 32, 1989. 67 R. Szeliski, “Bayesian modelling of uncertainty in low level vision,” Int’l Journal of Computer Vision, vol. 5, no. 3, pp. 271–302, Dec 1990. 68 D. Scharstein and R. Szeliski, “Stereo matching with nonlinear diffusion,” Int’l Journal of Computer Vision, vol. 28, no. 2, pp. 155–174, 1998. 69 S. Roy and I. Cox, “A maximum-flow formulation of the n-camera stereo correspondence problem,” Proc. Int’l Conf. Computer Vision, pp. 492–499, 1998. 70 Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, Nov 2001. 71 V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” Proc. Int’l Conf. Computer Vision, pp. 508–515, July 2001. 72 D. Marr and T. Poggio, “A computational theory of human stereo vision,” Proc. Royal Soc. London B, vol. 204, pp. 301–328, 1979. 73 G. Egnal and R. Wildes, “Detecting binocular half-occlusions: empirical comparisons of five approaches,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 8, pp. 1127–1133, Aug 2002. 74 O. Faugeras and F. Lustman, “Motion and structure from- motion in a piecewise planar environment,” Int’l Journal of Pattern Recognition and Artificial Intelligence, vol. 2, no. 3, pp. 485– 508, 1988. 75 S. Birchfield and C. Tomasi, “Multiway cut for stereo and motion with slanted surfaces,” Proc. Int’l Conf. Computer Vision, vol. 1, pp. 489–495, 1999. 76 M. Lin and C. Tomasi, “Surfaces with occlusions from layered stereo,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. I–710–I–717, June 2003. 77 F. Devernay and O. Faugeras, “Computing differential properties of 3-D shapes from stereoscopic images without 3-D models,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 208–213, 1994.

34

Stereo correspondence is particularly a difficult problem since some points in each image will not have corresponding point in the other image due to some reasons, for instance, the cameras might have different fields of view or mostly due to occlusion. A stereo system must be able to determine the image areas that should not be matched. We will discuss two main approaches for establishing stereo correspondence, correlation-based methods and featurebased methods. The main distinctions between the two categories of approaches are − First the element used for matching and − Second the criterion used for measuring the degree of correspondence. The correlation methods use image windows of fixed size as the matching basis, and some correlation coefficients to indicate measures of similarity. The feature-based methods on the other hand use a sparse set of features, for instance, corners and lines for matching. A measure of distance between feature descriptors is used to represent degree of similarity- the smaller the distance the higher the value of correspondence. In terms of implementation effectiveness and efficiency the correlation-based methods are certainly much simpler to implement and maintain than the feature-based methods. Besides, the output produced by correlation-based methods is disparity maps, i.e. contours that provide information on the depths of the objects, are extremely useful when different textures are mapped onto the surface. Since correlation is essentially matching pixel by pixel much greater precision and details are obtained. However when speed is a key issue, feature-based methods outperform correlation due to the fact that the search is limited only to a relatively small number of features. When difference in illumination is noticeable in the images featurebased methods are much less insensitive than correlation-based ones. Correlation-based algorithm78 Let’s consider an image pair IL and IR, suppose that pL and pR are pixels in the corresponding images. We have − The width (in pixels) of the image window is 2W+1; − The search region in the right image associated with pL is R(pL); − The function of two pixel values that measures degree of correlation is f(u,v). − For each left image pixel, pL = (i, j), its corresponding point in the right image, pR is found by applying the following procedure For each displacement d=(dx,dy) ∈ R(pL) compute the cross- correlation c(d) as: W

c( d ) =

W

∑ ∑ I (i + k , j + l )I (i + k − d L

R

x

, j +l − dy )

k = −W l = −W

pR is found as the displacement, d, that has the largest correlation coefficient, c(d) within the search region R(pL).

d = arg max d∈R [c(d )] At the end of the process, a disparity map containing a list of disparities, one per pixel in IL is generated. Usually we normalise c(d) by dividing it by the standard deviation of both . The formula is as follows: 78

Bebis, G. “The stereo correspondence problem”, http://www.cse.unr.edu/~bebis/CS791E/

35

W

W

∑ ∑ I (i + k , j + l )I (i + k − d L

c( d ) =

R

x

, j + l − dy )

k = −W l = −W W

W

W

W

∑ ∑ (I (i + k , j + l ) − I ) ∑ ∑ (I (i + k − d 2

L

L

k = −W l = −W

R

, j + l − d y )− I R )

2

x

k = −W l = −W

, where I L , I R are the average pixel values in the left and right windows. An alternative similarity measure is the sum of squared difference (SSD): W

c( d ) =

W

∑ ∑ (I (i + k , j + l ) − I (i + k − d L

R

, j + l − d y ))

2

x

k = −W l = −W

An alternative to the image intensity values is using the threshold signed gradient magnitude at each pixel. The gradient magnitude could be computed in two images without smoothing effects. Then we carry on mapping the gradient magnitude values into three values: -1, 0, 1 by thresholding the gradient magnitude. The success of correlation-based methods depends on whether the image window in one image exhibits a distinctive structure that occurs infrequently in the search region of the other image. If the size of the correlation window W is too small, it may not capture enough image structure and greatly affected by noise. If the size of the correlation window is too large, it would make matching less sensitive to noise but at the same time to actual variations of image intensity. If the distance of the fixating point from the cameras is much larger than the baseline, the location of R( pL) can be chosen to be the same as the location of pL . The size of R( pL) can be estimated from the maximum range of distances we expect to find in the scene. Featured-based algorithm79 In feature-based methods, we are trying to look for a feature in one image that matches with another feature in the other. Typical features that are widely used for this approach are edge points, line segments and corners. A set of features is used for matching, a line feature descriptor for instance could contain the following values: the length (l), the orientation (), the coordinates of the midpoints (m), the average intensity along the line, i. Similarity measures are based on matching feature descriptors:

S=

1 , where w0,…,w3 are w0 (l L − l R ) + w1 (θ L − θ R ) + w2 (m L − m R ) 2 + w3 (i L − i R ) 2 2

2

weights. Let’s consider an image pair IL and IR, suppose that pL and pR are pixels in the corresponding images. We have − Features and descriptors in both images − The search region in the right image R(fL) associated with a feature fL in the left image. − For each feature fL in the left image, we compute the similarity S between fL and each image feature in R(fL). Then we select the right image feature fR that maximises the similarity measure. Finally we could save the correspondence and disparity d(fL, fR). 79

Bebis, G., Computer Vision lecture notes. http://www.cse.unr.edu/~bebis/CS791E/

36

Disparity Map

Figure 2:Triangulation map We can get the following equation:

Z=

fb u 1 −u 2

Equation 23 , where, u1 − u 2 is the disparity, b is the baseline distance, and f is the focus length. After locating the corresponding points on the same row in the left and right images, we can compute the disparity for each matched pair one by one. Also, we can see that the depth Z is inversely proportional to the disparity d, in here, we set d = u1 − u2 . In order to illustrate the disparity for each matched pair, we first normalize the disparity values within the pixel intensity range (0-255). Thus we can display the disparity as pixel intensities. As a result, the brighter is the image point, the higher is the disparity and the smaller is the depth. Note that since we only consider well-correlated points on the maps, we set the disparity to 255 for lowly correlated points. Ordering constraint and dynamic programming80 Normally the order of points in two images is the same. If we match pixel i in left image to pixel j in right image no matches that follow will affect which are the best preceding matches. Usually points on the epipolar lines appear in the same order when occlusion occurs this will no longer holds, we would use dynamic programming technique to solve the problem. We aim to find the minimum cost path going monotonically down and right from the top-left corner of the graph to its bottom-right corner. In the graph, the nodes represent matched feature point, for example, edge points; arcs represent matched intervals along the epipolar lines; the arc cost represents the discrepancy between intervals.

80

Baker HH and Binford TO, Dynamic Programming, 1981.

37

2.7.3 3D reconstruction81 The approaches to 3D reconstruction using stereo image pairs of a given scene essentially fall into one of three categories82. The first, and the most familiar, is where both the intrinsic and the extrinsic parameters of the stereo system are known. In the second case, only the intrinsic parameters are available. The third case is the one where both intrinsic and extrinsic parameters are unknown, but a sufficiently large number of 3D object points are known83. If all of the parameters are known, reconstruction is straight forward. Estimation of the coordinates for the object points is a matter of triangulation, involving the left and right camera image points of a given object point. The second case, for which only the intrinsic parameters are known, is somewhat more involved84. The first step requires estimation of what is referred to as the matrix of essential parameters, which can only be found up to an unknown scale factor. Next, the normalized translation vector is found. Finally, the rotation matrix is derived from both the normalized essential matrix and normalized translation vector. Since the translation is only known up to a scale factor, so is the 3D reconstruction. If the coordinates of some object points are known, the reconstruction can be made true scale. The third case, where none of the parameters are known, is the most involved85 86.By mapping five scene points into the standard projective basis of P3, and using the epipoles, the projective matrix of each camera is found up to an unknown projective transformation. Once the projective matrices are determined, the 3D location of an arbitrary point in space is obtained by triangulation in projective space. Consequently, the reconstruction is unique up to an unknown projective transformation of the world frame87. If we have the coordinates of at least five world points, we can, in principle, obtain reconstruction to true scale. Reconstruction by triangulation Among all three approaches for 3D scene reconstruction from a stereo pair of images of that scene, this is the simplest since we have the knowledge of both intrinsic and extrinsic parameters. By convention, we choose the left camera frame to be the world system. That is, the coordinates of object points will be based on the left camera system. R denotes the rotation matrix for the right frame rotated with respect to the left frame. T, is the translation of the right frame relative to the left frame. Our task is to compute the location of the 3D points from their projections pl and pr on the left and right image frames. Pl and Pr are the coordinates of the object point in the left and 81

T.J.Mckinley, M.M.McWaters,V.K.Jain, “3D reconstruction from a stereo pair without the knowledge of intrinsic or extrinsic parameters”, University of South Florida, Tampa, Florida 33620, U.S.A. 82 E. Trucco, A. Verri, “Introductory Techniques for 3-D Computer Vision”. Prentice Hall, 1998 83 Olivier Faugeras, “ Three-Dimensional Computer Vision”, MIT Press, 1995 84 Berthold Klaus, Paul Hom, “Robot Vision”, MIT Press, 1986 85 R. Hartley, R. Gupta, and T. Chang, “Stereo from uncalibrated cameras”, IEEE Computer Vision and PatternRecognition, June 1995. 86 R. Mohr, F. Veillon, and L. Quan, “Relative 3D Reconstruction Using Multiple Uncalibrated images”, ZEEEComputer Vision and Pattern Recognition”, pp. 543-548 1993 87 V. K. Jain, “Mapping a high-speed wireless communication function to the reconfigurable Jplatform,” Proc.IEEE Int. Workshop on Rapidsystem Protoqping, pp. 103-108, June 2000.

38

right image frames respectively. The point P projected into the pair of corresponding image plane points pl and pr, lies at the intersection of the two rays from Ol through pl and from Or through pr.

Figure 3:Stereo imaging system Here the rays are known and the intersection can be obtained. However, since the parameters and image point locations are subject to quantisation and other errors, the two rays will not actually intersect in space. Their intersection can be estimated as the point of minimum distance from both rays. This is the midpoint P’ of line segment being perpendicular to both rays. The following paragraphs would formulate the solution for P’. We know that the parametric representation of the line passing through two points P1 and P2 is given by the formula

P(t ) = P1 + t (P2 − P1 ) or x(t ) = x1 + t ( x 2 − x1 ) and y (t ) = y1 + t ( y 2 − y1 ) ,where ( x1 , y1 ) is P1’s coordinates and ( x 2 , y 2 ) is P2’s coordinates. The direction of the line is given by the vector t (P2 − P1 ) .

Figure 4:Reconstruction of a 3D point Applying the above formula to obtain the parametric representation of the line passing through Ol and pl

Ol + a( pl − Ol ) = apl

The parametric representation of the line passing through Or and pr expressed in the left camera frame:

OrL + b( p rL − OrL ) = T + bR T p r since OrL = R T OrR + T = T and p rL = R T p rR + T , where p rR = p r , plL = p l The endpoints of the segment, say P1 and P2 are given by

39

P1 = a0 pl and P2 = T + b0 R T p r The parametric equation of the line passing through P1 and P2 is given by:

P1 + c(P2 − P1 )

The desired midpoint P’ is computed for c = ½. Let w be a vector orthogonal to both l and r. We need to determine the midpoint of the segment parallel to w that joins l and r. Vector w is formulated as w = ( pl − Ol ) × p rL − OrL = p l × R T p r , since p rR = R p rL − OrL

(

(

)

)

The line s going through P1 with direction w is given by a 0 p l + c0 p l × R T p r = T + b0 R T p r , assume that s passes through P2 for c=c0

(

)

We could obtain a0 and b0 by solving the following system of equations:

a 0 p l − b0 R T p r + c0 ( pl × R T p r ) = T

With a0 , b0 , and c0 known, we now have an estimate for P. Since image points are located by pixel indices, we need to determine pl, and pr from these indices anti we are done. Let X, Y, Z denote the axes of a camera frame. Let Z be the optical axis and the image plane be the coordinates of denote the axes of a camera frame. Let Z be the optical axis and the image plane be the coordinates of (XxYx{f)), where f is the focal length. Let (n,m) denote pixel indices, where n increases along the direction of the X-axis and m increases along the direction of the Y-axis. Finally, let (N0, M0) be the pixel indices, where the optical axis intersects the image plane. It is relatively straight ht forward to see that

[

p = (n − N 0 )s x , (m − M 0 )s y f

]

T

where sx x sy, is the size of the pixel. p denotes the image point coordinate and (n,m) the pixel indices of that image point. Using the above formula we attain pl, and pr. We then can continue to obtain the reconstruction Reconstruction up to a scale factor This is the case where only intrinsic camera parameters are known. Suppose that we have established n ≥ 8 correspondences to compute the essential matrix E. We would apply the similar approach to the previous section in calculating the location of the 3D points from their projections pl and pr. In this case, since we don’t know the baseline (that is the line joining two camera centres), we thus cannot recover the true scale of the viewed scene. The reconstruction point is unique only up to an unknown scaling factor which could be determined if we know the distance between two points in the scene. We estimate the essential matrix E by using the 8-point algorithm. The solution of E is unique up to an unknown scale factor. We carry on by recovering the translation vector T from the left image frame to the right image frame. From the definition of essential matrix E, we deduce that

T y2 + Tz2 E T E = S T R T RS = S T S = − Tx T y − Tz T x

− Tx T y 2 x

2 z

T +T − Tz T y

− Tx Tz − Tz T y T y2 + Tx2

40

(

)

Note: Tr E T E = 2 T

2

(

)

or T = Tr E T E / 2

To simplify the recovery of T, we divide E by T

1 − Tˆx2 Eˆ T Eˆ = − Tˆx Tˆy − Tˆ Tˆ x z

− Tˆx Tˆy 1 − Tˆy2 − Tˆz Tˆy

− Tˆx Tˆz − Tˆz Tˆy , where Tˆ = T / T 1 − Tˆz2

Next, we would recover the rotation matrix R between two image frames. It could be shown that each of the rows of the rotation matrix R can be calculated as

Ri = wi + w j × wk , where wi = Eˆ i × Tˆ After recovering the value of translation vector T and rotation matrix R, we would finally we could perform 3D reconstruction. Let’s compute the z coordinate of each point in the left camera frame:

( (

) )

R P − Tˆ Pl P and p r = f r r = f r T l since Pr = R(Pl − T ) Zl Zr R3 Pl − Tˆ The first component x r of p r is R T P − Tˆ x r = f r 1T l R3 Pl − Tˆ pZ Substituting Pl = l l into the above equation we get fl ( f R − xr R3 )T Tˆ Zl = fl r 1 ( f r R1 − xr R3 )T pl pZ Computing Xl and Yl from Pl = l l fl Computing ( X r , Yr , Z r ) from Pr = R(Pl − T ) pl = f l

( (

) )

41

Chapter 3

Calibration The process to calculate the parameters that describe a relationship between the world and image is called camera calibration. The basic idea behind the process of camera calibration is that by taking images of objects with features that are easily identified in the image, a relationship between 3D points and corresponding 2D points in the image can be formed. Using these corresponding points, the parameters of the camera can be calculated.

3.1 Camera model88 A pinhole camera is modelled by its optical centre C and its retinal plane (or image plane) R . A 3-D point W is projected into an image point M given by the intersection of R with the line containing C and W. The line containing C and orthogonal to R is called the optical axis and its intersection with R is the principal point. The distance between C and R is the focal length.

Figure 5: Pinhole camera model Let w = [x

z ] be the coordinates of W in the world reference frame (fixed arbitrarily) T

y

and m = [u v] the coordinates of M in the image plane (pixels). The mapping from 3-D coordinates to 2-D coordinates is the perspective projection, which is represented by a linear transformation in homogeneous coordinates. T

88

R. I. Hartley and A. Zisserman. Multiple View Geometry in computer vision. Cambridge University Press, 2002.

42

~ = [u Let m

~ = [x v 1] and w

z 1] be the homogeneous coordinates of M and W ~ respectively; then the perspective transformation is given by the matrix P : ~~ ~≈P m w T

y

T

, where ≈ means equal up to an arbitrary scale factor. The camera is therefore modelled by its ~ perspective projection matrix (henceforth PPM) P , which can be decomposed, using the QR factorization, into the product

~ P = A[R t ]

The matrix A depends on the intrinsic parameters only, and has the following form:

α u A = 0 0

γ αv 0

u0 v0 1

where α u = − fk u , α v = − fk v are the focal lengths in horizontal and vertical pixels, respectively (f is the focal length in millimetres, ku and kv are the effective number of pixels per millimetre along the u and v axes), (u0, v0) are the coordinates of the principal point, given by the intersection of the optical axis with the retinal plane, and γ is the skew factor that models non-orthogonal u-v axes.. The camera position and orientation (extrinsic parameters), are encoded by the 3x3 rotation matrix B and the translation vector t, representing the rigid transformation that brings the camera reference frame onto the world reference frame.

3.2 Lens distortion The lens is a pivotal part of the camera. The purpose of the lens is to focus the light from one point in the world to one point in the image. To achieve focus without the use of a lens would require a pinhole type camera with an infinitely small hole. Also it would have to remain open an infinitely long time to obtain enough light. The disadvantage of using lenses is that as they refract light the position of the projected point on the image plane is altered. The distortion of a point can be expressed as a symmetric distortion, radial, and an asymmetric distortion, tangential89. All lenses produce a radial distortion, resulting from the effects discussed above, which can be modelled as

xr ∇x T T ~ ~ y = L(∇x, ∇y )∇y , where (∇x, ∇y ) = (x − x P , y − y p ) is the undistorted point, r T (~x , ~y ) , relative to principal point, (x P , y p )T , and L(∇x, ∇y ) = K 1r 2 + K 2 r 4 + K is a function where r = ∇x 2 + ∇y 2 .

89

Janne Heikkila and Olli Silven. A four-step camera calibration procedure with implicit image correction. In CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), San Juan, Puerto Rico, p. 1106-1112. IEEE Computer Society, 1997.

43

Figure 6: Effects of radial lens distortion. The first one is called Pin cushion distortion and the second one is called Barrel distortion. Tangential distortion is the result of non-perfect centring of the lens components and other manufacturing defects. It is often very small and can in most cases be is regarded. The effect is that the symmetric centre of the distortion is no longer the principal point. This leads to an asymmetric distortion, and can be modelled as

xt 2 P1∇x∇y + P2 (r 2 + 2∇x 2 ) , y = 2 2 t 2 P2 ∇x∇y + P1 (r + 2∇y ) where P1, P2 are distortion parameters.

Figure 7: Effect of tangential lens distortion The combined distortion could be modelled as

x c x r xt y = y + y c r t , where the radial lens and tangential lens distortion are calculated as above. The distortion point is then calculated as undistorted points plus distortion

x xc xd ~ y = ~ + d y yc

Figure 8: Effects of combined radial and tangential lens distortion

44

3.3 Single camera calibration OpenCV calibration function is rather similar to the calibration procedure described in Zhengyou Zhang’s paper “A Flexible New Technique for Camera Calibration”. The algorithm first finds the initial estimates of parameter by using an analytical method. All parameters are then optimised globally using a non-linear lest square optimisation method which is described in the below section. The optimisation method would help minimise the least square distance between projected and detected points of the calibration pattern, namely the reprojection error. Then it would find the maximum likelihood estimates of the parameters. The implementation of a single camera calibration would provide estimates of the intrinsic parameters of the camera model described above and a rotation R and translation T estimates for each view used in the calibration process. The calibration object is defined to be at the origin of the world coordinate system for each view. R and T describe the transformation of the world coordinate system into the camera coordinate system.

P c = RPw + T Let’s consider on the constraints of the intrinsic parameters. Suppose we have an image of the model plane, a homography could be estimated as H = [h1 h2 h3 ] , given A as the camera intrinsic matrix found by the calibration process, it is formulated as

α A = 0 0

γ u0 β v0 0

1

, where (u 0 , v0 ) is the coordinates of the principal point, α and β are the scale factors in the image u and v axes and γ is the parameter describing the skewness of the two image axes. The homography H will be formulated as H = A[r1

[h1

h2

h3 ] = λA[r1

r2

r2

t ] . We could then deduce that

t ] , where λ is an arbitrary scalar. Since we know that r1 and r2

are orthogonal, we have that

h1T A −T A −1 h2 = 0 h1T A −T A −1 h1 = h2T A −T A −1 h2 The two above equations are two basic constraints on the intrinsic parameters provided one homography. Since the homography has 8 degrees of freedom (DOF) and there are 6 extrinsic parameters (3 for rotation and 3 for translation), we can only obtain 2 constraints on the intrinsic parameters.

3.4 Stereo camera calibration Our task is to find the relative position and view direction of the two cameras in the stereo system. Its task is to find the transformation from one camera coordinate system to the other camera coordinate system. Both cameras are first calibrated using the single camera method described above. Note that the data used for the single camera calibration requires that the checkerboard pattern to be visible and all corners are detected in both images. The complete single camera calibration for two cameras would provide the world to camera coordinate

45

system transformation for each camera and view: Rl, Tl, Rr and Tr. For a point P in the world reference frame we have

Pr = Rr P + Tr and

Pl = Rl P + Tl where Pr is the coordinates of the 3D point in the right camera frame Pl is the coordinates of the 3D point in the left camera frame, Rr and Rl are the rotation transformation matrices of the point in the right and left camera frame respectively, while Tr and Tl are the rotation transformation matrices of the point in the right and left camera frame respectively. We then deduce

Pl = Rl P + Tl = Rl [ Rr−1 ( Pr − Tr )] + Tl = Rl Rr−1 Pr − Rl Rr−1Tr + Tl Since the relationship between Pl and Pr is given by Pl = RPr + T , we equate the terms to get

R = Rl Rr−1 = Rl RrT And

T = Tl − Rl RrT Tr = Tl − R Tr as the extrinsic parameters matrices of the stereo system.

3.5 Stereo calibration optimisation90 3.5.1 Nonlinear optimisation: A minimisation problem can be formulated as min f ( x ) , where x ∈ ℜ n is a parameter vector and f ( x ) ∈ ℜ . x

Let’s consider the Taylor expansion of the function f around a start value x

f ( x + p ) = f ( x ) + ∇f ( x ) p + T

1 T 2 p ∇ f ( x ) p + K , where ∇f ( x ) is the gradient and 2

∇ 2 f ( x ) is the Hessian defined as ∂2 f ∂f 2 ∂x ∂x2 1 1 ∂ f ∂f ∇f ( x ) = ∂x 2 , ∇ 2 f ( x ) = ∂x ∂x 2 1 M 2M ∂f ∂ f ∂x n ∂x1 ∂x n

∂2 f ∂x1∂x 2 ∂2 f ∂x 22 M ∂2 f ∂x n ∂x 2

∂2 f ∂x1∂x n ∂2 f L ∂x 2 ∂x n O M ∂2 f L ∂x n2 L

3.5.2 Optimisation using Newton method 90

Michael T. Heath. Scientific Computing - An introductory survey. McGraw-Hill, 1997.

46

According to Newton method, function f is approximated at the solution of the minimisation problem, x*, using the first two steps of the Taylor expansion. The approximation is named Q(pk) since it’s a quadratic function.

f ( x* ) ≈ Q ( p k ) = f ( x k ) + ∇f ( x k ) T p k +

1 T 2 pk ∇ f ( xk ) pk 2

In order to make the solution of minimisation problem x* to become the solution of f , two necessary conditions must be satisfied The first order necessary condition is that the gradient is zero ∇f ( x* ) = 0 . Hence ∇f ( x* ) ≈ ∇Q( p k ) = ∇f ( x k ) + ∇ 2 f ( x k ) p k = 0 Solving for pk:

p k = −∇ 2 f ( x k ) ∇f ( x k ) = 0 , which is called the search direction. As Q(pk) is only an approximation the search direction pk is also only a first step, meaning x k + p k ≈ x* . −1

The second order necessary condition is that Hessian ∇ 2 f ( x* ) is positive semi definite. Hence we need to guarantee that

1 T 2 p ∇ f ( x* ) p > 0 or f ( x* + p ) > f ( x* ) , where ∀p ≠ 0 2 which means that x* is a global minimum.

3.5.3 Optimisation using non-linear least squares method A model function G(x) is fitted to m data points in a vector, d, in a way that minimizes the squared distance between them. The problem can be formulated as

1 m 1 1 2 T 2 ( g i ( x ) − d i ) = min F ( x ) = min F ( x ) F ( x ) ∑ x x x 2 x 2 2 i =1 , where the residual function F(x), is formulated as F ( x) = G ( x) − d and the gradient of the min f ( x ) = min

function f(x) is

∇f ( x) = ∇F ( x) F ( x) = J ( x) T F ( x) , where J(x) is called Jacobian function and is defined as

∂f 1 ∂x 1 ∂f 2 J ( x ) = ∂x1 M ∂f n ∂x1

∂f 1 ∂x 2 ∂f 2 ∂x 2 M ∂f n ∂x 2

∂f 1 ∂x n ∂f 2 L ∂x n O M ∂f n1 L ∂x n L

The Hessian of function f(x) is defined as m

∇ 2 f ( x) = J ( x) T J ( x) + ∑ f i ( x)∇ 2 f i ( x) i =1

47

Calculating the Hessian for each step of the iteration would be costly. Therefore the GaussNewton method approximates the Hessian by just using the J(x)T J(x). Since the Jacobian is already needed in the gradient calculation, this will not be costly. Given this approximation the search direction pk is calculated as

(

p k = − J ( x) T J ( x)

)

−1

J ( xk )T F ( xk )

3.6 Implementation results Calibration of a stereo camera system has to be performed whenever the focal length or the relative positions or view directions of both cameras change (assuming without loss of generality that the world coordinate system is one of the camera coordinate systems). The Bouguet’s OpenCV implementation 91 92 is used in this work for calibration of the stereo setup. More notably the toolbox is chosen for the following reasons: − The method is probably the most widely used calibration tool amongst research groups. − The method requires a chessboard patterned planar object which can easily be printed using a laser printer. The pattern can be attached to a rigid planar surface such as a CD case or wooden board. Creation of an accurate 3D object is too cumbersome so the planar approach is best. − It uses a camera model which is suitable for high accuracy − The implementation is freely available and can be extended. It incorporates a method to automatically extract a chessboard pattern from an image and refine the corner positions. The results of the process could be illustrated as below

Figure 9 Original image and the resulting image of finding chessboard corners

91 92

Camera calibration toolbox for matlab. http://www.vision.caltech.edu/bouguetj/calib Open source computer vision library. http://www.intel.com/research/mrl/research/opencv/

doc/ 48

Figure 10: Back projection of the corners into the image and Undistorted version of the back projection. As we can observe from the resulting images, the chessboard corners are detected accurately using Bouguet’s implementation. The location of back projected corners in the image when not taking into account of lens distortion might not matched with the real locations of corners however this is rectified by applying lens distortion obtained from the calibration process.

49

Chapter 4

Stereo correspondence 4.1 The Image Formation Model We assume that the cameras are calibrated and the epipolar geometry is pre-determined and also we have two pinhole cameras whose optical axes are parallel. The cameras each have focal length l, with fl the focal point of the left image, and fr the focal point of the right image. Let assume that the placement of the cameras is such that the baseline is parallel to the image planes and perpendicular to the optical axes. This placement could be established by using the rectification process.

Figure 11: Image formation model A point P on the surface of an object in 3D space, visible to both of the cameras, is projected through the focal points and onto the image planes. Each image plane has a 2D coordinate system with its origin determined by the intersection of its optical axis with the image plane. The brightness of each point projected onto the image planes creates image luminance function Il, Ir in the left and right planes respectively. A horizontal plane through the baseline and the object point P intersects both the image planes at epipolar lines which are horizontal. The object point P which has coordinates (X,Y,Z) has two image points pl , pr in left and right image planes with coordinates (xl, yl) and (xr, yr) respectively. The perpendicular distance between two camera optical axes is b. From the below diagram, we could work the relationship between the 3D coordinates and the image point’s coordinates

50

For the left projected image point pl, we have the equation:

xl = f

X+

b 2

Z

Similarly for the right projected image point pr, we have the equation:

xr = f

X−

b 2

Z

The disparity (d) between two image points coordinates is calculated as

b b X + −X + 2 2 = f b d = xl − x r = f Z Z The relationship between the disparity and depth (Z) is derived as

Z=

b b = f ( xl − x r ) fd

4.2 Image Rectification Algorithm 1 Rectification is a process used to facilitate the analysis of a stereo pair of images by making it simple to enforce the two view geometric constraint. For a pair of related views the epipolar geometry provides a complete description of relative camera geometry. Once the epipolar geometry has been determined it is possible to constrain the match for a point in one image to lie on a line (the epipolar line) in the other image and vice versa. The process of rectification makes it very simple to impose this constraint by making all matching epipolar lines coincident and parallel with an image axis. One of the important advantages of rectification is that computing stereo correspondences is made much simpler since the search can only need to be done along the horizontal lines of the rectified images. Many stereo algorithms assume this simplified form because subsequent processing becomes much easier if differences between matched points will be in one direction only. We have decided to implement rectification based on an algorithm presented in paper “Rectification with constrained stereo geometry” by A. Fusiello, E. Trucco, and A. Verri. The algorithm takes the two perspective projection matrices (PPM) of the original cameras, and

51

computes a pair of rectifying projection matrices, which could then be used to compute the rectified images.

4.2.1 Overview of the algorithm: The idea behind rectification is to define two new PPMs which would be obtained by rotating the old PPMs around their optical centres until focal planes becomes coplanar, thereby containing the baseline. This ensures that epipoles are at infinity, hence epipolar lines are parallel. To have horizontal epipolar lines, the baseline must be parallel to the new X axis of both cameras. In addition, to have a proper rectification, conjugate points must have the same vertical coordinate. This is obtained by requiring that the new cameras have the same intrinsic parameters. Note that, being the focal length the same, retinal planes are coplanar too. The optical centres of the new PPMs are the same as the old cameras, whereas the new orientation (the same for both cameras) differs from the old ones by suitable rotations; intrinsic parameters are the same for both cameras. Therefore, the two resulting PPMs will differ only in their optical centres, and they can be thought as a single camera translated along the X axis of its reference system. The algorithm consists of the following steps: 1. Calculate the intrinsic parameters and extrinsic parameters by camera calibration. 2. Decide the positions of the optical centre. The position of the optical centre is constraint by the fact that its projection to the image plane is at 0. 3. Calculate the new coordinate system. Let c1 and c2 be the two optical centres. The new x axis is along the direction of the line c1-c2, and the new y digestion is orthogonal to the new x and the old z. The new z axis is then orthogonal to baseline and y. Then the rotation part R the new projection matrix is derived based on this. 4. Construct the new PPMs and the rectification matrices. 5. Apply the new rectification matrixes to the images. The algorithm is described in details as follows:

4.2.2 Camera model A pinhole camera is introduced with optical centre C and its image plane R. The line containing C and orthogonal to R is called the optical axis and its intersection with R is the principal point. The distance between C and R is the focal length. We assume that the stereo rig is calibrated, i.e., the cameras’ internal parameters, mutual position and orientation are known. The camera is therefore modelled by its perspective projection matrix P, which can be decomposed, using the QR factorization, into the product P = A[R | t] The matrix A depends on the intrinsic parameters only, and has the following form:

α u A = 0 0

γ αv 0

u0 v0 1

52

, where α u = − fk u , α v = − fk v are the focal lengths in horizontal and vertical pixels, respectively (f is the focal length in millimetres, ku and kv are the effective number of pixels per millimetre along the u and v axes), (u0, v0) are the coordinates of the principal point, given by the intersection of the optical axis with the retinal plane, and γ is the skew factor that models non-orthogonal u − v axes. The camera position and orientation (extrinsic parameters), are encoded by the 3 × 3 rotation matrix R and the translation vector t, representing the rigid transformation that brings the camera reference frame onto the world reference frame. Let us write the PPM as

q1T q14 P = q 2T q 24 = [Q q~ ] q T q 3 34 The focal plane is the plane parallel to the retinal plane that contains the optical centre C. The coordinates c of C are given by

c = −Q −1 q~ Therefore, P can be rewritten

P = [Q − Qc ]

4.2.3 Rectification of camera matrices Let us consider a stereo rig composed by two pinhole cameras. Let C1 and C2 be the optical centres of the left and right cameras, respectively. All the epipolar lines in one image plane pass through a common point (E1 and E2, respectively) called the epipole, which is the projection of the optical centre of the other camera. Although the pair of left and right images are taken simultaneously by the stereo camera system that are supposedly level and horizontal to each other, the case is most often not so. With any stereo image pairs taken, there are usually two types of difference in location and orientation noticeable: First, both images are titled, indicating that the camera may not be horizontal to the ground. Such is easily rectifiable by adjusting the camera until the images are straight. Second, vertical shifts between the two images that is one image appear to be higher than the other. Rectification procedure aims to transform the pair of stereo images so that epipolar lines are parallel and horizontal in each image and the epipoles are at infinity. Since the stereo rig is calibrated, i.e., the PPMs Po1 and Po 2 are known. The idea behind rectification is to define two new PPMs Pn1 and Pn 2 obtained by rotating the old ones around their optical centres until focal planes becomes coplanar, thereby containing the baseline. This ensures that epipoles are at infinity; hence, epipolar lines are parallel. To have horizontal epipolar lines, the baseline must be parallel to the new X axis of both cameras. In addition, to have a proper

53

rectification, conjugate points must have the same vertical coordinate. This is obtained by requiring that the cameras have the same intrinsic parameters. Therefore, the two resulting PPMs will differ only in their optical centres, and they can be thought as a single camera translated along the X axis of its reference system.

Figure 12: Image rectification The new PPM in terms of their factorisation would have equations

Pn1 = A[R − Rc1 ]

Pn 2 = A[R − Rc 2 ] The intrinsic parameters matrix A is the same for both PPMs, and can be chosen arbitrarily. The optical centres c1 and c2 are given by the old optical centres, computed based on the previous derived formula. The matrix R, which gives the camera’s pose, is the same for both PPMs. It will be specified by means of its row vectors

r1T R = r2T r3T That is the X, Y, and Z axes, respectively, of the camera reference frame, expressed in world coordinates. According to the previous comments, we take: 1. The new X axis parallel to the baseline

r1 =

(c1 − c 2 ) c1 − c 2

2. The new Y axis orthogonal to X (mandatory) and to k: r2 = k ∧ r1 , where k is an arbitrary unit vector, that fixes the position of the new Y axis in the plane orthogonal to X. We take it equal to the Z unit vector of the old left matrix, thereby constraining the new Y axis to be orthogonal to both the new X and the old left Z 3. The new Z axis orthogonal to XY (mandatory):

r3 = r1 ∧ r2

4.2.4 The rectifying transformation 54

In order to rectify the left image, we need to compute the transformation mapping the image ~ onto the image plane of P = Q q~ plane of Po1 = Qo1 q o1 n1 n1 n1

[

]

[

]

We will see that the sought transformation is the co-linearity given by the 3 × 3 matrix T1 = Qn1Qo−11 . The same result applies to the right image. The transformation T1 is then applied to the original left image to produce the rectified image. The pixels (integer-coordinate positions) of the rectified image correspond, in general, to non-integer positions on the original image plane. Therefore, the gray levels of the rectified image are computed by bilinear interpolation.

4.2.5 Bilinear interpolation Once the shift vector (u, v) is updated after the previous iteration, the image I must be resampled into a new image Inew according to (-u, -v) in order to align the matching pixels in I and Jnew and provide for the correct alignment error computation. The shift (u, v) is approximated with sub pixel precision, so as we are applying it, it may take us to locations inbetween the actual pixel values (see diagram below).

The grayscale value at location (x, y) at each new warped pixel in Jnew as Bx,y . This can be achieved in two stages: First we defined Bx,0 = (1 - x) B0,0 + xB1,0 , and Bx,1 = (1 - x) B0,1 + x B1,1. From these two equations, we get Bx,y= (1 - y) B x,0 + y Bx,1= (1 - y)(1 - x) B0,0 + x (1 - y)B1,0 + y (1 - x) B0,1 + xyB1,1

4.3 Image Rectification Algorithm 2:

55

We need the extrinsic parameters of the stereo system to construct the rectification matrix. For a point P in the world reference frame we have

Pr = Rr P + Tr , and

Pl = Rl P + Tl ,where Pr is the coordinates of the 3D point in the right camera frame Pl is the coordinates of the 3D point in the left camera frame, Rr and Rl are the rotation transformation matrices of the point in the right and left camera frame respectively, while Tr and Tl are the rotation transformation matrices of the point in the right and left camera frame respectively. From two above equations, we have

Pl = Rl P + Tl = Rl [ Rr−1 ( Pr − Tr )] + Tl = Rl Rr−1 Pr − Rl Rr−1Tr + Tl Since the relationship between Pl and Pr is given by Pl = RPr + T , we equate the terms to get

R = Rl Rr−1 = Rl RrT , and

T = Tl − Rl RrT Tr = Tl − R Tr , as the extrinsic parameters matrices of the stereo system. Once obtain these extrinsic parameters, we can then construct the rectification matrix Rrect which consists of a triple of mutually orthogonal unit vectors:

T || T || − Ty Tx

e1 = e2 =

[

2 x

T

+T

0

]

2 y

e3 = e1 × e2 Rrect

e1T = e2T eT 3

The following reasoning shows how the image is rectified with the rectification matrix and the adjustment made on the intrinsic camera parameters to ensure the correctness of the rectification process. Apply the rectification matrix to both sides of Pl = RPr + T

Rrect Pl = Rrect RPr + Rrect T Let

Pl ' = Rrect Pl Pr' = Rrect RPr , be coordinates of the point in the rectified left and camera frame, since

56

|| T || Rrect T = 0 0 We have

|| T || Pl = P + 0 0 '

' r

At this stage, we know from above that the point has the same y and z coordinates in both the rectified left and right camera frame. We then investigate the rectification effect on the left and right images. Let

xr' Pr' = yr' z r'

xl' Pl ' = yl' zl' and

We set the camera intrinsic parameters in rectification procedure as

Wlrect

fs xl = 0 0

c 0l r0l 1

0 fs yl 0

W rrect

and

fs xr = 0 0

0 fs yr 0

c 0r r0 r 1

These intrinsic camera parameters could be different from those we obtain through the calibration procedure, since those intrinsic parameters may be adjusted in order to ensure the rectification, but we have not been able to know the details up to now. The following argument reveals the details of the requirements on the intrinsic parameters. By projecting the points Pr and Pl onto the image frames, we have

u l' x l' fs xl x l' + c 0l z l' ' ' ' ' v l = Wlrect y l = fs yl y l + r0l z l wl' z l' z l' And

u r' x r' fs xr x r' + c 0 r z r' ' ' ' ' v r = W rrect y r = fs yr y r + r0 r z r wr' z r' z r' The objective of the rectification procedure is to make the points Pr and Pl have the same coordinates along the vertical direction in the images, which means

fs yl y l' + r0l z l' z l'

=

fs yr y r' + r0 r z r' z r'

We already know from (14) that z l' = z r' and y l' = y r'

57

Thus

fs yl y l' + r0l z l' = fs yr y r' + r0 r z r' This is the same as

( fs yl − fs yr ) y l' + (r0l − r0 r ) z l' = 0 To make the above equation hold for any y and z coordinates, we must have fs yl = fs yr and r0l = r0 r Meeting the above constraints will ensure the correctness of the rectification. So we must use a new set of camera intrinsic parameters to rectify the left and right image, since the parameters obtained from the camera calibration procedure is likely to have significant errors and can’t meet the constraints of, which will result different y coordinates for the same 3D point in the two images. In practice, we can set the intrinsic parameters matrices of both cameras to their average or set them to either one, as long as they meet the requirements of. In the following steps, we will name this new camera intrinsic matrix as Wnew , which still is applied to both left and right rectification. For the rectification of the left image, relate the original image with the rectified image:

cl λl rl = Wol * Rl * P 1 cl' λr rl' 1

= Wnew Rrect * Rl * P

Where Wol is the intrinsic matrix obtained from camera calibration, λ s are the scale factors. From the two above equations, we deduce

cl' λ rl' 1

cl −1 = Wnew * Rrect * Wol * rl 1

The rectification procedure is realized by equation, while other methods such as bilinear may be used to improve the quality of the rectification image. We have also noticed in previous experiments that not all of the rectified pixels can be seen in the rectified image. This is a serious problem if the object we want to do 3D reconstruction is not contained in the rectified image. While changing the focus length can keep all the points within images of the same size as the original, we propose an alternative approach by shifting the image center along the horizontal direction. It is obvious from the above equations that changing c 0l or c 0 r will not destroy the rectification effect. By changing the image center, we can move some points outside of the images to inside of images and find matching points with new shifted images.

58

4.4 Calculating the Stereo Correspondence Stereo matching procedure has been widely used in many applications of view synthesis and 3-D reconstruction algorithms. One of the main concerns in these algorithms is the correspondence problem, where the main objective to solve this problem is to determine an item in the left image and an item in the right image that are projections of the same scene element. We use a method for detecting depth discontinuities from a stereo pair of images that first computes a dense disparity map and then labels those points that exhibit a significant change in disparity. (A threshold is unfortunately inherent in any problem of estimating discontinuities from a discrete function.) This method is implemented in OpenCV function cvFindStereoCorrespondence(). The following details about the algorithm are taken from “Depth Discontinuities by Pixel-to-Pixel Stereo” Stanford University Technical Report STAN-CS-TR-96-1573, July 1996. Many traditional stereo algorithms, such as those based on correlation windows, tend to blur depth discontinuities. This algorithm, on the other hand, uses neither windows nor preprocessing of the intensities, and matches the individual pixels in one image with those in the other image. As a result, it preserves sharp changes in disparity while introducing few false discontinuities, with far less computation than would be required if disparities were computed to sub pixel resolution (which of course would be necessary for the more common goal of accurate scene reconstruction).Thus, by sacrificing accurate disparities the algorithm is able to quickly compute crisp and accurate discontinuities. This algorithm first match different epipolar scan line pairs independently, and detects occlusions simultaneously with a disparity map, using a form of dynamic programming. Then, a postprocessor propagates information between the scan lines to refine the disparity map, from which the depth discontinuities are detected.

59

Chapter 5

3D reconstruction We are given two images taken from a stereo camera system of a 3 dimensional scene. Our task is to understand the relationship of points between the pair of the images and use this information to reconstruct the 3D coordinates of the points from their respective projected 2D coordinates on the image planes. We have chosen to implement the 3D reconstruction algorithm first proposed by the British psychologist H.C. Longuet-Higgins in 1981, to reconstruct the relative pose (i.e. position and orientation) of the cameras as well as the locations of the points in space from their projection onto the two image planes. In his paper “A computer algorithm for reconstructing a scene from two projections”, H.C. Longuet-Higgins wrote “the fusing of two images to produce a three-dimensional percept involves two distinct processes: the establishment of a 1:1 correspondence between image points in the two views- the ‘correspondence problem’ and the use of the associated disparities for determining the distances of visible elements in the scene. I shall assume that the correspondence problem has been solved the problem of reconstructing the scene then reduces to that of finding the relative orientation of the two view points.” Therefore in the situation where there is an absence of any noise or uncertainty, given two images taken from calibrated cameras, one can in principle recover camera pose and position of the points (3D coordinates) with a few steps of solving a set of simultaneous linear equations. Before explaining the algorithm in details, the first few sections of this chapter would describe epipolar constraint geometrically and methods to derive fundamental matrix F as well as essential matrix E. With the knowledge of F and E, we could follow the algorithm to consequentially derive the relative transform vector T and rigid rotation matrix R.

5.1 Epipolar geometry We now consider the geometry of two calibrated cameras viewing a scene. The geometry is mainly concerned with the search for corresponding points in stereo matching. Let’s suppose we have point X in space with 3D coordinates that has been projected in two image planes of left and right cameras. Let’s assume that the projected point of X on the left image plane is x1 and that of the right image plane is x2. Figure below shows that both the image points x1, x2 and the object point X along with the camera centres are co-planar. The rays that backprojected from x1 and x2 intersect with each other at X and also both of the rays are also coplanar, lying in the same plane as x1, x2 and X. Let’s call this plane Π (pi).

60

Figure 13: Point correspondence geometry, from section 9.1(p240) of Hartley and Zisserman93. The orginal caption reads: “(a) The two cameras are indicated by their centres C and C’. The camera centres, 3-space poit X and its images x and x’ lie in a common planeΠ. (b) An image point x backprojects to a ray in 3-space defined by the first camera centre C and x. This ray is imaged as a line l’ in the second view. The 3-space point X which projects to x must lie on this ray so the image of Xin the second view must lie on l’.”

Let’s suppose we now only have the 2D coordinates of x1 in the left image plane and we would like to search for the corresponding point x2 in the right image plane. Plane Π can be constructed by using the coordinates of two camera centres and known image point x1, alternatively the equation of the plane could be derived using the baseline (the line joining two camera centres) and the ray defined by x1. Assume that the plane Π intersects the right image plane at line l2, from above knowledge we know that x2 would lie on line l2 and also line l2 is the image in the right image plane of the ray back-projected from x1. Now the search for the corresponding point x2 in the right image plane could be restricted to the line l2 instead of covering the entire image plane. Symmetrically, we can also establish a line l1 which is the image in the left image plane of the ray back-projected from x2. Lines l1 and l2 are called epipolar lines. The plane Π containing the baseline is called epipolar plane. The baseline intersects each image plane at the epipoles e1 and e2.

Figure 14: Epipolar model , from Ch.5 of Invitation to 3D Vision (p.111). The original caption reads: “Two projected points x1,x2 ∈R3 if a 3-D point p from two vantage points. The Euclidean transformation between the two cameras is given by (R,T) ∈ SE(3). The intersections of the line (o1,o2) with each image plane are called epipoles and are denoted e1 and e2. The line l1 and l2 are called epipolar lines which are the intersection of the plane (o1,o2,p) with the two image planes. ”

93

“Multiview geometry in computer vision”, by R. Hartley and A. Zisserman published in 2000 by Cambridge University Press

61

5.2 The fundamental matrix F The epipolar geometry between two views is independent of scene structure and only depends upon the camera’s intrinsic parameters and relative pose. The fundamental matrix F encapsulates this property. In the following we could describe how our implementation could successfully derive the fundamental matrix from the mapping between a point and its epipolar line. Suppose that we have a pair of images taken from two calibrated cameras of a scene, we could see that for each point x1 in one image plane there exists a corresponding epipolar line l2 in the other image plane. Any point x2 in the second image corresponding with the point x1 in the fist image plane would lie on the epipolar line l2. The epipolar line is constructed as the projection in the second image plane of the ray from the point x1 going through the camera centre C of the first camera. Therefore we have established the existence of a mapping between arbitrary points x1 on the first image plane to its corresponding epipolar line in the other image.

5.3 Geometric derivation of F Step 1: In this step, the point x1 on the first image plane is mapped to point x2 in the second image plane which lies on the epipolar line l2. Point x2 is considered as a potential match for point x1. Assume we have a plane π which doesn’t contain the baseline of the stereo system. Draw a ray from the point x1 on the first image plane that goes through the first camera centre C1. This ray would meet the plane π at a single point, called X. Point X on plane π will be projected to a point x2 in the second image plane. Point x2 would lie on the epipolar line l2 corresponding to the image of the initial ray since X lies on the ray that corresponding to x1. Both points x1 and x2 are the image points of the 3D point X in space. Therefore there is a 2D homographic Hπ mapping each point on one image plane to another point on the other image plane, x2= Hπx1.

Figure 15: Point transfer via a plane, A point x in one image plane is transferred via a plane π to a matching point x’ in the second image plane. The epipolar line l’ through x’is obtained by joining x; to the epipole e’. Thus there is a 2D homographic Hπ that maps each xi to x’i , x’=Hπx and l’ = [e’]xx’=[e’]xHπx=Fx

Step 2: Given the point x2 the epipolar line l2 that passes through x2 and the epipole e2 can be written in symbols as l2=e2 × x2= [e2]×x2 .From step1 result, we derive an equation to evaluate l2, l2 = [e2]×x2 = [e2]× Hπx1= Fx1

62

Hence F = [e2]× Hπ Fundamental matrix F represents a mapping from a 2 dimensional onto a 1 dimensional projective space hence must have rank 2.

5.4 Algebraic derivation of F An expression of fundamental matrix F in terms of the two camera projection matrices P, P’ could be derived algebraically. The following formulation has been presented by Xu and Zhang [Xu-96]. Our implementation has been based on this formulation. We are given x and x’ as two corresponding image points in two image planes of a 3D object point X, P and P’ as projection matrices of the two cameras. The ray back-projected from x by P is obtained by solving the equation PX = x. Let λ be a scalar parameter, the parametric solution of the above equation is given as: X (λ) = P+x + λC + , where P is the pseudo-inverse of matrix P, i.e. P+P = I and C is the null-vector, namely the camera centre, defined by PC=0. In particular two points on the ray are P+x (at λ =0), and the first camera centre C (at λ =∞). These two points are imaged by the second camera P’ at P’P+x and P’C respectively in the second image plane. The epipolar line in the second image plane is the line joining these two projected points, the equation of which is give as l’ = (P’C)×(P’P+x). Note that, the point P’C is also the epipole in the second image, namely the projection of the first camera centre and is denoted as e’. Equation of l’ is re-written as l’ = [e’] ×(P’P+)x= Fx, where F is the matrix F = [e’] ×(P’P+)

5.5 Fundamental matrix properties Fundamental matrix F is a rank 2 homogeneous matrix with 7 degree of freedom (DOF). The fundamental matrix satisfies the condition that for any pair of corresponding points x→ x’ in the two image planes x’ΤFx = 0. From the above derived equation for F, the pair of camera matrices (P, P’) uniquely determines F; however the converse is not true. The fundamental matrix F determines the pair of camera matrices at best up to right-multiplication by a 3D projective transformation. Given this ambiguity, it is common to define a specific canonical form for the pair for the pair of camera matrices corresponding to the give fundamental matrix in which the first camera matrix is found by forming [I | 0], where I is the 3×3 identity matrix and 0 a null 3-vector. One property of a non-zero fundamental matrix F is that P’TFP is skew symmetric. The condition that PTFP is skew-symmetric is equivalent to XTPTFPX=0 for all X. We know from previous parts that x’ = P’X and x = PX, hence we have x’TFx=0 which is true as this is the definition of F. Let S be any skew-symmetric matrix. Define the pair of camera matrices P = [I | 0] and P’ = [SF | e’] , where e’is the epipole such that e’TF = 0, and assume that P’ so defined is a valid camera matrix then F is the fundamental matrix corresponding to the pair (P,P’).

63

As suggested by Luong and Viéville94 a good choice for S is S = [e’] ×, where e’Te’ ≠ 0. Therefore the camera matrices corresponding to a fundamental matrix F could be found as P = [I | 0] and P’ = [[e’] ×F | e’].

5.6 The essential matrix It has been known that the essential matrix was historically introduced by Longuet-Higgins before the fundamental matrix and the fundamental matrix could be thought of as the generalisation of the essential matrix in which the assumption of calibrated cameras could be absent. The essential matrix is defined as x’TEx=0, where E has the form E = [t]×R = R[RTt] ×. Our implementation to solve for essential matrix E is based upon the Longuet-Higgins 8 point algorithm. In order to have an understanding of the algorithm, the following would describe concepts and properties of the essential matrix E that have been used extensively in the algorithm.

5.7 Kronecker Product The Kronecker Product is denoted as ⊗ symbol. Given two vectors x= (x1, y1, z1)T and x’= (x2, y2, z2)T , Their Kronecker product is defined as

x1 x' a = x ⊗ x' = y1 x' ∈ ℜ 9 z1 x' The definition of essential matrix could be rewritten using Kronecker product as aTEs=0 s ,where E =[e11; e21; e31; e12; e22; e32; e13; e23; e33]T ∈ R9 If we denote the correspondences by (x1j, x2j), j = 1,2,…,n and construct an a for each correspondence such that aj = x1j ⊗ x2j then we can stack the a’s into the design matrix χ such that:

( a 1 ) T χ = (a 2 ) T ∈ ℜ n×9 ( a n ) T The definition of E could be rewritten as χEs=0. This equation could be solved using SVD, then reshape back into a matrix form to obtain E. The essential matrix E has 9 entries; however since we are in homogeneous coordinates we have one less degree of freedoms. Hence E has in total 8 degrees of freedom. The essential matrix E is a member of ξ known as the essential space. 94

Q. Luong and Thierry Viéville. Canonical representations for the geometries of multiple projective views. Computer Vision and Image Understanding, 64(2):193-229, 1996.

64

ξ = {[t ]× R R ∈ SO(3), [t ]× ∈ ℜ 3 } ⊂ ℜ 3×3

5.8 The 8 point linear algorithm

(

)

For a given set of image correspondences x1j , x 2j ; j = 1; 2; …; n (n ≤ 8), this algorithm recovers (R; T) ⊆ SE(3), which satisfy T x 2j TˆRx1j = 0 , for j = 1; 2; …; n

1. Compute a first approximation of the essential matrix Construct χ = [a1, a2… an]T ∈ Rn×9 from correspondences x1j , x 2j , namely, aj = x1j ⊗ x2j Find the vector Es ∈ R9 of unit length such that χE s is minimized by performing the

(

)

following steps: Compute the SVD of χ= Uχ∑χVTχ and define Es to be the ninth column of Vχ. Un-stack the nine elements of Es into a square 3 × 3 matrix E. 2. Project onto the essential space Call F the initial estimate of E, and let F = Udiag{λ1,λ2,λ3}VT, λ1< λ2 < λ3 Then choose E = Udiag{σ,σ,0}VT, σ = (λ1< λ2 ) /2 This choice of E is the one that minimises the Frobenius norm E − F

2 f

This two step process of estimating E from the stack Es and then projecting it onto the essential space is a good first cut.

5.9 The universal scale ambiguity It has been known for a fact that 3D reconstruction can only be done up to a universal scale factor. In the following steps of finding the translation vector T and rotation matrix R, we will use a “normalised essential matrix” which means that we will take σ =1. This is in effect equivalent to choosing T

2

= 1 , and does not affect the rotation matrix R.

5.10 Finding translation vector T from E From definition of essential matrix E, we note that

)

)

) E = [t ]× R = TR ) ) EE T = TRR T Tˆ T = TˆT T = −Tˆ 2

, since T = −T trace(EET) = sum of the squares of the singular values = 2σ2 = 2.1 = 2 (using convention σ= 1) However we have that T = (Tx, Ty, Tz)T

65

0 ) T = Tz − T y

Ty − Tx 0

− Tz 0 Tx

, and

− Tz2 − T y2 ) T 2 = Tx T y Tx T z

TxT y 2 z

Tz T y − T y2 − Tx2 Tx T z

−T −T Tz T y

2 x

It has been seen that the above matrix is symmetric, thus we have

) trace T 2 = 2 Tx2 + T y2 + Tz2 = 2 T

( ) (

Since T

2

)

2

= 1 this means that 1 − Tx2

) − T 2 = − Tx T y − Tx Tz

− TxT y

− Tx Tz

2 y

− Tz T y 1 − Tz2

1−T − Tz T y

From this matrix, we could see that there are three independent relationships between the diagonal and off-diagonal elements; this would help us check the results independently. However we don’t know which way the baseline goes since the absolute signs of Ti and Eij are still undetermined.

5.11 Finding rotation matrix R from E We have decided to implement the 8-point algorithm presented in Longuet-Higgins paper to derive a method to calculate R from the value of essential matrix E. From definition of the essential matrix E, we have

) E = TR

If we regard each row of E and each row of R as a 3 component vector, we then have ) Eα = T × Rα , for α= (1,2,3) and condition of R to represent a proper rotation can be expressed in a similar form: Rα = Rβ × Rγ , for α, β, γ such that εαβγ= 1 Our next step would be to express Rα in terms of T and the Eα. From our previous equation we could see that Rα is orthogonal to Eα and may therefore be expressed as a linear combination of T and Eα × T. Consequentially we introduce a new vectors Wα = Eα × T , for α= (1,2,3) , and write

R α = a α T + bα W α Substitute the above equation into equation of Eα, we thus have

Eα = T × (aα T + bα Wα ) = bα (T × Wα ) 66

However as we know that T is a unit vector,

T × Wα = T × (Eα × T ) = Eα and then we have

bα = 1 We know from the previous equation that Rα = R β × Rγ , for α, β, γ such that εαβγ= 1 Rα = aα T + bα Wα = aα T + Wα = (a β T + Wβ )× (aγ T + Wγ ) = a β Eγ − aγ E β + Wβ × Wγ However we know that vectors Wα , Qβ and Qγ are orthogonal to T, whereas Wβ × Wγ is a multiple of T. It follows that

aα T = Wβ × Wγ And the equation of Rα becomes

Rα = Wα + Wβ × Wγ For α= (1,2,3) we applied the previous formula to obtain all the rows of the rotation matrix R.

5.12 Estimating depths given rotation matrix R and translation vector T Before carrying on explaining how our implementation estimate the values of depths given the knowledge of rotation matrix R and translation vector T, we have to note that there are 4 possible combinations of solution for R and T accounting for the ± signs. The 4 possible solutions are demonstrated geometrically as follows.

Figure 16 Possible configurations Original caption from Ch.8 of Hartley and Zisserman reads “The four possible solutions for calibrated reconstruction from E. Between the left and right sides there is a baseline reversal. Between the top and bottom rows camera B rotates 180o about the baseline. Note, only in (a) is the reconstructed point in front of both cameras.”

When the reconstructed point is in front of both cameras, we know that we have obtained correct solutions. We have implemented this problem in a way that the code will automatically select the solution that makes sense, i.e. the one where all the points are in front of the camera.

67

Having the values of rotation matrix R and translation vector T, we carry on with performing Euclidean reconstruction. Our implementation adapted the algorithm described in LonguetHiggins paper. The following will describe the algorithm in details.

(

)

Let P a visible point in the scene and let ( X 1 , X 2 , X 3 ) and X 1' , X 2' , X 3' be its 3dimensional Cartesian coordinates with respect to the two view points. The ‘forward’ coordinates (the last component of the 3D-coordinates) are necessarily positive. The image coordinates of P in the two views maybe defined as

x = ( x1 , x 2 ) = ( X 1 / X 3 , X 2 / X 3 )

(

) (

x ' = x1' , x 2' = X 1' / X 3' , X 2' / X 3'

)

And it is convenient to add an extra ‘dummy’ coordinates

x3 = 1, x3' = 1 So that one can then write

xµ =

Xµ X3

, xν' =

X ν' , for ( µ,ν =1,2,3) X 3'

As the two sets of 3-dimensional coordinates are connected by an arbitrary displacement, we then have

X µ' = Rµν ( X ν − Tν ) , where T is our newly found translation vector and R is the rotation matrix found in the previous step. It follows that

x1' =

X 1' R1ν ( X ν − Tν ) = X 3' R3ν ( X ν − Tν )

If we rewrite the equation of x’1 in terms of the rows Rα of the matrix R

x1' =

R1 ( X − T ) R1 ⋅ ( x − T X 3 ) = R3 ( X − T ) R3 ⋅ ( x − T X 3 )

from which it follows that

X3 =

(R − x R )⋅ T (R − x R )⋅ x 1

1

' 1 ' 1

3

3

The other coordinates are then calculated as

X 1 = x1 X 3 X 2 = x2 X 3 X 1' = R1ν ( X ν − Tν ) X 2' = R2ν ( X ν − Tν ) X 3' = R3ν ( X ν − Tν )

5.13 Algorithm analysis The algorithm may fail under some 8 point configuration due to the fact that the equation x’TEx=0 becomes non-independent. This would happen in following cases:

68

− − − −

Four of the points lie in a straight line Seven of them lie in a plane Six points happen to be at the vertices of a regular hexagon Eight points happen to be at the vertices of a cube.

There is a solution for this degeneration of the configuration of the 8 points if one of the offending points is moved slightly away from its original position. The algorithm yields the most accurate results when applied where the distance d between the centres of projection is not too small compared with their distances from the points.

69

Chapter 6

Marker-based tracking 6.1 Choice of markers Our system is designed to detect markers specifically with rectangular shapes in specified colours. The marker detection is done by combining edge detection and colour detection.

Figure 17: Markers used in the implementation

6.2 Background subtraction Given a frame sequence from a fixed camera, our task is to detect all the foreground objects. The task of detecting the foreground objects could be considered as finding the difference between the current frame and an image of the scene’s static background: framei − background i > θ , where θ is the threshold. The background of the image is not fixed but must adapt to − First, illumination changes: gradual changes and sudden changes. − Second, motion changes: camera oscillations, high frequencies background objects. − Third, changes in the background geometry.

6.2.1 Basic methods: Frame difference:

framei − framei −1 > θ , where θ is the threshold. The estimated background is the previous frame. This method works only in particular conditions of objects’ speed and frame rate. Frame difference method is extremely sensitive to the thresholdθ.

70

Background as the average or the median of the previous n frames Background as the running average Bi +1 = αFi + (1 − α )Bi , where α i is the learning rate, typically 0.05. The background model at each pixel location is based on the pixel’s recent history. The history could be the previous n frames or a weighted average where recent frames have higher weight. In essence, the background model is computed as chronological average from the pixel’s history. No spatial correlation is used between different (neighbouring) pixel locations. At each new frame, each pixel is classified as either foreground or background. If the pixel is classified as foreground it would be ignored in the background model. In this way, we prevent the background model to be polluted by pixel logically not belonging to the background scene. We can incorporate selectivity when running average. The method is formulated as follows Bi +1 (x, y ) = αFt ( x, y ) + (1 − α )Bt ( x, y ) , if Ft ( x, y ) background

Bi +1 (x, y ) = Bt ( x, y ) , if Ft ( x, y ) foreground.

6.2.2 Implementation The first phase of the tracking system involves separating potential hand pixels from nonhand pixels. Before segmentation occurs, we first convolve all captured images with a 5x5 Gaussian filter and then scale this filtered image by one half in each dimension in order to reduce noisy pixel data. All subsequent image processing operations are then performed on this scaled and filtered image. Since the stereo cameras are mounted above a non-moving workspace, a simple background subtraction scheme is used to segment any potential foreground hand information from the non-changing background scene. At system start-up, a pair of background images IB,L and IB,R are captured to represent the static workspace from each camera view. Subsequent frames then use the appropriate background image to segment out moving foreground data.

Figure 18: Background subtraction from the left camera: background image, captured image and foreground image

71

Figure 19: Background subtraction from the right camera: background image, captured image and foreground image.

6.3 Edge detection 6.3.1 Introduction95 Edge detection is a fundamental tool used in most image processing applications to obtain information from the frames as a precursor step to feature extraction and object segmentation. This process detects outlines of an object and boundaries between objects and the background in the image. An edge-detection filter can also be used to improve the appearance of blurred or anti-aliased video streams. The basic edge-detection operator is a matrix area gradient operation that determines the level of variance between different pixels. The edge-detection operator is calculated by forming a matrix centred on a pixel chosen as the centre of the matrix area. If the value of this matrix area is above a given threshold, then the middle pixel is classified as an edge. Examples of gradient-based edge detectors are Roberts, Prewitt, and Sobel operators. All the gradient-based algorithms have kernel operators that calculate the strength of the slope in directions which are orthogonal to each other, commonly vertical and horizontal. Later, the contributions of the different components of the slopes are combined to give the total value of the edge strength. The Prewitt operator measures two components. The vertical edge component is calculated with kernel Kx and the horizontal edge component is calculated with kernel Ky. K x + K y gives an indication of the intensity of the gradient in the current pixel.

− 1 0 1 − 1 − 1 − 1 K x = − 1 0 1 and K y = 0 0 0 − 1 0 1 1 1 1 Depending on the noise characteristics of the image or streaming video, edge detection results can vary. Gradient-based algorithms such as the Prewitt filter have a major drawback of being very sensitive to noise. The size of the kernel filter and coefficients are fixed and cannot be adapted to a given image. An adaptive edge-detection algorithm is necessary to provide a robust solution that is adaptable to the varying noise levels of these images to help distinguish valid image content from visual artefacts introduced by noise.

95

H.S Neoh, A. Hazanchuk, “Adaptive edge detection for Real-time video processing using FPGAs

72

Our implementation has used Canny algorithm to perform edge detection. The following section would describe in detail how the algorithm works.

6.3.2 Canny edge detection algorithm − − −

The Canny Edge Detector defines edges as zero-crossings of second derivatives in the direction of greatest first derivative. The Direction of Greatest First Derivative This is simply the gradient direction (w in first-order gauge coordinates). The Second Derivative: We can compute this using the matrix of second derivatives (Hessian):

Lu = u T Hu −

Zero Crossings: As with the Marr-Hildreth edge detector, we’ll use positive-negative transitions to “trap” zeroes. Thus, in gauge coordinates, the Canny detector is Lww = 0 In terms of x and y derivatives, the above equation could be expanded to

Lww =

1 Lx 2 Lx + L2y

[

L xx Ly L yx

L xy L x =0 L yy L y

]

Zero-crossings in this measure give connected edges much like the Laplacian operator but more accurately localize the edge. The Canny algorithm uses an optimal edge detector based on a set of criteria which include finding the most edges by minimizing the error rate, marking edges as closely as possible to the actual edges to maximize localization, and marking edges only once when a single edge exists for minimal response96. According to Canny, the optimal filter that meets all three criteria above can be efficiently approximated using the first derivative of a Gaussian function.

G ( x, y ) =

x2 − y 2

1 2πσ

2

2σ 2

e

− ∂G ( x, y ) = αxe ∂x

x2 − y2

− ∂G ( x, y ) = αxe ∂y

x2 − y2

2σ 2

2σ 2

The first stage involves smoothing the image by convolving with a Gaussian filter. This is followed by finding the gradient of the image by feeding the smoothed image through a convolution operation with the derivative of the Gaussian in both the vertical and horizontal directions. The 2-D convolution operation is described in the following equation.

I ' ( x, y ) = g (k , l ) ⊗ I ( x, y ) =

N

N

∑ ∑ g (k , l )I (x − k , y − l ) k =− N l = − N

where: g(k,l) = convolution kernel 96

Canny, J., “A Computational Approach to Edge Detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, November 1986

73

I(x,y) = original image I’(x,y) = filtered image 2N + 1 = size of convolution kernel Both the Gaussian mask and its derivative are separable, allowing the 2-D convolution operation to be simplified. This optimization is not limited to software implementation only, but applies to hardware implementation as well, as shown in the next section. The nonmaximal suppression stage finds the local maxima in the direction of the gradient, and suppresses all others, minimizing false edges. The local maximum is found by comparing the pixel with its neighbors along the direction of the gradient. This helps to maintain the single pixel thin edges before the final thresholding stage. Instead of using a single static threshold value for the entire image, the Canny algorithm introduced hysteresis thresholding, which has some adaptivity to the local content of the image. There are two threshold levels, th, high and tl, low where th > tl. Pixel values above the th value are immediately classified as edges. By tracing the edge contour, neighboring pixels with gradient magnitude values less than th can still be marked as edges as long as they are above tl. This process alleviates problems associated with edge discontinuities by identifying strong edges, and preserving the relevant weak edges, in addition to maintaining some level of noise suppression. While the results are desirable, the hysteresis stage slows the overall algorithm down considerably. The performance of the Canny algorithm depends heavily on the adjustable parameters, σ, which is the standard deviation for the Gaussian filter, and the threshold values, th and tl. σ also controls the size of the Gaussian filter. The bigger the value for σ, the larger the size of the Gaussian filter becomes. This implies more blurring, necessary for noisy images, as well as detecting larger edges. As expected, however, the larger the scale of the Gaussian, the less accurate is the localization of the edge. Smaller values of σ imply a smaller Gaussian filter which limits the amount of blurring, maintaining finer edges in the image. The user can tailor the algorithm by adjusting these parameters to adapt to different environments with different noise levels. Results can be further improved by performing edge detection at multiple resolutions using multi-scale representations, similar to the Marr-Hildreth algorithm97. This is achieved using different standard deviations, which correspond to different resolution versions of the image. Edges have zero crossing at multiple scale values. Combining maxima information from different scales allows better classification of true edges. Convolution at multiple resolutions with large Gaussian filters requires even more computation power. This may prove to be challenging to implement as software solution for real-time applications.

6.4 Colour detection We have decided to incorporate colour detection into our marker tracking algorithm to enable markers with particular colour to be detected more accurately. The colour detection is done by applying thresholds to the hue channel of the images to detect the required colour. To optimise the result, image could be smoothened before applying thresholds to remove any

97

Maar, D., Hildreth E., “Theory of edge detection”, Proceedings Royal Soc. London, vol. 07, 187217, 1980

74

noisy pixels. The resulting image after thresholding is eroded and dilated to remove any insignificant detected area.

6.4.1 HSV colour space In colour image processing, a pixel is defined by three values corresponding to the tristimuli Red, Green and Blue. The RGB system ensures that no distortion of the initial information is encountered. However there are disadvantages of using RGB colour space. In the RGB space, colour features are highly correlated and it is impossible to evaluate the similarity of two colours from their distance in this space. We have chosen Hue Saturation Value (HSV) colour space for our colour detection. Hue specifies the colour and is inherently an angle, i.e., it has a cyclic or periodic character. The colour angle in HSV is defined to start at red, goes on via green to blue and back to red. This full circle can be mapped onto an axis from zero to one, and red colour has values, zero and one, because of the cyclic character. The distance from the centre of the plane is the saturation of a colour and states how much the pure colour is diluted by white. The value channel is related to the equivalent gray level.

6.4.2 Transformation from RGB to HSV98 There are many different HSV transformations; one of the methods is described below. Given a colour defined by three values of tristimuli (R,G,B) where R, G, and B are between 0.0 and 1.0, with 0.0 being the least amount and 1.0 being the greatest amount of that colour, an equivalent (H, S, V) colour can be determined by a series of formulas. Let MAX equal the maximum of the (R, G, B) values, and MIN equal the minimum of those values. The formula can then be written as

The resulting values are in (H, S, V) form, where H varies from 0.0 to 360.0, indicating the angle in degrees around the colour circle where the hue is located. S and V vary from 0.0 to 1.0, with 0.0 being the least amount and 1.0 being the greatest amount of saturation or value, respectively. As an angular coordinate, H can wrap around from 360 back to 0, so any value of H outside of the 0.0 to 360.0 range can be mapped onto that range by dividing H by 360.0 98

Wikipedia, http://en.wikipedia.org/wiki/HSV_color_space

75

and finding the remainder (also known as modular arithmetic). Thus, -30 is equivalent to 330, and 480 is equivalent to 120, for example. If MAX = MIN (i.e. S = 0), then H is undefined. If S = 0 then the colour lies along the central line of greys, so naturally it has no hue, and the angular coordinate is meaningless. If MAX = 0 (i.e. if V = 0), then S is undefined. If V = 0, then the colour is pure black, so naturally it has neither hue, nor saturation.

6.4.3 Gaussian blur Gaussian blur is one of the most common techniques used to reduce image noise and reduce detail levels. The visual effect of this blurring technique is a smooth blur resembling that of viewing the image through a translucent screen. When applying Gaussian blur in effect we convolve the image with a Gaussian distribution. The equation of Gaussian distribution is formulated as

G (r ) =

1 2πσ

−r

e

2

(2σ ) 2

, where r is the blur radius and σ is the standard deviation of the Gaussian distribution. When applying in two dimensions, it produces a surface whose contours are concentric circles with a Gaussian distribution from the centre point. Pixels where this distribution is non-zero are used to build a convolution matrix, which is applied to the original image. Each pixel's value is set to a weighted average of that pixel's neighbourhood. The original pixel's value receives the heaviest weight (having the highest Gaussian value), and neighboring pixels receive smaller weights as their distance to the original pixel increases. This results in a blur that preserves boundaries and edges.

6.4.4 Dilation and erosion Dilation allows objects to expand, thus potentially filling in small holes and connecting disjoint objects. Erosion shrinks objects by etching away (eroding) their boundaries. Dilation of the set A by set B, denoted by A ⊕ B , is obtained by first reflecting B about its origin and then translating the result by x. All x such that A and reflected B translated by x that have at least one point in common form the dilated set.

{( )

}

A ⊕ B = x Bˆ x ∩ A ≠ 0

{

}

, where, Bˆ denotes the reflection of B i.e., Bˆ = x x = −b, b ∈ B and ( B) x denotes the

{

}

translation of B by x = ( x1 , x 2 ) i.e. (B ) x = c c = b + x, b ∈ B Thus, dilation of A by B expands the boundary of A. It is easier to implement for gray-scales than the above description suggests.

(f

{

⊕ b )(s, t ) = max f (s − x, t − y ) + b( x, y )(s − x ), (t − y ) ∈ D f ; ( x, y ) ∈ Db

}

Here, f and b denote images f(x,y) and b(x,y). f is being dilated and b is called the structuring element and Df and Db are the domains of f and b respectively. Thus, in dilation we choose the maximum value of (B ) x = c c = b + x, b ∈ B in a neighborhood defined by b. If all

{

}

76

elements of b are positive, the dilated image is brighter than the original and the dark details are either reduced or eliminated. Erosion of A by B, denoted by A-B, is the set of all x such that B translated by x is completely contained in A, i.e.,

A − B = {x (B ) x , ⊆ A}

For gray-scale images we have,

(f

{

− b )(s, t ) = min f (s + x, t + y ) − b( x, y )(s + x ), (t + y ) ∈ D f ; ( x, y ) ∈ Db

}

Erosion is thus based on choosing the minimum value of (f-b) in a neighborhood defined by the shape of b. If all elements of b are positive, the output image is darker than the original and the effect of bright details in the input image are reduced if they cover a region smaller than b.

6.4.5 Implementation Initially we would smooth the image to remove any noise signals by applying the Gaussian blur convolution. The resulting image will be then transformed from RGB to HSV colour space. We then apply thresholding on the hue plane of the image to segment out the required colour. To improve the result we would then apply erosion and dilation. Erosion would help to remove small insignificant regions and dilation adds on layers that have been removed by erosion and return the image with a correct size. Some of the result images are shown below

Figure 20: Original image and intermediate image

Figure 21 Image after thresholded and resulting image

77

6.5 Spatial moments and centre of gravity Central moment is formulated as below ∞ +∞

m pq =

∫ ∫x

p

y q f ( x, y )dxdy

− ∞− ∞

However since the image function is defined at the pixels, we could transform the above formula into xres y res

m pq = ∑∑ i p j q f (i, j ) i =1 j =1

The centre of gravity of a region could be obtained using spatial moments according to the following formulas

m10 m00 m y c = 01 m00

xc =

78

Chapter 7

Evaluation 7.1 Calibration evaluation The cameras are to be calibrated are off the shelf Unibrain Firewire webcams. They are positioned on two fixed separated tripods on the desk as illustrated as follows.

Figure 22 Stereo system set-up

7.1.1 Experiments and results: We would carry out experiments of calibrating the two cameras with different number of view of chessboards and then analyse the accuracy of the calibration. The image resolution is 640 x 480. The model plane contains a pattern of 7 x 7 corners on each dimension, hence the total of 49 corners expected to be found in the chessboard model. The chessboard model is printed with a high-quality printer and put on a more solid surface. We took in total 20 images under different orientations; some of them are shown below.

79

Figure 23: Images captured from left camera

Figure 24: Images captured from right camera

For each view of the chessboard model we a pair of captured image from each camera. We apply our calibration method for each camera independently using the corresponding images for each view and produce the stereo extrinsic parameters (rotation matrix, and translation vector) for each set of images afterwards. The results are shown in the tables below. Table 1: Cameras’ principal point coordinates Left camera’s Right camera’s principal point principal point Number of images x y x y 5 187.378 125.871 171.272 120.008 10 289.113 106.747 242.657 194.588 15 286.801 115.553 245.174 189.999 20 256.949 109.609 249.661 182.843 Table 2: Cameras' focal length components Left camera’s focal length Number of images fc(1) fc(2) 5 545.11 410.847 10 776.616 595.516 15 617.023 481.497 20 600.728 463.505

Right camera’s focal length fc(1) fc(2) 469.747 355.311 477.708 376.176 467.331 368.707 477.312 373.938

80

Table 3: Left camera's distortion coefficients Distortion coefficients Number of images d(1) d(2) d(3) 5 -0.3112 3.8378 3.8378 10 -0.35023 -0.71506 0.058944 15 -0.2055 -0.42825 0.060752 20 -0.23088 -0.03736 0.048946

d(4) 0.0106 -0.07419 -0.05346 -0.04573

Table 4: Right camera's distortion coefficients Distortion coefficients Number of images d(1) d(2) d(3) 5 -0.2056 2.1428 -0.0057 10 0.176888 -0.0126 0.072222 15 0.177187 -0.03123 0.068897 20 0.163441 -0.01136 0.06645

d(4) 0.0021 0.065991 0.066836 0.066837

7.1.2 Results analysis: In Table 1, we could observe that there is a considerable discrepancy in the value of the principal point’s coordinates between the first set (set of 5 images) and the rest. However the last three experiments (set of 10, 15, and 20 images) the values are reasonably consistent. In Table 2, the right camera’s focal length values over the four different set of images are considerably consistent. However the left camera’s focal length values fluctuate reasonably over the set of the images. In Table 3 and Table 4, we could observe that the values of the distortion coefficients of the left and right cameras are consistent over the last three sets of images and there is still a discrepancy between the first set and the rest. These observations could be illustrated clearer in a graphical form as follow Left cam era's principle point

300 250 250-300

200

200-250 Pixels 150

150-200 100-150

100 20 50

15 10

0 x

Num ber of im ages

50-100 0-50

5 y

Coordinates

Figure 25: Left camera's principal point

81

Right cam era's principle point

250 200 200-250

150 Pixels

150-200 100

100-150

50

50-100

15

Num ber of im ages

0 5

x

0-50

y

Coordinates

Figure 26: Right camera's principal point Left cam era's focal length

800 700 600 500 400 Pixels 300 200 100 0

fc(2)

500-600 400-500 300-400 200-300

0-100

15 Focal length values

600-700

100-200

20

fc(1)

700-800

10 Num ber of im ages 5

Figure 27: Left camera's focal length Right cam era's focal length

500 450

450-500

400

400-450

350

350-400

300

300-350

Pixels 250

250-300

200

200-250

150

20

100

15

50 10

0 fc(1)

150-200 100-150

Num ber of im ages

5

50-100 0-50

fc(2)

Focal length

Figure 28: Right camera's focal length

82

Left cam era's distortion coefficient

d(1) d(2) Distortion coeeficient

d(3) d(4) 10

5

4

3.5-4

3.5

3-3.5

3

2.5-3

2.5 2

2-2.5

1.5 1 0.5 0 -0.5 -1

1-1.5

1.5-2 0.5-1 0-0.5 -0.5-0 -1--0.5

20

15

Num ber of im ages

Figure 29: Left camera’s distortion coefficients Right cam era's distortion coefficient

2.5 2 1.5 1 0.5 0

d(1) Distortion coeffient

d(3) 5

10

15

-0.5 20

2-2.5 1.5-2 1-1.5 0.5-1 0-0.5 -1-0

Num ber of im ages

Figure 30: Right camera's distortion coefficient

7.1.3 Results comparison: We run the Matlab calibration toolbox99 on all of the captured images to calibrate the left and right cameras. The results of calibration for the set of 20 images for each view as presented below For the left camera: Focal Length: fc = [640.86691 496.26059] ± [21.43555 15.62463] Principal point: cc = [316.86354 63.72844] ± [25.11130 18.86976] Distortion: kc = [0.01583 -0.53755 0.01555 -0.02182 0.00000] ± [0.13044 0.35233 0.01150 0.02091 0.00000] For the right camera: Focal Length: fc = [529.91208 436.35726] ± [14.62129 10.13643] 99

http://www.vision.caltech.edu/bouguetj/calib_doc/index.html

83

Principal point: cc = [320.79025 61.19902] ± [0.00000 0.00000] Distortion: kc = [-0.34178 -0.03548 0.00203 -0.08218 0.00000] ± [0.10076 0.14786 0.00605 0.01722 0.00000] Compared to the results of our implementation the values of principal points, focal lengths from the experiments lie in the range of the results produced by Matlab toolkit.

7.2 Image rectification evaluation 7.2.1 Rectification algorithm 1 We conduct test runs of the algorithm on different values of the intrinsic matrices for the left and right images. Set 1: For the left image, we set the following intrinsic parameter matrix for the left camera:

Wnew

0 626.73 941.38 = 0 − 118.62 92.9492 0 0 1

The left transform matrix calculated from the algorithm:

− 1.9113 − 0.9344 575.274 TL = − 0.141245 0.1275 140.737 − 0.00075 − 0.0012 1 For the right image, we set the following intrinsic parameter matrix for the right camera

Wnew

0 926.73 641.38 = 0 − 118.9 92.9492 0 0 1

The right transform matrix calculated from the algorithm:

− 2.31448 − 1.57918 629.914 TR = − 0.25764 0.163 123.175 − 0.00146 − 0.0015 0.9929 The original images and their corresponding rectified images for the left and right cameras are shown below

84

Figure 31: Rectified left and right images After obtaining the left and right rectified images, we found that the correspondence points in the left image can be found accurately in the same row of the right rectified image as shown

Figure 32: Stereo corresponding points

7.2.2 Rectification algorithm 2 We conduct test runs of the algorithm on different values of the intrinsic matrices for the left and right images. Set 1: For the left image, we set the following intrinsic parameter matrix for the left camera:

Wnew

0 436.73 141.38 = 0 − 381.62 92.9492 0 0 1

The left transform matrix calculated from the algorithm:

− 0.8177 − 0.136 278.55 TL = − 0.0642 0.91 149.931 − 0.00163 − 0.0002 0.8665 For the right image, we set the following intrinsic parameter matrix for the right camera

85

Wnew

0 446.73 241.38 = 0 − 381.62 92.9492 0 0 1

The right transform matrix calculated from the algorithm:

− 1.14002 − 0.18866 353.421 TR = − 0.0542 − 1.16102 151.575 − 0.00171 − 0.000343 1.02458 The original images and their corresponding rectified images for the left and right cameras are shown below

Figure 33: Un-rectified left and right images

Figure 34: Rectified left and right images After obtaining the left and right rectified images, we found that the correspondence points in the left image can be found accurately in the same row of the right rectified image as shown

86

Figure 35: Corresponding points on rectified images Set 2: For the left image, we set the following intrinsic parameter matrix for the left camera:

Wnew

0 926.73 491.38 = 0 − 258.62 92.9492 0 0 1

For the right image, we set the following intrinsic parameter matrix for the right camera

Wnew

0 726.73 491.38 = 0 − 258.62 92.9492 0 0 1

The original images and their corresponding rectified images for the left and right cameras are shown below

Figure 36: Rectified left and right images After obtaining the left and right rectified images, we found that the correspondence points in the left image can be found accurately in the same row of the right rectified image as shown

87

Figure 37: Stereo corresponding points From the above experiments, we made some conclusions as follows: First, for each point in the left rectified image, in order to locate its correspondence point in the rectified right image, we have to make sure that two entries of intrinsic parameter matrix, fs x and r0 , should be same in the new W for both left and right cameras. Second, we can adjust the entry c0 of the intrinsic parameter matrix to shift the rectified image to be visible.

7.2.3 Conclusion First, for each point in the left rectified image, in order to locate its correspondence point in the rectified right image, we have to make sure that two entries of intrinsic parameter matrix, fs y and r0 , should be same in the new W for both left and right cameras. Second, we can adjust the entry c0 of the intrinsic parameter matrix to shift the rectified image to be visible. Hence changing the focus length can keep all the points within images of the same size as the original, we propose an alternative approach by shifting the image centre along the horizontal direction. Changing c0l or c0 r will not destroy the rectification effect. By changing the image centre, we can move some points outside of the images to inside of images and find matching points with new shifted images. Third, from the test sets and evaluation we could say that it is much difficult to adjust the values of focal length and centre of principal point in algorithm 1 than that of algorithm 2. Hence algorithm 2 is much more favourable.

7.3 Markers detection evaluation We evaluate the accuracy and how well the implementation scopes with different orientations of hand by running it using different set of images under the same range of thresholds. After the markers are detected, their centre of gravity would be calculated. Note that in all the below experiments, all the markers are visible. Approaches when dealing occlusions have not been included in our implementation due to time constraints.

88

7.3.1 Stereo pair 1: Left frame: Markers with the same colour would be detected at the same time. The first four markers are detected and the results are illustrated below

Figure 38: Left image with background subtracted and image with pixels in a given range

Figure 39: Two sets of markers are detected in the image The centre of gravity of each markers are listed in the table below Markers 1 2 3 4 5 6 7 8

X coordinate 260.317 250.667 241 229.36 269 259.472 248.529 238.251

Y coordinate 109.372 113.667 118 122.03 124.5 128.007 133.765 137.064

89

Right frame:

Figure 40: Right image with background subtracted and image with pixels in a given range

Figure 41: Markers detected in the right frame The coordinates of centres of gravity are listed in the table below Markers 1 2 3 4 5 6 7 8

X coordinate 217.64 205.833 196.477 226 187.986 215.054 204.012 197

Y coordinate 104.217 110.496 115.748 120.5 121.745 125.225 130.982 137.064

7.3.2 Stereo pair 2: Left image :

90

Figure 42: Left image with background subtracted and image with pixels in a give range

Figure 43: Markers detected in left image The coordinates of centres of gravity are listed in the table below Markers 1 2 3 4 5 6 7 8

X coordinate 263.444 251.439 238.228 225.433 272.716 260.809 247.579 235.136

Y coordinate 117.481 123.547 128.341 133.156 136.197 141.417 146.785 152.413

Right image:

91

Figure 44: Right image with background subtracted and pixels in a give range

Figure 45: Markers detected in the right image The coordinates of centres of gravity are listed in the table below Markers 1 2 3 4 5 6 7 8

X coordinate 164.798 150.346 139.394 130.26 175.5 162.195 150.125 141.851

Y coordinate 115.7 121.822 128.95 136.35 136.35 142.274 149.229 154.571

7.4 3D reconstruction evaluation 7.4.1 No motion Here we try to detect a marker that does not move. The glove is put on the table and the tracking system started. In an error-free system the detected point should be the same for all frames. This will not be possible because of noise in the camera images and natural illumination changes. The fundamental matrix

92

5.17244e - 006 F = - 7.48387e - 006 - 6.06856e - 008

- 0.00496862 0.00749192 6.0769e - 005

2.93337e - 005 - 4.49791e - 005 - 3.64881e - 007

The essential matrix

1.80674 E = 0.693459 1.34376

1.24206 - 1.9897 - 0.49724

- 0.252517 - 0.464655 0.399678

The stereo rotation matrix

0.116346 R = 0.890237 0.336843

- 0.929799 - 0.0197195 0.365279

0.342741 - 0.355695 0.822803

The stereo translation vector

T = [0.233018 0.143407 0.961841]

The values of the markers are consistently evaluated as shown below, where (xL, yL) is the image point on the left frame and (xR,yR) is the image point in the right frame and (X,Y,Z) is the 3D object point reconstructed. Marker 1 2 3 4 5 6 7 8

xL 260.317 250.667 241 229.36 269 259.472 248.529 238.251

yL xR 109.372 217.64 113.667 205.833 118 196.477 122.03 226 124.5 187.986 128.007 215.054 133.765 204.012 137.064 197

yR 104.217 110.496 115.748 120.5 121.745 125.225 130.982 137.064

X -0.0967068 -0.0836514 -0.0730423 -0.0709773 -0.0789569 -0.0816624 -0.0697294 -0.0619594

Y -0.174806 -0.144399 -0.118677 -0.161228 -0.110828 -0.15847 -0.129294 -0.109576

Z -0.167314 -0.202344 -0.232248 -0.170597 -0.262464 -0.201518 -0.233916 -0.254307

7.4.2 Forwards-backwards motion The fundamental matrix

5.17244e - 006 F = - 7.48387e - 006 - 6.06856e - 008

2.93337e - 005 - 4.49791e - 005 - 3.64881e - 007

- 0.00496862 0.00749192 6.0769e - 005

The essential matrix

1.80674 E = 0.693459 1.34376

1.24206 - 1.9897 - 0.49724

- 0.252517 - 0.464655 0.399678 93

The stereo rotation matrix

0.116346 R = 0.890237 0.336843

- 0.929799 - 0.0197195 0.365279

0.342741 - 0.355695 0.822803

The stereo translation vector

T = [0.233018 0.143407 0.961841]

The user would move the hand forwards and backwards the stereo system. To analyse the results we have chosen three frames appear in the sequence in their alphabetical order. Frame (a): Marker 1 2 3 4 5 6 7 8

xL 260.317 250.667 241 229.36 269 259.472 248.529 238.251

yL xR 109.372 217.64 113.667 205.833 118 196.477 122.03 226 124.5 187.986 128.007 215.054 133.765 204.012 137.064 197

yR 104.217 110.496 115.748 120.5 121.745 125.225 130.982 137.064

X 81.4193 74.3329 67.2199 59.2462 78.536 71.9092 63.4424 56.5178

Y 175.513 173.11 170.718 167.242 187.014 184.269 181.979 178.792

Z 40.8885 40.2696 39.6593 38.9021 43.4449 42.8617 42.2685 41.4847

Frame (b): Marker 1 2 3 4 5 6 7 8

xL 261 250.413 239.333 226.941 270.489 259 247.451 235.852

yL 117 121.936 126.333 130.977 133.867 138 142 146.389

xL 262.439 251 239.1 227.456 270.89 259.827 247.84 237.054

yL 120.276 124.5 129.955 134.773 137.18 141.825 147.525 152.19

xR 196.046 184.593 172.884 165.196 205.796 193.217 182.703 175.602

yR 114.85 121.12 126.566 134.039 133.215 137.942 144.526 151.043

X 79.0194 71.0768 63.1401 54.3989 75.7283 67.7075 59.7209 51.5188

Y 185.772 183.077 179.872 176.149 198.831 195.294 191.661 188.197

Z 39.3231 38.7111 37.9906 37.1702 42.073 41.2858 40.4823 39.722

Frame (c): Marker 1 2 3 4 5 6 7 8

xR 187.84 175.097 163.52 197.249 155.83 185.343 173.705 164.789

yR 116.203 123.082 128.797 134.631 136.842 140.752 147.274 151.942

X 74.8949 66.965 58.1932 49.8763 70.8802 62.9451 54.0085 46.1961

Y 186.796 183.354 180.299 177.042 199.284 196.223 193.247 190.339

Z 32.5677 31.9188 31.3364 30.8397 34.629 34.1623 33.5977 33.0538

94

Our first observation is that there is a common trend in values belonging to a column, for instance in the value of xL, the x-coordinates of all the markers between left frame (a) and (b) are increasing. Similar characteristics could be observed on other columns storing coordinates of the image points on the left and right image planes. Secondly the 3D constructed coordinates of the markers also demonstrate the same characteristics. The values of X, Y and Z, when moving from one frame to another, change in according to the other values in the same column. Thirdly, as all of our markers are located on the back of the hand hence approximately they are co-planar. This property is demonstrated as the depth Z calculated for each marker is very much close to each other. We could confidently conclude that our 3D reconstruction has provided accurate approximation of the 3D coordinates of markers given their coordinates in two image planes. More tests could be done by moving the hands with different pattern of motions, for instance we could move the hand in up and down motion or sideways motion, we would expect our implementation of 3D reconstruction would provide consistent set of results for each motion pattern

7.5 Electromagnetic tracking system vs. Video-based tracking system 7.5.1 Up-down motion Video-based tracking system: Video sequence captured by left camera: Number of frames in the video sequence = 21 Frame rate by the left camera = 10 frames/sec Duration of the video sequence =

21 = 2.1 sec 10

Video sequence captured by right camera: Number of frames in the video sequence = 21 Frame rate by the left camera = 10 frames/sec Duration of the video sequence =

21 = 2.1 sec 10

To illustrate the change in values of the markers’ coordinates in left and right image plane and also that of their 3D reconstructed coordinates, we have chosen 3 frames in the sequence to analyse Frame 1: Marker 1 2

xL 219.39 195.167

yL 166.148 169.275

xR 174.796 162.418

yR 160.843 164.176

X 0.55915 0.749806

Y -1.88248 -2.28654

Z 2.35164 2.86488

95

3 4 5 6 7 8

207.242 182.261 210.181 223.808 185.76 197.547

169.118 170.587 186.909 186.189 189.38 189.385

152.243 142.138 179.08 166.655 156.026 147.582

165.864 167.253 180.379 182.723 184.241 186.132

0.744557 1.08218 0.546146 0.536586 0.749267 0.738591

-2.62088 -3.39456 -1.40504 -1.61754 -1.90988 -2.10037

3.16229 4.14119 1.81111 1.98662 2.39799 2.54898

xL 207.472 222.189 235.331 246.932 235.863 208.933 222.406 249.206

yL 74.5788 75.3333 75.2062 74.2242 94.5659 95.8379 95.0874 95.7438

xR 185.794 173.103 149.333 160.679 186.645 174.939 161.31 151.688

yR 64.6592 66.3932 68 67.3894 83.8659 85.6858 87.3715 88.7312

X 0.402953 0.414325 0.430033 0.429612 0.390031 0.381834 0.397459 0.412801

Y 1.36942 1.29666 1.24523 1.18473 1.22665 1.37401 1.31224 1.19298

Z -1.04093 -0.98291 -0.94437 -0.90184 -0.86587 -0.96497 -0.92424 -0.83837

xL 224.816 238.848 251.786 264.615 224.926 237.945 252.473 265.047

yL 21.8126 22.5175 22.7354 24.5278 41.9291 42.8025 43.0853 43.9794

xR 193.933 180.551 169.073 156.58 194.529 157.947 181.913 168.205

yR 19.2553 20.4468 21.5122 23.2431 35.8058 41.5733 40.7195 41.7411

X -0.22681 -0.400752 -0.786762 -2.30508 -0.127725 -0.373008 -0.21884 -0.346688

Y -0.80520 -1.29751 -2.34915 -6.50092 -0.57779 -1.52817 -0.82096 -1.20671

Z 2.35849 3.57058 6.12073 16.0116 1.66549 3.97585 2.15903 2.99377

Frame 10: Marker 1 2 3 4 5 6 7 8 Frame 20: Marker 1 2 3 4 5 6 7 8

From the above tables we observe that the y-coordinates in both left and right image planes decrease as the hand moving upwards, the x-coordinates in both left and right image planes vary a negligible small number of pixels, and this is due to noise and the property of imperfect motion (hand tremor, hand posture). If we have a perfect upwards motion, the xcoordinates would remain constant and only y-coordinates changing. If we join all of the 8 markers in each frame and consider them as one marker, the 3D coordinates of the centres of gravity of the polygon containing the markers would be obtained by taking the average of the 8 reconstructed markers for each frame. Their values are given in the table below Frame 1 2 3 4 5 6 7

X 0.0210728 0.713286 0.0186663 0.127718 0.0647827 0.0487504 0.507711

Y 0.165344 2.15216 0.215333 0.0202564 0.0883441 0.171491 2.20546

Z 0.117674 2.65809 0.170487 0.400311 0.267149 0.0435583 3.34535

96

8 9 10 11 12 13 14 15 16 17 18 19 20 21

0.545639 4.28116 0.285837 0.407381 0.305623 0.0428977 0.290814 0.176053 0.735814 1.95238 1.12379 0.0212434 0.0985999 0.598208

3.09933 34.9634 2.63691 1.27524 1.26156 0.747517 1.20899 0.844174 2.86028 8.09573 3.91155 0.723526 0.071983 1.8858

4.45863 48.1319 4.78634 0.932939 1.70975 0.553335 3.04243 1.26724 2.68928 19.572 3.49942 0.71318 0.120026 4.85695

Displacement of markers between two consecutive frames Frames 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17 17-18 18-19 19-20

Displacement value 3.29853 3.22831 0.320572 0.162262 0.239088 3.90506 1.42823 54.1907 54.2199 4.08872 0.783567 1.2925 2.54362 1.81592 2.52986 17.7177 16.6289 4.37518 0.884491 5.09686

A graphical representation of the displacement is shown below

97

Displacement chart 60

Displacement value

50 40 30

Displacement value

20 10

10 -1 1 12 -1 3 14 -1 5 16 -1 7 18 -1 9

89

67

45

23

01

0

Fram es

Figure 46: Displacement values obtained from video-based tracking system Electromagnetic tracking system: Number of frames: 1700 Duration: 28 sec In order to evaluate the results of the video-based system with that of the electromagnetic system, assume that we apply the same frame rate of the video sequence to the electromagnetic results. Hence frame rate = 10 frames/sec. For the duration of 28 sec, the expected number of frames would be = 28 * 10 = 280 frames. With the available 1700 frames collected from the EM tracker, we would take a sample with interval of 7 frames, hence the total number of frames obtained = 243. Due to a large amount of data available it is not convenient to include in this report. However a graphical representation of the displacement values over the sequence of frames is shown below

98

Displacement chart 0.3

Displacement

0.25

0.2

0.15

Series1

0.1

0.05

0 1

16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 Frames

Figure 47: Displacement values obtained from Electromagnetic tracking system

We could observe that the displacement charts produced by both video-based and electromagnetic tracking system have taken the same shape.

7.5.2 Sideways motion Video-based tracking system: Video sequence captured by left camera: Number of frames in the video sequence = 18 Frame rate by the left camera = 10 frames/sec Duration of the video sequence =

18 = 1.8 sec 10

Video sequence captured by right camera: Number of frames in the video sequence = 18 Frame rate by the left camera = 10 frames/sec Duration of the video sequence =

18 = 1.8 sec 10

To illustrate the change in values of the markers’ coordinates in left and right image plane and also that of their 3D reconstructed coordinates, we have chosen 3 frames in the sequence to analyse Frame 1: Marker

xL

yL

xR

yR

X

Y

Z

99

1 2 3 4 5 6 7 8

263.409 286.485 299.195 311.235 268.484 300.606 312.538 288.709

120.518 121.07 120.113 120.333 138.456 139.141 138.288 139.401

244.722 258.329 271.915 233.223 260 272.888 234.88 246.908

109.148 108.156 107.662 111.254 128 127.508 129.543 129.085

-0.006002 -0.060687 -0.204236 -0.286252 0.0845436 -0.003422 -0.014509 -0.017674

-3.36892 -3.07859 -31.6605 -3.11918 -2.75584 -2.2264 -2.00621 -2.68908

2.87054 2.28067 2.50747 2.48744 2.28336 1.81329 1.63981 2.16217

xL 234.268 246.481 258.962 220.952 249.732 260.941 237 224.726

yL 126.147 125.185 124.273 127.41 143.853 143.47 145 146.428

xR 203.365 179.985 191.675 170.291 207.297 194.901 182.471 173.5

yR 116.032 118.808 118.493 121.78 135.072 136.802 138.333 140

X -0.148737 -0.138157 -0.131503 -0.154249 -0.141962 -0.133621 -0.147039 -0.154064

Y 0.42253 0.423742 0.416612 0.435857 0.412908 0.412346 0.423812 0.430751

Z -0.15571 -0.15787 -0.15387 -0.16381 -0.13684 -0.13725 -0.14299 -0.14613

xL 222.962 211.222 198.219 186.46 226.355 213.828 201.258 190.411

yL 125.273 126.918 128.41 130.771 144.125 145.861 147.502 149.537

xR 166.479 153.235 143 133.508 168.184 157 145.124 142.502

yR 120.566 122.962 124.5 127.869 139.372 141 142.711 141.466

X 0.010179 0.017895 0.024232 0.029586 0.016024 0.022321 0.02804 0.030959

Y -0.08891 -0.06947 -0.05287 -0.03695 -0.08727 -0.07072 -0.05277 -0.04705

Z -0.20574 -0.24837 -0.28505 -0.32311 -0.22393 -0.25736 -0.29604 -0.30638

Frame 9: Marker 1 2 3 4 5 6 7 8 Frame 18: Marker 1 2 3 4 5 6 7 8

From the above tables we observe that the x-coordinates in both left and right image planes decrease as the hand moving right to left on the image planes, the y-coordinates in both left and right image planes vary a negligible small number of pixels, and this is due to noise and the property of imperfect motion (hand tremor, hand posture). If we have a perfect upwards motion, the y-coordinates would remain constant and only x-coordinates changing. If we join all of the 8 markers in each frame and consider them as one marker, the 3D coordinates of the centres of gravity of the polygon containing the markers would be obtained by taking the average of the 8 reconstructed markers for each frame. Their values are given in the table below Frame 1 2 3 4 5

X 0.0846657 0.0453764 0.176147 0.0301399 0.0874652

Y 2.74367 0.211733 0.0388423 0.166498 0.201882

Z 2.25559 0.0193351 0.421157 0.0710441 0.0322433

100

6 7 8 9 10 11 12 13 14 15 16 17 18

0.0680036 0.285958 0.0880687 0.143667 0.00896297 0.0721876 0.0100887 0.00678482 7.15799 0.00855816 0.0105894 0.00671422 0.0224052

0.0861142 1.70288 0.0509365 0.42232 0.159304 0.0772148 0.160908 0.0964563 83.6922 0.145715 0.076965 0.107541 0.0632497

0.255496 2.34998 0.131568 0.149311 0.105346 0.280875 0.0555726 0.164496 76.4985 0.0913175 0.226638 0.155954 0.268248

Displacement of markers between two consecutive frames Frames 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17

Displacement value 3.37833 0.456567 0.400241 0.0777414 0.252235 2.65486 2.77298 0.375941 0.298757 0.203829 0.248238 0.126607 113.43 113.442 0.151797 0.0771112 0.121729

A graphical representation of the displacement is shown below

101

Displacement value 120 100 80 60

Displacement value

40 20

17 16 -

13

15 14 -

12 -

11 10 -

89

67

45

23

01

0

Figure 48: Displacement values obtained from video-based tracking system Electromagnetic tracking system: Number of frames: 1121 Duration: 19 sec In order to evaluate the results of the video-based system with that of the electromagnetic system, assume that we apply the same frame rate of the video sequence to the electromagnetic results. Hence frame rate = 10 frames/sec. For the duration of 28 sec, the expected number of frames would be = 19* 10 = 190 frames. With the available 1121 frames collected from the EM tracker, we would take a sample with interval of 5 frames, hence the total number of frames obtained = 225. Due to a large amount of data available it is not convenient to include in this report. However a graphical representation of the displacement values over the sequence of frames is shown below

102

Displacement chart 30 25

Displacement

20 15 Displacement value 10 5 0 0

50

100

150

200

250

-5 Fram es

Figure 49: Displacement values obtained from Electromagnetic tracking system We could observe that the displacement charts produced by both video-based and electromagnetic tracking system have taken the same shape. This has helped confirm the accuracy of our implementation of the video-based tracking system.

7.5.3 Summary and Conclusion We have tested our implementation with simple motions described above (up-down motion and sideway motion). The electromagnetic tracking system has considerably higher frame rate than that of video-based system. In order to compare and analyse the results, resampling process is needed to be done beforehand. A small programme is written to take the data produced by the electromagnetic trackers and produced the displacement values between resampled frames. The displacement measurements produced by the electromagnetic tracking system in both scenarios have the same patterns as those produced by our implementation of video-based tracking system. Although our implementation of video-based tracking system produces similar pattern of displacement measures as electromagnetic tracking system. It still needs further tests and evaluations, especially in more complex hand movements.

103

Chapter 8

Conclusion and Future work 8.1 Contributions Calibration: This accuracy of hand tracking system depends greatly on the accuracy of calibration. One of the most popular calibration tool used in computer vision activities is Bouguet’s Matlab calibration tool. This tool requires the user manually laboriously detecting corners which has made it a very cumbersome task. Our implementation using OpenCV library provides the user an easy-to-use and automatic calibration of a single camera. The implementation also checks for the presence of a specified required number of chessboard corners during capturing process before allowing the images to be taken. This has effectively helped to ensure the calibration process to be more successful and accurate. Our system also allows user to perform stereo calibration and obtain a set of stereo extrinsic parameters with the least reprojection errors. Stereo correspondence We have decided to implement two different algorithms for image rectification. This would provide user an in-depth analysis on the two approaches and choices between them. 3D reconstruction Implementation of 3D reconstruction is also completed for two scenarios. First scenario is when both intrinsic and extrinsic parameters are known. If all of the parameters are known, reconstruction is straight forward. Estimation of the coordinates for the object points is a matter of triangulation, involving the left and right camera image points of a given object point. The second case, for which only the intrinsic parameters are known, is somewhat more involved. The first step requires estimation of what is referred to as the matrix of essential parameters, which can only be found up to an unknown scale factor. Next, the normalized translation vector is found. Finally, the rotation matrix is derived from both the normalized essential matrix and normalized translation vector. Marker-based tracking Our implementation provides a semi-automatic approach to detect rectangular markers with specified colours. The hand can be tracked reliably as long as all markers are visible in the cameras. The implementation runs fast and this would allow potential of real-time tracking.

8.2 Future work Calibration:

104

More optimisation on the stereo calibration could be done to select out the most optimal solution for further processing. A friendly graphic interface for the application could be a useful feature. 3D reconstruction We could introduce some filter that removes the Gaussian reconstruction noise. As mentioned in the previous chapters the Extended Kalman filter would be good candidate as it can be used for tracking in non-linear systems. Marker-based tracking We could improve the current solution by fully automating the application. We could also provide measures to deal with partial and full occlusion of markers.

105