Off-vehicle Evaluation of Camera-based Pedestrian ...

Viewer
Transcript

Off-vehicle Evaluation of Camera-based Pedestrian Detection Yaniv Alon and Aharon Bar-Hillel

Abstract— Performance evaluation and comparison of visionbased automotive modules is a growing need in automotive industry. Off-vehicle evaluation, using a database of video streams offers many advantages over on-vehicle evaluation in terms of reduced costs, repeatability and the ability to compare different modules under the same conditions. An off-vehicle evaluation platform for camera based pedestrian detection is presented, enabling evaluation of industrial modules and internally developed algorithms. In order to maintain a single video database despite variability in camera location and internal parameters, experiments were done with video warping techniques, in which a video is warped to look as if taken from a target camera. To obtain ground truth annotation, both manual and Lidar-based methods were tested. Lidar-based annotation was shown to achieve detection rate > 80% without human intervention, which can go up to 97.5% using a semi-supervised methodology with moderate human effort. Finally, we examined several performance metrics, and found that the image-based detection criteria used in most of the literature does not fit certain automotive application well. A modified criterion based on real world coordinates is suggested.

I. INTRODUCTION Vision-based features for active safety are a growing trend, expected to become more significant in the near decade. Examples are Lane Departure Warning (LDW) [10] or detection of pedestrians, vehicles and general obstacles in the front and rear camera [4]. For the front camera, modules offering features as Forward Collision Alert (FCA) or Traffic Sign Recognition (TSR) are developed by several vendors. For the rear, a fish-eye camera displaying the rear blind zone during reverse is gradually becoming widespread, and obstacle detection features for it are emerging. Performance evaluation and comparison between different algorithms or systems suggested by different vendors becomes a significant need. Currently, such evaluation is usually done by driving a vehicle mounted with the vision system, and qualitatively estimating the performance by a human operator. In this paper we examine the possibility of off-vehicle, labbased evaluation for vision-based pedestrian detection. By ‘off-vehicle’ evaluation we refer to performance evaluation done in the lab, by injecting video and CAN streams recorded from a vehicle to the examined system. The responses of the tested system, which are object detection reports, are compared with ground truth annotation of the injected movies, and performance statistics are gathered and analyzed. If and when feasible, such off-vehicle evaluation platform offers significant benefits compared to on-vehicle testing: reduced Y. Alon is with RoadMetric LTD, Giveat Yearim 90970, Israel

[email protected] A. Bar-Hillel is at GM Advanced Technical Center Israel, Hamada 7, Herliya, Isarel [email protected]

costs and effort, objective and quantitative evaluation, and clean comparison, in exactly the same conditions, between two or more tested systems. The ability to objectively quantify system performance, and to repeat evaluation experiments under strictly controlled conditions is essentially the difference between system evaluation as art and as science. Having said that, off-vehicle evaluation faces several significant methodological and technical challenges. One major difficulty is amenability to sensor diversity. The platform should enable evaluation of different systems, relaying on different cameras, and mounted on a variety of vehicles at different locations. If for each combination of camera and position a new database will have to be collected with that specific sensor setting, the whole concept of off-vehicle validation becomes doubtful. Another important enabler for large scale off-vehicle evaluation is obtaining ground truth annotation in a reliable, yet economic manner. Other challenges have to do with proper capture, storage and injection of analog video signal in the case of a rear production camera, and with methodological issues: What are the proper performance criteria for automotive obstacle detection features. Our contributions in this paper are the presentation of a working evaluation platform used to evaluate several systems successfully, description of experimental techniques for coping with sensor diversity and ground truth gathering, and introduction of novel, automotive aligned performance criteria. We focus on pedestrian and children detection in the rear view camera as a prominent example through this paper, as its analog nature and fish-eye lenses pose severe technical difficulties for evaluation techniques. In section II we briefly describe the video injection system we have been using and characterize its internal noise. In section III we describe our database collection and our experiments with camera simulation, in which we warp a video sequence from a high resolution camera into a stream mimicking the rear production camera with a different position. These experiments establish the value of camera simulation techniques as a tool for augmenting database sharing across different cameras and camera locations. In section IV we describe our manual data annotation methodology, as well as experiments in which an IBEO Lidar was used to annotate videos containing pedestrians. In these experiments the Lidar was able to automatically annotate 82% of the pedestrians with 1.5% false alarm rate. Furthermore, error analysis shows that a semi-supervised process can be devised in which the detection accuracy approaches 97.5%, with human effort which is less than 1/5 of the full manual annotation effort. In section V we discuss performance metrics used by the

0.2 0.1 0

0.5

0.9

Statistic True positives False positives False rate Miss rate

Session 1 4400 4714 1.912 0.7915

Session 2 4471 4930 1.902 0.7949

TABLE I: Left: Overlap distribution of detection rectangles between two injection sessions of the same video. The overlap score is computed as (intersection area)/(union area). For each detection rectangle, the score of the best match is considered. Right: detection statistics for a detected system a 416 seconds long video stream. Quantitative results over two consecutive injection sessions are very similar. See section V for explanation of the reported statistics (TMG,TFG=0.5 were used).

evaluation platform, including detection statistics, range and azimuth accuracy. We claim that the detection criterion used in the object recognition and pedestrian detection community is inadequate for certain automotive context, and suggest an improved criterion. Concluding remarks are given in section VI. Several papers discussing off-vehicle evaluation of vision based features were published in recent years [3], [15], [11], [12], [13]. Some of these [15], [11] refer to the problem of testing closed loop features using Hardware In the Loop (HIL) simulation, including a graphics engine producing synthetic visual input. In contrast, our work focuses on testing detection features using real world data, as the value of synthetically generated clips for testing real computer vision systems is highly questionable. Other papers describe evaluation tools using real world video scenes for testing traffic sign recognition [12], lane detection [13] and pedestrian detection [3]. Specifically [3] presents an evaluation platform similar to ours, but it does not describe testing of black box modules via injection, database enhancement techniques as camera simulation and Lidar-based ground truth, or worldcoordinate-related performance criteria. II. I NJECTION PLATFORM The injection system is composed of several hardware and software ingredients. A desktop computer containing the Video and CAN database is the main component. This computer has a video card with analog S-video output (NVIDIA GeForce 9500 GT), and two CAN cases for CAN input and output. The tested system gets analog video and CAN input from the main computer, sends its CAN output back to the computer, and its video output (including detections overlay) to a second monitoring screen. The injection software module loops through the data base and handles synchronous CAN and Video injection, and CAN reading. Other software modules are the evaluation module, which reads system detections and produce performance analysis reports, and a database annotation and viewing tool. While this platform is built in a one-time effort, certain modules have to be re-engineered for every detection sys-

(a)

(b)

Fig. 1: Lens distortion calibration. (a) An image of chessboard target that was used for lens calibration of the production camera. (b) The undistorted image after lens calibration.

tem tested. These include CAN injection and read modules (which are different due to lack of standard CAN interface), CAN decode, and setting of the camera parameters assumed by the detection system. These parameters have to be set in a way consistent with the viewpoint in which the database clips were taken, and with the internal parameters of the camera used for data collection. setting these parameters is done differently for each system, and some systems do not support it at all, or support a limited parameter set. When these parameters cannot be adapted, one may resort to video warping techniques, as described in section III. According to the rear camera specifications, an analog camera is used and the detection systems tested expect an analog signal. However, storage of analog signal is highly complex, and hence we record the analog signal and store it in digital form (using a frame grabber), then convert it back to analog for injection (by the NVIdia video card). The Analog→Digital→Analog transformation changes the original image properties, and looses information. In addition, the detection systems tested themselves are not deterministic: due to high processing time, they usually process only part of the frames, with the frames chosen for processing changing at different injection epoches. Hence there is some internal variability in the injection system, so two repetitions of an experiment do not produce exactly the same results. We have quantified this internal platform variability in several levels. At the pixel level, when one compares corresponding images in two injection epoches, the images are highly similar, with more than 65% of the pixels having the same value in both images. The difference of the intensity level at 99% of the pixels is less then 3% of the image intensity range. Despite this small difference in pixel intensity, sometimes significant differences are induced at the level of single object detections. Comparison of the detection rectangles between two injection trials on the entire data set is given in Table I.Left. While there are instability effects for single detections, the average performance of the system, measured over longer epoches of several minutes, is stable across different independent injection trials. A comparison between two such trails is given in Table I.Right. III. DATA BASE AND CAMERA SIMULATION To check for pedestrian and children detection, a database of 123 short video streams was gathered using a rear production camera (Sharp RJ0DA00041), with the combined length

of 64 minutes. While such a data set allows evaluation of systems tuned to a specific camera settings (the one used for data collection), some systems use different cameras, mounted in different positions on the vehicle, and with different tilt angle. It is most convenient if the tested system can be easily tuned to the database camera parameters, but unfortunately it is not always the case. A possible solution allowing to check multiple systems is collecting the database with a high resolution camera (the ’source’ camera), and warping it to mimic cameras of the evaluated systems (the ’target’ cameras). In the experiments described here we validate the utility of this approach. In our experiments the source camera was a 1.4MP Point Grey Chameleon CMLN-13S2C-CS with Computar T2314FICS-3 fish eye lens. The target camera is the Sharp RJ0DA00041 mentioned above, also equipped with a fish-eye lens with a typical horizontal 130◦ field of view. The Source camera was positioned at height 1.13 meter with tilt −4◦ , while the target camera was at 0.98 Meter with tilt −33◦ . In addition the cameras were horizontally translated by 10cm w.r.t. each other. These are large viewpoint differences, which we considered as harder than most realistic cases. 3 movies were taken of 40 seconds each, containing a pedestrian freely walking in the cameras field of view. During simulation, we warp movies taken with the source camera, which has a relatively high resolution, to look as if it was taken by the target production camera. Detection performance with the simulated video is then compared to the performance obtained using the video captured by the target camera. The high resolution is required since some resolution loss is inherent to the image warping process. The simulated video is created as a combination of two transformations: geometric warping and photometric histogram matching. Geometric image warping is required since the lens distortion parameters of the source and target cameras are different, and so are their intrinsic and extrinsic camera parameters. An image warping G : R2 → R2 is a transformation stating for each (x,y) position in the target image the corresponding (x,y) position in the source image (if there is one). Once this transformation is estimated, and given an image from the source camera S, the simulated target image T ′ is created by T ′ (x, y) = S(G(x, y)). Since G(x, y) is typically not an integer, the value S(G(x, y)) is bilinearly interpolated from its 4 nearest neighbor pixels. G is computed as a combination of three transformations G = DS ◦ HT →S ◦ DT−1 where • •

•

DT is the lens distortion function for the target camera HT →S is a projective homography mapping [9] between pixels in undistorted T and undistorted S. This transformation assumes flat ground and that the pixels of interest lay on the ground plane. DS is the lens distortion function for the source camera

DT and DS are obtained by applying independent lens distortion estimation process [9] for the two cameras. We have used the Caltech Camera Calibration Toolbox [1]. Figure 1 shows results of image un-distortion after lens

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2: Example of geometric image warping for sensor simulation. (a) An image taken by the target camera. (b) A corresponding image taken by the source high resolution camera. (c),(d) shows the result of un-distortion transformation applied to (a) and (b). Red crosses mark the calibration targets. (d) is warped by a homography aligning the four calibration targets, and then distorted to get (e), the geometrically warped image. (f) is the final simulated image, obtained by applying histogram matching to (e). Compare it to the target camera image (a).

calibration. The homography HT →S is computed as follows: 4 targets or more are positioned on a flat planar road in known positions and an image is taken using the source camera. The image positions of the targets are recorded in the undistorted image. This process is repeated for the target camera, with the targets placed in the same known real world positions. A projective homography is then fitted to the 4 corresponding point pairs, matching between the two image planes. If we denote the ground plane re-projection from the two cameras as PS ,PT respectively, than the estimated homography is HT →S = PS−1 ◦ PT . However, the parameters of HT →S are estimated directly, without explicit estimation of PS , TT . See figure 2 for a visual example of the warping process. The light sensitivity, as well as the gain and exposure algorithms in the two cameras are likely to be different, leading to different pixel values for corresponding pixels. A photometric transform of histogram matching is used to compensate for these differences. Histogram matching is a function M : {0, 255} → {0, 255} applied to each pixel in the simulated image T ′ , bringing its value to the value of the same percentile in T , i.e. M (i) = CDFT−1 (CDFT ′ (i))

(1)

with CDF standing for the Commutative Distribution

20

Target Warped Simulated

Mis Detection rate

False alarm rate

25

15 10 5

0.8

Target Warped Simulated

0.6 0.4

(a)

0.2

0.1

0.2 0.3 Overlap

(a)

0.4

0.5

0.1

0.2 0.3 Overlap

0.4

(b)

0.5

(b)

Fig. 3: Miss detection (right) and false alarm (left) rates of a tested system upon injection of target, geometrically warped, and fully simulated movies. Full simulation adds photometric histogram matching to the geometrical warping. The graphs show performance as a function of the required overlap between ground truth and detected rectangles, i.e. the parameter t in equation 2. Results are averaged over 10 injections of the 3 videos, with 1 standard deviation lines plotted. The simulated movie has consistently better detection rate, providing an upper bound which is up to 15% away from the actual detection rate.

Function. This process is applied to the three color channels independently. After this transformation, the pixel value distribution is the same for T and T’, and the resulting image T’ is more visually similar to T. In Figure 3 detection performance of a tested system on original and simulated movies are compared. It can be seen that the performance using the simulated clip can accurately predict the false alarm rate of the target camera, while the estimate of detection rates is slightly biased, predicting better detection rates than were actually achieved with the target camera. We believe the reason for this difference is the superiority of the high resolution source camera with respect to image sharpness and low image noise, granting it better performance even after the quality reduction induced by geometric and photometric transformations. One drawback of warping by homography through road plan is that pixels that are not on the road plan are not accurately positioned in the simulation image. Specifically, the upper part of the pedestrian is far from the road plane, and it is slightly tilted in the simulated image (see Figure 4). For pedestrians beyond a certain range the effect is minimal. Another drawback of the technique is possible loss in field of view. As can be seen in 2(e), the warped video may have blank areas near the edges of the image. Clearly, working with simulated clips cannot completely check detection performance in all the relevant field of view. However, for most systems it is expected that detection performance will not depend on location in the viewing field, as the same image processing and classifiers are applied to all the FOV. Another phenomena which may interfere with correct evaluation is that the blank areas create artificial edges which may influence the detection systems. To avoid this, detections and objects that are close to these areas are not taken into account during evaluation.

(c)

Fig. 4: Homographic transformation through road plan creates distortion in objects that are high above the road. (a) A source image after warping. (b) The target image. (c) The pixel-wise difference between the warped and target images. The feet of the person were warped correctly, but the upper body is slightly tilted.

IV. G ROUND TRUTH We have used a publicly available labeling tool [2] to manually annotate bounding boxes on children and pedestrians. The tool requires marking object bounding boxes in key frames, and uses linear interpolation to infer the bounding boxes location in intermediate frames. This linear interpolation can then be corrected by the annotator. Each object is classified as pedestrian, people or area to ignore. The manual marking process is slow and expensive. Using linear interpolation positioning to avoid marking of every frame saves time, but reduces the accuracy of objects position. A benefit of a manual marking is that the annotator can notice complicated situations that has to be ignored by evaluation platform and assign them the ’ignore’ label. While manual annotation provides accurate bounding boxes, the real world coordinates of the annotated objects are not known and are inferred using the flat ground assumption. The center of the rectangle bottom side is assumed to be the ground contact point of the object. This image point is converted to ground coordinates using a composition of the lens distortion correction function and the projective transformation from image to ground plane (i.e. we apply to it the function PT−1 ◦ DT−1 in the terminology of section III). For the estimation of PT we put 4 targets at known real world positions, and use the correspondence of real world coordinate with image coordinates (after undisrtotion). The accuracy of this mapping was shown to be about 2% in the distance of 2 meters from the camera (see Figure 5). Manual annotation using our current tools is a laborious process, requiring 40-50 minutes for annotation of a 1 minute movie. In this circumstances, automatic or semi-automatic

(a)

(b)

Fig. 5: Calibration of projective transform for distance measurement based on flat ground assumption. (a) 4 targets were spread over a flat ground in known locations relative to the camera. After image distortion cancelation, the image coordinates of the targets were marked and used for the projective transform calculation. (b) An image used to verify the accuracy of the computed transformation. The distance to the two targets was estimated by the transformation, and the error was about 2% of the actual distance.

annotation becomes an important enabler for large scale databases. Lidars offer an attractive option for pedestrians ground truthing, thanks to their exact range estimation and high angular accuracy. We have tested pedestrian ground truthing using an IBEO Lux lidar. This Lidar emits 4 beams at elevation angles of −0.8, 0, 0.8, 1.6 degrees, scanning a horizontal field of view of 85 degrees. The detection range is up to 90 meters with accuracy of less than 10cm, and the update rate is 12Hz. The Lidar has an accompanying software for classification of the detected objects, discriminating between pedestrians, vehicles, and unknown objects. Our experiments for checking Lidar ground truth validity were conducted with 3 video streams of length 45 seconds each, taken from a vehicle at a garage, near a bush and in an open street. In each clip two pedestrians were freely moving at various distances from the camera, entering and leaving the field of view multiple times. The Lidar was mounted below the camera, and a projective transformation P was trained to map (X, Z) ground coordinates in the IBEO system to the image plane, as described earlier in this section. Corresponding points between IBEO and camera were found by placing a box in several location in front of the sensors, then manually annotating their detections in the video and IBEO data. Synchronization of IBEO and camera readings was obtained using CAN tools. For each IBEO frame, IBEO pedestrian detections were projected into the relevant image in the video stream, and were manually annotated by a human operator as successful detections or false positives. In addition all pedestrian appearances in the movies were marked to account for IBEO misses. In our estimation we used only IBEO detections declared by its classification system as pedestrians. Other classes, such as ’small object’ or others, were not useful for detecting pedestrians due to the high false alarm rate they induce. Overall, the IBEO had achieved a detection rate of 82%, with 1.5% false alarms per frame. For an evaluation platform, this means that this level of ground truthing can be achieved automatically, without any human intervention in the an-

ŶƚƌĂŶĐĞůĂƚĞŶĐǇ KĐĐůƵƐŝŽŶ KďũĞĐƚƉƌŽǆŝŵŝƚǇ ^ĞŶƐŽƌƉƌŽǆŝŵŝƚǇ hŶĞǆƉůĂŝŶĞĚ sĞƌŐĞŽĨ&Ks

Fig. 6: Breakdown of Lidar misdetection error according to its causal source. See text for explanation. The dominant error sources can be handled using a modest manual effort.

notation process. However, 82% of successful detections may not be enough, and better rates can be obtained with minimal human intervention. In Figure 6 a breakdown of IBEO failure causes is given. The dominant failure causes are entry latency, pedestrian proximity to another object, and pedestrian being extremely near the vehicle, accounting for 43%, 21.8% and 21.4% of the misdetections respectively. Entry latency includes the 10-20 initial frames of object appearance, in which the object is usually reported by the IBEO as ’unknown object’ and not classified correctly. In cases of extreme proximity of the pedestrian to the camera, he is usually also reported as ’unknown object’. However, both these error-prune cases can be easily found in an automatic manner using simple rules given the IBEO annotation, and submitted for a human annotator for re-annotation. For example, the 15 frames before the appearance of an IBEOreported pedestrian can be sent to human annotation. If this strategy is adopted, the detection rate achieved can be 92.7%, as two major sources of error are eliminated. The last significant source of error is proximity of the pedestrian to another object like a car or a bush. These failure cases are composed of 3-5 seconds long frame sequences, and are hard to detect automatically, but can be easily detected by a human operator looking at the video with IBEO detections superimposed upon it, and subsequently corrected. Hence by applying a single human observation sweep over the recorded data, the detection rate can reach 97.5%. The rest of the misdetections are composed of short and unexpected events, which are relatively hard to find and correct. Unlike manually annotated ground truth, Lidar based ground truth does not give us image rectangles bounding the pedestrians. One may use the flat ground assumption to infer the ground contact point of the pedestrian, hence the middle of the bottom rectangle side, but no information is usually available regarding the pedestrian’s height, so the horizontal length of the bounding rectangle cannot be determined. This motivates basing the detection criterion on real world pedestrian coordinates, rather then on rectangles overlap in image space (as is commonly done). This issue is further discussed in section V-B.

V. E VALUATION CRITERIA

A. Detection criteria for video sequences Evaluation methods in the computer vision community [8], [6] and in previous automotive pedestrian detection work [3] were devised to test detection performance in single images or independent movie frames. The ground truth annotations in this methodology are provided as bounding boxes in the images. A detection bounding box is intersected with all the ground truth rectangles in the image and if the overlap area with one of them is large enough the detection is considered a true positive. Otherwise, the detection is considered a false positive. Ground truth objects not matched to any detection are counted as misses. In [8], [6] the overlap criterion demands that that for two bounding boxes A, B: A∩B >t A∪B

(2)

Where t = 0.5 is usually used. In [3] a slightly different variant is used, where the denominator in Equation 2 is replaced with |A||B| where |·| denotes the area of a rectangle. The miss detection statistic is the ratio of number of misses to the number of targets, and the false alarm rate is the number of false negatives divided by number of frames. In the above definition, the numbers of misses is calculated by summing them over all frames. This definition is simple, but it has the disadvantage that both of the following scenarios would have the same missed detection rate: Scenario A: Targets 1 and 2 are both repeatedly detected and missed with a cycle period of 0.5 seconds. Scenario B: Target 1 is detected in all frames and Target 2 is missed in all frames. Clearly Scenario A is preferable to B from an application point of view, but both would result in the same score. Therefore, we define a grace period, Time of Miss Gap (TMG). Misses lasting for TMG seconds or less, which occur after a successful detection, are not counted. For false positives simple summation over the frames has the disadvantage that the false alarm rate depends on the frame rate. Even when injected the same movie, systems are different from each other with respect to the number of frames they process. We therefore define a minimum duration Time of False Gap (TFG), and an additional false detection is counted for every TFG seconds that the false target track persists. There is no grace period for the initial false target report: a false detection is counted when it first appears.

16 Percentage

The primary statistics of interest for a detection system are the miss detection (false negative) and false detection (false positive) rates. In academic systems an ROC or DET curve is often used, showing the trade off between these two quantities [6], [5]. However, for a black-box system only a single performance point from the ROC can be measured, as these systems do not provide an interface for changing this trade-off. As a result, direct comparison between two systems is non trivial, as the performance is not summarized in a single scalar, but in two or more dimensions.

8 4 2 1 0 −0.5

0

0.5 1 2 Distance deviation

6

Fig. 7: Inadequacy of the overlap detection criterion. Left: The overlap criterion suggests that the dashed green rectangle is a successful detection of the child marked by the green rectangle. However, the former is at distance 75cm, while the child is at distance 25cm from the camera. Right: Histogram of relative range deviations (R.H.S. of equation 3) for successful detections according to the overlap matching criterion.

B. Detection criterion with world coordinates Our experience shows that the overlap detection criterion (Equation 2) may not be suitable for automotive safety systems, as high image overlap sometimes does not indicate neighborhood in real world object location. Figure 7.left shows an example of a detection rectangle considered as hit according to the overlap criterion of eq. 2 with t = 0.5, while the detected box is actually located far behind the true object. In a real world scenario,this is problematic as the system assumes that objects are much farther than they actually are, and it may fail to alert the driver. Figure 7.right shows a histogram of range differences between detected objects, determined using the overlap criterion, and their matching ground truth objects. As can be seen, the phenomena discussed is not uncommon, and for close objects the range difference may be of up to 6 times the actual distance. We suggest an alternative criterion for matching detections to ground truth objects, based on real world coordinates on the ground plane. Denote the in-plane coordinates of the detected and ground truth object as (xd , zd ) and (xo , zo ) respectively. A detection in (xd , zd ) is a successful match to an object in (xo , zo ) iff |

||(xd , zd ) − (xo , zo )|| |≤t ||(xo , zo )||

(3)

Where t is a threshold, defaults to 0.2 in our tests. This criterion demands proximity between detection and object on the ground plane, where the required proximity is proportional to the distance of the object. Results obtained when switching to the new criterion are of higher relevance to automotive tasks, like collision alert or avoidance. The criterion allows matching between one detection and several ground truth objects, or one ground truth object and several detections. For some applications it may be correct to match a detection only to the closest ground truth object, but we did not force such a one-to-one matching as it is not required for collision mitigation applications.

Data partition Target aspect ratio Target distance Location in FOV Clip start

Parameters Detection sensitivity TMG, TFG

measurements Detection rate False alarm rate Range accuracy Azimuth accuracy

TABLE II: Possible evaluation span: the evaluation platform measures performance measurements listed in the third column, on data partitions according to variables in the first column, and as a function of parameters listed in the middle column. For example, we can plot detection and false alarm rates as a function of detection sensitivity (’t’ in equation 2 or 3), as shown in figure 3, or compare between the azimuth and range accuracy in the FOV middle and side.

C. A system performance map Based on the detections, we also estimate other performance statistics of interest, such as range accuracy and azimuth accuracy. For a target successfully detected, range ˆ estimation error is defined as (|R − R|)/R, with R the ˆ the estimated distance to the ground truth object and R distance to the detected target as measured by the detection ˆ ˆ system. Azimuth error is defined as (h − h)/α where h, h are the horizontal image coordinate of the ground truth and detection objects, and α is the number of degrees per pixel in the used camera. We compute α by dividing the camera horizontal FOV in degrees by the number of horizontal pixels. This is an approximate estimation as it ignores camera distortion, and better estimation is possible when internal camera parameters are known. We use the performance statistics described in this section, and a set of data partitions similar to those defined in [6] to obtain a map of system performance under different conditions. Table II gives an overview of the possible evaluation points in this map. Data partitions of interest include partitioning according to target aspect ratio since it separates between upright and non upright pedestrians, and looking only at clip starts to estimate detection latency and detection quality without tracking cues. VI. C ONCLUDING REMARKS We presented an evaluation platform for pedestrian detection which was used to evaluate several detection modules and algorithms suggested by different vendors. The platform enables estimation of detection statistics and detection accuracy in various data partitions, thus providing a characterization of the tested system abilities. We showed that injection to an analog camera can be done with minimal variability in testing results. However, multi exposure camera like the one used in [7] are still a challenge, not faced in this work. Experiments carried with camera simulation show that this technique is useful when the camera cannot be tuned to the database acquisition parameters, and that it gives performance estimates which are slightly biased, nevertheless indicative of the real system performance. In this respect, results may be improved by exploring finer simulation transformations, including mimicking the typical blur and image noise of the target camera.

Lidar-based Ground truthing was found to be a promising path, though a semi-supervised approach is still required to reach high (> 95%) rates of correct pedestrian annotation. This may seem to be in contrast to recent literature [14], in which high-end Lidars were reported to achieve classification rates approaching 99%. However, those results were reported for successfully segmented and reasonably long tracks, which are not available in cases of pedestrian entry or pedestrians standing near another object, the main error sources according to our analysis. Finally, we have suggested improvements to the performance statistics used in pedestrian detection, in order to make them more appropriate for the automotive context. The criteria we suggested are still sub optimal in several respects. Specifically, our suggested criterion sometimes declare a successful detection when the detection image bounding box has no overlap at all with the ground truth bounding box, which may seem counter intuitive. A possible solution to this may be to put different demands on the range and azimuth accuracy, asking for high accuracy from the latter and lower accuracy from the former. VII. ACKNOWLEDGMENTS We are grateful to Dan Levi, Wende Zhang, David BarLev and Ronen Lerner for their help in work, ideas and manuscript refining. R EFERENCES [1] http://www.vision.caltech.edu/bouguetj/calib doc. [2] http://www.vision.caltech.edu/Image Datasets/CaltechPedestrians. [3] M. Bertozzi, A. Broggi, P. Grisleri, A. Tibaldi, and M. Del Rose. A tool for vision based pedestrian detection performance evaluation. In Intelligent Vehicle Symposium, 2004. [4] Geronimo D., Lopez A.M, Sappa A.D., and Graf T. Survey of pedestrian detection for advanced driver assistance systems. PAMI, pages 1239–1258, 2009. [5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [6] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In CVPR, 2009. [7] Raphael E., Kiefer R., Reisman P., and Hayon G. Development of a camera-based forward collision alert system. SAE International Journal of Passenger Cars. [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge 2009 (voc2009) results. [9] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004. [10] Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. Recent progress in road and lane detection - a survey. Accpted to Machine Vision and Applications, 2012. [11] Feng Luo, Chu Liu, and Zechang Sun. Intelligent vehicle simulation and debugging environment based on physics engine. In Asia Conference on Informatics in Control, Automation and Robotics, 2009. [12] Daniele Marenco, Davide Fontana, Guido Ghisio, Gianluca Monchiero, Elena Cardarelli, Paolo Medici, and Pire Paulo Porta. A validation tool for treffic sign recognition systems. In IEEE conference on Intelligent Tranportation Systems, 2009. [13] Joel C. McCall and Mohan M. Trivedi. Peformance evaluation of a vision based lane tracker designed for driver assistance systems. In Intelligent Vehicle Symposium, 2005. [14] Alex Teichman, Jesse Levinson, and Sebastian Thrun. Towards 3d object recognition via classification of arbitrary object tracks. In International Conference on Robotics and Automation (ICRA), 2011. [15] Papp Z., Labibes K., Thean A.C, and Van Elk M.G. Multi-agent based hil simulator with high fidelity virtual sensors. In Intelligent Vehicle Symposium, 2003.

Off-vehicle Evaluation of Camera-based Pedestrian ...

In this paper we examine the possibility of off-vehicle, lab- based evaluation for vision-based pedestrian ... For each detection rectangle, the score of the best match is considered. Right: detection statistics for a detected system ... Other software modules are the evaluation module, which reads system detections and produce ...

Download PDF

2MB Sizes 0 Downloads 256 Views

Report

Off-vehicle Evaluation of Camera-based Pedestrian ...

Recommend Documents