Continuously Tracking Objects Across Multiple Widely Separated Cameras Yinghao Cai, Wei Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences P.O.Box 2728, Beijing, 100080, China {yhcai,wchen,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. In this paper, we present a new solution to the problem of multi-camera tracking with non-overlapping fields of view. The identities of moving objects are maintained when they are traveling from one camera to another. Appearance information and spatio-temporal information are explored and combined in a maximum a posteriori (MAP) framework. In computing appearance probability, a two-layered histogram representation is proposed to incorporate spatial information of objects. Diffusion distance is employed to histogram matching to compensate for illumination changes and camera distortions. In deriving spatio-temporal probability, transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. Experimental results demonstrate the effectiveness of the proposed method.
1
Introduction
Nowadays, a distributed network of video sensors is applied to monitor activities over a complex area. Instead of having a high resolution camera with a limited field of view, multiple cameras provide a solution to wide area surveillance by extending the field of view of a single camera. Various types of camera overlap and non-overlap can be employed in multi-camera surveillance systems. Continuously tracking objects across cameras is usually termed as “object handover”. The objective of handover is to maintain the identities of moving objects when they are traveling from one camera to another. More specifically, when an object appears in one camera, we need to determine whether it has previously appeared before in other cameras or it is a new object. In earlier work of handover, either calibrated cameras or overlapping fields of view are required. Subsequent approaches to handover recover the relative positions between cameras by statistical consistency. Statistical information reveals a trend of how people are likely to move between cameras. Possible cues for tracking across cameras include appearance information and spatio-temporal information. Appearance information includes size, color, height of moving object, etc, while spatio-temporal information refers to transition time, velocity, entry zone, exit zone, trajectory, etc. These cues present a constraint on possible transitions between cameras, such as a person leaves the field of view of one camera Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 843–852, 2007. c Springer-Verlag Berlin Heidelberg 2007
844
Y. Cai et al.
at exit zone A will never appear at entry zone B of another camera at the opposite direction of his or her moving direction. Combining appearance information with spatio-temporal information is promising since it does not require a priori calibration and is able to adapt to changes in the cameras’ positions. In this context, tracking objects across cameras is achieved through computing the probability of correspondence according to appearance and spatio-temporal cues. Since cameras are non-overlapping, the appearances of moving objects under multiple non-overlapping cameras may exhibit significant differences due to different illumination conditions, poses and camera parameters. Even under the same scene, the illumination conditions vary over time. As to spatio-temporal information, the transition time from one camera to another differs dramatically from person to person. Some people may wander along the way, while others are rushing against time. In addition, as pointed out in [1], the more dense the observations and the longer the transition time, the more likely the false correspondences. In this paper, we solve these problems under a maximum a posteriori (MAP) framework. The probability of two observations under two cameras generated from the same object is dependent on both appearance probability and spatiotemporal probability. At the off-line training stage, we assume the correspondences between objects are known. The parameters for appearance matching and transition distributions between each pair of entry and exit zone are learned. At the testing stage, correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework. Experimental results demonstrate the effectiveness of the proposed algorithm. In the remainder of this paper, an overview of the related work is in Section 2. In Section 3, experimental setup is described. The MAP framework is presented in Section 4 with appearance probability and spatio-temporal probability described. Experimental results and conclusions are given in Section 5 and Section 6 respectively.
2
Related Work
To compensate color variations under two separated cameras, one solution is by color normalization. Niu et al. [2] employ a comprehensive color normalization algorithm (CCN) to remove image dependency on lighting geometry and illuminant color. This procedure is an iterative process until no change is detected. An alternative solution to the problem is by finding a transformation matrix [3] or a mapping function [4] which map the appearance of one object to its appearance under another view. In [3], the transformation matrix is obtained by solving a linear matrix equation. Javed et al. [4] show that all brightness transfer functions (BTF) from one camera to another lie in a low dimensional subspace. [4] assumes planar surfaces and uniform lighting which are undesirable in real applications. In determining the spatio-temporal relationship between pairs of cameras, Javed et al. [5] employ a non-parametric Parzen window technique to estimate the spatio-temporal pdfs between cameras. In [6], it is assumed that all pairs
Continuously Tracking Objects Across Multiple Widely Separated Cameras
845
of arrival and departure events contribute to the distribution of transition time. Observations of transition time are accumulated into a reappearance period histogram. The peak of the reappearance period histogram indicates the most popular transition time. No appearance information is used in [6]. Furthermore, [2,3] weight the temporally correlating information by appearance information, only those observations which look similar in appearance are used to derive spatiotemporal pdfs. Both [6] and [2,3] assume a single mode transition distribution and are not flexible to deal with multi-modal transition situations. In this paper, a two-layered histogram representation is proposed to incorporate spatial information of objects. This representation provides more descriptive ability than computing the histogram of the whole body directly. Furthermore, instead of modeling color changes between cameras explicitly as a mapping function or a transformation matrix, we propose diffusion distance [7] to histogram matching to compensate for illumination changes and camera distortions. To deal with multi-modal transition situations, we model the spatio-temporal probability between each pair of entry zone and exit zone as a mixture of Gaussians. Correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework.
(a)
(b)
Fig. 1. (a) The layout of the camera system, (b) Three views from three widely separated cameras
3
Experimental Setup
The experimental setup consists of three cameras with non-overlapping fields of view. The cameras are widely separated, including two outdoor settings and one indoor setting. The layout is shown in Figure 1(a). As we can see from Figure 1(b), illumination conditions are quite different. In single camera motion detection and tracking, Gaussian Mixture Model(GMM) and Kalman filter are applied, respectively. Figure 2(a), (b) and (c) show numbers of people in camera C1 , C2 , C3 respectively. The number of people in each view is obtained by single camera tracking.
Y. Cai et al. 40
35
35
30
30
25 20 15 10 5 0 0
30
25
Num of people
40
Num of people
Num of people
846
25 20 15 10
20
30
40
50
60
70
80
90 100 110 120
0 0
15
10
5
5
10
20
10
20
30
40
50
60
70
80
90 100 110 120
0 0
10
20
30
40
50
60
70
Time(min)
Time(min)
Time(min)
(a)
(b)
(c)
80
90 100 110 120
Fig. 2. (a-c) Numbers of people in camera C1 , C2 , and C3 respectively
Dense observations make the handover problem more difficult. However, the proposed method provides a satisfactory result given the difficulties above.
4
Bayesian Framework
Suppose we have m people p1 , p2 , ..., pm under n cameras C1 , C2 , ..., Cn , observations under camera i(j) of moving object pa (pb ) is represented as Oia (Ojb ). Observations of the moving object pa include appearance and spatio-temporal properties which are represented as Oia (app) and Oia (st) respectively. According to Bayesian theory, given two observations Oia and Ojb under two cameras, the probability of these observations generated from the same object is [5]: P (a = b|Oia , Ojb ) =
P (Oia , Ojb |a = b)P (a = b) P (Oia , Ojb )
(1)
where the denominator P (Oia , Ojb ) is the normalization term, P (Oia , Ojb |a = b) depends on both appearance probability and spatio-temporal probability. P (a = b) is a constant term denoting the probability of a transition from camera i to camera j defined as P (a = b) =
N um of transitions f rom Ci to Cj N um of people exiting Ci
(2)
Since the appearance of each object does not depend on its spatio-temporal property, we assume the independence between Oia (app) and Oia (st). So we have P (a = b|Oia , Ojb ) ∝ P (Oia (app), Ojb (app)|a = b) × P (Oia (st), Ojb (st)|a = b) (3) The handover problem is now formalized as: given observation Oia under camera i, we need to find out observations Qai in a time sliding window of Oia under camera j which maximize the posterior probability P (a = b|Oia , Ojb ): h = arg max P (a = b|Oia , Ojb ) ∀Ojb ∈Qa i
(4)
The appearance probabilityP (Oia (app), Ojb (app)|a=b) and the spatio-temporal probability P (Oia (st), Ojb (st)|a = b) are computed in section 4.2 and 4.3, respectively.
Continuously Tracking Objects Across Multiple Widely Separated Cameras
4.1
847
Moving Object Representation
The purpose of moving object representation is to describe appearance of each object so as to be discriminable from other objects. Histogram is a widely used appearance descriptor. The main drawback of histogram-based methods is that they lose spatial information of the color distribution which is essential to discriminate different moving objects. For example, histogram-based methods can not tell a person wearing a white shirt and blue pants from another person who dresses in a blue shirt and white pants.
Fig. 3. A two-layered histogram representation: (a,e) Histogram of the body, (b-d, f-h) Histograms of head, torso and legs respectively
In this paper, we propose a new moving object representation method based on a two-layered histogram. As pedestrians are our primary concern, human body is divided into three subregions: head, torso and bottom in vertical direction similar to the method in [8]. The first layer of the proposed representation corresponds to the color histogram of human body Htotal , while the second layer consists of histograms of head, torso and legs, represented by Hh , Ht , Hl respectively. Histograms are quantized into 30 bins in R, G, B channel separately. It is worthwhile pointing out that coarse quantization discards too much discriminatory information, while fine quantization will result in sparse histogram representations. Our preliminary experiments validate the adequacy of thirty bins in terms of discriminability and accuracy. Figure 3 shows the separated regions and their histogram representations. A two-layered histogram representation captures both global image description and local spatial information. Figure 3 shows that two different people have visually similar Htotal , however, their Ht s are quite different which demonstrate that the proposed two-layered representation provides more discriminability than computing the histogram of the whole body directly. Each layer of representation under one view is matched against its corresponding layer under another view in next subsection.
848
4.2
Y. Cai et al.
Histogram Matching
As we mentioned in Section 1, appearances of moving objects under multiple nonoverlapping cameras exhibit significant differences due to different illumination conditions, poses and camera parameters. To compute appearance probability given observations under two cameras, we first obtain a two-layered histogram representation in Section 4.1. Histogram representation provides robustness to pose changes to some degree. In this section, we propose diffusion distance to histogram matching to compensate for illumination changes and camera distortions. Diffusion distance is first proposed by Ling et al. [7]. This approach models the difference between two histograms as a temperature field. Firstly, an initial distance between two histograms is defined. The diffusion process on this temperature field diffuses the difference between two histograms by a Gaussian kernel. When time increases, the difference between these two histograms will approximate zero. Therefore, the distance between two histograms can be defined as the sum of dissimilarities over its process [7]: K(hist1, hist2) =
N
k(|di (x)|)
(5)
i=0
where d0 (x) = hist1(x) − hist2(x) di (x) = [di−1 (x) ∗ φ(x, σ)] ↓2 i = 1, ...N
(6) (7)
“↓2 ” denotes half size downsampling. σ is the standard deviation of the Gaussian filter which can be learned from the training phase. k(|.|) is chosen as the L1 norm. Subsequent distance di (x) is defined as half size downsampling of its
time = 1 time = 2 time = 3 time = 4 time = 5
(a)
time = 1 time = 2 time = 3 time = 4 time = 5
(b)
Fig. 4. Diffusion distance plotted on the same figure. (a) Diffusion process for the difference of histograms of the same person under two views, (b) Diffusion process for different people.
Continuously Tracking Objects Across Multiple Widely Separated Cameras
849
former layer. Then, the ground distance between two histograms is defined as the sum of norms over N scales of the pyramid. An intuitive illustration is shown in Figure 4. Figure 4(a) shows the diffusion process for the difference of histograms of the same person under two views, and Figure 4 (b) shows the diffusion process for different people. We can see that (a) decays faster than (b). In our method, we compare Htotal , Hh , Ht , Hl of one object with its corresponding histograms under another view by diffusion distance. The histogram representation is one dimensional since we treat each channel R, G, B separately. Four diffusion distances dtotal , dh , dt and dl are combined by the weighted sum technique. We obtain a Gaussian distribution for distances between the same object under different views at the training stage. Finally, distances are transformed into probabilities to obtain the appearance probability P (Oia (app), Ojb (app)|a = b). A comparison with other histogram distances is shown in Section 5. 4.3
Spatio-temporal Information
To estimate the spatio-temporal relationship between pairs of cameras, at the off-line training stage, we group locations where objects appear(entry zone) and disappear(exit zone) by k-means clustering. Transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. In this paper, we choose K as 3, three gaussian distributions correspond to people walking slowly, at normal speed and walking quickly. The probability of testing transition time x is P (x) =
3
ωi ∗ ηi (x, μi , σi )
(8)
i=1
where ωi is the weight of the ith Gaussian in the mixture. ωi can be interpreted as a prior probability of the random variable generated by the ith Gaussian distribution. μi and σi are the mean value and the standard deviation of the ith Gaussian. η is the Gaussian probability density function (x−μ)2 1 e− 2σ2 η(x, μ, σ) = √ 2πσ
Transition Distribution
0.045
0.035
0.03 0.025 0.02 0.015
0.03 0.025 0.02 0.015
0.01
0.01
0.005
0.005
0 20
30
40
50
60
70
80
90
Original Distribution Mixed Gaussian Single Gaussian
0.04
Probability
Probability
0.035
Transition Distribution
0.045
Original distribution Mixed Gaussian Single Gaussian
0.04
100
(9)
0 20
30
40
50
60
70
Transition Time
Transition Time
(a)
(b)
80
90
100
Fig. 5. (a) Transition distribution from Camera 1 to Camera 2, (b) Transition distribution from Camera 2 to Camera 3
850
Y. Cai et al.
Parameters of the model are estimated by expectation maximization(EM). It should be noted that a single Gaussian distribution can not accurately model the transition time distributions between cameras due to the variability of walking paces. Figure 5 shows the transition distribution and its approximations by a mixture of Gaussian distributions and a single Gaussian distribution.
5
Experimental Results
Experiments are carried out on two outdoor settings and one indoor setting as shown in Figure 1. The off-line training phase lasts 40 minutes, and evaluation of the effectiveness of the algorithm is performed using ground-truthed sequences lasting an hour. At the off-line training stage, locations where people appear and disappear are grouped together as entry zones and exit zones respectively in Figure 6. It takes approximately 40-70 seconds to exit from Camera 1 to Camera 2 and from Camera 2 to Camera 3. Some sample images under the three views are shown in Figure 7. Our first experiment consists of transitions from Camera 1 to Camera 2 with two outdoor settings. Our second experiment is carried out on Camera 2 and Camera 3. Numbers of correspondence pairs in the training stage, transitions and detected tracks in the testing stage are summarized in Table 1.
(a)
(b)
(c)
Fig. 6. (a-c) Entry zones and exit zones for Camera 1, 2 and 3, respectively
Fig. 7. Each column contains the same person under two different views Table 1. Experimental Description Training Stage Testing Stage Correspondence Pairs Transition Nums Detected Tracks Experiment 1 100 107 150 Experiment 2 50 75 100
1
1
0.95
0.95
0.9
0.9
0.85
0.85
Accuracy
Accuracy
Continuously Tracking Objects Across Multiple Widely Separated Cameras
0.8 0.75 0.7
app & st app st
0.65 0.6 0.55
1
2
3
4
Performance of ranked matches
0.8 0.75 0.7
app & st app st
0.65 0.6
5
(a)
851
0.55
1
2
3
4
Performance of ranked matches
5
(b)
Fig. 8. Rank Matching Performance. “app” denotes using appearance information only, “st” means using spatio-temporal information only, “app & st” means both appearance information and spatio-temporal information are employed. (a) Rank Matching Performance of Experiment 1. (b) Rank Matching Performance of Experiment 2.
Camera2__4 Camera1__1
Camera3__11 Camera3__10
Camera1__2 Camera1__2
Camera1__2
(a)
(b)
(c)
Fig. 9. Continuously tracking objects across three non-overlapping views
Fig. 10. Rank 1 rates for diffusion distance, L1 distance and histogram intersection
Figure 8 shows our rank matching performance. Rank i (i = 1...5) performance is the rate that the correct person is in the top i of the handover list. Different people with similar appearances bring uncertainties into the system which can explain the rank one accuracy of 87.8% in Experiment 1 and 76% in Experiment 2. By taking the top three matches into consideration, the performance is improved to 97.5% and 98.6% respectively. People are tracked correctly in Figure 9. As a comparison between diffusion distance, the widely used L1 distance and histogram intersection distance [9], we use the same framework and replace the diffusion distance with L1 and histogram intersection distance. The rank 1 rates for different distances are shown in Figure 10, which demonstrates the superiority of the proposed diffusion distance.
852
6
Y. Cai et al.
Conclusion and Future Work
In this paper, we have presented a new solution to the problem of multi-camera tracking with non-overlapping fields of view. People are tracked correctly across the widely separated cameras by combining appearance and spatio-temporal cues under the MAP framework. Experimental results validate the effectiveness of the proposed algorithm. The proposed method requires an off-line training phase where parameters for appearance matching and transition probabilities are learned. Future work will focus on evaluation of the proposed method on larger datasets.
Acknowledgement This work is partly supported by National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.
References 1. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Computer Vision, 2005. Proceedings. Ninth IEEE International Conference on (2005), pp. 1842–1849. IEEE Computer Society Press, Los Alamitos (2005) 2. Niu, C., Grimson, E.: Recovering non-overlapping network topology using far-field vehicle tracking data. In: ICPR 2006. Pattern Recognition 18th International Conference on (2006), pp. 944–949 (2006) 3. Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) 4. Javed, O., Shafique, K., Shah, M.: Appearance modeling for tracking in multiple nonoverlapping cameras. In: CVPR 2005. Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society, Los Alamitos (2005) 5. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (2003), pp. 952–957. IEEE Computer Society Press, Los Alamitos (2003) 6. Makris, D., Ellis, T., Black, J.: Bridging the gaps between cameras. In: CVPR 2004. Computer Vision and Pattern Recognition, pp. 205–210. IEEE Computer Society Press, Los Alamitos (2004) 7. Ling, H., Okada, K.: Diffusion distance for histogram comparison. In: Computer Vision and Pattern Recognition, pp. 246–253. IEEE Computer Society Press, Los Alamitos (2006) 8. Hu, M., Hu, W., Tan, T.: Tracking people through occlusions. In: ICPR 2004. Pattern Recognition 18th International Conference on (2006), pp. 724–727 (2004) 9. Swain, J., Ballard, M.: Indexing via color histograms, pp. 390–393 (1990)