Abstract. In this paper, we present a new solution to the problem of multi-camera tracking with non-overlapping ﬁelds of view. The identities of moving objects are maintained when they are traveling from one camera to another. Appearance information and spatio-temporal information are explored and combined in a maximum a posteriori (MAP) framework. In computing appearance probability, a two-layered histogram representation is proposed to incorporate spatial information of objects. Diﬀusion distance is employed to histogram matching to compensate for illumination changes and camera distortions. In deriving spatio-temporal probability, transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. Experimental results demonstrate the eﬀectiveness of the proposed method.

1

Introduction

Nowadays, a distributed network of video sensors is applied to monitor activities over a complex area. Instead of having a high resolution camera with a limited ﬁeld of view, multiple cameras provide a solution to wide area surveillance by extending the ﬁeld of view of a single camera. Various types of camera overlap and non-overlap can be employed in multi-camera surveillance systems. Continuously tracking objects across cameras is usually termed as “object handover”. The objective of handover is to maintain the identities of moving objects when they are traveling from one camera to another. More speciﬁcally, when an object appears in one camera, we need to determine whether it has previously appeared before in other cameras or it is a new object. In earlier work of handover, either calibrated cameras or overlapping ﬁelds of view are required. Subsequent approaches to handover recover the relative positions between cameras by statistical consistency. Statistical information reveals a trend of how people are likely to move between cameras. Possible cues for tracking across cameras include appearance information and spatio-temporal information. Appearance information includes size, color, height of moving object, etc, while spatio-temporal information refers to transition time, velocity, entry zone, exit zone, trajectory, etc. These cues present a constraint on possible transitions between cameras, such as a person leaves the ﬁeld of view of one camera Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 843–852, 2007. c Springer-Verlag Berlin Heidelberg 2007

844

Y. Cai et al.

at exit zone A will never appear at entry zone B of another camera at the opposite direction of his or her moving direction. Combining appearance information with spatio-temporal information is promising since it does not require a priori calibration and is able to adapt to changes in the cameras’ positions. In this context, tracking objects across cameras is achieved through computing the probability of correspondence according to appearance and spatio-temporal cues. Since cameras are non-overlapping, the appearances of moving objects under multiple non-overlapping cameras may exhibit signiﬁcant diﬀerences due to diﬀerent illumination conditions, poses and camera parameters. Even under the same scene, the illumination conditions vary over time. As to spatio-temporal information, the transition time from one camera to another diﬀers dramatically from person to person. Some people may wander along the way, while others are rushing against time. In addition, as pointed out in [1], the more dense the observations and the longer the transition time, the more likely the false correspondences. In this paper, we solve these problems under a maximum a posteriori (MAP) framework. The probability of two observations under two cameras generated from the same object is dependent on both appearance probability and spatiotemporal probability. At the oﬀ-line training stage, we assume the correspondences between objects are known. The parameters for appearance matching and transition distributions between each pair of entry and exit zone are learned. At the testing stage, correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework. Experimental results demonstrate the eﬀectiveness of the proposed algorithm. In the remainder of this paper, an overview of the related work is in Section 2. In Section 3, experimental setup is described. The MAP framework is presented in Section 4 with appearance probability and spatio-temporal probability described. Experimental results and conclusions are given in Section 5 and Section 6 respectively.

2

Related Work

To compensate color variations under two separated cameras, one solution is by color normalization. Niu et al. [2] employ a comprehensive color normalization algorithm (CCN) to remove image dependency on lighting geometry and illuminant color. This procedure is an iterative process until no change is detected. An alternative solution to the problem is by ﬁnding a transformation matrix [3] or a mapping function [4] which map the appearance of one object to its appearance under another view. In [3], the transformation matrix is obtained by solving a linear matrix equation. Javed et al. [4] show that all brightness transfer functions (BTF) from one camera to another lie in a low dimensional subspace. [4] assumes planar surfaces and uniform lighting which are undesirable in real applications. In determining the spatio-temporal relationship between pairs of cameras, Javed et al. [5] employ a non-parametric Parzen window technique to estimate the spatio-temporal pdfs between cameras. In [6], it is assumed that all pairs

Continuously Tracking Objects Across Multiple Widely Separated Cameras

845

of arrival and departure events contribute to the distribution of transition time. Observations of transition time are accumulated into a reappearance period histogram. The peak of the reappearance period histogram indicates the most popular transition time. No appearance information is used in [6]. Furthermore, [2,3] weight the temporally correlating information by appearance information, only those observations which look similar in appearance are used to derive spatiotemporal pdfs. Both [6] and [2,3] assume a single mode transition distribution and are not ﬂexible to deal with multi-modal transition situations. In this paper, a two-layered histogram representation is proposed to incorporate spatial information of objects. This representation provides more descriptive ability than computing the histogram of the whole body directly. Furthermore, instead of modeling color changes between cameras explicitly as a mapping function or a transformation matrix, we propose diﬀusion distance [7] to histogram matching to compensate for illumination changes and camera distortions. To deal with multi-modal transition situations, we model the spatio-temporal probability between each pair of entry zone and exit zone as a mixture of Gaussians. Correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework.

(a)

(b)

Fig. 1. (a) The layout of the camera system, (b) Three views from three widely separated cameras

3

Experimental Setup

The experimental setup consists of three cameras with non-overlapping ﬁelds of view. The cameras are widely separated, including two outdoor settings and one indoor setting. The layout is shown in Figure 1(a). As we can see from Figure 1(b), illumination conditions are quite diﬀerent. In single camera motion detection and tracking, Gaussian Mixture Model(GMM) and Kalman ﬁlter are applied, respectively. Figure 2(a), (b) and (c) show numbers of people in camera C1 , C2 , C3 respectively. The number of people in each view is obtained by single camera tracking.

Y. Cai et al. 40

35

35

30

30

25 20 15 10 5 0 0

30

25

Num of people

40

Num of people

Num of people

846

25 20 15 10

20

30

40

50

60

70

80

90 100 110 120

0 0

15

10

5

5

10

20

10

20

30

40

50

60

70

80

90 100 110 120

0 0

10

20

30

40

50

60

70

Time(min)

Time(min)

Time(min)

(a)

(b)

(c)

80

90 100 110 120

Fig. 2. (a-c) Numbers of people in camera C1 , C2 , and C3 respectively

Dense observations make the handover problem more diﬃcult. However, the proposed method provides a satisfactory result given the diﬃculties above.

4

Bayesian Framework

Suppose we have m people p1 , p2 , ..., pm under n cameras C1 , C2 , ..., Cn , observations under camera i(j) of moving object pa (pb ) is represented as Oia (Ojb ). Observations of the moving object pa include appearance and spatio-temporal properties which are represented as Oia (app) and Oia (st) respectively. According to Bayesian theory, given two observations Oia and Ojb under two cameras, the probability of these observations generated from the same object is [5]: P (a = b|Oia , Ojb ) =

P (Oia , Ojb |a = b)P (a = b) P (Oia , Ojb )

(1)

where the denominator P (Oia , Ojb ) is the normalization term, P (Oia , Ojb |a = b) depends on both appearance probability and spatio-temporal probability. P (a = b) is a constant term denoting the probability of a transition from camera i to camera j deﬁned as P (a = b) =

N um of transitions f rom Ci to Cj N um of people exiting Ci

(2)

Since the appearance of each object does not depend on its spatio-temporal property, we assume the independence between Oia (app) and Oia (st). So we have P (a = b|Oia , Ojb ) ∝ P (Oia (app), Ojb (app)|a = b) × P (Oia (st), Ojb (st)|a = b) (3) The handover problem is now formalized as: given observation Oia under camera i, we need to ﬁnd out observations Qai in a time sliding window of Oia under camera j which maximize the posterior probability P (a = b|Oia , Ojb ): h = arg max P (a = b|Oia , Ojb ) ∀Ojb ∈Qa i

(4)

The appearance probabilityP (Oia (app), Ojb (app)|a=b) and the spatio-temporal probability P (Oia (st), Ojb (st)|a = b) are computed in section 4.2 and 4.3, respectively.

Continuously Tracking Objects Across Multiple Widely Separated Cameras

4.1

847

Moving Object Representation

The purpose of moving object representation is to describe appearance of each object so as to be discriminable from other objects. Histogram is a widely used appearance descriptor. The main drawback of histogram-based methods is that they lose spatial information of the color distribution which is essential to discriminate diﬀerent moving objects. For example, histogram-based methods can not tell a person wearing a white shirt and blue pants from another person who dresses in a blue shirt and white pants.

Fig. 3. A two-layered histogram representation: (a,e) Histogram of the body, (b-d, f-h) Histograms of head, torso and legs respectively

In this paper, we propose a new moving object representation method based on a two-layered histogram. As pedestrians are our primary concern, human body is divided into three subregions: head, torso and bottom in vertical direction similar to the method in [8]. The ﬁrst layer of the proposed representation corresponds to the color histogram of human body Htotal , while the second layer consists of histograms of head, torso and legs, represented by Hh , Ht , Hl respectively. Histograms are quantized into 30 bins in R, G, B channel separately. It is worthwhile pointing out that coarse quantization discards too much discriminatory information, while ﬁne quantization will result in sparse histogram representations. Our preliminary experiments validate the adequacy of thirty bins in terms of discriminability and accuracy. Figure 3 shows the separated regions and their histogram representations. A two-layered histogram representation captures both global image description and local spatial information. Figure 3 shows that two diﬀerent people have visually similar Htotal , however, their Ht s are quite diﬀerent which demonstrate that the proposed two-layered representation provides more discriminability than computing the histogram of the whole body directly. Each layer of representation under one view is matched against its corresponding layer under another view in next subsection.

848

4.2

Y. Cai et al.

Histogram Matching

As we mentioned in Section 1, appearances of moving objects under multiple nonoverlapping cameras exhibit signiﬁcant diﬀerences due to diﬀerent illumination conditions, poses and camera parameters. To compute appearance probability given observations under two cameras, we ﬁrst obtain a two-layered histogram representation in Section 4.1. Histogram representation provides robustness to pose changes to some degree. In this section, we propose diﬀusion distance to histogram matching to compensate for illumination changes and camera distortions. Diﬀusion distance is ﬁrst proposed by Ling et al. [7]. This approach models the diﬀerence between two histograms as a temperature ﬁeld. Firstly, an initial distance between two histograms is deﬁned. The diﬀusion process on this temperature ﬁeld diﬀuses the diﬀerence between two histograms by a Gaussian kernel. When time increases, the diﬀerence between these two histograms will approximate zero. Therefore, the distance between two histograms can be deﬁned as the sum of dissimilarities over its process [7]: K(hist1, hist2) =

N

k(|di (x)|)

(5)

i=0

where d0 (x) = hist1(x) − hist2(x) di (x) = [di−1 (x) ∗ φ(x, σ)] ↓2 i = 1, ...N

(6) (7)

“↓2 ” denotes half size downsampling. σ is the standard deviation of the Gaussian ﬁlter which can be learned from the training phase. k(|.|) is chosen as the L1 norm. Subsequent distance di (x) is deﬁned as half size downsampling of its

time = 1 time = 2 time = 3 time = 4 time = 5

(a)

time = 1 time = 2 time = 3 time = 4 time = 5

(b)

Fig. 4. Diﬀusion distance plotted on the same ﬁgure. (a) Diﬀusion process for the diﬀerence of histograms of the same person under two views, (b) Diﬀusion process for diﬀerent people.

Continuously Tracking Objects Across Multiple Widely Separated Cameras

849

former layer. Then, the ground distance between two histograms is deﬁned as the sum of norms over N scales of the pyramid. An intuitive illustration is shown in Figure 4. Figure 4(a) shows the diﬀusion process for the diﬀerence of histograms of the same person under two views, and Figure 4 (b) shows the diﬀusion process for diﬀerent people. We can see that (a) decays faster than (b). In our method, we compare Htotal , Hh , Ht , Hl of one object with its corresponding histograms under another view by diﬀusion distance. The histogram representation is one dimensional since we treat each channel R, G, B separately. Four diﬀusion distances dtotal , dh , dt and dl are combined by the weighted sum technique. We obtain a Gaussian distribution for distances between the same object under diﬀerent views at the training stage. Finally, distances are transformed into probabilities to obtain the appearance probability P (Oia (app), Ojb (app)|a = b). A comparison with other histogram distances is shown in Section 5. 4.3

Spatio-temporal Information

To estimate the spatio-temporal relationship between pairs of cameras, at the oﬀ-line training stage, we group locations where objects appear(entry zone) and disappear(exit zone) by k-means clustering. Transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. In this paper, we choose K as 3, three gaussian distributions correspond to people walking slowly, at normal speed and walking quickly. The probability of testing transition time x is P (x) =

3

ωi ∗ ηi (x, μi , σi )

(8)

i=1

where ωi is the weight of the ith Gaussian in the mixture. ωi can be interpreted as a prior probability of the random variable generated by the ith Gaussian distribution. μi and σi are the mean value and the standard deviation of the ith Gaussian. η is the Gaussian probability density function (x−μ)2 1 e− 2σ2 η(x, μ, σ) = √ 2πσ

Transition Distribution

0.045

0.035

0.03 0.025 0.02 0.015

0.03 0.025 0.02 0.015

0.01

0.01

0.005

0.005

0 20

30

40

50

60

70

80

90

Original Distribution Mixed Gaussian Single Gaussian

0.04

Probability

Probability

0.035

Transition Distribution

0.045

Original distribution Mixed Gaussian Single Gaussian

0.04

100

(9)

0 20

30

40

50

60

70

Transition Time

Transition Time

(a)

(b)

80

90

100

Fig. 5. (a) Transition distribution from Camera 1 to Camera 2, (b) Transition distribution from Camera 2 to Camera 3

850

Y. Cai et al.

Parameters of the model are estimated by expectation maximization(EM). It should be noted that a single Gaussian distribution can not accurately model the transition time distributions between cameras due to the variability of walking paces. Figure 5 shows the transition distribution and its approximations by a mixture of Gaussian distributions and a single Gaussian distribution.

5

Experimental Results

Experiments are carried out on two outdoor settings and one indoor setting as shown in Figure 1. The oﬀ-line training phase lasts 40 minutes, and evaluation of the eﬀectiveness of the algorithm is performed using ground-truthed sequences lasting an hour. At the oﬀ-line training stage, locations where people appear and disappear are grouped together as entry zones and exit zones respectively in Figure 6. It takes approximately 40-70 seconds to exit from Camera 1 to Camera 2 and from Camera 2 to Camera 3. Some sample images under the three views are shown in Figure 7. Our ﬁrst experiment consists of transitions from Camera 1 to Camera 2 with two outdoor settings. Our second experiment is carried out on Camera 2 and Camera 3. Numbers of correspondence pairs in the training stage, transitions and detected tracks in the testing stage are summarized in Table 1.

(a)

(b)

(c)

Fig. 6. (a-c) Entry zones and exit zones for Camera 1, 2 and 3, respectively

Fig. 7. Each column contains the same person under two diﬀerent views Table 1. Experimental Description Training Stage Testing Stage Correspondence Pairs Transition Nums Detected Tracks Experiment 1 100 107 150 Experiment 2 50 75 100

1

1

0.95

0.95

0.9

0.9

0.85

0.85

Accuracy

Accuracy

Continuously Tracking Objects Across Multiple Widely Separated Cameras

0.8 0.75 0.7

app & st app st

0.65 0.6 0.55

1

2

3

4

Performance of ranked matches

0.8 0.75 0.7

app & st app st

0.65 0.6

5

(a)

851

0.55

1

2

3

4

Performance of ranked matches

5

(b)

Fig. 8. Rank Matching Performance. “app” denotes using appearance information only, “st” means using spatio-temporal information only, “app & st” means both appearance information and spatio-temporal information are employed. (a) Rank Matching Performance of Experiment 1. (b) Rank Matching Performance of Experiment 2.

Camera2__4 Camera1__1

Camera3__11 Camera3__10

Camera1__2 Camera1__2

Camera1__2

(a)

(b)

(c)

Fig. 9. Continuously tracking objects across three non-overlapping views

Fig. 10. Rank 1 rates for diﬀusion distance, L1 distance and histogram intersection

Figure 8 shows our rank matching performance. Rank i (i = 1...5) performance is the rate that the correct person is in the top i of the handover list. Diﬀerent people with similar appearances bring uncertainties into the system which can explain the rank one accuracy of 87.8% in Experiment 1 and 76% in Experiment 2. By taking the top three matches into consideration, the performance is improved to 97.5% and 98.6% respectively. People are tracked correctly in Figure 9. As a comparison between diﬀusion distance, the widely used L1 distance and histogram intersection distance [9], we use the same framework and replace the diﬀusion distance with L1 and histogram intersection distance. The rank 1 rates for diﬀerent distances are shown in Figure 10, which demonstrates the superiority of the proposed diﬀusion distance.

852

6

Y. Cai et al.

Conclusion and Future Work

In this paper, we have presented a new solution to the problem of multi-camera tracking with non-overlapping ﬁelds of view. People are tracked correctly across the widely separated cameras by combining appearance and spatio-temporal cues under the MAP framework. Experimental results validate the eﬀectiveness of the proposed algorithm. The proposed method requires an oﬀ-line training phase where parameters for appearance matching and transition probabilities are learned. Future work will focus on evaluation of the proposed method on larger datasets.

Acknowledgement This work is partly supported by National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.

References 1. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Computer Vision, 2005. Proceedings. Ninth IEEE International Conference on (2005), pp. 1842–1849. IEEE Computer Society Press, Los Alamitos (2005) 2. Niu, C., Grimson, E.: Recovering non-overlapping network topology using far-ﬁeld vehicle tracking data. In: ICPR 2006. Pattern Recognition 18th International Conference on (2006), pp. 944–949 (2006) 3. Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) 4. Javed, O., Shaﬁque, K., Shah, M.: Appearance modeling for tracking in multiple nonoverlapping cameras. In: CVPR 2005. Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society, Los Alamitos (2005) 5. Javed, O., Rasheed, Z., Shaﬁque, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (2003), pp. 952–957. IEEE Computer Society Press, Los Alamitos (2003) 6. Makris, D., Ellis, T., Black, J.: Bridging the gaps between cameras. In: CVPR 2004. Computer Vision and Pattern Recognition, pp. 205–210. IEEE Computer Society Press, Los Alamitos (2004) 7. Ling, H., Okada, K.: Diﬀusion distance for histogram comparison. In: Computer Vision and Pattern Recognition, pp. 246–253. IEEE Computer Society Press, Los Alamitos (2006) 8. Hu, M., Hu, W., Tan, T.: Tracking people through occlusions. In: ICPR 2004. Pattern Recognition 18th International Conference on (2006), pp. 724–727 (2004) 9. Swain, J., Ballard, M.: Indexing via color histograms, pp. 390–393 (1990)