Adaptive background mixture models for real-time tracking Chris Stauffer W.E.L Grimson
The Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge,MA02139
Abstract
A common method for real-time segmentation of moving regions in image sequences involves“back-ground subtraction,”or thresholding the error between an estimate of the image without moving objects and the current image.The numerous approaches to this problem differ in the type of background model used and the procedure used to update the model.This paper discusses modeling each pixel as a mixture of Gaus-sians and using an on-line approximation to update the model.The Gaussian distributions of the adaptive mixture model are then evaluated to determine which are most likelyto result f rom a background process. Each pixel is classified based on whether the Gaussian distribution which represents it most effectivelyis con-sidered part of the background model.
This results in a stable,real-time outdoor tracker which reliablydeals with lighting changes,repetitive
motions from clutter,and long-term scene changes. This system has been run almost continuously for16 months,24hours a day,through rain and snow.
1Introduction
In the past,computational barriers have limited the complexity of real-time video processing applications. As a consequence,most systems were either too slow to be practical,or succeeded by restricting themselves to very controlled situations.Recently,faster comput-ers have enabled researchers to consider more complex, robust models for real-time analysis of streaming data. These new methods allow researchers to begin model-ing real world processes under varying conditions.
Consider the problem of video surveillance and monitoring.A robust system should not depend on careful placement of cameras.It should also be robust to whatever is in its visualfield or whatever lighting effects occur.It should be capable of dealing with movement through cluttered areas,objects overlap-ping in the visualfield,shadows,lighting changes,ef-fects of moving elements of the swaying trees),slow-moving objects,and objects being intro-duced or removed from the scene.Traditional ap-proaches based on backgrounding methods typically fail in these general situations.Our goal is to cre-ate a robust,adaptive tracking system that isflexible enough to handle va
riations in lighting,moving scene clutter,multiple moving objects and other arbitrary changes to the observed scene.The resulting tracker is primarily geared towards scene-level video surveil-lance applications.
1.1Previous work and current shortcom-
ings
Most researchers have abandoned non-adaptive methods of backgrounding because of the need for manual initialization.Without re-initialization,errors in the background accumulate over time,making this method useful only in highly-supervised,short-term tracking applications without significant changes in the scene.
A standard method of adaptive backgrounding is averaging the images over time,creating a background approximation which is similar to the current static scene except where motion occurs.While this is ef-fective in situations where objects move continuously and the background is visible a significant portion of the time,it is not robust to scenes with many mov-ing objects particularly if they move slowly.It also cannot handle bimodal backgrounds,recovers slowly when the background is uncovered,and has a single, predetermined threshold for the entire scene.
Changes in scene lighting can cause problems for many backgrounding methods.Ridder et al.[5]mod-eled each pixel with a Kalman Filter which made their system more robust to lighting changes in the scene. While this method does have a pixel-wise automatic threshold,it still recovers slowly and does not han-dle bimodal backgrounds well.Koller et al.[4]have successfully integrated this method in an automatic traffic monitoring application.
Pfinder[7]uses a multi-class statistical model for
the tracked objects,but the background model is a single Gaussian per pixel.After an initialization pe-riod where the room is empty,the system reports good results.There have been no reports on the success of this tracker in outdoor scenes.
Friedman and Russell[2]have recently implemented a pixel-wise EM framework for detection of vehicles that bears the most similarity to our work.Their method attempts to explicitly classify the pixel values into three separate,predetermined distributions corre-sponding to the road color,the shadow color,and col-ors corresponding to vehicles.Their attempt to medi-ate the effect of shadows appears to be somewhat suc-cessful,but it is not clear what behavior their system would exhibit for pixels which did not contain these three distributions.For example,pixels may present a single backgr
ound color or multiple background col-ors resulting from repetitive motions,shadows,or re-flectances.
1.2Our approach
Rather than explicitly modeling the values of all the pixels as one particular type of distribution,we simply model the values of a particular pixel as a mix-ture of Gaussians.Based on the persistence and the variance of each of the Gaussians of the mixture,we determine which Gaussians may correspond to back-ground colors.Pixel values that do notfit the back-ground distributions are considered foreground until there is a Gaussian that includes them with sufficient, consistent evidence supporting it.
Our system adapts to deal robustly with lighting changes,repetitive motions of scene elements,track-ing through cluttered regions,slow-moving objects, and introducing or removing objects from the scene. Slowly moving objects take longer to be incorporated into the background,because their color has a larger variance than the background.Also,repetitive vari-ations are learned,and a model for the background distribution is generally maintained even if it is tem-porarily replaced by another distribution which leads to faster recovery when objects are removed.
Our backgrounding method contains two significant parameters–α,the learning constant and T,the pr
o-portion of the data that should be accounted for by the background.Without needing to alter parameters,our system has been used in an indoors,human-computer interface application and,for the past16months,has been continuously monitoring outdoor scenes.
2The method
If each pixel resulted from a particular surface un-der particular lighting,a single Gaussian would be suf-ficient to model the pixel value while accounting
for
(a)
adaptive
(b)
(c)(d)
Figure1:The execution of the program.(a)the cur-rent image,(b)an image composed of the means of the most probable Gaussians in the background model, (c)the foreground pixels,(d)the current image with tracking information superimposed.Note:while the shadows are foreground in this case,if the surface was covered byshadows a significant amount o f the time, a Gaussian representing those pixel values maybe sig-nificant enough to be considered background. acquisition noise.If only lighting changed over time,a single,adaptive Gaussian per pixel would be sufficient. In practice,multiple surfaces often appear in the view frustum of a particular pixel and the lighting condi-tions change.Thus,multiple,adaptive Gaussians are necessary.We use a mixture of adaptive Gaussians to approximate this process.
Each time the parameters of the Gaussians are up-dated,the Gaussians are evaluated using a simple heuristic to hypothesize which are most likely to be part of the“background process.”Pixel values that do not match one of the pixel’s“background”Gaussians are grouped using connected components.Finally, the connected components are tracked from frame to frame using a multiple hypothesis tracker.The pro-cess is illustrated in Figure1.
2.1Online mixture model
We consider the values of a particular pixel over time as a“pixel process”.The“pixel process”is a time series of pixel scalars for grayvalues or vectors for color images.At any time,t,what is known about a particular pixel,{x0,y0},is its history {X1,...,X t}={I(x0,y0,i):1≤i≤t}(1) where I is the image sequence.Some“pixel processes”are shown by the(R,G)scatter plots in Figure2(a)-(c)
(b)
(c)
Figure 2:This figure contains images and scatter plots
of the red and green values of a single pixel from the image over time.It illustrates some of the difficulties involved in real environments.(a)shows two scatter plots from the same pixel taken 2minutes apart.This would require two thresholds.(b)shows a bi-model dis-tribution of a pixel values resulting from specularities on the surface of water.(c)shows another bi-modality resulting from monitor flicker.
which illustrate the need for adaptive systems with au-tomatic thresholds.Figure 2(b)and (c)also highlight a need for a multi-modal representation.
The value of each pixel represents a measurement of the radiance in the direction of the sensor of the first object intersected by the pixel’s optical ray.With a static background and static lighting,that value would be relatively constant.If we assume that independent,Gaussian noise is incurred in the sampling process,its density could be described by a single Gaussian dis-tribution centered at the mean pixel value.Unfortu-nately,the most interesting video sequences involve lighting changes,scene changes,and moving objects.If lighting changes occurred in a static scene,it would be necessary for the Gaussian to track those changes.If a static object was added to the scene and was not incorpora
ted into the background until it had been there longer than the previous object,the corresponding pixels could be considered foreground for arbitrarily long periods.This would lead to accu-mulated errors in the foreground estimation,resulting in poor tracking behavior.These factors suggest that more recent observations may be more important in determining the Gaussian parameter estimates.
An additional aspect of variation occurs if moving objects are present in the scene.Even a relatively con-sistently colored moving object is generally expected to produce more variance than a “static”object.Also,in general,there should be more data supporting the background distributions because they are repeated,whereas pixel values for different objects are often not the same color.
These are the guiding factors in our choice of model and update procedure.The recent history of each pixel,{X 1,...,X t },is modeled by a mixture of K Gaus-sian distributions.The probability of observing the current pixel value is
P (X t )=
K  i =1
ωi,t ∗η(X t ,µi,t ,Σi,t )
(2)
where K is the number of distributions,ωi,t is an es-timate of the weight (what portion of the data is ac-counted for by this Gaussian)of the i th Gaussian in the mixture at time t,µi,t is the mean value of the i th Gaussian in the mixture at time t,Σi,t is the co-variance matrix of the i th Gaussian in the mixture at time t,and where ηis a Gaussian probability density function η(X t ,µ,Σ)=
1(2π)n 2
|Σ|
12
e −12(X t −µt )
T
Σ−1(X t −µt )
(3)
K is determined by the available memory and compu-tational power.Currently,from 3to 5are used.Also,for computational reasons,the covariance matrix is assumed to be of the form:
Σk,t =σ2
k I
(4)
This assumes that the red,green,and blue pixel values
are independent and have the same variances.While this is certainly not the case,the assumption allows us to avoid a costly matrix inversion at the expense of some accuracy.
Thus,the distribution of recently observed values of each pixel in the scene is characterized by a mixture of Gaussians.A new pixel value will,in general,be represented by one of the major components of the mixture model and used to update the model.
If the pixel process could be considered a sta-tionary process,a standard method for maximizing the likelihood of the observed data is expectation maximization [1].Unfortunately,each pixel process varie
s over time as the state of the world changes,so we use an approximate method which essentially treats each new observation as a sample set of size 1
and uses standard learning rules to integrate the new data.
Because there is a mixture model for every pixel in the image,implementing an exact EM algorithm on a window of recent data would be costly.Instead,we implement an on-line K-means approximation.Every new pixel value,X t,is checked against the existing K Gaussian distributions,until a match is found.A match is defined as a pixel value within2.5standard deviations of a distribution1.This threshold can be perturbed with little effect on performance.This is effectively a per pixel/per distribution threshold.This is extremely useful when different regions have differ-ent lighting(see Figure2(a)),because objects which appear in shaded regions do not generally exhibit as much noise as objects in lighted regions.A uniform threshold often results in objects disappearing when they enter shaded regions.
If none of the K distributions match the current pixel value,the least probable distribution is replaced with a distribution with the current value as its mean value,an initially high variance,and low prior weight.
The prior weights of the K distributions at time t,ωk,t,are adjusted as follows
ωk,t=(1−α)ωk,t−1+α(M k,t)(5) whereαis the learning rate2and M k,t is1for the model which matched and0for the remaining mod-els.After this approximation,the weights are re-normalized.1/αdefines the time constant which de-termines the speed at which the distribution’s param-eters change.ωk,t is effectively a causal low-passfil-tered average of the(thresholded)posterior probabil-ity that pixel values have matched model k given ob-servations from time1through t.This is equivalent to the expectation of this value with an exponential window on the past values.
Theµandσparameters for unmatched distribu-tions remain the same.The parameters of the dis-tribution which matches the new observation are up-dated as follows
µt=(1−ρ)µt−1+ρX t(6)σ2t=(1−ρ)σ2t−1+ρ(X t−µt)T(X t−µt)(7) where
ρ=αη(X t|µk,σk)(8) 1Depending on the curtosis of the noise,some percentage of the data points“generated”by a Gaussian will not“match”. The resulting random noise is easily ignored by neglecting con-nected components containing only1or2pixels.
2While this rule is easily interpreted an an interpolation between two points,it is often shown in the eq
uivalent form:ωk,t=ωk,t−1+α(M k,t−ωk,t−1)which is effectively the same type of causal low-pass filter as mentioned above,except that only the data which matches the model is included in the estimation.
One of the significant advantages of this method is that when something is allowed to become part of the background,it doesn’t destroy the existing model of the background.The original background color re-mains in the mixture until it becomes the K th most probable and a new color is observed.Therefore,if an object is stationary just long enough to become part of the background and then it moves,the distribution describing the previous background still exists with the sameµandσ2,but a lowerωand will be quickly re-incorporated into the background.
2.2Background Model Estimation
As the parameters of the mixture model of each pixel change,we would like to determine which of the Gaussians of the mixture are most likely produced by background processes.Heuristically,we are interested in the Gaussian distributions which have the most sup-porting evidence and the least variance.
To understand this choice,consider the accumu-lation of supporting evidence and the relatively low variance for the“background”distributions when a static,persistent object is visible.In contrast,when a
new object occludes the background object,it will not,in general,match one of the existing distributions which will result in either the creation of a new dis-tribution or the increase in the variance of an existing distribution.Also,the variance of the moving object is expected to remain larger than a background pixel until the moving object stops.To model this,we need a method for deciding what portion of the mixture model best represents background processes.
First,the Gaussians are ordered by the value of ω/σ.This value increases both as a distribution gains more evidence and as the variance decreases.Af-ter re-estimating the parameters of the mixture,it is sufficient to sort from the matched distribution to-wards the most probable background distribution,be-cause only the matched models relative value will have changed.This ordering of the model is effectively an ordered,open-ended list,where the most likely back-ground distributions remain on top and the less prob-able transient background distributions gravitate to-wards the bottom and are eventually replaced by new distributions.
Then thefirst B distributions are chosen as the background model,where
B=argmin b
b
k=1
ωk>T
(9)
where T is a measure of the minimum portion of the data that should be accounted for by the background. This takes the“best”distributions until a certain por-tion,T,of the recent data has been accounted for.If a small value for T is chosen,the background model is usually unimodal.If this is the case,using only the most probable distribution will save processing.
If T is higher,a multi-modal distribution caused by a repetitive background leaves on a tree,aflag in the wind,a constructionflasher,etc.) could result in more than one color being included in the background model.This results in a transparency effect which allows the background to accept two or more separate colors.
2.3Connected components
The method described above allows us to identify foreground pixels in each new frame while updating the description of each pixel’s process.These labeled foreground pixels can then be segme
nted into regions by a two-pass,connected components algorithm[3].
Because this procedure is effective in determining the whole moving object,moving regions can be char-acterized not only by their position,but size,mo-ments,and other shape information.Not only can these characteristics be useful for later processing and classification,but they can aid in the tracking process.
2.4Multiple Hypothesis Tracking
While this section is not essential in the under-standing of the background subtraction method,it will allow one to better understand and evaluate the results in the following sections.
Establishing correspondence of connected compo-nents between frames is accomplished using a lin-early predictive multiple hypotheses tracking algo-rithm which incorporates both position and size.We have implemented an online method for seeding and maintaining sets of Kalmanfilters.
At each frame,we have an available pool of Kalman models and a new available pool of connected com-ponents that they could explain.First,the models are probabilistically matched to the connected regions that they could explain.Second,the connected re-gions which could not be sufficiently explain
ed are checked tofind new Kalman models.Finally,mod-els whosefitness(as determined by the inverse of the variance of its prediction error)falls below a threshold are removed.
Matching the models to the connected compo-nents involves checking each existing model against the available pool of connected components which are larger than a pixel or two.All matches are used to up-date the corresponding model.If the updated model has sufficientfitness,it will be used in the following frame.If no match is found a“null”match can be hypothesized which propogates the model as expected and decreases itsfitness by a constant factor.
The unmatched models from the current frame and the previous two frames are then used to hypothe-size new models.Using pairs of unmatched connected components from the previous two frames,a model is hypothesized.If the current frame contains a match with sufficientfitness,the updated model is added to the existing models.To avoid possible combina-torial explosions in noisy situations,it may be desir-able to limit the maximum number of existing models by removing the least probable models when excessive models exist.In noisy d cameras in low-light conditions),it is often useful to remove the short tracks that may result from random correspon-dances.Further details of this method can be found at www.ai.mit.edu/projects/vsam/.
3Results
On an SGI O2with a R10000processor,this method can process11to13frames a second(frame size160x120pixels).The variation in the frame rate is due to variation in the amount of foreground present. Our tracking system has been effectively storing track-ing information forfive scenes for over16months[6]. Figure3shows accumulated tracks in one scene over the period of a day.
While quick changes in cloud cover(relative toα, the learning rate)can sometimes necessitate a new set of background distributions,it will stabilize within10-20seconds and tracking will continue unhindered.
Because of the stability and completeness of the representation it is possible to do some simple classi-fication.Figure4shows the classification of objects which appeared in a scene over a10minute period using a simple binary threshold on the time-averaged aspect ratio of the object.Tracks lasting less than a second were removed.
Every object which entered this scene–in total,33 cars and34people–was tracked.It successfully clas-sified every car except in one case,where it classified two cars as the same object because one car entered the scene simultaneously with another car leaving at the same point.It found only one person in two cases where two people where walking in physical contact. It also double counted2objects because their tracks were not matched properly.
4Applicability
When deciding on a tracker to implement,the most important information to a researcher is where the

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。