LEARNING SPATIO-TEMPORAL DEPENDENCIES FOR ACTION RECOGNITION
Qiao Cai,Yafeng Yin,Hong Man
Department of Electrical and Computer Engineering
Stevens Institute of Technology,Hoboken,NJ07030USA
ABSTRACT
In this paper,we propose a spatio-temporal dependencies learning(STDL)method for action recognition.Inspired by self-organizing map,our method can learn implicit spatial-temporal dependencies from sequential action feature sets while preserving the intrinsic topologies characterized in hu-man actions.A further advantage is its ability to project higher dimensional action feature to lower dimensional latent neural distribution,which significantly reduces the com-putational cost and data redundancy in the learning and recognition process.An ensemble learning strategy using expectation-maximization is adopted to estimate the latent parameters of STDL model.The effectiveness and robustness of the proposed model is verified through extensive experi-ments on several benchmark datasets.
Index Terms—Spatio-temporal dependencies;self-organizing map;action recognition
1.INTRODUCTION
Human action recognition has attracted much attention in the fields of computer vision and machine learning in recent years [1].Many previous works focused on either augmenting the feature descriptions,such as proposing stronger feature sets and combining different features,or improving action recog-nition models without human intervention.
As a typical classification problem,feature extraction plays an essential role in the action recognition.Due to the intrinsic sequential property,many spatio-temporal features, such as STIP[2]and HOSVD[3]have been developed for human action recognition.In[4],the author modeled spatio-temporal context information in a hierarchical structure.A novel approach to use very dense corner features is intro-duced in[5],which are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set.Besides augmenting the features,feature learn-ing methods,such as SVM and AdaBoost,are introduced to recognize human action.Multi-class support vector machine (SVM)with linear kernels is adopted in[6].In[7],the input video sequence is classified into one of the discrete action classes.
The remainder of this paper is organized as follows.In Section2,we propose dependencies learning mo
del and present the detailed learning procedures.In Section3,exper-imental results are presented and discussed.Finally,conclu-sion is provided in the Section4.
2.SPATIO-TEMPORAL DEPENDENCIES LEARNING MODEL FOR ACTION RECOGNITION It is a complex process to analyze the correlation and variation across space and time[8].There are limitations on the estima-tion of traditional state-space models,since the high dimen-sional parameters may lead to complex dependency structures [9].These dependency structures are often non-stationary or non-separable.STDL can achieve good spatio-temporal clus-tering results,and preserve the intrinsic topological structure pertaining to the spatio-temporal dependencies.The training procedure of STDL can be illustrated in Fig.
1.
(a)
(c)
Fig.1.Training procedure of the proposed dynamic model.(a)Op-ticalflow is extracted from each action video sequences.Given two consecutive frames,opticalflow is computed at each pixel,and sam-pled wi
th a10×10grid.For instance,the frame size of KTH data set is160×120,after opticalflow computing,the size of opticalflow field for each frame is16×12×2.(b)We use STDL to extract spatio-temporal patterns.The colors of grid represent the distances of var-ious motions on STDL.(c)The ensemble learning based on EM is adopted to predict the action class by majority voting.
2.1.Self-organizing Map
SOM[10]is considered as an effective neural network model
in unsupervised learning,which can extract certain implicit
knowledge without human intervention or empirical evi-
dence.This characteristic of SOM is very useful in solving
clustering problems[11].Additionally,SOM can preserve
spring boot原理和生命周期topological structure of the original dataset[12],while other
clustering methods such as k-means can hardly do this.From
the input feature space to the neuron map space,the non-
linear mapping reduces the computational cost for searching
the similar properties between two high dimensional data
clusters.The competitive learning mechanism associates the
input data instances to the corresponding optimal neurons.
Given the input data sequence X={x1,...,x n}and synaptic neuron weight m j,j∈{1,...,N s},the procedure of search-ing the best-matching unit(BMU)can be expressed as(1).
The total number of the neurons on the map is N s.
bmu(x i)=arg min
j
∥x i−m j∥(1) The topological relations can be determined by a neigh-borhood kernel function.More specifically,Gaussian neigh-borhood kernel function defined in(2)is used to constrain the neighborhood
scope of the BMU.
h j,bmu(x
i )
(t)=exp(−
d2
j,bmu(x i)
2σ2(t)
)(2)
For the input data x i,d j,bmu(x
i )
represents the Manhattan
distance between its BMU and the synaptic neuron j on the 2-dimensional map.The parameterσ(t)specifies the radius of the neighboring scope,which monotonously reduces in the cooperative learning.
An adaptive learning rule updates the synaptic neuron weight m j according to(3),whereα(t)represents the learn-ing rate.
m j(t+1)=m j(t)+α(t)h j,bmu(x
i )
(t)(x(t)−m j(t))(3)
2.2.Spatio-temporal dependencies learning model STDL provides an efficient way to discover the hidden spatio-temporal dependency information.It can also reduce the fea-ture dimension from the original video sequences.Similar to SOM,the procedure of searching for the BMU and the def-inition of the neighborhood function are specified in(1)and (2),respectively.The difference is the adaptive learning rule for synaptic neuron weights.
In SOM,the neighborhood function can be only used to preserve the spatial topology.Temporal Kohone
n map(TKM) and recurrent self-organizing map(RSOM)[13]were pro-posed to adaptively model a data distribution over time on non-stationary input sequences.Although TKM preserves a trace of the past activation in terms of weighted sum,the weights are updated towards the last frame sample of the input sequence based on the conventional SOM update rule.RSOM provides a consistent update rule for the network parameters, but smoothes out temporary volatilities.STDL models se-quential dynamics by introducing Markov process to capture neuron transition probabilities between every two time sam-ples.It is similar to Markov random walk[14]on graph, where at each step the walk jumps to another place based on specified probability distribution.The parameters of Markov process are used in neuron update and model classification.In Fig.2,we take the“run”action in KTH dataset as an example to trace Markov random walk.The trace can record the spa-tial and temporal dynamics during learning knowledge from new data.
Fig.2.The trace of Random Markov walk for run action in KTH dataset
Suppose the video sequence x i={x i,1,...,x i,t,...,x i,T}, x i,t is the input data x i at time t.In STDL,we have N s neu-rons on the lattice map.In(4),p i,j is the transition probability from BMU i at the time t to BMU j at the time t+1.
p i,j=
K(i,j)
m∈N s
K(i,m)
(4)
The transition probabilities constrain the variations of neuron weights.There are two parts contributing to the synaptic weights.The synaptic neuron weights can be up-dated according to(5).We can see that thefirst part calculates the estimation of the same target neuron weight in two ad-jacent timestamps.The second part emphasizes the neuron weight at the previous time.The state dynamics k
eeps a bal-ance between these two parts.This formulation means the elastic characteristics of STDL have effects on both spatial domain as well as temporal domain.
m j(t+1)=m j(t)+α(t)h j,bmu(x(t))(t)(x(t)
−p bmu(x(t)),bmu(x(t+1))m∗j(t+1)
−(1−p bmu(x(t)),bmu(x(t+1)))m j(t))
(5)
m∗j(t+1)=(1−p bmu(x(t)),bmu(x(t+1)))m j(t)
+p bmu(x(t)),bmu(x(t+1))m j(t+1)
(6)
m∗j(t+1)can help the target neuron to adaptively learn temporal knowledge from Markov model.The transition probability can teach the neuron how to balance neuron weight update in STDL.
Algorithm1STDL
Input:Video feature sequences:X={x1,...,x S},where x i is a sequence vector{x i(1),...,x i(T)};Total neuron number:N;Initial neuron weights:m j(t0),where j= 1,...,N;Initial learning rate:α(t0);Counter for each neu-ron:cnt(j)
X←X−min(X)
max(X)−min(X),X∈[0,1]
for i=1to S do
for t=1to T do
Search BMU b(x i(t)),d i,t←b(x i(t))
cnt(b(x i(t)))←cnt(b(x i(t)))+1
Calculate p b(x
i (t)),b(x i(t+1))
Update m j(t+1)
end for
end for
Output:
Label sequences:D={d i,1,...,d i,T}
An adaptive merging strategy for clustering optimization (AMSCO)is introduced.The advantage of AMSCO is to avoid the local optima caused by STDL,although STDL is an effective clustering technique for complex high-dimensional data.The spatial and temporal topology knowledge is used to analyze the adaptive merging strategy.The Manhattan dis-tance is a key metric to evaluate spatio-temporal relationships. min(CNT)is a non-zero constant,which indicates the clus-ters in the sparse feature space.The purpose of AMSCO is to analyze these sparse clusters and then merge them into po-tential clusters associated with spatio-temporal dependencies. The adaptive clustering method creates a robust latent space to establish dynamic modeling framework of dynamic model.
By using AMSCO on Weizmann dataset,the average cluster number of all classes ranges from4to20.We use DaviesBouldin index(DBI)in(7)to evaluate clustering per-formance.As shown in Fig.
3,Both k-means and fuzzy-c-means perform similarly,and these two partition clustering methods achieve lower performance than STDL,in which the DBI value is below3.0.STDL with AMSCO obtains even better clustering results than STDL.The variation of DBI in STDL-AMSCO appears smoother.
DBI=1
N
N
i=1
max
j=i
(
S i+S j
dist(c i,c j)
)(7)
where N is the total number of clusters,c i means the cen-troid of cluster i,dist(c i,c j)is the inter-distance between centroid c i and c j,S i is the average intra-distance within clus-ter i.Algorithm2AMSCO
Input:Video feature sequences:D={d i,1,...,x i,T}; Cluster number before merging:N c;Counter vector: CNT
Calculate the expected number of counter vector E(CNT) repeat
cnt(j)=min(CNT)
if cnt(j)<E(CNT)then
for k=1to N c do
if MhtDist(j,k)=min(MhtDist)and j=k
then
Add k into index vector
end if
end for
if size(index vector)>1and cnt(k)=
max(cnt(index vector))then
Merge cluster j into k
cnt(k)←cnt(k)+cnt(j),cnt(j)←0
end if
end if
until min(CNT)>E(CNT)
if d i,t>j then
d∗i,t←k
end if
Output:
Label sequences after merging:D∗={d∗i,1,...,d∗i,T}
Fig.3.The cluster analysis based on Weizmann dataset.
We assume the input data X t={X(x i;t)},i=1,...,S, where S is the number of spatial data attributes at the
time t.The covariance matrix of the zero-mean Gaussian noise is∆t.Θt describes the state transition over the time t.We collect the dynamic model parameters asΦ={Θt,∆t}.The primary goal of this model is to estimate the modeling param-eters through expectation-maximization(EM).The likelihood of the input data sequences can be estimated as(8).
P(D∗|Φ)=
n
i=1
P(D∗i|Φ)(8)
We can predict the class label based on(9).
y=arg max
s i∈S
n
j=1
P(D∗|Φj,s j)P(s j|Φj)P(Φj)(9)
whereΦj represents one of the alternative models,S is the set of all class labels.
3.EXPERIMENTS
To analyze the effects of periodic and non-periodic actions, we calculate opticalflow in feature extraction[15].Optical flow approximates local image motion based on local deriva-tives in a video sequence,and it can essentially reflect the spatio-temporal variability between two consecutive frames.
3.1.Performance
Afurther analysis of spatio-temporal clustering is illustrated in Table1.N c and N∗c represent the average cluster number before and after merging,respectively.N c increases promptly when the map si
ze becomes larger,since SOM depends on the input feature space.STDL also inherits this characteristic of sensitivity to the data diversity.AMSCO can help STDL to overcome this problem.The spatio-temporal dependency is also considered during the merging procedure and the original topological structure will be preserved.
Table1.Comparison of cluster number
Dataset Size N c N∗c
KTH 2×242 3×353 4×483 5×5125
Weizmann 2×242 3×363 4×4106 5×52012
UCF 2×242 3×373 4×4127 5×5169
To better visualize the spatio-temporal dependency,we take“jump”and“skip”
as an example in Fig.4.“jump”and “skip”actions have similar distance error in STDL model. But the error variation of“skip”action is less than“jump”action.This verifies that STDL model can learn the spatio-temporal dependencies between two similar actions.
To verify the recognition capability of the proposal method,Table2shows the recognition results of these compa-rable approaches based on KTH,Weizmann and UCF dataset,
Fig.4.Two similar actions in STDL model. respectively.On KTH dataset,Wu et al.[16]and Kovashka et al.[17]achieved the best performance with94.5%.Our method can achieve94.2%on average.On Weizmann dataset, Fathi[7]achieved100%,Jhuang[18]achieved98.8%,and our method achieved98.2%.On the more challenging UCF dataset,Kovashka et al.[17]and Wu et al.[16]achieved 87.3%and91.3%,respectively.Our method with91.6% performs better than these methods.The performance of our method is comparable with these state of the art methods on action datasets.Particularly,for more complex dataset,such as UCF sport dataset,our method can effectively improve the recognition performance.But more importantly our method can adaptively learn from lower level features,such as optical flow,rather than using strong features.This improves model robustness,and requires less human intervention.
Table2.Avarage accuracy on benchmark datasets
Method KTH Weizmann UCF
Fathi et al.[7]90.5%100%-
JHuang et al.[18]91.7%98.8%-
Laptev et al.[2]91.8%--
Campos et al.[19]91.5%96.7%80.0%
Wang et al.[20]89.0%97.8%83.3%
Wu et al.[16]94.5%-91.3%
Kovashka et al.[17]94.5%-87.3%
Our method94.2%98.2%91.6%
4.CONCLUSION
In this paper,STDL model is introduced to effectively learn the spatio-temporal dependencies for action
recognition. The implicit spatio-temporal knowledge is extracted through learning the low level features from sequential action data. The learning process can preserve the intrinsic topologies and reduce dimension in spatio-temporal domain,which signifi-cantly improves computational efficiency in the model.We also analyze the overall performance of our model on some real-world datasets.
5.REFERENCES
[1]P.Turaga,R.Chellappa,V.Subrahmanian,and O.Udrea,“Machine
recognition of human activities:A survey,”IEEE Trans.Circuits and Systems for Video Technology,vol.18,no.11,pp.1473–1488,2008.
[2]I.Laptev,“On space-time interest points,”Intl.Journal of Computer
Vision,vol.64,pp.107–123,2005.
[3]Y.Lui,J.Beveridge,and M.Kirby,“Action classification on product
manifolds,”CVPR,2010.
[4]J.Sun,X.Wu,S.Yan,L.Cheong,T.Chua,and J.Li,“Hierarchical
spatio-temporal context modeling for action recognition,”CVPR,2009.
[5]  A.Gilbert,J.Illingworth,and R.Bowden,“Fast realistic multi-action
recognition using mined dense spatio-temporal features,”ICCV,2009.
[6]Y.Zhu,X.Zhao,Y.Fu,and Y.Liu,“Sparse coding on local spatial-
temporal volumes for human action recognition,”ACCV,pp.660–671, 2010.
[7]  A.Fathi and G.Mori,“Action recognition by learning mid-level motion
features,”CVPR,2008.
[8]S.Zhao,D.Tuninetti,R.Ansari,and D.Schonfeld,“Multiple descrip-
tion coding over multiple correlated erasure channels,”Trans.Emerg-ing Tel.Tech,vol.23,pp.522–536,2012.
[9]Q.Cai,Y.Yin,and H.Man,“Dspm:Dynamic structure preserving
map for action recognition,”ICME,2013.
[10]T.Kohonen,“Self-organizing maps,”Springer,2001.
[11]Q.Cai,H.He,and H.Man,“Spatial outlier detection based on iterative
self-organizing learning model,”Neurocomputing,2013.
[12]Q.Cai,H.He,and H.Man,“Somso:A self-organizing map approach
for spatial outlier detection with multiple attributes,”Proc.Int. Neural Networks,pp.425–431,2009.
[13]M.Varsta,J.Heikkonen,J.Lampinen,and J.Millan,“Temporal ko-
honen map and recurrent self-organizing map:analytical and experi-mental comparison,”Neural Processing Letters,vol.13,pp.237–251, 2001.
[14]M.Szummer and T.Jaakkola,“Partially labeled classification with
markov random walks,”NIPS,2001.
[15]  D.Fleet and Y.Weiss,“Opticalflow estimation,”Handbook of Mathe-
matical Models in Computer Vision,Springer,pp.239–258,2005. [16]X.Wu,D.Xu,L.Duan,and J.Luo,“Action recognition using context
and appearance distribution features,”CVPR,2011.
[17]  A.Kovashka and K.Grauman,“Learning a hierarchy of discrimina-
tive space-time neighborhood features for human action recognition,”
CVPR,2010.
[18]H.Jhuang,T.Serre,L.Wolf,and T.Poggio,“A biologically inspired
system for action recognition,”ICCV,2007.
[19]T.Campos,M.Barnard,K.Mikolajczyk,J.Kittler,F.Yan,W.Christ-
mas,and D.Windridge,“An evaluation of bags-of-words and spatio-temporal shapes for action recognition,”IEEE Workshop Applications of Computer Vision,pp.344–351,2011.
[20]H.Wang,M.Ullah,A.Klaser,I.Laptev,and C.Schmid,“Evaluation
of local spatio-temporal features for action recognition,”BMVC,2009.

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。