Learning Spatio-Temporal Dependencies for Action Recognition2014

LEARNING SPATIO-TEMPORAL DEPENDENCIES FOR ACTION RECOGNITION

Qiao Cai,Yafeng Yin,Hong Man

Department of Electrical and Computer Engineering

Stevens Institute of Technology,Hoboken,NJ07030USA

ABSTRACT

In this paper,we propose a spatio-temporal dependencies learning(STDL)method for action recognition.Inspired by self-organizing map,our method can learn implicit spatial-temporal dependencies from sequential action feature sets while preserving the intrinsic topologies characterized in hu-man actions.A further advantage is its ability to project higher dimensional action feature to lower dimensional latent neural distribution,which signiﬁcantly reduces the com-putational cost and data redundancy in the learning and recognition process.An ensemble learning strategy using expectation-maximization is adopted to estimate the latent parameters of STDL model.The effectiveness and robustness of the proposed model is veriﬁed through extensive experi-ments on several benchmark datasets.

Index Terms—Spatio-temporal dependencies;self-organizing map;action recognition

1.INTRODUCTION

Human action recognition has attracted much attention in the ﬁelds of computer vision and machine learning in recent years [1].Many previous works focused on either augmenting the feature descriptions,such as proposing stronger feature sets and combining different features,or improving action recog-nition models without human intervention.

As a typical classiﬁcation problem,feature extraction plays an essential role in the action recognition.Due to the intrinsic sequential property,many spatio-temporal features, such as STIP[2]and HOSVD[3]have been developed for human action recognition.In[4],the author modeled spatio-temporal context information in a hierarchical structure.A novel approach to use very dense corner features is intro-duced in[5],which are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set.Besides augmenting the features,feature learn-ing methods,such as SVM and AdaBoost,are introduced to recognize human action.Multi-class support vector machine (SVM)with linear kernels is adopted in[6].In[7],the input video sequence is classiﬁed into one of the discrete action classes.

The remainder of this paper is organized as follows.In Section2,we propose dependencies learning mo

del and present the detailed learning procedures.In Section3,exper-imental results are presented and discussed.Finally,conclu-sion is provided in the Section4.

2.SPATIO-TEMPORAL DEPENDENCIES LEARNING MODEL FOR ACTION RECOGNITION It is a complex process to analyze the correlation and variation across space and time[8].There are limitations on the estima-tion of traditional state-space models,since the high dimen-sional parameters may lead to complex dependency structures [9].These dependency structures are often non-stationary or non-separable.STDL can achieve good spatio-temporal clus-tering results,and preserve the intrinsic topological structure pertaining to the spatio-temporal dependencies.The training procedure of STDL can be illustrated in Fig.

(a)

(c)

Fig.1.Training procedure of the proposed dynamic model.(a)Op-ticalﬂow is extracted from each action video sequences.Given two consecutive frames,opticalﬂow is computed at each pixel,and sam-pled wi

th a10×10grid.For instance,the frame size of KTH data set is160×120,after opticalﬂow computing,the size of opticalﬂow ﬁeld for each frame is16×12×2.(b)We use STDL to extract spatio-temporal patterns.The colors of grid represent the distances of var-ious motions on STDL.(c)The ensemble learning based on EM is adopted to predict the action class by majority voting.

2.1.Self-organizing Map

SOM[10]is considered as an effective neural network model

in unsupervised learning,which can extract certain implicit

knowledge without human intervention or empirical evi-

dence.This characteristic of SOM is very useful in solving

clustering problems[11].Additionally,SOM can preserve

spring boot原理和生命周期topological structure of the original dataset[12],while other

clustering methods such as k-means can hardly do this.From

the input feature space to the neuron map space,the non-

linear mapping reduces the computational cost for searching

the similar properties between two high dimensional data

clusters.The competitive learning mechanism associates the

input data instances to the corresponding optimal neurons.

Given the input data sequence X={x1,...,x n}and synaptic neuron weight m j,j∈{1,...,N s},the procedure of search-ing the best-matching unit(BMU)can be expressed as(1).

The total number of the neurons on the map is N s.

bmu(x i)=arg min

∥x i−m j∥(1) The topological relations can be determined by a neigh-borhood kernel function.More speciﬁcally,Gaussian neigh-borhood kernel function deﬁned in(2)is used to constrain the neighborhood

scope of the BMU.

h j,bmu(x

i )

(t)=exp(−

j,bmu(x i)

2σ2(t)

)(2)

For the input data x i,d j,bmu(x

i )

represents the Manhattan

distance between its BMU and the synaptic neuron j on the 2-dimensional map.The parameterσ(t)speciﬁes the radius of the neighboring scope,which monotonously reduces in the cooperative learning.

An adaptive learning rule updates the synaptic neuron weight m j according to(3),whereα(t)represents the learn-ing rate.

m j(t+1)=m j(t)+α(t)h j,bmu(x

i )

(t)(x(t)−m j(t))(3)

2.2.Spatio-temporal dependencies learning model STDL provides an efﬁcient way to discover the hidden spatio-temporal dependency information.It can also reduce the fea-ture dimension from the original video sequences.Similar to SOM,the procedure of searching for the BMU and the def-inition of the neighborhood function are speciﬁed in(1)and (2),respectively.The difference is the adaptive learning rule for synaptic neuron weights.

In SOM,the neighborhood function can be only used to preserve the spatial topology.Temporal Kohone

n map(TKM) and recurrent self-organizing map(RSOM)[13]were pro-posed to adaptively model a data distribution over time on non-stationary input sequences.Although TKM preserves a trace of the past activation in terms of weighted sum,the weights are updated towards the last frame sample of the input sequence based on the conventional SOM update rule.RSOM provides a consistent update rule for the network parameters, but smoothes out temporary volatilities.STDL models se-quential dynamics by introducing Markov process to capture neuron transition probabilities between every two time sam-ples.It is similar to Markov random walk[14]on graph, where at each step the walk jumps to another place based on speciﬁed probability distribution.The parameters of Markov process are used in neuron update and model classiﬁcation.In Fig.2,we take the“run”action in KTH dataset as an example to trace Markov random walk.The trace can record the spa-tial and temporal dynamics during learning knowledge from new data.

Fig.2.The trace of Random Markov walk for run action in KTH dataset

Suppose the video sequence x i={x i,1,...,x i,t,...,x i,T}, x i,t is the input data x i at time t.In STDL,we have N s neu-rons on the lattice map.In(4),p i,j is the transition probability from BMU i at the time t to BMU j at the time t+1.

p i,j=

K(i,j)

∑

m∈N s

K(i,m)

(4)

The transition probabilities constrain the variations of neuron weights.There are two parts contributing to the synaptic weights.The synaptic neuron weights can be up-dated according to(5).We can see that theﬁrst part calculates the estimation of the same target neuron weight in two ad-jacent timestamps.The second part emphasizes the neuron weight at the previous time.The state dynamics k

eeps a bal-ance between these two parts.This formulation means the elastic characteristics of STDL have effects on both spatial domain as well as temporal domain.

m j(t+1)=m j(t)+α(t)h j,bmu(x(t))(t)(x(t)

−p bmu(x(t)),bmu(x(t+1))m∗j(t+1)

−(1−p bmu(x(t)),bmu(x(t+1)))m j(t))

(5)

m∗j(t+1)=(1−p bmu(x(t)),bmu(x(t+1)))m j(t)

+p bmu(x(t)),bmu(x(t+1))m j(t+1)

(6)

m∗j(t+1)can help the target neuron to adaptively learn temporal knowledge from Markov model.The transition probability can teach the neuron how to balance neuron weight update in STDL.

Algorithm1STDL

Input:Video feature sequences:X={x1,...,x S},where x i is a sequence vector{x i(1),...,x i(T)};Total neuron number:N;Initial neuron weights:m j(t0),where j= 1,...,N;Initial learning rate:α(t0);Counter for each neu-ron:cnt(j)

X←X−min(X)

max(X)−min(X),X∈[0,1]

for i=1to S do

for t=1to T do

Search BMU b(x i(t)),d i,t←b(x i(t))

cnt(b(x i(t)))←cnt(b(x i(t)))+1

Calculate p b(x

i (t)),b(x i(t+1))

Update m j(t+1)

end for

Output:

Label sequences:D={d i,1,...,d i,T}

An adaptive merging strategy for clustering optimization (AMSCO)is introduced.The advantage of AMSCO is to avoid the local optima caused by STDL,although STDL is an effective clustering technique for complex high-dimensional data.The spatial and temporal topology knowledge is used to analyze the adaptive merging strategy.The Manhattan dis-tance is a key metric to evaluate spatio-temporal relationships. min(CNT)is a non-zero constant,which indicates the clus-ters in the sparse feature space.The purpose of AMSCO is to analyze these sparse clusters and then merge them into po-tential clusters associated with spatio-temporal dependencies. The adaptive clustering method creates a robust latent space to establish dynamic modeling framework of dynamic model.

By using AMSCO on Weizmann dataset,the average cluster number of all classes ranges from4to20.We use DaviesBouldin index(DBI)in(7)to evaluate clustering per-formance.As shown in Fig.

3,Both k-means and fuzzy-c-means perform similarly,and these two partition clustering methods achieve lower performance than STDL,in which the DBI value is below3.0.STDL with AMSCO obtains even better clustering results than STDL.The variation of DBI in STDL-AMSCO appears smoother.

DBI=1

∑

i=1

max

j=i

(

S i+S j

dist(c i,c j)

)(7)

where N is the total number of clusters,c i means the cen-troid of cluster i,dist(c i,c j)is the inter-distance between centroid c i and c j,S i is the average intra-distance within clus-ter i.Algorithm2AMSCO

Input:Video feature sequences:D={d i,1,...,x i,T}; Cluster number before merging:N c;Counter vector: CNT

Calculate the expected number of counter vector E(CNT) repeat

cnt(j)=min(CNT)

if cnt(j)<E(CNT)then

for k=1to N c do

if MhtDist(j,k)=min(MhtDist)and j=k

then

Add k into index vector

end if

end for

if size(index vector)>1and cnt(k)=

max(cnt(index vector))then

Merge cluster j into k

cnt(k)←cnt(k)+cnt(j),cnt(j)←0

end if

until min(CNT)>E(CNT)

if d i,t>j then

d∗i,t←k

end if

Output:

Label sequences after merging:D∗={d∗i,1,...,d∗i,T}

Fig.3.The cluster analysis based on Weizmann dataset.

We assume the input data X t={X(x i;t)},i=1,...,S, where S is the number of spatial data attributes at the

time t.The covariance matrix of the zero-mean Gaussian noise is∆t.Θt describes the state transition over the time t.We collect the dynamic model parameters asΦ={Θt,∆t}.The primary goal of this model is to estimate the modeling param-eters through expectation-maximization(EM).The likelihood of the input data sequences can be estimated as(8).

P(D∗|Φ)=

∏

i=1

P(D∗i|Φ)(8)

We can predict the class label based on(9).

y=arg max

s i∈S

∑

j=1

P(D∗|Φj,s j)P(s j|Φj)P(Φj)(9)

whereΦj represents one of the alternative models,S is the set of all class labels.

3.EXPERIMENTS

To analyze the effects of periodic and non-periodic actions, we calculate opticalﬂow in feature extraction[15].Optical ﬂow approximates local image motion based on local deriva-tives in a video sequence,and it can essentially reﬂect the spatio-temporal variability between two consecutive frames.

3.1.Performance

Afurther analysis of spatio-temporal clustering is illustrated in Table1.N c and N∗c represent the average cluster number before and after merging,respectively.N c increases promptly when the map si

ze becomes larger,since SOM depends on the input feature space.STDL also inherits this characteristic of sensitivity to the data diversity.AMSCO can help STDL to overcome this problem.The spatio-temporal dependency is also considered during the merging procedure and the original topological structure will be preserved.

Table1.Comparison of cluster number

Dataset Size N c N∗c

KTH 2×242 3×353 4×483 5×5125

Weizmann 2×242 3×363 4×4106 5×52012

UCF 2×242 3×373 4×4127 5×5169

To better visualize the spatio-temporal dependency,we take“jump”and“skip”

as an example in Fig.4.“jump”and “skip”actions have similar distance error in STDL model. But the error variation of“skip”action is less than“jump”action.This veriﬁes that STDL model can learn the spatio-temporal dependencies between two similar actions.

To verify the recognition capability of the proposal method,Table2shows the recognition results of these compa-rable approaches based on KTH,Weizmann and UCF dataset,

Fig.4.Two similar actions in STDL model. respectively.On KTH dataset,Wu et al.[16]and Kovashka et al.[17]achieved the best performance with94.5%.Our method can achieve94.2%on average.On Weizmann dataset, Fathi[7]achieved100%,Jhuang[18]achieved98.8%,and our method achieved98.2%.On the more challenging UCF dataset,Kovashka et al.[17]and Wu et al.[16]achieved 87.3%and91.3%,respectively.Our method with91.6% performs better than these methods.The performance of our method is comparable with these state of the art methods on action datasets.Particularly,for more complex dataset,such as UCF sport dataset,our method can effectively improve the recognition performance.But more importantly our method can adaptively learn from lower level features,such as optical ﬂow,rather than using strong features.This improves model robustness,and requires less human intervention.

Table2.Avarage accuracy on benchmark datasets

Method KTH Weizmann UCF

Fathi et al.[7]90.5%100%-

JHuang et al.[18]91.7%98.8%-

Laptev et al.[2]91.8%--

Campos et al.[19]91.5%96.7%80.0%

Wang et al.[20]89.0%97.8%83.3%

Wu et al.[16]94.5%-91.3%

Kovashka et al.[17]94.5%-87.3%

Our method94.2%98.2%91.6%

4.CONCLUSION

In this paper,STDL model is introduced to effectively learn the spatio-temporal dependencies for action

recognition. The implicit spatio-temporal knowledge is extracted through learning the low level features from sequential action data. The learning process can preserve the intrinsic topologies and reduce dimension in spatio-temporal domain,which signiﬁ-cantly improves computational efﬁciency in the model.We also analyze the overall performance of our model on some real-world datasets.

5.REFERENCES

[1]P.Turaga,R.Chellappa,V.Subrahmanian,and O.Udrea,“Machine

recognition of human activities:A survey,”IEEE Trans.Circuits and Systems for Video Technology,vol.18,no.11,pp.1473–1488,2008.

[2]I.Laptev,“On space-time interest points,”Intl.Journal of Computer

Vision,vol.64,pp.107–123,2005.

[3]Y.Lui,J.Beveridge,and M.Kirby,“Action classiﬁcation on product

manifolds,”CVPR,2010.

[4]J.Sun,X.Wu,S.Yan,L.Cheong,T.Chua,and J.Li,“Hierarchical

spatio-temporal context modeling for action recognition,”CVPR,2009.

[5] A.Gilbert,J.Illingworth,and R.Bowden,“Fast realistic multi-action

recognition using mined dense spatio-temporal features,”ICCV,2009.

[6]Y.Zhu,X.Zhao,Y.Fu,and Y.Liu,“Sparse coding on local spatial-

temporal volumes for human action recognition,”ACCV,pp.660–671, 2010.

[7] A.Fathi and G.Mori,“Action recognition by learning mid-level motion

features,”CVPR,2008.

[8]S.Zhao,D.Tuninetti,R.Ansari,and D.Schonfeld,“Multiple descrip-

tion coding over multiple correlated erasure channels,”Trans.Emerg-ing Tel.Tech,vol.23,pp.522–536,2012.

[9]Q.Cai,Y.Yin,and H.Man,“Dspm:Dynamic structure preserving

map for action recognition,”ICME,2013.

[10]T.Kohonen,“Self-organizing maps,”Springer,2001.

[11]Q.Cai,H.He,and H.Man,“Spatial outlier detection based on iterative

self-organizing learning model,”Neurocomputing,2013.

[12]Q.Cai,H.He,and H.Man,“Somso:A self-organizing map approach

for spatial outlier detection with multiple attributes,”Proc.Int. Neural Networks,pp.425–431,2009.

[13]M.Varsta,J.Heikkonen,J.Lampinen,and J.Millan,“Temporal ko-

honen map and recurrent self-organizing map:analytical and experi-mental comparison,”Neural Processing Letters,vol.13,pp.237–251, 2001.

[14]M.Szummer and T.Jaakkola,“Partially labeled classiﬁcation with

markov random walks,”NIPS,2001.

[15] D.Fleet and Y.Weiss,“Opticalﬂow estimation,”Handbook of Mathe-

matical Models in Computer Vision,Springer,pp.239–258,2005. [16]X.Wu,D.Xu,L.Duan,and J.Luo,“Action recognition using context

and appearance distribution features,”CVPR,2011.

[17] A.Kovashka and K.Grauman,“Learning a hierarchy of discrimina-

tive space-time neighborhood features for human action recognition,”

CVPR,2010.

[18]H.Jhuang,T.Serre,L.Wolf,and T.Poggio,“A biologically inspired

system for action recognition,”ICCV,2007.

[19]T.Campos,M.Barnard,K.Mikolajczyk,J.Kittler,F.Yan,W.Christ-

mas,and D.Windridge,“An evaluation of bags-of-words and spatio-temporal shapes for action recognition,”IEEE Workshop Applications of Computer Vision,pp.344–351,2011.

[20]H.Wang,M.Ullah,A.Klaser,I.Laptev,and C.Schmid,“Evaluation

of local spatio-temporal features for action recognition,”BMVC,2009.

688IT编程网

Learning Spatio-Temporal Dependencies for Action Recognition2014_图文...

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

688IT编程网

Learning Spatio-Temporal Dependencies for Action Recognition2014_图文...

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林 重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

随机森林重要性