作者:梁丽丽 刘昕雨 孙广路 朱素霞
正则化正交匹配追踪        DOI:10.15938/j.jhust.2022.04.014
        中图分类号: TP391.3
        文献标志码: A
        文章编号: 1007-2683(2022)04-0107-11
        MSAM:Video Question Answering Based
        on Multi-Stage Attention Model
        LIANG Li-li,LIU Xin-yu,SUN Guang-lu,ZHU Su-xia
        (School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China)
        Abstract:The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed.This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of featur
e fusion, a triple-modal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.
        Keywords:video question answering; multi-stage attention model; multi-modal feature fusion
        针对上述问题,本文提出了一种多阶段注意力模型(multi-stage attention model, MSAM)用来精准定位与问题相关的视频特征。MSAM共分为3个阶段且每个阶段所关注的对象有所侧重:第1阶段注意力模型是在视频序列中到与问题相关的关键通道。第2阶段注意力模型是在第1个阶段的基础上,从关键通道中到与问题相关的关键区域,实现进一步的精准定位。第3阶段注意力模型在第1阶段的基础上对融合的视频表示进行关注,通过多个特征协同合作来理解问题,从而得到与问题相关的视频表示。基于MSAM的提出,因此构建了多阶段注意力模型网络(multi-stage attention model network,MSAMN)来解决视频问答任务。实验表明本文提出的方法在视频问答任务中的分类准确率有明显提高,同时提出的MSAMN具有较好的泛化性能。
