基于多模态语义理解的视频问答算法研究.pdf

基本信息

文件名称：基于多模态语义理解的视频问答算法研究.pdf

文件大小：6.65 MB

总页数：79 页

更新时间：2025-06-11

总字数：约12.72万字

文档摘要

ABSTRACT

Withthedevelopmentofdeeplearningtheoryandtechnology,cross-modalresearch

fieldsinvolvingmulti-modaldatasourcehavedrawnnumerousfocus.Videoquestion

answeringisatypicalcross-modaltask,whichrequiresthemodeltoautomaticallygivethe

answerbasedonanarbitraryinputvideoandaquestionaboutthevisualcontent.Focusing

onthefieldofvideoquestionanswering,themaincontentsofthisthesisarelistedas

follows:

(1)ProposeareconstructeddatasetnamedTGIF-QA-R

TGIF-QAisawidely-usedlargedatasetinvideoquestionanswering.However,this

thesisfindsthatthereisaninterdependentrelationbetweenthecandidateanswersinthe

subtasksofActionandTransitionoforiginaldataset,whichmayleadtoagenerallyhigher

accuracyrateofcurrentmodelsonthesetwosubtasks.Inaddition,thisthesisreconstructs

theoriginaldataset,andnewcandidateanswersaregeneratedbyrandomselection.Ex-

perimentalresultsdemonstratethatthereconstructeddataseteffectivelyremovesthebias

causedbythedistributionofcandidateanswers.

(2)ProposeMotion-AwareAttentionNetwork

ThisthesisproposesMotion-AwareAttentionNetwork(MAAN)forvideoquestion

answering,whichcansimultaneouslymodellocalandglobalmotionchangesofvideoob-

jects.Specifically,consideringthattheorderoforiginalobjectfeaturesismixedbetween

videoframes,thisthesisproposesanalignmentalgorithmbasedonvectorsimilarityto

aligntheinstancesofthesameobjectindifferentframes.Secondly,alocalmotionatten-

tionmodulebasedonalignedfeaturesisproposedtomodelthemotionchangesofeach

videoobject.Furthermore,thisthesisdesignsanotherglobalmotionattentionmoduleto

explorethehigher-levelmotioninformation,whichiscomplementarytotheformerone.

(3)ProposeProgressiveGraphAttentionNetwork

ThisthesisproposesanotherProgressiveGraphAttentionNetwork(PGAT),which

canmodeltherelationshi