基本信息
文件名称:基于多模态语义理解的视频问答算法研究.pdf
文件大小:6.65 MB
总页数:79 页
更新时间:2025-06-11
总字数:约12.72万字
文档摘要

ABSTRACT

ABSTRACT

Withthedevelopmentofdeeplearningtheoryandtechnology,cross-modalresearch

fieldsinvolvingmulti-modaldatasourcehavedrawnnumerousfocus.Videoquestion

answeringisatypicalcross-modaltask,whichrequiresthemodeltoautomaticallygivethe

answerbasedonanarbitraryinputvideoandaquestionaboutthevisualcontent.Focusing

onthefieldofvideoquestionanswering,themaincontentsofthisthesisarelistedas

follows:

(1)ProposeareconstructeddatasetnamedTGIF-QA-R

TGIF-QAisawidely-usedlargedatasetinvideoquestionanswering.However,this

thesisfindsthatthereisaninterdependentrelationbetweenthecandidateanswersinthe

subtasksofActionandTransitionoforiginaldataset,whichmayleadtoagenerallyhigher

accuracyrateofcurrentmodelsonthesetwosubtasks.Inaddition,thisthesisreconstructs

theoriginaldataset,andnewcandidateanswersaregeneratedbyrandomselection.Ex-

perimentalresultsdemonstratethatthereconstructeddataseteffectivelyremovesthebias

causedbythedistributionofcandidateanswers.

(2)ProposeMotion-AwareAttentionNetwork

ThisthesisproposesMotion-AwareAttentionNetwork(MAAN)forvideoquestion

answering,whichcansimultaneouslymodellocalandglobalmotionchangesofvideoob-

jects.Specifically,consideringthattheorderoforiginalobjectfeaturesismixedbetween

videoframes,thisthesisproposesanalignmentalgorithmbasedonvectorsimilarityto

aligntheinstancesofthesameobjectindifferentframes.Secondly,alocalmotionatten-

tionmodulebasedonalignedfeaturesisproposedtomodelthemotionchangesofeach

videoobject.Furthermore,thisthesisdesignsanotherglobalmotionattentionmoduleto

explorethehigher-levelmotioninformation,whichiscomplementarytotheformerone.

(3)ProposeProgressiveGraphAttentionNetwork

ThisthesisproposesanotherProgressiveGraphAttentionNetwork(PGAT),which

canmodeltherelationshi