ABSTRACT
ABSTRACT
Withthedevelopmentofdeeplearningtheoryandtechnology,cross-modalresearch
fieldsinvolvingmulti-modaldatasourcehavedrawnnumerousfocus.Videoquestion
answeringisatypicalcross-modaltask,whichrequiresthemodeltoautomaticallygivethe
answerbasedonanarbitraryinputvideoandaquestionaboutthevisualcontent.Focusing
onthefieldofvideoquestionanswering,themaincontentsofthisthesisarelistedas
follows:
(1)ProposeareconstructeddatasetnamedTGIF-QA-R
TGIF-QAisawidely-usedlargedatasetinvideoquestionanswering.However,this
thesisfindsthatthereisaninterdependentrelationbetweenthecandidateanswersinthe
subtasksofActionandTransitionoforiginaldataset,whichmayleadtoagenerallyhigher
accuracyrateofcurrentmodelsonthesetwosubtasks.Inaddition,thisthesisreconstructs
theoriginaldataset,andnewcandidateanswersaregeneratedbyrandomselection.Ex-
perimentalresultsdemonstratethatthereconstructeddataseteffectivelyremovesthebias
causedbythedistributionofcandidateanswers.
(2)ProposeMotion-AwareAttentionNetwork
ThisthesisproposesMotion-AwareAttentionNetwork(MAAN)forvideoquestion
answering,whichcansimultaneouslymodellocalandglobalmotionchangesofvideoob-
jects.Specifically,consideringthattheorderoforiginalobjectfeaturesismixedbetween
videoframes,thisthesisproposesanalignmentalgorithmbasedonvectorsimilarityto
aligntheinstancesofthesameobjectindifferentframes.Secondly,alocalmotionatten-
tionmodulebasedonalignedfeaturesisproposedtomodelthemotionchangesofeach
videoobject.Furthermore,thisthesisdesignsanotherglobalmotionattentionmoduleto
explorethehigher-levelmotioninformation,whichiscomplementarytotheformerone.
(3)ProposeProgressiveGraphAttentionNetwork
ThisthesisproposesanotherProgressiveGraphAttentionNetwork(PGAT),which
canmodeltherelationshi