This thesis mainly focuses on multimodal understanding and Visual Question Answering (VQA) via deep learning methods. For technical contributions, this thesis first focuses on improving multimodal... Show moreThis thesis mainly focuses on multimodal understanding and Visual Question Answering (VQA) via deep learning methods. For technical contributions, this thesis first focuses on improving multimodal fusion schemes via multi-stage vision-language interactions. Then, the thesis seeks to overcome the language bias challenges to build robust VQA models, and also extend the bias problem into the more complex audio-visual-textual question answering tasks. Furthermore, this thesis explores the open-world applicability of VQA algorithms from the aspects of lifelong learning and federated learning, thereby expanding the continuous and distributed training ability. The efficacy of the proposed methods in this thesis is verified by extensive experiments. This thesis also gives an overview of challenges, benchmarks and strategies for robust VQA algorithms. Show less