Multi modal representation learning and cross-modal semantic matching

Wang, X.

Persistent URL of this record https://hdl.handle.net/1887/3391031

View statistics

Documents

- Download
- Cover
- open access
- Full Text
- under embargo until 2026-06-24
- Download
- Title Pages_Contents
- open access
- Download
- Chapter 1
- open access
- Chapter 2
- under embargo until 2026-06-24
- Download
- Chapter 3
- open access
- Download
- Chapter 4
- open access
- Download
- Chapter 5
- open access
- Download
- Chapter 6
- open access
- Download
- Bibliography
- open access
- Download
- Summary in English
- open access
- Download
- Summary in Dutch
- open access
- Download
- Curriculum Vitae_Acknowledgements
- open access
- Download
- Propositions
- open access

In Collections

This item can be found in the following collections:

Wang, X. (2022)

Multi modal representation learning and cross-modal semantic matching

Doctoral Thesis

Humans perceive the real world through their sensory organs: vision, taste, hearing, smell, and touch. In terms of information, we consider these different modes
also referred to as different channels of information or modals. Considering multiple channels of information, at the same time, is referred to as multimodal and the input as multimedia. By their very nature, multimedia data are complex and often involve intertwined instances of different kinds of information. We can leverage this multimodal perspective to extract meaning and understanding of the
world. This is comparable to how our brain processes these multiple channels, we learn how to combine and extract meaningful information from them. In this thesis, the learning is done by computer programs and smart algorithms. This is referred to as artificial intelligence. To that end, in this thesis, we have studied multimedia information, with a focus on vision and language information representation for semantic...Show moreHumans perceive the real world through their sensory organs: vision, taste, hearing, smell, and touch. In terms of information, we consider these different modes
also referred to as different channels of information or modals. Considering multiple channels of information, at the same time, is referred to as multimodal and the input as multimedia. By their very nature, multimedia data are complex and often involve intertwined instances of different kinds of information. We can leverage this multimodal perspective to extract meaning and understanding of the
world. This is comparable to how our brain processes these multiple channels, we learn how to combine and extract meaningful information from them. In this thesis, the learning is done by computer programs and smart algorithms. This is referred to as artificial intelligence. To that end, in this thesis, we have studied multimedia information, with a focus on vision and language information representation for semantic mapping. The aims of the semantic mapping learning in this thesis are: (1) visually supervised word embedding learning; (2) fine-grained label
learning for vision representation; (3) kernel-based transformation for image and text association; (4) visual representation learning via a cross-modal contrastive
learning framework.
Show less

Object detection

Semantic mapping

Image-text association

Kernel-based mapping

Visual representation

Phrase grounding

Contrastive learning

Weakly supervised learning