This thesis mainly focuses on multimodal understanding and Visual Question Answering (VQA) via deep learning methods. For technical contributions, this thesis first focuses on improving multimodal... Show moreThis thesis mainly focuses on multimodal understanding and Visual Question Answering (VQA) via deep learning methods. For technical contributions, this thesis first focuses on improving multimodal fusion schemes via multi-stage vision-language interactions. Then, the thesis seeks to overcome the language bias challenges to build robust VQA models, and also extend the bias problem into the more complex audio-visual-textual question answering tasks. Furthermore, this thesis explores the open-world applicability of VQA algorithms from the aspects of lifelong learning and federated learning, thereby expanding the continuous and distributed training ability. The efficacy of the proposed methods in this thesis is verified by extensive experiments. This thesis also gives an overview of challenges, benchmarks and strategies for robust VQA algorithms. Show less
Sbrollini, A.; Barocci, M.; Mancinelli, M.; Paris, M.; Raffaelli, S.; Marcantoni, I.; ... ; Burattini, L. 2023
Heart failure (HF) diagnosis, typically visually performed by serial electrocardiography, may be supported by machine-learning approaches. Repeated structuring & learning procedure (RS & LP... Show moreHeart failure (HF) diagnosis, typically visually performed by serial electrocardiography, may be supported by machine-learning approaches. Repeated structuring & learning procedure (RS & LP) is a constructive algorithm able to automatically create artificial neural networks (ANN); it relies on three parameters, namely maximal number of hidden layers (MNL), initializations (MNI) and confirmations (MNC), arbitrarily set by the user. The aim of this study is to evaluate RS & LP robustness to varying values of parameters and to identify an optimized combination of parameter values for HF diagnosis. To this aim, the Leiden University Medical Center HF data-base was used. The database is constituted by 129 serial ECG pairs acquired in patients who experienced myocardial infarction; 48 patients developed HF at follow-up (cases), while 81 remained clinically stable (controls). Overall, 15 ANNs were created by considering 13 serial ECG features as inputs (extracted from each serial ECG pair), 2 classes as outputs (cases/controls), and varying values of MNL (1, 2, 3, 4 and 10), MNI (50, 250, 500, 1000 and 1500) and MNC (2, 5, 10, 20 and 50). The area under the curve (AUC) of the receiver operating characteristic did not significantly vary with varying parameter values (P >= 0.09). The optimized combination of parameter values, identified as the one showing the highest AUC, was obtained for MNL = 3, MNI = 500 and MNC = 50 (AUC = 86 %; ANN structure: 3 hidden layers of 14, 14 and 13 neurons, respectively). Thus, RS & LP is robust, and the optimized ANN represents a potentially useful clinical tool for a reliable auto-matic HF diagnosis. Show less
Dreuning, H.; Bal, H.E.; Nieuwpoort, R.V. van 2023
Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and image processing models... Show moreDeep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and image processing models contain billions of trainable parameters. Training such massive neural networks incurs significant memory requirements and financial cost. Hybrid-parallel training approaches have emerged that combine pipelining with data and tensor parallelism to facilitate the training of large DL models on distributed hardware setups. However, existing approaches to design a hybrid-parallel partitioning and parallelization plan for DL models focus on achieving high throughput and not on minimizing memory usage and financial cost. We introduce CAPTURE, a partitioning and parallelization approach for hybrid parallelism that minimizes peak memory usage. CAPTURE combines a profiling-based approach with statistical modeling to recommend a partitioning and parallelization plan that minimizes the peak memory usage across all the Graphics Processing Units (GPUs) in the hardware setup. Our results show a reduction in memory usage of up to 43.9% compared to partitioners in state-of-the-art hybridparallel training systems. The reduced memory footprint enables the training of larger DL models on the same hardware resources and training with larger batch sizes. CAPTURE can also train a given model on a smaller hardware setup than other approaches, reducing the financial cost of training massive DL models. Show less
In many real-world applications today, it is critical to continuously record and monitor certain machine or system health indicators to discover malfunctions or other abnormal behavior at an early... Show moreIn many real-world applications today, it is critical to continuously record and monitor certain machine or system health indicators to discover malfunctions or other abnormal behavior at an early stage and prevent potential harm. The demand for such reliable monitoring systems is expected to increase in the coming years. Particularly in the industrial context, in the course of ongoing digitization, it is becoming increasingly important to analyze growing volumes of data in an automated manner using state-of-the-art algorithms. In many practical applications, one has to deal with temporal data in the form of data streams or time series. The problem of detecting unusual (or anomalous) behavior in time series is commonly referred to as time series anomaly detection. Anomalies are events observed in the data that do not conform to the normal or expected behavior when viewed in their temporal context.This thesis focuses on unsupervised machine learning algorithms for anomaly detection in time series. In an unsupervised learning setup, a model attempts to learn the normal behavior in a time series — which might already be contaminated with anomalies — without any external assistance. The model can then use its learned notion of normality to detect anomalous events. Show less
The archaeology domain produces large amounts of texts, too much to effectively read or manually search through for research. To alleviate this problem, we created a search system (called AGNES),... Show moreThe archaeology domain produces large amounts of texts, too much to effectively read or manually search through for research. To alleviate this problem, we created a search system (called AGNES), which combines full text search with entity and geographical search. We first created a manually labelled data set to train a Named Entity Recognition model, which is used to extract entities from text. We also did a user requirement study, and usability evaluation on the system, to make sure it is suitable for archaeological research. In a case study on Early Medieval cremations, we show that using AGNES leads to a knowledge increase when compared to the knowledge of experts, gathered using previously available search engines. This shows that this kind of intelligent search system can help with literature research, find more relevant data, and lead to a better understanding of the past. Show less
The manual analysis of remotely-sensed data is a widespread practice in local and regional scale archaeological research, as well as heritage management. However, the amount of available high... Show moreThe manual analysis of remotely-sensed data is a widespread practice in local and regional scale archaeological research, as well as heritage management. However, the amount of available high-quality, remotely-sensed data is continuously growing at a staggering rate, which creates new challenges to effectively and efficiently analyze these data and find and document the seemingly overwhelming number of potential archaeological objects. Therefore, computer-aided methods for the automated detection of archaeological objects are needed. In this thesis, the development and application of automated detection methods, based on Deep Convolutional Neural Networks, for the detection of multiple classes of archaeological objects in LiDAR data is investigated. Furthermore, the implementation of these methods into archaeological practice and the opportunities of knowledge discovery—on both a quantitative and qualitative level—for landscape or spatial archaeology are explored. Show less
Person re-identification (ReID) methods always learn through a stationary domain that is fixed by the choice of a given dataset. In many contexts (e.g., lifelong learning), those methods are... Show morePerson re-identification (ReID) methods always learn through a stationary domain that is fixed by the choice of a given dataset. In many contexts (e.g., lifelong learning), those methods are ineffective because the domain is continually changing in which case incremental learning over multiple domains is required potentially. In this work we explore a new and challenging ReID task, namely lifelong person re-identification (LReID), which enables to learn continuously across multiple domains and even generalise on new and unseen domains. Following the cognitive processes in the human brain, we design an Adaptive Knowledge Accumulation (AKA) framework that is endowed with two crucial abilities: knowledge representation and knowledge operation. Our method alleviates catastrophic forgetting on seen domains and demonstrates the ability to generalize to unseen domains. Correspondingly, we also provide a new and large-scale benchmark for LReID. Extensive experiments demonstrate our method outperforms other competitors by a margin of 5.8% mAP in generalising evaluation. The codes will be available at https: //github.com/TPCD/LifelongReID. Show less
Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person... Show moreVisible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person reidentification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modalityfeature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. Disentangling cross-modality identity-discriminable features leads to more robust retrieval for VI-ReID. To achieve efficient optimization like conventional VAE, we theoretically derive two variational inference terms for the MoG prior under the supervised setting, which not only restricts the identity-discriminable subspace so that the model explicitly handles the cross-modality intra-identity variance, but also enables the MoG distribution to avoid posterior collapse. Furthermore, we propose a triplet swap reconstruction (TSR) strategy to promote the above disentangling process. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two VI-ReID datasets. Codes will be available at https://github.com/TPCD/DG-VAE. Show less
In 2018, the number of mobile phone users will reach about 4.9 billion. Assuming an average of 5 photos taken per day using the built-in cameras would result in about 9 trillion photos annually... Show moreIn 2018, the number of mobile phone users will reach about 4.9 billion. Assuming an average of 5 photos taken per day using the built-in cameras would result in about 9 trillion photos annually. Thus, it becomes challenging to mine semantic information from such a huge amount of visual data. To solve this challenge, deep learning, an important sub-field in machine learning, has achieved impressive developments in recent years. Inspired by its success, this thesis aims to develop new approaches in deep learning to explore and analyze image data from three research themes: classification, retrieval and synthesis. In summary, the research of this thesis contributes at three levels: models and algorithms, practical scenarios and empirical analysis. First, this work presents new approaches based on deep learning to address eight research questions regarding the three themes. In addition, it aims towards adapting the approaches to practical scenarios in real world. Furthermore, this thesis provides numerous experiments and in-depth analysis, which can help motivate further research on the three research themes. Computer Vision Multimedia Applications Deep Learning Show less