In many real-world applications today, it is critical to continuously record and monitor certain machine or system health indicators to discover malfunctions or other abnormal behavior at an early... Show moreIn many real-world applications today, it is critical to continuously record and monitor certain machine or system health indicators to discover malfunctions or other abnormal behavior at an early stage and prevent potential harm. The demand for such reliable monitoring systems is expected to increase in the coming years. Particularly in the industrial context, in the course of ongoing digitization, it is becoming increasingly important to analyze growing volumes of data in an automated manner using state-of-the-art algorithms. In many practical applications, one has to deal with temporal data in the form of data streams or time series. The problem of detecting unusual (or anomalous) behavior in time series is commonly referred to as time series anomaly detection. Anomalies are events observed in the data that do not conform to the normal or expected behavior when viewed in their temporal context.This thesis focuses on unsupervised machine learning algorithms for anomaly detection in time series. In an unsupervised learning setup, a model attempts to learn the normal behavior in a time series — which might already be contaminated with anomalies — without any external assistance. The model can then use its learned notion of normality to detect anomalous events. Show less
The archaeology domain produces large amounts of texts, too much to effectively read or manually search through for research. To alleviate this problem, we created a search system (called AGNES),... Show moreThe archaeology domain produces large amounts of texts, too much to effectively read or manually search through for research. To alleviate this problem, we created a search system (called AGNES), which combines full text search with entity and geographical search. We first created a manually labelled data set to train a Named Entity Recognition model, which is used to extract entities from text. We also did a user requirement study, and usability evaluation on the system, to make sure it is suitable for archaeological research. In a case study on Early Medieval cremations, we show that using AGNES leads to a knowledge increase when compared to the knowledge of experts, gathered using previously available search engines. This shows that this kind of intelligent search system can help with literature research, find more relevant data, and lead to a better understanding of the past. Show less
In link prediction, the goal is to predict which links will appear in the future of an evolving network. To estimate the performance of these models in a supervised machine learning model, disjoint... Show moreIn link prediction, the goal is to predict which links will appear in the future of an evolving network. To estimate the performance of these models in a supervised machine learning model, disjoint and independent train and test sets are needed. However, objects in a real-world network are inherently related to each other. Therefore, it is far from trivial to separate candidate links into these disjoint sets.Here we characterize and empirically investigate the two dominant approaches from the literature for creating separate train and test sets in link prediction, referred to as random and temporal splits. Comparing the performance of these two approaches on several large temporal network datasets, we find evidence that random splits may result in too optimistic results, whereas a temporal split may give a more fair and realistic indication of performance. Results appear robust to the selection of temporal intervals. These findings will be of interest to researchers that employ link prediction or other machine learning tasks in networks. Show less
It is a common technique in global optimization with expensive black-box functions to learn a surrogate-model of the response function from past evaluations and use it to decide on the location of... Show moreIt is a common technique in global optimization with expensive black-box functions to learn a surrogate-model of the response function from past evaluations and use it to decide on the location of future evaluations.In surrogate-model-assisted optimization, selecting the right modeling technique without preliminary knowledge about the objective function can be challenging. It might be beneficial if the algorithm trains many different surrogate models and selects the model with the smallest training error. This approach is known as model selection.In this thesis, a generalization of this approach is developed. Instead of choosing a single model, the optimal convex combinations of model predictions is used to combine surrogate models into one more accurate ensemble surrogate model.This approach is studied in a fundamental way, by first evaluating minimalistic ensembles of only two surrogate models in detail and then proceeding to ensembles with more surrogate models.Finally, the approach is adopted and evaluated in the context of sequential parameter optimization. Besides discussing the general strategy, the optimal frequency of learning the convex combination of weights is investigated.The results provide insights into the performance, scalability, and robustness of the approach. Show less
A P300-based Brain Computer Interface character speller, also known as P300 speller, has been an important communication pathway, under extensive research, for people who lose motor ability, such... Show moreA P300-based Brain Computer Interface character speller, also known as P300 speller, has been an important communication pathway, under extensive research, for people who lose motor ability, such as patients with Amyotrophic Lateral Sclerosis or spinal-cord injury because a P300 speller allows human-beings to directly spell characters using eye-gazes, thereby building communication between the human brain and a computer. Unfortunately, P300 spellers are still not used in human’s daily life and remain in an experimental stage at research labs. The reason for this situation is that the performance and the efficiency of current P300 spellers are unacceptably low for BCI users in their daily life. Therefore, in this thesis, we have focused our attention on developing high performance and efficient P300 spellers in order to bring P300 spellers into practical use. More specifically, in order to increase the performance of a P300 speller, we have developed methods to increase the character spelling accuracy and the Information Transfer Rate. In order to improve the efficiency of a P300 speller, we have developed methods to reduce the number of sensors needed to acquire EEG signals as well as to reduce the complexity of the classifier used in a P300 speller without losing the performance. Show less
The high velocity tail of the total velocity distribution of stars provides essential insight into fundamental properties of the Galaxy. Hypervelocity stars (HVSs), travelling on unbound orbits... Show moreThe high velocity tail of the total velocity distribution of stars provides essential insight into fundamental properties of the Galaxy. Hypervelocity stars (HVSs), travelling on unbound orbits coming from the Galactic Centre, are powerful tracers of the underlying Galactic gravitational potential, and can shed light on the stellar population in the proximity of our massive black hole. Runaway stars, ejected from the stellar disk, provide information on binary evolution and stellar clusters' dynamical processes. The advent of the data from the European Space Agency satellite Gaia has revolutionized our knowledge on high velocity stars. In my PhD thesis entitled "Hunting for the fastest stars in the Milky Way" I present my work on searching for the fastest stars in the Milky Way. I start by presenting our modelling work on predicting the HVS population expected to be contained in the Gaia catalogue. Then I illustrate the data mining techniques built and implemented to find these rare objects in the first and second data release of Gaia. Finally, I conclude discussing how HVSs can be used to constrain the Galactic dark matter halo and the binary population in the Galactic Centre. Show less
Mining time series is a machine learning subfield that focuses on a particular data structure, where variables are measured over (short or long) periods of time. In this thesis we focus on... Show moreMining time series is a machine learning subfield that focuses on a particular data structure, where variables are measured over (short or long) periods of time. In this thesis we focus on multivariate time series, with multiple vari- ables measured over the same period of time. In most cases, such variables are collected at different sampling rates. When combined, these variables can be explored with machine learning methods for multiple purposes.Firstly, we consider the possibility of unsupervised learning. In this case, we propose a pattern recognition method that discovers subsets of variables that show consistent behavior in a number of shared time segments. Fur- thermore, when in a supervised setting, given a dependent variable (target),we propose a method that aggregates independent variables into meaningful features.Additionally to the methods above, we provide two tools in the form of Software as a Service, where users without programming background can intuitively follow the learning and testing methodologies for both methods.Finally, we present an applied study of machine learning to improve speed skating athletes performance. Here, we make a deep analysis of historical data, in order to help optimize performance results. Show less
Many databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns... Show moreMany databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns in such data. In our approach, a data analyst specifies constraints on patterns that she believes to be of interest, and the computer searches for patterns that satisfy these constraints. An important constraint on which we focus, is the constraint that a pattern should have a significant number of occurrences in the data. Constraints like this allow the search to be performed reasonably efficiently. We develop algorithms for searching ppatterns taht are represented in formal first order logic, tree data structures and graph data structures. We perform experiments in which these algorithms, and algorithms proposed by other researchers, are compared with each other, and study which properties determine the efficiency of the algorithms. As a result, we are able to develop more efficient algorithms. As application we study the discovery of fragments in molecular datasets. The aim is to discover fragments that relate the structure of molecules to their activity. Show less