Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS)... Show moreUnambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two DNA sequences. Our algorithm is able to compute the HGVS~descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers. Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex) frame shifted variants that can be described with our method. Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing HGVS nomenclature. The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups. Show less
Many scientists are focussed on building models. We nearly process all information we perceive to a model. There are many techniques that enable computers to build models as well. The field... Show more Many scientists are focussed on building models. We nearly process all information we perceive to a model. There are many techniques that enable computers to build models as well. The field of research that develops such techniques is called Machine Learning. Many research is devoted to develop computer programs capable of building models (algorithms). Many of such algorithms exist, and these often consist of various options that subtly influence performance (parameters). Furthermore, there is mathematical proof that there exists no single algorithm that works well on every dataset. This complicates the task of selecting the right algorithm for a given task. The field of meta-learning aims to resolve these problems. The purpose is to determine what kind of algorithms work well on which datasets. In order to do so, we developed OpenML. This is an online database on which researches can share experimental results amongst each other, potentially scaling up the size of meta-learning studies. Having earlier experimental results freely accessible and reusable for others, it is no longer required to conduct time expensive experiments. Rather, researchers can answer such experimental questions by a simple database look-up. This thesis addresses how OpenML can be used to answer fundamental meta-learning questions. Show less
We describe a formal notation for DNA molecules that may contain nicks and gaps. The resulting DNA expressions denote formal DNA molecules. Different DNA expressions may denote the same molecule.... Show moreWe describe a formal notation for DNA molecules that may contain nicks and gaps. The resulting DNA expressions denote formal DNA molecules. Different DNA expressions may denote the same molecule. Such DNA expressions are called equivalent. We examine which DNA expressions are minimal, which means that they have the shortest length among all equivalent DNA expressions. Among others, we describe how to construct a minimal DNA expression for a given molecule. We also present an efficient, recursive algorithm to rewrite a given DNA expression into an equivalent, minimal DNA expression. For many formal DNA molecules, there exists more than one minimal DNA expression. We define a minimal normal form, i.e., a set of properties such that for each formal DNA molecule, there is exactly one (minimal) DNA expression with these properties. We finally describe an efficient, two-step algorithm to rewrite an arbitrary DNA expression into this normal form. Show less
This thesis is about algorithms for analyzing large real-world graphs (or networks). Examples include (online) social networks, webgraphs, information networks, biological networks and scientific... Show moreThis thesis is about algorithms for analyzing large real-world graphs (or networks). Examples include (online) social networks, webgraphs, information networks, biological networks and scientific collaboration and citation networks. Although these graphs differ in terms of what kind of information the objects and relationships represent, it turns out that the structure of each these networks is surprisingly similar.For computer scientists, there is an obvious challenge to design efficient algorithms that allow large graphs to be processed and analyzed in a practical setting, facing the challenges of processing millions of nodes and billions of edges. Specifically, there is an opportunity to exploit the non-random structure of real-world graphs to efficiently compute or approximate various properties and measures that would be too hard to compute using traditional graph algorithms. Examples include computation of node-to-node distances and extreme distance measures such as the exact diameter and radius of a graph. Show less
Evolution and Interaction are two processes in Computer Science that are used in many algorithms to create, shape, find and optimize solutions to real world problems. Evolution has been very... Show moreEvolution and Interaction are two processes in Computer Science that are used in many algorithms to create, shape, find and optimize solutions to real world problems. Evolution has been very successfully applied as a pow-erful tool to solve complex search problems in fields ranging from physics, chemistry and biology all the way to commercial application such as aircraft fuselage design and civil engineering grading plans. Defining interaction is a big part of algorithm design. Not only defining the inputs and outputs of an algorithm but for a complex algorithm the interactions inside of an al- gorithm are as important. This thesis will concentrate on where Evolution overlaps Interaction. It will show how evolution can be used to evolve in- teraction, how the interaction inside an evolutionary algorithm impacts its performance and how an evolutionary algorithm can interact with humans. By touching on these three forms of overlap this thesis tries to give insight into the world of evolution and interaction Show less
The increase in capabilities of information technology of the last decade has led to a large increase in the creation of raw data. Data mining, a form of computer guided, statistical data analysis,... Show moreThe increase in capabilities of information technology of the last decade has led to a large increase in the creation of raw data. Data mining, a form of computer guided, statistical data analysis, attempts to draw knowledge from these sources that is usable, human understandable and was previously unknown. One of the potential application domains is that of law enforcement. This thesis describes a number of efforts in this direction and reports on the results reached on the application of its resulting algorithms on actual police data. The usage of specifically tailored data mining algorithms is shown to have a great potential in this area, which forebodes a future where algorithmic assistance in "combating" crime will be a valuable asset. Show less
Many databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns... Show moreMany databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns in such data. In our approach, a data analyst specifies constraints on patterns that she believes to be of interest, and the computer searches for patterns that satisfy these constraints. An important constraint on which we focus, is the constraint that a pattern should have a significant number of occurrences in the data. Constraints like this allow the search to be performed reasonably efficiently. We develop algorithms for searching ppatterns taht are represented in formal first order logic, tree data structures and graph data structures. We perform experiments in which these algorithms, and algorithms proposed by other researchers, are compared with each other, and study which properties determine the efficiency of the algorithms. As a result, we are able to develop more efficient algorithms. As application we study the discovery of fragments in molecular datasets. The aim is to discover fragments that relate the structure of molecules to their activity. Show less