A data mining scenario is a logical sequence of steps to infer patterns from data. In this thesis, we present two scenarios. Our first scenario aims to identify homogeneous subtypes in data. It was... Show moreA data mining scenario is a logical sequence of steps to infer patterns from data. In this thesis, we present two scenarios. Our first scenario aims to identify homogeneous subtypes in data. It was applied to clinical research on Osteoarthritis (OA) and Parkinson’s disease (PD) and in drug discovery. Thus, because OA and PD are characterized by clinical heterogeneity, a more sensitive classification of the cohort of patients may contribute to the search for the underlying diseases mechanism. In drug discovery, subtyping may improve the understanding of the similarity (and distance) between different phenotypic effects as induced by drugs and chemicals. Our second scenario aims to compare text classification algorithms. First, we show that common classifiers achieve comparable performance on most problems. Second, tightly constrained SVM solutions are high performers. In that situation, most training documents are bounded support vectors, SVM reduces to a nearest mean classifier and no training is necessary, which raises a question on SVM merits in sparse bag of words feature spaces. Also, SVM is shown to suffer from performance deterioration for particular combinations of training set size/number of features. This relate to outlying documents of distinct classes overlapping in the feature space. Show less
Many databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns... Show moreMany databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns in such data. In our approach, a data analyst specifies constraints on patterns that she believes to be of interest, and the computer searches for patterns that satisfy these constraints. An important constraint on which we focus, is the constraint that a pattern should have a significant number of occurrences in the data. Constraints like this allow the search to be performed reasonably efficiently. We develop algorithms for searching ppatterns taht are represented in formal first order logic, tree data structures and graph data structures. We perform experiments in which these algorithms, and algorithms proposed by other researchers, are compared with each other, and study which properties determine the efficiency of the algorithms. As a result, we are able to develop more efficient algorithms. As application we study the discovery of fragments in molecular datasets. The aim is to discover fragments that relate the structure of molecules to their activity. Show less
In this thesis novel statistical methods are developed for the analysis of high dimensional microarray data. In short: Chapter 1 gives an overview of the most important research methods developed... Show moreIn this thesis novel statistical methods are developed for the analysis of high dimensional microarray data. In short: Chapter 1 gives an overview of the most important research methods developed so far. Chapter 2 describes a method for testing association of the expression of gene sets (pathways) with a patient level response variable, which can be continuous or two-valued. Chapter 3 extends the methodology of chapter 2 to survival as a response variable. Chapter 4 presents a goodness-of-fit test for the multinomial regression model, which can be used to extend the methodology of chapter 2 to multi-valued outcomes. Chapter 5 presents a general theoretical framework in for the tests of chapters 2-4 and derives optimality properties for these tests. Chapter 6 presents a method for predicting a response variable from high dimensional data, based on latent variables. Chapter 7 presents a visualization tool for improved presentation of scatterplots with many thousands of dots. Show less