MotivationBeyond identifying genetic variants, we introduce a set of Boolean relations, which allows for a comprehensive classification of the relations of every pair of variants by taking all... Show moreMotivationBeyond identifying genetic variants, we introduce a set of Boolean relations, which allows for a comprehensive classification of the relations of every pair of variants by taking all minimal alignments into account. We present an efficient algorithm to compute these relations, including a novel way of efficiently computing all minimal alignments within the best theoretical complexity bounds.ResultsWe show that these relations are common, and many non-trivial, for variants of the CFTR gene in dbSNP. Ultimately, we present an approach for the storing and indexing of variants in the context of a database that enables efficient querying for all these relations.Availability and implementationA Python implementation is available at as well as an interface at . Show less
MOTIVATION\nRESULTS\nAVAILABILITY AND IMPLEMENTATION\nSUPPLEMENTARY INFORMATION\nGenome-scale metabolic reconstructions have been assembled for thousands of organisms using a wide range of tools.... Show moreMOTIVATION\nRESULTS\nAVAILABILITY AND IMPLEMENTATION\nSUPPLEMENTARY INFORMATION\nGenome-scale metabolic reconstructions have been assembled for thousands of organisms using a wide range of tools. However, metabolite annotations, required to compare and link metabolites between reconstructions, remain incomplete. Here, we aim to further extend metabolite annotation coverage using various databases and chemoinformatic approaches.\nWe developed a COBRA toolbox extension, deemed MetaboAnnotator, which facilitates the comprehensive annotation of metabolites with database independent and dependent identifiers, obtains molecular structure files, and calculates metabolite formula and charge at pH 7.2. The resulting metabolite annotations allow for subsequent cross-mapping between reconstructions and mapping of, e.g., metabolomic data.\nMetaboAnnotator and tutorials are freely available at https://github.com/opencobra.\nSupplementary data are available at Bioinformatics online. Show less
Bizzarri, D.; Reinders, M.J.T.; Beekman, M.; Slagboom, P.E.; Akker, E.B. van den 2022
Motivation: H-1-NMR metabolomics is rapidly becoming a standard resource in large epidemiological studies to acquire metabolic profiles in large numbers of samples in a relatively low-priced and... Show moreMotivation: H-1-NMR metabolomics is rapidly becoming a standard resource in large epidemiological studies to acquire metabolic profiles in large numbers of samples in a relatively low-priced and standardized manner. Concomitantly, metabolomics-based models are increasingly developed that capture disease risk or clinical risk factors. These developments raise the need for user-friendly toolbox to inspect new H-1-NMR metabolomics data and project a wide array of previously established risk models. Results: We present MiMIR (Metabolomics-based Models for Imputing Risk), a graphical user interface that provides an intuitive framework for ad hoc statistical analysis of Nightingale Health's H-1-NMR metabolomics data and allows for the projection and calibration of 24 pre-trained metabolomics-based models, without any pre-required programming knowledge. Availability and implementation: The R-shiny package is available in CRAN or downloadable at , together with an extensive user manual (also available as Supplementary Documents to the article). Show less
Motivation: Batch effects heavily impact results in omics studies, causing bias and false positive results, but software to control them preemptively is lacking. Sample randomization prior to... Show moreMotivation: Batch effects heavily impact results in omics studies, causing bias and false positive results, but software to control them preemptively is lacking. Sample randomization prior to measurement is vital for minimizing these effects, but current approaches are often ad hoc, poorly documented and ill-equipped to handle multiple batches and outcomes.Results: We developed Omixer-a Bioconductor package implementing multivariate and reproducible sample randomization for omics studies. It proactively counters correlations between technical factors and biological variables of interest by optimizing sample distribution across batches. Show less
Lefter, M.; Vis, J.K.; Vermaat, M.; Dunnen, J.T. den; Taschner, P.E.M.; Laros, J.F.J. 2021
Motivation: Unambiguous variant descriptions are of utmost importance in clinical genetic diagnostics, scientific literature and genetic databases. The Human Genome Variation Society (HGVS)... Show moreMotivation: Unambiguous variant descriptions are of utmost importance in clinical genetic diagnostics, scientific literature and genetic databases. The Human Genome Variation Society (HGVS) publishes a comprehensive set of guidelines on how variants should be correctly and unambiguously described. We present the implementation of the Mutalyzer 2 tool suite, designed to automatically apply the HGVS guidelines so users do not have to deal with the HGVS intricacies explicitly to check and correct their variant descriptions.Results: Mutalyzer is profusely used by the community, having processed over 133 million descriptions since its launch. Over a five year period, Mutalyzer reported a correct input in similar to 50% of cases. In 41% of the cases either a syntactic or semantic error was identified and for similar to 7% of cases, Mutalyzer was able to automatically correct the description. Show less
A Summary: In mass spectrometry-based proteomics, accurate peptide masses improve identifications, alignment and quantitation. Getting the most out of any instrument therefore requires proper... Show moreA Summary: In mass spectrometry-based proteomics, accurate peptide masses improve identifications, alignment and quantitation. Getting the most out of any instrument therefore requires proper calibration. Here, we present a new stand-alone software, mzRecal, for universal automatic recalibration of data from all common mass analyzers using standard open formats and based on physical principles. Show less
Motivation Laboratory mouse is the most used animal model in biological research, largely due to its high conserved synteny with human. Researchers use mice to answer various questions ranging from... Show moreMotivation Laboratory mouse is the most used animal model in biological research, largely due to its high conserved synteny with human. Researchers use mice to answer various questions ranging from determining a pathological effect of knocked out/in gene to understanding drug metabolism. Our group developed >5000 quantitative targeted proteomics assays for 20 mouse tissues and determined the concentration ranges of a total of >1600 proteins using heavy labeled internal standards. We describe here MouseQuaPro; a knowledgebase that hosts this collection of carefully curated experimental data.Results The web-based application includes protein concentrations from >700 mouse tissue samples from three common research strains, corresponding to >200k experimentally determined concentrations. The knowledgebase integrates the assay and protein concentration information with their human orthologs, functional and molecular annotations, biological pathways, related human diseases and known gene expressions. At its core are the protein concentration ranges, which provide insights into (dis)similarities between tissues, strains and sexes. MouseQuaPro implements advanced search as well as filtering functionalities with a simple interface and interactive visualization. This information-rich resource provides an initial map of protein absolute concentration in mouse tissues and allows guided design of proteomics phenotyping experiments. The knowledgebase is available on mousequapro.proteincentre.com. Show less
Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these... Show moreMotivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Show less
Ubels, J.; Schaefers, T.; Punt, C.; Guchelaar, H.J.; Ridder, J. de 2020
Motivation: When phase III clinical drug trials fail their endpoint, enormous resources are wasted. Moreover, even if a clinical trial demonstrates a significant benefit, the observed effects are... Show moreMotivation: When phase III clinical drug trials fail their endpoint, enormous resources are wasted. Moreover, even if a clinical trial demonstrates a significant benefit, the observed effects are often small and may not outweigh the side effects of the drug. Therefore, there is a great clinical need for methods to identify genetic markers that can identify subgroups of patients which are likely to benefit from treatment as this may (i) rescue failed clinical trials and/or (ii) identify subgroups of patients which benefit more than the population as a whole. When single genetic biomarkers cannot be found, machine learning approaches that find multivariate signatures are required. For single nucleotide polymorphism (SNP) profiles, this is extremely challenging owing to the high dimensionality of the data. Here, we introduce RAINFOREST (tReAtment benefIt prediction using raNdom FOREST), which can predict treatment benefit from patient SNP profiles obtained in a clinical trial setting.Results: We demonstrate the performance of RAINFOREST on the CAIRO2 dataset, a phase III clinical trial which tested the addition of cetuximab treatment for metastatic colorectal cancer and concluded there was no benefit. However, we find that RAINFOREST is able to identify a subgroup comprising 27.7% of the patients that do benefit, with a hazard ratio of 0.69 (P = 0.04) in favor of cetuximab. The method is not specific to colorectal cancer and could aid in reanalysis of clinical trial data and provide a more personalized approach to cancer treatment, also when there is no clear link between a single variant and treatment benefit. Show less
Abdelaal, T.; Raadt, P. de; Lelieveldt, B.P.F.; Reinders, M.J.T.; Mahfouz, A. 2020
Motivation: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further... Show moreMotivation: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.Results: We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. Show less
MOTIVATIONMicrobial communities drive matter and energy transformations integral to global biogeochemical cycles, yet many taxonomic groups facilitating these processes remain poorly represented in... Show moreMOTIVATIONMicrobial communities drive matter and energy transformations integral to global biogeochemical cycles, yet many taxonomic groups facilitating these processes remain poorly represented in biological sequence databases. Due to this missing information, taxonomic assignment of sequences from environmental genomes remains inaccurate.RESULTSWe present the Tree-based Sensitive and Accurate Phylogenetic Profiler (TreeSAPP) software for functionally and taxonomically classifying genes, reactions and pathways from genomes of cultivated and uncultivated microorganisms using reference packages representing coding sequences mediating multiple globally relevant biogeochemical cycles. TreeSAPP uses linear regression of evolutionary distance on taxonomic rank to improve classifications, assigning both closely related and divergent query sequences at the appropriate taxonomic rank. TreeSAPP is able to provide quantitative functional and taxonomic classifications for both assembled and unassembled sequences and files supporting interactive tree of life visualizations.AVAILABILITY AND IMPLEMENTATIONTreeSAPP was developed in Python 3 as an open-source Python package and is available on GitHub at https://github.com/hallamlab/TreeSAPP.SUPPLEMENTARY INFORMATIONSupplementary data are available at Bioinformatics online. Show less
Mass spectrometry imaging (MSI) can reveal biochemical information directly from a tissue section. MSI generates a large quantity of complex spectral data which is still challenging to translate... Show moreMass spectrometry imaging (MSI) can reveal biochemical information directly from a tissue section. MSI generates a large quantity of complex spectral data which is still challenging to translate into relevant biochemical information. Here, we present rMSIproc, an open-source R package that implements a full data processing workflow for MSI experiments performed using TOF or FT-based mass spectrometers. The package provides a novel strategy for spectral alignment and recalibration, which allows to process multiple datasets simultaneously. This enables to perform a confident statistical analysis with multiple datasets from one or several experiments. rMSIproc is designed to work with files larger than the computer memory capacity and the algorithms are implemented using a multi-threading strategy. rMSIproc is a powerful tool able to take full advantage of modern computer systems to completely develop the whole MSI potential. Show less
Motivation: To facilitate accurate estimation of statistical significance of sequence similarity in profile-profile searches, queries should ideally correspond to protein domains. For multidomain... Show moreMotivation: To facilitate accurate estimation of statistical significance of sequence similarity in profile-profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance.Results: In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. Show less
Cordes, M.; Pike-Overzet, K.; Eggermond, M. van; Vloemans, S.; Baert, M.R.; Garcia Perez, L.; ... ; Akker, E. van den 2020
A Summary: An effective immune system is characterized by a diverse immune repertoire. There is a strong demand for accurate and quantitative methods to assess the diversity of the immune... Show moreA Summary: An effective immune system is characterized by a diverse immune repertoire. There is a strong demand for accurate and quantitative methods to assess the diversity of the immune repertoire for various (pre-)clinical applications, including the diagnosis and prognosis of primary immune deficiencies, or to assess the response to therapy. Current strategies for immune diversity assessment generally comprise the visual inspection of the length distribution of rearranged T- and B-cell receptors. Visual inspections, however, are prone to subjective assessments and thus lead to biases. Here, we introduce ImSpectR, a unified approach to quantify immunodiversity using either spectratype, repertoire sequencing or single cell RNA sequencing data. ImSpectR scores various types of deviations from the expected length distribution and integrates these into one measure, allowing for robust quantitative comparisons of immune diversity across individuals or conditions. Show less
A Summary: The NanoString (TM) nCounter((R)) is a platform for the targeted quantification of expression data in biofluids and tissues. While software by the manufacturer is available in addition... Show moreA Summary: The NanoString (TM) nCounter((R)) is a platform for the targeted quantification of expression data in biofluids and tissues. While software by the manufacturer is available in addition to third parties packages, they do not provide a complete quality control (QC) pipeline. Here, we present NACHO ('NAnostring quality Control dasHbOard'), a comprehensive QC R-package. The package consists of three subsequent steps: summarize, visualize and normalize. The summarize function collects all the relevant data and stores it in a tidy format, the visualize function initiates a dashboard with plots of the relevant QC outcomes. It contains QC metrics that are measured by default by the manufacturer, but also calculates other insightful measures, including the scaling factors that are needed in the normalization step. In this normalization step, different normalization methods can be chosen to optimally preprocess data. Together, NACHO is a comprehensive method that optimizes insight and preprocessing of nCounter((R)) data. Show less
Abdelaal, T.; Hollt, T.; Unen, V. van; Lelieveldt, B.P.F.; Koning, F.; Reinders, M.J.T.; Mahfouz, A. 2019
MotivationDeveloping a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We... Show moreMotivationDeveloping a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed using the Kubernetes container orchestrator.ResultsWe developed a Virtual Research Environment (VRE) which facilitates rapid integration of new tools and developing scalable and interoperable workflows for performing metabolomics data analysis. The environment can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry, one nuclear magnetic resonance spectroscopy and one fluxomics study. We showed that the method scales dynamically with increasing availability of computational resources. We demonstrated that the method facilitates interoperability using integration of the major software suites resulting in a turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, statistics and identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science.Availability and implementationThe PhenoMeNal consortium maintains a web portal (https://portal.phenomenal-h2020.eu) providing a GUI for launching the Virtual Research Environment. The GitHub repository https://github.com/phnmnl/ hosts the source code of all projects. Show less
Palmblad, M.; Lamprecht, A.L.; Ison, J.; Schwammle, V. 2019