Machine Learning for Microbial Phenotype Prediction

Machine Learning for Microbial Phenotype Prediction
Author: Roman Feldbauer
Publisher: Springer
Total Pages: 116
Release: 2016-06-15
Genre: Science
ISBN: 3658143193


Download Machine Learning for Microbial Phenotype Prediction Book in PDF, Epub and Kindle

This thesis presents a scalable, generic methodology for microbial phenotype prediction based on supervised machine learning, several models for biological and ecological traits of high relevance, and the deployment in metagenomic datasets. The results suggest that the presented prediction tool can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomic studies. Unraveling relationships between a living organism's genetic information and its observable traits is a central biological problem. Phenotype prediction facilitated by machine learning techniques will be a major step forward to creating biological knowledge from big data.

Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences

Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences
Author: Zhengqiao Zhao
Publisher:
Total Pages: 219
Release: 2020
Genre: Bioinformatics
ISBN:


Download Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences Book in PDF, Epub and Kindle

Advances in DNA sequencing, as well as the rise of shotgun metagenomics and metabolomics, are rapidly producing complex microbiome datasets for studies of human health and the environment. The large-scale sampling of DNA/RNA from microbes provides a window into the microbiome's interactions with its host and habitat, enables us to predict phenotypic traits of the host/microbiome, aids the discovery of emergent biological function, and supports the medical diagnosis. Researchers try to extract features from DNA/RNA sequencing data and make 1) taxonomic predictions ("Who is there"), 2) function annotations ("What they are doing") and 3) host/microbiome phenotype predictions. This work is to explore different computational methods to address challenges in these three fields. First, taxonomic classification relies on NCBI RefSeq database sequences, which are being added at an exponential rate. Therefore, the incremental learning concept is especially important. Although the incremental naive Bayes classifier (NBC) is a decade old concept, it has not been applied to taxonomic classification in the metagenomics field. In this work, I compare the classification accuracy and runtime of the proposed incremental learning implementation of NBC with the performance of the traditional implementation of NBC and demonstrate a proof of concept of how incremental learning can make taxonomic classification much more efficient in its training process, significantly reducing computation while maintaining accuracy. In addition to predicting taxonomic labels for metagenomic samples, researchers are also interested in identifying different subtypes for one virus since mutations can be introduced during the transmission. "Oligotyping" is an entropy analysis tool developed for subtyping taxonomic units based on 16S rRNA sequences. "Oligotyping" was formulated because the 16S rRNA gene is very conservative and there are only very few mutations in the 16S rRNA gene for some lineages. The SARS-CoV-2 genome, being months old, also has a relatively small amount of mutations. Therefore, the entropy analysis developed for 16S rRNA sequences can be adapted for SARS-CoV-2 viral genome subtyping. However, other researchers were only looking at sequence similarity (and subsequent trees) or important single nucleotide variants individually between the genomes. To my knowledge, I am the first to draw on the "Oligotyping" concept to group mutations as a "barcode" of the viral genome and extend it to define subtypes for SARS-CoV-2 viral genomes. I further add error correction to account for ambiguities in the sequences and, optionally, apply further compression by identifying patterns of base entropy correlation. I demonstrate its application in spatiotemporal analyses of real world SARS-CoV-2 sequences responsible for the COVID-19 pandemic. My method is validated by comparing the subtypes defined to similar subtypes discovered in other literature. Third, microbial survey data is not used efficiently for phenotype prediction. For example, a precise Crohn's disease prediction model can help diagnostics given stool samples collected from subjects. To predict Crohn's disease (or another phenotype) from microbiome composition, researchers usually start by grouping sequences that look similar together into an Operation Taxonomic Unit (OTU) or Amplicon Sequence Variant (ASV) and subsequently learn samples by examining OTU occurrences in different phenotypes. However, only looking at sequence similarity ignores the sequential information contained in DNA sequences. Bioinformatics has been inspired by successes in deep learning applications in Natural Language Processing (NLP). Both convolutional neural network (CNN) and recurrent neural network (RNN) have been utilized to learn DNA sequential information for applications such as transcription factor binding site classification. In my work, I propose to adapt deep learning architectures (such as RNN and attention mechanism) that have been widely used in NLP to develop a "phenotype" classifier. This Read2Pheno classifier can predict "phenotype" based on 16S rRNA reads. I demonstrate how the sequential information learned by the proposed model can provide insights on informative regions in DNA sequences/reads while making accurate predictions. The model is validated by comparing its accuracy with other baseline methods such as a random forest model trained with various features (standard OTU/ASV table and k-mers). Forth, there have been different deep learning based functional annotation models proposed recently. However, these models can only output one class of function annotation predictions, such as Gene Ontology (GO). It is convenient to have a tool that can output function predictions for both function annotation databases. In this work, I first extend the proposed Read2Pheno model to a function prediction model, AttentionGO, and compare the performance with both alignment based and deep learning based models to show that the proposed model can achieve comparable performance with additional interpretability. Second, I explore the possibility of using the proposed AttentionGO classifier in a multi-task learning model to predict three branches of GO terms and KEGG Orthology terms simultaneously. The multi-task learning model is compared with single-task models trained with individual tasks to demonstrate performance improvement.

Benchmarking Continuous Phenotype Prediction with Multi-omic Microbiome Data

Benchmarking Continuous Phenotype Prediction with Multi-omic Microbiome Data
Author: Patrick Imran McGrath
Publisher:
Total Pages: 32
Release: 2021
Genre:
ISBN:


Download Benchmarking Continuous Phenotype Prediction with Multi-omic Microbiome Data Book in PDF, Epub and Kindle

Large-scale microbiome datasets from 16S amplicon sequencing provide opportunities for building predictive models with supervised machine learning to answer questions of biological significance. Prior regression analyses have used supervised learning to predict variables of the sampled microbial environment, such as pH, host age, or other host phenotypes and disease states, however little justification has been made for the use of specific algorithms on microbiome data. We performed a large-scale comprehensive benchmark for 11 regression algorithms across an exhaustive grid search for tuning algorithm hyperparameters, in three large human datasets: The National FINRISK Study, Study of Latinos, and International Multiple Sclerosis Microbiome Study. We found that ensemble-based algorithms consistently performed the best, confirming prior analyses' use of ensemble algorithms such as Random Forests. For the most accurate ensemble algorithms, we analyzed the best hyperparameters from our grid search to produce a set of hyperparameters that we recommend to be fixed at specific values. With those recommended hyperparameter settings, we observed no loss in accuracy and significant reductions in the runtime and computational expense of hyperparameter tuning. Our results suggest the feasibility of further streamlining the process of producing robust machine learning models specific to microbiome data. These results may generalize to compositional data obtained from other preparations, such as taxonomic profiles from shotgun metagenomic analyses, and an expansion of this work to include metagenomics profiles as well as other machine learning tasks presents an exciting opportunity.

Kernel Methods in Computational Biology

Kernel Methods in Computational Biology
Author: Bernhard Schölkopf
Publisher: MIT Press
Total Pages: 428
Release: 2004
Genre: Computers
ISBN: 9780262195096


Download Kernel Methods in Computational Biology Book in PDF, Epub and Kindle

A detailed overview of current research in kernel methods and their application to computational biology.

Predicting "essentials" Genes in Microbial Genomes

Predicting
Author: Krishnaveni Palaniappan
Publisher:
Total Pages: 466
Release: 2010
Genre:
ISBN:


Download Predicting "essentials" Genes in Microbial Genomes Book in PDF, Epub and Kindle

Essential genes constitute the minimal gene set of an organism that is indispensable for its survival under most favorable conditions. The problem of accurately identifying and predicting genes essential for survival of an organism has both theoretical and practical relevance in genome biology and medicine. From a theoretical perspective it provides insights in the understanding of the minimal requirements for cellular life and plays a key role in the emerging field of synthetic biology; from a practical perspective, it facilitates efficient identification of potential drug targets (e.g., antibiotics) in novel pathogens. However, characterizing essential genes of an organism requires sophisticated experimental studies that are expensive and time consuming. The goal of this research study was to investigate machine learning methods to accurately classify/predict "essential genes" in newly sequenced microbial genomes based solely on their genomic sequence data. This study formulates the predication of essential genes problem as a binary classification problem and systematically investigates applicability of three different supervised classification methods for this task. In particular, Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN) based classifier models were constructed and trained on genomic features derived solely from gene sequence data of 14 experimentally validated microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features (including gene and protein sequence features, protein physio-chemical features and protein sub-cellular features) was used as input for the learners to learn the classifier models. The training and test datasets used in this study reflected between-class imbalance (i.e. skewed majority class vs. minority class) that is intrinsic to this data domain and essential genes prediction problem. Two imbalance reduction techniques (homology reduction and random under sampling of 50% of the majority class) were devised without artificially balancing the datasets and compromising classifier generalizability. The classifier models were trained and evaluated using 10-fold stratified cross validation strategy on both the full multi-genome datasets and its class imbalance reduced variants to assess their predictive ability of discriminating essential genes from non-essential genes. In addition, the classifiers were also evaluated using a novel blind testing strategy, called LOGO (Leave-One-Genome-Out) and LOTO (Leave-One-Taxon group-Out) tests on carefully constructed held-out datasets (both genome-wise (LOGO) and taxonomic group-wise (LOTO)) that were not used in training of the classifier models. Prediction performance metrics, accuracy, sensitivity, specificity, precision and area under the Receiver Operating Characteristics (AU-ROC) were assessed for DT, SVM and ANN derived models. Empirical results from 10 X 10-fold stratified cross validation, Leave-One-Genome-Out (LOGO) and Leave-One-Taxon group-Out (LOTO) blind testing experiments indicate SVM and ANN based models perform better than Decision Tree based models. On 10 X 10-fold cross validations, the SVM based models achieved an AU-ROC score of 0.80, while ANN and DT achieved 0.79 and 0.68 respectively. Both LOGO (genome-wise) and LOTO (taxon-wise) blind tests revealed the generalization extent of these classifiers across different genomes and taxonomic orders.

Infections in Surgery

Infections in Surgery
Author: Massimo Sartelli
Publisher: Springer Nature
Total Pages: 279
Release: 2021-01-29
Genre: Medical
ISBN: 3030621162


Download Infections in Surgery Book in PDF, Epub and Kindle

Although most clinicians are aware of the problem of antimicrobial resistance, most also underestimate its significance in their own hospital. The incorrect and inappropriate use of antibiotics and other antimicrobials, as well as poor prevention and poor control of infections, are contributing to the development of such resistance. Appropriate use of antibiotics and compliance with infection prevention and control measures should be integral aspects of good clinical practice and standards of care. However, these activities are often inadequate among clinicians, and there is a considerable gap between the best evidence and actual clinical practice. In hospitals, cultural determinants influence clinical practice, and improving behaviour in terms of infection prevention and antibiotics-prescribing practice remains a challenge. Despite evidence supporting the effectiveness of best practices, many clinicians fail to implement them, and evidence-based processes and practices that are known to optimize both the prevention and the treatment of infections tend to be underused. Addressing precisely this problem, this volume offers an essential toolkit for all surgeons and intensivists interested in improving their clinical practices.

Handbook of Statistical Bioinformatics

Handbook of Statistical Bioinformatics
Author: Henry Horng-Shing Lu
Publisher: Springer Nature
Total Pages: 406
Release: 2022-12-08
Genre: Science
ISBN: 3662659026


Download Handbook of Statistical Bioinformatics Book in PDF, Epub and Kindle

Now in its second edition, this handbook collects authoritative contributions on modern methods and tools in statistical bioinformatics with a focus on the interface between computational statistics and cutting-edge developments in computational biology. The three parts of the book cover statistical methods for single-cell analysis, network analysis, and systems biology, with contributions by leading experts addressing key topics in probabilistic and statistical modeling and the analysis of massive data sets generated by modern biotechnology. This handbook will serve as a useful reference source for students, researchers and practitioners in statistics, computer science and biological and biomedical research, who are interested in the latest developments in computational statistics as applied to computational biology.

Inferring Phenotypes from Genotypes with Machine Learning

Inferring Phenotypes from Genotypes with Machine Learning
Author: Alexandre Drouin
Publisher:
Total Pages: 225
Release: 2019
Genre:
ISBN:


Download Inferring Phenotypes from Genotypes with Machine Learning Book in PDF, Epub and Kindle

A thorough understanding of the relationship between the genomic characteristics of an individual (the genotype) and its biological state (the phenotype) is essential to personalized medicine, where treatments are tailored to each individual. This notably allows to anticipate diseases, estimate response to treatments, and even identify new pharmaceutical targets. Machine learning is a science that aims to develop algorithms that learn from examples. Such algorithms can be used to learn models that estimate phenotypes based on genotypes, which can then be studied to elucidate the biological mechanisms that underlie the phenotypes. Nonetheless, the application of machine learning in this context poses significant algorithmic and theoretical challenges. The high dimensionality of genomic data and the small size of data samples can lead to overfitting; the large volume of genomic data requires adapted algorithms that limit their use of computational resources; and importantly, the learned models must be interpretable by domain experts, which is not always possible. This thesis presents learning algorithms that produce interpretable models for the prediction of phenotypes based on genotypes. Firstly, we explore the prediction of discrete phenotypes using rule-based learning algorithms. We propose new implementations that are highly optimized and generalization guarantees that are adapted to genomic data. Secondly, we study a more theoretical problem, namely interval regression. We propose two new learning algorithms, one which is rule-based. Finally, we show that this type of regression can be used to predict continuous phenotypes and that this leads to models that are more accurate than those of conventional approaches in the presence of censored or noisy data. The overarching theme of this thesis is an application to the prediction of antibiotic resistance, a global public health problem of high significance. We demonstrate that our algorithms can be used to accurately predict resistance phenotypes and contribute to the improvement of their understanding. Ultimately, we expect that our algorithms will take part in the development of tools that will allow a better use of antibiotics and improved epidemiological surveillance, a key component of the solution to this problem.

Computational Methods for Comparative Analysis of Microbiome Related to Human Diseases

Computational Methods for Comparative Analysis of Microbiome Related to Human Diseases
Author: Wontack Han
Publisher:
Total Pages: 0
Release: 2022
Genre: Bioinformatics
ISBN:


Download Computational Methods for Comparative Analysis of Microbiome Related to Human Diseases Book in PDF, Epub and Kindle

Microbial organisms play key roles in the human hosts' health and diseases. Recent advancements in genome sequencing have resulted in a large collection of sequencing data of microbial species and have expanded the research of microbiome from the characterization of microbiomes' community associated with different environments/hosts to the applications related with human health and diseases. Computational methods have been developed to identify microbial markers from microbiome datasets derived from cohorts of patients with different diseases. Predictive models based on these markers (features) have been built for discriminating host phenotypes such as disease vs healthy and cancer immunotherapy responder vs non-responder. In this dissertation, I developed computational methods for comparative analysis of metagenomes from raw sequencing data and developed Machine Learning (ML) approaches to build predictive models for host phenotype prediction based on identified microbial markers. First, I implemented the subtractive assembly method(called CoSA) for comparative metagenomics that directly detects differential reads between two groups of metagenomes, from which microbial marker genes could be assembled and characterized. Secondly, I reported the curation of a repository of microbial marker genes and predictive models built from these markers for microbiome-based prediction of host phenotype, and a computational pipeline(named Mi2P) for using the repository. Lastly, I exploited locality sensitive hashing(LSH) as clustering algorithm to group billions of k-mers having similar abundance profiles across multiple samples into k-mers co-abundance groups (kCAGs) to improve the characterization of differential microbial markers. The overall goal of my research is to develop fast and efficient approaches for identifying microbial marker genes, and make them available for building predictive models for microbiome-based host phenotype predictions.

Computational Topology

Computational Topology
Author: Herbert Edelsbrunner
Publisher: American Mathematical Society
Total Pages: 241
Release: 2022-01-31
Genre: Mathematics
ISBN: 1470467690


Download Computational Topology Book in PDF, Epub and Kindle

Combining concepts from topology and algorithms, this book delivers what its title promises: an introduction to the field of computational topology. Starting with motivating problems in both mathematics and computer science and building up from classic topics in geometric and algebraic topology, the third part of the text advances to persistent homology. This point of view is critically important in turning a mostly theoretical field of mathematics into one that is relevant to a multitude of disciplines in the sciences and engineering. The main approach is the discovery of topology through algorithms. The book is ideal for teaching a graduate or advanced undergraduate course in computational topology, as it develops all the background of both the mathematical and algorithmic aspects of the subject from first principles. Thus the text could serve equally well in a course taught in a mathematics department or computer science department.