Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences

Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences
Author: Zhengqiao Zhao
Publisher:
Total Pages: 219
Release: 2020
Genre: Bioinformatics
ISBN:


Download Improvements in Machine Learning for Predicting Taxon, Phenotype and Function from Genetic Sequences Book in PDF, Epub and Kindle

Advances in DNA sequencing, as well as the rise of shotgun metagenomics and metabolomics, are rapidly producing complex microbiome datasets for studies of human health and the environment. The large-scale sampling of DNA/RNA from microbes provides a window into the microbiome's interactions with its host and habitat, enables us to predict phenotypic traits of the host/microbiome, aids the discovery of emergent biological function, and supports the medical diagnosis. Researchers try to extract features from DNA/RNA sequencing data and make 1) taxonomic predictions ("Who is there"), 2) function annotations ("What they are doing") and 3) host/microbiome phenotype predictions. This work is to explore different computational methods to address challenges in these three fields. First, taxonomic classification relies on NCBI RefSeq database sequences, which are being added at an exponential rate. Therefore, the incremental learning concept is especially important. Although the incremental naive Bayes classifier (NBC) is a decade old concept, it has not been applied to taxonomic classification in the metagenomics field. In this work, I compare the classification accuracy and runtime of the proposed incremental learning implementation of NBC with the performance of the traditional implementation of NBC and demonstrate a proof of concept of how incremental learning can make taxonomic classification much more efficient in its training process, significantly reducing computation while maintaining accuracy. In addition to predicting taxonomic labels for metagenomic samples, researchers are also interested in identifying different subtypes for one virus since mutations can be introduced during the transmission. "Oligotyping" is an entropy analysis tool developed for subtyping taxonomic units based on 16S rRNA sequences. "Oligotyping" was formulated because the 16S rRNA gene is very conservative and there are only very few mutations in the 16S rRNA gene for some lineages. The SARS-CoV-2 genome, being months old, also has a relatively small amount of mutations. Therefore, the entropy analysis developed for 16S rRNA sequences can be adapted for SARS-CoV-2 viral genome subtyping. However, other researchers were only looking at sequence similarity (and subsequent trees) or important single nucleotide variants individually between the genomes. To my knowledge, I am the first to draw on the "Oligotyping" concept to group mutations as a "barcode" of the viral genome and extend it to define subtypes for SARS-CoV-2 viral genomes. I further add error correction to account for ambiguities in the sequences and, optionally, apply further compression by identifying patterns of base entropy correlation. I demonstrate its application in spatiotemporal analyses of real world SARS-CoV-2 sequences responsible for the COVID-19 pandemic. My method is validated by comparing the subtypes defined to similar subtypes discovered in other literature. Third, microbial survey data is not used efficiently for phenotype prediction. For example, a precise Crohn's disease prediction model can help diagnostics given stool samples collected from subjects. To predict Crohn's disease (or another phenotype) from microbiome composition, researchers usually start by grouping sequences that look similar together into an Operation Taxonomic Unit (OTU) or Amplicon Sequence Variant (ASV) and subsequently learn samples by examining OTU occurrences in different phenotypes. However, only looking at sequence similarity ignores the sequential information contained in DNA sequences. Bioinformatics has been inspired by successes in deep learning applications in Natural Language Processing (NLP). Both convolutional neural network (CNN) and recurrent neural network (RNN) have been utilized to learn DNA sequential information for applications such as transcription factor binding site classification. In my work, I propose to adapt deep learning architectures (such as RNN and attention mechanism) that have been widely used in NLP to develop a "phenotype" classifier. This Read2Pheno classifier can predict "phenotype" based on 16S rRNA reads. I demonstrate how the sequential information learned by the proposed model can provide insights on informative regions in DNA sequences/reads while making accurate predictions. The model is validated by comparing its accuracy with other baseline methods such as a random forest model trained with various features (standard OTU/ASV table and k-mers). Forth, there have been different deep learning based functional annotation models proposed recently. However, these models can only output one class of function annotation predictions, such as Gene Ontology (GO). It is convenient to have a tool that can output function predictions for both function annotation databases. In this work, I first extend the proposed Read2Pheno model to a function prediction model, AttentionGO, and compare the performance with both alignment based and deep learning based models to show that the proposed model can achieve comparable performance with additional interpretability. Second, I explore the possibility of using the proposed AttentionGO classifier in a multi-task learning model to predict three branches of GO terms and KEGG Orthology terms simultaneously. The multi-task learning model is compared with single-task models trained with individual tasks to demonstrate performance improvement.

Handbook of Machine Learning Applications for Genomics

Handbook of Machine Learning Applications for Genomics
Author: Sanjiban Sekhar Roy
Publisher: Springer Nature
Total Pages: 222
Release: 2022-06-23
Genre: Technology & Engineering
ISBN: 9811691584


Download Handbook of Machine Learning Applications for Genomics Book in PDF, Epub and Kindle

Currently, machine learning is playing a pivotal role in the progress of genomics. The applications of machine learning are helping all to understand the emerging trends and the future scope of genomics. This book provides comprehensive coverage of machine learning applications such as DNN, CNN, and RNN, for predicting the sequence of DNA and RNA binding proteins, expression of the gene, and splicing control. In addition, the book addresses the effect of multiomics data analysis of cancers using tensor decomposition, machine learning techniques for protein engineering, CNN applications on genomics, challenges of long noncoding RNAs in human disease diagnosis, and how machine learning can be used as a tool to shape the future of medicine. More importantly, it gives a comparative analysis and validates the outcomes of machine learning methods on genomic data to the functional laboratory tests or by formal clinical assessment. The topics of this book will cater interest to academicians, practitioners working in the field of functional genomics, and machine learning. Also, this book shall guide comprehensively the graduate, postgraduates, and Ph.D. scholars working in these fields.

Machine Learning for Microbial Phenotype Prediction

Machine Learning for Microbial Phenotype Prediction
Author: Roman Feldbauer
Publisher: Springer
Total Pages: 116
Release: 2016-06-15
Genre: Science
ISBN: 3658143193


Download Machine Learning for Microbial Phenotype Prediction Book in PDF, Epub and Kindle

This thesis presents a scalable, generic methodology for microbial phenotype prediction based on supervised machine learning, several models for biological and ecological traits of high relevance, and the deployment in metagenomic datasets. The results suggest that the presented prediction tool can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomic studies. Unraveling relationships between a living organism's genetic information and its observable traits is a central biological problem. Phenotype prediction facilitated by machine learning techniques will be a major step forward to creating biological knowledge from big data.

Interpretable Machine Learning Methods for Regulatory and Disease Genomics

Interpretable Machine Learning Methods for Regulatory and Disease Genomics
Author: Peyton Greis Greenside
Publisher:
Total Pages:
Release: 2018
Genre:
ISBN:


Download Interpretable Machine Learning Methods for Regulatory and Disease Genomics Book in PDF, Epub and Kindle

It is an incredible feat of nature that the same genome contains the code to every cell in each living organism. From this same genome, each unique cell type gains a different program of gene expression that enables the development and function of an organism throughout its lifespan. The non-coding genome - the ~98 of the genome that does not code directly for proteins - serves an important role in generating the diverse programs of gene expression turned on in each unique cell state. A complex network of proteins bind specific regulatory elements in the non-coding genome to regulate the expression of nearby genes. While basic principles of gene regulation are understood, the regulatory code of which factors bind together at which genomic elements to turn on which genes remains to be revealed. Further, we do not understand how disruptions in gene regulation, such as from mutations that fall in non-coding regions, ultimately lead to disease or other changes in cell state. In this work we present several methods developed and applied to learn the regulatory code or the rules that govern non-coding regions of the genome and how they regulate nearby genes. We first formulate the problem as one of learning pairs of sequence motifs and expressed regulator proteins that jointly predict the state of the cell, such as the cell type specific gene expression or chromatin accessibility. Using pre-engineered sequence features and known expression, we use a paired-feature boosting approach to build an interpretable model of how the non-coding genome contributes to cell state. We also demonstrate a novel improvement to this method that takes into account similarities between closely related cell types by using a hierarchy imposed on all of the predicted cell states. We apply this method to discover validated regulators of tadpole tail regeneration and to predict protein-ligand binding interactions. Recognizing the need for improved sequence features and stronger predictive performance, we then move to a deep learning modeling framework to predict epigenomic phenotypes such as chromatin accessibility from just underlying DNA sequence. We use deep learning models, specifically multi-task convolutional neural networks, to learn a featurization of sequences over several kilobases long and their mapping to a functional phenotype. We develop novel architectures that encode principles of genomics in models typically designed for computer vision, such as incorporating reverse complementation and the 3D structure of the genome. We also develop methods to interpret traditionally ``black box" neural networks by 1) assigning importance scores to each input sequence to the model, 2) summarizing non-redundant patterns learned by the model that are predictive in each cell type, and 3) discovering interactions learned by the model that provide indications as to how different non-coding sequence features depend on each other. We apply these methods in the system of hematopoiesis to interpret chromatin dynamics across differentiation of blood cell types, to understand immune stimulation, and to interpret immune disease-associated variants that fall in non-coding regions. We demonstrate strong performance of our boosting and deep learning models and demonstrate improved performance of these machine learning frameworks when taking into account existing knowledge about the biological system being modeled. We benchmark our interpretation methods using gold standard systems and existing experimental data where available. We confirm existing knowledge surrounding essential factors in hematopoiesis, and also generate novel hypotheses surrounding how factors interact to regulate differentiation. Ultimately our work provides a set of tools for researchers to probe and understand the non-coding genome and its role in controlling gene expression as well as a set of novel insights surrounding how hematopoiesis is controlled on many scales from global quantification of regulatory sequence to interpretation of individual variants.

Machine Learning Models for Functional Genomics and Therapeutic Design

Machine Learning Models for Functional Genomics and Therapeutic Design
Author: Haoyang Zeng (Ph.D.)
Publisher:
Total Pages: 230
Release: 2019
Genre:
ISBN:


Download Machine Learning Models for Functional Genomics and Therapeutic Design Book in PDF, Epub and Kindle

Due to the limited size of training data available, machine learning models for biology have remained rudimentary and inaccurate despite the significant advance in machine learning research. With the recent advent of high-throughput sequencing technology, an exponentially growing number of genomic and proteomic datasets have been generated. These large-scale datasets admit the training of high-capacity machine learning models to characterize sophisticated features and produce accurate predictions on unseen examples. In this thesis, we attempt to develop advanced machine learning models for functional genomics and therapeutics design, two areas with ample data deposited in public databases and tremendous clinical implications. The shared theme of these models is to learn how the composition of a biological sequence encodes a functional phenotype and then leverage such knowledge to provide insight for target discovery and therapeutic design. First, we design three machine learning models that predict transcription factor binding and DNA methylation, two fundamental epigenetic phenotypes closely tied to gene regulation, from DNA sequence alone. We show that these epigenetic phenotypes can be well predicted from the sequence context. Moreover, the predicted change in phenotype between the reference and alternate allele of a genetic variant accurately reflect its functional impact and improves the identification of regulatory variants causal for complex diseases. Second, we devise two machine learning models that improve the prediction of peptides displayed by the major histocompatibility complex (MHC) on the cell surface. Computational modeling of peptide-display by MHC is central in the design of peptide-based therapeutics. Our first machine learning model introduces the capacity to quantify uncertainty in the computational prediction and proposes a new metric for peptide prioritization that reduces false positives in high-affinity peptide design. The second model improves the state-of-the-art performance in MHC-ligand prediction by employing a deep language model to learn the sequence determinants for auxiliary processes in MHC-ligand selection, such as proteasome cleavage, that are omitted by existing methods due to the lack of labeled data. Third, we develop machine learning frameworks to model the enrichment of an antibody sequence in phage-panning experiments against a target antigen. We show that antibodies with low specificity can be reduced by a computational procedure using machine learning models trained for multiple targets. Moreover, machine learning can help to design novel antibody sequences with improved affinity.

Machine Learning in Genome-Wide Association Studies

Machine Learning in Genome-Wide Association Studies
Author: Ting Hu
Publisher: Frontiers Media SA
Total Pages: 74
Release: 2020-12-15
Genre: Science
ISBN: 2889662292


Download Machine Learning in Genome-Wide Association Studies Book in PDF, Epub and Kindle

This eBook is a collection of articles from a Frontiers Research Topic. Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: frontiersin.org/about/contact.

Multivariate Statistical Machine Learning Methods for Genomic Prediction

Multivariate Statistical Machine Learning Methods for Genomic Prediction
Author: Osval Antonio Montesinos López
Publisher: Springer Nature
Total Pages: 707
Release: 2022-02-14
Genre: Technology & Engineering
ISBN: 3030890104


Download Multivariate Statistical Machine Learning Methods for Genomic Prediction Book in PDF, Epub and Kindle

This book is open access under a CC BY 4.0 license This open access book brings together the latest genome base prediction models currently being used by statisticians, breeders and data scientists. It provides an accessible way to understand the theory behind each statistical learning tool, the required pre-processing, the basics of model building, how to train statistical learning methods, the basic R scripts needed to implement each statistical learning tool, and the output of each tool. To do so, for each tool the book provides background theory, some elements of the R statistical software for its implementation, the conceptual underpinnings, and at least two illustrative examples with data from real-world genomic selection experiments. Lastly, worked-out examples help readers check their own comprehension.The book will greatly appeal to readers in plant (and animal) breeding, geneticists and statisticians, as it provides in a very accessible way the necessary theory, the appropriate R code, and illustrative examples for a complete understanding of each statistical learning tool. In addition, it weighs the advantages and disadvantages of each tool.