Inferring Phenotypes from Genotypes with Machine Learning

Inferring Phenotypes from Genotypes with Machine Learning
Author: Alexandre Drouin
Publisher:
Total Pages: 225
Release: 2019
Genre:
ISBN:


Download Inferring Phenotypes from Genotypes with Machine Learning Book in PDF, Epub and Kindle

A thorough understanding of the relationship between the genomic characteristics of an individual (the genotype) and its biological state (the phenotype) is essential to personalized medicine, where treatments are tailored to each individual. This notably allows to anticipate diseases, estimate response to treatments, and even identify new pharmaceutical targets. Machine learning is a science that aims to develop algorithms that learn from examples. Such algorithms can be used to learn models that estimate phenotypes based on genotypes, which can then be studied to elucidate the biological mechanisms that underlie the phenotypes. Nonetheless, the application of machine learning in this context poses significant algorithmic and theoretical challenges. The high dimensionality of genomic data and the small size of data samples can lead to overfitting; the large volume of genomic data requires adapted algorithms that limit their use of computational resources; and importantly, the learned models must be interpretable by domain experts, which is not always possible. This thesis presents learning algorithms that produce interpretable models for the prediction of phenotypes based on genotypes. Firstly, we explore the prediction of discrete phenotypes using rule-based learning algorithms. We propose new implementations that are highly optimized and generalization guarantees that are adapted to genomic data. Secondly, we study a more theoretical problem, namely interval regression. We propose two new learning algorithms, one which is rule-based. Finally, we show that this type of regression can be used to predict continuous phenotypes and that this leads to models that are more accurate than those of conventional approaches in the presence of censored or noisy data. The overarching theme of this thesis is an application to the prediction of antibiotic resistance, a global public health problem of high significance. We demonstrate that our algorithms can be used to accurately predict resistance phenotypes and contribute to the improvement of their understanding. Ultimately, we expect that our algorithms will take part in the development of tools that will allow a better use of antibiotics and improved epidemiological surveillance, a key component of the solution to this problem.

Statistical Methods for Inferring Correlation and Causation Between Genotypes and Phenotypes

Statistical Methods for Inferring Correlation and Causation Between Genotypes and Phenotypes
Author: Nathan Riley Summers LaPierre
Publisher:
Total Pages: 215
Release: 2022
Genre:
ISBN:


Download Statistical Methods for Inferring Correlation and Causation Between Genotypes and Phenotypes Book in PDF, Epub and Kindle

Genome-Wide Association Studies (GWAS) have identified many genetic variants that are associated with a variety of complex phenotypes, including anthropometric and lifestyle traits as well as complex diseases. It is unclear, however, which of these variants actually play causal roles in these phenotypes, as opposed to simply being correlated with the causal variants. It is also unclear through which intermediate mechanisms causal variants impact complex phenotypes, such as effects on gene expression, metabolites, the microbiome, or other related phenotypes. In this dissertation, I present computational and statistical methods for addressing these issues. These methods infer causal variants for complex phenotypes, link variants to intermediate gene expression phenotypes, and use genetic variants to determine the causal effect of intermediate phenotypes on downstream phenotypes. Further exploring one such set of intermediate phenotypes, I present several methods for the analysis of metagenomic sequencing data. Metagenomics, the study of microbial genomes sequenced directly from their host environment, has revolutionized the study of microorganisms and illuminated their key roles in environmental function and dysfunction, including in human health and disease. However, it is challenging to determine which microbes are present and their relative abundances from sequencing data, due to incomplete genomic reference databases as well as errors in the sequencing reads themselves. I introduce several methods addressing this challenge, providing means to correct errors in sequencing reads and then to estimate the relative abundances of microbial taxa in the sequenced sample. I then explore several machine learning approaches for predicting human diseases based on inferred microbe abundance information.

Machine Learning for Microbial Phenotype Prediction

Machine Learning for Microbial Phenotype Prediction
Author: Roman Feldbauer
Publisher: Springer
Total Pages: 116
Release: 2016-06-15
Genre: Science
ISBN: 3658143193


Download Machine Learning for Microbial Phenotype Prediction Book in PDF, Epub and Kindle

This thesis presents a scalable, generic methodology for microbial phenotype prediction based on supervised machine learning, several models for biological and ecological traits of high relevance, and the deployment in metagenomic datasets. The results suggest that the presented prediction tool can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomic studies. Unraveling relationships between a living organism's genetic information and its observable traits is a central biological problem. Phenotype prediction facilitated by machine learning techniques will be a major step forward to creating biological knowledge from big data.

Machine Learning in Genome-Wide Association Studies

Machine Learning in Genome-Wide Association Studies
Author: Ting Hu
Publisher: Frontiers Media SA
Total Pages: 74
Release: 2020-12-15
Genre: Science
ISBN: 2889662292


Download Machine Learning in Genome-Wide Association Studies Book in PDF, Epub and Kindle

This eBook is a collection of articles from a Frontiers Research Topic. Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: frontiersin.org/about/contact.

Sparse Model Learning for Inferring Genotype and Phenotype Associations

Sparse Model Learning for Inferring Genotype and Phenotype Associations
Author: Anhui Huang
Publisher:
Total Pages:
Release: 2014
Genre:
ISBN:


Download Sparse Model Learning for Inferring Genotype and Phenotype Associations Book in PDF, Epub and Kindle

Genotype and phenotype associations are of paramount importance in understanding the genetic basis of living organisms, improving traits of interests in animal and plant breeding, as well as gaining insights into complex biological systems and the etiology of human diseases. With the advancements in molecular biology such as microarrays, high throughput next generation sequencing, RNAseq, et al, the number of available genotype markers is far exceeding the number of available samples in association studies. The objective of this dissertation is to develop sparse models for such high dimensional data, develop accurate sparse variable selection and estimation algorithms for the models, and design statistical methods for robust hypothesis tests for the genotype and phenotype associations. We develop a novel empirical Bayesian least absolute shrinkage and selection operator (EBlasso) algorithm with Normal, Exponential and Gamma (NEG), and Normal, Exponential (NE) hierarchical prior distributions, and an empirical Bayesian elastic net (EBEN) algorithm with an innovative Normal and generalized Gamma (NG) hierarchical prior distribution, for both general linear and generalized logistic regression models. Both of the two empirical Bayes methods estimate variance components of the regression coefficients with closed-form solutions and perform automatic variable selection such that a variable with zero variance is excluded from the model. With the closed-form solutions for variance components in the model and without estimating the posterior modes for excluded variables, the two empirical Bayes methods infer sparse models efficiently. Having both covariance and posterior modes estimated, they also provide a statistical testing method that considers as much information as possible without increasing the degrees of freedom (DF). Extensive simulation studies are carried out to evaluate the performance of the proposed methods, and real datasets are analyzed for validation. Both simulation and real data analyses suggest that the two methods are fast and accurate genotype-phenotype association methods that can easily handle high dimensional data including possible main and interaction effects. Comparing the two methods, EBlasso typically selects one variable out of a group of highly correlated effects, and the EBEN algorithm encourages a grouping effect that selects a group of effects if they are correlated. Not only verificatory simulation and real dataset analyses are performed, we further demonstrate the advantage of the developed algorithms through two exploratory applications, namely the whole-genome QTL mapping for an elite rice hybrid and pathway-based genome wide association study (GWAS) for human Parkinson disease (PD). In the first application, we exploit whole-genome markers of an immortalized F2 population derived from an elite rice hybrid to perform QTL mapping for the rice-yield phenotype. Our QTL model includes additive and dominance main effects of 1,619 markers and all pair-wise interactions, with a total of more than 5 million possible effects. This study not only reveals the major role of epistasis influencing rice yield, but also provides a set of candidate genetic loci for further experimental investigations. In the second application, we employ the EBlasso logistic regression model for pathway-based GWAS to include all possible main effects and a large number of pair-wise interactions of single nucleotide polymorphisms (SNPs) in a pathway, with a total number of more than 32 million effects included in the model. With effects inferred by EBlasso, the statistical significance of a pathway is tested with the Wald statistics and reliable effects in a significant pathway are selected using the stability selection technique. Another important area of genotype and phenotype association is to infer the structure of gene regulatory networks (GRNs). We developed a GRN inference algorithm by exploring sparse model selection and estimation methods in structural equation models (SEMs). We extend a previously developed sparse-aware maximum likelihood (SML) algorithm to incorporate the adaptive elastic net penalty for the SEM likelihood function (SEM-EN) and infer the model using a parallelized block coordinate ascent algorithm. With the versatile penalty function and powerful parallel computation, the SEM-EN algorithm is able to infer a network with thousands of nodes. The performance of the developed algorithm are demonstrated through simulation studies, in which power of detection and false discovery rate both suggest that SEM-EN significantly improves GRN inference over the previously developed SEM-SML algorithm. When applied to infer the GRN of a real budding yeast dataset with more than 3,000 nodes, SEM-EN infers a sparse network corroborated by previous independent studies in terms of roles of hub nodes and functions of key clusters. Given the fundamental importance of genotype and phenotype associations in understanding the genetic basis of complex biological system, the EBlasso-NE, EBlasso-NEG, EBEN, as well as SEM-EN algorithms and software packages developed in this dissertation achieve the effectiveness, robustness and efficiency that are needed for successful application to biology. With the advancement of high-throughput molecular technologies in generating information at genetic, epigenetic, transcriptional and post-transcriptional levels, the methods developed in this dissertation can have broad applications to infer different types of genotype and phenotypes associations.

Elements of Causal Inference

Elements of Causal Inference
Author: Jonas Peters
Publisher: MIT Press
Total Pages: 289
Release: 2017-11-29
Genre: Computers
ISBN: 0262037319


Download Elements of Causal Inference Book in PDF, Epub and Kindle

A concise and self-contained introduction to causal inference, increasingly important in data science and machine learning. The mathematization of causality is a relatively recent development, and has become increasingly important in data science and machine learning. This book offers a self-contained and concise introduction to causal models and how to learn them from data. After explaining the need for causal models and discussing some of the principles underlying causal inference, the book teaches readers how to use causal models: how to compute intervention distributions, how to infer causal models from observational and interventional data, and how causal ideas could be exploited for classical machine learning problems. All of these topics are discussed first in terms of two variables and then in the more general multivariate case. The bivariate case turns out to be a particularly hard problem for causal learning because there are no conditional independences as used by classical methods for solving multivariate cases. The authors consider analyzing statistical asymmetries between cause and effect to be highly instructive, and they report on their decade of intensive research into this problem. The book is accessible to readers with a background in machine learning or statistics, and can be used in graduate courses or as a reference for researchers. The text includes code snippets that can be copied and pasted, exercises, and an appendix with a summary of the most important technical concepts.

Learning Genomic and Molecular Mediators of Genotype-phenotype Associations

Learning Genomic and Molecular Mediators of Genotype-phenotype Associations
Author: Anna Shcherbina
Publisher:
Total Pages:
Release: 2020
Genre:
ISBN:


Download Learning Genomic and Molecular Mediators of Genotype-phenotype Associations Book in PDF, Epub and Kindle

The vast majority of genomic variants are non-coding, and many disrupt regulatory elements, causing dysregulation of gene expression. However, the functional mechanisms by which non-coding variants operate at the molecular level, as well as their tissue-specific downstream effects on cellular, organismal and disease phenotypes remain challenging to decipher. Firstly, complex phenotypes such as physical activity patterns are difficult to characterize and measure. Secondly, even after inferring statistical associations between genetic loci and complex phenotypes, identifying the causal variants is challenging due to the issues posed by linkage disequilibrium. Finally, the elucidation of functional molecular mechanisms that mediate the manifestation of genotypic variation to phenotypic effects remains an open challenge in the field. This thesis attempts to address these three challenges via the development and application of statistical and deep learning approaches to mine large genomic, molecular and phenotypic datasets. The MyHeart Counts study serves as an example of how wearable and mobile technologies enable unobtrusive real-time measurements of complex phenotypes such as exercise and physical activity patterns. These technologies also enable rapid recruitment of large study cohorts and facilitate fully digital randomized controlled trials with low barriers to entry. Such technologies also facilitate the compilation of population-level biobanks, such as the UK Biobank by enabling acquisition of lifestyle and activity data at scale. Having acquired complex phenotypes on large data cohorts, we can begin to investigate the effects of genomic variation on these phenotypes by performing genomewide association studies (GWAS). Functional GWAS SNPs can be identified via in silico interrogation of predictive deep learning models of regulatory DNA. Here, I present convolutional neural network models trained on genome-wide chromatin profiling experiments to interpret and finemap GWAS SNPs by leveraging their ability to learn predictive DNA sequence syntax. Case studies in colorectal cancer and Alzheimer's disease are presented to illustrate the application of these methods. To improve the model stability and interpretability, I developed deep learning models that can predict regulatory chromatin profiles at single base resolution, accounting and correcting for confounding experimental biases. I also contributed to several collaborative investigations of the molecular basis of complex cellular phenotypes. We identified the Sp1 regulatory protein as a key regulator of matrix stiffness and induction of tumorigenic phenotypes in mammary epithelium; the PI3K pathway as a key modulator of efficiency of stem cell differentiation and transcription factor networks that regulate murine muscle stem cell aging through differentiation. In summary, this thesis presents new computational approaches for linking genotype to phenotype through mechanistic molecular mechanisms.

Machine Learning for the Genotype-to-Phenotype Problem

Machine Learning for the Genotype-to-Phenotype Problem
Author: John William Santerre
Publisher:
Total Pages: 263
Release: 2018
Genre:
ISBN: 9780355804065


Download Machine Learning for the Genotype-to-Phenotype Problem Book in PDF, Epub and Kindle

This thesis demonstrates the suitability of machine learning for classifying phenotypes from genotype data. First, we analyze the suitability of machine learning techniques on antimicrobial resistance phenotypes. Additionally, we evaluate the stability of identifying DNA regions related to antimicrobial resistance. To speed and simplify this process, we develop a unique matrix construction method specifically for use on antimicrobial resistances datasets. We also consider an alternative phenotypic classification problem — namely predicting the ability of an organism to grow on a particular media (predicting growth rate) and the structure of the resulting feature space. Finally, we propose an extension of the Random Forest feature importance calculation and show how such an alteration results in an improvement in the identification of gene regions.