Interpretable Machine Learning Methods for Regulatory and Disease Genomics

Interpretable Machine Learning Methods for Regulatory and Disease Genomics
Author: Peyton Greis Greenside
Publisher:
Total Pages:
Release: 2018
Genre:
ISBN:


Download Interpretable Machine Learning Methods for Regulatory and Disease Genomics Book in PDF, Epub and Kindle

It is an incredible feat of nature that the same genome contains the code to every cell in each living organism. From this same genome, each unique cell type gains a different program of gene expression that enables the development and function of an organism throughout its lifespan. The non-coding genome - the ~98 of the genome that does not code directly for proteins - serves an important role in generating the diverse programs of gene expression turned on in each unique cell state. A complex network of proteins bind specific regulatory elements in the non-coding genome to regulate the expression of nearby genes. While basic principles of gene regulation are understood, the regulatory code of which factors bind together at which genomic elements to turn on which genes remains to be revealed. Further, we do not understand how disruptions in gene regulation, such as from mutations that fall in non-coding regions, ultimately lead to disease or other changes in cell state. In this work we present several methods developed and applied to learn the regulatory code or the rules that govern non-coding regions of the genome and how they regulate nearby genes. We first formulate the problem as one of learning pairs of sequence motifs and expressed regulator proteins that jointly predict the state of the cell, such as the cell type specific gene expression or chromatin accessibility. Using pre-engineered sequence features and known expression, we use a paired-feature boosting approach to build an interpretable model of how the non-coding genome contributes to cell state. We also demonstrate a novel improvement to this method that takes into account similarities between closely related cell types by using a hierarchy imposed on all of the predicted cell states. We apply this method to discover validated regulators of tadpole tail regeneration and to predict protein-ligand binding interactions. Recognizing the need for improved sequence features and stronger predictive performance, we then move to a deep learning modeling framework to predict epigenomic phenotypes such as chromatin accessibility from just underlying DNA sequence. We use deep learning models, specifically multi-task convolutional neural networks, to learn a featurization of sequences over several kilobases long and their mapping to a functional phenotype. We develop novel architectures that encode principles of genomics in models typically designed for computer vision, such as incorporating reverse complementation and the 3D structure of the genome. We also develop methods to interpret traditionally ``black box" neural networks by 1) assigning importance scores to each input sequence to the model, 2) summarizing non-redundant patterns learned by the model that are predictive in each cell type, and 3) discovering interactions learned by the model that provide indications as to how different non-coding sequence features depend on each other. We apply these methods in the system of hematopoiesis to interpret chromatin dynamics across differentiation of blood cell types, to understand immune stimulation, and to interpret immune disease-associated variants that fall in non-coding regions. We demonstrate strong performance of our boosting and deep learning models and demonstrate improved performance of these machine learning frameworks when taking into account existing knowledge about the biological system being modeled. We benchmark our interpretation methods using gold standard systems and existing experimental data where available. We confirm existing knowledge surrounding essential factors in hematopoiesis, and also generate novel hypotheses surrounding how factors interact to regulate differentiation. Ultimately our work provides a set of tools for researchers to probe and understand the non-coding genome and its role in controlling gene expression as well as a set of novel insights surrounding how hematopoiesis is controlled on many scales from global quantification of regulatory sequence to interpretation of individual variants.

Biologically Interpretable Machine Learning Methods to Understand Gene Regulation for Disease Phenotypes

Biologically Interpretable Machine Learning Methods to Understand Gene Regulation for Disease Phenotypes
Author: Ting Jin
Publisher:
Total Pages: 0
Release: 2023
Genre:
ISBN:


Download Biologically Interpretable Machine Learning Methods to Understand Gene Regulation for Disease Phenotypes Book in PDF, Epub and Kindle

Gene expression and regulation is a key molecular mechanism driving the development of human diseases, particularly at the cell type level, but it remains elusive. For example in many brain diseases, such as Alzheimer's disease (AD), understanding how cell-type gene expression and regulation change across multiple stages of AD progression is still challenging. Moreover, interindividual variability of gene expression and regulation is a known characteristic of the human brain and brain diseases. However, it is still unclear how interindividual variability affects personalized gene regulation in brain diseases including AD, thereby contributing to their heterogeneity. Recent technological advances have enabled the detection of gene regulation activities through multi-omics (i.e., genomics, transcriptomics, epigenomics, proteomics). In particular, emerging single-cell sequencing technologies (e.g., scRNA-seq, scATAC-seq) allow us to study functional genomics and gene regulation at the cell-type level. Moreover, these multi-omics data of populations (e.g., human individuals) provide a unique opportunity to study the underlying regulatory mechanisms occurring in brain disease progression and clinical phenotypes. For instance, PsychAD is a large project generating single-cell multi-omics data including many neuronal and glial cell types, aiming to understand the molecular mechanisms of neuropsychiatric symptoms of multiple brain diseases (e.g., AD, SCZ, ASD, Bipolar) from over 1,000 individuals. However, analyzing and integrating large-scale multi-omics data at the population level, as well as understanding the mechanisms of gene regulation, also remains a challenge. Machine learning is a powerful and emerging tool to decode the unique complexities and heterogeneity of human diseases. For instance, Beebe-Wang, Nicosia, et al. developed MD-AD, a multi-task neural network model to predict various disease phenotypes in AD patients using RNA-seq. Additionally, with advancements in graph neural networks, which possess enhanced capabilities to represent sophisticated gene network structures like gene regulation networks that control gene expression. Efforts have also been made to capture the gene regulation heterogeneity of brain diseases. For instance, Kim SY has applied graph convolutional networks to offer personalized diagnostic insights through population graphs that correspond with disease progression. However, many existing machine learning methods are often limited to constructing accurate models for disease phenotype prediction and frequently lack biological interpretability or personalized insights, especially in gene regulation. Therefore, to address these challenges, my Ph.D. works have developed three machine-learning methods designed to decode the gene regulation mechanisms of human diseases. First, in this dissertation, I will present scGRNom, a computational pipeline that integrates multi-omic data to construct cell-type gene regulatory networks (GRNs) linking non-coding regulatory elements. Next, I will introduce i-BrainMap an interpretable knowledge-guided graph neural network model to prioritize personalized cell type disease genes, regulatory linkages, and modules. Thirdly, I introduce ECMaker, a semi-restricted Boltzmann machine (semi-RBM) method for identifying gene networks to predict diseases and clinical phenotypes. Overall, all our interpretable machine learning models improve phenotype prediction, prioritize key genes and networks associated with disease phenotypes, and are further aimed at enhancing our understanding of gene regulatory mechanisms driving disease progression and clinical phenotypes.

Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics

Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics
Author: Avanti Shrikumar
Publisher:
Total Pages:
Release: 2020
Genre:
ISBN:


Download Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics Book in PDF, Epub and Kindle

All cells in our body have approximately the same DNA sequence, yet different cell-types have distinct behavior due to differential expression of genes. This cell-type specific control of gene expression is governed by regulatory proteins that bind to DNA. Over 90% of disease-associated mutations do not disrupt the DNA sequences of genes, but rather disrupt functions involved in the regulation of gene expression. Unfortunately, conventional computational models can fail to distinguish between mutations that are benign and mutations that are likely to affect regulatory activity. Machine learning poses a solution to this dilemma: by training complex models, including deep learning models, to predict regulatory activity from DNA sequence, we implicitly force the models to learn which sequence features are relevant for regulation. However, our difficulty in interpreting and trusting these models limits our ability to extract novel scientific insights from them. In this thesis, I will present techniques I have developed to address some of these limitations. I will begin by discussing DeepLIFT, a fast algorithm for calculating example-specific importance scores to explain the predictions of a deep learning model, as well as GkmExplain, an algorithm for efficiently computing importance scores for gapped k-mer support vector machines. I will then describe TF-MoDISco, an algorithm that leverages importance scores produced by an algorithm such as DeepLIFT or GkmExplain to discover recurring patterns learned by the model. Next, I discuss two projects on leveraging domain-specific knowledge to improve the performance and interpretability of deep learning models trained on regulatory genomic data. The first project, on reverse-complement parameter sharing, introduces architectures that can account for symmetries inherent in the double-stranded nature of regulatory DNA. The second project, on separable fully-connected layers, introduces a novel parameterization to exploit the fact that positional patterns in DNA binding sites are often shared across different regulatory proteins. Finally, I will discuss three projects centered on improving the reliability of predictions derived from these models. The first project deals with the situation where a deep learning model trained on regulatory genomic data is leveraged to identify pairs of proteins that have non-additive interaction effects; we demonstrate that looking at change in the model's prediction loss, rather than simply looking at the change in the predictions, is a far more robust indicator of whether the model's learned interaction effect is likely to be an artifact. The second project presents a state-of-the-art algorithm for improving the model predictions under a type of data distribution shift known as ``label shift'', where the class proportions in the held-out testing set differ from the class proportions that the model was trained on (this can occur, for example, if a model that is trained to predict diseases given symptoms is deployed in a situation where the prevalence of the disease is far higher than in the data distribution it was trained on). The third project explores the scenario where a model can abstain from making predictions on a subset of examples that it is uncertain of, in order to improve user trust in the predictions on remaining examples; in the project, we devise a novel and flexible strategy for choosing which examples to abstain on when the goal is to optimize metrics other than simple prediction accuracy, such as the area under the ROC curve or the sensitivity at a target specificity level (such metrics are commonly used in genomics and medicine). Taken together, I hope these methods help pave the way for successful application of advanced machine learning techniques to derive novel scientific insights from regulatory genomic data.

Handbook of Machine Learning Applications for Genomics

Handbook of Machine Learning Applications for Genomics
Author: Sanjiban Sekhar Roy
Publisher: Springer Nature
Total Pages: 222
Release: 2022-06-23
Genre: Technology & Engineering
ISBN: 9811691584


Download Handbook of Machine Learning Applications for Genomics Book in PDF, Epub and Kindle

Currently, machine learning is playing a pivotal role in the progress of genomics. The applications of machine learning are helping all to understand the emerging trends and the future scope of genomics. This book provides comprehensive coverage of machine learning applications such as DNN, CNN, and RNN, for predicting the sequence of DNA and RNA binding proteins, expression of the gene, and splicing control. In addition, the book addresses the effect of multiomics data analysis of cancers using tensor decomposition, machine learning techniques for protein engineering, CNN applications on genomics, challenges of long noncoding RNAs in human disease diagnosis, and how machine learning can be used as a tool to shape the future of medicine. More importantly, it gives a comparative analysis and validates the outcomes of machine learning methods on genomic data to the functional laboratory tests or by formal clinical assessment. The topics of this book will cater interest to academicians, practitioners working in the field of functional genomics, and machine learning. Also, this book shall guide comprehensively the graduate, postgraduates, and Ph.D. scholars working in these fields.

Machine Learning Methods for Multi-Omics Data Integration

Machine Learning Methods for Multi-Omics Data Integration
Author: Abedalrhman Alkhateeb
Publisher: Springer Nature
Total Pages: 171
Release: 2023-12-15
Genre: Science
ISBN: 303136502X


Download Machine Learning Methods for Multi-Omics Data Integration Book in PDF, Epub and Kindle

The advancement of biomedical engineering has enabled the generation of multi-omics data by developing high-throughput technologies, such as next-generation sequencing, mass spectrometry, and microarrays. Large-scale data sets for multiple omics platforms, including genomics, transcriptomics, proteomics, and metabolomics, have become more accessible and cost-effective over time. Integrating multi-omics data has become increasingly important in many research fields, such as bioinformatics, genomics, and systems biology. This integration allows researchers to understand complex interactions between biological molecules and pathways. It enables us to comprehensively understand complex biological systems, leading to new insights into disease mechanisms, drug discovery, and personalized medicine. Still, integrating various heterogeneous data types into a single learning model also comes with challenges. In this regard, learning algorithms have been vital in analyzing and integrating these large-scale heterogeneous data sets into one learning model. This book overviews the latest multi-omics technologies, machine learning techniques for data integration, and multi-omics databases for validation. It covers different types of learning for supervised and unsupervised learning techniques, including standard classifiers, deep learning, tensor factorization, ensemble learning, and clustering, among others. The book categorizes different levels of integrations, ranging from early, middle, or late-stage among multi-view models. The underlying models target different objectives, such as knowledge discovery, pattern recognition, disease-related biomarkers, and validation tools for multi-omics data. Finally, the book emphasizes practical applications and case studies, making it an essential resource for researchers and practitioners looking to apply machine learning to their multi-omics data sets. The book covers data preprocessing, feature selection, and model evaluation, providing readers with a practical guide to implementing machine learning techniques on various multi-omics data sets.

Interpretable Machine Learning in Plant Genomes

Interpretable Machine Learning in Plant Genomes
Author: Christina Brady Azodi
Publisher:
Total Pages: 217
Release: 2019
Genre: Electronic dissertations
ISBN: 9781392717943


Download Interpretable Machine Learning in Plant Genomes Book in PDF, Epub and Kindle

Complex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean that those working at the interface of data science and biology are poised to better model and understand complex biological systems. The research in this dissertation demonstrates how a data-driven approach can be used to better understand three complex systems: (1) transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana, (2) the genetic basis of flowering time, a complex trait, in Zea mays, and (3) the social basis for opinions and beliefs about biotechnology products.To study the first system, we generated models of the cis-regulatory code from information about DNA sequence and additional omics levels using both classic machine learning and deep learning algorithms. We identified 1,061 putative cis-regulatory elements associated with different patterns of response to single and combined heat and drought stress and found that information about additional levels of regulation, especially chromatin accessibility and known transcription factor binding, improved our models of the cis-regulatory code. To study the second system, we generated phenotype prediction models for flowering time, height, and yield based on either genetic markers or transcript levels at the seedling stage. We found that, while genetic marker-based models performed better than transcript level-based models, models that integrated both types of data performed best. Furthermore, transcript-based models were more useful for finding genes known to be associated with flowering time, highlighting how using additional levels of omics data can improve our ability to understand the genetic basis of complex traits. Finally, to study the third system, we integrated 29 characteristics about a person (e.g. age, political ideology, education, values, environmental beliefs) into a machine learning model that would predict an individual's beliefs and opinions about five different types of biotechnology products (e.g. biofortification, biopharmaceuticals). While this approach was particularly usefully for identifying individuals that were broadly supportive of biotechnology, finding characteristics of individuals with negative or conditional (i.e. support product A, but not B) opinions was more challenging, highlighting the complexity of public opinions about biotechnology.