A Novel Computational Algorithm for Predicting Immune Cell Types Using Single-cell RNA Sequencing Data

A Novel Computational Algorithm for Predicting Immune Cell Types Using Single-cell RNA Sequencing Data
Author: Shuo Jia
Publisher:
Total Pages: 0
Release: 2020
Genre:
ISBN:


Download A Novel Computational Algorithm for Predicting Immune Cell Types Using Single-cell RNA Sequencing Data Book in PDF, Epub and Kindle

Background: Cells from our immune system detect and kill pathogens to protect our body against many diseases. However, current methods for determining cell types have some major limitations, such as being time-consuming and with low throughput rate, etc. These problems stack up and hinder the deep exploration of cellular heterogeneity. Immune cells that are associated with cancer tissues play a critical role in revealing the stages of tumor development. Identifying the immune composition within tumor microenvironments in a timely manner will be helpful to improve clinical prognosis and therapeutic management for cancer. Single-cell RNA sequencing (scRNA-seq), an RNA sequencing (RNA-seq) technique that focuses on a single cell level, has provided us with the ability to conduct cell type classification. Although unsupervised clustering approaches are the major methods for analyzing scRNA-seq datasets, their results vary among studies with different input parameters and sizes. However, in supervised machine learning methods, information loss and low prediction accuracy are the key limitations. Methods and Results: Genes in the human genome align to chromosomes in a particular order. Hence, we hypothesize incorporating this information into our model will potentially improve the cell type classification performance. In order to utilize gene positional information, we introduce chromosome-based neural network, namely ChrNet, a novel chromosome-specific re-trainable supervised learning method based on a one-dimensional convolutional neural network (1D-CNN). The model's performance was evaluated and compared with other supervised learning architectures. Overall, the ChrNet showed highest performance among the 3 models we benchmarked. In addition, we demonstrated the advantages of our new model over unsupervised clustering approaches using gene expression profiles from healthy, and tumor infiltrating immune cells. The codes for our model are packed into a Python package publicly available online on Github. Conclusions: We established an innovative chromosome-based 1D-CNN architecture to extract scRNA-seq expression information for immune cell type classification. It is expected that this model can become a reference architecture for future cell type classification methods.

Statistical Simulation and Analysis of Single-cell RNA-seq Data

Statistical Simulation and Analysis of Single-cell RNA-seq Data
Author: Tianyi Sun
Publisher:
Total Pages: 0
Release: 2023
Genre:
ISBN:


Download Statistical Simulation and Analysis of Single-cell RNA-seq Data Book in PDF, Epub and Kindle

The recent development of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies by revealing the genome-wide gene expression levels within individual cells. In contrast to bulk RNA sequencing, scRNA-seq technology captures cell-specific transcriptome landscapes, which can reveal crucial information about cell-to-cell heterogeneity across different tissues, organs, and systems and enable the discovery of novel cell types and new transient cell states. According to search results from PubMed, from 2009-2023, over 5,000 published studies have generated datasets using this technology. Such large volumes of data call for high-quality statistical methods for their analysis. In the three projects of this dissertation, I have explored and developed statistical methods to model the marginal and joint gene expression distributions and determine the latent structure type for scRNA-seq data. In all three projects, synthetic data simulation plays a crucial role. My first project focuses on the exploration of the Beta-Poisson hierarchical model for the marginal gene expression distribution of scRNA-seq data. This model is a simplified mechanistic model with biological interpretations. Through data simulation, I demonstrate three typical behaviors of this model under different parameter combinations, one of which can be interpreted as one source of the sparsity and zero inflation that is often observed in scRNA-seq datasets. Further, I discuss parameter estimation methods of this model and its other applications in the analysis of scRNA-seq data. My second project focuses on the development of a statistical simulator, scDesign2, to generate realistic synthetic scRNA-seq data. Although dozens of simulators have been developed before, they lack the capacity to simultaneously achieve the following three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, scDesign2 is developed as a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs. My third project focuses on deciding latent structure types for scRNA-seq datasets. Clustering and trajectory inference are two important data analysis tasks that can be performed for scRNA-seq datasets and will lead to different interpretations. However, as of now, there is no principled way to tell which one of these two types of analysis results is more suitable to describe a given dataset. In this project, we propose two computational approaches that aim to distinguish cluster-type vs. trajectory-type scRNA-seq datasets. The first approach is based on building a classifier using eigenvalue features of the gene expression covariance matrix, drawing inspiration from random matrix theory (RMT). The second approach is based on comparing the similarity of real data and simulated data generated by assuming the cell latent structure as clusters or a trajectory. While both approaches have limitations, we show that the second approach gives more promising results and has room for further improvements.

Handbook of Statistical Bioinformatics

Handbook of Statistical Bioinformatics
Author: Henry Horng-Shing Lu
Publisher: Springer Nature
Total Pages: 406
Release: 2022-12-08
Genre: Science
ISBN: 3662659026


Download Handbook of Statistical Bioinformatics Book in PDF, Epub and Kindle

Now in its second edition, this handbook collects authoritative contributions on modern methods and tools in statistical bioinformatics with a focus on the interface between computational statistics and cutting-edge developments in computational biology. The three parts of the book cover statistical methods for single-cell analysis, network analysis, and systems biology, with contributions by leading experts addressing key topics in probabilistic and statistical modeling and the analysis of massive data sets generated by modern biotechnology. This handbook will serve as a useful reference source for students, researchers and practitioners in statistics, computer science and biological and biomedical research, who are interested in the latest developments in computational statistics as applied to computational biology.

Understanding Cell Identity with Single Cell Transcriptomics

Understanding Cell Identity with Single Cell Transcriptomics
Author: Geoffrey Stanley
Publisher:
Total Pages:
Release: 2019
Genre:
ISBN:


Download Understanding Cell Identity with Single Cell Transcriptomics Book in PDF, Epub and Kindle

In my thesis work, I use single-cell whole-transcriptome sequencing to reveal new insights into cell identity: when cell types arise in development, how cell types are patterned in the adult, how splicing and transcription factors are modulated by cell identity, and the molecules that may be responsible for generating these patterns. In the first study, I sequenced neurons from the mouse striatum, a large brain region involved in Parkinsons and Huntingtons, in collaboration with Ozgun Gokce and Thomas Sudhof. I created a well-resolved classification of striatal cell type of the mouse striatum; transcriptome analysis revealed 10 differentiated distinct cell types, including neurons, astrocytes, oligodendrocytes, ependymal, immune, and vascular cells, and enabled the discovery of numerous novel marker genes. I further explored neuronal heterogeneity in the adult murine striatum by combining single-cell RNA-seq of SPNs combined with quantitative RNA in situ hybridization (ISH) using the RNAscope platform. I developed a novel computational algorithm that distinguishes discrete versus continuous cell identities in scRNA-seq data, and used it to show that SPNs in the striatum can be classified into four major discrete types with little overlap and no implied spatial relationship. I found that these discrete classes that continuously vary along multiple spatial gradients axes of expression; these gradients define anatomical location by a combinatorial mechanism. I used this information to support the description of a novel region of the striatum. Broadly, our results suggest that neuronal circuitry has a substructure at far higher resolution than is typically interrogated which is defined by the precise identity and location of a neuron. In a collaboration with Rahul Sinha and Irving Weissman, I discovered and investigated an artifact in Illumina sequencing data. Illumina-based next generation sequencing (NGS) has accelerated biomedical discovery through its ability to generate thousands of gigabases of sequencing output at low cost. In 2015, a new chemistry of cluster generation was introduced in the newer Illumina machines called exclusion amplification (ExAmp). This advance has been widely adopted for genome sequencing because greater sequencing depth can be achieved for lower cost without compromising the quality of longer reads. We show that this promising chemistry is problematic, however, when multiplexing samples. We discovered that up to 0.4-10% of sequencing reads (or signals) are incorrectly assigned from a given sample to other samples in a multiplexed pool. We provide evidence that this "spreading-of-signals" arises from low levels of free index primers present in the pool. The rate of signal spreading depending on the level of free index primers present in a library pool, and therefore, variable among experiments. In a collaboration with Tianying Su, Rahul Sinha, and Kristy Red-Horse, I investigated the development of mouse coronary arteries using scRNA-Seq and mouse genetics. I developed a statistical test that categorizes subpopulations within scRNA-Seq datasets as continuous or discrete to identify candidate developmental transitions. I analyzed the transitions between coronary progenitors and artery cells computationally and in vivo, which revealed that the progenitor cells of the mouse heart undergo a gradual conversion from vein to artery before a subset crosses a threshold to differentiate into pre-artery cells. I showed that pre-artery cells in scRNA-Seq data appear prior to blood flow, contrary to previous assumptions about how the heart develops. We showed that a venous transcription factor, COUP-TFII, blocked progression to the pre-artery state through activation of cell cycle genes. I was also interested in how transcription factors maintained cell identity. I therefore analyzed a dataset composed of more than 100,000 cells from 20 organs and tissues, produced by the Tabula Muris Consortium, to understand the transcription factor codes specifying cell identity in the mouse. One of the challenges of scRNA-Seq data is that nearly all studies are specific to a single organ, and it is challenging to compare data collected from different animals by independent labs with varying experimental techniques. To understand which TFs were most informative for specifying cell types, we used random forest machine learning to show that 136 TFs are needed to simultaneously define all cell types across all organs. I collected a compendium of transcription factor reprogramming protocols and showed that for nearly all reprogramming protocols, the TFs used also specified the targeted cell type in our data, suggesting that whole-organism scRNA-Seq data can inform novel reprogramming schemes.

Algorithms for Modeling Gene Regulation and Determining Cell Type Using Single-cell Molecular Profiles

Algorithms for Modeling Gene Regulation and Determining Cell Type Using Single-cell Molecular Profiles
Author: Hannah Andersen Pliner
Publisher:
Total Pages: 167
Release: 2019
Genre:
ISBN:


Download Algorithms for Modeling Gene Regulation and Determining Cell Type Using Single-cell Molecular Profiles Book in PDF, Epub and Kindle

Single-cell genomic technologies are helping us answer key biological questions that have long remained elusive. How do a single cell and a single genome generate such complex multicellular organisms as humans? More specifically, how do these cells orchestrate specific transcriptional programs depending on their cell type? New technologies like single-cell RNA-seq and single-cell ATAC-seq allow us to examine the transcription and regulation of individual cells as they develop; however, these methods have important limitations. A primary limitation with all single-cell data is data sparsity, which must be overcome computationally to extract useful information from these experiments. In this dissertation, I present two algorithms designed to overcome the sparsity of single-cell data and allow biological discovery. I first introduce Cicero for single-cell chromatin accessibility data, which is both an algorithm that calculates co-accessibility scores to assign distal regulatory elements to genes, and a software system that adapts existing single-cell RNA-seq analysis techniques for use with single-cell chromatin accessibility data. In Chapter 2, I apply Cicero to an in vitro myoblast differentiation assay and find evidence for the use of ”chromatin hubs” during myogenesis. In Chapter 3, I apply Cicero to single-cell ATAC-seq data from mouse bone marrow and recapitulate known patterns of hematopoiesis and known cis-regulation of the b-globin locus. In Chapter 4, I introduce a second algorithm, Garnett, which uses single-cell expression data to train and apply automated cell type classifiers. The accuracy of this technology is demonstrated with data from various single-cell RNA-seq methods and tissue sources. In a final chapter, I reflect on the development of software for biological applications and future directions for this work.

Computational Methods for Single-Cell Data Analysis

Computational Methods for Single-Cell Data Analysis
Author: Guo-Cheng Yuan
Publisher: Humana Press
Total Pages: 271
Release: 2019-02-14
Genre: Science
ISBN: 9781493990566


Download Computational Methods for Single-Cell Data Analysis Book in PDF, Epub and Kindle

This detailed book provides state-of-art computational approaches to further explore the exciting opportunities presented by single-cell technologies. Chapters each detail a computational toolbox aimed to overcome a specific challenge in single-cell analysis, such as data normalization, rare cell-type identification, and spatial transcriptomics analysis, all with a focus on hands-on implementation of computational methods for analyzing experimental data. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Computational Methods for Single-Cell Data Analysis aims to cover a wide range of tasks and serves as a vital handbook for single-cell data analysis.

Methods and Applications of Integrating Single Nucleus and Bulk Tissue RNA Sequencing

Methods and Applications of Integrating Single Nucleus and Bulk Tissue RNA Sequencing
Author: Marcus Fernando Alvarez
Publisher:
Total Pages: 0
Release: 2022
Genre:
ISBN:


Download Methods and Applications of Integrating Single Nucleus and Bulk Tissue RNA Sequencing Book in PDF, Epub and Kindle

Obesity typically precedes and accompanies the development of cardiometabolic diseases (CMD) that lead to increased morbidity and mortality. One of these disorders is non-alcoholic fatty liver disease (NAFLD), which encompasses a spectrum of varying degrees of fat accumulation and inflammation in the liver. More severe forms of NAFLD, such as non-alcoholic steatohepatitis (NASH), lead to a higher risk of developing hepatocellular carcinoma (HCC), the most prevalent form of liver cancer. Adipose tissue dysfunction in obesity can lead to increased circulating free fatty acids, and thus to ectopic lipid deposition in the liver. Left unchecked, lipotoxicity in the liver can result in inflammation, cell death, fibrosis, and ultimately the development of HCC. In both adipose and liver tissues, non-parenchymal cells, such as vascular and immune cell-types, play important roles in the normal function of these tissues and the pathophysiology of obesity, NAFLD, and HCC. A holistic approach to studying cell-types in a global manner would therefore greatly enhance our understanding of these common obesity-related diseases. Single-cell technologies, such as single-cell RNA-sequencing (scRNA-seq), assay individual cells and provide an excellent tool to study cell-type changes. While these approaches provide high resolution, they are currently costly and low-throughput. Traditional methods that measure molecular phenotypes at the tissue level are therefore still more practical. These assess a composite sum of cells present in the sample or biopsy, leading to inherent uncertainty in whether observed results are due to changes at the compositional level, cellular level, or both. Given these limitations, I aimed to integrate bulk-tissue RNA-sequencing (RNA-seq) and scRNA-seq data to leverage larger sample sizes in bulk RNA-seq and higher resolution in scRNA-seq. The application of single-cell technologies is especially promising for biobanks, as they can contain multiple levels of data on participants to uncover novel associations. Tissues are typically stored frozen, however, and this usually requires nuclei suspensions for single-nucleus RNA-seq (snRNA-seq), whereas whole cells would typically be used for scRNA-seq. This presents challenges for current droplet-based technologies. RNA from the ambient pool of lysed cells and nuclei can encapsulate into droplets, confounding results. In Chapter 2, I present a computational method to remove empty droplets from gene expression data (Alvarez et al. 2020). This allows for cleaner downstream data analysis by ensuring that only droplets with nuclei or cells are used. As current scRNA-seq technologies are low-throughput, their application to population-based studies and cohorts are limited. Present scRNA-seq technologies have lower throughput compared to bulk-tissue RNA-seq, which are typically available in higher sample sizes. In Chapter 3, I developed a method to help address this methodological gap. This approach, called Bisque (Jew et al. 2020), estimates cell-type composition in bulk RNA-seq data sets using single cell level reference data from the same tissue. The estimated cell-type proportions can be associated with sample-level data to uncover relevant cell-types, or they can be included as covariates in a model to reduce confounding caused by cell-type heterogeneity. One advantage of our method is that it requires only a minimum amount of information in the form of cell-type markers. This makes it attractive for existing data sets, which may not have accompanying single-cell level RNA-seq data. In the fourth chapter of this dissertation, I present our application of snRNA-seq to HCC. Carcinomas, such as HCC, are typically characterized by high amounts of tissue heterogeneity. Larger scale cancer cohorts usually lack single-cell level data, making interpretation of bulk-tissue results challenging. Here, I integrated HCC single-cell level experiments with relatively large HCC case-control bulk RNA-seq cohorts. The results from these analyses highlighted the role that proliferating cells play in HCC (Alvarez et al. 2022). These cycling cells were highly enriched in cancer tissue, as expected, and were prognostic of poor survival outcomes consistently in two independent cohorts. Furthermore, we observed that individuals with TP53 mutations have higher levels of these proliferating cells. Thus, our integration helped to interpret tumor gene expression changes as cell-type composition changes. In the fifth chapter, I present our human adipose tissue snRNA-seq results, showing changes in obesity and insulin resistance (Alvarez et al. manuscript in preparation). We applied multiplexing to increase our snRNA-seq sample size to roughly 100 subcutaneous adipose samples and over 100,000 nuclei, providing unprecedented resolution of human adipose tissue. This allowed us to identify finer resolution subcell-types, or cell states, which are more challenging to study as they are lower in frequency and exhibit more subtle differences. In addition to substantiating previous findings, we identified subcell-types associated with CMD. Then, we apply integrative approaches to corroborate these cell state changes in adipose bulk RNA-seq. Overall, our results show that both main cell-type and subcell-type variations are associated with metabolic traits. In summary, this dissertation presents my work on the integration of snRNA-seq and bulk- tissue RNA-seq to leverage distinct advantages provided by each. This has allowed us to gain a better understanding of the origin of gene expression changes in CMD.

Interpretable Machine Learning Methods for Regulatory and Disease Genomics

Interpretable Machine Learning Methods for Regulatory and Disease Genomics
Author: Peyton Greis Greenside
Publisher:
Total Pages:
Release: 2018
Genre:
ISBN:


Download Interpretable Machine Learning Methods for Regulatory and Disease Genomics Book in PDF, Epub and Kindle

It is an incredible feat of nature that the same genome contains the code to every cell in each living organism. From this same genome, each unique cell type gains a different program of gene expression that enables the development and function of an organism throughout its lifespan. The non-coding genome - the ~98 of the genome that does not code directly for proteins - serves an important role in generating the diverse programs of gene expression turned on in each unique cell state. A complex network of proteins bind specific regulatory elements in the non-coding genome to regulate the expression of nearby genes. While basic principles of gene regulation are understood, the regulatory code of which factors bind together at which genomic elements to turn on which genes remains to be revealed. Further, we do not understand how disruptions in gene regulation, such as from mutations that fall in non-coding regions, ultimately lead to disease or other changes in cell state. In this work we present several methods developed and applied to learn the regulatory code or the rules that govern non-coding regions of the genome and how they regulate nearby genes. We first formulate the problem as one of learning pairs of sequence motifs and expressed regulator proteins that jointly predict the state of the cell, such as the cell type specific gene expression or chromatin accessibility. Using pre-engineered sequence features and known expression, we use a paired-feature boosting approach to build an interpretable model of how the non-coding genome contributes to cell state. We also demonstrate a novel improvement to this method that takes into account similarities between closely related cell types by using a hierarchy imposed on all of the predicted cell states. We apply this method to discover validated regulators of tadpole tail regeneration and to predict protein-ligand binding interactions. Recognizing the need for improved sequence features and stronger predictive performance, we then move to a deep learning modeling framework to predict epigenomic phenotypes such as chromatin accessibility from just underlying DNA sequence. We use deep learning models, specifically multi-task convolutional neural networks, to learn a featurization of sequences over several kilobases long and their mapping to a functional phenotype. We develop novel architectures that encode principles of genomics in models typically designed for computer vision, such as incorporating reverse complementation and the 3D structure of the genome. We also develop methods to interpret traditionally ``black box" neural networks by 1) assigning importance scores to each input sequence to the model, 2) summarizing non-redundant patterns learned by the model that are predictive in each cell type, and 3) discovering interactions learned by the model that provide indications as to how different non-coding sequence features depend on each other. We apply these methods in the system of hematopoiesis to interpret chromatin dynamics across differentiation of blood cell types, to understand immune stimulation, and to interpret immune disease-associated variants that fall in non-coding regions. We demonstrate strong performance of our boosting and deep learning models and demonstrate improved performance of these machine learning frameworks when taking into account existing knowledge about the biological system being modeled. We benchmark our interpretation methods using gold standard systems and existing experimental data where available. We confirm existing knowledge surrounding essential factors in hematopoiesis, and also generate novel hypotheses surrounding how factors interact to regulate differentiation. Ultimately our work provides a set of tools for researchers to probe and understand the non-coding genome and its role in controlling gene expression as well as a set of novel insights surrounding how hematopoiesis is controlled on many scales from global quantification of regulatory sequence to interpretation of individual variants.