Learning Inductive Representations of Biomedical Data
Author | : Samuel G. Finlayson |
Publisher | : |
Total Pages | : 189 |
Release | : 2020 |
Genre | : Artificial intelligence |
ISBN | : |
Download Learning Inductive Representations of Biomedical Data Book in PDF, Epub and Kindle
Representation learning with neural networks has catalyzed rapid progress in biomedical pattern recognition. This progress, however, has generally been limited to domains where data are abundant, richly structured, and stable. In contrast, much of biomedicine is marked by limited and poorly structured data and by highly dynamic deployment environments. In particular, many of the most compelling problem areas in biomedicine involve the "long tails" of rare diseases and rare events. In this thesis, I confront the challenge of learning data representations whose utility can extend into dynamic and data-poor biomedical domains. I do so through three primary projects: First, I present a novel method for representation learning with subgraphs. This method, called Subgraph Neural Networks (Sub-GNN), learns disentangled representations of subgraph structure, neighborhood, and position through property-aware routing channels. The work is motivated by the desire for methods that can better contextualize patient phenotypes (encoded as subgraphs) into the broader context of biomedical knowledge, which could allow for better diagnostic generalization to novel disorders involving previously unseen phenotypes. Subgraph neural networks provide a principled framework for doing just this, by leveraging the relational inductive biases of the underlying knowledge graph while still respecting subgraphs as independent entities. Next, I present an approach to learning coordinated representations of small molecules and their associated transcriptional signatures. This approach extends a popular paradigm for drug development (known as connectivity mapping) to operate inductively, making predictions involving drugs that have not previously been experimentally assayed. I benchmark the performance of this approach, studying the circumstances under which it can and cannot achieve strong performance. Finally, I present an analysis of the clinical challenges posed by dataset shift, the phenomenon in which the input data to a deployed machine learning algorithm become mismatched with its training data. After introducing the problem of general dataset shift, I turn to a special case—adversarial examples—which reflect the worst-case generalization conditions for a machine learning system. I then build and test the representational robustness of three high-accuracy machine learning systems, constructing adversarial examples that cause their accuracy to drop to 0% on data that is imperceptibly different from the training data. I discuss the implications of these findings for clinical machine learning, offering specific regulatory recommendations. I conclude my thesis with lessons learned from these projects, and provide an extensive appendix with three additional smaller-scale projects that branched off of my research.