Deep Learning for Network Biology

ISMB 2018 Tutorial

Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. This tutorial investigates key advancements in representation learning for networks over the last few years, with an emphasis on fundamentally new opportunities in network biology enabled by these advancements.

Tutorial information

Biological networks are powerful resources for the discovery of interactions and emergent properties in biological systems, ranging from single-cell to population level. Network approaches have been used many times to combine and amplify signals from individual genes, and have led to remarkable discoveries in biology, including drug discovery, protein function prediction, disease diagnosis, and precision medicine. Furthermore, these approaches have shown broad utility in uncovering new biology and have contributed to new discoveries in wet laboratory experiments.

Mathematical machinery that is central to these approaches is machine learning on networks. The main challenge in machine learning on networks is to find a way to extract information about interactions between nodes and to incorporate that information into a machine learning model. To extract this information from networks, classic machine learning approaches often rely on summary statistics (e.g., degrees or clustering coefficients) or carefully engineered features to measure local neighborhood structures (e.g., network motifs). These classic approaches can be limited because these hand-engineered features are inflexible, they often do not generalize to networks derived from other organisms, tissues and experimental technologies, and can fail on datasets with low experimental coverage.

Recent years have seen a surge in approaches that automatically learn to encode network structure into low-dimensional representations, using transformation techniques based on deep learning and nonlinear dimensionality reduction. The idea behind these representation learning approaches is to learn a data transformation function that maps nodes to points in a low-dimensional vector space, also termed embeddings. Representation learning methods have revolutionized the state-of-the-art in network science and the goal of this tutorial is to open the door for these methods to computational biology and bioinformatics.

The tutorial investigates techniques for biological network modeling, analytics and optimization:

How do we represent genome-wide biological data with networks and what techniques can be used to unify data coming in different formats and from different experimental technologies? How do we represent different types of biological objects, such as genes, diseases and drugs, and how do we represent different relation types using networks?
How do we automatically extract information about interactions between molecular components? How to incorporate this information into a machine learning model? What methods can be used to embed individual nodes in a network?
Fundamental to biological networks is the principle that genes underlying the same phenotype tend to interact. How do we mathematically encode such principles into a machine learning model?
How can we embed larger network structures, such as subgraphs of disease proteins, and entire networks, such as molecular graphs, into a low-dimensional vector space? How to deal with highly multi-relational and multi-layered networks?
Genes associated with the same function tend to cluster in the same network neighborhood. How do we use this notion to predict new gene functions?
Proteins involved in the same disease tend to interact with each other. How do representation learning methods use this information to detect disease modules? How can we prioritize candidate disease proteins?
How can representation learning methods identify cell types in mega-scale single-cell genomic networks? How do we integrate tissue-specific information into these methods?
Drugs tend to target proteins that are close to proteins associated with diseases that these drugs treat. How do we develop methods that operationalize this notion and generate hypotheses for drug repurposing?

Tutorial investigates methods and case studies for analyzing biological networks and extracting actionable insights, and in doing so, it provides attendees with a toolbox of next-generation algorithms for network biology.

Tutorial materials and outline

Our tutorial will cover the key conceptual foundations of representation learning, from more traditional approaches relying on matrix factorization and network propagation to very recent advancements in deep representation learning for networks.

In addition to a broad high-level overview, we will spend a considerable amount of time describing the details of recent advancements in deep representation learning and discussing both algorithmic and implementation aspects.

Introduction: Introduction to networks and overview of network biology (pdf) (ppt)

Biological network maps and interaction resources
Concepts of network theory
Organizing principles of network biomedicine (hubs, local principle, network parsimony principle, shared components principle)
Standard prediction tasks (node classification, link prediction, and node clustering)

Part 1: Network propagation and node embeddings (pdf) (ppt)

Network propagation
Random-walk embeddings (e.g., DeepWalk, node2vec, struc2vec)
Applications: Protein-protein interactions, Disease pathway detection

Part 2: Graph autoencoders and deep representation learning (pdf) (ppt)

Principles of graph autoencoder approaches (encoding, message passing, decoding)
Detailed description of graph convolutional networks (GCNs)
Applications: Gene function prediction

Part 3: Heterogeneous networks (pdf) (ppt)

Deep learning methods for heterogeneous, multi-relational, and hierarchical graphs (e.g., OhmNet, metapath2vec, Decagon)
Integration of side information into deep networks (e.g., structural fingerprints of chemicals, gene expression levels)
Applications: Tissue-specific protein embeddings, Drug side effects, Drug repurposing, Hierarchical structure of the cell

Conclusion: End-to-end Tensorflow examples and new directions (pdf) (ppt)

Implementation insights and demos on biomedical networks (Demo 1: html, ipynb) (Demo 2: html, ipynb)
Single-cell genomics and gene regulation (e.g., clustering of cells, biomarker discovery)
Human disease (e.g., disease pathway discovery, multi-omic and clinical data)

We just released a new dataset collection, BioSNAP Datasets, containing many large biomedical network datasets.

Tutorial info

Tutorial will be held at ISMB conference in Chicago, IL, USA, on Friday, July 6th, 2018.

The tutorial will be of broad interest to researchers who work with network data coming from biology, medicine, and life sciences. Graph-structured data arise in many different areas of data mining and predictive analytics, so the tutorial should be of theoretical and practical interest to a large part of data mining and network science community.

The tutorial will not require prior knowledge beyond fundamental concepts covered in introductory machine learning and network science classes. Attendees will come away with a broad knowledge necessary to understand state-of-the-art representation learning methods and to use these methods to solve central problems in network biology.

No special software or other package installation is needed to follow this tutorial. However, this tutorial contains several demos in Python and Tensorflow that might be of interest to participants.

Presenters

Marinka Zitnik is a postdoctoral fellow in Computer Science at Stanford University. Her research focuses on network science and representation learning methods for biomedicine. She received her PhD in Computer Science from University of Ljubljana in 2015 while also conducting research at Imperial College London, University of Toronto, Baylor College of Medicine, and Stanford University. She received outstanding research awards at ISMB, CAMDA, RECOMB, and BC2 conferences, and is involved in projects at Chan Zuckerberg Biohub.

Jure Leskovec is an Associate Professor of Computer Science at Stanford University and Chan Zuckerberg Biohub Investigator. His research is recently focusing on biological and biomedical problems and applications of network science to problems in biomedicine and health. Jure received his PhD in Machine Learning from Carnegie Mellon University in 2008 and spent a year at Cornell University. His work received five best paper awards, won the ACM KDD cup and topped the Battle of the Sensor Networks competition.