Meta-learning for Bridging Labeled and Unlabeled Data in Biomedicine

ISMB/ECCB 2021 Tutorial

Tutorial goals

In biomedical domains labeled datasets are often very difficult and time-consuming to obtain, requiring a lot of costly manual effort and expert knowledge to hand-label classes before machine learning methods can even be used. This results in many scarcely labeled or completely unlabeled datasets. For instance, in protein function prediction a large number of functional labels have only a few labeled genes, or in single-cell transcriptomics novel and rare cell types appear across large, heterogeneous single-cell datasets.

While machine learning methods excel on tasks with a large number of labeled datasets that can support learning of highly parameterized models, to solve central problems in biomedicine we need methods that can generalize to unseen domains and datasets given only a few labeled training examples, or in the extreme case to completely unlabeled datasets. Major advances under low-data regime tasks have been achieved by leveraging knowledge across related tasks with meta-learning, or learning to learn across tasks. The central idea behind meta-learning is to acquire prior knowledge over previous tasks so that new tasks can be efficiently learned from a small amount of data.

This tutorial will cover principles and recent advancements of meta-learning with the case studies designed based on their high relevance for advancing new biomedical discoveries. We will present representation learning methods that bridge labeled and unlabeled data by learning to generalize across datasets given only a few labeled examples or extremely without any labeled data with an emphasis on interpretability. We will spend considerable amount of time to explain how can interpretability be incorporated as an essential feature in the design of the methods. The tutorial will equip participants with the ability to understand fundamentals and state-of-the-art meta-learning methods and to utilize the learned concepts and methods in their own research.

Learning objectives

At the completion of the tutorial, the participants will gain understanding and broad knowledge about the basic concepts and recent advances in the meta-learning techniques:

How can we effectively learn from scarcely labeled datasets, e.g., protein functions or structures with a few labeled examples? How can we use prior knowledge to learn to generalize, i.e., meta-learn?
How can we utilize knowledge from existing knowledge bases, such as Gene Ontology and Cell Ontology, to provide interpretations behind decisions based on only few-labeled examples?
How can we learn without any labeled examples? How can we discover new, never-before-seen categories/classes, such as rare and unseen cell types across single-cell experiments?
How can we transfer knowledge across different species, tissues, or sequencing technologies?
What fundamental open problems in biology can benefit from meta-learning techniques? How can meta-learning be applied to these problems?
What frameworks, tools and libraries are available to use meta-learning methods on new datasets and applications?

Tutorial materials and outline

Our tutorial will cover fundamental techniques and recent advances on machine learning methods that have an ability to generalize from a small number of labels, or in the extreme case without any labels. The participants will acquire knowledge of fundamental concepts and understanding how to use these methods to advance new discoveries in their own research.

Tutorial will be organised as a half-day tutorial. Tutorial materials will be posted on this website in early July 2021.

Detailed outline of the tutorial is the following:

Introduction [slides]

Types of missing data problems
Basics of deep learning

Few-shot learning part I: Meta-learning for few-shot learning [slides]

Problem statement: Few-shot learning
Optimization-based methods (e.g., MAML)
Metric-based methods (e.g., Siamese, MatchingNet, ProtoNet)
Applications: Drug discovery and cellular response prediction

Few-shot learning part II: Integrating side information [slides]

Feature level prior knowledge for interpretability (e.g. COMET)
Class level prior knowledge (e.g., AM3, TAdaNet)
Applications: Interpretable cell type annotation, disease prediction

Open-world learning [slides]

Problem statement: Open-world learning
Novel class discovery (e.g., MARS)
Open-world semi-supervised learning (e.g., ORCA)
Applications: Cell type discovery from single-cell data and multiplexed imaging technology CODEX

Guidelines and conclusions [slides]

Tips and guidelines on available resources, libraries and tools
Demo on prototypical networks for cell type annotation [ipynb]
Future perspectives and concluding remarks

Tutorial recording

Video recording of the tutorial can be found at: https://www.youtube.com/watch?v=A5Gj9vaoimo.

Tutorial info

Tutorial was held at ISMB/ECCB conference on Friday, July 23 2021, 15:00 - 19:00 UTC.

The tutorial is designed for researchers who would like to learn the principles of machine learning techniques that can be applied when only a limited number of labeled data is available. The tutorial will require basic prior knowledge of fundamental concepts covered in introductory machine learning classes.

Presenters

Maria Brbic is a postdoctoral fellow in Computer Science at Stanford University. Her research focuses on development of computational approaches that can generalize to never-before-seen contexts and tasks with particular interest in applications in single-cell genomics. She is involved in projects at Chan Zuckerberg Biohub and Stanford Neuro-omics Initiative. She received her Bachelor's degree from University of Zagreb, Croatia, and PhD degree in Computer Science with the best PhD thesis award from University of Zagreb. During her PhD she was also conducting research at University of Tokyo and Stanford University as a Fulbright Scholar.

Chelsea Finn is an Assistant Professor of Computer Science and Electrical Engineering at Stanford University. Dr. Finn's research focuses on machine learning and robotics, with a significant focus on generalization and few-shot learning. She has pioneered work on meta-learning algorithms that can enable fast, few-shot adaptation, including the widely-used model-agnostic meta-learning algorithm. Her PhD thesis, Learning to Learn with Gradients, received the ACM Doctoral Dissertation Award, and her research more broadly has been recognized by several other awards, including the Samsung AI Researcher of the Year, the Microsoft Research Faculty Fellowship, and the MIT Technology Review 35 under 35 Award. Her work has also been covered by various media outlets, including the New York Times, Wired, and Bloomberg. Finn received her Bachelor's degree in Electrical Engineering and Computer Science at MIT and her PhD in Computer Science at UC Berkeley.

Jure Leskovec is Associate Professor of Computer Science at Stanford University, Chief Scientist at Pinterest, and investigator at Chan Zuckerberg Biohub. Dr. Leskovec was the co-founder of a machine learning startup Kosei, which was later acquired by Pinterest. His research focuses on machine learning and data mining large social, information, and biological networks. Computation over massive data is at the heart of his research and has applications in computer science, social sciences, marketing, and biomedicine. This research has won several awards including a Lagrange Prize, Microsoft Research Faculty Fellowship, the Alfred P. Sloan Fellowship, and numerous best paper and test of time awards. It has also been featured in popular press outlets such as the New York Times and the Wall Street Journal. Leskovec received his bachelor's degree in computer science from University of Ljubljana, Slovenia, PhD in machine learning from Carnegie Mellon University and postdoctoral training at Cornell University.