MARS: Discovering Novel Cell Types across Heterogeneous Single-cell Experiments

MARS is a tool for identifying and annotating cell types across heterogeneous single-cell experiments. MARS has a unique ability to discover and suggest human-interpretable names to novel cell types that have never been seen in previously annotated experiments.

The automated assignment of cells to cell types in a new single-cell experiment is the central task in single-cell genomics. Many computational tools are available for annotating cell types based on existing annotations in a predefined reference set of cell types. However, these tools are unable to annotate cells into cell types that are poorly characterized in existing annotations or have not been identified before. To assist with the discovery of cell types, new techniques are required that (i) harmonize heterogeneous and time-varying datasets, (ii) learn dataset-invariant cell representations, and (iii) use the learned representations to characterize previously unseen cell types.

Here we develop a method, called MARS, for the discovery of novel cell types, for which the method creates biologically meaningful and human-readable names. MARS is grounded in meta-learning which allows it to overcome the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS has a unique ability to generalize to unannotated experiments and identify never-before-seen cell types.

Publication

MARS: Discovering Novel Cell Types across Heterogeneous Single-cell Experiments.
Maria Brbić, Marinka Zitnik, Sheng Wang, Angela O. Pisco, Russ B. Altman, Spyros Darmanis, Jure Leskovec.
Nature Methods, 2020.

@article{Brbic20, title={MARS: discovering novel cell types across heterogeneous single-cell experiments}, author={Brbi\'c, Maria and Zitnik, Marinka and Wang, Sheng and Pisco, Angela O
and Altman, Russ B and Darmanis, Spyros and Leskovec, Jure}, journal={Nature Methods}, volume={17}, number={12}, pages={1200--1206}, year={2020}, }

Overview of MARS

MARS takes as input single-cell gene expression profiles from heterogeneous experiments (e.g., pancreas, lung, heart tissues) annotated according to their cell types, and a completely unannotated target experiment (e.g. brain tissue) which does not need to share any cell types with the annotated experiments. Using the deep neural network, MARS jointly learns a set of cell-type landmarks and an embedding function that projects cells into a shared embedding space, such that cells are close to their cell-type landmarks.



The embedding function projects a high-dimensional expression profile of each cell to a low-dimensional vector that directly captures the cell-type identity. Cell-type landmarks are defined as cell type representatives and are learned for both annotated and unannotated experiments. By transferring latent cell representations across multiple datasets, MARS overcomes the heterogeneity of cell types and leverages commonalities across experiments.

Case study: mouse cell atlas

We apply MARS to large-scale manually curated mouse cell atlas Tabula Muris. We regard each tissue as a separate experiment and train MARS in leave-one-tissue-out manner, where the held out tissue is completely unannotated. MARS accurately annotates cell types and transfers latent cell representations across tissues.

The figure below shows MARS example embeddings of diaphragm and liver tissues projected in the 2D space using UMAP. Color indicates ground-truth Tabula Muris cell type annotation. In the MARS’s embedding space, cell types naturally form clusters that correspond to cell types.



Interpretable names for novel cell types

MARS goes beyond cell type annoations task and assigns interpretable names to discovered groups of cells. MARS relies on the cell-type landmarks in the annotated experiments to probabilistically define a cell type based on its region in the low-dimensional embedding space.

Figure below shows example when limb muscle tissue is used as an unannotated experiment. For unannotated cell type, MARS determines distances to all landmarks from the annotated experiments and for each of them outputs probability that discovered cell type receives the same name. MARS accurately assigns names to stromal cells, B-cells, macrophages and satellite muscle cells.



Data

File Description
tms-facs-mars.tar.gz Tabula Muris Senis dataset in h5ad format
TM_trained_models.tar.gz Trained models for each tissue in Tabula Muris
cellbench_kolod_pollen.tgz Small-scale CellBench and Kolod/Pollen datasets

Code

A PyTorch implementation of MARS is available on GitHub.