Prioritizing communities in megascale cell-cell interaction networks

In this notebook, we demonstrate the utility of CRank, an automatic method for prioritizing network communities and identifying the most promising ones for further experimentation, for the analysis of single-cell RNA-sequencing data.

This notebook reproduces Figure 4 in the CRank paper on Prioritizing network communities.

Single-cell RNA sequencing has transformed our understanding of complex cell populations (Trapnell, et al., Genome Research 2015). While many types of questions can be answered using single-cell RNA-sequencing, a central focus is the ability to survey the diversity of cell types within a sample.

To demonstrate that CRank scales to large networks, we used the single-cell RNA-seq dataset containing 1,306,127 embroyonic mouse brain cells (Zheng et al., Nature Communications 2017 and 10x Genomics) for which no cell types are known. Dataset was preprocessed using standard procedures to select and filter the cells based on quality-control metrics, normalize and scale the data, detect highly variable genes, and remove unwanted sources of variation (Satija et al., Nature Biotechnology 2015).

The dataset was represented as a weighted graph of nearest neighbor relations (edges) among cells (nodes), where relations indicated cells with similar gene expression patterns calculated using diffusion pseudotime analysis (Haghverdi et al., Nature Methods 2016). To partition this graph into highly interconnected communities we apply a community detection method proposed for single-cell data (Levine et al., Cell 2015).

Community detection analysis segregates the cells into 141 fine-grained communities, the largest containing 18,788 (1.8% of) and the smallest only 203 (0.02% of) cells. After detecting the communities, CRank takes the cell-cell interaction network and the detected communities, and generates a rank-ordered list of communities, assigning a priority to each community. CRank's prioritization of communities derived from the cell-cell interaction network takes less than 2 minutes.

(Last compiled and run on: 11/08/2017)

In [2]:
from operator import itemgetter

import numpy as np
import scanpy.api as sc
import pandas as pd
from matplotlib import pyplot as plt

sc.settings.verbosity = 3           

%matplotlib inline
Running Scanpy version 0.2.8 on 2017-11-09 20:59.

Load single-cell RNA-seq dataset

In [3]:
ANALYSIS_DATA = '1M_neurons_corrected'
LOG_EXPRESSION_MATRIX = '1M_neurons_filtered_raw_log'
PRIORITIZATION = '1M_neurons_prioritization.txt'

Load the single-cell RNA-seq dataset.

We preprocessed (normalized, filtered) the dataset, constructed cell-cell interaction network, calculated diffusion map, computed t-SNE and PCA projections, and detected communities. These computations take several hours and need substantial amounts of memory. For convenience, adata_corrected object contains all results of these computations. Note that these computations represent the standard pipeline in the analysis of single-cell RNA-seq data; they are not in any way specific to CRank and are not required by CRank.

Also, load the log-normalized expression matrix (UMI count matrix).

In [4]:
adata_corrected =
adata_raw =
reading file ./write/1M_neurons_corrected.h5
reading file ./write/1M_neurons_filtered_raw_log.h5

Print variables in adata_corrected object containing results of additional analyses run on the cell-cell interaction network.

In [5]:
dict_keys(['louvain_groups_order', 'pca_variance_ratio', 'smp_keys_multicol', 'diffmap_evals', 'var_keys_multicol', 'louvain_params', 'data_graph_norm_weights', 'data_graph_distance_local'])

['n_genes', 'n_counts', 'X_pca', 'X_diffmap', 'X_diffmap0', 'louvain_groups', 'X_tsne']

['n_cells', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18', 'PC19', 'PC20', 'PC21', 'PC22', 'PC23', 'PC24', 'PC25', 'PC26', 'PC27', 'PC28', 'PC29', 'PC30', 'PC31', 'PC32', 'PC33', 'PC34', 'PC35', 'PC36', 'PC37', 'PC38', 'PC39', 'PC40', 'PC41', 'PC42', 'PC43', 'PC44', 'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50']

Plot 1 million cells in the cell-cell interacton network using t-SNE. Color cells according to clustering of cells into clusters/communities; cells assigned to each community are distinguished by color.

In [6]:
N = len(set(adata_corrected.smp['louvain_groups']))
c ='hsv', N+1)
colors = [c(i) for i in range(N)]

fig, ax = plt.subplots(figsize=(12, 10)), basis='tsne', color='louvain_groups', palette=colors, ax=ax, legend_loc='none');