Prioritizing communities in megascale cell-cell interaction networks

In this notebook, we demonstrate the utility of CRank, an automatic method for prioritizing network communities and identifying the most promising ones for further experimentation, for the analysis of single-cell RNA-sequencing data.

This notebook reproduces Figure 4 in the CRank paper on Prioritizing network communities.


Single-cell RNA sequencing has transformed our understanding of complex cell populations (Trapnell, et al., Genome Research 2015). While many types of questions can be answered using single-cell RNA-sequencing, a central focus is the ability to survey the diversity of cell types within a sample.

To demonstrate that CRank scales to large networks, we used the single-cell RNA-seq dataset containing 1,306,127 embroyonic mouse brain cells (Zheng et al., Nature Communications 2017 and 10x Genomics) for which no cell types are known. Dataset was preprocessed using standard procedures to select and filter the cells based on quality-control metrics, normalize and scale the data, detect highly variable genes, and remove unwanted sources of variation (Satija et al., Nature Biotechnology 2015).

The dataset was represented as a weighted graph of nearest neighbor relations (edges) among cells (nodes), where relations indicated cells with similar gene expression patterns calculated using diffusion pseudotime analysis (Haghverdi et al., Nature Methods 2016). To partition this graph into highly interconnected communities we apply a community detection method proposed for single-cell data (Levine et al., Cell 2015).

Community detection analysis segregates the cells into 141 fine-grained communities, the largest containing 18,788 (1.8% of) and the smallest only 203 (0.02% of) cells. After detecting the communities, CRank takes the cell-cell interaction network and the detected communities, and generates a rank-ordered list of communities, assigning a priority to each community. CRank's prioritization of communities derived from the cell-cell interaction network takes less than 2 minutes.


(Last compiled and run on: 11/08/2017)

In [2]:
from operator import itemgetter

import numpy as np
import scanpy.api as sc
import pandas as pd
from matplotlib import pyplot as plt

sc.settings.verbosity = 3           
sc.settings.set_figure_params(dpi=70)
sc.logging.print_version_and_date()

%matplotlib inline
Running Scanpy version 0.2.8 on 2017-11-09 20:59.

Load single-cell RNA-seq dataset

In [3]:
ANALYSIS_DATA = '1M_neurons_corrected'
LOG_EXPRESSION_MATRIX = '1M_neurons_filtered_raw_log'
PRIORITIZATION = '1M_neurons_prioritization.txt'

Load the single-cell RNA-seq dataset.

We preprocessed (normalized, filtered) the dataset, constructed cell-cell interaction network, calculated diffusion map, computed t-SNE and PCA projections, and detected communities. These computations take several hours and need substantial amounts of memory. For convenience, adata_corrected object contains all results of these computations. Note that these computations represent the standard pipeline in the analysis of single-cell RNA-seq data; they are not in any way specific to CRank and are not required by CRank.

Also, load the log-normalized expression matrix (UMI count matrix).

In [4]:
adata_corrected = sc.read(ANALYSIS_DATA)
adata_raw = sc.read(LOG_EXPRESSION_MATRIX)
reading file ./write/1M_neurons_corrected.h5
reading file ./write/1M_neurons_filtered_raw_log.h5

Print variables in adata_corrected object containing results of additional analyses run on the cell-cell interaction network.

In [5]:
print(adata_corrected.add.keys())
print()
print(adata_corrected.smp.keys())
print()
print(adata_corrected.var.keys())
dict_keys(['louvain_groups_order', 'pca_variance_ratio', 'smp_keys_multicol', 'diffmap_evals', 'var_keys_multicol', 'louvain_params', 'data_graph_norm_weights', 'data_graph_distance_local'])

['n_genes', 'n_counts', 'X_pca', 'X_diffmap', 'X_diffmap0', 'louvain_groups', 'X_tsne']

['n_cells', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18', 'PC19', 'PC20', 'PC21', 'PC22', 'PC23', 'PC24', 'PC25', 'PC26', 'PC27', 'PC28', 'PC29', 'PC30', 'PC31', 'PC32', 'PC33', 'PC34', 'PC35', 'PC36', 'PC37', 'PC38', 'PC39', 'PC40', 'PC41', 'PC42', 'PC43', 'PC44', 'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50']

Plot 1 million cells in the cell-cell interacton network using t-SNE. Color cells according to clustering of cells into clusters/communities; cells assigned to each community are distinguished by color.

In [6]:
N = len(set(adata_corrected.smp['louvain_groups']))
c = plt.cm.get_cmap('hsv', N+1)
colors = [c(i) for i in range(N)]

fig, ax = plt.subplots(figsize=(12, 10))
sc.pl.scatter(adata_corrected, basis='tsne', color='louvain_groups', palette=colors, ax=ax, legend_loc='none');

Analyzing CRank's rank-ordered list of network communities

Network communities are detected using a community detection method developed for single-cell RNA-seq data (Levine et al., Cell 2015) and then prioritized using CRank. CRank generates a rank-ordered list of detected communities.

Let us load CRank's prioritization results, i.e., a rank-ordered list of communities (clusters of cell populations).

In [7]:
def read_prioritization(fname):
    with open(fname) as fin:
        fin.readline()
        com_score = [line.strip().split()[:2] for line in fin]
    com_score = [(com.replace('Cmt', ''), float(score)) for com, score in com_score]
    com_score = sorted(com_score, reverse=True, key=itemgetter(1))
    return pd.DataFrame(com_score, columns=['Community', 'CRank score'])


com_score = read_prioritization(PRIORITIZATION)
com_score
Out[7]:
Community CRank score
0 92 1.000000
1 87 1.000000
2 76 1.000000
3 85 1.000000
4 77 1.000000
5 105 1.000000
6 98 1.000000
7 108 0.990218
8 63 0.984885
9 55 0.973381
10 21 0.972205
11 20 0.972205
12 34 0.972205
13 11 0.963596
14 74 0.502283
15 94 0.502283
16 75 0.502283
17 90 0.502283
18 110 0.492501
19 116 0.492501
20 58 0.487168
21 62 0.487168
22 53 0.475664
23 24 0.474488
24 15 0.474488
25 42 0.474488
26 40 0.474488
27 0 0.465880
28 101 0.381119
29 86 0.381119
... ... ...
111 54 0.128837
112 50 0.128837
113 49 0.128837
114 23 0.127661
115 33 0.127661
116 124 0.121364
117 125 0.121364
118 122 0.121364
119 121 0.121364
120 10 0.119052
121 69 0.116031
122 117 0.108238
123 118 0.108238
124 47 0.104527
125 126 0.098455
126 123 0.098455
127 130 0.098455
128 128 0.098455
129 134 0.075547
130 140 0.075547
131 132 0.075547
132 135 0.075547
133 131 0.075547
134 133 0.075547
135 137 0.075547
136 136 0.075547
137 127 0.075547
138 129 0.075547
139 138 0.075547
140 139 0.075547

141 rows × 2 columns

Let us analyze communities ranked high according to CRank

We select Community 98, Community 87, Community 63. These communities are all are ranked high by CRank.

Let us analyze highly differential genes for each selected community.

In [8]:
high_ranked_communities = ['98', '87', '63']
s2c = {'98': 'red', '87': 'green', '63': 'blue'}
s2cm = {'98': plt.cm.Reds, '87': plt.cm.Greens, '63': plt.cm.Blues}

sc.pl.rank_genes_groups(adata_raw, groups=high_ranked_communities, n_genes=20)

Plot t-SNE projections showing cells assigned to communities that are ranked high by CRank (from top to bottom: Community 98, Community 87, Community 63).

Cells assigned to each community are distinguished by color, and all other cells are shown in grey.

In [9]:
for com in high_ranked_communities:
    fig, ax = plt.subplots(figsize=(6, 4))
    ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
    ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
    sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);

Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.

Investigation reveals that communities ranked high by CRank are represented by clusters of cells that have a highly localized gene expression pattern. For example, genes indicative of cells in rank 1 community (the highest community in CRank ranking, i.e., Community 98) are TYROBP, C1QB, C1QC, FCER1G and C1QA. Expression levels of these genes are concentrated in cells that belong to Community 98.

Similarly, we see that genes that best characterize Community 87 and Community 63 are active only in cell populations of Community 87 and Community 63, suggesting that these genes might be good markers for those cell populations.

In [10]:
for com in high_ranked_communities: 
    top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
    val = np.sum(adata_raw[:, top_genes].X, axis=1)
    val = np.array(val).flatten()
    label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
    adata_raw.smp[label] = val
    sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');

Let us analyze communities ranked low according to CRank

We select Community 47, Community 123, Community 127. These communities are all ranked low by CRank.

Let us analyze highly differential genes for each selected community.

In [11]:
low_ranked_communities = ['47', '123', '127']
s2c = {'47': 'orange', '127': 'purple', '123': 'brown'}
s2cm = {'47': plt.cm.Oranges, '127': plt.cm.Purples, '123': plt.cm.YlOrBr}

sc.pl.rank_genes_groups(adata_raw, groups=low_ranked_communities, n_genes=20)

Plot t-SNE projections highlighting cells assigned to communities that are ranked low by CRank (from top to bottom: Community 47, Community 123, Community 127).

Cells assigned to each community are distinguished by color, and all other cells are shown in grey.

In [12]:
for com in low_ranked_communities:
    fig, ax = plt.subplots(figsize=(6, 4))
    ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
    ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
    sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);

Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.

Although Community 47, Community 123, Community 127 correspond to clusters of cells (see t-SNE projections above), they have diluted gene expression patterns that are spread out over the entire network, meaning that CRank has correctly considered those communities to be low priority.

For example, investigation reveals that genes that are most differentially expressed between cells in Community 127 and all other cells, i.e., OPCML, TMSB4X, NYM, CCK, and CNTN2, show a weak expression pattern that is diffused across the entire network.

In [13]:
for com in low_ranked_communities: 
    top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
    val = np.sum(adata_raw[:, top_genes].X, axis=1)
    val = np.array(val).flatten()
    label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
    adata_raw.smp[label] = val
    sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');