Prioritizing communities in megascale cell-cell interaction networks¶

In this notebook, we demonstrate the utility of CRank, an automatic method for prioritizing network communities and identifying the most promising ones for further experimentation, for the analysis of single-cell RNA-sequencing data.

This notebook reproduces Figure 4 in the CRank paper on Prioritizing network communities.

Single-cell RNA sequencing has transformed our understanding of complex cell populations (Trapnell, et al., Genome Research 2015). While many types of questions can be answered using single-cell RNA-sequencing, a central focus is the ability to survey the diversity of cell types within a sample.

To demonstrate that CRank scales to large networks, we used the single-cell RNA-seq dataset containing 1,306,127 embroyonic mouse brain cells (Zheng et al., Nature Communications 2017 and 10x Genomics) for which no cell types are known. Dataset was preprocessed using standard procedures to select and filter the cells based on quality-control metrics, normalize and scale the data, detect highly variable genes, and remove unwanted sources of variation (Satija et al., Nature Biotechnology 2015).

The dataset was represented as a weighted graph of nearest neighbor relations (edges) among cells (nodes), where relations indicated cells with similar gene expression patterns calculated using diffusion pseudotime analysis (Haghverdi et al., Nature Methods 2016). To partition this graph into highly interconnected communities we apply a community detection method proposed for single-cell data (Levine et al., Cell 2015).

Community detection analysis segregates the cells into 141 fine-grained communities, the largest containing 18,788 (1.8% of) and the smallest only 203 (0.02% of) cells. After detecting the communities, CRank takes the cell-cell interaction network and the detected communities, and generates a rank-ordered list of communities, assigning a priority to each community. CRank's prioritization of communities derived from the cell-cell interaction network takes less than 2 minutes.

(Last compiled and run on: 11/08/2017)

from operator import itemgetter

import numpy as np
import scanpy.api as sc
import pandas as pd
from matplotlib import pyplot as plt

sc.settings.verbosity = 3           
sc.settings.set_figure_params(dpi=70)
sc.logging.print_version_and_date()

%matplotlib inline

Running Scanpy version 0.2.8 on 2017-11-09 20:59.

Load single-cell RNA-seq dataset¶

ANALYSIS_DATA = '1M_neurons_corrected'
LOG_EXPRESSION_MATRIX = '1M_neurons_filtered_raw_log'
PRIORITIZATION = '1M_neurons_prioritization.txt'

Load the single-cell RNA-seq dataset.

We preprocessed (normalized, filtered) the dataset, constructed cell-cell interaction network, calculated diffusion map, computed t-SNE and PCA projections, and detected communities. These computations take several hours and need substantial amounts of memory. For convenience, adata_corrected object contains all results of these computations. Note that these computations represent the standard pipeline in the analysis of single-cell RNA-seq data; they are not in any way specific to CRank and are not required by CRank.

Also, load the log-normalized expression matrix (UMI count matrix).

adata_corrected = sc.read(ANALYSIS_DATA)
adata_raw = sc.read(LOG_EXPRESSION_MATRIX)

reading file ./write/1M_neurons_corrected.h5
reading file ./write/1M_neurons_filtered_raw_log.h5

Print variables in adata_corrected object containing results of additional analyses run on the cell-cell interaction network.

print(adata_corrected.add.keys())
print()
print(adata_corrected.smp.keys())
print()
print(adata_corrected.var.keys())

dict_keys(['louvain_groups_order', 'pca_variance_ratio', 'smp_keys_multicol', 'diffmap_evals', 'var_keys_multicol', 'louvain_params', 'data_graph_norm_weights', 'data_graph_distance_local'])

['n_genes', 'n_counts', 'X_pca', 'X_diffmap', 'X_diffmap0', 'louvain_groups', 'X_tsne']

['n_cells', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18', 'PC19', 'PC20', 'PC21', 'PC22', 'PC23', 'PC24', 'PC25', 'PC26', 'PC27', 'PC28', 'PC29', 'PC30', 'PC31', 'PC32', 'PC33', 'PC34', 'PC35', 'PC36', 'PC37', 'PC38', 'PC39', 'PC40', 'PC41', 'PC42', 'PC43', 'PC44', 'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50']

Plot 1 million cells in the cell-cell interacton network using t-SNE. Color cells according to clustering of cells into clusters/communities; cells assigned to each community are distinguished by color.

N = len(set(adata_corrected.smp['louvain_groups']))
c = plt.cm.get_cmap('hsv', N+1)
colors = [c(i) for i in range(N)]

fig, ax = plt.subplots(figsize=(12, 10))
sc.pl.scatter(adata_corrected, basis='tsne', color='louvain_groups', palette=colors, ax=ax, legend_loc='none');

Analyzing CRank's rank-ordered list of network communities¶

Network communities are detected using a community detection method developed for single-cell RNA-seq data (Levine et al., Cell 2015) and then prioritized using CRank. CRank generates a rank-ordered list of detected communities.

Let us load CRank's prioritization results, i.e., a rank-ordered list of communities (clusters of cell populations).

def read_prioritization(fname):
    with open(fname) as fin:
        fin.readline()
        com_score = [line.strip().split()[:2] for line in fin]
    com_score = [(com.replace('Cmt', ''), float(score)) for com, score in com_score]
    com_score = sorted(com_score, reverse=True, key=itemgetter(1))
    return pd.DataFrame(com_score, columns=['Community', 'CRank score'])


com_score = read_prioritization(PRIORITIZATION)
com_score

Let us analyze communities ranked high according to CRank¶

We select Community 98, Community 87, Community 63. These communities are all are ranked high by CRank.

Let us analyze highly differential genes for each selected community.

high_ranked_communities = ['98', '87', '63']
s2c = {'98': 'red', '87': 'green', '63': 'blue'}
s2cm = {'98': plt.cm.Reds, '87': plt.cm.Greens, '63': plt.cm.Blues}

sc.pl.rank_genes_groups(adata_raw, groups=high_ranked_communities, n_genes=20)

Plot t-SNE projections showing cells assigned to communities that are ranked high by CRank (from top to bottom: Community 98, Community 87, Community 63).

Cells assigned to each community are distinguished by color, and all other cells are shown in grey.

for com in high_ranked_communities:
    fig, ax = plt.subplots(figsize=(6, 4))
    ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
    ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
    sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);

Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.

Investigation reveals that communities ranked high by CRank are represented by clusters of cells that have a highly localized gene expression pattern. For example, genes indicative of cells in rank 1 community (the highest community in CRank ranking, i.e., Community 98) are TYROBP, C1QB, C1QC, FCER1G and C1QA. Expression levels of these genes are concentrated in cells that belong to Community 98.

Similarly, we see that genes that best characterize Community 87 and Community 63 are active only in cell populations of Community 87 and Community 63, suggesting that these genes might be good markers for those cell populations.

for com in high_ranked_communities: 
    top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
    val = np.sum(adata_raw[:, top_genes].X, axis=1)
    val = np.array(val).flatten()
    label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
    adata_raw.smp[label] = val
    sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');

Let us analyze communities ranked low according to CRank¶

We select Community 47, Community 123, Community 127. These communities are all ranked low by CRank.

Let us analyze highly differential genes for each selected community.

low_ranked_communities = ['47', '123', '127']
s2c = {'47': 'orange', '127': 'purple', '123': 'brown'}
s2cm = {'47': plt.cm.Oranges, '127': plt.cm.Purples, '123': plt.cm.YlOrBr}

sc.pl.rank_genes_groups(adata_raw, groups=low_ranked_communities, n_genes=20)

Plot t-SNE projections highlighting cells assigned to communities that are ranked low by CRank (from top to bottom: Community 47, Community 123, Community 127).

Cells assigned to each community are distinguished by color, and all other cells are shown in grey.

for com in low_ranked_communities:
    fig, ax = plt.subplots(figsize=(6, 4))
    ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
    ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
    sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);

Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.

Although Community 47, Community 123, Community 127 correspond to clusters of cells (see t-SNE projections above), they have diluted gene expression patterns that are spread out over the entire network, meaning that CRank has correctly considered those communities to be low priority.

For example, investigation reveals that genes that are most differentially expressed between cells in Community 127 and all other cells, i.e., OPCML, TMSB4X, NYM, CCK, and CNTN2, show a weak expression pattern that is diffused across the entire network.

for com in low_ranked_communities: 
    top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
    val = np.sum(adata_raw[:, top_genes].X, axis=1)
    val = np.array(val).flatten()
    label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
    adata_raw.smp[label] = val
    sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');

	Community	CRank score
0	92	1.000000
1	87	1.000000
2	76	1.000000
3	85	1.000000
4	77	1.000000
5	105	1.000000
6	98	1.000000
7	108	0.990218
8	63	0.984885
9	55	0.973381
10	21	0.972205
11	20	0.972205
12	34	0.972205
13	11	0.963596
14	74	0.502283
15	94	0.502283
16	75	0.502283
17	90	0.502283
18	110	0.492501
19	116	0.492501
20	58	0.487168
21	62	0.487168
22	53	0.475664
23	24	0.474488
24	15	0.474488
25	42	0.474488
26	40	0.474488
27	0	0.465880
28	101	0.381119
29	86	0.381119
...	...	...
111	54	0.128837
112	50	0.128837
113	49	0.128837
114	23	0.127661
115	33	0.127661
116	124	0.121364
117	125	0.121364
118	122	0.121364
119	121	0.121364
120	10	0.119052
121	69	0.116031
122	117	0.108238
123	118	0.108238
124	47	0.104527
125	126	0.098455
126	123	0.098455
127	130	0.098455
128	128	0.098455
129	134	0.075547
130	140	0.075547
131	132	0.075547
132	135	0.075547
133	131	0.075547
134	133	0.075547
135	137	0.075547
136	136	0.075547
137	127	0.075547
138	129	0.075547
139	138	0.075547
140	139	0.075547