In this notebook, we demonstrate the utility of CRank, an automatic method for prioritizing network communities and identifying the most promising ones for further experimentation, for the analysis of single-cell RNA-sequencing data.
This notebook reproduces Figure 4 in the CRank paper on Prioritizing network communities.
Single-cell RNA sequencing has transformed our understanding of complex cell populations (Trapnell, et al., Genome Research 2015). While many types of questions can be answered using single-cell RNA-sequencing, a central focus is the ability to survey the diversity of cell types within a sample.
To demonstrate that CRank scales to large networks, we used the single-cell RNA-seq dataset containing 1,306,127 embroyonic mouse brain cells (Zheng et al., Nature Communications 2017 and 10x Genomics) for which no cell types are known. Dataset was preprocessed using standard procedures to select and filter the cells based on quality-control metrics, normalize and scale the data, detect highly variable genes, and remove unwanted sources of variation (Satija et al., Nature Biotechnology 2015).
The dataset was represented as a weighted graph of nearest neighbor relations (edges) among cells (nodes), where relations indicated cells with similar gene expression patterns calculated using diffusion pseudotime analysis (Haghverdi et al., Nature Methods 2016). To partition this graph into highly interconnected communities we apply a community detection method proposed for single-cell data (Levine et al., Cell 2015).
Community detection analysis segregates the cells into 141 fine-grained communities, the largest containing 18,788 (1.8% of) and the smallest only 203 (0.02% of) cells. After detecting the communities, CRank takes the cell-cell interaction network and the detected communities, and generates a rank-ordered list of communities, assigning a priority to each community. CRank's prioritization of communities derived from the cell-cell interaction network takes less than 2 minutes.
(Last compiled and run on: 11/08/2017)
from operator import itemgetter
import numpy as np
import scanpy.api as sc
import pandas as pd
from matplotlib import pyplot as plt
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=70)
sc.logging.print_version_and_date()
%matplotlib inline
ANALYSIS_DATA = '1M_neurons_corrected'
LOG_EXPRESSION_MATRIX = '1M_neurons_filtered_raw_log'
PRIORITIZATION = '1M_neurons_prioritization.txt'
Load the single-cell RNA-seq dataset.
We preprocessed (normalized, filtered) the dataset, constructed cell-cell interaction network, calculated diffusion map, computed t-SNE and PCA projections, and detected communities. These computations take several hours and need substantial amounts of memory. For convenience, adata_corrected
object contains all results of these computations. Note that these computations represent the standard pipeline in the analysis of single-cell RNA-seq data; they are not in any way specific to CRank and are not required by CRank.
Also, load the log-normalized expression matrix (UMI count matrix).
adata_corrected = sc.read(ANALYSIS_DATA)
adata_raw = sc.read(LOG_EXPRESSION_MATRIX)
Print variables in adata_corrected
object containing results of additional analyses run on the cell-cell interaction network.
print(adata_corrected.add.keys())
print()
print(adata_corrected.smp.keys())
print()
print(adata_corrected.var.keys())
Plot 1 million cells in the cell-cell interacton network using t-SNE. Color cells according to clustering of cells into clusters/communities; cells assigned to each community are distinguished by color.
N = len(set(adata_corrected.smp['louvain_groups']))
c = plt.cm.get_cmap('hsv', N+1)
colors = [c(i) for i in range(N)]
fig, ax = plt.subplots(figsize=(12, 10))
sc.pl.scatter(adata_corrected, basis='tsne', color='louvain_groups', palette=colors, ax=ax, legend_loc='none');
Network communities are detected using a community detection method developed for single-cell RNA-seq data (Levine et al., Cell 2015) and then prioritized using CRank. CRank generates a rank-ordered list of detected communities.
Let us load CRank's prioritization results, i.e., a rank-ordered list of communities (clusters of cell populations).
def read_prioritization(fname):
with open(fname) as fin:
fin.readline()
com_score = [line.strip().split()[:2] for line in fin]
com_score = [(com.replace('Cmt', ''), float(score)) for com, score in com_score]
com_score = sorted(com_score, reverse=True, key=itemgetter(1))
return pd.DataFrame(com_score, columns=['Community', 'CRank score'])
com_score = read_prioritization(PRIORITIZATION)
com_score
We select Community 98
, Community 87
, Community 63
. These communities are all are ranked high by CRank.
Let us analyze highly differential genes for each selected community.
high_ranked_communities = ['98', '87', '63']
s2c = {'98': 'red', '87': 'green', '63': 'blue'}
s2cm = {'98': plt.cm.Reds, '87': plt.cm.Greens, '63': plt.cm.Blues}
sc.pl.rank_genes_groups(adata_raw, groups=high_ranked_communities, n_genes=20)
Plot t-SNE projections showing cells assigned to communities that are ranked high by CRank (from top to bottom: Community 98
, Community 87
, Community 63
).
Cells assigned to each community are distinguished by color, and all other cells are shown in grey.
for com in high_ranked_communities:
fig, ax = plt.subplots(figsize=(6, 4))
ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);
Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.
Investigation reveals that communities ranked high by CRank are represented by clusters of cells that have a highly localized gene expression pattern. For example, genes indicative of cells in rank 1 community (the highest community in CRank ranking, i.e., Community 98
) are TYROBP, C1QB, C1QC, FCER1G and C1QA. Expression levels of these genes are concentrated in cells that belong to Community 98
.
Similarly, we see that genes that best characterize Community 87
and Community 63
are active only in cell populations of Community 87
and Community 63
, suggesting that these genes might be good markers for those cell populations.
for com in high_ranked_communities:
top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
val = np.sum(adata_raw[:, top_genes].X, axis=1)
val = np.array(val).flatten()
label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
adata_raw.smp[label] = val
sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');
We select Community 47
, Community 123
, Community 127
. These communities are all ranked low by CRank.
Let us analyze highly differential genes for each selected community.
low_ranked_communities = ['47', '123', '127']
s2c = {'47': 'orange', '127': 'purple', '123': 'brown'}
s2cm = {'47': plt.cm.Oranges, '127': plt.cm.Purples, '123': plt.cm.YlOrBr}
sc.pl.rank_genes_groups(adata_raw, groups=low_ranked_communities, n_genes=20)
Plot t-SNE projections highlighting cells assigned to communities that are ranked low by CRank (from top to bottom: Community 47
, Community 123
, Community 127
).
Cells assigned to each community are distinguished by color, and all other cells are shown in grey.
for com in low_ranked_communities:
fig, ax = plt.subplots(figsize=(6, 4))
ii = [i for i, x in enumerate(adata_corrected.smp['louvain_groups']) if x == com]
ax.scatter(adata_corrected.smp['X_tsne'][ii, 0], adata_corrected.smp['X_tsne'][ii, 1], c=s2c[com], zorder=10, s=.5)
sc.pl.scatter(adata_corrected, basis='tsne', color='grey', legend_loc='none', ax=ax, title='Community %s' % com);
Plot t-SNE projections showing the aggregated log-scaled expression values of the top five genes that are most differentially expressed between cells in each community and all other cells.
Although Community 47
, Community 123
, Community 127
correspond to clusters of cells (see t-SNE projections above), they have diluted gene expression patterns that are spread out over the entire network, meaning that CRank has correctly considered those communities to be low priority.
For example, investigation reveals that genes that are most differentially expressed between cells in Community 127
and all other cells, i.e., OPCML, TMSB4X, NYM, CCK, and CNTN2, show a weak expression pattern that is diffused across the entire network.
for com in low_ranked_communities:
top_genes = adata_raw.add['rank_genes_groups_gene_names'][com][:5]
val = np.sum(adata_raw[:, top_genes].X, axis=1)
val = np.array(val).flatten()
label = 'Community_%s:_%s' % (com, '_'.join(top_genes).upper())
adata_raw.smp[label] = val
sc.pl.scatter(adata_raw, color=label, legend_loc='none', color_map=s2cm[com], size=5., basis='tsne');