Prioritizing network communities

CRank is an automatic unsupervised method for prioritizing network communities and identifying the most promising ones for further experimentation.

CRank can be used with any community detection method and scales to large networks. It uses only information provided by the network structure and does not require any additional external metadata or labels.

Complex networks exhibit modular structure and methods for community detection allow for computational detection of modular structure by identifying a division of network’s nodes into communities. Such communities provide predictions/hypotheses about potential modules of the network, which then need to be experimentally validated and confirmed.

In this project, we focus on identifying the most promising communities for further experimentation. To this end, we develop an automatic unsupervised method for prioritizing network communities.

Prioritizing Network Communities.
Marinka Zitnik, Rok Sosic, and Jure Leskovec.
Nature Communications, 9, 2544, 2018. [arXiv]

@article{Zitnik2018prioritizing, title={Prioritizing Network Communities}, author={Zitnik, Marinka and Sosic, Rok and Leskovec, Jure}, journal={Nature Communications}, volume={9}, number={1}, pages={2544}, year={2018} }

CRank network community prioritization method

Our goal is to identify communities that are most promising candidates for follow-up investigations. Promising candidates are communities that are most associated with external network functions, such as cellular functions in protein-protein interaction networks, social communities in social networks, or product categories in product co-purchasing networks.

CRank ranks the communities detected in a network using only the information about the network structure and does not require any external data about the nodes or edges of the network. CRank takes a network and detected communities as its input and outputs a ranked list of communities, where high-ranking communities represent most promising candidates for downstream investigation.

Our network community prioritization technique for prioritizing communities is:

  1. Run a community detection method on the network to detect communities.
  2. Compute CRank's community prioritization metrics for each of the detected communities, evaluating the magnitude of network connectivity features for each community as well as its robustness against noise in the network structure. This step yields four lists, each list containing scores of all communities for one prioritization metric. Convert scores in each list into ranks, producing ranked lists of communities.
  3. Combine the ranked lists into a single aggregated ranking of communities using CRank's rank aggregation method.
  4. Prioritize the communities by ranking them by their decreasing aggregated score.

CRank can be used with any community detection method. In particular, it works with statistical community detection methods, such as Stochastic Block Model, Affiliation Graph Models and Latent Feature Graph Model, that allow for direct computation of community prioritization metrics. Additionally, CRank works with non-statistical methods like modularity optimization and spectral methods, where edge probabilities are given by an auxiliary network model.

Both operating modes of CRank are demonstrated in the following sections and are also implemented in the code provided on this website.

Prioritizing network communities in the Zachary's Karate Club network

Zachary's Karate Club network is a well known network, which consists of 34 nodes and 78 edges. It shows the relationships between Zachary Karate Club members. Each node represents a member of the karate club and each edge represents a ties between two members of the club.

Experiment

A community detection method of user's choice takes as input the network and outputs a grouping of nodes into five communities, Cmt1, Cmt2, Cmt3, Cmt4, and Cmt5, as highlighted in the figure. Each community is given by a set of its member nodes. For example, community Cmt1 contains nodes 5, 6, 7, 11, and 17. Notice that communities Cmt2 and Cmt4 split the friendship network among the Karate Club members into two widely known factions, whereas communities Cmt1, Cmt3 and Cmt5 represent less meaningful groups of the Karate Club members.

We now prioritize the detected communities using CRank. CRank takes as input the Zachary's Karate Club network given by its edge list (switch -i:) and node-community affiliations (switch -c:). As a result, CRank prioritizes the communities by ranking them by their aggregated prioritization score and saves the resulting prioritization to a file (switch -o:).

# Run the code on Zachary's Karate Club network 
./crank -i:karate.txt -c:karate_communities.txt -o:karate_prioritization.txt

Results

CRank assigns a score to each community and uses that score to determine the rank of a community in the final prioritization. The resulting prioritization of the five communities detected in the Zachary's network is shown in the figure below.

Communities Cmt4 and Cmt2 are placed at the top of the ranking, indicating that Cmt4 and Cmt2 are most promising communities for follow-up investigation. In contrast. Cmt5 is ranked last, 5 out of 5, suggesting that Cmt5 is the least promising community. Indeed, Cmt4 and Cmt2 correspond to two groups of people into which the karate club split after an argument between two teachers, a fact that is well known in the literature but that was not used for prioritization.

Notice that CRank allows the input to consist of only network and community affiliation data, given by switches -i: and -c:, respectively. As a result, CRank can be used with non-statistical community detection methods.

Prioritizing network communities in the Amazon product network

The Amazon product co-purchasing network has 330 thousand nodes and almost one million edges. It represents frequently co-purchased products at the online retailer. Each node is a product (e.g., a book), and each edge represents a pair of products commonly purchased together.

Experiment

A community detection method of user's choice takes as input the network and outputs a grouping of nodes into several thousand communities. In case the selected method implements a statistical model, such as the Affiliation Graph Model, then CRank can use edge probabilities and node-community affiliation probabilities that are provided by the method.

We now prioritize the detected communities using CRank. CRank takes as input the Amazon network given by its edge list (switch -i:) and node-community affiliations (switch -c:).

CRank also accepts information about node-community affiliation probabilities (switch -in:), edge probabilities (switch -ie:), and edge probabilities conditioned on nodes' affiliations with communities (switch -ic:). Note that this additional information (switches -in:, -ie: and -ic:) is always provided by the community detection method. CRank prioritizes the communities by ranking them by their aggregated prioritization score and save the resulting prioritization to a file (switch -o:, filename amazon_prioritization.txt).

# Run the code on Amazon network
# If necessary, scroll to the right to see the entire command
./crank -i:amazon.txt -p:F -c:amazon_communities.txt -in:amazon_CProbaH.txt -ie:amazon_EdgeProbaH.txt -ic:amazon_CEdgeProbaH.txt -o:amazon_prioritization.txt

CRank's runtime on a standard notebook is 1-2 minutes. CRank uses ~500MB of RAM to process the Amazon network.

Results

Using only network structure, CRank attempts to identify promising communities and to place them towards the top of the community ranking. It ranks the communities purely based on network structure and we use the external metadata about product categories to assess the quality of the community ranking.

We find that communities ranked high by CRank mostly contain products that belong to the same product category. For example, all nodes in community Cmt914, which has the highest CRank's prioritization score, represent books about antique and collectible jewelry. For example, the rank 2 community (2nd highest community in the ranking) contains books belonging to a children's literary franchise "The Boxcar Children." Another high-ranked (rank 3) community is about progressive country, a subgenre of country music.

In contrast, communities ranked lower by CRank (rank 200, 500, and 1000 communities) carry much broader semantic meaning and their products become increasingly more heterogeneous.

We provide metadata for the Amazon product network as part of the code linked from this website. File amazon_product_categories.txt contains the product category of each node.

Prioritizing network communities in the megascale cell interaction network

To demonstrate that CRank scales to large biological networks, we used the single-cell RNA-seq dataset containing 1,306,127 embroyonic mouse brain cells for which no cell types are known. The dataset was represented as a weighted graph of nearest neighbor relations (edges) among cells (nodes), where relations indicated cells with similar gene expression patterns calculated using diffusion pseudotime analysis.

To partition this graph into highly interconnected communities we apply a community detection method proposed for single-cell data. After detecting the communities, CRank takes the cell-cell interaction network and the detected communities, and generates a rank-ordered list of communities, assigning a priority to each community. Prioritization of communities derived from the cell-cell interaction network takes less than 2 minutes.

We find that communities that rank high by CRank contain cells with distinct gene expression patterns, whereas communities that rank low by CRank show more variation in gene expression and little or no community-specific expression patterns (see the IPython notebook and Figure 4 in the paper).

File Description
1M_neurons_filtered_raw_log.h5 Log-normalized expression matrix (UMI count matrix) as described in the paper. The raw data was generated by Zheng et al., Nature Communications 2017 and 10x Genomics. Note: The file is 14 GB large.
1M_neurons_corrected.h5 Preprocessed (normalized, filtered) dataset, cell-cell interaction network, diffusion map, t-SNE and PCA projections, and list of detected communities. These computations take several hours and need substantial amounts of memory. For convenience, the file contains all results of these computations. Note that these computations represent the standard pipeline in the analysis of single-cell RNA-seq data; they are not in any way specific to CRank and are not required by CRank. The raw data was generated by Zheng et al., Nature Communications 2017 and 10x Genomics. Note: The file is 5.3 GB large.
1M_neurons_prioritization.txt CRank ranking of communities detected in the megascale cell-cell interaction network.
1M_neurons_CRank.html IPython notebook (HTML) reproducing Figure 4 in the paper.

Experimental network community prioritization code

You can download the code directly, which comes with the Zachary's karate and Amazon network data. This is a C++ implementation. Once you have unzipped the file, run the following to prioritize communities from the Zachary's karate network.

cd crank-prioritization/
# Run the code on Zachary's Karate Club network 
./crank -i:karate.txt -c:karate_communities.txt -o:karate_prioritization.txt
# Look at the prioritization results
cat karate_prioritization.txt

The code is tested under Mac OS X, Linux and Windows systems. Binary executable for all three operating systems are included in the zip file.

A C++ implementation of CRank is available on GitHub.