Mambo: A Tool for Multimodal Biomedical Networks

Mambo is a tool for the representation and analysis of large-scale biological multimodal networks. Given a set of biological entities and information about those entities as well as a set of relationships between entities and information about those relationships, Mambo can be used to construct a structured representation of this data. In turn, this structured representation can be used to construct and analyze multimodal networks.

In biology, multimodal networks provide a rich representation of biological data. This allows researchers to analyze the data and generate hypotheses, like finding drug repurposing targets and predicting genetic causes of diseases.

Recent advances in technology have resulted in the rapid collection of a diverse range of biological data from various methods and sources. While there are many databases and applications focusing on understanding specific types of relationships between only a few types of entities (for example, understanding protein-protein interactions), there is a lack of tools for simply and efficiently integrating data from several databases or of various relationship types. This integration requires handling large amounts of data that may be in different formats and use different naming schemes to represent the same entities.

We present a framework and set of computational tools called Mambo that can be used to efficiently construct, store, and analyze large scale multimodal networks of biological data. Mambo is able to scale to millions of nodes and billions of edges, and can also support thousands of different modes, or entity types, and links, or relationship types.

Multimodal Networks in Mambo

Mambo develops a representation of biological data that is compact and can be used to analyze large complex biomedical data and generate new domain-specific hypotheses. Multimodal networks extend the classic graph/network structure from homogeneous to heterogeneous networks. A multimodal network is composed of several set of nodes, called modes, where each mode represents a distinct entity type, connected by edges betweeen nodes within a mode and across modes.

Mambo defines a multimodal network by specifying four components:

The figure below shows an example of a multimoodal network. The multimodal network consists of six modes, each denoted by a different color.

Examples of Multimodal Networks in Mambo

Multimodal networks can be constructed from a variety of biological data types. To gain a better understanding of what multimodal networks are and why we would choose to use multimodal networks as the representation format, we provide the processed data and the code required for the construction of some example networks. These are intended to show how to use Mambo as well as provide examples of types of analysis that are possible and the scalability of Mambo.

Multimodal Network for Studying Cancer

This network focuses on protein-coding genes with mutations in the largest number of patients according to data provided by the International Cancer Genome Consortium (ICGC)1. It was originally constructed by taking a much larger network and selecting a subset of nodes and edges. The original data and the databases from which the data were obtained are listed in the first 11 rows of the table in the Data section. To perform the sub-network selection from this data, we begin with the top 500 genes out of the total 20,326 protein-coding genes listed on the ICGC website, and then include all nodes in other modes (excluding genes) that are within one-hop of these genes in the large network.

We selected the 500 genes with mutations in the largest number of cancer patients because these are genes and proteins that are likely to be interesting to look at to gain a better understanding of cancer and to potentially target for treatment. The network we construct will allow us to represent what proteins these genes encode, what drugs or chemicals interact with the genes and proteins, what the function of these genes and proteins are, and what other diseases these genes and proteins are associated with. By integrating all of this information, we are able to perform a more detailed analysis of these genes.

This network has 5 modes: Chemical, Disease, Function, Gene, and Protein. It has 21 link types: Chemical-Chemical, Chemical-Protein, Disease-Chemical, Disease-Disease, Disease-Function, Disease-Gene, Function-Function, Gene-Gene (split into 6 link types by interaction type), Gene-Protein, Protein-Function, and Protein-Protein (split into 6 link types by interaction type). There are a total of 20,363 nodes and 3,474,349 links.

The data required to construct this network is available from the Mambo Repository. The tutorial for the construction and analysis of this network is also available from the Mambo Repository. The tutorial begins in notebook 03 Workflow for Constructing Multimodal Networks and ends in notebook 08 Performing Analytics on the Multimodal Network. For a more detailed description of the complete tutorial available from the Mambo repository, see the Tutorial Code section.

Large-Scale Multimodal Network

One advantage of Mambo is that it efficiently scales to very large datasets without requiring additional work by the user. Thus we provide an example of a much larger network as well. This network is an expanded version of the previous cancer gene network and has 10,005,139 nodes and 2,341,201,734 edges. In this network, we integrate biological data from various databases and species to exhibit how Mambo can easily and efficiently scale to very large networks using the same sets of commands.

Because of the large size of the data used to construct this network, we do not host this data in the Mambo Repository. Instead the user must download the datasets from the original sources, which are listed in the table below. Further instructions can be found in the tutorial from the Mambo Repository.

Data

Interaction Type File Database
Chemical-gene CTD_chem_gene_ixns.tsv.gz CTD2
Disease-chemical CTD_chemicals_diseases.tsv.gz CTD2
Disease-disease doid.obo Disease Ontology3
Disease-function CTD_Disease-GO_biological_process_associations.tsv.gz
CTD_Disease-GO_cellular_component_associations.tsv.gz

CTD_Disease-GO_molecular_function_associations.tsv.gz
CTD2
Disease-gene CTD_genes_diseases.tsv.gz CTD2
Drug-target drugbank.xml.zip Drugbank4
Function-function go.obo Gene Ontology5
Gene-gene Genemania Data Genemania6
Gene-protein Ensembl Data Ensembl7
Protein-function goa_human.gaf.gz Gene Ontology5
Protein-protein protein.links.detailed.v10.5.txt.gz STRING8

Setup and Installation Instructions

Prerequisites

The following three items are required to run Mambo.

Mambo Installation

  1. Download notebooks from the Mambo Repository.
  2. Launch the Jupyter Notebook by clicking on the application or by using the command: jupyter notebook. The web browser will open to the Notebook Dashboard. Check out Jupyter Notebook website for further details on how to run the notebooks.
  3. Navigate to the directory downloaded in Step 1 and click on the first notebook, named 01 Introduction to Multimodal Networks to launch it. Now follow the Mambo tutorial detailed in the notebooks.

Mambo Tutorial

Mambo uses Jupyter Notebooks to provide a clear and easy-to-use interface with a simple set of commands that can be used to construct, represent and analyze multimodal networks. We provide a tutorial using several examples of real-world multimodal networks in biomedicine. The tutorial details how to use Mambo and can be easily adapted to new networks. Detailed description together with code and data is available from the Mambo Repository.

Tutorial has three parts that are ordered according to the numbering of the Jupyter Notebooks.

In the first part, we provide background on multimodal networks and how they are represented in Mambo. This can be found in the two notebooks: 01 Introduction to Multimodal Networks and in 02 Data Representation in SNAP.

In the second part, we provide the code and data for the construction and analysis of a multimodal network for cancer. This part of the tutorial is intended to provide a detailed understanding of how Mambo works and to be easily adaptable to new networks. This part of the tutorial begins at the notebook 03 Workflow for Constructing Multimodal Networks.

In the third part, we provide the code for the construction and analysis of a large-scale multimodal network. This part of the tutorial can be found in the notebook 09 Large Network Case Study.

The detailed tutorial including the background, code, and data is available from the Mambo Repository.

References

  1. Zhang J. et al. International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database, Volume 2011, 16 September 2016, bar026.
  2. Curated chemical–gene interactions, chemical–disease, gene–disease, disease-function data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. http://ctdbase.org/.
  3. Kibbe W. et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Research, Volume 43, Issue D1, 28 January 2015, Pages D1071–D1078.
  4. Law V. et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D1091-D1097.
  5. The Gene Ontology Consortium. Gene Ontology Consortium: going forward. (2015) Nucleic Acids Research, Volume 43, Issue D1, 28 January 2015, Pages D1049–D1056.
  6. Warde-Farley D. et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, Volume 38, Issue W1, 1 July 2010, Pages W214–W220.
  7. Bronwen L. Aken et al. The Ensembl gene annotation system. Database, Volume 2016, 1 January 2016, baw093.
  8. Szklarczyk D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Research, Volume 43, Issue D1, 28 January 2015, Pages D447–D452.
  9. Contributors

    The following people contributed to the Mambo project (appear in alphabetical order):

    Jure Leskovec
    Priyanka Nigam
    Rok Sosic
    Marinka Zitnik