Open positions
We have filled all the positions for this quarter. More info.

Wikipedia network of top categories

Dataset information

This is a web graph of Wikipedia hyperlinks collected in September 2011. The network was constructed by first taking the largest strongly connected component of Wikipedia, then restricting to pages in the top set of categories (those with at least 100 pages), and finally taking the largest strongly connected component of the restricted graph.

In addition to the graph, we also provide the page names of the articles and the categories of the articles. The categories can serve as "ground-truth" communities. The categories are overlapping as each article may be classified into several categories.

Dataset statistics
Nodes 1791489
Edges 28511807
Nodes in largest WCC 1791489 (1.000)
Edges in largest WCC 28511807 (1.000)
Nodes in largest SCC 1791489 (1.000)
Edges in largest SCC 28511807 (1.000)
Average clustering coefficient 0.2746
Number of triangles 52106893
Fraction of closed triangles 0.00165
Diameter (longest shortest path) 9
90-percentile effective diameter 3.8

Source (citation)


File Description
wiki-topcats.txt.gz Hyperlink network of Wikipedia
wiki-topcats-categories.txt.gz Which articles are in which of the top categories
wiki-topcats-page-names.txt.gz Names of the articles