Open positions
Open research positions in SNAP group are available at undergraduate, graduate and postdoctoral levels.

Web data: Reddit interaction networks

Dataset information

This dataset is a collection of monthly user interaction networks from the year 2014 for 2046 subreddit communities from reddit.com. There are two types networks: chain-based interaction networks have link users who comment within a linear chain (and are separated by at most 2 other comments); reply-based interaction networks only connect users when one has directly replied to the other. The 2046 subreddits were selected by removing subreddits that fell below certain activity thresholds (need to have at least 100 comments in every week) and discarding two subreddits, /r/counting and /r/CatsStandingUp, that are significantly anomolous in their commenting patterns. Only users who commented at least 50 times to Reddit in 2014 are included in these networks, representing about the top-20% of users.

Each subreddit has a json file, which contains a list of networks defined as adjacency lists with username strings. These raw adjacency lists are directed; the replier links to the individuals she/he is responding to.


Dataset statistics
Number of subreddits 2046
Median number of users per monthly network 504
Timespan Jan. 27, 2014 - Nov. 30 2014

Source (citation)


Files

File Description
reddit_chain_networks.tar.gz Networks constructed from comment chains
reddit_reply_networks.tar.gz Networks constructed from direct replies

Data format

Directed json adjacency lists with usernames as identifiers. Each subreddit has a file "[subreddit].json" that contains 11 "monthly" interaction networks in a list (corresponding to ISO 4-week periods starting from Jan. 27, 2014 and ending on Nov. 30, 2014). Each network is represented as directed adjacency lists (dictionaries mapping users to lists of people they replied to). The December/January holiday periods are excluded due to data quality issues.

How to parse (in Python)

import gzip import json # read in networks for subreddit /r/politics with open("politics.json") as fp: month_nets = json.load(fp) # see who the user frog_licker replied to in the first month print month_nets[0]["frog_licker"]