Pointers to data and code

Datasets

Stanford Large Network Dataset Collection

60 large social and information network datasets

Coauthorship and Citation Networks

DBLP: Collaboration network of computer scientists
KDD Cup Dataset

Internet Topology

AS Graphs: AS-level connectivities inferred from Oregon route-views, Looking glass data and Routing registry data

Stack Overflow

Stack Overflow Data

Yelp Data

Yelp Review Data: reviews of the 250 closest businesses for 30 universities for students and academics to explore and research

Prosper peer to peer money lending dataset

Money Lending Data: Lenders ask for loans and people bid (price, interest rate) on loans to fund.

Youtube dataset

Youtube data: YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a.

Amazon product copurchasing networks and metadata

Amazon Data: The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes).

Wikipedia

Wikipedia page to page link data: A list of all page-to-page links in Wikipedia
DBPedia: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia.
Edits and talks: Complete edit history (all revisions, all pages) of Wikipedia since its inception till January 2008.

Movie Ratings

IMDB database: Movie ratings from IMDB
User rating data: Movie ratings from MovieLens

Who trusts whom data at Trustlet

Trust network datasets: Includes trust/distrust edges and Epinions product reviews/review ratings

Mark Newman's pointers

Network data: More than 20 network datasets

Munmun De Choudhury's pointers

Network data: Flickr Image Dataset, YouTube Dataset, Digg Dataset (Social Media), Engadget Dataset (online communities), Del.icio.us Dataset (Social bookmarking)

Reality Commons data

Mobile data: Several mobile data sets that contain the dynamics of several communities of about 100 people each.

Stanford Foursquare Place Graph Dataset

Every day millions of people check-in to the places they go on Foursquare and in the process create vast amounts of data about how places are connected to each other. We call this set of interconnections the Place Graph, and provide a sample of this data for 5 major US cities. This dataset contains metadata about 160k popular public venues, and 21m anonymous check-in transitions (or trips between venues). You'll have to sign an agreement to gain access; contact Jure for more information.

GitHub Dataset

GitHub is one of the most popular platforms for sharing code online. Please contact Vikesh for any help with these datasets. There are two sources for GitHub data:

GHTorrent
GHTorrent maintains a relational model of GitHub activity data and offers archives for download. Both MySql and MongoDB dumps are available.
GitHub Archive
GitHub archive is the comprehensive source of all GitHub events starting from February 12, 2011. It is officially provided by GitHub and includes 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client.

Google Local Dataset

dataset

David Hallac

Bitcoin

Bitcoin is a digital currency invented in 2008 and operates on a peer-to-peer system for transaction validation. This decentralized currency is an attempt to mimic physical currencies in that there is limited supply of Bitcoins in the world, each Bitcoin must be "mined", and each transaction can be verified for authenticity. Bitcoins are used to exchange every day goods and services, but it also has known ties to black markets, illicit drugs, and illegal gambling transactions. The dataset is also very inclined towards anonymization of behavior, though true anonymization is rarely achieved.
The Bitcoin dataset captures transaction-level information. For each transaction, there can be multiple senders and multiple receivers as detailed here. This dataset provides a challenge in that multiple addresses are usually associated with a single entity or person. However, some initial work has been done to associated keys with a single user by looking at transactions that are associated with each other (for example, if a transaction has multiple public keys as input on a single transaction, then a single user owns both private keys). The dataset provided provides these known associations by grouping these addresses together under a single UserId (which then maps to a set of all associated addresses).
Key Challenge Questions:
1. Can we detect bulk Bitcoin thefts by hackers? Can we track where the money went after thefts?
2. Can we detect illicit transactions based on Bitcoin transaction behavior? What sort of graph patterns emerge?
3. Can we detect attempts at money laundering (called a "mixing service" in Bitcoin)
  1. Can we detect money laundering attempts and the people who use them? Note: Current Bitcoin mixing services tend to mix Bitcoins amongst all the people who bother to use a mixing service so does the mixing service actually obfuscate anything?
  2. Can we trace back the originator of these laundering attempts?
4. Can we detect currency manipulation (hackers try to destabilize Bitcoin currency exchanges to deflate prices)
5. Is Bitcoin gaining traction or losing traction among the regular population for use as a regular digital currency?
6. It is Bitcoin best practice to generate and use a new address with every transaction. Is this practice followed? If not, then what can we learn from this?
7. Can we identify and extract organizational behavior amidst the Bitcoin transactions?
8. Can we determine which Bitcoin addresses belong to a single entity? While the initial pass over the data have yielded some resolution of entities, can we further improve this mapping?

Townsquared

If you need access to the data, please contact Vikesh. For any other details of the project, please contact Rohit Prakash, Co-Founder of Townsquared. Townsquared is a Series A company that is building private online communities for local businesses.
Dataset description:
- Edges (links between businesses) - Facebook Business page likes to other Facebook Business pages.
- Nodes (physical business address and other quantitative and qualitative information) - Factual data giving qualitative data about businesses and scraped data to enhance business information.
- Big Picture: Can we use b2b Facebook likes to create an effective invite recommendation system for growth. What other qualities of the network exists that can be used to enhance other product offerings.
Questions:
1. Are Facebook b2b likes a sufficient scaffold to lean on for systemized growth
2. If so, what qualities of individual edges create the most effective push mechanisms (familiarity) for a business owner
3. Is “friends of friends” an important network feature that can influence pushed recommendations?
Problems to be tackled:
1. Clean up the data – ensure that the likes are between business pages
2. Associate the Facebook pages with businesses in our native business database OR create new businesses where we are confident that these are real businesses
3. What are the qualities of the network – e.g. how many edges are there, what kinds of businesses normally connected, do socially active businesses have more b2b likes. Create a query tool that allows filtering to visualize nodes and connections.
4. Create a map reduce algorithm taking into account different factors (location, company type, social score) gleaned from the quantification of the network etc to make a prediction of the top 10 businesses to recommend
5. Community detection problem inferred by connectivity.

Chatous

If you are interested in working on this project, please contact Vikesh.
Chatous is a text-based, 1-on-1 anonymous chat network that has seen 2.5 million unique visitors from over 180 different countries. Users can create a profile that contains a screen name, age, gender, location, and a short free-form "about me" field. After clicking the "new chat" button, users are matched up with one another in a text-based conversation. Interactions on Chatous include exchanging messages, sending/accepting a friend request, reporting an abusive user, ending a conversation. In our dataset, we store all user profile information (and changes made to the profile), all actions taken by users on the site, as well as conversation content (in particular, conversation length and words used). Here are some suggested research questions that we think you could solve:
1. Predicting user "quality" - or general conversation tendency / likability by community as a whole. In particular, identifying users of poor quality is important because they rarely have long conversations and are overrepresented in the matching queue, thus affecting a large set of users. Using a user's profile information and past conversations, can we predict which users tend to be good conversationalists and which users generally engage in short or zero length conversations?
2. Predicting identity changes. A user who frequently changes profile information on Chatous, especially age/gender, indicates that the user is lying about their identity. We'd like to develop a system to predict the users are most likely to be lying about their age or gender - i.e. see what type of behavior is linked with profile changes. Given a set of users, can we use information from their current profile and past interactions to predict whether or not they will tend to be changing key profile elements in the future?
3. Evaluating validity of user reports. On Chatous, user moderation of community is important for flagging of abusive / spammy users. However, the tendency of users to report varies widely, and we have many false positives (reports that are unwarranted, people that are reported simply because they are on the platform a lot). We'd like to develop a system that can determine the accuracy of a user report (based on the reporting user's behavior, the reported user's behavior, and the total number of reports both users have sent/received). We hope this will enable us to remove some of the noise in user reports and more easily detect abusive users on the platform.
Dataset:
- Two weeks of user interaction on our platform ~80,000 users and ~8 million conversations
- Graph structure consisting of users as nodes and conversations as weighted edges (with conversation length as weight)
- Additional meta data around edge includes: person who disconnected the conversation, time started, time finished, whether a "friendship" exists between the two users
- User profiles consisting of screen name, age, gender, location, and "about me" (including all changes to a user's profile)
- List of user reports (person reported, person reporting, conversation length & all associated meta data)
Potential additional data:
- Word vectors consisting of the words each user has used in a conversation

Bookopolis

If you are interested in working on this project, pelase contact Vikesh
Project Goal:
Create book recommendation engine to help kids ages 7-12 find books that ignite their love of reading. We want to provide every user with a personalized list of book recommendations.
Company Overview:
We aim to be the go-to place for 7-12 year olds (and their teachers and parents) to discover new books they are excited to read and to share their love of books with friends. It’s Goodreads with Pandora style recommendations for kid readers. Our book discovery tool and social network lets young users safely connect with friends to share reviews of their favorite books, explore books recommended by peers and curated by experts, and earn points and badges for their reading achievements. Kids can sign up directly for an account but it must be activated by parents. Kids can invite other friends to join and connect as friends with parent approval. The site is also used as an edtech tool by 2nd-6th grade classroom teachers and school librariansl. There’s a Teacher Account option where teachers can set up accounts for a whole class of students and access a Teacher Dashboard to track and monitor student user data including: reading logs, book reports, book reviews, and lists of most popular books.
Social Mission and Goals:
Only 35% of young people read every day compared to the 78% who play online or on mobile devices every day. Increases in time spent reading is shown to lead to better academic performance and positive social emotional development. Our goal is to ignite a love of reading in elementary and middle school students by making reading fun through a hip online platform that helps kids find the books that hook them to become lifelong readers.
User Growth:
We recently launched in beta and currently have 3200+ users. Primary growth is through schools as teachers sign up students for accounts that can be used at school and at home. User growth is our primary business development goal and we want to add core functionality such as a book recommendation engine that will make this as a needed tool for students, teachers, and parents.
Book Recommendation Engine Project Details:
- Reviews written by kid users
- Star ratings by kid users
- Number of “Likes” of user reviews
- Most popular books in Bookopolis - based on # of times placed on My Books, # of reviews, #o
- Frequency of ‘recommended by’ friend in Bookopolis by title
- Book Pick lists curated by Bookopolis
- Book titles placed on user shelves (“I Read It”, “Currently Reading,” and “Want to Read.”
- Friends’ books
- In user profile, we collect user’s preferred book genres. Can expand info collected here to include user interests, hobbies, etc.
- School Library Journal
- Publisher's Weekly - Kid's Fiction
- Kirkus Review
- Amazon Children’s Book reviews
- Goodreads : Children's tag
- NYT Children's Books List
- ALSC Book Lists
Opportunity for Students:
Build out core features in a mission-oriented edtech startup that is motivating young people to love reading. Apply your skills to an early stage start-up and have direct access to the CEO/Founder and Bookopolis’ tech advisor experts from Google and Yahoo!

MOOC Forums Dataset

All data from Stanford's courses on Coursera and NovoEd is available. For Coursera format details see this page. For an explanation of data available from Stanford courses offered on our OpenEdX platform, see Datastage. To request any of the data, fill in this form. For more details, please contact Jure.
A number of (relatively) new OpenEdX data are now available on datastage.stanford.edu. These include both data that the OpenEdX platform collects, and tables that result from computations over that base data. In addition, processes are now in place to keep the data current on a daily to weekly basis (Coursera and NovoEd data is integrated at the end of each course)
In summary, the additions are:
- ActivityGrade: Assignment grades Includes right/wrong for each problem part, the learners' solution choice for each answer, and the first and last solution submission times.
- Cumulative assignment performance per learner
- 'Raw' final grades, updated at the end of courses.
- Demographic information: country, gender,year_of_birth, and level_of_education. This information is not fully populated because its provision is optional
- A much slimmed view of the OpenEdX tracking log events. The view only includes fields that are currently in use by the platform.
- An anonymized record of the forum from each class.
- The country of origin of each class participant (by IP address).