CS224W:

Social and Information Network Analysis

Autumn 2010

Project themes & Datasets

Note - don't feel limited by the above. You can also collect the data yourself - Be sure to definte the project and thus the question yourself!

This page is intended to provide you possible project themes and links to the various datasets aroun d that you can use towards your project. If you can't find links on this page to datasets mentioned here, please email us for access to the same.

DBpedia

Richly labeled network containing extracted data from Wikipedia (based on infoboxes):

Richly labeled network
Multiple types of nodes and edges
Go here.

Other OpenLinkedData datasets available here.

Signed Networks

Networks of positive and negative edges:
Data includes:

Trust/distrust edges
Also Epinions product reviews and review ratings

Signed networks on SNAP : here.
Signed networks on Trustlet : here.

P2P lending : Prosper

Prosper marketplace – Peer-to-peer lending:
Lenders ask for loans
People then bid (price, interest rate) on loans to fund them
Rich social structure around the website
Prosper data here.

Facebook Game-playing

Turiya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients
Data collected includes:

Players and their attributes
Logs of game events
Information about virtual items & transactions in real money (or) credits

Analyses include:

Player segmentation
Virtual goods recommendations
Lifetime value estimation of players

Facebook What-to-Wear ? (W2W) - A Social game on Facebook

Contestants create outfits and submit these to a daily competition, which has a theme like e.g. “an outfit for attending your ex’s-wedding”
Contestants can also vote and comment on other people’s submissions. One gets credit for both participation and judgements
Items for outfits are either bought from the store or reused from the contestant’s closet
~30,000 players/month
Data about this game includes:

Player data & Data about previous competitions
Data about outfits and other fashion items
Many other data (~400 relations in all)

Amazon Product Network

Amazon Product Review data :

Product info: name, sales rank etc
Product categorization
All reviews - users, ratings, review helpfulness etc
Those who bought X also bought Y - networks !

DBLP Collaborations

Collaboration network of Computer scientists
Each such publication record includes : author names, title, year of publication & conference or journal name into which it was accepted.
Data can be got here and here.

Citation networks

Patents
Citations between patents. For each patent, we also know the time at which it was made, patent category, patent owner data etc
Arxiv High-energy Physics:

Citation network between papers
For each patent, we also know the author names, title & abstract, year of publication, journal/conference it got accepted into.

Data here.

Twitter

~ 50 million tweets per month (from June 2009 for 6 months)
Format:

T 2009-06-07 02:07:42
U http://twitter.com/redsoxtweets
W #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwy

Two important things:

URLs
Hash-tags

Twitter social graph & profiles
Some possible project themes using Twitter data:

Inferring links of the who-follows-whom network
What is the lifecycle of URLs and hash-tags?

How do hash-tags get adopted?
Multiple competing hash-tags, which one wins?

Finding early/influential users?
Community discovery
Where/how will the information propagate?

Memetracker

More than 1 million newsmedia and blog articles per day since August 2008
Extracted phrases (quotes) and links : Memetracker
Data format:

P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-palins-experience-level
T 2008-09-01 00:00:13
Q dangerously unprepared to be president
Q even more dangerously unprepared
Q understands the challenges that we face
Q worked and succeeded
L http://www.cnn.com

Some Ideas:

How does information mutate/change over time?
Which media sites are the most influential? Build a predictive model of site influence
Role discovery: Which nodes are early adopters, late comers, summarizers?
Create a model of political bias (liberal vs. conservative)
What is genuine news, what are genuine phrases and what is spam?

Stanford Law School (Legal Data)

About the Dataset:

6.5 million legal opinions from the United States Judiciary from 1900 to the present.
Documents are linked (later cases refer to earlier ones)
The documents are both stored in raw form on Amazon S3 and also have been pre-processed for analysis by Hadoop

Project ideas:

Label cases as pro-plaintiff or pro-defendant
Run PageRank, Hub-Authorities, or other graph algorithms on the documents - they are hyperlinked)
Identify legally important concepts

Wikipedia

Complete edit history of Wikipedia available until January 2008
Some Wikipedia snapshots
Some Ideas:

We have nicely parsed Wikipedia data for each edit:

REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot 433328
CATEGORY American_mathematicians
MAIN Boston_University MIT Harvard_University Cornell_University
OTHER De:Steven_Strogatz Es:Steven_Strogatz
EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html
TEMPLATE Cite_book Cite_book Cite_journal
COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
MINOR 1
TEXTDATA 229

Can identify networks like 'Who talks to whom' & 'Who edits what'
Also, Wikipedia has elections for admins, articles get reverted, disputes resolved, …
We also have the Wikipedia webserver logs, i.e., page visit statistics : here, here & here.
How does Wiki page visit statistics correlate with external events, natural disasters?

Use Twitter or MemeTracker data to detect those
Compare occurrence of phrases and visits to Wikipedia pages

Yahoo Altavista Web Graph

Web graph from 2002:

Nodes are webpages, we also know URL of each node
Directed edges are hyperlinks
1.4 billion public webpages
So, several billion edges

Some Ideas:

SPAM

Use webgraph structure to more efficiently extract spam webpages like the one below:

Link Farms and Spider traps

Personalized and topic-sensitive PageRank
Website structure identification:

From the webgraph extract “websites” - What are common navigational structures of websites?
Cluster website graphs - Identify common subgraphs and patterns
What are roles pages/links play in the graph: Content pages, Navigational pages & Index pages
Build a summary/map of the website

Stanford WebBase

Collection of focused web snapshots - From 2004 to present

General crawls - start from ~ 1000 seed webpages - crawl upto ~150000 pages per site
Specialized crawls - Universities, US Government, daily crawls for Hurricane Katrina (2005) and monthly newspaper crawls

Some Ideas:

Smaller than Altavista but you also have the page content
Study the evolution of the webgraph

How does website structure change and evolve over time
How do webpages (webpage structure) change over time

IM Buddy Graph

About the Dataset:

A large IM buddy graph from March 2005 - 230 million nodes & 7,340 million undirected edges
Limitations: Only have the buddy graph with random node ids - No communication or edge strength

Some Ideas:

Find communities, clusters in such a big graph
Count frequent subgraphs
Design algorithms to characterize the structure of the network as a whole

Other Ideas/Datasets

Stanford Search Queries
New York Times articles (since 1987)

Articles are manually annotated by subject
Entity or relation extraction
Extract keywords, predict article category