CS224W:
Social and Information Network Analysis
Autumn 2010
Note - don't feel limited by the above. You can also collect the data yourself - Be sure to definte the project and thus the question yourself!
This page is intended to provide you possible project themes and links to the various datasets aroun
d that you can use towards your project. If you can't find links on this page to datasets mentioned here, please email us for access to the same.
DBpedia
- Richly labeled network containing extracted data from Wikipedia (based on infoboxes):
- Richly labeled network
- Multiple types of nodes and edges
About 2.6 million concepts described by 247 million triples, including abstracts in 14 different languages
- Go here.
- Other OpenLinkedData datasets available here.
Signed Networks
- Networks of positive and negative edges:
- Data includes:
- Trust/distrust edges
- Also Epinions product reviews and review ratings
- Signed networks on SNAP : here.
- Signed networks on Trustlet : here.
P2P lending : Prosper
- Prosper marketplace – Peer-to-peer lending:
- Lenders ask for loans
- People then bid (price, interest rate) on loans to fund them
- Rich social structure around the website
- Prosper data here.
Facebook Game-playing
- Turiya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients
- Data collected includes:
- Players and their attributes
- Logs of game events
- Information about virtual items & transactions in real money (or) credits
- Analyses include:
- Player segmentation
- Virtual goods recommendations
- Lifetime value estimation of players
Facebook What-to-Wear ? (W2W) - A Social game on Facebook
- Contestants create outfits and submit these to a daily competition, which has a theme like e.g. “an outfit for attending your ex’s-wedding”
- Contestants can also vote and comment on other people’s submissions. One gets credit for both participation and judgements
- Items for outfits are either bought from the store or reused from the contestant’s closet
- ~30,000 players/month
- Data about this game includes:
- Player data & Data about previous competitions
- Data about outfits and other fashion items
- Many other data (~400 relations in all)
Amazon Product Network
- Amazon Product Review data :
- Product info: name, sales rank etc
- Product categorization
- All reviews - users, ratings, review helpfulness etc
- Those who bought X also bought Y - networks !
DBLP Collaborations
- Collaboration network of Computer scientists
- Each such publication record includes : author names, title, year of publication & conference or journal name into which it was accepted.
- Data can be got here and here.
Citation networks
- Patents
- Citations between patents. For each patent, we also know the time at which it was made, patent category, patent owner data etc
- Arxiv High-energy Physics:
- Citation network between papers
- For each patent, we also know the author names, title & abstract, year of publication, journal/conference it got accepted into.
- Data here.
Twitter
- ~ 50 million tweets per month (from June 2009 for 6 months)
- Format:
- T 2009-06-07 02:07:42
- U http://twitter.com/redsoxtweets
- W #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwy
- Two important things:
- Twitter social graph & profiles
- Some possible project themes using Twitter data:
- Inferring links of the who-follows-whom network
- What is the lifecycle of URLs and hash-tags?
- How do hash-tags get adopted?
- Multiple competing hash-tags, which one wins?
- Finding early/influential users?
- Community discovery
- Where/how will the information propagate?
Memetracker
- More than 1 million newsmedia and blog articles per day since August 2008
- Extracted phrases (quotes) and links : Memetracker
- Data format:
- P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-palins-experience-level
- T 2008-09-01 00:00:13
- Q dangerously unprepared to be president
- Q even more dangerously unprepared
- Q understands the challenges that we face
- Q worked and succeeded
- L http://www.cnn.com
- Some Ideas:
- How does information mutate/change over time?
- Which media sites are the most influential? Build a predictive model of site influence
- Role discovery: Which nodes are early adopters, late comers, summarizers?
- Create a model of political bias (liberal vs. conservative)
- What is genuine news, what are genuine phrases and what is spam?
Stanford Law School (Legal Data)
- About the Dataset:
- 6.5 million legal opinions from the United States Judiciary from 1900 to the present.
- Documents are linked (later cases refer to earlier ones)
- The documents are both stored in raw form on Amazon S3 and also have been pre-processed for analysis by Hadoop
- Project ideas:
- Label cases as pro-plaintiff or pro-defendant
- Run PageRank, Hub-Authorities, or other graph algorithms on the documents - they are hyperlinked)
- Identify legally important concepts
Wikipedia
- Complete edit history of Wikipedia available until January 2008
- Some Wikipedia snapshots
- Some Ideas:
- We have nicely parsed Wikipedia data for each edit:
- REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot 433328
- CATEGORY American_mathematicians
- MAIN Boston_University MIT Harvard_University Cornell_University
- OTHER De:Steven_Strogatz Es:Steven_Strogatz
- EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html
- TEMPLATE Cite_book Cite_book Cite_journal
- COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
- MINOR 1
- TEXTDATA 229
- Can identify networks like 'Who talks to whom' & 'Who edits what'
- Also, Wikipedia has elections for admins, articles get reverted, disputes resolved, …
- We also have the Wikipedia webserver logs, i.e., page visit statistics : here, here & here.
- How does Wiki page visit statistics correlate with external events, natural disasters?
- Use Twitter or MemeTracker data to detect those
- Compare occurrence of phrases and visits to Wikipedia pages
Yahoo Altavista Web Graph
- Web graph from 2002:
- Nodes are webpages, we also know URL of each node
- Directed edges are hyperlinks
- 1.4 billion public webpages
- So, several billion edges
- Some Ideas:
- SPAM
- Use webgraph structure to more efficiently extract spam webpages like the one below:
- Link Farms and Spider traps
- Personalized and topic-sensitive PageRank
- Website structure identification:
- From the webgraph extract “websites” - What are common navigational structures of websites?
- Cluster website graphs - Identify common subgraphs and patterns
- What are roles pages/links play in the graph: Content pages, Navigational pages & Index pages
- Build a summary/map of the website
Stanford WebBase
- Collection of focused web snapshots - From 2004 to present
- General crawls - start from ~ 1000 seed webpages - crawl upto ~150000 pages per site
- Specialized crawls - Universities, US Government, daily crawls for Hurricane Katrina (2005) and monthly newspaper crawls
- Some Ideas:
- Smaller than Altavista but you also have the page content
- Study the evolution of the webgraph
- How does website structure change and evolve over time
- How do webpages (webpage structure) change over time
IM Buddy Graph
- About the Dataset:
- A large IM buddy graph from March 2005 - 230 million nodes & 7,340 million undirected edges
- Limitations: Only have the buddy graph with random node ids - No communication or edge strength
- Some Ideas:
- Find communities, clusters in such a big graph
- Count frequent subgraphs
- Design algorithms to characterize the structure of the network as a whole
Other Ideas/Datasets
- Stanford Search Queries
- New York Times articles (since 1987)
- Articles are manually annotated by subject
- Entity or relation extraction
- Extract keywords, predict article category