CS341:
Advanced Topics in Data Mining
Spring 2011
Books (PDFs):
Datasets:
SNAP network datasets
Wikipedia
Ratings and purchases (movies, music, etc.)
Yahoo! Webscope Catalog of datasets
- Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
- Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.
Co-authorship and Citation Networks
Internet (Autonomous Systems) topology
Who trusts whom data at Trustlet
Stanford only datasets
- Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.
- Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
- Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
- The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
- TheFind: product information data (price, category, related products) extracted from 239 different websites.
- Twitter: About 500 million tweets over a 7 month period. Data description.
- Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
- Wikipedia webserver logs: Hourly Wikipedia page access statistics.
- Yahoo! Messenger: Instant Messenger graph with some additional information
Data can be accessed here. Email Jure if you do not have a password.
Other Datasets
- Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
- The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.