Stanford CS341: Project in Mining Massive Data Sets (Spring 2012) -- Datasets

CS341

Project in Mining Massive Data Sets

Spring 2012

Datasets

When dealing with these datasets please be careful and responsible. The datasets are meant to be used strictly for the purposes of the class project and nothing else. This means: (1) Do not do anything ''funny'' with the dataset; (2) Do not try to break the anonymization; (3) Do not share that data outside the class; (4) do not copy the data off the Amazon EC2; (4) After the class is over destroy all data.

Datasets ''in progress''

These datasets are currently under preparation but you can find rough desciptions at http://bit.ly/CS341DATA.

Stanford CS341 only datasets

Walmart. We have Walmart sales data. Click here for license and usage conditions.The idea then is to combine these with Twitter or Memetracker data to mine spikes in product sales and similar. We have randomly sampled market basket data from a set of stores in the Northern Illinois region (including Chicago), for the period October 7 – November 7. We picked these dates to include Halloween: one of the questions I'd like students to study is whether tweets can predict which Halloween costumes sell.

Walmart transaction sample
--------------------------
Each record represents an item purchased as part of a visit
(for our purpose: a market basket):

* Unique visit ID (to link items to the same visit)
* Date of visit
* Store visited
> Store ID
> Coordinates (latitude and longitude)
> Location (address)
* Item purchased:
> UPC of scanned item
> Receipt-printed description string for item
> Walmart categorization of product
(department / category / subcategory)
> Number of units purchased

The visits are sampled from a set of Walmart retailers
in northern Illinois, 07 October 2011 - 07 November 2011.
Retailer formats may vary (e.g., Supercenter, Express,
Neighborhood Market,...).
Twitter: a complete set of all the tweets (3 billion) from January 2011. We only have tweets, follower network needs to be crawled separately. Some project ideas:
- Event detection: Can we detect there is some outside event going on right now? Can we early detect such event? Detect reports of particular kinds of events close to real time (e.g., natural disasters, revolutions, Justin Bieber's haircut, ...)
- Construct Twitter influence networks: An influence network specifies who influences whom, which event, etc.
- Tweet spam and bot detection: Distinguish Bots from Human users.
Memetracker. 20 million news media articles and blog posts per day since August 2008. TBs of data. Some project ideas:
- Time series prediction: Predict stock movements, popularity of politicians, emergence of real-world events.
Skout is a real-time location-based service to meet, flirt and date with hundreds of thousands users online every day. The dataset consists of about 3 months worth of data containing:
- User data, including birthday, ethnicity, orientation (approximately 1/4 GB)
- Searches, including search radius and search settings (10GB / month)
- GPS locations of users over time (10GB / month)
- Message headers, including to/from users (10GB / month)
- Profile views (7 GB / month)
- 'Blocks' and 'hotlist' events between users
- User logins (10 GB/ month)
Meetup Details are on Piazza
Kaggle has a number of interesting challenges and datasets. For example, KDD Cup 2012 has a link-prediction task for a Chinese clone of Twitter.
Wikipedia. Complete revision history of Wikipedia -- every edit of every article with full article content. This data is massive and super interesting. It gives you full article edit history as well as user discussion history of Wikipedia. Some project ideas:
- Model editor lifetimes: Which editors are likely to come back and make more edits? Can we early on detect long term contributors to Wikipedia?
- Classify/cluster Wikipedia editors based on what kinds of edits they make (big/small changes, adding content, reorganizing the article, ...).
- Recommend to users the articles they might want tot edit.
- Identify which articles are controversial (their edits get reverted).
Wikipedia webserver logs: Hourly Wikipedia page access statistics. Some project ideas:
- It would be interesting to study how real-world events or different article access patterns correlate with edits and evolution.
The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
Yahoo! Messenger: Instant Messenger graph with some additional information.
Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges. No additional information is provided.
Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges. Webpage URLs (but no content) are provided. Some project ideas:
- Website structure identification: From the webgraph extract "websites". Then cluster the website graphs based on their structure to identify what are common navigational structures of websites? What structural classes of websites exist?
- Build a summary/map of a website. What structural roles do webpages play in the webgraph? Cluster pages into Content pages, Navigational pages, Index pages based on the graph structure.
Stanford Webase. The Stanford WebBase project provides repeated crawls of the web. This allows for studying the evolution of the web graph and the webpage content. Find description here. Find how to access web pages in the repository here.
Mining text books. Using a large repository of textbooks one can aks a number of interesting questions, including main topics of chapters, sections, etc., use before definition (harder than it looks; "in Ch. 10 we'll learn what NP-completeness really means" is not a use before definition.), finding useful illustrations and supplements on the Web. Here is a pointer to work at Microsoft Research that has used this corpus and has some interesting ideas on the subject.
Help us find an innovate way to analyze and visualize test data
by LitePoint Corp.(Reward: $500) Signup here to access challenge with your Stanford email.
Litepoint makes test equipment for the wireless industry. The dataset is real but has been obfuscated and cleansed a bit. There are tests from more than 300,000 devices over a period of 1 month, each device has almost 900 tests each. So you'll find more than a quarter billion numbers in the csv file. Numbers that can be visualized in many ways. A quick intro to test data; some tests will have limits, either upper, lower, or both. The test result has to be within these limits to pass the test, if there are no limits the test is passed by default. The percentage of devices that pass all tests is called the yield and the station is the physical location where the device actually gets tested.
So what can be done with the data? We typically calculate yield over station and over time, we also create fail paretos, again we also typically do it over stations and time. We also make simple graphs and statistics over each test. This could for instance be histograms, line charts over time, scatter plots for comparison, etc. Our intent with this challenge is purely exploratory, meaning that you're free to explore as you please! Submit a video of you describing your analysis and why it’s useful. Also include a PDF file which shows the way you interpreted the data (images/graphs/tables).

Let us know if you need more info on these datasets. We will upload the datasets to EC2.

Other datasets

DBpedia. Richly labeled network containing extracted data from Wikipedia (based on infoboxes). Labeled network of multiple types of nodes and edges About 2.6 million concepts described by 247 million triples, including abstracts in 14 different languages. http://dbpedia.org. Some project ideas:
- Detecting of missing links (and relation types)
- Classification of nodes into the onthology.
Other OpenLinkedData datasets available at http://esw.w3.org/DataSetRDFDumps.
Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
SNAP network datasets. 60 large social and information network datasets
Wikipedia
- Complete edit history of Wikipedia articles: Which user edited what article at what time.
- Wikipedia page to page link data
- DBpedia: A richly labeled graph of Wikipedia entities.
- Freebase: An entity graph of people, places and things.
Ratings and purchases (movies, music, etc.)
- Amazon product co-purchasing network: 600k products and all their metadata.
- KDD Cup 2011: 300M ratings from 1M users on 600k songs, albums and artists.
- IMDB database: Everything about every movie ever made.
- Movielens: User movie rating data.
Yahoo! Webscope Catalog of datasets
- Yahoo! Webscope dataset collection. Contains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
- Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.
1.usa.gov data set
It would enable questions around link propagation, half-life by referrer, geographical analysis, and I'm sure a ton of other fun stuff! 1.usa.gov