Stanford CS345A: Data Mining

CS345A:

Data Mining

Winter 2010

Resources (from last year's course website):

TheFind Shopping Search Engine Dataset
Craigslist Data (data will be uploaded soon!)
All Tweets and some associated metadata from June 2009
Memetracker Dataset (More than 1 million news media and blog articles per day since August 2008)
Wikipedia: Entire edit history of english wikipedia for March 2003.
Wikipedia web server logs
IM buddy graph from March 2005..
Yahoo Altavista Web Graph..
Yahoo messenger Data set.
Yahoo Music Data set.
Last Year's Final. Note that the subject matter is somewhat different this year, so you should not assume the coverage on this year's final will be exactly the same. It will, however, cover all material in the course up to and including the 3/4 lecture.
References. We'll put in this file citations or links to papers that you may wish to read to learn more about certain topics covered in the class. They are not required reading.
Yahoo! Catalog of data sets available. Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.

Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. The format is

visitor_hash	timestamp	requested_url	referer_from_a_search_engine

E.g.,

a997c1950718d75c03f22ca8715e50b3

[28/Feb/2007:23:45:47 -0800]

/group/svsa/cgi-bin/www/officers.php

"http://www.google.com/sea rch?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts"

See http://www.stanford.edu/~antonell/tags_dataset.html for more information about how to get and use this file.

The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.
Here's a data set that might be interesting to some of you as you think about your project. e.g., on news clustering, identifying trends in news stories, etc. There is a nominal fee to get the DVD with the data, but if someone if really interested I'm sure we could arrange to make it available: http://open.blogs.nytimes.com/2009/01/12/fatten-up-your-corpus/
Excerpt: Available for noncommercial research license from The Linguistic Data Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and 2007 (that's 7,475 issues, to be exact). This collection includes the text of 1.8 million articles written at The Times (for wire service articles, you'll have to look elsewhere). Of these, more than 1.5 million have been manually annotated by The New York Times Index with distinct tags for people, places, topics and organizations drawn from a controlled vocabulary. A further 650,000 articles also include summaries written by indexers from the New York Times Index. The corpus is provided as a collection of XML documents in the News Industry Text Format and includes open source Java tools for parsing documents into memory resident objects.
A former CS345A student, and the TA from last year have started a company, Cellixis, to do a cellphone-based advisor. They offer two data sets that might be of interest; both are based on restaurant reviews:
1. A corpus of restaurants and reviews (100+ thousand restaurants, text of reviews can be tagged by part-of-speech). They are interested, for example, in knowing the keywords or key phrases (consecutive words) that best characterize different kinds of restaurants. As a baseline for word occurrence, they can also provide a sample corpus of the web (10+ million pages), and average single word stats over that corpus.
2. A training set of (user id, restaurant id, rating) tuples. They can also provide a corpus of restaurant info and reviews (in case a model-based approach is used). This data can be used in a manner similar to the Netflix data, but they are not offering $1M for a good solution. And you would have to excise from the data a small portion to measure your performance, while Netflix retains the test data itself.
If you are interested in obtaining either of these data sets, they can be emailed as love-cs345 at cellixis dt cm.
ACM has just issued its Multimedia Grand Challege(s). Many of these involve images in a way that we don't have the resources to deal with in the next month, but you might want to read the material to see if anything looks doable and interesting.