Stanford CS341: Project in Mining Massive Data Sets

CS341

Project in Mining Massive Data Sets

Spring 2011

Course information:

CS341 (Project in Mining Massive Data Sets) is a project-focused advanced class with access to a large MapReduce cluster. This course is the second part in a two part sequence CS246/CS341 replacing CS345A: Data Mining. CS246 discusses methods and algorithms for mining massive data sets.

In this class, we will develop large scale data mining techniques and research projects. Students will have access to Amazon EC2 comptuing cluster. This means we will be able to run massive MapReduce jobs. Because it is challenging to work on algorithms for large scale data mining, we will be able to work with only a small number of students, and enrollment will be limited.

This is a purely project based course. We expect that students are already to some extent familiar with data mining methods. There will be lectures on some advanced data mining algorithm at the begging of the quarter. We also expect to have a good number of industrial guest lecturers discussing big data case studies.

Instructors:

Jure Leskovec
Office Hours: TBD

Anand Rajaraman
Office Hours:TBD

Jeff Ullman
Office Hours:TBD

Class meetings:

Tue, Thu 4:15-5:30 in 380-380D

This is a project course. There will be only few weekly lectures, and only two introductory homeworks. We will spend the quarter working in teams on different large scale data mining related projects.

Teaching assistant:

Hyung Jin (Evion) Kim
Office Hours: TBD

Staff mailing list:

You can reach us at cs341-spr1011-staff@lists.stanford.edu

Prerequisites:

Knowledge and familiarity with concepts of CS246.

Other classes that might be helpful: CS224W, CS245, CS347, CS229, CS221, CS228, CS224N, CS276.

Enrollment:

Doing research in data mining can be challenging! Thus we will only be able to work with a small number of students, and enrollment will be limited.

One of the goals of the class is to help students get involved in long-term research. To quickly get you the background knowledge you will need to do research in data mining, all students are required to successfully complete a programming assignment (will be posted below) by Thursday April 7th. This programming assignment asks you to implement the Locality Sensitive Hashing algorithm for detecting similar pairs of sentences. If you have taken and mastered the material in CS246 (including C++/Java programming), we believe you should be able to successfully complete this assignment. This programming assignment will also give you a lower-bound on the pace you can expect in CS341.

Materials:

The book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman serves as a comprehensive reference for the background required for this course. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant.

Requirements:

Students will be required to successfully complete a substantial data-mining project. There will be mid-quarter project milestone submission/presention in addition to the final project presentation and report submission.

Recitation Sessions:

Recitation sessions will be held to guide the students on the use of Amazon services.

Assignments:

To help the students speed up with the Amazon Services, there will be an assignment in the initial weeks of the class.

Tentative grade breakup:

The course grade is based on the project.

Announcements:

The first class will be held on Tuesday 3/29. See you there !
Assignment 1 Out on 3/29; Due in 5:00PM 4/7. You can find the data set here.
Office hour for Assignment 1 will be held on 4/1 9-10AM and 4/5 9-10AM in Gates B26B
Register your groups on here. It needs to be completed by 24:00, 4/7. See here to find out how to register for AWS.