Project in Mining Massive Data Sets
Spring 2012

Course Information

Course description

CS341 (Project in Mining Massive Data Sets) is a project-focused advanced class with access to a large MapReduce cluster. This course is the second part in a two part sequence CS246/CS341.

CS246 discusses methods and algorithms for mining massive data sets. In this class, we will develop large scale data mining techniques and research projects. Students will have access to Amazon EC2 computing cluster. This means we will be able to run massive MapReduce jobs. Because it is challenging to work on algorithms for large scale data mining, we will be able to work with only a small number of students, and enrollment will be limited.

This is a purely project based course. We expect that students are already to some extent familiar with data mining methods. There will be lectures on some advanced data mining algorithm at the begging of the quarter. We also expect to have a good number of industrial guest lecturers discussing big data case studies.


Knowledge and familiarity with concepts of CS246 or similar class (Hadoop, large scale data mining and machine learning algorithms).

Other courses that might be helpful: CS224W, CS245, CS347, CS276m CS229, CS221, CS228, CS224N.


Doing research in data mining can be challenging! Thus we will only be able to work with a small number of students, and enrollment will be limited.

One of the goals of the class is to help students get involved in long-term research. To quickly get you the background knowledge you will need to do research in data mining, all students are required to successfully complete a programming assignment (will be posted below) by Friday April 6th (midnight). This programming assignment asks you to implement the Locality Sensitive Hashing algorithm for detecting similar pairs of sentences. If you have taken and mastered the material in CS246 (including C++/Java programming), we believe you should be able to successfully complete this assignment. This programming assignment will also give you a lower-bound on the pace you can expect in CS341.

Course application procedure

To apply to the course follow the following instructions

If you would like some help or guidance when developing your project idea feel free to contact course staff (Jure or any other instructor). We will help you develop your project idea.

Project proposal submission deadline is Thursday March 29 5pm. Over the weekend we will evaluate the proposals and notify you whether your team got accepted in the class.

Project writeups

The result of the project is 5-10 page paper, describing the approach, the results, and the related work. The overall form of the paper depends on the nature of the project. Here are some ideas on how to go about preparing the final project writeup.

Course materials

Lecture notes and/or slides will be posted in the handouts section of the website.

The book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman serves as a comprehensive reference for the background required for this course. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant.


Students will be required to successfully complete a substantial data-mining project. There will be mid-quarter project milestone submission/present ion in addition to the final project presentation and report submission.

Recitation sessions

Recitation sessions will be held to guide the students on the use of Amazon services.


To help the students speed up with the Amazon Services, there will be an assignment in the initial weeks of the class.

Course work

The coursework for the course will consist of:

Important dates


The tentative grade breakup is as follows:


General course questions should be posted Piazza.

Piazza requires @stanford.edu email address to register. If you do not have @stanford.edu address, send us email with your email address and we will add you to Piazza.

If you need to reach the course staff, you can reach us at cs341-spr1112-staff@lists.stanford.edu (consists of the professors and the TA).