CS341
Project in Mining Massive Data Sets
Spring 2012
Course Information
Course description
CS341 (Project in Mining Massive Data Sets) is a project-focused advanced class with access to a large MapReduce cluster. This course is the second part in a two part sequence CS246/CS341.
CS246 discusses methods and algorithms for mining massive data sets. In this class, we will develop large scale data mining techniques and research projects. Students will have access to Amazon EC2 computing cluster. This means we will be able to run massive MapReduce jobs. Because it is challenging to work on algorithms for large scale data mining, we will be able to work with only a small number of students, and enrollment will be limited.
This is a purely project based course. We expect that students are already to some extent familiar with data mining methods. There will be lectures on some advanced data mining algorithm at the begging of the quarter. We also expect to have a good number of industrial guest lecturers discussing big data case studies.
Prerequisites
Knowledge and familiarity with concepts of CS246 or similar class (Hadoop, large scale data mining and machine learning algorithms).
Other courses that might be helpful: CS224W, CS245, CS347, CS276m CS229, CS221, CS228, CS224N.
Enrollment
Doing research in data mining can be challenging! Thus we will only be able to work with a small number of students, and enrollment will be limited.
One of the goals of the class is to help students get involved in long-term research.
To quickly get you the background knowledge you will need to do research in data mining, all students are required to successfully complete a programming assignment (will be posted below) by Friday April 6th (midnight). This programming assignment asks you to implement the Locality Sensitive Hashing algorithm for detecting similar pairs of sentences. If you have taken and mastered the material in CS246 (including C++/Java programming), we believe you should be able to successfully complete this assignment. This programming assignment will also give you a lower-bound on the pace you can expect in CS341.
Course application procedure
To apply to the course follow the following instructions
- Form a team of three students. We strongly discourage teams of 2/4 students. Use Piazza to find a project partner.
- Find a large dataset that you will work on. See list of datasets (and project ideas) that we provide. But feel free to use your own dataset.
- Write a project proposal (there is no page limit but we do not promise to read more than the first 3 pages). Your project proposal should be structured into the following sections (it should concisely answer the following questions):
- What is the problem your team is solving? Give a brief but precise description or definition of the problem.
- What data will you use? Briefly describe the data, its size (number of records, number of GBs) and where will you get the data. It is fine to use your own data but make sure you have it before you propose the project.
- How will you solve the problem? Describe your approach. Tell us what method, algorithm, technique do you plan to develop or use and how will you scale them up to your size of the data. Be as specific as you can!
- How will you evaluate your method? Here we want you to describe how will you measure performance or success of your method. Against what baseline methods will you compare your algorithm or how do you plan to obtain ground-truth labeled data so that you can then measure accuracy, precision, recall or some other metric that will tell us how well is your method really performing.
- What do you expect to submit/accomplish by the end of the quarter?
- Your proposal should include the following info
- Project title
- Team member info: name, email address, SUNet id (and if you have also Twitter handle)
- Team member description: Give us 2 sentence description of your experience that makes you qualified for this class. Don't tell us what classes you took in the past.
- Send us the PDF of your proposal to cs341-spr1112-staff@lists.stanford.edu
If you would like some help or guidance when developing your project idea feel free to contact course staff (Jure or any other instructor). We will help you develop your project idea.
Project proposal submission deadline is Thursday March 29 5pm. Over the weekend we will evaluate the proposals and notify you whether your team got accepted in the class.
Project writeups
The result of the project is 5-10 page paper, describing the approach, the results, and the related work. The overall form of the paper depends on the nature of the project. Here are some ideas on how to go about preparing the final project writeup.
- There is no page limit but we do not promise to read more than the first 10 pages.
- Writeup should address :
- Problem description: Give a brief but precise description or definition of the problem or hypothesis you set to evaluate.
- Related work: How does this problem and the method relate to problems/methods others have developed in the past.
- Solution: How did you solve the problem? Describe the technical approach. Tell us what method/algorithm did you use, develop or extend and how did you implement it.
- Experiments:
- Data: Briefly describe the data and its size (number of records, number of GBs).
- Experimental setup: Describe how did you setup your experiments, how the training/testing data was prepared, what performance metrics are you considering, what baseline methods for comparison are you using.
- Experimental results: Describe your experimental results. Structure your experiments around particular aspects of your method. For example, if you are working on a machine learning project you could structure the experiments as follows: (1) a table showing results of your method using different types of features; (2) table comparing the performance of your method to the baselines; (3) a graph plotting the size of the training dataset vs. the time it takes to train the model; (4) Investigation of the learned model (what are the important features, etc.).
- Brief conclusion
- Writeups are due 6/11 at 5pm.
- Send us the PDF of your final writeup to cs341-spr1112-staff@lists.stanford.edu
Course materials
Lecture notes and/or slides will be posted in the handouts section of the website.
The book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman serves as a comprehensive reference for the background required for this course. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant.
Requirements
Students will be required to successfully complete a substantial data-mining project. There will be mid-quarter project milestone submission/present ion in addition to the final project presentation and report submission.
Recitation sessions
Recitation sessions will be held to guide the students on the use of Amazon services.
Assignments
To help the students speed up with the Amazon Services, there will be an assignment in the initial weeks of the class.
Course work
The coursework for the course will consist of:
- Project milestone report and in-class presentation
- Final project report and in-class presentation
Important dates
- 5/1, 5/3 : In-class project milestone presentations
- 6/5 and 6/7 : Final presentations (in class)
- 6/11 : Final writeups due (5pm). Email your PDFs to the course staff mailing list
Grading
The tentative grade breakup is as follows:
- Milestone (presentation): 20%
- Final (presentation, poster and writeup): 80%
Communication:
General course questions should be posted Piazza.
Piazza requires @stanford.edu email address to register. If you do not have @stanford.edu address, send us email with your email address and we will add you to Piazza.
If you need to reach the course staff, you can reach us at cs341-spr1112-staff@lists.stanford.edu (consists of the professors and the TA).