CS246
Mining Massive Data Sets
Winter 2013

Course Information


Meeting Times and Locations

Tuesday & Thursday 9:30AM - 10:45AM in NVidia Auditorium, Jen-Hsun Huang Engineering Center

In the first two weeks of the class we will also hold 3 recitation sessions:

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Relation extraction and Web Advertising.

CS246 is the first part in a two part sequence CS246--CS341. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

See Handouts for a list of topics and reading materials.

Important Dates

Assignment
Out on
Due on
Assignment #0
Tue, January 15 (no late days!)
Assignment #1
Thu, January 10
Thu, January 24
Assignment #2
Thu, January 24
Thu, February 7
Assignment #3
Thu, February 7
Thu, February 21
Assignment #4
Thu, February 21
Thu, March 7
Alternate Final exam
--
Tuesday March 19 6:00pm-9:00pm (320-105)
Final exam
--
Friday March 22 12:15pm-3:15pm (Cemex Auditorium)

See FAQ for information on how to submit assignments and other work.

Prerequisites

Students are expected to have the following background:

The recitation sessions in the first weeks of the class will give the overview of the expected background.

Course materials

Lecture notes and slides will be posted on-line. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

You can see earlier versions of the notes and slides for the Winter 2012 version of the course. Note there may be a slight change in the topics covered this year.

Course handouts and other reading materials can be downloaded here.

Course work and grading

The coursework for the course will consist of:

Gradiance quizzes

With regard to the weekly quizzes on Gradiance. Here are the instructions:

You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.

Homeworks

Four biweekly homeworks that will involve programming, working with Hadoop, as well as regular numerical/algebraic theory problems.

Questions: We try very hard to make questions unambiguous, but some ambiguities may remain. Ask (i.e., post a question on Piazza) if confused or state your assumptions explicitly. Reasonable assumptions will be accepted in case of ambiguous questions.

Honor code: We strongly encourage students to form study groups. Students may discuss and work on homework problems in groups. However, each student must write down the solutions independently, and without referring to written notes from the joint session. In other words, each student must understand the solution well enough in order to reconstruct it by him/herself. In addition, each student should write on the problem set the set of people with whom she/he collaborated.

Further, since we occasionally reuse problem set questions from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year.

Late assignments: Each student will have a total of two late days to use for homeworks, reaction papers and project proposals. One late day expires at the start of every class. (This means that if the assignment is due on Thursday then the first late day expires on the following Tuesday at the start of the class.) Once late days are exhausted, any assignments turned in late will be penalized 50% per late day. However, no assignment will be accepted more than one late day after its due date.

Assignment submission: See the F.A.Q.

Regrade policy: We take great care to ensure that grading is fair and consistent. Since we will always use the same grading procedure, any grades you receive are unlikely to change significantly. However, if you feel that your work deserves a regrade, please submit a written request within a week of receiving your grade. In your request, indicate which components of your submission you would like regraded, and prepare a clear and concise argument why you feel we should regrade those components. However, note that we reserve the right to regrade the entire assignment.

For grading questions, please talk to us during office hours. If you want a regrade, drop the homework and the request into the submission box in the Gates B-wing.

Hadoop in a Virtual Machine

Most assignments will require some level of programming in Hadoop. Hadoop is the open source implementation of MapReduce distributed data processing environment for mining large data sets across clusters of computers.

You will be running Hadoop jobs on your local laptop/desktop. However, since installing and setting up Hadoop is non-trivial we prepared a Linux virtual machine with Hadoop already installed. We will post the instructions and the VM soon.

Note that we will be holding a Hadoop recitation session in the first week of the class. This session will be very useful as it will teach you how to efficiently use, troubleshoot and debug Hadoop jobs.

Recitation sessions

Three recitation sessions will be held:

Previous versions of the course

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

CS246 was first offered in Winter 2011. Here is the course webpage with all the materials.

Communication

General course questions should be posted Piazza.

Piazza requires @stanford.edu emaill address to register. If you do not have @stanford.edu address, send us email with your email address and we will register you.

If you need to reach the course staff, you can reach us at cs246-win1213-staff@lists.stanford.edu (consists of the TAs and the professor).