CS246
Mining Massive Data Sets
Winter 2017

Course Information

Meeting Times and Locations

Tuesday & Thursday 3:00PM - 4:20PM in NVIDIA Auditorium, Jen-Hsun Huang Engineering Center.

In the first two weeks of the class, we will also hold 2 recitation sessions that will serve as refreshers on important course material:

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.

CS246 is the first part in a two part sequence CS246--CS341. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

For students who want to learn more about Hadoop we are also offering CS246H: Mining Massive Data Sets: Hadoop Labs. In CS246H Hadoop will be covered in depth to give students a more complete understanding of the platform and its role in data mining.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

See Handouts for a list of topics and reading materials.

Important Dates: Assignments

Assignment
Out on
Due on (11:59pm Pacific Time)
Hadoop tutorial
Tue, January 10
Thurs, January 19
Assignment #1
Thurs, January 12
Thurs, January 26
Assignment #2
Thurs, January 26
Thurs, February 9
Assignment #3
Thurs, February 9
Thurs, February 23
Assignment #4
Thurs, February 23
Thurs, March 9
Final exam
--
Tue, March 21, 3:30-6:30PM

See FAQ for information on how to submit assignments and other work.

Important Dates: Gradiance quizzes

Gradiance quizzes are usually out on Tuesdays and due 9 days later, on Thursdays. Note that we cannot under any circumstances extend the quiz deadline. Once the deadline has passed students will not be able to submit their quizzes. The table below will be updated with quiz deadlines as and when they are live.

Gradiance quiz
Out on
Due on (11:59pm Pacific time)
GHW1
Tue, January 10
Thurs, January 19
GHW2
Tue, January 17
Thurs, January 26
GHW3
Tue, January 24
Thurs, February 2
GHW4
Tue, January 31
Thurs, February 9
GHW5
Tue, February 7
Thurs, February 16
GHW6
Tue, February 14
Thurs, February 23
GHW7
Tue, February 21
Thurs, March 2
GHW8
Tue, February 28
Thurs, March 9
GHW9
Tue, March 7
Thurs, March 16

Prerequisites

Students are expected to have the following background:

The recitation sessions in the first weeks of the class will give an overview of the expected background.

Course materials

Lecture notes and slides will be posted online. Readings have been derived from the book Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

Automated Quizzes: We will be using Gradiance. Everyone should create an account there (passwords are at least 10 letters and digits with at least one of each) and enter the class code 380CE054. Please use your real first and last name, with the standard capitalization, e.g., "Jeffrey Ullman". Also please register using your stanford email or the same email you used for Gradescope so we can match your Gradiance score report to other class grades.

Books: Leskovec-Rajaraman-Ullman: Mining of Massive Datasets can be downloaded for free. It can be purchased from Cambridge University Press, but you are not required to do so.

MOOC: You can watch videos from a past Coursera MOOC (similary to this course) on Youtube.

Piazza: Piazza Discussion Group for this class (access code "mmds").

Course handouts: Available here.

Course work and grading

The coursework for the course will consist of:

Gradiance quizzes

With regard to the weekly quizzes on Gradiance. Here are the instructions:

You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.

Gradiance quizzes are generally out on Tuesdays and due on Thursdays, 9 days later. (Thursday 11:59pm Pacific time). Note that we cannot under any circumstances extend the quiz deadline. Once the deadline has passed students will not be able to submit their quizzes.

Homeworks

Four biweekly homeworks that will involve programming, working with Hadoop, as well as regular numerical/algebraic theory problems.

Questions: We try very hard to make questions unambiguous, but some ambiguities may remain. Ask (i.e., post a question on Piazza) if confused or state your assumptions explicitly. Reasonable assumptions will be accepted in case of ambiguous questions.

Honor code: We strongly encourage students to form study groups. Students may discuss and work on homework problems in groups. However, each student must write down the code and solutions independently, and without referring to written notes from the joint session. In other words, each student must understand the solution well enough in order to reconstruct it by him/herself. In addition, each student should write on the problem set the set of people with whom she/he interacted.

Since we occasionally reuse problem set questions from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year.

Finally, we consider it an Honor Code Violation to post your homework solutions to a place where it is easy for other students to access it. This includes uploading your solutions to publicly-viewable repositories like on GitHub.

The standard penalty for a first offense includes a one-quarter suspension from the University and 40 hours of community service. And the standard penalty for multiple violations (e.g. cheating more than once in the same course) is a three-quarter suspension and 40 or more hours of community service. Stanford Office of community standards has more information.

Late assignments: Each student will have a total of two late periods to use for homeworks. One late period ends at midnight, on the day of each class (This means that if the assignment is due on Thursday then the first late period expires on the following Tuesday midnight.) Once late periods are exhausted, any assignments turned in late will be penalized 50% per late period. However, no assignment will be accepted more than one late period after its due date.. Also note that we cannot under any circumstances extend the deadline of quizzes on Gradiance. Students cannot use any late days for quizzes on Gradiance.

Assignment submission: All students (SCPD and non-SCPD) should submit their assignments via GradeScope by 11:59PM on the due date. (We will allow a small 15 minute grace period, but beyond that and late periods, all deadlines are final.) You can typeset or scan your assignment. Make sure that you answer each sub-question on a separate page. That is, one question per page regardless of the answer length. Also, attach a signed cover sheet to the end of your submission.

Do not put code in your GradeScope submission. Also, please make sure to tag each part correctly on GradeScope so it is easier for us to grade. There will be a small point deduction for each mistagged page and for each question that includes code.

To register for GradeScope,

Students also need to upload their code at http://snap.stanford.edu/submit. Put all the code for a single question into a single file and upload it.

Regrade policy: We take great care to ensure that grading is fair and consistent. Since we will always use the same grading procedure, any grades you receive are unlikely to change significantly. However, if you feel that your work deserves a regrade, please submit a request on GradeScope within one week of receiving your grade.

Before requesting a regrade, please prepare a clear and concise argument for your stance by doing the following:

And then submit your regrade request via GradeScope. We reserve the right to regrade the entirety of any homework for which any regrade is requested. Moreover, if the regrade request is unjustified and thus not honored, then every future unsuccessful regrade request will be penalized 5 points.

For grading questions, please talk to us during office hours.

Hadoop in a Virtual Machine

Most assignments will require some level of programming in Hadoop. Hadoop is the open source implementation of MapReduce distributed data processing environment for mining large data sets across clusters of computers.

You will be running Hadoop jobs on your local laptop/desktop. However, since installing and setting up Hadoop is non-trivial we prepared a Linux virtual machine with Hadoop already installed. Instructions on installing the VM can be found in homework 0.

Recitation sessions

Two recitation sessions will be held:

The recitation sessions are only intended to be refreshers; it is expected that you have already taken courses that include this material.

Previous versions of the course

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

You can access class notes and slides of previous versions of the course here:

CS246: Winter 2016

CS246: Winter 2015

CS246: Winter 2014

CS246: Winter 2013

CS246: Winter 2012

CS246: Winter 2011

CS345a: Winter 2010

Communication

General course questions should be posted Piazza.