CS246

Mining Massive Data Sets

Winter 2018

Course Information

Meeting Times and Locations

Tuesday & Thursday 3:00PM - 4:20PM in NVIDIA Auditorium, Jen-Hsun Huang Engineering Center.

In the first two weeks of the class, we will also hold three recitation sessions that will serve as refreshers on important course material:

Spark tutorial and help session. Location: Thursday January 11, from 4:30-5:50 pm in Skilling Auditorium.
Review of linear algebra and proof techniques. Location: Tuesday January 16, from 4:30-5:50 pm in Skilling Auditorium.
Review of probability and statistics. Location: Thursday, January 18, from 4:30-5:50 pm in Skilling Auditorium.

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.

CS246 is the first part in a two part sequence CS246--CS341. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

For students who want to learn more about Spark and Hadoop we are also offering CS246H: Mining Massive Data Sets: Hadoop/Spark Labs. In CS246H Spark and Hadoop will be covered in depth to give students a more complete understanding of the platform and its role in data mining. CS 246H videos may be viewed here.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

Introduction and MapReduce
Association Rules: Frequent itemsets and Association rules
Near Neighbor Search in High Dimensional Data
Locality Sensitive Hashing (LSH)
Dimensionality reduction: SVD and CUR
Recommendation Systems
Clustering
Link Analysis: Personalized PageRank, Hubs and Authorities
Web spam and TrustRank
Proximity search on Graphs: Random Walks with Restarts
Large scale supervised machine learning (1): k-nearest neighbor, Perceptron
Large scale supervised machine learning (2): Classification and regression trees
Large scale supervised machine learning (3): Support Vector Machines
Mining data streams
Web Advertising

See Handouts for a list of topics and reading materials.

Important Dates: Assignments

Assignment	Out on	Due on (11:59pm Pacific Time)
Spark/Hadoop tutorial	Tue, January 09	Thurs, January 25
Assignment #1	Thurs, January 11	Thurs, January 25
Assignment #2	Thurs, January 25	Thurs, February 8
Assignment #3	Thurs, February 8	Thurs, February 22
Assignment #4	Thurs, February 22	Thurs, March 8
Final exam	--	Tue, March 20, 3:30-6:30PM

See FAQ for information on how to submit assignments and other work.

Important Dates: Gradiance quizzes

Gradiance quizzes are usually out on Tuesdays and due 9 days later, on Thursdays at 23:59 Pacific Time. Note that we cannot under any circumstances extend the quiz deadline. Once the deadline has passed students will not be able to submit their quizzes. The table below will be updated with quiz deadlines as and when they are live.

Gradiance quiz	Out on	Due on (11:59pm Pacific time)
GHW1	Tue, January 09	Thurs, January 25
GHW2	Tue, January 16	Thurs, January 25
GHW3	Tue, January 23	Thurs, February 1
GHW4	Tue, January 30	Thurs, February 8
GHW5	Tue, February 6	Thurs, February 15
GHW6	Tue, February 13	Thurs, February 22
GHW7	Tue, February 20	Thurs, March 1
GHW8	Tue, February 27	Thurs, March 8
GHW9	Tue, March 6	Thurs, March 15

Prerequisites

Students are expected to have the following background:

Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
Good knowledge of Java and Python will be extremely helpful since most assignments will require the use of Spark/Hadoop which is written in Java. (Note: There will be no Hadoop in Winter 2018, and Python will suffice for this class.)
Familiarity with basic probability theory (CS109 or Stat116 or equivalent is sufficient but not necessary).
Familiarity with writing rigorous proofs (at a minimum, at the level of CS 103).
Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263 would be much more than necessary).
Familiarity with algorithmic analysis (e.g., CS 161 would be much more than necessary).

The recitation sessions in the first weeks of the class will give an overview of the expected background.

Course materials

Lecture notes and slides will be posted online. Readings have been derived from the book Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

Books: Leskovec-Rajaraman-Ullman: Mining of Massive Datasets can be downloaded for free. It can be purchased from Cambridge University Press, but you are not required to do so.

MOOC: You can watch videos from a past Coursera MOOC (similary to this course) on Youtube.

Piazza: Piazza Discussion Group for this class.

Course handouts: Available here.

Course work and grading

The coursework for the course will consist of:

Gradiance quizzes: Short weekly Gradiance quizzes. 20% of the final grade.
Homeworks: Tutorial and four biweekly homeworks that include programming. 40% of the final grade.
Final exam. 40% of the final grade
Extra credit: Up to 1% of the final grade. Read our extra credit policy.

Please read the homework submission instructions and policies for instructions on how to submit homework, register for Gradiance, etc.

Spark

Most assignments will require some level of programming in Spark. Spark is the open source implementation of MapReduce distributed data processing environment for mining large data sets across clusters of computers.

You will be running Spark jobs on your local laptop/desktop. Instructions on installing Spark can be found in homework 0.

Recitation sessions

Three recitation sessions will be held:

Spark Tutorial and Help Session: Daniel Templeton, the instructor of CS246H, will go over how to use Spark, and spend some time helping individual students.
Linear Algebra and Proof Techniques Review: properties such as rank and nullspace, operations such as inverse and trace, quadratic forms, eigendecomposition. Proof techniques such as induction and contradiction.
Probability Review: random variables, moments, basic limits and bounds, maximum likelihood, basic optimization algorithms like gradient descent.

The recitation sessions are only intended to be refreshers; it is expected that you have already taken courses that include this material.

Previous versions of the course

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3-4 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

You can access class notes and slides of previous versions of the course here:

Communication

General course questions should be posted Piazza.

If you need to reach the course staff, you can reach us at cs246-win1718-staff@lists.stanford.edu (consists of the TAs and the professor). Please don't email us individually and always use the mailing list or Piazza.