Course Information

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Relation extraction and Web Advertising.

CS246 is the first part in a two part sequence CS246--CS341. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

Introduction and MapReduce
Association Rules: Frequent itemsets and Association rules
Near Neighbor Search in High Dimensional Data
Locality Sensitive Hashing (LSH)
Dimensionality reduction: SVD and CUR
Recommendation Systems
Clustering
Link Analysis: Personalized PageRank, Hubs and Authorities
Web spam and TrustRank
Proximity search on Graphs: Random Walks with Restarts
Large scale supervised machine learning (1): k-nearest neighbor, Perceptron
Large scale supervised machine learning (2): Classification and regression trees
Large scale supervised machine learning (3): Support Vector Machines
Mining data streams
Mining the Web for Structured Data, Relation extraction
Web Advertising

See Handouts for a list of topics and reading materials.

Prerequisites

Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program/algorithm (e.g., CS107, CS161 or CS145 or equivalent are recommended).

Familiarity with the basic probability theory. (CS109 or Stat116 or equivalent is sufficient but not necessary).

Familiarity with the basic linear algebra (any one of Math 51, Math 103, Math 113, or CS 205 would be much more than necessary).

The recitation sessions in the first weeks of the class will give the overview of the expected background.

Course materials

Lecture notes and/or slides will be posted on-line. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

You can see earlier versions of the notes and slides for the Winter 2011 version of the course. Note there may be a slight change in the topics covered this year.

Course handouts and other reading materials can be downloaded here.

Course work

The coursework for the course will consist of:

Gradiance quizzes: Short weekly Gradiance quizzes
Homeworks: Four biweekly homeworks that include programming
Final exam

Gradiance quizzes

With regard to the weekly quizzes on Gradiance. Here are the instructions:

Go to http://www.newgradiance.com/services
Register and use the class token 75677A64
There is a MapReduce quiz assigned (see under the Homeworks tab) with 4 quick questions.
To answer some of them you will need access to the Mining Massive Datasets book
You have 1 week to complete the quiz. (no late days!)

You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.

Homeworks

Four biweekly homeworks that will involve coding, working with Hadoop, as well as regular numerical/algebraic theory problems.

Questions: We try very hard to make questions unambiguous, but some ambiguities may remain. Ask (i.e., post a question on Piazza) if confused or state your assumptions explicitly. Reasonable assumptions will be accepted in case of ambiguous questions.

Honor code: Students may discuss and work on homework problems in groups. However, each student must write down the solutions independently, and without referring to written notes from the joint session. In other words, each student must understand the solution well enough in order to reconstruct it by him/herself. In addition, each student should write on the problem set the set of people with whom s/he collaborated.

Further, since we occasionally reuse problem set questions from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year.

Late assignments: Assignments will be due in class (9:30am) on Fridays. Each student will have a total of seven free late (calendar) days to use for homeworks, reaction papers, project proposals and project milestones. Once these late days are exhausted, any assignments turned in late will be penalized 20% per late day. However, no assignment will be accepted more than three days after its due date. Each 24 hours or part thereof that a homework is late uses up one full late day.

Assignment submission: Assignments will be due in class (9:30am) on Fridays. You need to submit paper as well as the electronic version of the assignment.

For each assignment you need to upload two files:

Your solution report. A single PDF file. You do not need to typeset the document. (It can be handwritten/scanned.)
Any source code that you use towards obtaining your results stated in your solution report should be submitted as a single ZIP file. Also include the mention of any tools whichever you use towards your solution(s).

Refer to Frequently Asked questions on how to submit the assignments.

Important Dates

Assignment	Out on	Due on
Assignment #1	January 11	January 27
Assignment #2	January 27	February 10
Assignment #3	February 10	February 24
Assignment #4	February 24	March 9
Final exam	March 19 at 8:30am

Previous versions

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

CS246 was first offered in Winter 2011. Here is the course webpage with all the materials.

Recitation sessions

Two recitation sessions will be held:

Revision of basic math concepts: linear algebra -- eigenvalues, basic probability, maximum likelihood, gradient descent, basic limits and bounds. The recitation sessions is only intended to be a refresher of the material. You are still expected to have taken courses corresponding to this material.
Working with Hadoop

Grading

The tentative grade breakup is as follows:

Gradiance: 20%
Homeworks: 40%
Final: 40%

Communication:

General course questions should be posted Piazza.

Piazza requires @stanford.edu email address to register. If you do not have @stanford.edu address, send us email with your email address and we will add you to Piazza.

If you need to reach the course staff, you can reach us at cs246-win1112-staff@lists.stanford.edu (consists of the TAs and the professor).