CS246
Mining Massive Data Sets
Winter 2011

Course information:

This course is the first part in a two part sequence CS246/CS341 replacing CS345A: Data Mining. CS246 will discuss methods and algorithms for mining massive data sets, while CS341 (Advanced Topics in Data Mining) will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

Instructor:

Jure Leskovec
Office Hours: Tuesday 9-10am (or by appointment), Gates 418

Room:

Mon,Wed 9:30-10:45 in 420-041 (Jordan Hall, room 041)

Teaching assistants:

Aditya Parameswaran Office Hours: Tues 1-2pm, Gates 424

Bahman Bahmani Office Hours: Mon 12-1pm, Bytes Cafe

Peyman Kazemian Office Hours: Friday 1-2pm, Gates 352

Hyung Jin Kim

Eunjoon Cho

Staff mailing list:

You can reach us at cs246-win1011-staff@lists.stanford.edu

Prerequisites:

Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).

Familiarity with the basic probability theory. (CS109 or Stat116 or equivalent is sufficient but not necessary).

Familiarity with basic linear algebra

Materials:

Lecture notes and/or slides will be posted on-line. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant. Slides from the lectures will be made available in PDF format.

Students will use the Gradiance automated homework system for which a fee will be charged. See instructions about how to create an account.

You can see earlier versions of the notes and slides covering 2010 CS345a Data Mining and 2008/09 CS345a Data Mining. Not all these topics will be covered this year.

Requirements:

The coursework will be composed of short weekly assignments (short quizes), biweekly homeworks that also include programming (short, specific assignments using Hadoop and/or Hive) and a final exam (but no project).

Previous Versions:

The previous version of the course is CS345A:Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

Recitation Sessions:

Two recitation sessions will be held:

Due dates for assignments:

Assignment/Work
Out on
Due on
Assignment #1
Jan 5
Jan 19
Assignment #2
Jan 19
Feb 2
Assignment #3
Feb 2
Feb 18
Assignment #4
Feb 16
Mar 2

Assignemnts are due on midnight (23:59). More details.

Tentative Grade Breakup

The tentative grade breakup is as follows:

Course outline

See Handouts for a list of topics and reading materials.

Here is a tentative list of topics to be covered. These topics may change as the quarter progresses.

Announcements: