Mining Massive Data Sets
Winter 2011

This course is the first part in a two part sequence CS246/CS341 replacing CS345A: Data Mining. CS246 will discuss methods and algorithms for mining massive data sets, while CS341 (Advanced Topics in Data Mining) will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.


Jure Leskovec
Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).

Familiarity with the basic probability theory. (CS109 or Stat116 or equivalent is sufficient but not necessary).

Familiarity with basic linear algebra


Lecture notes and/or slides will be posted on-line. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant. Slides from the lectures will be made available in PDF format.

Students will use the Gradiance automated homework system for which a fee will be charged. See instructions about how to create an account.

You can see earlier versions of the notes and slides covering 2010 CS345a Data Mining and 2008/09 CS345a Data Mining. Not all these topics will be covered this year.


The coursework will be composed of short weekly assignments (short quizes), biweekly homeworks that also include programming (short, specific assignments using Hadoop and/or Hive) and a final exam (but no project).

The previous version of the course is CS345A:Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

