CS246
Mining Massive Data Sets
Winter 2011
Course information:
This course is the first part in a two part sequence CS246/CS341 replacing
CS345A: Data Mining.
CS246 will discuss methods and algorithms for mining massive data sets, while CS341 (Advanced Topics in Data Mining) will be a projectfocused advanced class with an unlimited access to a large MapReduce cluster.
Instructor:
Jure Leskovec
Office Hours: Tuesday 910am (or by appointment), Gates 418
Room:
Mon,Wed 9:3010:45 in 420041 (Jordan Hall, room 041)
Teaching assistants:
Aditya Parameswaran
Office Hours: Tues 12pm, Gates 424
Bahman Bahmani
Office Hours: Mon 121pm, Bytes Cafe
Peyman Kazemian
Office Hours: Friday 12pm, Gates 352
Hyung Jin Kim
Eunjoon Cho
Staff mailing list:
You can reach us at cs246win1011staff@lists.stanford.edu
Prerequisites:
Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably nontrivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
Familiarity with the basic probability theory. (CS109 or Stat116 or equivalent is sufficient but not necessary).
Familiarity with basic linear algebra
Materials:
Lecture notes and/or slides will be posted online. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (GarciaMolina, Ullman, Widom) relevant. Slides from the lectures will be made available in PDF format.
Students will use the Gradiance automated homework system for which a fee will be charged. See instructions about how to create an account.
You can see earlier versions of the notes and slides covering
2010 CS345a Data Mining
and 2008/09 CS345a Data Mining. Not all these topics will be covered this year.
Requirements:
The coursework will be composed of short weekly assignments (short quizes), biweekly homeworks that also include programming (short, specific assignments using Hadoop and/or Hive) and a final exam (but no project).
Previous Versions:
The previous version of the course is CS345A:Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).
Recitation Sessions:
Two recitation sessions will be held:

Working with Hadoop

Revision of basic math concepts: (The recitation sessions is only intended to be a refresher of the material. You are still expected to have taken courses corresponding to this material.) linear algebra  eigenvalues, basic probability, maximum likelihood, gradient descent, basic limits and bounds
Due dates for assignments:
Assignment/Work 
Out on 
Due on 
Assignment #1 
Jan 5 
Jan 19 
Assignment #2 
Jan 19 
Feb 2 
Assignment #3 
Feb 2 
Feb 18 
Assignment #4 
Feb 16 
Mar 2 
Assignemnts are due on midnight (23:59). More details.
Tentative Grade Breakup
The tentative grade breakup is as follows:
 Final: 50%
 Gradiance: 20%
 Homeworks: 30%
Course outline
See Handouts for a list of topics and reading materials.
Here is a tentative list of topics to be covered. These topics may change as the quarter progresses.
 Introduction, MapReduce
 Association Rules: Frequent itemsets and Association rules
 Near Neighbor Search in High Dimensional Data
 Locality Sensitive Hashing (LSH)
 Dimensionality reduction: SVD and CUR
 Recommendation Systems
 Clustering
 Link Analysis  link prediction
 Personalized PageRank, Hubs and Authorities
 Web spam and TrustRank
 Proximity on Graphs: Random Walks with Restarts and Link prediction
 Large scale supervised machine learning (1): knearest
neighbor, Perceptron
 Large scale supervised machine learning (2): Classification
and regression trees
 Large scale supervised machine learning (3): Support Vector Machines
 Mining data streams
 Mining the Web for Structured Data, Relation extraction
 Web Advertising
Announcements:
 1/3: The first class will be held on Monday 1/3 in Jordan Hall Room 041. See you there!
 1/4: First Gradiance homework is up. See instructions how to create an account.
 1/5: Recitations: Friday 1/7 57pm: review of basic concepts of linear algebra, probability and statistics. Tuesday 1/11 57pm: Hadoop Q&A session. Location will be annonuced later!
 1/6: Homework1 is up. Due Jan 19 at 23:59.
 1/7: Recitations: Today Friday 01/07: Basic probability and statistics, 57 pm at 420041 (same room where we have the class). slides
 1/10: Frequent Itemsets Gradiance homework is out. Due 20110117 23:59.
 1/11: Recitations: Hadoop, Tuesday 1/11, 57 pm in Gates B12.
 1/11: Hadoop session: Slides, Example, Hadoop Installation Instructions
 1/22: Homework2 is up. Due Feb 2 at 23:59.
 1/22: Homework3 is up. Due Feb 18 at 23:59.
 2/20: The final Exam will be held at HEWLETT201 on 3/16/2011 from 8:3011:30 AM
 2/24: Homework4 is up. Due March 9 at 23:59.