Mining Massive Data Sets
This course is the first part in a two part sequence CS246/CS341 replacing CS345A: Data Mining
will discuss methods and algorithms for mining massive data sets, while CS341 (Advanced Topics in Data Mining) will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.
Office Hours: Tuesday 9-10am (or by appointment), Gates 418
Mon,Wed 9:30-10:45 in 420-041 (Jordan Hall, room 041)
Office Hours: Tues 1-2pm, Gates 424
Office Hours: Mon 12-1pm, Bytes Cafe
Office Hours: Friday 1-2pm, Gates 352
Hyung Jin Kim
Staff mailing list:
You can reach us at firstname.lastname@example.org
Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
Familiarity with the basic probability theory. (CS109 or Stat116 or equivalent is sufficient but not necessary).
Familiarity with basic linear algebra
Lecture notes and/or slides will be posted on-line. Readings have been derived from the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant. Slides from the lectures will be made available in PDF format.
Students will use the Gradiance automated homework system for which a fee will be charged. See instructions about how to create an account.
You can see earlier versions of the notes and slides covering
2010 CS345a Data Mining
and 2008/09 CS345a Data Mining. Not all these topics will be covered this year.
The coursework will be composed of short weekly assignments (short quizes), biweekly homeworks that also include programming (short, specific assignments using Hadoop and/or Hive) and a final exam (but no project).
The previous version of the course is CS345A:Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).
Two recitation sessions will be held:
Working with Hadoop
Revision of basic math concepts: (The recitation sessions is only intended to be a refresher of the material. You are still expected to have taken courses corresponding to this material.) linear algebra -- eigenvalues, basic probability, maximum likelihood, gradient descent, basic limits and bounds
Due dates for assignments:
Assignemnts are due on midnight (23:59). More details.
Tentative Grade Breakup
The tentative grade breakup is as follows:
- Final: 50%
- Gradiance: 20%
- Homeworks: 30%
See Handouts for a list of topics and reading materials.
Here is a tentative list of topics to be covered. These topics may change as the quarter progresses.
- Introduction, MapReduce
- Association Rules: Frequent itemsets and Association rules
- Near Neighbor Search in High Dimensional Data
- Locality Sensitive Hashing (LSH)
- Dimensionality reduction: SVD and CUR
- Recommendation Systems
- Link Analysis -- link prediction
- Personalized PageRank, Hubs and Authorities
- Web spam and TrustRank
- Proximity on Graphs: Random Walks with Restarts and Link prediction
- Large scale supervised machine learning (1): k-nearest
- Large scale supervised machine learning (2): Classification
and regression trees
- Large scale supervised machine learning (3): Support Vector Machines
- Mining data streams
- Mining the Web for Structured Data, Relation extraction
- Web Advertising
- 1/3: The first class will be held on Monday 1/3 in Jordan Hall Room 041. See you there!
- 1/4: First Gradiance homework is up. See instructions how to create an account.
- 1/5: Recitations: Friday 1/7 5-7pm: review of basic concepts of linear algebra, probability and statistics. Tuesday 1/11 5-7pm: Hadoop Q&A session. Location will be annonuced later!
- 1/6: Homework-1 is up. Due Jan 19 at 23:59.
- 1/7: Recitations: Today Friday 01/07: Basic probability and statistics, 5-7 pm at 420-041 (same room where we have the class). slides
- 1/10: Frequent Itemsets Gradiance homework is out. Due 2011-01-17 23:59.
- 1/11: Recitations: Hadoop, Tuesday 1/11, 5-7 pm in Gates B12.
- 1/11: Hadoop session: Slides, Example, Hadoop Installation Instructions
- 1/22: Homework-2 is up. Due Feb 2 at 23:59.
- 1/22: Homework-3 is up. Due Feb 18 at 23:59.
- 2/20: The final Exam will be held at HEWLETT201 on 3/16/2011 from 8:30-11:30 AM
- 2/24: Homework-4 is up. Due March 9 at 23:59.