Logistics


Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.

Previous offerings

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3-4 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).

You can access class notes and slides of previous versions of the course here:
CS246 Websites: CS246: Winter 2018 / CS246: Winter 2017 / CS246: Winter 2016 / CS246: Winter 2015 / CS246: Winter 2014 / CS246: Winter 2013 / CS246: Winter 2012 / CS246: Winter 2011
CS345a Website: CS345a: Winter 2010

Prerequisites

Students are expected to have the following background:

The recitation sessions in the first weeks of the class will give an overview of the expected background.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset


Schedule

Lecture slides will be posted here shortly before each lecture. If you wish to view slides further in advance, refer to last year's slides, which are mostly similar.

This schedule is subject to change.

Date Description Course Materials Events Deadlines
Tue Jan 8 Introduction; MapReduce and Spark
[slides]
Suggested Readings:
  1. Chapter 1: Data Mining
  2. Chapter 2: Large-Scale File Systems and Map-Reduce
Assignment 0 out
[bundle file]
Thu Jan 10 Frequent Itemsets Mining
[slides]
Suggested Readings:
  1. Ch6: Frequent itemsets
Assignment 1 out
[bundle file]
Thu Jan 10 Recitation: Spark tutorial
[slides]
4:30-5:50PM, Skilling Auditorium
Tue Jan 15 Locality-Sensitive Hashing I
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue Jan 15 Recitation: Probability and Proof Techniques
[handout]
4:30-5:50PM, Gates B01
Thu Jan 17 Locality-Sensitive Hashing II
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.5-3.8)
Thu Jan 17 Recitation: Linear Algebra
[handout]
4:30-5:50PM, Gates B01
Tue Jan 22 Clustering
[slides]
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
Thu Jan 24 Dimensionality Reduction
[slides]
Suggested Readings:
  1. Ch11: Dimensionality Reduction (Sect. 11.4)
Assignment 2 out
[bundle file]
Assignment 0, assignment 1 due
Tue Jan 29 Recommender Systems I
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Thu Jan 31 Recommender Systems II
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Tue Feb 5 PageRank
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu Feb 7 Link Spam and Introduction to Social Networks
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.4)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
Assignment 3 out
[bundle file]
Assignment 2 due
Tue Feb 12 Community Detection in Graphs
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu Feb 14 Algorithms on Large Graphs
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.7-10.8)
Tue Feb 19 Large-Scale Machine Learning I
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Thu Feb 21 Large-Scale Machine Learning II
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Assignment 4 out
[bundle file]
Assignment 3 due
Tue Feb 26 Mining Data Streams I
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.3)
Thu Feb 28 Mining Data Streams II
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.4-4.7)
Tue Mar 5 Computational Advertising
[slides]
Suggested Readings:
  1. Ch8: Advertising on the Web
Thu Mar 7 Learning through Experimentation
[slides]
Suggested Readings:
  1. A Contextual-Bandit Approach to Personalized News Article Recommendation by Li, Chu, Langford, Schapier. WWW 2010.
Assignment 4 due
Tue Mar 12 Optimizing Submodular Functions
[slides]
Suggested Readings:
  1. Turning Down the Noise in the Blogosphere by El-Arini, Veda, Shahaf, Guestrin. KDD 2009.
Thu Mar 14 Review
[slides]
Mon Mar 18 6:30PM - 9:30PM in Gates 104 Alternate Final Exam
Tue Mar 19 3:30PM - 6:30PM:
if SUNetID[0] in ['a', .. 'l'] then 420-040
if SUNetID[0] in ['m', .. 'z'] then Bishop Auditorium
Final Exam