Stanford CS246H: Mining Massive Data Sets: Hadoop Labs (Winter 2016)

CS246H

Mining Massive Data Sets: Hadoop Labs

Winter 2017

This course is designed to give students a practical understanding of the tools in the Hadoop ecosystem with a focus on understanding MapReduce and Spark. The focus of this course is on the practical application of big data technologies, rather than on the theory behind them.
This is a partner course to CS246: Mining Massive Datasets and includes limited additional assignments.
The course is adapted from the professional courses taught by Cloudera.

Announcements:

Important course information will be posted on this web page and announced in class. You are responsible for all material that appears here and should check this page for updates frequently.

1/11: The first class will be held at 11:30 on Wednesday 1/11, in Skilling Auditorium.
We look forward to seeing you there!
1/12: We are organizing a VM clinic to help students set up their VMs. Daniel Templeton will be at the session, assisted by several other TAs. Time and Location: January 16 (coming Monday), 6PM to 9PM in Gates 415.

Course information:

Lectures:

Wednesdays 11:30-13:20 in Skilling Auditorium

Instructors:

Daniel Templeton (daniel at cloudera dot com), Cloudera
Office Hours: By arrangement

Jure Leskovec
Office Hours: Wednesdays 9-10am, Gates InfoLab

You Will Learn to

Implement and debug complex data processing applications in Hadoop
Use some of the tools in the Hadoop ecosystem for data mining and machine learning

Topics Include

Apache Hadoop
Apache Spark
Apache Hive
Apache Impala
Apache Kafka
Other ecosystem tools, e.g. Apache Sqoop, Apache Pig, etc.

Automated Quizzes

This course will include eight weekly Gradiance quizzes to check that students are learning the concepts. Some of the quizzes will require students to complete short programming assignments to produce the answers. The Gradiance token for this class is 6A8C4765.

Lecture notes

01/11: Introduction Slides: Introduction to Hadoop