Hadoop cluster

What's all the hubub? It's distributed and it spreads the data blocks between multiple nodes and the resource manager/scheduler/whateveryouwanttocallit is aware of the data locations. Which means we can "grep" through a 50TB dataset in about half an hour. Cool, right?

How do I get access?

You'll need a CSID and a home directory on the HDFS. You probably already have your CSID (if you don't, congrats for reading this through anyway). Your sysadmin can take care of the home directory (if you ask nicely).

Where do I? How do I?

This are the nodes that have the hadoop packages installed:

madmax
madmax2
madmax3
madmax4
madmax5

Here's how you list the contents of your HDFS home directory:

hadoop fs -ls /user/tommy

Here's how you submit a job to the cluster:

hadoop jar <jarfile> <param1> <param2> ...

More Hadoop info