What's all the hubub? It's distributed and it spreads the data blocks between multiple nodes and the resource manager/scheduler/whateveryouwanttocallit is aware of the data locations. Which means we can "grep" through a 50TB dataset in about half an hour. Cool, right?
How do I get access?
You'll need a CSID and a home directory on the HDFS. You probably already have your CSID (if you don't, congrats for reading this through anyway). Your sysadmin can take care of the home directory (if you ask nicely).
Where do I? How do I?
This are the nodes that have the hadoop packages installed:
madmax madmax2 madmax3 madmax4 madmax5
Here's how you list the contents of your HDFS home directory:
hadoop fs -ls /user/tommy
Here's how you submit a job to the cluster:
hadoop jar <jarfile> <param1> <param2> ...
More Hadoop info
Examples and usage: InfolabClusterHadoop
Current cluster status: ilHadoopStatus
Current cluster statistics: ilHadoopStats
HDFS info: http://ilhadoop1.stanford.edu:50070/dfshealth.html
Application Tracker: http://ilhadoop1.stanford.edu:8088/cluster