Infolab Hadoop Cluster

Beta warning

You thought we were kidding with the beta? This is our first Hadoop installation and we are still figuring things out. Stuff may break, names may change, but we cannot make the cluster better without your feedback. So... please use the cluster, join the mailing our ilcluster mailing list and let us know about the issues you encounter and about the software you wish the cluster ran.

Contents

Infolab Hadoop Cluster

The anatomy of the cluster

Name nodes:
- iln29
JobTracker nodes:
- iln29
Data nodes:
- iln30 - iln36
- 12.19 TB
TaskTracker nodes:
- iln30 - iln36
- 224 cores

Access

You will need a CSID to use the Hadoop cluster.

To submit Hadoop jobs to the cluster you need to log in to the submission node iln29.stanford.edu.

ssh your_csid@iln29.stanford.edu

Map/reduce

Here is an example of a map/reduce job that you can try running:

hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar pi 10 1000000

The above example should output something that ends like this:

Job Finished in 33.276 seconds
Estimated value of Pi is 3.14158440000000000000

If running the example above throws an AccessControlException (see below), please let Andrej know about it by e-mail (don't forget to include a full copy of the output).

Number of Maps  = 10
Samples per Map = 1000000
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=snap, access=WRITE, inode="user":hdfs:supergroup:rwxr-xr-x
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
...

At this point, congratulations are in order. If you have not used Hadoop before, you just successfully ran your first Hadoop job.

HDFS

Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the JobTracker will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.

Your HDFS home

When you start using Hadoop your HDFS home directory will be created. Actually, if you ran the example above, your home has already been created and used. By default your HDFS home is:

/user/your_csid

Let's try and list the contents of that directory (please note that the CSID used in this example is snap):

hadoop fs -ls /user/snap

Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all og the user homes just to see who is using the HDFS:

~/$ hadoop fs -ls /user
Found 3 items
drwxr-xr-x   - akrevl supergroup          0 2012-10-17 13:23 /user/akrevl
drwxr-xr-x   - root   root                0 2012-10-08 11:51 /user/root
drwxrwxrwx   - snap   supergroup          0 2012-10-17 14:18 /user/snap

Getting data into HDFS

We have a local directory that contains some XML data that we want to run a mapred job on. Let's create a subdirectory under our HDFS home that will hold those files:

hadoop fs -mkdir xmldata

Please note that the xmldata directory will be created as a subdirectory of /user/your_csid as we have not specified an absolute path. Let's list the content of our home directory to verify it was created.

~/$ fs -ls /user/snap
Found 1 items
drwxr-xr-x   - snap supergroup          0 2012-10-17 14:21 /user/snap/xmldata

If the XML files are in the ~/xmldata directory on the local harddisk, we copy them to the HDFS with the following command:

hadoop fs -put ~/xmldata/* /user/snap/xmldata

You already know how to list the contents of a HDFS directory so we'll leave checking on the files to you.

Deleting a file

If you want to delete a file from the HDFS you can use the following command:

hadoop fs -rm filename

Let's try and get rid of the file taskcontroller.cfg that we imported to HDFS in the previous example:

~/$ /user/snap/xmldata/taskcontroller.cfg
Deleted hdfs://iln29.stanford.edu:9000/user/snap/xmldata/taskcontroller.cfg

If we want to delete a directory we can use the following command:

hadoop fs -rmr directory

HDFS commands

By now you have probably noticed that HDFS command are very similar to the commands that we use on our "regular" file systems. We just have to start every command with the prefix hadoop fs then type in a dash - and the name of the command we use for "regular" files. So instead of chmod 777 somefile we type in hadoop fs -chown 777 somefile. We can get a list of available commands by typing hadoop fs.

InfolabClusterHadoop

Menu

SNAP

Wiki