Differences between revisions 15 and 16

Infolab Hadoop Cluster

Contents

Infolab Hadoop Cluster

The anatomy of the cluster

Name node: ilhadoop1
Yarn Resource Manager: ilhadoop1
Data nodes:
- ilh01 - ilh40
- 320 TB total capacity
Yarn Nodes:
- ilh01 - ilh40
- 320 total cores
- 2560 GB RAM

Access

You will need a CSID to use the Hadoop cluster.

You can submit your Hadoop jobs from the following nodes:

madmax
madmax2
madmax3
madmax4
madmax5

HDFS

Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the Yarn Resource Manager will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.

Your HDFS home

Ask your admin to create an HDFS home directory for you. Your HDFS home is going to be:

/user/your_csid

Let's try and list the contents of that directory:

hadoop fs -ls /user/tommy

Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all the user homes just to see who is using the HDFS:

tommy@madmax5: ~$ hadoop fs -ls /user/
Found 3 items
drwx------   - hdfs       hadoop          0 2015-10-12 18:54 /user/hdfs
drwxr-xr-x   - tommy     users           0 2015-06-12 15:20 /user/tommy
drwxr-xr-x   - west1      users           0 2015-10-12 22:05 /user/west1

Getting data into HDFS

We have a local directory that contains some XML data that we want to run a mapred job on. Let's create a subdirectory under our HDFS home that will hold those files:

hadoop fs -mkdir xmldata

Please note that the xmldata directory will be created as a subdirectory of /user/your_csid as we have not specified an absolute path. Let's list the content of our home directory to verify it was created.

~/$ fs -ls /user/tommy
Found 1 items
drwxr-xr-x   - tommy users          0 2012-10-17 14:21 /user/tommy/xmldata

If the XML files are in the ~/xmldata directory on the local harddisk, we copy them to the HDFS with the following command:

hadoop fs -put ~/xmldata/* /user/snap/xmldata

You already know how to list the contents of a HDFS directory so we'll leave checking on the files to you.

Getting data out of HDFS

If we want to copy a file from the HDFS back to our "regular" filesystem, we need to run the following command:

hadoop fs -get hdfs_source_filename local_destination_filename

So if we have a file called result.txt in our HDFS home directory, we can copy it to our current directory with the following command:

hadoop fs -get results.txt ./

Please note that whenever we use a HDFS path that does not start with a forward slash / the HDFS command automatically add a prefix /user/your_csid.

Deleting a file

If you want to delete a file from the HDFS you can use the following command:

hadoop fs -rm filename

If we want to delete a directory we can use the following command:

hadoop fs -rm -r directory

HDFS commands

By now you have probably noticed that HDFS command are very similar to the commands that we use on our "regular" file systems. We just have to start every command with the prefix hadoop fs then type in a dash - and the name of the command we use for "regular" files. So instead of chmod 777 somefile we type in hadoop fs -chown 777 somefile. We can get a list of available commands by typing hadoop fs.

Map/reduce

Here are a few example map/reduce jobs that you can try running.

A life of Pi

hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar pi 10 1000000

The above example should output something that ends like this:

Job Finished in 33.276 seconds
Estimated value of Pi is 3.14158440000000000000

If running the example above throws an AccessControlException (see below), please let Andrej know about it by e-mail (don't forget to include a full copy of the output).

Number of Maps  = 10
Samples per Map = 1000000
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=snap, access=WRITE, inode="user":hdfs:supergroup:rwxr-xr-x
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
...

At this point, congratulations are in order. If you have not used Hadoop before, you just successfully ran your first Hadoop job.

Some XML extraction

This example expects that there are some XML configuration files on the HDFS and creates come output files. It's here just to get you going with the HDFS...

The XML files are already in the HDFS, but let's copy them to our HDFS home directory first (please replace snap with your CSID):

hadoop fs -cp /tmp/xmldata /user/snap/tmpdata

Now we can run the example:

hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep xmldata output 'dfs[a-z.]+'

Again note that since we are not using absolute paths, Hadoop TaskTrackers add /user/snap/ as a path prefix to both xmldata and output in the command above. Perhaps the command itself needs some explanation. We are telling hadoop to use the jar file hadoop-examples. Further more it should use the example called grep from the jar. The example code should find the source files (some XML ocnfiguration files) in the directory xmldata and it should output the results to the directory called output. The last part is a regular expression that grep should search for.

Now that the map/reduce task has finished, we need to copy the files from the output directory in HDFS to a local output directory:

hadoop fs -get output/* ~/output/

Our job prints out all the strings starting with dfs sorted by how often they appeared in the files:

~/ $ cat ~/output/part-00000
2       dfs.audit.logger
1       dfs.data.dir
1       dfs.name.dir
1       dfs.replication
1       dfs.server.namenode.
1       dfsadmin

Pig

Apache Pig is a platform running on top of Hadoop that consists of a high-level language for expressing data analysis programs and the infrastructure that runs such programs.

The following example will extract user ids from a standard *nix password file.

First we need to copy the password file to the HDFS (you can use any other file that has a similar structure). You can use the one that is already in the HDFS at /tmp/passwd.

hadoop fs -put passwd /tmp/passwd

The second step is to create a Pig script. Pig scripts are written in a language called Pig Latin. Let's create a file extractid.pig and put the following code into it:

   1 A = load '/tmp/passwd' using PigStorage(':');
   2 B = foreach A generate $0 as id;
   3 store B into 'id.out';

The first line loads our passwd file. The second line extract the user ids and the third line writes the results to the directory called id.out.

We can run the Pig script that we have just written by running the following:

pig -x mapreduce extractid.pig

The job should create a file called id.out in our home directory. The listing below shows how to check for the file, transfer it to the local filsystem and display the result:

~/ $ hadoop fs -ls
Found 1 item
drwxrwxrwx   - snap supergroup          0 2012-10-17 17:14 /user/snap/id.out

~/ $ hadoop fs -ls id.out/
Found 2 items
-rw-r--r--   3 snap supergroup          0 2012-10-17 17:14 /user/snap/id.out/_SUCCESS
-rw-r--r--   3 snap supergroup        242 2012-10-17 17:14 /user/snap/id.out/part-m-00000

~/ $ hadoop fs -get id.out ./
~/ $ cat id.out/part-m-00000
root
bin
daemon
...

Slow execution

If you tried out the examples above you may be thinking but why would I want to use the cluster that takes half a minute for a job awk can do in less than a second on my local computer. Well, the examples are provided only to see if some simple things can run on the cluster and it really does not make any sense to run them on a cluster as the cluster imposes too much overhead for such small tasks. The overhead should however become irrelevant for larger jobs that actually compute something useful.

-  ⇤ ← Revision 15 as of 2013-12-24 07:48:38 → 
  Size: 11829
  Editor: akrevl
  Comment:
+   ← Revision 16 as of 2015-10-16 22:50:05 → ⇥
  Size: 8947
  Editor: akrevl
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-{{{#!wiki caution
'''Beta warning'''

You thought we were kidding with the beta? This is our first Hadoop installation and we are still figuring things out. Stuff may break, names may change, but we cannot make the cluster better without your feedback. So... please use the cluster, join our ilcluster mailing list and let us know about the issues you encounter and about the software you wish the cluster ran. 
}}}
-Line 13:
+Line 7:
- * Name nodes:
   * ilhead2
 * Name & !JobTracker nodes:
   * ilhead2
+ * Name node: ilhadoop1
 * Yarn Resource Manager: ilhadoop1
-Line 18:
+Line 10:
-   * iln13 - iln28 and iln30 - iln36
   * 18.93 TB
 * !TaskTracker nodes:
   * iln13 - iln28 and iln30 - iln36
   * 254 cores
   * 448 GB RAM
+   * ilh01 - ilh40
   * 320 TB total capacity
 * Yarn Nodes:
   * ilh01 - ilh40
   * 320 total cores
   * 2560 GB RAM
-Line 29:
+Line 21:
-To submit Hadoop jobs to the cluster you need to log in to the submission node [[ilcXX|iln29.stanford.edu]]. 

{{{
ssh your_csid@ilhead2.stanford.edu
+You can submit your Hadoop jobs from the following nodes:

{{{
madmax
madmax2
madmax3
madmax4
madmax5
-Line 37:
+Line 33:
-Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the !JobTracker will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.
+Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the Yarn Resource Manager will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.
-Line 41:
+Line 37:
-When you start using Hadoop your HDFS home directory will be created. Actually, if you ran the example above, your home has already been created and used. By default your HDFS home is:
+Ask your admin to create an HDFS home directory for you. Your HDFS home is going to be:
-Line 47:
+Line 43:
-Let's try and list the contents of that directory (please note that the CSID used in this example is snap):

{{{
hadoop fs -ls /user/snap
}}}

Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all og the user homes just to see who is using the HDFS:

{{{
~/$ hadoop fs -ls /user
+Let's try and list the contents of that directory:

{{{
hadoop fs -ls /user/tommy
}}}

Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all the user homes just to see who is using the HDFS:

{{{
tommy@madmax5: ~$ hadoop fs -ls /user/
-Line 58:
+Line 54:
-drwxr-xr-x   - akrevl supergroup          0 2012-10-17 13:23 /user/akrevl
drwxr-xr-x   - root   root                0 2012-10-08 11:51 /user/root
drwxrwxrwx   - snap   supergroup          0 2012-10-17 14:18 /user/snap
+drwx------   - hdfs       hadoop          0 2015-10-12 18:54 /user/hdfs
drwxr-xr-x   - tommy     users           0 2015-06-12 15:20 /user/tommy
drwxr-xr-x   - west1      users           0 2015-10-12 22:05 /user/west1
-Line 74:
+Line 70:
-~/$ fs -ls /user/snap
+~/$ fs -ls /user/tommy
-Line 76:
+Line 72:
-drwxr-xr-x   - snap supergroup          0 2012-10-17 14:21 /user/snap/xmldata
+drwxr-xr-x   - tommy users          0 2012-10-17 14:21 /user/tommy/xmldata
-Line 111:
+Line 107:
-Let's try and get rid of the file taskcontroller.cfg that we imported to HDFS in the previous example:

{{{
~/$ /user/snap/xmldata/taskcontroller.cfg
Deleted hdfs://iln29.stanford.edu:9000/user/snap/xmldata/taskcontroller.cfg
}}}
-Line 121:
+Line 110:
-hadoop fs -rmr directory
+hadoop fs -rm -r directory
-Line 253:
+Line 242:
-== RHIPE ==

From their website: RHIPE is the R and Hadoop Integrated Programming Environment. In other words it enables you to use a Hadoop cluster as a backend engine for running R calculations on big data sets.

Using RHIPE on our cluster is a bit cumbersome at the moment, but it is bound to get better. Please bear with us.

There are two R versions installed on the cluster. The one that is located at ''/usr/bin/R'' is not configured to work with Hadoop. If you want to use RHIPE, you should use the version of R at ''/opt/R/bin/R''. Make sure that LD_LIBRARY_PATH, HADOOP and HADOOP_BIN are set before running R. Here is a series of commands that should work:

{{{
export LD_LIBRARY_PATH=/opt/protobuf/lib/:/opt/R/lib64/R/modules/:/opt/R/lib64/R/lib/:/opt/R/lib64/R/library/:$LD_LIBRARY_PATH
export HADOOP=/usr
export HADOOP_BIN=/usr/bin
/opt/R/bin/R
}}}

Alternatively, just run the Rhipe helper script (located in /use/bin) that does all the work for you:

{{{
Rhipe
}}}

Once you are in the R environment, you need to initialize the library. Make sure you run both library() and rhinit() as shown in the listing below.

{{{
> library("Rhipe")
--------------------------------------------------------
| IMPORTANT: Before using Rhipe call rhinit()           |
| Rhipe will not work or most probably crash            |
--------------------------------------------------------
> rhinit()
Rhipe initialization complete
Rhipe first run complete
[1] TRUE
}}}

Instead of running rhinit() you can also run rhinit(TRUE, TRUE) for more debugging info. 

If you want to test whether the installation is working you can try the following example. First create a small list and write it to the file ''/tmp/x'' on the hdfs.

{{{
rhwrite(list(1,2,3),"/tmp/x")
}}}

If the file was written correctly we can read it back with the following command:

{{{
rhread("/tmp/x")
}}}

More information about RHIPE is available here: https://www.datadr.org/doc/index.html.

Diff for "InfolabClusterHadoop"

Menu

SNAP

Wiki