Locked History Actions

Diff for "InfolabClusterHadoop"

Differences between revisions 1 and 6 (spanning 5 versions)
Revision 1 as of 2012-10-17 18:36:36
Size: 15063
Editor: akrevl
Comment:
Revision 6 as of 2012-10-17 22:02:40
Size: 7519
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

{{{#!wiki caution
'''Beta warning'''

You thought we were kidding with the beta? This is our first Hadoop installation and we are still figuring things out. Stuff may break, names may change, but we cannot make the cluster better without your feedback. So... please use the cluster, join the mailing our ilcluster mailing list and let us know about the issues you encounter and about the software you wish the cluster ran.
}}}
Line 5: Line 11:
== The anatomy of the cluster ==

 * Name nodes:
   * iln29
 * !JobTracker nodes:
   * iln29
 * Data nodes:
   * iln30 - iln36
   * 12.19 TB
 * !TaskTracker nodes:
   * iln30 - iln36
   * 224 cores
Line 7: Line 26:
To submit the jobs to the compute cluster you need to log in to the submission node [[ilcXX|ilhead1.stanford.edu]]. Use your CS credentials to log in. You will need a '''CSID''' to use the Hadoop cluster.

To submit Hadoop jobs to the cluster you need to log in to the submission node [[ilcXX|iln29.stanford.edu]].
Line 10: Line 31:
ssh your_cs_id@ilhead1.stanford.edu ssh your_csid@iln29.stanford.edu
Line 13: Line 34:
== Job scheduling == == HDFS ==
Line 15: Line 36:
All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there. Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the JobTracker will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.
Line 17: Line 38:
Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably. === Your HDFS home ===
Line 19: Line 40:
== Qsub ==

'''qsub''' is the main command that submits your job to the cluster. The command uses the following syntax:
When you start using Hadoop your HDFS home directory will be created. Actually, if you ran the example above, your home has already been created and used. By default your HDFS home is:
Line 24: Line 43:
qsub -V script_file /user/your_csid
Line 27: Line 46:
So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following: Let's try and list the contents of that directory (please note that the CSID used in this example is snap):
Line 30: Line 49:
qsub -V myjob.sh hadoop fs -ls /user/snap
Line 33: Line 52:
Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all og the user homes just to see who is using the HDFS:
Line 34: Line 54:
{{{#!wiki tip
'''script_file should be a text file'''

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.
{{{
~/$ hadoop fs -ls /user
Found 3 items
drwxr-xr-x - akrevl supergroup 0 2012-10-17 13:23 /user/akrevl
drwxr-xr-x - root root 0 2012-10-08 11:51 /user/root
drwxrwxrwx - snap supergroup 0 2012-10-17 14:18 /user/snap
Line 40: Line 62:
=== Getting data into HDFS ===
Line 41: Line 64:
{{{#!wiki caution
'''script_file must not be binary/executable file'''
We have a local directory that contains some XML data that we want to run a mapred job on. Let's create a subdirectory under our HDFS home that will hold those files:
Line 44: Line 66:
Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error. {{{
hadoop fs -mkdir xmldata
Line 47: Line 70:
== Resource manager directives ==

These directives tell the resource manager how to run your job. All of the directives start with a pound character (#) immediately followed by the keyword PBS:
Please note that the ''xmldata'' directory will be created as a subdirectory of ''/user/your_csid'' as we have not specified an absolute path. Let's list the content of our home directory to verify it was created.
Line 52: Line 73:
#PBS -directive options ~/$ fs -ls /user/snap
Found 1 items
drwxr-xr-x - snap supergroup 0 2012-10-17 14:21 /user/snap/xmldata
Line 55: Line 78:
=== Name ===

This directive tells the resource manager which name to use for your job. If you do not specify it, the name of your submission script will be used.
If the XML files are in the ~/xmldata directory on the local harddisk, we copy them to the HDFS with the following command:
Line 60: Line 81:
#PBS -N InfolabClusterTutorial hadoop fs -put ~/xmldata/* /user/snap/xmldata
Line 63: Line 84:
=== Standard output === You already know how to list the contents of a HDFS directory so we'll leave checking on the files to you.
Line 65: Line 86:
Since you never know which server your program will run on once you submit it to the cluster, the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from. === Getting data out of HDFS ===
Line 67: Line 88:
By default the resource manager will redirect all standard output of a job to a file named ''jobname''.o''jobid''. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.sh.o711. If you provided the Name directive discussed in the previous section, then your default standard output will be saved to InfolabClusterTutorial.o711.

You can override this behavior by using the -o directive:
If we want to copy a file from the HDFS back to our "regular" filesystem, we need to run the following command:
Line 72: Line 91:
#PBS -o /dfs/rulk/0/mydir/myjob.out hadoop fs -get hdfs_source_filename local_destination_filename
Line 75: Line 94:
This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

=== Error output ===

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711. If you provided the Name directive discussed in one of the previous sections, then your default standard error stream will be saved to InfolabClusterTutorial.e711.

You can override this behavior by using the -e directive:
So if we have a file called result.txt in our HDFS home directory, we can copy it to our current directory with the following command:
Line 84: Line 97:
#PBS -e /dfs/rulk/0/mydir/myjob.error hadoop fs -get results.txt ./
Line 87: Line 100:
This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once. Please note that whenever we use a HDFS path that does not start with a forward slash / the HDFS command automatically add a prefix /user/your_csid.
Line 89: Line 102:
=== Mail directive === === Deleting a file ===
Line 91: Line 104:
This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes. If you want to delete a file from the HDFS you can use the following command:
Line 94: Line 107:
#PBS -m be hadoop fs -rm filename
Line 97: Line 110:
Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail queue on the submission node (you could set up forwarding, but that should be a topic of another wiki page).

=== Parallel jobs ===

You may specify that you want your job to run on multiple cores and multiple nodes with the following directive:
Let's try and get rid of the file taskcontroller.cfg that we imported to HDFS in the previous example:
Line 104: Line 113:
#PBS -l nodes=node_no:ppn=core_no ~/$ /user/snap/xmldata/taskcontroller.cfg
Deleted hdfs://iln29.stanford.edu:9000/user/snap/xmldata/taskcontroller.cfg
Line 107: Line 117:
In the example above the ''node_no'' represents the number of nodes (physical servers) that you are requesting and the ''core_no'' represents the number of cores that you would like to use on each of the nodes. If you would like to use 6 cores on a single node you could do it with the following directive: If we want to delete a directory we can use the following command:
Line 110: Line 120:
#PBS -l nodes=1:ppn=6
}}}

Here is another example requesting two nodes with 32 cores each:

{{{
#PBS -l nodes=2:ppn=32
hadoop fs -rmr directory
Line 120: Line 124:
'''Parallel jobs''' '''HDFS commands'''
Line 122: Line 126:
Please bear in mind that qsub will not to anything to make your job parallel. That is why you should only make requests for more than one core if your program is multi core or multi thread capable. If your program is not written in a parallel manner it will only run on a single core and your 32-core reservation will just waste system resources for others.

'''Number of requested cores'''

Please do not make requests that the cluster is not able to handle. If you submit a job with the directive ''-l nodes=1:ppn=128'' this job will actually never run on the current configuration of the cluster as we do not have nodes with 128 cores. Please consult the cluster's hardware capabilities before using this directive.
By now you have probably noticed that HDFS command are very similar to the commands that we use on our "regular" file systems. We just have to start every command with the prefix ''hadoop fs'' then type in a dash - and the name of the command we use for "regular" files. So instead of ''chmod 777 somefile'' we type in ''hadoop fs -chown 777 somefile''. We can get a list of available commands by typing ''hadoop fs''.
Line 129: Line 129:
=== Running time === == Map/reduce ==
Line 131: Line 131:
This directive lets you specify a maximum walltime (sum of CPU time and wait time) that can be used by your job. This may be useful in a situation where you know your job should run no longer than 2 hours and if it runs longer then something went wrong. You can specify such a limit with the following directive: Here are a few example map/reduce jobs that you can try running.

=== A life of Pi ===
Line 134: Line 136:
#PBS -l walltime=02:00:00 hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar pi 10 1000000
Line 137: Line 139:
You do not have to specify a maximal walltime in that case your job will run eternally... unless the cluster crashes... and it may be interrupted by shorter running jobs. The above example should output something that ends like this:
Line 139: Line 141:
{{{#!wiki tip
'''Wall time format'''

You should always specify the wall time in HH:MM:SS format. If you were to write ''walltime=120:00'' your program would get killed after 2 hours of work as the setting is read as 120 minutes, 0 seconds.
{{{
Job Finished in 33.276 seconds
Estimated value of Pi is 3.14158440000000000000
Line 145: Line 146:
=== An example submission script ===

In the following example we do not actually call some binary of our own, we just run a few standard commands and exit. Since the submission script is nothing more than a regular shell script, the example should print out what host it is running on to our standard output file.
If running the example above throws an AccessControlException (see below), please let Andrej know about it by e-mail (don't forget to include a full copy of the output).
Line 150: Line 149:
#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20
Number of Maps = 10
Samples per Map = 1000000
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=snap, access=WRITE, inode="user":hdfs:supergroup:rwxr-xr-x
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
...
Line 159: Line 161:
== Paths == At this point, congratulations are in order. If you have not used Hadoop before, you just successfully ran your first Hadoop job.
Line 161: Line 163:
You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs for your '''scripts''', your '''programs''' and the '''datafiles''' needed for your job. You should also make sure to always use a full (absolute) path specification. === Some XML extraction ===
Line 163: Line 165:
This means that using ./myjob to run your program from a submission script in your home directory is a '''bad''' idea. You should all it like this: This example expects that there are some XML configuration files on the HDFS and creates come output files. It's here just to get you going with the HDFS...

The XML files are already in the HDFS, but let's copy them to our HDFS home directory first (please replace snap with your CSID):
Line 166: Line 170:
/afs/cs.stanford.edu/u/your_csid/myjob hadoop fs -cp /tmp/xmldata /user/snap/tmpdata
Line 169: Line 173:
You can save yourself some typing by using environment variables. You could use $HOME/myjob in the example above. If you decide to use environment variables, make sure that you run qsub with the -V parameter as we are showing you throughout this tutorial. The -V parameter makes sure that the environment variables are available to the submission script.

=== What is mounted under /dfs ===

 * '''/dfs/hulk/0''' points to /lfs/hulk/0 on hulk.stanford.edu
 * '''/dfs/rulk/0''' points to /lfs/rulk/0 on rulk.stanford.edu
 * '''/dfs/rocky/0''' points to /lfs/rocky/0 on rocky.stanford.edu
 * '''/dfs/hulk/0''' points to /lfs/hulk/0 on hulk.stanford.edu
 * '''/u''' points to /afs/cs.stanford.edu/u and contains user home directories

== Passing CLI arguments ==

You may want to pass some arguments to the program that you want to run on the cluster. This is problematic as you would actually be passing the arguments to the ''qsub'' program instead of your own program.

You have probably already noticed the first workaround in the text above. We are always running qsub with the '''-V''' switch. This switch tells ''qsub'' to pass all the environment variables that are currently available to the environment from which the submitted script will run. That is why we told you it is OK, to use the $HOME variable when we were discussing [[InfolabClusterCompute#Paths|paths]].

The other workaround is to list the variables that need to be available in the program execution environment with the '''-v''' switch. If you wanted the variable $MYNAME to contain the value Alice, you would call ''qsub'' like this:
Now we can run the example:
Line 188: Line 176:
qsub -v MYNAME="Alice" myjob.sh hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep xmldata output 'dfs[a-z.]+'
Line 191: Line 179:
Please see [[InfolabClusterComputeHowtoVariables]] for a more detailed example. Again note that since we are not using absolute paths, Hadoop !TaskTrackers add /user/snap/ as a path prefix to both xmldata and output in the command above. Perhaps the command itself needs some explanation. We are telling hadoop to use the jar file hadoop-examples. Further more it should use the example called grep from the jar. The example code should find the source files (some XML ocnfiguration files) in the directory xmldata and it should output the results to the directory called output. The last part is a regular expression that grep should search for.
Line 193: Line 181:
== Queues ==

There is only one queue available on the compute cluster at the moment. This is bound to change once the cluster is used more heavily and we can make better sense of what is needed.

The default queue is called '''test''' and it allows up to 35,000 jobs to be queued and up to 1,200 jobs to run simultaneously.

== Qstat ==

The '''qstat''' command enables you to check in on your job. You run it with your job ID number:
Now that the map/reduce task has finished, we need to copy the files from the output directory in HDFS to a local output directory:
Line 204: Line 184:
qstat job_id hadoop fs -get output/* ~/output/
Line 207: Line 187:
If you want to know the status of the job number 4652 you can issue the command: Our job prints out all the strings starting with dfs sorted by how often they appeared in the files:
Line 210: Line 190:
qstat 4652 ~/ $ cat ~/output/part-00000
2 dfs.audit.logger
1 dfs.data.dir
1 dfs.name.dir
1 dfs.replication
1 dfs.server.namenode.
1 dfsadmin
Line 212: Line 198:

And the resource manager's reply might look a little something like this:

{{{
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1 SingleCoreJob akrevl 0 Q test
}}}

Or like this:

{{{
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1 SingleCoreJob akrevl 0 R test
}}}

The difference in the two outputs shown is the status (S) column, that has the value of '''Q''' in the first output and the value of '''R''' in the second output. Those mean that are job is '''queued''' in the first output and that our job is '''running''' in the second output.

If you run the ''qstat'' command with the ''-f'' switch you will get more detailed data about yout job:

{{{
qstat -f 4652
}}}

== Qshow ==

As an alternative to ''qstat'' you can use the ''showq'' command. Note however that this command is somewhat sensitive to the condition of the cluster and may report a timeout even though everything is running fine on the cluster.

You can invoke the ''showq'' by running:

{{{
showq -u your_csid
}}}

And the response should be similar to:

{{{
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

4654 akrevl Running 1 00:01:00 Tue Oct 16 17:47:43

     1 Active Job 1 of 896 Processors Active (0.11%)
                         1 of 28 Nodes Active (3.57%)

IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
}}}

''showq'' will display a list of active, idle and blocked jobs by default.

 * '''active''' jobs are the jobs that are currently running on the cluster,
 * '''idle''' are the jobs that are in the queue and ready to run but are still waiting for free resources,
 * '''blocked''' are the jobs that could not complete or cannot run on this cluster (typically a job will go into this state if it was running one one of the cluster nodes that happened to fail at that time).

== Job Arrays ==

This section could also be subtitled '''How to submit a bunch of jobs and not crash the cluster in the process'''. So how does it need to be done? The obvious solution probably does not work, otherwise this section would not exist in the wiki.

{{{#!wiki warning
'''Do not use loops'''

Running ''for job in {1..1000}; do qsub myjob.sh; done'' is a bad, bad, bad thing. It will probably crash the cluster. Sorry, that's the way it is. The software seems to be on the sensitive size.
}}}

If you cannot avoid running the above, then insert a ''sleep 1'' after each ''qsub''. This has been tested and while it is not a preferred way of running multiple jobs it at least does not kill the cluster... as quickly...

The right way of submitting a lot of jobs is to use job arrays. Let's say we want to submit a 100 instances of our program and we have already prepared a submission script called myjob.sh. We can submit all 100 jobs by issuing the following command:

{{{
qsub -V -t 0-99 myjob.sh
}}}

The cluster should reply with the following:

{{{
4660[].ilhead1.stanford.edu
}}}

Notice the square brackets following the job id. Those indicate that this is actually an array of jobs. Please note that the ''qstat'' command will also display only a single entry. ''showq'' on the other hand displays all of the jobs. You can use ''qstat'' to get more information if you specify an exact element of an array. If we want the full details of the 2nd job in the array with the id 4663, we can issue the following command:

{{{
qstat -f 4663[1]
}}}

Please see this howto for additional information on how to submit job arrays: [[InfolabClusterComputeHowtoJobArray]].

{{{#!wiki tip
'''Output files'''

Running a large job array will create a number of output files called !JobName.eJobId-!ElementId for standard error and !JobName.oJobId-!ElementId for standard output. Check them out once your job run is finished but make sure you delete them afterwards (rm *.{e,o}* seems to be helpful) as having a large collection of files in your directory might slow down other file operations in that directory...
}}}

== Other useful commands ==

 * qdel job_id: deletes your job
 * qstat -q: lists all queues
 * qstat -a: lists all jobs
 * qstat -au userid: lists all jobs submitted by userid
 * pbsnodes: list status of all the compute nodes

== HOWTOs / Tutorials ==

 * [[InfolabClusterComputeHowtoSingle|How to]] run a single core job on the cluster
 * [[InfolabClusterComputeHowtoVariables|How to]] pass arguments / variables to the job running on the cluster
 * [[InfolabClusterComputeHowtoMpi|How to]] run an MPI job on the cluster
 * [[InfolabClusterComputeHowtoJobArray|How to]] submit job arrays

Infolab Hadoop Cluster

Beta warning

You thought we were kidding with the beta? This is our first Hadoop installation and we are still figuring things out. Stuff may break, names may change, but we cannot make the cluster better without your feedback. So... please use the cluster, join the mailing our ilcluster mailing list and let us know about the issues you encounter and about the software you wish the cluster ran.

The anatomy of the cluster

  • Name nodes:
    • iln29
  • JobTracker nodes:

    • iln29
  • Data nodes:
    • iln30 - iln36
    • 12.19 TB
  • TaskTracker nodes:

    • iln30 - iln36
    • 224 cores

Access

You will need a CSID to use the Hadoop cluster.

To submit Hadoop jobs to the cluster you need to log in to the submission node iln29.stanford.edu.

ssh your_csid@iln29.stanford.edu

HDFS

Hadoop clusters are usually running their own file system called Hadoop Distributed Filesystem (HDFS). If you run a map/reduce job on the data that you have stored in the HDFS the JobTracker will make sure that your job runs on the nodes that have your data stored locally in order to minimize network traffic.

Your HDFS home

When you start using Hadoop your HDFS home directory will be created. Actually, if you ran the example above, your home has already been created and used. By default your HDFS home is:

/user/your_csid

Let's try and list the contents of that directory (please note that the CSID used in this example is snap):

hadoop fs -ls /user/snap

Our home directory is empty (even though it was used as a staging area in the example above) so there is no output from the command. Let's try listing all og the user homes just to see who is using the HDFS:

~/$ hadoop fs -ls /user
Found 3 items
drwxr-xr-x   - akrevl supergroup          0 2012-10-17 13:23 /user/akrevl
drwxr-xr-x   - root   root                0 2012-10-08 11:51 /user/root
drwxrwxrwx   - snap   supergroup          0 2012-10-17 14:18 /user/snap

Getting data into HDFS

We have a local directory that contains some XML data that we want to run a mapred job on. Let's create a subdirectory under our HDFS home that will hold those files:

hadoop fs -mkdir xmldata

Please note that the xmldata directory will be created as a subdirectory of /user/your_csid as we have not specified an absolute path. Let's list the content of our home directory to verify it was created.

~/$ fs -ls /user/snap
Found 1 items
drwxr-xr-x   - snap supergroup          0 2012-10-17 14:21 /user/snap/xmldata

If the XML files are in the ~/xmldata directory on the local harddisk, we copy them to the HDFS with the following command:

hadoop fs -put ~/xmldata/* /user/snap/xmldata

You already know how to list the contents of a HDFS directory so we'll leave checking on the files to you.

Getting data out of HDFS

If we want to copy a file from the HDFS back to our "regular" filesystem, we need to run the following command:

hadoop fs -get hdfs_source_filename local_destination_filename

So if we have a file called result.txt in our HDFS home directory, we can copy it to our current directory with the following command:

hadoop fs -get results.txt ./

Please note that whenever we use a HDFS path that does not start with a forward slash / the HDFS command automatically add a prefix /user/your_csid.

Deleting a file

If you want to delete a file from the HDFS you can use the following command:

hadoop fs -rm filename

Let's try and get rid of the file taskcontroller.cfg that we imported to HDFS in the previous example:

~/$ /user/snap/xmldata/taskcontroller.cfg
Deleted hdfs://iln29.stanford.edu:9000/user/snap/xmldata/taskcontroller.cfg

If we want to delete a directory we can use the following command:

hadoop fs -rmr directory

HDFS commands

By now you have probably noticed that HDFS command are very similar to the commands that we use on our "regular" file systems. We just have to start every command with the prefix hadoop fs then type in a dash - and the name of the command we use for "regular" files. So instead of chmod 777 somefile we type in hadoop fs -chown 777 somefile. We can get a list of available commands by typing hadoop fs.

Map/reduce

Here are a few example map/reduce jobs that you can try running.

A life of Pi

hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar pi 10 1000000

The above example should output something that ends like this:

Job Finished in 33.276 seconds
Estimated value of Pi is 3.14158440000000000000

If running the example above throws an AccessControlException (see below), please let Andrej know about it by e-mail (don't forget to include a full copy of the output).

Number of Maps  = 10
Samples per Map = 1000000
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=snap, access=WRITE, inode="user":hdfs:supergroup:rwxr-xr-x
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
...

At this point, congratulations are in order. If you have not used Hadoop before, you just successfully ran your first Hadoop job.

Some XML extraction

This example expects that there are some XML configuration files on the HDFS and creates come output files. It's here just to get you going with the HDFS...

The XML files are already in the HDFS, but let's copy them to our HDFS home directory first (please replace snap with your CSID):

hadoop fs -cp /tmp/xmldata /user/snap/tmpdata

Now we can run the example:

hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep xmldata output 'dfs[a-z.]+'

Again note that since we are not using absolute paths, Hadoop TaskTrackers add /user/snap/ as a path prefix to both xmldata and output in the command above. Perhaps the command itself needs some explanation. We are telling hadoop to use the jar file hadoop-examples. Further more it should use the example called grep from the jar. The example code should find the source files (some XML ocnfiguration files) in the directory xmldata and it should output the results to the directory called output. The last part is a regular expression that grep should search for.

Now that the map/reduce task has finished, we need to copy the files from the output directory in HDFS to a local output directory:

hadoop fs -get output/* ~/output/

Our job prints out all the strings starting with dfs sorted by how often they appeared in the files:

~/ $ cat ~/output/part-00000
2       dfs.audit.logger
1       dfs.data.dir
1       dfs.name.dir
1       dfs.replication
1       dfs.server.namenode.
1       dfsadmin