Revision 2 as of 2012-10-17 18:54:06

Clear message
Locked History Actions

InfolabClusterHadoop

Infolab Hadoop Cluster

Beta warning

You thought we were kidding with the beta? This is our first Hadoop installation and we are still figuring things out. Stuff may break, names may change, but we cannot make the cluster better without your feedback. So... please use the cluster, join the mailing our ilcluster mailing list and let us know about the issues you encounter and about the software you wish the cluster ran.

Access

You will need a CSID to use the Hadoop cluster.

To submit Hadoop jobs to the cluster you need to log in to the submission node iln29.stanford.edu. Use your CS credentials to log in.

ssh your_csid@iln29.stanford.edu

The anatomy of the cluster

  • Name nodes:
    • iln29
  • JobTracker nodes:

    • iln29
  • Data nodes:
    • iln30 - iln36
    • 12.19 TB
  • TaskTracker nodes:

    • iln30 - iln36
    • 224 cores

Job scheduling

All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there.

Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably.

Qsub

qsub is the main command that submits your job to the cluster. The command uses the following syntax:

qsub -V script_file

So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following:

qsub -V myjob.sh

script_file should be a text file

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.

script_file must not be binary/executable file

Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error.

Resource manager directives

These directives tell the resource manager how to run your job. All of the directives start with a pound character (#) immediately followed by the keyword PBS:

#PBS -directive options

Name

This directive tells the resource manager which name to use for your job. If you do not specify it, the name of your submission script will be used.

#PBS -N InfolabClusterTutorial

Standard output

Since you never know which server your program will run on once you submit it to the cluster, the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named jobname.ojobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.sh.o711. If you provided the Name directive discussed in the previous section, then your default standard output will be saved to InfolabClusterTutorial.o711.

You can override this behavior by using the -o directive:

#PBS -o /dfs/rulk/0/mydir/myjob.out

This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

Error output

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711. If you provided the Name directive discussed in one of the previous sections, then your default standard error stream will be saved to InfolabClusterTutorial.e711.

You can override this behavior by using the -e directive:

#PBS -e /dfs/rulk/0/mydir/myjob.error

This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

Mail directive

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

#PBS -m be

Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail queue on the submission node (you could set up forwarding, but that should be a topic of another wiki page).

Parallel jobs

You may specify that you want your job to run on multiple cores and multiple nodes with the following directive:

#PBS -l nodes=node_no:ppn=core_no

In the example above the node_no represents the number of nodes (physical servers) that you are requesting and the core_no represents the number of cores that you would like to use on each of the nodes. If you would like to use 6 cores on a single node you could do it with the following directive:

#PBS -l nodes=1:ppn=6

Here is another example requesting two nodes with 32 cores each:

#PBS -l nodes=2:ppn=32

Parallel jobs

Please bear in mind that qsub will not to anything to make your job parallel. That is why you should only make requests for more than one core if your program is multi core or multi thread capable. If your program is not written in a parallel manner it will only run on a single core and your 32-core reservation will just waste system resources for others.

Number of requested cores

Please do not make requests that the cluster is not able to handle. If you submit a job with the directive -l nodes=1:ppn=128 this job will actually never run on the current configuration of the cluster as we do not have nodes with 128 cores. Please consult the cluster's hardware capabilities before using this directive.

Running time

This directive lets you specify a maximum walltime (sum of CPU time and wait time) that can be used by your job. This may be useful in a situation where you know your job should run no longer than 2 hours and if it runs longer then something went wrong. You can specify such a limit with the following directive:

#PBS -l walltime=02:00:00

You do not have to specify a maximal walltime in that case your job will run eternally... unless the cluster crashes... and it may be interrupted by shorter running jobs.

Wall time format

You should always specify the wall time in HH:MM:SS format. If you were to write walltime=120:00 your program would get killed after 2 hours of work as the setting is read as 120 minutes, 0 seconds.

An example submission script

In the following example we do not actually call some binary of our own, we just run a few standard commands and exit. Since the submission script is nothing more than a regular shell script, the example should print out what host it is running on to our standard output file.

#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20

Paths

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs for your scripts, your programs and the datafiles needed for your job. You should also make sure to always use a full (absolute) path specification.

This means that using ./myjob to run your program from a submission script in your home directory is a bad idea. You should all it like this:

/afs/cs.stanford.edu/u/your_csid/myjob

You can save yourself some typing by using environment variables. You could use $HOME/myjob in the example above. If you decide to use environment variables, make sure that you run qsub with the -V parameter as we are showing you throughout this tutorial. The -V parameter makes sure that the environment variables are available to the submission script.

What is mounted under /dfs

  • /dfs/hulk/0 points to /lfs/hulk/0 on hulk.stanford.edu

  • /dfs/rulk/0 points to /lfs/rulk/0 on rulk.stanford.edu

  • /dfs/rocky/0 points to /lfs/rocky/0 on rocky.stanford.edu

  • /dfs/hulk/0 points to /lfs/hulk/0 on hulk.stanford.edu

  • /u points to /afs/cs.stanford.edu/u and contains user home directories

Passing CLI arguments

You may want to pass some arguments to the program that you want to run on the cluster. This is problematic as you would actually be passing the arguments to the qsub program instead of your own program.

You have probably already noticed the first workaround in the text above. We are always running qsub with the -V switch. This switch tells qsub to pass all the environment variables that are currently available to the environment from which the submitted script will run. That is why we told you it is OK, to use the $HOME variable when we were discussing paths.

The other workaround is to list the variables that need to be available in the program execution environment with the -v switch. If you wanted the variable $MYNAME to contain the value Alice, you would call qsub like this:

qsub -v MYNAME="Alice" myjob.sh

Please see InfolabClusterComputeHowtoVariables for a more detailed example.

Queues

There is only one queue available on the compute cluster at the moment. This is bound to change once the cluster is used more heavily and we can make better sense of what is needed.

The default queue is called test and it allows up to 35,000 jobs to be queued and up to 1,200 jobs to run simultaneously.

Qstat

The qstat command enables you to check in on your job. You run it with your job ID number:

qstat job_id

If you want to know the status of the job number 4652 you can issue the command:

qstat 4652

And the resource manager's reply might look a little something like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 Q test

Or like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 R test

The difference in the two outputs shown is the status (S) column, that has the value of Q in the first output and the value of R in the second output. Those mean that are job is queued in the first output and that our job is running in the second output.

If you run the qstat command with the -f switch you will get more detailed data about yout job:

qstat -f 4652

Qshow

As an alternative to qstat you can use the showq command. Note however that this command is somewhat sensitive to the condition of the cluster and may report a timeout even though everything is running fine on the cluster.

You can invoke the showq by running:

showq -u your_csid

And the response should be similar to:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

4654                 akrevl    Running     1    00:01:00  Tue Oct 16 17:47:43

     1 Active Job        1 of  896 Processors Active (0.11%)
                         1 of   28 Nodes Active      (3.57%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

showq will display a list of active, idle and blocked jobs by default.

  • active jobs are the jobs that are currently running on the cluster,

  • idle are the jobs that are in the queue and ready to run but are still waiting for free resources,

  • blocked are the jobs that could not complete or cannot run on this cluster (typically a job will go into this state if it was running one one of the cluster nodes that happened to fail at that time).

Job Arrays

This section could also be subtitled How to submit a bunch of jobs and not crash the cluster in the process. So how does it need to be done? The obvious solution probably does not work, otherwise this section would not exist in the wiki.

Do not use loops

Running for job in {1..1000}; do qsub myjob.sh; done is a bad, bad, bad thing. It will probably crash the cluster. Sorry, that's the way it is. The software seems to be on the sensitive size.

If you cannot avoid running the above, then insert a sleep 1 after each qsub. This has been tested and while it is not a preferred way of running multiple jobs it at least does not kill the cluster... as quickly...

The right way of submitting a lot of jobs is to use job arrays. Let's say we want to submit a 100 instances of our program and we have already prepared a submission script called myjob.sh. We can submit all 100 jobs by issuing the following command:

qsub -V -t 0-99 myjob.sh

The cluster should reply with the following:

4660[].ilhead1.stanford.edu

Notice the square brackets following the job id. Those indicate that this is actually an array of jobs. Please note that the qstat command will also display only a single entry. showq on the other hand displays all of the jobs. You can use qstat to get more information if you specify an exact element of an array. If we want the full details of the 2nd job in the array with the id 4663, we can issue the following command:

qstat -f 4663[1]

Please see this howto for additional information on how to submit job arrays: InfolabClusterComputeHowtoJobArray.

Output files

Running a large job array will create a number of output files called JobName.eJobId-ElementId for standard error and JobName.oJobId-ElementId for standard output. Check them out once your job run is finished but make sure you delete them afterwards (rm *.{e,o}* seems to be helpful) as having a large collection of files in your directory might slow down other file operations in that directory...

Other useful commands

  • qdel job_id: deletes your job
  • qstat -q: lists all queues
  • qstat -a: lists all jobs
  • qstat -au userid: lists all jobs submitted by userid
  • pbsnodes: list status of all the compute nodes

HOWTOs / Tutorials

  • How to run a single core job on the cluster

  • How to pass arguments / variables to the job running on the cluster

  • How to run an MPI job on the cluster

  • How to submit job arrays