Locked History Actions

Diff for "InfolabClusterCompute"

Differences between revisions 4 and 25 (spanning 21 versions)
Revision 4 as of 2012-10-16 19:35:29
Size: 4599
Editor: akrevl
Comment:
Revision 25 as of 2012-10-17 02:49:59
Size: 12834
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
=== Access === <<TableOfContents(3)>>

== Access ==
Line 11: Line 13:
=== Job scheduling === == Job scheduling ==
Line 17: Line 19:
=== Qsub === == Qsub ==
Line 22: Line 24:
qsub script_file qsub -V script_file
Line 28: Line 30:
qsub runjob.sh qsub -V myjob.sh
Line 40: Line 42:
'''script_file nost not be binary/executable file''' '''script_file must not be binary/executable file'''
Line 45: Line 47:



==== Preparing your job ====

Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example:
== Resource manager directives ==

These directives tell the resource manager how to run your job. All of the directives start with a pound character (#) immediately followed by the keyword PBS:

{{{
#PBS -directive options
}}}

=== Name ===

This directive tells the resource manager which name to use for your job. If you do not specify it, the name of your submission script will be used.

{{{
#PBS -N InfolabClusterTutorial
}}}

=== Standard output ===

Since you never know which server your program will run on once you submit it to the cluster, the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named ''jobname''.o''jobid''. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.sh.o711. If you provided the Name directive discussed in the previous section, then your default standard output will be saved to InfolabClusterTutorial.o711.

You can override this behavior by using the -o directive:

{{{
#PBS -o /dfs/rulk/0/mydir/myjob.out
}}}

This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

=== Error output ===

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711. If you provided the Name directive discussed in one of the previous sections, then your default standard error stream will be saved to InfolabClusterTutorial.e711.

You can override this behavior by using the -e directive:

{{{
#PBS -e /dfs/rulk/0/mydir/myjob.error
}}}

This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

=== Mail directive ===

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

{{{
#PBS -m be
}}}

Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail queue on the submission node (you could set up forwarding, but that should be a topic of another wiki page).

=== Parallel jobs ===

You may specify that you want your job to run on multiple cores and multiple nodes with the following directive:

{{{
#PBS -l nodes=node_no:ppn=core_no
}}}

In the example above the ''node_no'' represents the number of nodes (physical servers) that you are requesting and the ''core_no'' represents the number of cores that you would like to use on each of the nodes. If you would like to use 6 cores on a single node you could do it with the following directive:

{{{
#PBS -l nodes=1:ppn=6
}}}

Here is another example requesting two nodes with 32 cores each:

{{{
#PBS -l nodes=2:ppn=32
}}}

{{{#!wiki tip
'''Parallel jobs'''

Please bear in mind that qsub will not to anything to make your job parallel. That is why you should only make requests for more than one core if your program is multi core or multi thread capable. If your program is not written in a parallel manner it will only run on a single core and your 32-core reservation will just waste system resources for others.

'''Number of requested cores'''

Please do not make requests that the cluster is not able to handle. If you submit a job with the directive ''-l nodes=1:ppn=128'' this job will actually never run on the current configuration of the cluster as we do not have nodes with 128 cores. Please consult the cluster's hardware capabilities before using this directive.
}}}

=== Running time ===

This directive lets you specify a maximum walltime (sum of CPU time and wait time) that can be used by your job. This may be useful in a situation where you know your job should run no longer than 2 hours and if it runs longer then something went wrong. You can specify such a limit with the following directive:

{{{
#PBS -l walltime=02:00:00
}}}

You do not have to specify a maximal walltime in that case your job will run eternally... unless the cluster crashes... and it may be interrupted by shorter running jobs.

{{{#!wiki tip
'''Wall time format'''

You should always specify the wall time in HH:MM:SS format. If you were to write ''walltime=120:00'' your program would get killed after 2 hours of work as the setting is read as 120 minutes, 0 seconds.
}}}

=== An example submission script ===

In the following example we do not actually call some binary of our own, we just run a few standard commands and exit. Since the submission script is nothing more than a regular shell script, the example should print out what host it is running on to our standard output file.
Line 62: Line 159:
The comment lines that start with the PBS keyword let you select different PBS options:

 * #PBS -N: lets you specify a friendly job name
 * #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
 * #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.

For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

 * Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
 * Network mounted directories from rocky and hulk:
  * /dfs/hulk/0
  * /dfs/rocky/0

Here is an example of a bit more complex script to run an MPI job (copied from http://csc.cnsi.ucsb.edu/docs/running-jobs-torque):

{{{
#!/bin/sh
#PBS -l nodes=2:ppn=4

# Make sure that we are in the same subdirectory as where the qsub command
# is issued.
cd $PBS_O_WORKDIR

# make a list of allocated nodes(cores)
cat $PBS_NODEFILE > nodes

# How many cores total do we have?
NO_OF_CORES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
NODE_LIST=`cat $PBS_NODEFILE `

# Just for kicks, see which nodes we got.
echo $NODE_LIST

# Run the executable. *DO NOT PUT* a '&' at the end!!
mpirun -np $NO_OF_CORES -machinefile nodes ./pi3 >& log
}}}


==== Submitting your job ====

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

{{{
qsub myjob.sh
}}}

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

==== Check the status of your job ====

You can check what is happening with your job with the qstat command:

{{{
qstat jobid
}}}

jobid is the number that the resource manager assigned to your job (the first number qsub will output after you successfully submit a job).

==== Other useful commands ====
== Paths ==

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs for your '''scripts''', your '''programs''' and the '''datafiles''' needed for your job. You should also make sure to always use a full (absolute) path specification.

This means that using ./myjob to run your program from a submission script in your home directory is a '''bad''' idea. You should all it like this:

{{{
/afs/cs.stanford.edu/u/your_csid/myjob
}}}

You can save yourself some typing by using environment variables. You could use $HOME/myjob in the example above. If you decide to use environment variables, make sure that you run qsub with the -V parameter as we are showing you throughout this tutorial. The -V parameter makes sure that the environment variables are available to the submission script.

=== What is mounted under /dfs ===

 * '''/dfs/hulk/0''' points to /lfs/hulk/0 on hulk.stanford.edu
 * '''/dfs/rulk/0''' points to /lfs/rulk/0 on rulk.stanford.edu
 * '''/dfs/rocky/0''' points to /lfs/rocky/0 on rocky.stanford.edu
 * '''/dfs/hulk/0''' points to /lfs/hulk/0 on hulk.stanford.edu
 * '''/u''' points to /afs/cs.stanford.edu/u and contains user home directories

== Passing CLI arguments ==

You may want to pass some arguments to the program that you want to run on the cluster. This is problematic as you would actually be passing the arguments to the ''qsub'' program instead of your own program.

You have probably already noticed the first workaround in the text above. We are always running qsub with the '''-V''' switch. This switch tells ''qsub'' to pass all the environment variables that are currently available to the environment from which the submitted script will run. That is why we told you it is OK, to use the $HOME variable when we were discussing [[InfolabClusterCompute#Paths|paths]].

The other workaround is to list the variables that need to be available in the program execution environment with the '''-v''' switch. If you wanted the variable $MYNAME to contain the value Alice, you would call ''qsub'' like this:

{{{
qsub -v MYNAME="Alice" myjob.sh
}}}

Please see [[InfolabClusterComputeHowtoVariables]] for a more detailed example.

== Queues ==

There is only one queue available on the compute cluster at the moment. This is bound to change once the cluster is used more heavily and we can make better sense of what is needed.

The default queue is called '''test''' and it allows up to 35,000 jobs to be queued and up to 1,200 jobs to run simultaneously.

== Qstat ==

The '''qstat''' command enables you to check in on your job. You run it with your job ID number:

{{{
qstat job_id
}}}

If you want to know the status of the job number 4652 you can issue the command:

{{{
qstat 4652
}}}

And the resource manager's reply might look a little something like this:

{{{
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1 SingleCoreJob akrevl 0 Q test
}}}

Or like this:

{{{
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1 SingleCoreJob akrevl 0 R test
}}}

The difference in the two outputs shown is the status (S) column, that has the value of '''Q''' in the first output and the value of '''R''' in the second output. Those mean that are job is '''queued''' in the first output and that our job is '''running''' in the second output.

If you run the ''qstat'' command with the ''-f'' switch you will get more detailed data about yout job:

{{{
qstat -f 4652
}}}

== Qshow ==

As an alternative to ''qstat'' you can use the ''showq'' command. Note however that this command is somewhat sensitive to the condition of the cluster and may report a timeout even though everything is running fine on the cluster.

You can invoke the ''showq'' by running:

{{{
showq -u your_csid
}}}

And the response should be similar to:

{{{
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

4654 akrevl Running 1 00:01:00 Tue Oct 16 17:47:43

     1 Active Job 1 of 896 Processors Active (0.11%)
                         1 of 28 Nodes Active (3.57%)

IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
}}}

''showq'' will display a list of active, idle and blocked jobs by default.

 * '''active''' jobs are the jobs that are currently running on the cluster,
 * '''idle''' are the jobs that are in the queue and ready to run but are still waiting for free resources,
 * '''blocked''' are the jobs that could not complete or cannot run on this cluster (typically a job will go into this state if it was running one one of the cluster nodes that happened to fail at that time).

== Other useful commands ==
Line 130: Line 285:
More coming soon... == HOWTOs / Tutorials ==

 * [[InfolabClusterComputeHowtoSingle|How to]] run a single core job on the cluster
 * [[InfolabClusterComputeHowtoVariables|How to]] pass arguments / variables to the job running on the cluster
 * [[InfolabClusterComputeHowtoMpi|How to]] run an MPI job on the cluster

Infolab Compute Cluster

Access

To submit the jobs to the compute cluster you need to log in to the submission node ilhead1.stanford.edu. Use your CS credentials to log in.

ssh your_cs_id@ilhead1.stanford.edu

Job scheduling

All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there.

Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably.

Qsub

qsub is the main command that submits your job to the cluster. The command uses the following syntax:

qsub -V script_file

So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following:

qsub -V myjob.sh

script_file should be a text file

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.

script_file must not be binary/executable file

Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error.

Resource manager directives

These directives tell the resource manager how to run your job. All of the directives start with a pound character (#) immediately followed by the keyword PBS:

#PBS -directive options

Name

This directive tells the resource manager which name to use for your job. If you do not specify it, the name of your submission script will be used.

#PBS -N InfolabClusterTutorial

Standard output

Since you never know which server your program will run on once you submit it to the cluster, the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named jobname.ojobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.sh.o711. If you provided the Name directive discussed in the previous section, then your default standard output will be saved to InfolabClusterTutorial.o711.

You can override this behavior by using the -o directive:

#PBS -o /dfs/rulk/0/mydir/myjob.out

This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

Error output

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711. If you provided the Name directive discussed in one of the previous sections, then your default standard error stream will be saved to InfolabClusterTutorial.e711.

You can override this behavior by using the -e directive:

#PBS -e /dfs/rulk/0/mydir/myjob.error

This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

Mail directive

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

#PBS -m be

Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail queue on the submission node (you could set up forwarding, but that should be a topic of another wiki page).

Parallel jobs

You may specify that you want your job to run on multiple cores and multiple nodes with the following directive:

#PBS -l nodes=node_no:ppn=core_no

In the example above the node_no represents the number of nodes (physical servers) that you are requesting and the core_no represents the number of cores that you would like to use on each of the nodes. If you would like to use 6 cores on a single node you could do it with the following directive:

#PBS -l nodes=1:ppn=6

Here is another example requesting two nodes with 32 cores each:

#PBS -l nodes=2:ppn=32

Parallel jobs

Please bear in mind that qsub will not to anything to make your job parallel. That is why you should only make requests for more than one core if your program is multi core or multi thread capable. If your program is not written in a parallel manner it will only run on a single core and your 32-core reservation will just waste system resources for others.

Number of requested cores

Please do not make requests that the cluster is not able to handle. If you submit a job with the directive -l nodes=1:ppn=128 this job will actually never run on the current configuration of the cluster as we do not have nodes with 128 cores. Please consult the cluster's hardware capabilities before using this directive.

Running time

This directive lets you specify a maximum walltime (sum of CPU time and wait time) that can be used by your job. This may be useful in a situation where you know your job should run no longer than 2 hours and if it runs longer then something went wrong. You can specify such a limit with the following directive:

#PBS -l walltime=02:00:00

You do not have to specify a maximal walltime in that case your job will run eternally... unless the cluster crashes... and it may be interrupted by shorter running jobs.

Wall time format

You should always specify the wall time in HH:MM:SS format. If you were to write walltime=120:00 your program would get killed after 2 hours of work as the setting is read as 120 minutes, 0 seconds.

An example submission script

In the following example we do not actually call some binary of our own, we just run a few standard commands and exit. Since the submission script is nothing more than a regular shell script, the example should print out what host it is running on to our standard output file.

#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20

Paths

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs for your scripts, your programs and the datafiles needed for your job. You should also make sure to always use a full (absolute) path specification.

This means that using ./myjob to run your program from a submission script in your home directory is a bad idea. You should all it like this:

/afs/cs.stanford.edu/u/your_csid/myjob

You can save yourself some typing by using environment variables. You could use $HOME/myjob in the example above. If you decide to use environment variables, make sure that you run qsub with the -V parameter as we are showing you throughout this tutorial. The -V parameter makes sure that the environment variables are available to the submission script.

What is mounted under /dfs

  • /dfs/hulk/0 points to /lfs/hulk/0 on hulk.stanford.edu

  • /dfs/rulk/0 points to /lfs/rulk/0 on rulk.stanford.edu

  • /dfs/rocky/0 points to /lfs/rocky/0 on rocky.stanford.edu

  • /dfs/hulk/0 points to /lfs/hulk/0 on hulk.stanford.edu

  • /u points to /afs/cs.stanford.edu/u and contains user home directories

Passing CLI arguments

You may want to pass some arguments to the program that you want to run on the cluster. This is problematic as you would actually be passing the arguments to the qsub program instead of your own program.

You have probably already noticed the first workaround in the text above. We are always running qsub with the -V switch. This switch tells qsub to pass all the environment variables that are currently available to the environment from which the submitted script will run. That is why we told you it is OK, to use the $HOME variable when we were discussing paths.

The other workaround is to list the variables that need to be available in the program execution environment with the -v switch. If you wanted the variable $MYNAME to contain the value Alice, you would call qsub like this:

qsub -v MYNAME="Alice" myjob.sh

Please see InfolabClusterComputeHowtoVariables for a more detailed example.

Queues

There is only one queue available on the compute cluster at the moment. This is bound to change once the cluster is used more heavily and we can make better sense of what is needed.

The default queue is called test and it allows up to 35,000 jobs to be queued and up to 1,200 jobs to run simultaneously.

Qstat

The qstat command enables you to check in on your job. You run it with your job ID number:

qstat job_id

If you want to know the status of the job number 4652 you can issue the command:

qstat 4652

And the resource manager's reply might look a little something like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 Q test

Or like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 R test

The difference in the two outputs shown is the status (S) column, that has the value of Q in the first output and the value of R in the second output. Those mean that are job is queued in the first output and that our job is running in the second output.

If you run the qstat command with the -f switch you will get more detailed data about yout job:

qstat -f 4652

Qshow

As an alternative to qstat you can use the showq command. Note however that this command is somewhat sensitive to the condition of the cluster and may report a timeout even though everything is running fine on the cluster.

You can invoke the showq by running:

showq -u your_csid

And the response should be similar to:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

4654                 akrevl    Running     1    00:01:00  Tue Oct 16 17:47:43

     1 Active Job        1 of  896 Processors Active (0.11%)
                         1 of   28 Nodes Active      (3.57%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

showq will display a list of active, idle and blocked jobs by default.

  • active jobs are the jobs that are currently running on the cluster,

  • idle are the jobs that are in the queue and ready to run but are still waiting for free resources,

  • blocked are the jobs that could not complete or cannot run on this cluster (typically a job will go into this state if it was running one one of the cluster nodes that happened to fail at that time).

Other useful commands

  • qdel job_id: deletes your job
  • qstat -q: lists all queues
  • qstat -a: lists all jobs
  • qstat -au userid: lists all jobs submitted by userid
  • pbsnodes: list status of all the compute nodes

HOWTOs / Tutorials

  • How to run a single core job on the cluster

  • How to pass arguments / variables to the job running on the cluster

  • How to run an MPI job on the cluster