Infolab Compute Cluster

Contents

Infolab Compute Cluster

Access

To submit the jobs to the compute cluster you need to log in to the submission node ilhead1.stanford.edu. Use your CS credentials to log in.

ssh your_cs_id@ilhead1.stanford.edu

Job scheduling

All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there.

Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably.

Qsub

qsub is the main command that submits your job to the cluster. The command uses the following syntax:

qsub -V script_file

So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following:

qsub -V myjob.sh

script_file should be a text file

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.

script_file must not be binary/executable file

Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error.

Resource manager directives

These directives tell the resource manager how to run your job. All of the directives start with a pound character (#) immediately followed by the keyword PBS:

#PBS -directive options

Name

This directive tells the resource manager which name to use for your job. If you do not specify it, the name of your submission script will be used.

#PBS -N InfolabClusterTutorial

Standard output

Since you never know which server your program will run on once you submit it to the cluster, the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named jobname.ojobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.sh.o711. If you provided the Name directive discussed in the previous section, then your default standard output will be saved to InfolabClusterTutorial.o711.

You can override this behavior by using the -o directive:

#PBS -o /dfs/scratch0/mydir/myjob.out

This will save all the standard output to the file /dfs/scratch0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

Error output

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711. If you provided the Name directive discussed in one of the previous sections, then your default standard error stream will be saved to InfolabClusterTutorial.e711.

You can override this behavior by using the -e directive:

#PBS -e /dfs/scratch0/mydir/myjob.error

This will save the standard error stream to the file /dfs/scratch0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

Mail directive

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

#PBS -m be -M your_csid@cs.stanford.edu

Make sure that you supply a valid @stanford.edu or @cs.stanford.edu e-mail address after the -M switch and the cluster will notify you when the job started and when it has finished processing (addresses outside Stanford will not work).

Parallel jobs

You may specify that you want your job to run on multiple cores and multiple nodes with the following directive:

#PBS -l nodes=node_no:ppn=core_no

In the example above the node_no represents the number of nodes (physical servers) that you are requesting and the core_no represents the number of cores that you would like to use on each of the nodes. If you would like to use 6 cores on a single node you could do it with the following directive:

#PBS -l nodes=1:ppn=6

Here is another example requesting two nodes with 32 cores each:

#PBS -l nodes=2:ppn=32

Parallel jobs

Please bear in mind that qsub will not to anything to make your job parallel. That is why you should only make requests for more than one core if your program is multi core or multi thread capable. If your program is not written in a parallel manner it will only run on a single core and your 32-core reservation will just waste system resources for others.

Number of requested cores

Please do not make requests that the cluster is not able to handle. If you submit a job with the directive -l nodes=1:ppn=128 this job will actually never run on the current configuration of the cluster as we do not have nodes with 128 cores. Please consult the cluster's hardware capabilities before using this directive.

Running time

This directive lets you specify a maximum walltime (sum of CPU time and wait time) that can be used by your job. This may be useful in a situation where you know your job should run no longer than 2 hours and if it runs longer then something went wrong. You can specify such a limit with the following directive:

#PBS -l walltime=02:00:00

You do not have to specify a maximal walltime in that case your job will run eternally... unless the cluster crashes... and it may be interrupted by shorter running jobs.

Wall time format

You should always specify the wall time in HH:MM:SS format. If you were to write walltime=120:00 your program would get killed after 2 hours of work as the setting is read as 120 minutes, 0 seconds.

Requesting a specific node

Sometimes you may want to use a specific node in the cluster. This node might be special because it has some special hardware (think GPUs even though we do not have any at the moment) or because you made the extra effort of copying some data to its local storage. You can ask the resource manager to run your job on a specific node by using the following directive:

#PBS -l nodes=iln10.stanford.edu:ppn=32

The directive above will ask the scheduler to assign it 32 cores on node called iln10.stanford.edu. If you need more than one node, use a plus sign and add more nodes:

#PBS -l nodes=iln10.stanford.edu:ppn=32+iln11.stanford.edu:ppn=32+iln12.stanford.edu:ppn=16

The directive above will schedule 32 cores on nodes iln10.stanford.edu and iln11.stanford.edu and 16 cores on node iln12.stanford.edu.

Additional delay

Please note that requesting a specific node might cause an additional delay prior to your job execution as the resources on the specific node might not be available when you submit your job.

An example submission script

In the following example we do not actually call some binary of our own, we just run a few standard commands and exit. Since the submission script is nothing more than a regular shell script, the example should print out what host it is running on to our standard output file.

#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20

Paths

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs for your scripts, your programs and the datafiles needed for your job. You should also make sure to always use a full (absolute) path specification.

This means that using ./myjob to run your program from a submission script in your home directory is a bad idea. You should all it like this:

/afs/cs.stanford.edu/u/your_csid/myjob

You can save yourself some typing by using environment variables. You could use $HOME/myjob in the example above. If you decide to use environment variables, make sure that you run qsub with the -V parameter as we are showing you throughout this tutorial. The -V parameter makes sure that the environment variables are available to the submission script.

What is mounted under /dfs

/dfs/scratch0 points to a big scratch volume on il-nfs-1
/dfs/scratch1 points to a big scratch volume on il-nfs-1
/dfs/ilfs2/0 points to the ilfs2 file server

Passing CLI arguments

You may want to pass some arguments to the program that you want to run on the cluster. This is problematic as you would actually be passing the arguments to the qsub program instead of your own program.

You have probably already noticed the first workaround in the text above. We are always running qsub with the -V switch. This switch tells qsub to pass all the environment variables that are currently available to the environment from which the submitted script will run. That is why we told you it is OK, to use the $HOME variable when we were discussing paths.

The other workaround is to list the variables that need to be available in the program execution environment with the -v switch. If you wanted the variable $MYNAME to contain the value Alice, you would call qsub like this:

qsub -v MYNAME="Alice" myjob.sh

Please see InfolabClusterComputeHowtoVariables for a more detailed example.

Queues

There is only one queue available on the compute cluster at the moment. This is bound to change once the cluster is used more heavily and we can make better sense of what is needed.

The default queue is called test and it allows up to 35,000 jobs to be queued and up to 1,200 jobs to run simultaneously.

Qstat

The qstat command enables you to check in on your job. You run it with your job ID number:

qstat job_id

If you want to know the status of the job number 4652 you can issue the command:

qstat 4652

And the resource manager's reply might look a little something like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 Q test

Or like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4652.ilhead1               SingleCoreJob    akrevl                 0 R test

The difference in the two outputs shown is the status (S) column, that has the value of Q in the first output and the value of R in the second output. Those mean that are job is queued in the first output and that our job is running in the second output.

If you run the qstat command with the -f switch you will get more detailed data about yout job:

qstat -f 4652

ShowQ

As an alternative to qstat you can use the showq command. Note however that this command is somewhat sensitive to the condition of the cluster and may report a timeout even though everything is running fine on the cluster.

You can invoke the showq by running:

showq -u your_csid

And the response should be similar to:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

4654                 akrevl    Running     1    00:01:00  Tue Oct 16 17:47:43

     1 Active Job        1 of  896 Processors Active (0.11%)
                         1 of   28 Nodes Active      (3.57%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

showq will display a list of active, idle and blocked jobs by default.

active jobs are the jobs that are currently running on the cluster,
idle are the jobs that are in the queue and ready to run but are still waiting for free resources,
blocked are the jobs that could not complete or cannot run on this cluster (typically a job will go into this state if it was running one one of the cluster nodes that happened to fail at that time).

Killing / cancelling a job

Sometimes a job just goes wrong and you want to cancel it before it is able to finish gracefully. A nice way of doing that is sending the job a SIGINT signal. This will give your job a chance to clean up after itself (provided you have coded that, of course). In the following example we submit a job only to cancel it and show that it does not exist any more:

$ qsub outputjob.sh
60328.ilhead1.stanford.edu

$ qsig -s SIGINT 60328.ilhead1.stanford.edu

$ qstat -a
ilhead1.stanford.edu:
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----

For job arrays make sure you add the square brackets to the job id:

$ qsub -V -t 0-5 outputjob.sh
60355[].ilhead1.stanford.edu

$ qsig -s SIGINT 60355[].ilhead1.stanford.edu

If you have no mercy for the job you can also send it a SIGKILL signal. You can also use the qdel command that has the same effect as sending your job a SIGKILL signal:

$ qsub outputjob.sh
60352.ilhead1.stanford.edu

$ qdel 60352.ilhead1.stanford.edu

If you are using job arrays you will have to add square brackets to the job ID in order to kill the whole array:

$ qsub -V -t 0-5 outputjob.sh
60354[].ilhead1.stanford.edu

$ qdel 60354[].ilhead1.stanford.edu

Job Arrays

This section could also be subtitled How to submit a bunch of jobs and not crash the cluster in the process. So how does it need to be done? The obvious solution probably does not work, otherwise this section would not exist in the wiki.

Do not use loops

Running for job in {1..1000}; do qsub myjob.sh; done is a bad, bad, bad thing. It will probably crash the cluster. Sorry, that's the way it is. The software seems to be on the sensitive size.

If you cannot avoid running the above, then insert a sleep 1 after each qsub. This has been tested and while it is not a preferred way of running multiple jobs it at least does not kill the cluster... as quickly...

The right way of submitting a lot of jobs is to use job arrays. Let's say we want to submit a 100 instances of our program and we have already prepared a submission script called myjob.sh. We can submit all 100 jobs by issuing the following command:

qsub -V -t 0-99 myjob.sh

The cluster should reply with the following:

4660[].ilhead1.stanford.edu

Notice the square brackets following the job id. Those indicate that this is actually an array of jobs. Please note that the qstat command will also display only a single entry. showq on the other hand displays all of the jobs. You can use qstat to get more information if you specify an exact element of an array. If we want the full details of the 2nd job in the array with the id 4663, we can issue the following command:

qstat -f 4663[1]

Please see this howto for additional information on how to submit job arrays: InfolabClusterComputeHowtoJobArray.

Output files

Running a large job array will create a number of output files called JobName.eJobId-ElementId for standard error and JobName.oJobId-ElementId for standard output. Check them out once your job run is finished but make sure you delete them afterwards (rm *.{e,o}* seems to be helpful) as having a large collection of files in your directory might slow down other file operations in that directory...

Mail and job arrays

If you submit a big array of jobs you might like to get an email once all of the jobs are finished instead of getting a message every time one of the jobs in the array is done. To my knowledge there is no support for this built into the resource scheduler, but you can use this simple script to work around that. Make sure you call the script before and after calling your job in your qsub file.

So if we had a python script called myjob.py and it was stored on /dfs/scratch0/x/ we would prepare a qsub file that looks a bit like this:

#PBS -l nodes=1:ppn=1
#PBS -N myPthonJob
#PBS -m a -M your_csid@cs.stanford.edu

/dfs/scratch0/x/qjobmon.sh
python /dfs/scratch0/x/myjob.py
/dfs/scratch0/x/qjobmon.sh

And we would submit this to the cluster as an array of 20 jobs like this:

qsub -t 1-20 /dfs/scratch0/x/myjob.sh

You can get the script here: qjobmon.sh

PBS mail command

The #PBS -m a -M ... line tells the resource scheduler to send you a message only in case the running job is aborted (something goes wrong with it). If you used #PBS -m be -M ... you would get an email every time a job is started or finished for every job in the array (in the example above you would get 40 messages).

Interactive jobs

The cluster can also be used interactively. Instead of creating a qsub script, you can just run the following:

qsub -I -l nodes=1:ppn=16

Your job will be scheduled by the resource manager as any other job would be scheduled. But in contrast to a regular job submission qsub will not exit, but rather wait for the job to start on the cluster and redirect standard input and output to your terminal.

Please note that the ppn in the example above stands for the number of CPUs you want to allocate to your job.

Here is an example of an interactive run:

~$ hostname
ilhead1.Stanford.EDU
~$ qsub -I -l nodes=1:ppn=32
qsub: waiting for job 44389.ilhead1.stanford.edu to start
qsub: job 44389.ilhead1.stanford.edu ready

~$ hostname
iln27.stanford.edu
...

Interactive jobs on multiple nodes

If you want to run your interactive job on multiple nodes you need to specify the number of nodes accordingly. The following command will run your job on 7 nodes:

qsub -I -l nodes=7:ppn=32

Please note that qsub will only open one console for you even though it has made a reservation of resources on seven nodes.

Nodes and CPUs

Torque has a funny way of interpreting what is a node and what is a CPU on a node. That is why you need to make sure you specify both nodes= and ppn= limits. If you used qsub -I -l nodes=7 in the example above, you would only get 7 CPUs on a single node.

Where are my jobs

This is particularly useful if you want to run an interactive session over more than one node (as you will only get a single command prompt). Here is how you can check which nodes were scheduled for your job:

qstat -n

The command will output something similar to this:

-bash-4.1$ qstat -n

ilhead1.stanford.edu:
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
46415.ilhead1.st     akrevl   test     STDIN               --      7 224    --    --  R   --
   iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28
   +iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28
   +iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln28+iln27+iln27+iln27+iln27
   +iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27
   +iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27+iln27
...

You can see that each of the node names is appearing multiple times in the example above. That is because Torque is displaying a name for every core that it assigned to you on that node.

Here is a very quick and dirty one-liner that will display only the node names from the output above:

qstat -n | sed 's/+/\n/g' | grep iln | sed 's/ //g' | sort | uniq

The line above will output something in the lines of this:

-bash-4.1$ qstat -n | sed 's/+/\n/g' | grep iln | sed 's/ //g' | sort | uniq
iln22
iln23
iln24
iln25
iln26
iln27
iln28

Other useful commands

qdel job_id: deletes your job
qstat -q: lists all queues
qstat -a: lists all jobs
qstat -au userid: lists all jobs submitted by userid
pbsnodes: list status of all the compute nodes

HOWTOs / Tutorials

How to run a single core job on the cluster
How to pass arguments / variables to the job running on the cluster
How to run an MPI job on the cluster
How to submit job arrays

InfolabClusterCompute

Menu

SNAP

Wiki