Locked History Actions

Diff for "InfolabClusterCompute"

Differences between revisions 1 and 9 (spanning 8 versions)
Revision 1 as of 2012-10-16 19:12:14
Size: 3564
Editor: akrevl
Comment:
Revision 9 as of 2012-10-16 20:46:08
Size: 7729
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Infolab Compute Cluster =  = Infolab Compute Cluster =
Line 3: Line 3:
=== Access === <<TableOfContents(3)>>
Line 5: Line 5:
To submit the jobs to the compute cluster you need to log in to the submission node [[ilhead1|ilhead1.stanford.edu]]. Use your CS account to log in. == Access ==
Line 7: Line 7:
==== Login to the headnode ====

You can submit jobs to the cluster from the head node: snapx.Stanford.EDU (sorry about the name, the head node will be renamed to ilh soon). First SSH to the headnode:
To submit the jobs to the compute cluster you need to log in to the submission node [[ilcXX|ilhead1.stanford.edu]]. Use your CS credentials to log in.
Line 12: Line 10:
ssh your_cs_id@snapx.Stanford.EDU ssh your_cs_id@ilhead1.stanford.edu
Line 15: Line 13:
Use your CSID username and password to login to the headnode. == Job scheduling ==
Line 17: Line 15:
==== Preparing your job ==== All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there.

Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably.

== Qsub ==

'''qsub''' is the main command that submits your job to the cluster. The command uses the following syntax:

{{{
qsub -V script_file
}}}

So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following:

{{{
qsub -V myjob.sh
}}}


{{{#!wiki tip
'''script_file should be a text file'''

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.
}}}


{{{#!wiki caution
'''script_file must not be binary/executable file'''

Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error.
}}}

== Resource manager directives ==

These directives tell the resource manager how to run your job. All of the directives start with a pound (#) character immediately followed by the keyword PBS:

{{{
#PBS -directive options
}}}

=== Standard output ===

Since you never know which server your program will run on once you submit it to the cluster the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named jobname.ojobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.o711.

You can override this behavior by using the -o directive:

{{{
#PBS -o /dfs/rulk/0/mydir/myjob.out
}}}

This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

{{{#!wiki tip
'''File locations'''

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs to deposit your standard output files. And make sure you always use a full (absolute) path specification. You should never use ./myjob.out, use /afs/cs.stanford.edu/u/your_csid/myjob.out instead. We know it's tedious to type but it will save you a whole lot of trouble.
}}}

=== Error output ===

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711.

You can override this behavior by using the -e directive:

{{{
#PBS -e /dfs/rulk/0/mydir/myjob.error
}}}

This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

{{{#!wiki tip
'''File locations'''

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs to deposit your standard error files. And make sure you always use a full (absolute) path specification. You should never use ./myjob.out, use /afs/cs.stanford.edu/u/your_csid/myjob.out instead. We know it's tedious to type but it will save you a whole lot of trouble.
}}}

=== Name ===

This directive tells


=== Mail directive ===

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

{{{
#PBS -m be
}}}

Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail spooler directory on the submission node (you could set up forwarding, but that should be a topic of another document).

== Paths ==



----

Infolab Compute Cluster

Access

To submit the jobs to the compute cluster you need to log in to the submission node ilhead1.stanford.edu. Use your CS credentials to log in.

ssh your_cs_id@ilhead1.stanford.edu

Job scheduling

All the jobs are submitted with Torque resource manager and are scheduled by the MAUI scheduler. Please do not log in to the nodes directly and run jobs from there.

Torque used to be called PBS, so if you see any resources talking about the PBS resource manager those more or less apply to Torque as well. Also please excuse us if we use PBS and Torque interchangeably.

Qsub

qsub is the main command that submits your job to the cluster. The command uses the following syntax:

qsub -V script_file

So if I have a script called runjob.sh that I would like to run on a cluster I can do so by executing the following:

qsub -V myjob.sh

script_file should be a text file

The script_file should contain the name and the path to your executable file and extra instructions that tell the resource manager how to run your job. Don't worry, we'll talk more about those later.

script_file must not be binary/executable file

Never use qsub to submit a binary executable to the resource manager. This will result in a successful job submission, but the runner that is the job is assigned to will fail to execute it with a "Cannot execute a binary file" error.

Resource manager directives

These directives tell the resource manager how to run your job. All of the directives start with a pound (#) character immediately followed by the keyword PBS:

#PBS -directive options

Standard output

Since you never know which server your program will run on once you submit it to the cluster the resource manager will deposit the standard output and standard error streams to a set of files in the directory where your submission script ran from.

By default the resource manager will redirect all standard output of a job to a file named jobname.ojobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.o711.

You can override this behavior by using the -o directive:

#PBS -o /dfs/rulk/0/mydir/myjob.out

This will save all the standard output to the file /dfs/rulk/0/mydir/myjob.out. Please note that the file will be overwritten if you run the job more than once.

File locations

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs to deposit your standard output files. And make sure you always use a full (absolute) path specification. You should never use ./myjob.out, use /afs/cs.stanford.edu/u/your_csid/myjob.out instead. We know it's tedious to type but it will save you a whole lot of trouble.

Error output

By default the resource manager will redirect all output to standard error of a job to a file named jobname.ejobid. So if you submitted myjob.sh and the resource manager assigned it the ID 711, the standard output will be saved to myjob.e711.

You can override this behavior by using the -e directive:

#PBS -e /dfs/rulk/0/mydir/myjob.error

This will save the standard error stream to the file /dfs/rulk/0/mydir/myjob.error. Please note that the file will be overwritten if you run the job more than once.

File locations

You should always use your home directory (which is on the AFS filesystem) or one of the filesystems mounted under /dfs to deposit your standard error files. And make sure you always use a full (absolute) path specification. You should never use ./myjob.out, use /afs/cs.stanford.edu/u/your_csid/myjob.out instead. We know it's tedious to type but it will save you a whole lot of trouble.

Name

This directive tells

Mail directive

This directive tells the resource manager to send you an e-mail when your job is started and when it is finished. In a cluster environment your job may not start immediately as it depends on the other jobs that are currently in cluster's queues. The following will send you and e-mail both when the job starts executing and when it finishes.

#PBS -m be

Please note that the e-mail will not be delivered to your main CS account, but rather to the local mail spooler directory on the submission node (you could set up forwarding, but that should be a topic of another document).

Paths


Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example:

#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20

The comment lines that start with the PBS keyword let you select different PBS options:

  • #PBS -N: lets you specify a friendly job name
  • #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
  • #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.

For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

  • Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
  • Network mounted directories from rocky and hulk:
    • /dfs/hulk/0
    • /dfs/rocky/0

Here is an example of a bit more complex script to run an MPI job (copied from http://csc.cnsi.ucsb.edu/docs/running-jobs-torque):

#PBS -l nodes=2:ppn=4

# Make sure that we are in the same subdirectory as where the qsub command 
# is issued. 
cd $PBS_O_WORKDIR 

#  make a list of allocated nodes(cores)
cat $PBS_NODEFILE > nodes

# How many cores total do we have?
NO_OF_CORES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
NODE_LIST=`cat $PBS_NODEFILE `

# Just for kicks, see which nodes we got.
echo $NODE_LIST

# Run the executable. *DO NOT PUT* a '&' at the end!!
mpirun -np $NO_OF_CORES -machinefile nodes ./pi3 >& log 

Submitting your job

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

qsub myjob.sh

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

Check the status of your job

You can check what is happening with your job with the qstat command:

qstat jobid

jobid is the number that the resource manager assigned to your job (the first number qsub will output after you successfully submit a job).

Other useful commands

  • qdel job_id: deletes your job
  • qstat -q: lists all queues
  • qstat -a: lists all jobs
  • qstat -au userid: lists all jobs submitted by userid
  • pbsnodes: list status of all the compute nodes

More coming soon...