Locked History Actions

Diff for "InfolabClusterComputeHowtoJobArray"

Differences between revisions 1 and 2
Revision 1 as of 2012-10-17 04:00:16
Size: 4880
Editor: akrevl
Comment:
Revision 2 as of 2012-10-17 04:11:24
Size: 4030
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 26: Line 26:
We are going to use just a simple Python script as our main program for this HOWTO. You can download the script here [[attachment:SingleCore.py]]. Again we are using the same simple Python script that sleeps for a while and outputs some time and the arguments that it was called with.
Line 31: Line 31:
import socket, datetime, time, getpass import socket, datetime, time, getpass, sys

arrayid = sys.argv[1]

# We're using just using a simple list here but you can
# easily imagine this getting read from a file or sth ...
arguments = [
  [ "myarg1-0", "myarg2-0", "myarg3-0" ],
  [ "myarg1-1", "myarg2-1", "myarg3-1" ],
  [ "myarg1-2", "myarg2-2", "myarg3-2" ],
  [ "myarg1-3", "myarg2-3", "myarg3-3" ]
]
Line 41: Line 52:
print "My arguments:"
print arguments [int(arrayid)]
Line 43: Line 56:
The script starts, records the current time, figures out the hostname it is running on and the username it is running as. Then it sleeps for 10 seconds (so we at least have some impact on the cluster), records the time again and prints out a string that may look a little something like this:

{{{
Started: 2012-10-16 15:56:55 Finished: 2012-10-16 15:57:05 Host: ilhead1 User: akrevl
}}}

It's a good idea to check if the program will run on the target platform. It doesn't make much difference for a Python script, but if you were running a C binary it's worth checking if it runs on the AMD platform. This is where '''ild1''' comes in. The development node '''ild1''' is set up in the same way as the cluster nodes are. So let's test the script on ild1:

{{{
/usr/bin/python2.7 /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore/SingleCore.py
}}}

Note that we are using a full path both to the python executable and to the Python script. The result is as expected:

{{{
Started: 2012-10-16 17:04:44 Finished: 2012-10-16 17:04:54 Host: ild1 User: akrevl
}}}


The only twist is, that we are reading the actual arguments from the list provided in the script itself. This could be easily replaced by reading from a cvs file or some other, neater argument storage.
Line 69: Line 63:
qsub -V /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore/SingleCore.qsub.sh qsub -V -t 0-3 $HOME/tutorial/JobArray/JobArray.qsub.sh
Line 72: Line 66:
If we submitted the job successfully, the resource manager should reply with with the ID of the job and the name of the headnode: There is a few things to note about the ''-t'' argument. This argument specifies that we the job should be run as a job array. In addition to that it also specifies the array ids that our instabces will get. When we run the command above we'll get instances 0, 1, 2, 3 respectively. We could also specify those as a comma delimited list. The following command does the same thing as the previous one:
Line 75: Line 69:
4651.ilhead1.stanford.edu qsub -V -t 0,1,2,3 $HOME/tutorial/JobArray/JobArray.qsub.sh
Line 78: Line 72:
= Check on the job =

While the job is running, you can check on it with ''qstat'' and ''showq'' commands. Please be patient with the ''showq'' command as it tends to return timeouts when a lot of jobs are in the queue.
We could also make up our own non-sequential ids:
Line 83: Line 75:
~/ $ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4651.ilhead1 SingleCoreJob akrevl 0 R test
qsub -V -t 111,211,311,411 $HOME/tutorial/JobArray/JobArray.qsub.sh
Line 89: Line 78:
Anyhow, if our jobs ran successfully, we should be able to see the results in the output files. In our case:
Line 90: Line 81:
~/ $ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

4651 akrevl Running 1 00:01:00 Tue Oct 16 17:19:29

     1 Active Job 1 of 896 Processors Active (0.11%)
                         1 of 28 Nodes Active (3.57%)
~/ $ cat *.o*
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-0', 'myarg2-0', 'myarg3-0']
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-1', 'myarg2-1', 'myarg3-1']
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-2', 'myarg2-2', 'myarg3-2']
Started: 2012-10-16 21:01:13 Finished: 2012-10-16 21:01:23 Host: iln28 User: akrevl
My arguments:
['myarg1-3', 'myarg2-3', 'myarg3-3']
Line 100: Line 96:
= The results =

Once the job is finished it should deposit two files into the directory we ran qsub from:

 * '''!SingleCoreJob.e4651''': copy of the standard error stream
 * '''!SingleCoreJob.o4651''': copy of the standard output stream

Let's see what does our directory contain:

{{{
~/ $ ls /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore
SingleCoreJob.e4651
SingleCoreJob.o4651
SingleCore.py
SingleCore.qsub.sh
}}}

Now let's see the content of those files:

{{{
~/ $ cat SingleCoreJob.e4651
~/ $ cat SingleCoreJob.o4651
Started: 2012-10-16 17:19:29 Finished: 2012-10-16 17:19:39 Host: iln28 User: akrevl
}}}

Excellent, the standard error file is empty and the standard output tells us that our job ran on node iln28 and it finished (as expeted) in 10 seconds.
So we successfully ran four instances of our script with 4 different sets of arguments. Of course this is only one way of doing things... but it seems to work...

Here is a hypothetical... you have one program that is not multi-threaded nor aware of multiple cores. You have to run that program about a thousand times with different input parameters and different input data. And luckily... the results of a single run are independent of all the other results. This HOWTO describes how one might run such a scenario on the Infolab Compute Cluster.

We presume that you know your qsub basics. If that is not the case, please see InfolabClusterComputeHowtoSingle and InfolabClusterComputeHowtoVariables first.

The submission script

We'll tackle this one the other way around. So let's create our submission script first.

Now that we got the program up and running let's log into the submission node ilhead1 and prepare a submission script. You can download the script here: SingleCore.qsub.sh

   1 #!/bin/bash
   2 #PBS -N JobArray
   3 #PBS -l nodes=1:ppn=1
   4 #PBS -l walltime=00:01:00
   5 
   6 /usr/bin/python2.7 $HOME/tutorial/JobArray/JobArray.py $PBS_ARRAYID 

The only special thing here is that we'll be passing the array id (so the number of the job in the array) to our Python script.

The program

Again we are using the same simple Python script that sleeps for a while and outputs some time and the arguments that it was called with.

   1 #!/usr/bin/python2.7
   2 
   3 import socket, datetime, time, getpass, sys
   4 
   5 arrayid = sys.argv[1]
   6 
   7 # We're using just using a simple list here but you can
   8 # easily imagine this getting read from a file or sth ...
   9 arguments = [
  10   [ "myarg1-0", "myarg2-0", "myarg3-0" ],
  11   [ "myarg1-1", "myarg2-1", "myarg3-1" ],
  12   [ "myarg1-2", "myarg2-2", "myarg3-2" ],
  13   [ "myarg1-3", "myarg2-3", "myarg3-3" ]
  14 ]
  15 
  16 start = datetime.datetime.now()
  17 hostname = socket.gethostname().split('.')[0]
  18 username = getpass.getuser()
  19 time.sleep(10)
  20 end = datetime.datetime.now()
  21 
  22 dfmt = "%Y-%m-%d %H:%M:%S"
  23 print "Started: %s Finished: %s Host: %s User: %s" % (start.strftime(dfmt), end.strftime(dfmt), hostname, username)
  24 print "My arguments:"
  25 print arguments [int(arrayid)]

The only twist is, that we are reading the actual arguments from the list provided in the script itself. This could be easily replaced by reading from a cvs file or some other, neater argument storage.

Submit the job

Nothing left to do but submit the job to the cluster with qsub:

qsub -V -t 0-3 $HOME/tutorial/JobArray/JobArray.qsub.sh

There is a few things to note about the -t argument. This argument specifies that we the job should be run as a job array. In addition to that it also specifies the array ids that our instabces will get. When we run the command above we'll get instances 0, 1, 2, 3 respectively. We could also specify those as a comma delimited list. The following command does the same thing as the previous one:

qsub -V -t 0,1,2,3 $HOME/tutorial/JobArray/JobArray.qsub.sh

We could also make up our own non-sequential ids:

qsub -V -t 111,211,311,411 $HOME/tutorial/JobArray/JobArray.qsub.sh

Anyhow, if our jobs ran successfully, we should be able to see the results in the output files. In our case:

~/ $ cat *.o*
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-0', 'myarg2-0', 'myarg3-0']
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-1', 'myarg2-1', 'myarg3-1']
Started: 2012-10-16 21:01:12 Finished: 2012-10-16 21:01:22 Host: iln28 User: akrevl
My arguments:
['myarg1-2', 'myarg2-2', 'myarg3-2']
Started: 2012-10-16 21:01:13 Finished: 2012-10-16 21:01:23 Host: iln28 User: akrevl
My arguments:
['myarg1-3', 'myarg2-3', 'myarg3-3']

So we successfully ran four instances of our script with 4 different sets of arguments. Of course this is only one way of doing things... but it seems to work...