This HOWTO describes how to run a single core job on the [[InfolabClusterCompute|Infolab Compute Cluster]]. <> = The program = We are going to use just a simple Python script as our main program for this HOWTO. You can download the script here [[attachment:SingleCore.py]]. {{{#!highlight python #!/usr/bin/python2.7 import socket, datetime, time, getpass start = datetime.datetime.now() hostname = socket.gethostname().split('.')[0] username = getpass.getuser() time.sleep(10) end = datetime.datetime.now() dfmt = "%Y-%m-%d %H:%M:%S" print "Started: %s Finished: %s Host: %s User: %s" % (start.strftime(dfmt), end.strftime(dfmt), hostname, username) }}} The script starts, records the current time, figures out the hostname it is running on and the username it is running as. Then it sleeps for 10 seconds (so we at least have some impact on the cluster), records the time again and prints out a string that may look a little something like this: {{{ Started: 2012-10-16 15:56:55 Finished: 2012-10-16 15:57:05 Host: ilhead1 User: akrevl }}} It's a good idea to check if the program will run on the target platform. It doesn't make much difference for a Python script, but if you were running a C binary it's worth checking if it runs on the AMD platform. This is where '''ild1''' comes in. The development node '''ild1''' is set up in the same way as the cluster nodes are. So let's test the script on ild1: {{{ /usr/bin/python2.7 /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore/SingleCore.py }}} Note that we are using a full path both to the python executable and to the Python script. The result is as expected: {{{ Started: 2012-10-16 17:04:44 Finished: 2012-10-16 17:04:54 Host: ild1 User: akrevl }}} = The submission script = Now that we got the program up and running let's log into the submission node '''ilhead1''' and prepare a submission script. You can download the script here: [[attachment:SingleCore.qsub.sh]] {{{#!highlight bash #!/bin/bash #PBS -N SingleCoreJob #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:01:00 /usr/bin/python2.7 /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore/SingleCore.py }}} We are using a friendly name ''SingleCoreJob'' for our submission and we are limiting our job to a single node and a single CPU cure (based on what our script does, there really is no reason to ask for more). We are also limiting the wall clock time to 1 minute. Since our program only sleeps for 10 seconds a 1 minute wall time seems more than enough for the job to complete. = Submit the job = Nothing left to do but submit the job to the cluster with ''qsub'': {{{ qsub -V /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore/SingleCore.qsub.sh }}} If we submitted the job successfully, the resource manager should reply with with the ID of the job and the name of the headnode: {{{ 4651.ilhead1.stanford.edu }}} = Check on the job = While the job is running, you can check on it with ''qstat'' and ''showq'' commands. Please be patient with the ''showq'' command as it tends to return timeouts when a lot of jobs are in the queue. {{{ ~/ $ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 4651.ilhead1 SingleCoreJob akrevl 0 R test }}} {{{ ~/ $ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 4651 akrevl Running 1 00:01:00 Tue Oct 16 17:19:29 1 Active Job 1 of 896 Processors Active (0.11%) 1 of 28 Nodes Active (3.57%) }}} = The results = Once the job is finished it should deposit two files into the directory we ran qsub from: * '''!SingleCoreJob.e4651''': copy of the standard error stream * '''!SingleCoreJob.o4651''': copy of the standard output stream Let's see what does our directory contain: {{{ ~/ $ ls /afs/cs.stanford.edu/u/akrevl/tutorial/SingleCore SingleCoreJob.e4651 SingleCoreJob.o4651 SingleCore.py SingleCore.qsub.sh }}} Now let's see the content of those files: {{{ ~/ $ cat SingleCoreJob.e4651 ~/ $ cat SingleCoreJob.o4651 Started: 2012-10-16 17:19:29 Finished: 2012-10-16 17:19:39 Host: iln28 User: akrevl }}} Excellent, the standard error file is empty and the standard output tells us that our job ran on node iln28 and it finished (as expeted) in 10 seconds.