Locked History Actions

Diff for "InfolabCluster"

Differences between revisions 1 and 7 (spanning 6 versions)
Revision 1 as of 2012-08-10 21:03:08
Size: 579
Editor: akrevl
Comment:
Revision 7 as of 2012-08-11 00:38:12
Size: 3409
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
== Hardware == {{{#!wiki caution
'''Beta warning'''

Please note that the cluster is still in the testing phase. Things might break. Please join the mailing list and report any glitches that you come across.
}}}

=== Hardware ===
Line 7: Line 13:
 * 38 compute nodes:
  * 1216 cores
  * 2.4 TB RAM
 * 36 compute nodes:
  * 1152 cores
  * 2.25 TB RAM
Line 14: Line 20:

=== Software ===

 * TORQUE resource manager
 * MAUI scheduler
 * Planned: Hadoop
Line 15: Line 27:
== Mailing list == === Mailing list ===
Line 22: Line 34:
== Access == === Access ===

==== Login to the headnode ====

You can submit jobs to the cluster from the head node: snapx.Stanford.EDU (sorry about the name, the head node will be renamed to ilh soon). First SSH to the headnode:

{{{
ssh your_cs_id@snapx.Stanford.EDU
}}}

Use your CSID username and password to login to the headnode.

==== Preparing your job ====

Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example:

{{{
#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20
}}}

The comment lines that start with the PBS keyword let you select different PBS options:

 * #PBS -N: lets you specify a friendly job name
 * #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
 * #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.

For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

 * Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
 * Network mounted directories from rocky and hulk:
  * /dfs/hulk/0
  * /dfs/rocky/0

==== Submitting your job ====

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

{{{
qsub myjob.sh
}}}

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

==== Check the status of your job ====

You can check what is happening with your job with the qstat command:

{{{

}}}

Since the cluster is running Torque you need to submit a job to it.


The compute cluster is running the TORQUE resource manager with

More coming soon...

on all nodes
Hulk's filesystem is mounted on:
/dfs/hulk/0
Rocky's filesystem is mounted on:
/dfs/rocky/0

Infolab cluster

Beta warning

Please note that the cluster is still in the testing phase. Things might break. Please join the mailing list and report any glitches that you come across.

Hardware

  • 2 head nodes
  • 2 development nodes
  • 36 compute nodes:
    • 1152 cores
    • 2.25 TB RAM
  • Each node:
    • 2x AMD Opteron 6276 (Interlagos) @2.3GHz - 3.2GHz, 16 cores/CPU, AMD64, VT
    • 64 GB RAM
    • 2 TB local HDD

Software

  • TORQUE resource manager
  • MAUI scheduler
  • Planned: Hadoop

Mailing list

There is a mailing list for all those interested in what is currently happening with the cluster and the configuration of the cluster:

Access

Login to the headnode

You can submit jobs to the cluster from the head node: snapx.Stanford.EDU (sorry about the name, the head node will be renamed to ilh soon). First SSH to the headnode:

ssh your_cs_id@snapx.Stanford.EDU

Use your CSID username and password to login to the headnode.

Preparing your job

Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example:

#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00

echo "I am running on:"
hostname
sleep 20

The comment lines that start with the PBS keyword let you select different PBS options:

  • #PBS -N: lets you specify a friendly job name
  • #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
  • #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.

For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

  • Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
  • Network mounted directories from rocky and hulk:
    • /dfs/hulk/0
    • /dfs/rocky/0

Submitting your job

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

qsub myjob.sh

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

Check the status of your job

You can check what is happening with your job with the qstat command:

Since the cluster is running Torque you need to submit a job to it.

The compute cluster is running the TORQUE resource manager with

More coming soon...

on all nodes Hulk's filesystem is mounted on: /dfs/hulk/0 Rocky's filesystem is mounted on: /dfs/rocky/0