Locked History Actions

Diff for "InfolabCluster"

Differences between revisions 7 and 17 (spanning 10 versions)
Revision 7 as of 2012-08-11 00:38:12
Size: 3409
Editor: akrevl
Comment:
Revision 17 as of 2015-10-16 22:34:40
Size: 2072
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
Please note that the cluster is still in the testing phase. Things might break. Please join the mailing list and report any glitches that you come across. If Google can keep things in Beta, why can't we? So.. beware... Things might break. Please join the mailing list and report any glitches that you come across.
Line 9: Line 9:
=== Hardware === It was recently decided that having a split personality is not the best thing to have in a cluster as two different resource managers start to compete while not being aware of each other. That is why we have separated our cluster into a Compute cluster and a Hadoop cluster. Read on for more about the two.
Line 11: Line 11:
 * 2 head nodes
 * 2 development nodes
 * 36 compute nodes:
  * 1152 cores
  * 2.25 TB RAM
 * Each node:
  * 2x AMD Opteron 6276 (Interlagos) @2.3GHz - 3.2GHz, 16 cores/CPU, AMD64, VT
  * 64 GB RAM
  * 2 TB local HDD

=== Software ===

 * TORQUE resource manager
 * MAUI scheduler
 * Planned: Hadoop
  
=== Mailing list ===
== Mailing list ==
Line 34: Line 18:
=== Access === == Compute cluster ==
Line 36: Line 20:
==== Login to the headnode ==== The compute cluster comes in handy whenever you need a lot of cores to get your job done. It is just like looking at the CPU and memory load of the other servers and then deciding which one to use for your job, only the job schedule will take care of looking at the CPU load for you and schedule the resources on a first come first serve basis (at least for the time being, queue priorities may change in the future).
Line 38: Line 22:
You can submit jobs to the cluster from the head node: snapx.Stanford.EDU (sorry about the name, the head node will be renamed to ilh soon). First SSH to the headnode: ==== Hardware ====
Line 40: Line 24:
{{{
ssh your_cs_id@snapx.Stanford.EDU
}}}
 * 1 head node: ilhead1
 * 27 compute nodes: iln1 - iln28 (iln25 currently offline)
  * 768 CPU cores
  * 1728 GB RAM
Line 44: Line 29:
Use your CSID username and password to login to the headnode. ==== Software ====
Line 46: Line 31:
==== Preparing your job ====  * Torque resource manager
 * MAUI job scheduler
 * CentOS 6.3
  
==== Resources ====
 
 * '''[[InfolabClusterCompute|Using the compute cluster]]'''
Line 48: Line 39:
Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example: == Hadoop cluster ==
Line 50: Line 41:
{{{
#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00
If you want to run map/reduce jobs then this is the cluster for you.
Line 55: Line 43:
echo "I am running on:"
hostname
sleep 20
}}}
==== Hardware ====
Line 60: Line 45:
The comment lines that start with the PBS keyword let you select different PBS options:  * 1 head node: ilhadoop1
 * 39 nodes: ilh01 - ilh40 (ilh19 currently offline)
  * 312 CPU cores
  * 2496 GB RAM
  * 312 TB raw storage
Line 62: Line 51:
 * #PBS -N: lets you specify a friendly job name
 * #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
 * #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.
==== Software ====
Line 66: Line 53:
For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

 * Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
 * Network mounted directories from rocky and hulk:
  * /dfs/hulk/0
  * /dfs/rocky/0

==== Submitting your job ====

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

{{{
qsub myjob.sh
}}}

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

==== Check the status of your job ====

You can check what is happening with your job with the qstat command:

{{{

}}}

Since the cluster is running Torque you need to submit a job to it.


The compute cluster is running the TORQUE resource manager with

More coming soon...

on all nodes
Hulk's filesystem is mounted on:
/dfs/hulk/0
Rocky's filesystem is mounted on:
/dfs/rocky/0
 * Cloudera CDH 5.4.2
 * CentOS 6.3
  
==== Resources ====
 
 * '''[[ilHadoopStats|Hadoop Cluster Statistics]]'''
 * '''[[InfolabClusterHadoop|Using the Hadoop cluster]]'''

Infolab cluster

Beta warning

If Google can keep things in Beta, why can't we? So.. beware... Things might break. Please join the mailing list and report any glitches that you come across.

It was recently decided that having a split personality is not the best thing to have in a cluster as two different resource managers start to compete while not being aware of each other. That is why we have separated our cluster into a Compute cluster and a Hadoop cluster. Read on for more about the two.

Mailing list

There is a mailing list for all those interested in what is currently happening with the cluster and the configuration of the cluster:

Compute cluster

The compute cluster comes in handy whenever you need a lot of cores to get your job done. It is just like looking at the CPU and memory load of the other servers and then deciding which one to use for your job, only the job schedule will take care of looking at the CPU load for you and schedule the resources on a first come first serve basis (at least for the time being, queue priorities may change in the future).

Hardware

  • 1 head node: ilhead1
  • 27 compute nodes: iln1 - iln28 (iln25 currently offline)
    • 768 CPU cores
    • 1728 GB RAM

Software

  • Torque resource manager
  • MAUI job scheduler
  • CentOS 6.3

Resources

Hadoop cluster

If you want to run map/reduce jobs then this is the cluster for you.

Hardware

  • 1 head node: ilhadoop1
  • 39 nodes: ilh01 - ilh40 (ilh19 currently offline)
    • 312 CPU cores
    • 2496 GB RAM
    • 312 TB raw storage

Software

  • Cloudera CDH 5.4.2
  • CentOS 6.3

Resources