Locked History Actions

Diff for "InfolabCluster"

Differences between revisions 10 and 17 (spanning 7 versions)
Revision 10 as of 2012-10-16 19:09:52
Size: 4615
Editor: akrevl
Comment:
Revision 17 as of 2015-10-16 22:34:40
Size: 2072
Editor: akrevl
Comment:
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
=== Hardware === It was recently decided that having a split personality is not the best thing to have in a cluster as two different resource managers start to compete while not being aware of each other. That is why we have separated our cluster into a Compute cluster and a Hadoop cluster. Read on for more about the two.
Line 11: Line 11:
 * 2 head nodes
 * 2 development nodes
 * 36 compute nodes:
  * 1152 cores
  * 2.25 TB RAM
 * Each node:
  * 2x AMD Opteron 6276 (Interlagos) @2.3GHz - 3.2GHz, 16 cores/CPU, AMD64, VT
  * 64 GB RAM
  * 2 TB local HDD

=== Mailing list ===
== Mailing list ==
Line 28: Line 18:
=== Software === == Compute cluster ==
Line 30: Line 20:
We decided that it is not good to have a split personality that is why we now have a set of nodes dedicated to the compute cluster and another set of nodes dedicated to a hadoop cluster. The compute cluster comes in handy whenever you need a lot of cores to get your job done. It is just like looking at the CPU and memory load of the other servers and then deciding which one to use for your job, only the job schedule will take care of looking at the CPU load for you and schedule the resources on a first come first serve basis (at least for the time being, queue priorities may change in the future).
Line 32: Line 22:
 * [[InfolabClusterCompute|Compute cluster]]:
   * TORQUE resource manager, MAUI job scheduler
   * Nodes: iln01-iln28
   * Submission node: ilhead1
 * Hadoop cluster
   * Apache Hadoop
   * Nodes: iln29-iln36
   * Submission node: iln29
==== Hardware ====
Line 41: Line 24:
=== Access ===  * 1 head node: ilhead1
 * 27 compute nodes: iln1 - iln28 (iln25 currently offline)
  * 768 CPU cores
  * 1728 GB RAM
Line 43: Line 29:
==== Login to the headnode ==== ==== Software ====
Line 45: Line 31:
You can submit jobs to the cluster from the head node: snapx.Stanford.EDU (sorry about the name, the head node will be renamed to ilh soon). First SSH to the headnode:  * Torque resource manager
 * MAUI job scheduler
 * CentOS 6.3
  
==== Resources ====
 
 * '''[[InfolabClusterCompute|Using the compute cluster]]'''
Line 47: Line 39:
{{{
ssh your_cs_id@snapx.Stanford.EDU
}}}
== Hadoop cluster ==
Line 51: Line 41:
Use your CSID username and password to login to the headnode. If you want to run map/reduce jobs then this is the cluster for you.
Line 53: Line 43:
==== Preparing your job ==== ==== Hardware ====
Line 55: Line 45:
Preparing your job for the scheduler is as simple as adding a few comments to a script that runs your program. Here is an example:  * 1 head node: ilhadoop1
 * 39 nodes: ilh01 - ilh40 (ilh19 currently offline)
  * 312 CPU cores
  * 2496 GB RAM
  * 312 TB raw storage
Line 57: Line 51:
{{{
#PBS -N my_job_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:10:00
==== Software ====
Line 62: Line 53:
echo "I am running on:"
hostname
sleep 20
}}}

The comment lines that start with the PBS keyword let you select different PBS options:

 * #PBS -N: lets you specify a friendly job name
 * #PBS -l nodes=1:ppn=1: specifies that I would like my job to run on a single node (nodes) and on a single core (ppn)
 * #PBS -l walltime=01:10:00: specifies the amount of real time I anticipate that my script will need to finish. Please note that the scheduler will terminate my script if it does not finish in time.

For a more comprehensive list of resources that you can slecify with #PBS -l see here: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml. Note however, that there is currently only one queue without very many parameters set.

Make sure that your job uses data from:

 * Your CS home directory (whatever is under /afs/cs.stanford.edu/u/your_csid on hulk, rocky and snapx, please note that user home directories are not yet available under /u/your_csid on snapx)
 * Network mounted directories from rocky and hulk:
  * /dfs/hulk/0
  * /dfs/rocky/0

Here is an example of a bit more complex script to run an MPI job (copied from http://csc.cnsi.ucsb.edu/docs/running-jobs-torque):

{{{
#!/bin/sh
#PBS -l nodes=2:ppn=4

# Make sure that we are in the same subdirectory as where the qsub command
# is issued.
cd $PBS_O_WORKDIR

# make a list of allocated nodes(cores)
cat $PBS_NODEFILE > nodes

# How many cores total do we have?
NO_OF_CORES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
NODE_LIST=`cat $PBS_NODEFILE `

# Just for kicks, see which nodes we got.
echo $NODE_LIST

# Run the executable. *DO NOT PUT* a '&' at the end!!
mpirun -np $NO_OF_CORES -machinefile nodes ./pi3 >& log
}}}


==== Submitting your job ====

Now that your job is prepared you have to submit it to the resource manager. Use qsub to submit your jobs:

{{{
qsub myjob.sh
}}}

Make sure you run qsub from your CS home directory or from a network mounted filesystem (see above). Once the job is finished output data will wait for you in the same directory and there will be two additional files that end in e<job#> and o<job#>. These two are stderr and stdout, respectively.

==== Check the status of your job ====

You can check what is happening with your job with the qstat command:

{{{
qstat jobid
}}}

jobid is the number that the resource manager assigned to your job (the first number qsub will output after you successfully submit a job).

==== Other useful commands ====

 * qdel job_id: deletes your job
 * qstat -q: lists all queues
 * qstat -a: lists all jobs
 * qstat -au userid: lists all jobs submitted by userid
 * pbsnodes: list status of all the compute nodes

More coming soon...
 * Cloudera CDH 5.4.2
 * CentOS 6.3
  
==== Resources ====
 
 * '''[[ilHadoopStats|Hadoop Cluster Statistics]]'''
 * '''[[InfolabClusterHadoop|Using the Hadoop cluster]]'''

Infolab cluster

Beta warning

If Google can keep things in Beta, why can't we? So.. beware... Things might break. Please join the mailing list and report any glitches that you come across.

It was recently decided that having a split personality is not the best thing to have in a cluster as two different resource managers start to compete while not being aware of each other. That is why we have separated our cluster into a Compute cluster and a Hadoop cluster. Read on for more about the two.

Mailing list

There is a mailing list for all those interested in what is currently happening with the cluster and the configuration of the cluster:

Compute cluster

The compute cluster comes in handy whenever you need a lot of cores to get your job done. It is just like looking at the CPU and memory load of the other servers and then deciding which one to use for your job, only the job schedule will take care of looking at the CPU load for you and schedule the resources on a first come first serve basis (at least for the time being, queue priorities may change in the future).

Hardware

  • 1 head node: ilhead1
  • 27 compute nodes: iln1 - iln28 (iln25 currently offline)
    • 768 CPU cores
    • 1728 GB RAM

Software

  • Torque resource manager
  • MAUI job scheduler
  • CentOS 6.3

Resources

Hadoop cluster

If you want to run map/reduce jobs then this is the cluster for you.

Hardware

  • 1 head node: ilhadoop1
  • 39 nodes: ilh01 - ilh40 (ilh19 currently offline)
    • 312 CPU cores
    • 2496 GB RAM
    • 312 TB raw storage

Software

  • Cloudera CDH 5.4.2
  • CentOS 6.3

Resources