Revision 1 as of 2012-09-11 18:17:26

Clear message
Locked History Actions

Spinn3rDataSet

Spinn3r data

Zarya

Zarya is our main crawler that downloads data from spinn3r.com.

Dataset directories

  • full3: /lfs/1/tmp/spinn3r-full3/

Crawler

  • /lfs/1/tmp/spinn3r-client/runclient.sh
  • Runs every hour: 0:00, 1:00, 2:00, ...

Copy data to Hulk

This script copies the daily downloaded data to hulk.Stanford.EDU.

/u/snap/spinn3r/
  rsync-spinn3r-full3-to-hulk.sh  # Copies all the daily spinn3r data from zarya to hulk
  log/
    rsync-spinn3r-full3-to-hulk.log # The rsync log file

The script is run from crontab as the snap user:

  • 2:45am: rsync-spinn3r-full3-to-hulk.sh

In order for the script to work, the snap user on zarya needs a valid kerberos ticket for the cs domain. You can check this with running klist as the snap user on zarya. The output for a valid ticket dshould look something like this:

Valid starting     Expires            Service principal
08/31/12 10:31:28  09/03/12 10:31:27  krbtgt/CS.STANFORD.EDU@CS.STANFORD.EDU
        renew until 09/30/12 10:31:27

Once you get the ticket, the krb_renew daemon will renew the ticket for you. You have to do this on every server reboot though... For more info see SshKerberos.

Checking for downloaded files

There is a script that checks if we have all the spinn3r files for the past 14 days.

/u/snap/checkF3/
  checkF3.sh  # Checks for the full3 files
  log/
    2012-09-09-checkF3.log 
    2012-09-10-checkF3.log 
    ...

The script is run from crontab as the snap user:

  • 3:15am: checkF3.sh

  • The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
  • Crontab log goes to: /u/snap/.crontab/log-checkF3-zarya.txt

Hulk

Dataset directories

  • full3: /lfs/hulk/0/datasets/spinn3r/spinn3r-full3
  • full5: /lfs/hulk/0/datasets/spinn3r/spinn3r-full5

Checking for spinn3r full3 files

The checkF3 script checks if all the files for the day have been successfully copied from Zarya.Stanford.EDU:

/lfs/local/0/snap/checkF3/
  checkF3.sh # Checks for the full3 files
  log/
    2012-09-09-checkF3.log  
    2012-09-10-checkF3.txt
    ...

The script is run from crontab as the snap.crontab user:

  • 3:15am: checkF3.sh

  • The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
  • Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF3-hulk.txt

  CheckSpinn3rData-Full5.sh  # Checks for the full5 files
  log/
    CheckSpinn3rData.log   # Old log file for the full3 check (this can be discarder once checked)
    CheckSpinn3rData-Full3.log   # Log file for the full3 check
    CheckSpinn3rData-Full5.log   # Log file for the full5 check

Both of the scripts are run daily as the user snap.cron:

  • 3:15am: CheckSpinn3rData-Full3.sh

  • 5:00am: CheckSpinn3rData-Full5.sh

The scripts will only send an e-mail if they have encountered any errors. At the moment the e-mail is being sent to rok@cs, akrevl@cs.