Spinn3r data
Zarya
Zarya is our main crawler that downloads data from spinn3r.com.
Dataset directories
- full3: /lfs/1/tmp/spinn3r-full3/
Crawler
- /lfs/1/tmp/spinn3r-client/runclient.sh
- Runs every hour: 0:00, 1:00, 2:00, ...
Copy data to Hulk
This script copies the daily downloaded data to hulk.Stanford.EDU.
/u/snap/spinn3r/
rsync-spinn3r-full3-to-hulk.sh # Copies all the daily spinn3r data from zarya to hulk
log/
rsync-spinn3r-full3-to-hulk.log # The rsync log fileThe script is run from crontab as the snap user:
2:45am: rsync-spinn3r-full3-to-hulk.sh
In order for the script to work, the snap user on zarya needs a valid kerberos ticket for the cs domain. You can check this with running klist as the snap user on zarya. The output for a valid ticket dshould look something like this:
Valid starting Expires Service principal
08/31/12 10:31:28 09/03/12 10:31:27 krbtgt/CS.STANFORD.EDU@CS.STANFORD.EDU
renew until 09/30/12 10:31:27Once you get the ticket, the krb_renew daemon will renew the ticket for you. You have to do this on every server reboot though... For more info see SshKerberos.
Checking for downloaded files
There is a script that checks if we have all the spinn3r files for the past 14 days.
/u/snap/checkF3/
checkF3.sh # Checks for the full3 files
log/
2012-09-09-checkF3.log
2012-09-10-checkF3.log
...The script is run from crontab as the snap user:
3:15am: checkF3.sh
- The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
- Crontab log goes to: /u/snap/.crontab/log-checkF3-zarya.txt
Hulk
Dataset directories
- full3: /lfs/hulk/0/datasets/spinn3r/spinn3r-full3
- full5: /lfs/hulk/0/datasets/spinn3r/spinn3r-full5
Checking for spinn3r full3 files
The checkF3 script checks if all the files for the day have been successfully copied from Zarya.Stanford.EDU:
/lfs/local/0/snap/checkF3/
checkF3.sh # Checks for the full3 files
log/
2012-09-09-checkF3.log
2012-09-10-checkF3.txt
...The script is run from crontab as the snap.crontab user:
3:15am: checkF3.sh
- The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
- Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF3-hulk.txt
CheckSpinn3rData-Full5.sh # Checks for the full5 files
log/
CheckSpinn3rData.log # Old log file for the full3 check (this can be discarder once checked)
CheckSpinn3rData-Full3.log # Log file for the full3 check
CheckSpinn3rData-Full5.log # Log file for the full5 checkBoth of the scripts are run daily as the user snap.cron:
3:15am: CheckSpinn3rData-Full3.sh
5:00am: CheckSpinn3rData-Full5.sh
The scripts will only send an e-mail if they have encountered any errors. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
Infolab wiki