⇤ ← Revision 1 as of 2012-09-11 18:17:26
3220
Comment:
|
4997
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Spinn3r data = | = Spinn3r dataset = |
Line 57: | Line 57: |
* The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. | * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''. |
Line 59: | Line 59: |
* | |
Line 61: | Line 62: |
There are two versions of the Spinn3r dataset on Hulk. One is Full3 which gets copied from Zarya every night. The other one is Full5 that is converted from the Full3 dataset. The conversion is done every night on Hulk. |
|
Line 67: | Line 70: |
==== Checking for spinn3r full3 files ==== | ==== Checking for Full3 files ==== |
Line 75: | Line 78: |
2012-09-09-checkF3.log 2012-09-10-checkF3.txt |
2012-09-09-checkF5.txt 2012-09-10-checkF5.txt |
Line 83: | Line 86: |
* The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. | * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''. |
Line 86: | Line 89: |
######## | ==== Checking for Full5 files ==== The checkF5 script checks if all the Full3 files have been successfully converted to the Full5 format. This is the directory structure used: |
Line 89: | Line 94: |
CheckSpinn3rData-Full5.sh # Checks for the full5 files | /lfs/local/0/snap/checkF5/ checkF5.sh # Checks for the full5 files |
Line 91: | Line 97: |
CheckSpinn3rData.log # Old log file for the full3 check (this can be discarder once checked) CheckSpinn3rData-Full3.log # Log file for the full3 check CheckSpinn3rData-Full5.log # Log file for the full5 check |
2012-09-09-checkF5.txt 2012-09-10-checkF5.txt ... |
Line 96: | Line 102: |
Both of the scripts are run daily as the user snap.cron: | The script is run from crontab as the snap.crontab user: |
Line 98: | Line 104: |
* '''3:15am''': !CheckSpinn3rData-Full3.sh * '''5:00am''': !CheckSpinn3rData-Full5.sh |
* '''6:15am''': checkF5.sh * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''. * Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF5-hulk.txt |
Line 101: | Line 108: |
The scripts will only send an e-mail if they have encountered any errors. At the moment the e-mail is being sent to rok@cs, akrevl@cs. | ==== Converting Full3 to Full5 ==== The conversion scripts are ... {{{ /lfs/local/0/snap/convertF3F5 convert-full3.py # This Python script actually does the conversion convertF3F5byFile.sh # A helper script that converts one Full3 file to Full5 files convertF3F5byTime.sh # A helper that finds the appropriate Full3 file based on the timestamp you provide averageSize.sh # Display the average Full3/Full5 file size by year and last 12 months debug-log/ # Directory with the full debug log of all scripts convertF3F5-2012-09-10.log convertF3F5-2012-09-11.log ... spinn3r-log/ # Directory with the old version logs __spinn3r_... lerr-2011-11-18T19-00-00.txt log-2011-11-18T19-00-00.txt lerr-2011-11-18T20-00-00.txt log-2011-11-18T20-00-00.txt ... prepares the temporary directories checkF5.sh # Checks for the full5 files log/ 2012-09-09-checkF5.txt 2012-09-10-checkF5.txt ... }}} The script is run from crontab as the snap.crontab user: * '''6:15am''': checkF5.sh * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''. * Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF5-hulk.txt |
Spinn3r dataset
Zarya
Zarya is our main crawler that downloads data from spinn3r.com.
Dataset directories
- full3: /lfs/1/tmp/spinn3r-full3/
Crawler
- /lfs/1/tmp/spinn3r-client/runclient.sh
- Runs every hour: 0:00, 1:00, 2:00, ...
Copy data to Hulk
This script copies the daily downloaded data to hulk.Stanford.EDU.
/u/snap/spinn3r/ rsync-spinn3r-full3-to-hulk.sh # Copies all the daily spinn3r data from zarya to hulk log/ rsync-spinn3r-full3-to-hulk.log # The rsync log file
The script is run from crontab as the snap user:
2:45am: rsync-spinn3r-full3-to-hulk.sh
In order for the script to work, the snap user on zarya needs a valid kerberos ticket for the cs domain. You can check this with running klist as the snap user on zarya. The output for a valid ticket dshould look something like this:
Valid starting Expires Service principal 08/31/12 10:31:28 09/03/12 10:31:27 krbtgt/CS.STANFORD.EDU@CS.STANFORD.EDU renew until 09/30/12 10:31:27
Once you get the ticket, the krb_renew daemon will renew the ticket for you. You have to do this on every server reboot though... For more info see SshKerberos.
Checking for downloaded files
There is a script that checks if we have all the spinn3r files for the past 14 days.
/u/snap/checkF3/ checkF3.sh # Checks for the full3 files log/ 2012-09-09-checkF3.log 2012-09-10-checkF3.log ...
The script is run from crontab as the snap user:
3:15am: checkF3.sh
The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with [snap-srv].
- Crontab log goes to: /u/snap/.crontab/log-checkF3-zarya.txt
Hulk
There are two versions of the Spinn3r dataset on Hulk. One is Full3 which gets copied from Zarya every night. The other one is Full5 that is converted from the Full3 dataset. The conversion is done every night on Hulk.
Dataset directories
- full3: /lfs/hulk/0/datasets/spinn3r/spinn3r-full3
- full5: /lfs/hulk/0/datasets/spinn3r/spinn3r-full5
Checking for Full3 files
The checkF3 script checks if all the files for the day have been successfully copied from Zarya.Stanford.EDU:
/lfs/local/0/snap/checkF3/ checkF3.sh # Checks for the full3 files log/ 2012-09-09-checkF5.txt 2012-09-10-checkF5.txt ...
The script is run from crontab as the snap.crontab user:
3:15am: checkF3.sh
The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with [snap-srv].
- Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF3-hulk.txt
Checking for Full5 files
The checkF5 script checks if all the Full3 files have been successfully converted to the Full5 format. This is the directory structure used:
/lfs/local/0/snap/checkF5/ checkF5.sh # Checks for the full5 files log/ 2012-09-09-checkF5.txt 2012-09-10-checkF5.txt ...
The script is run from crontab as the snap.crontab user:
6:15am: checkF5.sh
The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with [snap-srv].
- Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF5-hulk.txt
Converting Full3 to Full5
The conversion scripts are ...
/lfs/local/0/snap/convertF3F5 convert-full3.py # This Python script actually does the conversion convertF3F5byFile.sh # A helper script that converts one Full3 file to Full5 files convertF3F5byTime.sh # A helper that finds the appropriate Full3 file based on the timestamp you provide averageSize.sh # Display the average Full3/Full5 file size by year and last 12 months debug-log/ # Directory with the full debug log of all scripts convertF3F5-2012-09-10.log convertF3F5-2012-09-11.log ... spinn3r-log/ # Directory with the old version logs __spinn3r_... lerr-2011-11-18T19-00-00.txt log-2011-11-18T19-00-00.txt lerr-2011-11-18T20-00-00.txt log-2011-11-18T20-00-00.txt ... prepares the temporary directories checkF5.sh # Checks for the full5 files log/ 2012-09-09-checkF5.txt 2012-09-10-checkF5.txt ...
The script is run from crontab as the snap.crontab user:
6:15am: checkF5.sh
The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with [snap-srv].
- Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF5-hulk.txt