Locked History Actions

Diff for "Spinn3rDataSet"

Differences between revisions 1 and 7 (spanning 6 versions)
Revision 1 as of 2012-09-11 18:17:26
Size: 3220
Editor: akrevl
Comment:
Revision 7 as of 2014-09-06 01:22:46
Size: 1090
Editor: NikoColneric
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Spinn3r data = = Spinn3r data set on Hadoop cluster =
Line 3: Line 3:
=== Zarya === This page provides all informaton about Spinn3r data set stored on Hadoop cluster.
Line 5: Line 5:
Zarya is our main crawler that downloads data from spinn3r.com. === Data records versions ===
There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.
 
|| '''From''' || '''To''' || '''Version'''||
|| 2008-08-01 || 2010-07-13 || A ||
|| 2010-07-14 || 2010-07-26 || B ||
|| 2010-07-27 || 2013-04-30 || C ||
|| 2013-05-01 || 2014-05-30 || D ||
|| 2014-06-01 || ... || E ||
Line 7: Line 15:
==== Dataset directories ==== === More information ===
 * [[Spinn3rFormat]] - provides detailed description of version transitions and parsing.
Line 9: Line 18:
 * full3: /lfs/1/tmp/spinn3r-full3/

==== Crawler ====

 * /lfs/1/tmp/spinn3r-client/runclient.sh
 * Runs every hour: 0:00, 1:00, 2:00, ...

==== Copy data to Hulk ====

This script copies the daily downloaded data to hulk.Stanford.EDU.

{{{
/u/snap/spinn3r/
  rsync-spinn3r-full3-to-hulk.sh # Copies all the daily spinn3r data from zarya to hulk
  log/
    rsync-spinn3r-full3-to-hulk.log # The rsync log file
}}}

The script is run from crontab as the snap user:

 * '''2:45am''': rsync-spinn3r-full3-to-hulk.sh

In order for the script to work, the snap user on zarya needs a valid kerberos ticket for the cs domain. You can check this with running klist as the snap user on zarya. The output for a valid ticket dshould look something like this:

{{{
Valid starting Expires Service principal
08/31/12 10:31:28 09/03/12 10:31:27 krbtgt/CS.STANFORD.EDU@CS.STANFORD.EDU
        renew until 09/30/12 10:31:27
}}}

Once you get the ticket, the krb_renew daemon will renew the ticket for you. You have to do this on every server reboot though... For more info see SshKerberos.

==== Checking for downloaded files ====

There is a script that checks if we have all the spinn3r files for the past 14 days.

{{{
/u/snap/checkF3/
  checkF3.sh # Checks for the full3 files
  log/
    2012-09-09-checkF3.log
    2012-09-10-checkF3.log
    ...
}}}

The script is run from crontab as the snap user:

 * '''3:15am''': checkF3.sh
 * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
 * Crontab log goes to: /u/snap/.crontab/log-checkF3-zarya.txt

=== Hulk ===

==== Dataset directories ====

 * full3: /lfs/hulk/0/datasets/spinn3r/spinn3r-full3
 * full5: /lfs/hulk/0/datasets/spinn3r/spinn3r-full5

==== Checking for spinn3r full3 files ====

The checkF3 script checks if all the files for the day have been successfully copied from Zarya.Stanford.EDU:

{{{
/lfs/local/0/snap/checkF3/
  checkF3.sh # Checks for the full3 files
  log/
    2012-09-09-checkF3.log
    2012-09-10-checkF3.txt
    ...
}}}

The script is run from crontab as the snap.crontab user:

 * '''3:15am''': checkF3.sh
 * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
 * Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF3-hulk.txt

########

{{{
  CheckSpinn3rData-Full5.sh # Checks for the full5 files
  log/
    CheckSpinn3rData.log # Old log file for the full3 check (this can be discarder once checked)
    CheckSpinn3rData-Full3.log # Log file for the full3 check
    CheckSpinn3rData-Full5.log # Log file for the full5 check
}}}

Both of the scripts are run daily as the user snap.cron:

 * '''3:15am''': !CheckSpinn3rData-Full3.sh
 * '''5:00am''': !CheckSpinn3rData-Full5.sh

The scripts will only send an e-mail if they have encountered any errors. At the moment the e-mail is being sent to rok@cs, akrevl@cs.
 * [[Spinn3rDataSetArchive]] - more info about processign.

Spinn3r data set on Hadoop cluster

This page provides all informaton about Spinn3r data set stored on Hadoop cluster.

Data records versions

There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.

From

To

Version

2008-08-01

2010-07-13

A

2010-07-14

2010-07-26

B

2010-07-27

2013-04-30

C

2013-05-01

2014-05-30

D

2014-06-01

...

E

More information