Locked History Actions

Diff for "Spinn3rDataSet"

Differences between revisions 3 and 4
Revision 3 as of 2012-09-11 19:48:42
Size: 6941
Editor: akrevl
Comment:
Revision 4 as of 2014-09-06 01:20:34
Size: 1025
Editor: NikoColneric
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Spinn3r dataset = = Spinn3r data set on Hadoop cluster =
Line 3: Line 3:
{{{#!wiki tip
'''Script configuration'''
This page provides all informaton about Spinn3r data set stored on Hadoop cluster.
Line 6: Line 5:
Almost all of the scripts have a few configuration options in the beginning of the file. Make sure you check those if you move the script to a new directory or to a different machine.
}}}
=== Data records versions ===
There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.
 
|| '''From''' || '''To''' || '''Version'''||
|| 2008-08-01 || 2010-07-13 || A ||
|| 2010-07-14 || 2010-07-26 || B ||
|| 2010-07-27 || 2013-04-30 || C ||
|| 2013-05-01 || 2014-05-30 || D ||
|| 2014-06-01 || ... || E ||
Line 9: Line 15:
=== Zarya ===

Zarya is our main crawler that downloads data from spinn3r.com.

==== Dataset directories ====

 * full3: /lfs/1/tmp/spinn3r-full3/

==== Crawler ====

 * /lfs/1/tmp/spinn3r-client/runclient.sh
 * Runs every hour: 0:00, 1:00, 2:00, ...

==== Copy data to Hulk ====

This script copies the daily downloaded data to hulk.Stanford.EDU.

{{{
/u/snap/spinn3r/
  rsync-spinn3r-full3-to-hulk.sh # Copies all the daily spinn3r data from zarya to hulk
  log/
    rsync-spinn3r-full3-to-hulk.log # The rsync log file
}}}

The script is run from crontab as the snap user:

 * '''2:45am''': rsync-spinn3r-full3-to-hulk.sh

In order for the script to work, the snap user on zarya needs a valid kerberos ticket for the cs domain. You can check this with running klist as the snap user on zarya. The output for a valid ticket dshould look something like this:

{{{
Valid starting Expires Service principal
08/31/12 10:31:28 09/03/12 10:31:27 krbtgt/CS.STANFORD.EDU@CS.STANFORD.EDU
        renew until 09/30/12 10:31:27
}}}

Once you get the ticket, the krb_renew daemon will renew the ticket for you. You have to do this on every server reboot though... For more info see SshKerberos.

==== Checking for downloaded files ====

There is a script that checks if we have all the spinn3r files for the past 14 days.

{{{
/u/snap/checkF3/
  checkF3.sh # Checks for the full3 files
  log/
    2012-09-09-checkF3.log
    2012-09-10-checkF3.log
    ...
}}}

The script is run from crontab as the snap user:

 * '''3:15am''': checkF3.sh
 * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''.
 * Crontab log goes to: /u/snap/.crontab/log-checkF3-zarya.txt
 *

=== Hulk ===

There are two versions of the Spinn3r dataset on Hulk. One is Full3 which gets copied from Zarya every night. The other one is Full5 that is converted from the Full3 dataset. The conversion is done every night on Hulk.

==== Dataset directories ====

 * full3: /lfs/hulk/0/datasets/spinn3r/spinn3r-full3
 * full5: /lfs/hulk/0/datasets/spinn3r/spinn3r-full5

==== Checking for Full3 files ====

The checkF3 script checks if all the files for the day have been successfully copied from Zarya.Stanford.EDU:

{{{
/lfs/local/0/snap/checkF3/
  checkF3.sh # Checks for the full3 files
  log/
    2012-09-09-checkF5.txt
    2012-09-10-checkF5.txt
    ...
}}}

The script is run from crontab as the snap.crontab user:

 * '''3:15am''': checkF3.sh
 * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''.
 * Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF3-hulk.txt

==== Checking for Full5 files ====

The checkF5 script checks if all the Full3 files have been successfully converted to the Full5 format. This is the directory structure used:

{{{
/lfs/local/0/snap/checkF5/
  checkF5.sh # Checks for the full5 files
  log/
    2012-09-09-checkF5.txt
    2012-09-10-checkF5.txt
    ...
}}}

The script is run from crontab as the snap.crontab user:

 * '''6:15am''': checkF5.sh
 * The script reports which files it has found and which files had problems. At the moment the e-mail is being sent to rok@cs, akrevl@cs. Report's subject will start with '''[snap-srv]'''.
 * Crontab output goes to: /lfs/local/0/snap/.crontab/log-checkF5-hulk.txt


==== Converting Full3 to Full5 ====

The daily conversion is done in the following steps:

 # convertF3F5daily.sh is called to get a list of files that were copied from Zarya.
 # The previous steps calls convertF3F5byFile.sh for each of the Full3 files.
 # convertF3F5byFile.sh sets up the temporary directories, links, etc.
 # convert-full3.py is called and the formats are converted.
 # Control is handed back to convertF3F5byFile.sh which copies the processed file to Full5 directory.

Here is the directory structure.

{{{
/lfs/local/0/snap/convertF3F5
  convert-full3.py # This Python script actually does the conversion
  convertF3F5byFile.sh # A helper script that converts one Full3 file to Full5 files
  convertF3F5byTime.sh # A helper that finds the appropriate Full3 file based on the timestamp you provide
  convertF3F5daily.sh # The daily run that will convert all the files we have copied from Zarya
  averageSize.sh # Display the average Full3/Full5 file size by year and last 12 months
  debug-log/ # Directory with the full debug log of all scripts
    convertF3F5-2012-09-10.log
    convertF3F5-2012-09-11.log
    ...
  spinn3r-log/ # Directory with the logs in the same format as the precious version (__spinn3r_)
    lerr-2011-11-18T19-00-00.txt
    log-2011-11-18T19-00-00.txt
    lerr-2011-11-18T20-00-00.txt
    log-2011-11-18T20-00-00.txt
    ...
}}}

The script convertF3F5daily.sh is run from crontab as the snap.crontab user:

 * '''3:45am''': convertF3F5daily.sh
 * Crontab output goes to: /lfs/local/0/snap/.crontab/log-hulk-convertF3F5daily.txt

If checkF5.sh reports missing files, check the logs first. If you need to rerun the conversion for a missing file, use convertF3F5byTime.sh. If you know the timestamp of the full3 file, run the script like this:

{{{
./convertF3F5byTime.sh 2012 09 09 12
}}}

If you know the timestamp of the full5 file (this is the timestamp that checkF5.sh reports) you can run the script like this:

{{{
./convertF3F5byTime.sh 2012 09 09 19 UTC
}}}

{{{#!wiki tip
'''Conversion not starting'''

The conversion is NOT done if a log file for a given full3 file already exists in spinn3r-log OR if the destination filenames already exist in the full5 directory.

'''Parallel processing'''

convertF3F5byFile.sh and convertF3F5byTime.sh are designed so they can run in parallel (you can fork multiple processes).
}}}

==== Checking files ====

There is another script called _convertF3F5byFile.sh_nocopy (note that it might get renamed to something more understandable) that can be used for recreating logs and checking the file integrity. This script will proceed to convert a full3 file to the full5 files, but it will leave all the data in a temporary directory.

In addition to that, the script will also do a diff between the created files and the same files in the full5 directory (if they exist, of course).

The script may be useful to test the conversion process if the conversion Python script changes.
=== More information ===
[[Spinn3rFormat]] - provides detailed description of version transitions and parsing.

Spinn3r data set on Hadoop cluster

This page provides all informaton about Spinn3r data set stored on Hadoop cluster.

Data records versions

There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.

From

To

Version

2008-08-01

2010-07-13

A

2010-07-14

2010-07-26

B

2010-07-27

2013-04-30

C

2013-05-01

2014-05-30

D

2014-06-01

...

E

More information

Spinn3rFormat - provides detailed description of version transitions and parsing.