Spinn3r data set on Hadoop cluster
This page provides all informaton about Spinn3r data set stored on Hadoop cluster.
Data records versions
There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.
From |
To |
Version |
2008-08-01 |
2010-07-13 |
A |
2010-07-14 |
2010-07-26 |
B |
2010-07-27 |
2013-04-30 |
C |
2013-05-01 |
2014-05-30 |
D |
2014-06-01 |
... |
E |
More information
Spinn3rFormat - provides detailed description of version transitions and parsing.
Spinn3rDataSetArchive - more info about processign.