Locked History Actions

Diff for "Spinn3rHadoopDataSet"

Differences between revisions 8 and 18 (spanning 10 versions)
Revision 8 as of 2014-09-05 22:23:09
Size: 481
Editor: NikoColneric
Comment:
Revision 18 as of 2014-09-06 01:11:19
Size: 1025
Editor: NikoColneric
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example in some versions there is no capital letters, no raw html fileds, etc. There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.
Line 8: Line 8:
|| '''From''' || '''To''' || '''Version'''||
|| 2008-08-01 || 2010-07-13 || A ||
|| 2010-07-14 || 2010-07-26 || B ||
|| 2010-07-27 || 2013-04-30 || C ||
|| 2013-05-01 || 2014-05-30 || D ||
|| 2014-05-30 || ... || E ||

=== More information ===
[[Spinn3rFormat]] - provides detailed description of version transitions and parsing.

Spinn3r data set on Hadoop cluster

This page provides all informaton about Spinn3r data set stored on Hadoop cluster.

Data records versions

There are several verison in which the records are stored. It is important to know which version are you processings, since depending on the version you know what fields are available for this record and how the text was preprocessed. For example, in some versions there is no capital letters, no raw html fileds, etc.

From

To

Version

2008-08-01

2010-07-13

A

2010-07-14

2010-07-26

B

2010-07-27

2013-04-30

C

2013-05-01

2014-05-30

D

2014-05-30

...

E

More information

Spinn3rFormat - provides detailed description of version transitions and parsing.