This is not entirely conclusive, but my archæological endeavors hint towards the following as a
potential history of character encoding of the Spinn3r data.
(Author: Bob West, 2014-06-06)


PHASE A (until 2010-07-13)
==========================

(1) Spinn3r's probably UTF-8-encoded data was read as Latin-1 (a.k.a. ISO-8859-1). UTF-8 has
potentially several bytes per character, while Latin-1 has always one byte per character. That is,
a single character from the original data now looks like two characters.
E.g., Unicode code point U+00E4 ("latin small letter a with diaeresis", a.k.a. "ä") is represented
by the two-byte code C3A4 in UTF-8. Reading the bytes C3A4 as Latin-1 results in the two-character
sequence "Ã¤", since C3 encodes "Ã" in Latin-1, and A4, "¤".

(2) Then, case-folding was performed on the garbled text, making it even more garbled.
E.g., "Ã¤" became "ã¤".

(3) The data was written to disk as UTF-8.

Approximate solution:
Take the debugging table from http://www.i18nqa.com/debug/utf8-debug.html, look for the garbled
and lower-cased sequences and replace them by their original character.
NB: The garbling is not bijective, but since most of the garbled sequences are highly unlikely
(e.g., "ã¤"), this should be mostly fine.
NB: Since we replace several characters ("ã¤") by a single character ("ä"), the indices of links
and quotes might be off. But I think fixing the encoding is worth more than keeping the indicices
precise.


PHASE B (2010-07-14 to 2010-07-26)
==================================

For just about 2 weeks, the data seems to have been read as UTF-8 and written as Latin-1 (i.e., the
other way round than in phase A).

Non-Latin-1 characters are printed as "?". However, there also seem to be a very few cases as in
Phase A, e.g., the second document in
/afs/cs/group/infolab/datasets/snap-private/spinn3r/spinn3r-full5/web/2010-07/web-2010-07-15T00-00-00Z.rar

Approximate solution:
Simply read the data as Latin-1.


PHASE C (2010-07-27 to 2013-04-28)
==================================

Data was written as ASCII, such that all non-ASCII characters (including Latin-1 characters),
appear as "?".

Approximate solution:
None. We simply need to byte (haha...) the bullet and deal with the question marks.


PHASE D (2013-04-29 to 2014-05-21)
==================================

Bob's version of the ProtostreamParser, i.e., capitalization and HTML markup are kept.

However, due to a bad BASH environment variable, data was written as ASCII, such that non-ASCII
characters appear as "?".

NB: We store the original content with markup, plus a markup stripped and tokenized version (tokens
are whitespace-separated). Links and quotes are extracted from the stripped and tokenized version.


PHASE E (since 2014-05-22)
==================================

Same as Phase D, but output is now written as proper unicode, by hard-coding the output encoding as
"UTF-8" in the Java code.