This is not entirely conclusive, but my archæological endeavors hint towards the following as a potential history of character encoding of the Spinn3r data. (Author: Bob West, 2014-06-06) PHASE A (until 2010-07-13) ========================== (1) Spinn3r's probably UTF-8-encoded data was read as Latin-1 (a.k.a. ISO-8859-1). UTF-8 has potentially several bytes per character, while Latin-1 has always one byte per character. That is, a single character from the original data now looks like two characters. E.g., Unicode code point U+00E4 ("latin small letter a with diaeresis", a.k.a. "ä") is represented by the two-byte code C3A4 in UTF-8. Reading the bytes C3A4 as Latin-1 results in the two-character sequence "ä", since C3 encodes "Ã" in Latin-1, and A4, "¤". (2) Then, case-folding was performed on the garbled text, making it even more garbled. E.g., "ä" became "ã¤". (3) The data was written to disk as UTF-8. Approximate solution: Take the debugging table from http://www.i18nqa.com/debug/utf8-debug.html, look for the garbled and lower-cased sequences and replace them by their original character. NB: The garbling is not bijective, but since most of the garbled sequences are highly unlikely (e.g., "ã¤"), this should be mostly fine. NB: Since we replace several characters ("ã¤") by a single character ("ä"), the indices of links and quotes might be off. But I think fixing the encoding is worth more than keeping the indicices precise. PHASE B (2010-07-14 to 2010-07-26) ================================== For just about 2 weeks, the data seems to have been read as UTF-8 and written as Latin-1 (i.e., the other way round than in phase A). Non-Latin-1 characters are printed as "?". However, there also seem to be a very few cases as in Phase A, e.g., the second document in /afs/cs/group/infolab/datasets/snap-private/spinn3r/spinn3r-full5/web/2010-07/web-2010-07-15T00-00-00Z.rar Approximate solution: Simply read the data as Latin-1. PHASE C (2010-07-27 to 2013-04-28) ================================== Data was written as ASCII, such that all non-ASCII characters (including Latin-1 characters), appear as "?". Approximate solution: None. We simply need to byte (haha...) the bullet and deal with the question marks. PHASE D (2013-04-29 to 2014-05-21) ================================== Bob's version of the ProtostreamParser, i.e., capitalization and HTML markup are kept. However, due to a bad BASH environment variable, data was written as ASCII, such that non-ASCII characters appear as "?". NB: We store the original content with markup, plus a markup stripped and tokenized version (tokens are whitespace-separated). Links and quotes are extracted from the stripped and tokenized version. PHASE E (since 2014-05-22) ================================== Same as Phase D, but output is now written as proper unicode, by hard-coding the output encoding as "UTF-8" in the Java code.