Attachment 'unicode_history.txt'
Download 1 This is not entirely conclusive, but my archæological endeavors hint towards the following as a
2 potential history of character encoding of the Spinn3r data.
3 (Author: Bob West, 2014-06-06)
4
5
6 PHASE A (until 2010-07-13)
7 ==========================
8
9 (1) Spinn3r's probably UTF-8-encoded data was read as Latin-1 (a.k.a. ISO-8859-1). UTF-8 has
10 potentially several bytes per character, while Latin-1 has always one byte per character. That is,
11 a single character from the original data now looks like two characters.
12 E.g., Unicode code point U+00E4 ("latin small letter a with diaeresis", a.k.a. "ä") is represented
13 by the two-byte code C3A4 in UTF-8. Reading the bytes C3A4 as Latin-1 results in the two-character
14 sequence "ä", since C3 encodes "Ã" in Latin-1, and A4, "¤".
15
16 (2) Then, case-folding was performed on the garbled text, making it even more garbled.
17 E.g., "ä" became "ã¤".
18
19 (3) The data was written to disk as UTF-8.
20
21 Approximate solution:
22 Take the debugging table from http://www.i18nqa.com/debug/utf8-debug.html, look for the garbled
23 and lower-cased sequences and replace them by their original character.
24 NB: The garbling is not bijective, but since most of the garbled sequences are highly unlikely
25 (e.g., "ã¤"), this should be mostly fine.
26 NB: Since we replace several characters ("ã¤") by a single character ("ä"), the indices of links
27 and quotes might be off. But I think fixing the encoding is worth more than keeping the indicices
28 precise.
29
30
31 PHASE B (2010-07-14 to 2010-07-26)
32 ==================================
33
34 For just about 2 weeks, the data seems to have been read as UTF-8 and written as Latin-1 (i.e., the
35 other way round than in phase A).
36
37 Non-Latin-1 characters are printed as "?". However, there also seem to be a very few cases as in
38 Phase A, e.g., the second document in
39 /afs/cs/group/infolab/datasets/snap-private/spinn3r/spinn3r-full5/web/2010-07/web-2010-07-15T00-00-00Z.rar
40
41 Approximate solution:
42 Simply read the data as Latin-1.
43
44
45 PHASE C (2010-07-27 to 2013-04-28)
46 ==================================
47
48 Data was written as ASCII, such that all non-ASCII characters (including Latin-1 characters),
49 appear as "?".
50
51 Approximate solution:
52 None. We simply need to byte (haha...) the bullet and deal with the question marks.
53
54
55 PHASE D (2013-04-29 to 2014-05-21)
56 ==================================
57
58 Bob's version of the ProtostreamParser, i.e., capitalization and HTML markup are kept.
59
60 However, due to a bad BASH environment variable, data was written as ASCII, such that non-ASCII
61 characters appear as "?".
62
63 NB: We store the original content with markup, plus a markup stripped and tokenized version (tokens
64 are whitespace-separated). Links and quotes are extracted from the stripped and tokenized version.
65
66
67 PHASE E (since 2014-05-22)
68 ==================================
69
70 Same as Phase D, but output is now written as proper unicode, by hard-coding the output encoding as
71 "UTF-8" in the Java code.
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.- [get | view] (2014-08-08 16:14:15, 5332.2 KB) [[attachment:Main.jar.F4v3-20140521]]
- [get | view] (2014-08-08 16:13:04, 5328.9 KB) [[attachment:Main.jar.F4v4-20140808]]
- [get | view] (2014-09-16 23:02:02, 82133.1 KB) [[attachment:Spinn3rToHadoopWriterV2.jar]]
- [get | view] (2014-09-16 23:10:10, 84977.9 KB) [[attachment:Spinn3rToHadoopWriterV2.tar.gz]]
- [get | view] (2014-09-16 23:02:30, 3.3 KB) [[attachment:copy.sh]]
- [get | view] (2014-08-08 16:26:34, 2.2 KB) [[attachment:copy_spinn3r_to_hdfs.pl]]
- [get | view] (2014-09-16 23:02:46, 0.7 KB) [[attachment:handle_one.sh]]
- [get | view] (2014-08-08 16:33:00, 8.9 KB) [[attachment:notes.txt]]
- [get | view] (2014-09-16 23:02:56, 3.3 KB) [[attachment:run_java.pl]]
- [get | view] (2014-09-16 23:03:17, 90566.0 KB) [[attachment:spinn3rToHadoopAllTogether.tar.gz]]
- [get | view] (2014-08-08 16:26:29, 2231.9 KB) [[attachment:spinn3rhadoop_java.tgz]]
- [get | view] (2014-08-08 16:16:20, 8.4 KB) [[attachment:spinn3rreaderd.tgz.F4v3-20140521]]
- [get | view] (2014-08-08 16:24:58, 2.9 KB) [[attachment:unicode_history.txt]]
You are not allowed to attach a file to this page.