Locked History Actions

attachment:unicode_history.txt of Spinn3rFormat

Attachment 'unicode_history.txt'

Download

   1 This is not entirely conclusive, but my archæological endeavors hint towards the following as a
   2 potential history of character encoding of the Spinn3r data.
   3 (Author: Bob West, 2014-06-06)
   4 
   5 
   6 PHASE A (until 2010-07-13)
   7 ==========================
   8 
   9 (1) Spinn3r's probably UTF-8-encoded data was read as Latin-1 (a.k.a. ISO-8859-1). UTF-8 has
  10 potentially several bytes per character, while Latin-1 has always one byte per character. That is,
  11 a single character from the original data now looks like two characters.
  12 E.g., Unicode code point U+00E4 ("latin small letter a with diaeresis", a.k.a. "ä") is represented
  13 by the two-byte code C3A4 in UTF-8. Reading the bytes C3A4 as Latin-1 results in the two-character
  14 sequence "ä", since C3 encodes "Ã" in Latin-1, and A4, "¤".
  15 
  16 (2) Then, case-folding was performed on the garbled text, making it even more garbled.
  17 E.g., "ä" became "ã¤".
  18 
  19 (3) The data was written to disk as UTF-8.
  20 
  21 Approximate solution:
  22 Take the debugging table from http://www.i18nqa.com/debug/utf8-debug.html, look for the garbled
  23 and lower-cased sequences and replace them by their original character.
  24 NB: The garbling is not bijective, but since most of the garbled sequences are highly unlikely
  25 (e.g., "ã¤"), this should be mostly fine.
  26 NB: Since we replace several characters ("ã¤") by a single character ("ä"), the indices of links
  27 and quotes might be off. But I think fixing the encoding is worth more than keeping the indicices
  28 precise.
  29 
  30 
  31 PHASE B (2010-07-14 to 2010-07-26)
  32 ==================================
  33 
  34 For just about 2 weeks, the data seems to have been read as UTF-8 and written as Latin-1 (i.e., the
  35 other way round than in phase A).
  36 
  37 Non-Latin-1 characters are printed as "?". However, there also seem to be a very few cases as in
  38 Phase A, e.g., the second document in
  39 /afs/cs/group/infolab/datasets/snap-private/spinn3r/spinn3r-full5/web/2010-07/web-2010-07-15T00-00-00Z.rar
  40 
  41 Approximate solution:
  42 Simply read the data as Latin-1.
  43 
  44 
  45 PHASE C (2010-07-27 to 2013-04-28)
  46 ==================================
  47 
  48 Data was written as ASCII, such that all non-ASCII characters (including Latin-1 characters),
  49 appear as "?".
  50 
  51 Approximate solution:
  52 None. We simply need to byte (haha...) the bullet and deal with the question marks.
  53 
  54 
  55 PHASE D (2013-04-29 to 2014-05-21)
  56 ==================================
  57 
  58 Bob's version of the ProtostreamParser, i.e., capitalization and HTML markup are kept.
  59 
  60 However, due to a bad BASH environment variable, data was written as ASCII, such that non-ASCII
  61 characters appear as "?".
  62 
  63 NB: We store the original content with markup, plus a markup stripped and tokenized version (tokens
  64 are whitespace-separated). Links and quotes are extracted from the stripped and tokenized version.
  65 
  66 
  67 PHASE E (since 2014-05-22)
  68 ==================================
  69 
  70 Same as Phase D, but output is now written as proper unicode, by hard-coding the output encoding as
  71 "UTF-8" in the Java code.

Attached Files

To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.
  • [get | view] (2014-08-08 16:14:15, 5332.2 KB) [[attachment:Main.jar.F4v3-20140521]]
  • [get | view] (2014-08-08 16:13:04, 5328.9 KB) [[attachment:Main.jar.F4v4-20140808]]
  • [get | view] (2014-09-16 23:02:02, 82133.1 KB) [[attachment:Spinn3rToHadoopWriterV2.jar]]
  • [get | view] (2014-09-16 23:10:10, 84977.9 KB) [[attachment:Spinn3rToHadoopWriterV2.tar.gz]]
  • [get | view] (2014-09-16 23:02:30, 3.3 KB) [[attachment:copy.sh]]
  • [get | view] (2014-08-08 16:26:34, 2.2 KB) [[attachment:copy_spinn3r_to_hdfs.pl]]
  • [get | view] (2014-09-16 23:02:46, 0.7 KB) [[attachment:handle_one.sh]]
  • [get | view] (2014-08-08 16:33:00, 8.9 KB) [[attachment:notes.txt]]
  • [get | view] (2014-09-16 23:02:56, 3.3 KB) [[attachment:run_java.pl]]
  • [get | view] (2014-09-16 23:03:17, 90566.0 KB) [[attachment:spinn3rToHadoopAllTogether.tar.gz]]
  • [get | view] (2014-08-08 16:26:29, 2231.9 KB) [[attachment:spinn3rhadoop_java.tgz]]
  • [get | view] (2014-08-08 16:16:20, 8.4 KB) [[attachment:spinn3rreaderd.tgz.F4v3-20140521]]
  • [get | view] (2014-08-08 16:24:58, 2.9 KB) [[attachment:unicode_history.txt]]

You are not allowed to attach a file to this page.