Locked History Actions

Diff for "Spinn3rFormat"

Differences between revisions 16 and 20 (spanning 4 versions)
Revision 16 as of 2014-08-30 01:45:40
Size: 13433
Editor: NikoColneric
Comment:
Revision 20 as of 2014-09-16 23:10:55
Size: 13967
Editor: NikoColneric
Comment:
Deletions are marked like this. Additions are marked like this.
Line 133: Line 133:
Changes by Niko: === Changes by Niko ===
Line 138: Line 138:

Final code used for uploading files to hadoop:
 * [[attachment:Spinn3rToHadoopWriterV2.tar.gz]] - java source code, when compiled we jar file saved in the next line
 * [[attachment:Spinn3rToHadoopWriterV2.jar]] - compiled java code into runnable JAR
 * [[attachment:copy.sh]] - the main script
 * [[attachment:handle_one.sh]] - helper script
 * [[attachment:run_java.pl]] - helper script
 * [[attachment:spinn3rToHadoopAllTogether.tar.gz]] - whole zip containing these files as well as some logs and progress messages

Client F4

Processing of spinn3r data

Original notes by Bob: notes.txt

The title is now stored twice: (T) as is, only tabs and newlines replaced by whitespace (F) HTML-cleaned and tokenized (see below for tokenization details)

The same is true for the content: (H) as is, only tabs are removed, and newlines replaced by "*NL*" (multiple consecutive newlines are collapsed into one); this is useful if ever we realize that we threw out too much info during plain-text extraction; also, HTML markup has been shown to be a useful feature for certain NLP tasks (C) HTML-cleaned and tokenized (see below for tokenization details)

HTML cleaning includes:

  • removing HTML tags
  • decoding HTML entities using org.apache.commons.lang.StringEscapeUtils.unescapeHTML(), e.g., "&" => "&", ">" => ">", "&auml" => "ä"

We do not case-fold (i.e., make everything lower case), since capitalization is a useful feature for many steps in the NLP pipeline, such as named-entity recognition and sentence boundary detection.

Tokenization is done via edu.stanford.nlp.process.PTBTokenizer, from the Stanford Core NLP library ("JavaNLP"), cf. http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html. We instantiate it with the following parameters:

tokenizeNLs=%b
americanize=false
normalizeCurrency=false
normalizeParentheses=false
normalizeOtherBrackets=false
unicodeQuotes=false
ptb3Ellipsis=true
escapeForwardSlashAsterisk=false
untokenizable=noneKeep

PTBTokenizer results in much better handling of special cases that often arise. Some highlights of why a mature tokenizer is better than the previously used, hand-coded "ghetto tokenizer" that simply splits on whitespace:

  • Punctuation marks are lexed as separate tokens, e.g., "I am." => "I am ."; this is very useful when counting words, since naive whitespace tokenization would treat the "am" in "I am." and "I am good." as separate words, once as "am." and once as "am".

  • It includes rules for deciding whether a punctuation mark is a proper punctuation token or belongs to the previous token: e.g., "Mr." is a single token, whereas the period in "I am." should be considered a separate token.
  • It recognizes and decomposes certain composites, e.g., "I'm" => "I 'm", "don't" => "do n't", "Bob's" => "Bob 's"; again, this is useful when counting words.

  • It produces the output that is expected by other NLP tools down the pipeline, such as for detecting sentence boundaries.
  • It treats HTML tags as separate tokens, which makes it very easy to remove them.
  • It distinguishes opening from closing quotes, which makes quote extraction somewhat easier; e.g., "\"Wow!\"" => "`` Wow ! ''".

  • It recognizes a variety of quotation marks (e.g., '...', "...", ‘...’, “...”, «...»), which is a big advantage over the old code, which recognized only "..."; this was a significant drawback, since major outlets, such as the New York Times, use different quotation marks (NYT uses “...”).

Like quotes, for links we now extract position information, in the form of two numbers: the first number is the 0-based index of the starting position, the second one, the length; indices refer to the plain text. We recognize two types of links:

  1. URLs mentioned as plain text, without hyperlink tags (<a>); in this case the position information marks the appearance of the plain URL;

  2. proper HTML markup using <a> tags; in this case the position info marks the text within the opening and closing <a> tags; if there is no plain text between these tags, the length will be given as 0.

Links can appear in the content but also in the title. However, as defined above, position information refers to the plain text. So, when a link appears in the title, position information doesn't make sense and must be output as "undefined". We mark this by giving empty strings instead of integers for position and length, e.g., "L:::http://snap.stanford.edu" if "http://snap.stanford.edu" appears in the title.

Output format

Each article is represented by one line of tab-separated columns. Here is an example output line; for ease of readibility, we show each column on a separate line:

# The URL of the article
U:http://www.karamatsews.com/2013/04/out-to-sea-quilt.html

# The date
D:2013-04-09T02:26:00Z

# The title in HTML-cleaned and tokenized form
T:Karamat : Out to Sea Quilt

# The original title; only change: tabs and newlines removed
F:Karamat: Out to Sea Quilt

# Plain-text (HTML-cleaned) and tokenized content (see above for details)
C:When Megan moved into her ` big girl ' bed I told her that I would make her a new quilt , with her choice of fabric . I set out a couple of fabric options and she immediately picked Out to Sea . Mermaids and Pirate Girls ... who could resist ! I wanted a pattern with good size pieces so we would n't end up with a quilt full of headless pirates or octopus without tentacles . I ended up picking a free pattern from the Andover website . It uses only 2 blocks , with good size pieces ( 4 '' x 4 '' and 4 '' x 8 '' ) . And one of the blocks is pieced with partial seam construction ... easy to do , and adds a little interest to the layout . The only thing I did different from the pattern was I left off one column ... so rather than an 80 '' x 80 '' quilt , I ended up with a 64 '' x 80 '' quilt ... much better to fit on her bed . Details Fabric : Out to Sea by Sarah Jane for Michael Miller Backing : Essential Dots by Riley Blake Pattern : Frippery Quilt ( available at Andover 's website ) Quilting : Russ @ The Back Porch Quilters

# The original content; only change: tabs and newlines removed
H: <div class='post-body entry-content' id='post-body-6489500017357713536' itemprop='description articleBody'>When Megan moved into her 'big girl' bed I told her that I would make her a new quilt, with her choice of fabric. I set out a couple of fabric options and she immediately picked Out to Sea. Mermaids and Pirate Girls... who could resist!<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633649686/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8539/8633649686_f11cc3dec8.jpg" width="500" /></a></center> <br /> I wanted a pattern with good size pieces so we wouldn't end up with a quilt full of headless pirates or octopus without tentacles. I ended up picking a free pattern from the Andover website. It uses only 2 blocks, with good size pieces (4" x 4" and 4" x 8"). And one of the blocks is pieced with partial seam construction... easy to do, and adds a little interest to the layout.<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633649274/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8259/8633649274_293c10b086.jpg" width="500" /></a></center> <br /> The only thing I did different from the pattern was I left off one column... so rather than an 80" x 80" quilt, I ended up with a 64" x 80" quilt... much better to fit on her bed.<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633648668/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8536/8633648668_82c4d07da0.jpg" width="500" /></a></center> <strong><br /> </strong> <strong><br /> </strong> <strong>Details</strong><br /> Fabric: Out to Sea by Sarah Jane for Michael Miller<br /> Backing: Essential Dots by Riley Blake<br /> Pattern: Frippery Quilt (available at Andover's website)<br /> Quilting: Russ @ <a href="http://thebackporchquilters.com/">The Back Porch Quilters</a></div>

# Links with starting position and length of the text marked up by the <a> tags;
# the first number is the 0-based index of the starting position,
# the second one, the length; indices refer to the plain text (field C)
L:244:0:http://www.flickr.com/photos/37060810@N04/8633649686/
L:641:0:http://www.flickr.com/photos/37060810@N04/8633649274/with
L:833:0:http://www.flickr.com/photos/37060810@N04/8633648668/
L:1013:23:http://thebackporchquilters.com/

# Quotes, again with starting position and length (indices as for the L fields); note that, correctly, no quote is recognized in "4\" x 4\"":
Q:28:8:big girl

Perforamnce

The new code takes about 2.7 times as long as Klemen's code. I suspect that much of this overhead comes from the fact that we are now outputting about twice as much data to disk as he did (content and title are kept in original as well as cleaned and tokenized form).

> time ./convert-spinn3r.py 2013 04 11 02 &> /dev/null

real  1m3.373s
user  1m9.200s
sys   0m4.156s

> time ./convert-spinn3r_klemen.py 2013 04 11 02 &> /dev/null 

real  0m23.799s
user  0m23.165s
sys   0m3.316s

Issues

Although our coverage of quotations is much larger than before, we're still not getting everything; for instance, some languages have special conventions. Examples:

        »...« => '' ... ``
        „...” => „ ... ''
        ‚...’ => ‚ ... '

Spinn3r to Hadoop

Changes by Niko

  • Decoding HTML entities using org.apache.commons.lang.StringEscapeUtils.unescapeHTML() (see above for description) was added to both title T and content C since some of the older versions still contains some of these characters.

  • There is no laguage detection treshold. We store all detected languages, since there are ususally just a few of them. Besides, for each language we also store its probability as returned by the language detection algorithm.
  • The information about ratio of useful characters is also stored as well as information whether this degrable ratio is above 0.8 (T/F flag). This information keyword is "G".
  • Removed some logging messages (duplicate documents, no features in text, ...) since the log files were just too large. For each file we count the number of duplicates and log it.

Final code used for uploading files to hadoop:

UTF-8 history

Original text: unicode_history.txt.

This is not entirely conclusive, but my archæological endeavors hint towards the following as a potential history of character encoding of the Spinn3r data. (Author: Bob West, 2014-06-06)

PHASE A (until 2010-07-13)

(1) Spinn3r's probably UTF-8-encoded data was read as Latin-1 (a.k.a. ISO-8859-1). UTF-8 has potentially several bytes per character, while Latin-1 has always one byte per character. That is, a single character from the original data now looks like two characters. E.g., Unicode code point U+00E4 ("latin small letter a with diaeresis", a.k.a. "ä") is represented by the two-byte code C3A4 in UTF-8. Reading the bytes C3A4 as Latin-1 results in the two-character sequence "ä", since C3 encodes "Ã" in Latin-1, and A4, "¤".

(2) Then, case-folding was performed on the garbled text, making it even more garbled. E.g., "ä" became "ã¤".

(3) The data was written to disk as UTF-8.

Approximate solution: Take the debugging table from http://www.i18nqa.com/debug/utf8-debug.html, look for the garbled and lower-cased sequences and replace them by their original character. NB: The garbling is not bijective, but since most of the garbled sequences are highly unlikely (e.g., "ã¤"), this should be mostly fine. NB: Since we replace several characters ("ã¤") by a single character ("ä"), the indices of links and quotes might be off. But I think fixing the encoding is worth more than keeping the indicices precise.

PHASE B (2010-07-14 to 2010-07-26)

For just about 2 weeks, the data seems to have been read as UTF-8 and written as Latin-1 (i.e., the other way round than in phase A).

Non-Latin-1 characters are printed as "?". However, there also seem to be a very few cases as in Phase A, e.g., the second document in /afs/cs/group/infolab/datasets/snap-private/spinn3r/spinn3r-full5/web/2010-07/web-2010-07-15T00-00-00Z.rar

Approximate solution: Simply read the data as Latin-1.

PHASE C (2010-07-27 to 2013-04-28)

Data was written as ASCII, such that all non-ASCII characters (including Latin-1 characters), appear as "?".

Approximate solution: None. We simply need to byte (haha...) the bullet and deal with the question marks.

PHASE D (2013-04-29 to 2014-05-21)

Bob's version of the ProtostreamParser, i.e., capitalization and HTML markup are kept.

However, due to a bad BASH environment variable, data was written as ASCII, such that non-ASCII characters appear as "?".

NB: We store the original content with markup, plus a markup stripped and tokenized version (tokens are whitespace-separated). Links and quotes are extracted from the stripped and tokenized version.

PHASE E (since 2014-05-22)

Same as Phase D, but output is now written as proper unicode, by hard-coding the output encoding as "UTF-8" in the Java code.