Locked History Actions

attachment:notes.txt of Spinn3rFormat

Attachment 'notes.txt'

Download

   1 PROCESSING OF SPINN3R DATA
   2 ==========================
   3 
   4 The title is now stored twice:
   5 (T) as is, only tabs and newlines replaced by whitespace
   6 (F) HTML-cleaned and tokenized (see below for tokenization details)
   7 
   8 The same is true for the content:
   9 (H) as is, only tabs are removed, and newlines replaced by "*NL*" (multiple consecutive newlines are collapsed into one); this is useful if ever we realize that we threw out too much info during plain-text extraction; also, HTML markup has been shown to be a useful feature for certain NLP tasks
  10 (C) HTML-cleaned and tokenized (see below for tokenization details)
  11 
  12 HTML cleaning includes:
  13 - removing HTML tags
  14 - decoding HTML entities using org.apache.commons.lang.StringEscapeUtils.unescapeHTML(), e.g., "&" => "&", ">" => ">", "&auml" => "ä"
  15 
  16 We do not case-fold (i.e., make everything lower case), since capitalization is a useful feature for many steps in the NLP pipeline, such as named-entity recognition and sentence boundary detection.
  17 
  18 Tokenization is done via edu.stanford.nlp.process.PTBTokenizer, from the Stanford Core NLP library ("JavaNLP"), cf. http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html.
  19 We instantiate it with the following parameters:
  20 tokenizeNLs=%b
  21 americanize=false
  22 normalizeCurrency=false
  23 normalizeParentheses=false
  24 normalizeOtherBrackets=false
  25 unicodeQuotes=false
  26 ptb3Ellipsis=true
  27 escapeForwardSlashAsterisk=false
  28 untokenizable=noneKeep
  29 
  30 PTBTokenizer results in much better handling of special cases that often arise. Some highlights of why a mature tokenizer is better than the previously used, hand-coded "ghetto tokenizer" that simply splits on whitespace:
  31 - Punctuation marks are lexed as separate tokens, e.g., "I am." => "I am ."; this is very useful when counting words, since naive whitespace tokenization would treat the "am" in "I am." and "I am good." as separate words, once as "am." and once as "am".
  32 - It includes rules for deciding whether a punctuation mark is a proper punctuation token or belongs to the previous token: e.g., "Mr." is a single token, whereas the period in "I am." should be considered a separate token.
  33 - It recognizes and decomposes certain composites, e.g., "I'm" => "I 'm", "don't" => "do n't", "Bob's" => "Bob 's"; again, this is useful when counting words.
  34 - It produces the output that is expected by other NLP tools down the pipeline, such as for detecting sentence boundaries.
  35 - It treats HTML tags as separate tokens, which makes it very easy to remove them.
  36 - It distinguishes opening from closing quotes, which makes quote extraction somewhat easier; e.g., "\"Wow!\"" => "`` Wow ! ''".
  37 - It recognizes a variety of quotation marks (e.g., '...', "...", ‘...’, “...”, «...»), which is a big advantage over the old code, which recognized only "..."; this was a significant drawback, since major outlets, such as the New York Times, use different quotation marks (NYT uses “...”).
  38 
  39 Like quotes, for links we now extract position information, in the form of two numbers: the first number is the 0-based index of the starting position, the second one, the length; indices refer to the plain text.
  40 We recognize two types of links:
  41 (1) URLs mentioned as plain text, without hyperlink tags (<a>); in this case the position information marks the appearance of the plain URL;
  42 (2) proper HTML markup using <a> tags; in this case the position info marks the text within the opening and closing <a> tags; if there is no plain text between these tags, the length will be given as 0.
  43 
  44 Links can appear in the content but also in the title. However, as defined above, position information refers to the plain text. So, when a link appears in the title, position information doesn't make sense and must be output as "undefined". We mark this by giving empty strings instead of integers for position and length, e.g., "L:::http://snap.stanford.edu" if "http://snap.stanford.edu" appears in the title.
  45 
  46 
  47 OUTPUT FORMAT
  48 =============
  49 
  50 Each article is represented by one line of tab-separated columns.
  51 Here is an example output line; for ease of readibility, we show each column on a separate line:
  52 
  53 # The URL of the article
  54 U:http://www.karamatsews.com/2013/04/out-to-sea-quilt.html
  55 
  56 # The date
  57 D:2013-04-09T02:26:00Z
  58 
  59 # The title in HTML-cleaned and tokenized form
  60 T:Karamat : Out to Sea Quilt
  61 
  62 # The original title; only change: tabs and newlines removed
  63 F:Karamat: Out to Sea Quilt
  64 
  65 # Plain-text (HTML-cleaned) and tokenized content (see above for details)
  66 C:When Megan moved into her ` big girl ' bed I told her that I would make her a new quilt , with her choice of fabric . I set out a couple of fabric options and she immediately picked Out to Sea . Mermaids and Pirate Girls ... who could resist ! I wanted a pattern with good size pieces so we would n't end up with a quilt full of headless pirates or octopus without tentacles . I ended up picking a free pattern from the Andover website . It uses only 2 blocks , with good size pieces ( 4 '' x 4 '' and 4 '' x 8 '' ) . And one of the blocks is pieced with partial seam construction ... easy to do , and adds a little interest to the layout . The only thing I did different from the pattern was I left off one column ... so rather than an 80 '' x 80 '' quilt , I ended up with a 64 '' x 80 '' quilt ... much better to fit on her bed . Details Fabric : Out to Sea by Sarah Jane for Michael Miller Backing : Essential Dots by Riley Blake Pattern : Frippery Quilt ( available at Andover 's website ) Quilting : Russ @ The Back Porch Quilters
  67 
  68 # The original content; only change: tabs and newlines removed
  69 H: <div class='post-body entry-content' id='post-body-6489500017357713536' itemprop='description articleBody'>When Megan moved into her 'big girl' bed I told her that I would make her a new quilt, with her choice of fabric. I set out a couple of fabric options and she immediately picked Out to Sea. Mermaids and Pirate Girls... who could resist!<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633649686/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8539/8633649686_f11cc3dec8.jpg" width="500" /></a></center> <br /> I wanted a pattern with good size pieces so we wouldn't end up with a quilt full of headless pirates or octopus without tentacles. I ended up picking a free pattern from the Andover website. It uses only 2 blocks, with good size pieces (4" x 4" and 4" x 8"). And one of the blocks is pieced with partial seam construction... easy to do, and adds a little interest to the layout.<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633649274/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8259/8633649274_293c10b086.jpg" width="500" /></a></center> <br /> The only thing I did different from the pattern was I left off one column... so rather than an 80" x 80" quilt, I ended up with a 64" x 80" quilt... much better to fit on her bed.<br /> <br /> <center><a href="http://www.flickr.com/photos/37060810@N04/8633648668/" title="Out To Sea Quilt by {Karamat}, on Flickr"><img alt="Out To Sea Quilt" height="334" src="http://farm9.staticflickr.com/8536/8633648668_82c4d07da0.jpg" width="500" /></a></center> <strong><br /> </strong> <strong><br /> </strong> <strong>Details</strong><br /> Fabric: Out to Sea by Sarah Jane for Michael Miller<br /> Backing: Essential Dots by Riley Blake<br /> Pattern: Frippery Quilt (available at Andover's website)<br /> Quilting: Russ @ <a href="http://thebackporchquilters.com/">The Back Porch Quilters</a></div>
  70 
  71 # Links with starting position and length of the text marked up by the <a> tags;
  72 # the first number is the 0-based index of the starting position,
  73 # the second one, the length; indices refer to the plain text (field C)
  74 L:244:0:http://www.flickr.com/photos/37060810@N04/8633649686/
  75 L:641:0:http://www.flickr.com/photos/37060810@N04/8633649274/with
  76 L:833:0:http://www.flickr.com/photos/37060810@N04/8633648668/
  77 L:1013:23:http://thebackporchquilters.com/
  78 
  79 # Quotes, again with starting position and length (indices as for the L fields); note that, correctly, no quote is recognized in "4\" x 4\"":
  80 Q:28:8:big girl
  81 
  82 
  83 PERFROMANCE
  84 ===========
  85 
  86 The new code takes about 2.7 times as long as Klemen's code.
  87 I suspect that much of this overhead comes from the fact that we are now outputting about twice as much data to disk as he did (content and title are kept in original as well as cleaned and tokenized form).
  88 
  89 > time ./convert-spinn3r.py 2013 04 11 02 &> /dev/null
  90 
  91 real  1m3.373s
  92 user  1m9.200s
  93 sys   0m4.156s
  94 
  95 > time ./convert-spinn3r_klemen.py 2013 04 11 02 &> /dev/null 
  96 
  97 real  0m23.799s
  98 user  0m23.165s
  99 sys   0m3.316s
 100 
 101 
 102 ISSUES
 103 ======
 104 
 105 Although our coverage of quotations is much larger than before, we're still not getting everything; for instance, some languages have special conventions.
 106 Examples:
 107 	»...« => '' ... ``
 108 	„...” => „ ... ''
 109 	‚...’ => ‚ ... '
 110 

Attached Files

To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.
  • [get | view] (2014-08-08 16:14:15, 5332.2 KB) [[attachment:Main.jar.F4v3-20140521]]
  • [get | view] (2014-08-08 16:13:04, 5328.9 KB) [[attachment:Main.jar.F4v4-20140808]]
  • [get | view] (2014-09-16 23:02:02, 82133.1 KB) [[attachment:Spinn3rToHadoopWriterV2.jar]]
  • [get | view] (2014-09-16 23:10:10, 84977.9 KB) [[attachment:Spinn3rToHadoopWriterV2.tar.gz]]
  • [get | view] (2014-09-16 23:02:30, 3.3 KB) [[attachment:copy.sh]]
  • [get | view] (2014-08-08 16:26:34, 2.2 KB) [[attachment:copy_spinn3r_to_hdfs.pl]]
  • [get | view] (2014-09-16 23:02:46, 0.7 KB) [[attachment:handle_one.sh]]
  • [get | view] (2014-08-08 16:33:00, 8.9 KB) [[attachment:notes.txt]]
  • [get | view] (2014-09-16 23:02:56, 3.3 KB) [[attachment:run_java.pl]]
  • [get | view] (2014-09-16 23:03:17, 90566.0 KB) [[attachment:spinn3rToHadoopAllTogether.tar.gz]]
  • [get | view] (2014-08-08 16:26:29, 2231.9 KB) [[attachment:spinn3rhadoop_java.tgz]]
  • [get | view] (2014-08-08 16:16:20, 8.4 KB) [[attachment:spinn3rreaderd.tgz.F4v3-20140521]]
  • [get | view] (2014-08-08 16:24:58, 2.9 KB) [[attachment:unicode_history.txt]]

You are not allowed to attach a file to this page.