deler

deler is a tool for simple and accountable segmentation of marked-up text. It provides a wrapper around any existing tool for text segmentation, provided the output is mapped so that segments are delimited by new lines. deler works by converting to plain text and tracking alignment—for details please see:

    @InProceedings{Rea:Dri:Oep:13,
      author    = {Read, Jonathon and Dridan, Rebecca and Oepen, Stephan},
      title     = {Simple and Accountable Segmentation of Marked-up Text},
      booktitle = {Proceedings of the 19th Nordic Conference
                   on Computational Linguistics},
      month     = {May},
      year      = {2013},
      address   = {Oslo, Norway},
      url       = {http://www.ep.liu.se/ecp/085/033/ecp1385033.pdf}
    }
        

For details on the motivation behind deler please see previous work on the creation of the WeSearch Data Collection (Read, Flickinger, Dridan, Oepen and Øvrelid, 2012) and our survey of the state-of-the-art in sentence segmentation (Read, Dridan, Oepen and Solberg, 2012).

Obtaining deler

deler can be obtained through subversion, at svn.delph-in.net/deler/trunk

Setting up

deler has a number of prerequisites:

  1. python (version 2.7 recommended)
  2. some external tool for segmentation that outputs each segment on a new line. tokenizer is recommended — see examples/tokenizer for our invocation.
  3. a configuration file that specifies how to handle elements — see examples/html-wdc.xml for an example, and more information.

Running

    usage: deler.py [-h] [--accounting] [--config CONFIG] [--gml-mode] [--paragraph-mode]
                    [--post_start POST_START] [--segmenter SEGMENTER] [--validate]
                    [files [files ...]]

    positional arguments:
      files                 a list of files to segment

    optional arguments:
      -h, --help            show this help message and exit
      --accounting          output an account of modifications made to the original segment
      --config CONFIG       configuration xml
      --gml-mode            output gml instead of the input markup
      --paragraph-mode      force segmentation at double newlines
      --post_start POST_START
                            regex to extract the tag that indicates the start of
                            the post
      --segmenter SEGMENTER
                            path to segmenter executable
        

for example:

    ./deler.py \
        --accounting \
        --config examples/html-wdc.xml \
        --gml-mode \
        --paragraph-mode \
        --segmenter examples/tokenizer \
        examples/test.html
        

Output

For each file the tool will produce an output file with a .deler extension. This is the output for the examples/test.html, without --accounting specified, where each line corresponds to a processed segment:

    The name ⌊∗Clanfield∗⌋ is derived from the ⌊>Old English>⌋and means “field clean of weeds”.
    Clanfield was historically a small ⌊>farming>⌋ community.
         

If --accounting is specified then there are two lines for each segment:

    The name ⌊∗Clanfield∗⌋ is derived from the ⌊>Old English>⌋and means “field clean of weeds”.
    0 104 @0-"<p>"  @12+"⌊∗"  @12-"<b>" @24+"∗⌋"  @24-"</b>"  @49+"⌊>"  @49-"<a>" @63+">⌋"  @63-"</a>"  @100-"</p>"
    Clanfield was historically a small ⌊>farming>⌋ community.
    105 172 @0-"<p>"  @38+"⌊>"  @38-"<a>" @48+">⌋"  @48-"</a>"  @63-"</p>"
        

The first line is the segment itself. The second is the account of the segment, with fields delimited by tabs. The first and second fields are start and end character offsets in the input file that correspond to the segment. Each subsequent field is an account of some modification made to the original segment. The account matches the regular expression:

    /@(\d+)([+-])"(.+)"/
        

where the capturing groups correspond to:

  1. the character offset of this action (relative to the segment start)
  2. + (indicating insertion) or - (indication removal)
  3. a unicode string (where whitespace is escaped) indicating what was inserted or removed

Acknowledgements

Work on deler was carried out in the Language Technology Group at the University of Oslo as part of the WeSearch project, funded by the Norwegian Research Council through its VerdIKT programme.

Contact

Jonathon Read
Last updated 10 May 2013

License

Copyright (c) 2013 Jonathon Read

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.