The event mention detection and corefrene evaluators, and associated utilities (converters, validators)
Table of Contents
This repository conducts, file conversion, and scoring for event mention detection. It consists of the following three pieces of code:
To use the software, we need to prepare the CMU format annotation file from the Brat annotation output using “brat2tbf.py”. The scorer can then take 2 documents in such format, one as gold standard data, one as system output. The scorer also need the token files produced by the tokenizer. The usage of these codes are described below.
Use the example shell scripts “example_run.sh” to perform all the above steps in the sample documents, if success, you will find scoring results in the example_data directory
Most utility code can be found in the util directory.
The following scripts need to find corresponding files by docid and file extension, so the file extension will be provided exactly. The script have default values for these extensions, but may require additional argument if extensions are changed.
Here is how to find the extension:
For brat annotation files, they normally have the following name:
<docid>.ann
In such case, the file extension is “.ann”, the converter assume this as the default extension. If not, change it with “-ae” argument
In the past evaluations, tokenization tables are provided, for tokenization table, they normally have the following name:
<docid>.tab
In such case, the file extension is “.tab”, both the converter and scorer assume this as a default extension. If not, change them with “-te” argument.
The current scorer can score event mention detection and coreference based on the (.tbf) format. It also require the token table files to detect invisible words and to generate CoNLL style coreference files.
usage: scorer_v1.8.py [-h] -g GOLD -s SYSTEM [-d COMPARISON_OUTPUT]
[-o OUTPUT] [-c COREF] [-a SEQUENCING] [-t TOKEN_PATH]
[-m COREF_MAPPING] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-ct COREFERENCE_THRESHOLD]
[-b] [--eval_mode {char,token}] [-wl TYPE_WHITE_LIST]
[-dn DOC_ID_TO_EVAL]
Event mention scorer, provides support to Event Nugget scoring, Event Coreference and Event Sequencing scoring.
core arguments:
-g GOLD, --gold GOLD Golden Standard
-s SYSTEM, --system SYSTEM System output
optional arguments:
-d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
Compare and help show the difference between system
and gold
-o OUTPUT, --output OUTPUT
Optional evaluation result redirects, put eval result
to file
-c COREF, --coref COREF
Eval Coreference result output, need to put the
referenceconll coref scorer in the same folder with
this scorer
-a SEQUENCING, --sequencing SEQUENCING
Eval Event sequencing result output (After and
Subevent)
-t TOKEN_PATH, --token_path TOKEN_PATH
Path to the directory containing the token mappings
file, only used in token mode.
-m COREF_MAPPING, --coref_mapping COREF_MAPPING
Which mapping will be used to perform coreference
mapping.
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
startsat 0, default value will be [2, 3]
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.tab], only used in token mode.
-ct COREFERENCE_THRESHOLD, --coreference_threshold COREFERENCE_THRESHOLD
Threshold for coreference mention mapping
-b, --debug turn debug mode on
--eval_mode {char,token}
Use Span or Token mode. The Span mode will take a span
as range [start:end], while the Token mode consider
each token is provided as a single id.
-wl TYPE_WHITE_LIST, --type_white_list TYPE_WHITE_LIST
Provide a file, where each line list a mention type
subtype pair to be evaluated. Types that are out of
this white list will be ignored.
-dn DOC_ID_TO_EVAL, --doc_id_to_eval DOC_ID_TO_EVAL
Provide one single doc id to evaluate.
The validator check whether the supplied “tbf” file follows assumed structure . The validator will exit at status 255 if any errors are found, validation logs will be written at the same directory of the validator with “errlog” as extension.
usage: validator.py [-h] -s SYSTEM [-tm] [-t TOKEN_PATH] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-wc WORD_COUNT_FILE]
[-ty TYPE_FILE] [-b]
The validator check whether the supplied 'tbf' file follows assumed structure.
The validator will exit at status 255 if any errors are found, validation
logs will be written at the same directory of the validator with 'errlog' as
extension.
core arguments:
-s SYSTEM, --system SYSTEM System output
optional arguments:
-h, --help show this help message and exit
-tm, --token_mode Token mode, default is false.
-t TOKEN_PATH, --token_path TOKEN_PATH
Path to the directory containing the token mappings
file, only in token mode.
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
starts at 0, default value will be [2, 3]. Only used
in token mode.
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.tab]
-wc WORD_COUNT_FILE, --word_count_file WORD_COUNT_FILE
A word count file that can be used to help validation,
such as the character_counts.tsv in LDC2016E64.
-ty TYPE_FILE, --type_file TYPE_FILE
If provided, the validator will check whether the type
subtype pair is valid.
-b, --debug turn debug mode on
This is a tool that converts Brat Annotation format to TBF format. We currently try to make as little assumption as possible. However, in order to resolve coreference transitive redirect automatically, the relation name for coreference must be named as “Coreference”. We also develop for event coreference only.
The default set up follows Brat v1.3 ID convention:
Further development might allow customized ID convention.
This code only scan and detect event mentions and its attributes. Event arguments and entities are currently not handled. Annotations other than Event Mention (with its attributes and Text Spans) will be ignored, which means, it will only read “E” annotations and its related attributes.
Discontinuous text-bound annotations will be supported
brat2tokenFormat.py [-h] (-d DIR | -f FILE) -t TOKENPATH [-o OUT]
[-oe EXT] [-i EID] [-w] [-te TOKEN_TABLE_EXTENSION]
[-ae ANNOTATION_EXTENSION] [-b]
This converter converts Brat annotation files to one single token based event mention description file (CMU format). It accepts a single file name or a directory name that contains the Brat annotation output. The converter also requires token offset files that shares the same name with the annotation file, with extension .txt.tab. The converter will search for the token file in the directory specified by ‘-t’ argument
Required Arguments:
-d DIR, --dir DIR directory of the annotation files
-f FILE, --file FILE name of one annotation file
-t TOKENPATH, --tokenPath TOKENPATH
directory to search for the corresponding token files
Optional arguments:
-h, --help show this help message and exit
-o OUT, --out OUT output path, 'converted' in the current path by
default
-oe EXT, --ext EXT output extension, 'tbf' by default
-i EID, --eid EID an engine id that will appears at each line of the
output file. 'brat_conversion' will be used by default
-w, --overwrite force overwrite existing output file
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is .txt.tab
-ae ANNOTATION_EXTENSION, --annotation_extension ANNOTATION_EXTENSION
any extension appended after docid of annotation
files. Default is .tkn.ann
-b, --debug turn debug mode on
This software converts LDC’s XML format for the TAC KBP 2015 Event Nugget task to the Brat format. More specifically, it converts LDC’s event nuggets and coreferences to events and coreference links that can be viewed via the Brat web interface. Brat annotation configurations for output are available at directory src/main/resources/
.
The software requires Java 1.8. See pom.xml
for other dependencies.
You can see its usage with the following command:
$ java -jar target/converter-1.0.3-jar-with-dependencies.jar -h
Option Description
------ -----------
-a <annotation dir> annotation directory
--ae <annotation file extension> annotation file extension
-d whether to detag text
-h help
-i <input mode> input mode ("event-nugget")
-o <output dir> output directory
-t <text dir> text directory
--te <text file extension> text file extension
The software requires Java 1.8. A precompiled jar locates at bin directory. To compile the project from source you will also need Maven 2.7+.
Our tokenizer implementation is based on the tokenizer in the Stanford CoreNLP tool . The software is implemented in Java, and its requirements are as follows:
java -jar bin/token-file-maker-1.0.3-jar-with-dependencies.jar -a <annotation> -e <extension> [-h] -o <output> [-s <separator>] -t <text>
-a <annotation> annotation directory
-e <extension> text file extension
-h print this message
-o <output> output directory
-s <separator> separator chars for tokenization
-t <text> text directory
These are tab-delimited files which map the tokens to their tokenized files. A mapping table contains 3 columns for each row, and the rows contain an orderd listing of the document’s tokens. The columns are:
Please note that all 4 fields are required and will be used:
The tokenization table files are created using our automatic tool, which wraps the Stanford tokenizer and provide boundary checks.
The visualization is provided as a mechanism to compare different output, which is optional and can be ignored if one is only interested in the scores. This code maybe update frequently. Please refer to the command line “-h” for detailed instructions.
The visualize code represent mention differences in JSON, which is then passed to Embedded Brat .
Recent changes make visualizing clusters possible by creating additional JSON object. When enabled, there will be a cluster selector on the webpage, one could select the cluster and all other event mentions will hide.
The visualization mapping does not fully reflect the scoring process, it is just a mean to help compare the data. Note that there are up to 2^k different way of aligning the mentions, where k is the number of attributes. The input to the visualization system is the most basic mapping (span only). It need not capture the true mapping of mention type or realis status because several mapping options are identical in span only mapping, the visualization system simply choose whichever comes first.
The text based Visualization can be generated using the “scorer.py”, by supplying the “-d” argument. The format is straightforward, a text document is produced for comparison. The annotation of both systems are displayed in one line, separated by “|”
The web base visualization takes the text visualization file, then:
usage: visualize.py [-h] -d COMPARISON_OUTPUT -t TOKENPATH [-x TEXT]
[-v VISUALIZATION_HTML_PATH] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-se SOURCE_FILE_EXTENSION]
Mention visualizer, will create a side-by-side embedded visualization from the mapping
Required Arguments:
-d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
The comparison output file between system and gold,
used to recover the mapping
-t TOKENPATH, --tokenPath TOKENPATH
Path to the directory containing the token mappings
file
Optional Arguments:
-h, --help show this help message and exit
-x TEXT, --text TEXT Path to the directory containing the original text
-v VISUALIZATION_HTML_PATH, --visualization_html_path VISUALIZATION_HTML_PATH
The Path to find visualization web pages, default path
is [visualization]
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
startsat 0, default value will be [2, 3]
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.txt.tab]
-se SOURCE_FILE_EXTENSION, --source_file_extension SOURCE_FILE_EXTENSION
any extension appended after docid of source
files.Default is [.tkn.txt]