MPL Language Specification

MPL Language Specification
==========================

Version 0.1
31 Jan 2008
Andrew Clegg
http://biotext.org.uk/

The following sections describe the syntax of MPL in extended Backus-Naur
form (ISO/IEC 14977), along with some notes on semantics.

File structure
--------------

entry			=	comment | rule ;

comment			=	( "#", ? any text ?, newline ) |
				newline ;

rule			=	( match rule, newline ) |
				( pattern rule, newline ) |
				( replacement rule, newline ) ;

newline			=	? system-specific newline character(s) ? ;

Comments introduced with a # character, and blank lines, are ignored.

Match rules
-----------

match rule		=	{ "!" }, "match ", variable, " = ", regexp ;

variable		=	( "@" | "#" ), letter, { letter} ;

letter			=	"A" | "B" | "C" | "D" | "E" | "F" | "G" |
				"H" | "I" | "J" | "K" | "L" | "M" | "N" |
				"O" | "P" | "Q" | "R" | "S" | "T" | "U" |
				"V" | "W" | "X" | "Y" | "Z" ;

regexp			=	? regular expression using Java syntax ? ;

Match rules define regular expressions for matching against words, POS tags
or arc labels in patterns (see below). Those beginning with the ! character
are inverted; that is, they match any node to which the regular expression
does not match.

Each regular expression is assigned to a variable, which may be identified
either by a # or @ symbol. These variables can be used when writing
patterns. The choice of symbol divides the variables into two subsets and
allows us to write replacement rules (see below) that target one subset
only.

Pattern rules
-------------

pattern rule		=	"pattern", newline, node, newline, "end" ;

node			=	pos tag, "~~", ( word | composite ),
				{ " ", child arc } ;

pos tag			=	variable | literal ;

word			=	variable | literal ;

literal			=	? any non-whitespace text ? ;

composite		=	"{", word, { "_", word }, "}" ;

child arc		=	"( ", arc label, " ", node, " )" ;

arc label		=	variable | literal ;

Pattern rules are textual representations of subgraphs (graph fragments) to
be matched against whole-sentence dependency graphs. The first node in the
pattern is the root node; each node may have one or more child arcs which
themselves end in nodes. This recursive definition means that patterns of
arbitrary width and depth may be defined. The order in which child arcs are
listed is not important. Excess whitespace within patterns is ignored,
allowing the user to format them in a visually appealing manner, using line
breaks, indentation etc. However 'begin' and 'end' must occur on lines by
themselves.

Words, POS tags and arc labels (dependency types) can be specified in terms
of variables (see above) or literal strings. A variable's regular
expression can match anywhere in the target string to count, but literals
must match the entire target string exactly. For example, an arc label
given as the literal 'VB' would not match against a VBD' dependency.
Composites consist of variables and literals, separated by underscores,
which match any one character. These sequences must match the target string
contiguously; excess material at either end is ignored.

Replacement rules
-----------------

replacement rule	=	"replace ", original string, " = ",
				new string ;

original string		=	? any text ? ;

new string		=	? any text ? ;

Replacement rules allow unconstrained search-and-replace functionality over
patterns before they are compiled, allowing variant patterns to be easily
generated. Typically, the string to be replaced would be a single node
definition, or a POS tag or arc label, or a variable, but this is not
mandated by the language.

Note that replacement works on the raw text of a pattern, so unexpected
whitespace inside a string can prevent a match.

On initially reading an MPL file, the MPL parser builds a pool of pattern
rules (those specified explicitly in the file) and a list of replacement
rules in the order they occur in the file. One by one, each replacement
rule is applied to every pattern in the pool, and any new patterns
generated are added to the pool, so that subsequent replacement rules
operate on them as well as on the original 'seed' set. Thus the order the
replacement rules are declared makes a difference to the ultimate outcome.
Note that a replacement rule can match a pattern in multiple places,
generating a new pattern for each distinct combination of matches.

An example
----------

Consider the following simple MPL file.

# Match rules

match @AGENT = Entity[a-z]{1,2}
match @TARGET = Entity[a-z]{1,2}

# Pattern rules

pattern
VB~~activate
        ( nsubj NN~~@AGENT )
        ( dobj NN~~@TARGET )
end

# Replacement rules

replace VB~~activate = VBZ~~activates

replace NN~~@TARGET = expression ( prep_of NN~~@TARGET )

The match rules define regular expressions to identify the agent and target
in the placeholder form 'Entityaa', i.e. any placeholder composed of the
string 'Entity' and two lower-case letters. This is the minimum set of
variables required to retrieve interactions. In the current implementation,
the two variables '@AGENT' and '@TARGET' are treated differently from other
variables, in that their bindings in successfully-matched patterns are
saved and used to generate interactions.

The pattern rule matches just the simple fragment 'Entityaa activate
Entitybb'. The first replacement rule generates another pattern which
matches 'Entityaa activates Entitybb', while the second replaces the whole
target node with a prepositionally-modified noun. This is applied to both
the original pattern and the new one generated by the first replacement
rule, resulting in two new patterns that match 'Entityaa activate
expression of Entitybb' and 'Entityaa activates expression of Entitybb'.
Advertisements
%d bloggers like this: