Constrained Dynamic Parser

From Event-B
Revision as of 13:06, 1 February 2010 by imported>Mathieu (add : Open Questions : AST, de bruijn ids,...)
Jump to navigationJump to search

This page describes the requirements for a generic parser for the Event-B Mathematical Language.

A first design proposal is also drafted.


In order to be usable mathematical extensions require that the event-b mathematical language syntax can be extended by the final user.

The lexical analyser and the syntaxic parser thus have to be extensible in a simple enough way (from a user point of vue).

Requirements Exported by the Current Language Design

Operator Priority

  • operator are defined by group,
  • group of operator have a defined precedences,
  • there may be precedences defined inside groups.

Operator Compatibility

  • a compatibility table defines allowed associativities inside a group,
  • a compatibility table defines allowed associativities between groups (it allows to forbid a syntaxic construction like f(x)^{-1}\;
nota: this requirement was added afterwards with consistency in mind.

Expected Extension Schemes

We do want to at least define operators of the following form :

  • infix : a + b\; or a \vdash b : c\;
  • prefix : \neg a\;
  • postfix :  R^*\;
  • closed : \|a\|\;
  • parentheses sugar : (a +b) * c\;
  • functional postfix :  a \langle b \rangle\;
  • functional prefix :   \langle b \rangle f\;
  • lambda like :  \lambda x\mapsto y . P | E\;
  • Union like :  \Union\{ e \mid P\} or  \Union\{ x,y . P \mid e\}
  • sets : \{a, b, c + e\}\; or \{ e \mid P\}\; or \{x,y . P \mid e\}\;

We also like to define precisely precedences and associativity between existing and new operators.

Requirements exported by the dynamic feature

  • the precedence should not be enumerated but defined by a relation, like: '+' < '*' and '*' < '**', ...


Design Alternatives

Make Existing Parser Extensible

The existing parser is a LL recursive descent parser generated by the Coco/R Compiler Generator, which makes extensive use of pre-computed lookahead, which makes it very difficult to be transformed in a sufficiently enough generic parser.

Parser Combinator

This paper is interesting in its proposal of using an acyclic graph to define operator precedence.

Pratt Parser

Note that a pratt parser is essentially a parser for expressions (where expression parts are themselves expressions)... But it may be tweaked to limit the accepted sub-expressions or to enforce some non-expression parts inside an expression. More on that hereafter.

Some Existing Extensible Languages

Design Proposal

Main Concepts

a symbol is a lexem known by the parser.
each symbol belongs to one and only one group.
symbol compatibility
a relation telling if inside a group two symbol are compatibles.
Two symbol are compatibles is they can be parsed without parentheses (see Event-B Mathematical Language for more on compatibility).
group compatibility
a relation telling if two groups are compatibles.
Two symbol of different groups are said compatibles if their group are compatibles.
symbol associativity
a relation which gives the associative precedence of symbols inside a given group.
group associativity
a relation which gives the associative precedence of groups.
Two symbol of different groups have a relative precedence given by the associative precedence of their groups.
generic parser
the parser which is defined.
This parser may be called recursively (an infix expression being build upon a left and right sub-expression) and is often the default parser used when building the AST associated with a symbol.
specific parser
some specific parser, predefined as helper functions to build specific AST.
Some example of specific parser are the one used to parse and build list of identifiers (for quantified predicates), mapplets lists (for lambda), and so one...

Proposed User Interface

The hereafter proposition does not preclude the final UI to take other (more user friendly forms). But the concepts should probably stay the same

We propose that the user will be able to create several predefined type of symbols (new types may certainly be further added by mean of eclipse extension points). When defining a symbol, the user can choose the parser used to parse the sub-expressions (default to generic parser, but a specific parser may be used) and can enforce the form of the resulting parsed sub-expression in terms of their AST root type.

Predefined Symbol Builders

The proposed predefined symbol builders are:

For example predicate conjunction which expects two predicates for left and right parts may be defined by infix("∧", gid="(logic pred)", expected_nonterm=["predicate", "predicate"])
As another example, substitution ≔ of group '(infix subst)' which expect two ident list for left and right expressions may be defined by infix(symbol="≔", gid="(infix subst)", parsers=["ident_list", "ident_list"], expected_nonterm=["ident_list", "ident_list"])
prefix("¬", "(not pred)", expected_nonterm=["predicate"])
atomic("⊤", "(atomic pred)")
prefix closed
used to build prefix symbol which must be followed by enclosing symbol(brackets, parentheses,..). For example finite may be defined by: prefix_closed("finite", "(", ")","(atomic pred)", expected_nonterm=["expression"])
postfix: postfix("∼", gid="(unary relation)", expected_nonterm=["expression"])
quantified("∀", "·", expected_nonterm=["predicate"]
lambda like
lambda_like("λ", "·", u"∣", gid="(quantification)", expected_nonterm=["ident list", "predicate", "expression"])
quantified union like
Union_like("⋃", "·", "∣", gid="(quantification)")
closed sugar
use for parentheses added to enforce associativity: closed_sugar("(", ")")
functional postfix
functional_postfix("(", ")", gid="(functional)", expected_nonterm=["expression", "expression"])

Some right associative versions of those symbol builders may also be supplied (infix_right, or a specific parameter right associative parameter may also be provided to the infix symbol builder).

Predefined Specific Parser

  • expression lists
  • ident lists
  • maplet tree (for lambda)

Defining Compatibility and Precedences

Inside a group, compatibility may be defined like:

   g = group["(binop)"]
           [   ("∪", "∪"),
               ("∩", "∩"),
               ("∩", "∖"),
               ("∩", "▷"),
               ("∩", "⩥"),
               ("∘", "∘"),
               (";", ";"),
               (";", "▷"),
               (";", "⩥"),
               ("", ""),
               ("◁", "∩"),
               ("◁", "∖"),
               ("◁", ";"),
               ("◁", "⊗"),
               ("◁", "▷"),
               ("◁", "⩥"),
               ("⩥", "∩"),
               ("⩥", "∖"),
               ("⩥", ";"),
               ("⩥", "⊗"),
               ("⩥", "▷"),
               ("⩥", "⩥"),
               ("×", "×"), ]

which defines the compatibility table for expression binary operator as defined in Event-B Mathematical Language.

Inside a group, precedence may be defined like:

g = group["(arithmetic)"]
           [   ("^", "÷"),
               ("^", "∗"),
               ("^", "+"),
               ("^", "−"),
               ("^", "mod"),
               ("+", "∗"),
               ("−", "∗"),
               ("+", "÷"),
               ("−", "÷"),
               ("+", "mod"),
               ("−", "mod"), ] )

which defines ^ as having the least precedence amongst the operators of the expression arithmetic group.

Similarly, precedences and compatibilities between groups may be defined :

   group_precedence.add(("(binop)", "(interval)"))
   group_precedence.add(("(interval)", "(arithmetic)"))
   group_precedence.add(("(arithmetic)", "(unary relation)"))
   group_precedence.add(("(unary relation)", "(bound unary)"))
   group_precedence.add(( "(bound unary)", "(bool)"))
   group_compatibility.add_all( [ "(binop)", "(interval)", (arithmetic)", "(unary relation)", "(unary relation)", "(bound unary)",....])

Some Difficulties

Some difficulties appear when trying to implement an event-b parser with the preceding scheme:

  • A symbol may be declared by several different builders (thinf of unary and infix minus for example).
    • pratt parser are able to cope with a symbol appearing either at the beginning of an expression or inside an expression (see led and nud hereafter). For example, unary minus and infix minus wan be both defined by the proposed scheme.
    • but in their default form, pratt parser failed in parsing symbol like '{' which appears at the beginning of extension sets, comprehension sets and quantified sets.
The existing event-b parser use infinite lookahead to determine which one of those three sets is actually being parsed. We can not afford such a solution, as it would hinder extensibility of the parser. Implementing a backtracking parser is probably our best solution provided that we use memoisation to alleviate the cost of parsing formulas that require heavy backtracking.
  • Defining a grammar or a grammar extension is not an easy job, even with those simple mechanisms.
It would certainly be a good idea to implement a user interface to grammar testing so that an extension designer as the mean to ensure that he agree with the parser on the semantic of (or at least on the AST obtained by using) its extension.

Core Algorithm

Main Classes

a table associating a SymbolParser for each known symbol.
a Parser for a specific symbol. It should at least have the following attributes:
  • Parser
    for parsing the expression associated to the symbol
    when it appears in the middle of an expression (left denotation)
  • Parser
    for parsing the expression associated to the symbol
    when it appears at the beginning of an expression (null denotation)
  • String
  • String
    designing the group to which the
    symbol belongs to.
A parser, which should have the following attributes
  • Parser Array
    an array of subparsers used by the
    method. They defaults to
    , but may be set to specific parsers.
and the following methods:
  • parse
    : this method should consume tokens and returns an AST. It may use several
ParserBuilder TODO

Main Objects



Open Questions

How does the user specify the expected AST ?

Some examples in the existing parser:
  • some binary infix predicates are in fact parsed as n-ary operators (with more than 2 childrens if possible). Shall we generalize this feature ? How will this fit within the proposed extension scheme ?
  • bound identifiers are converted to de Bruijn indexes while constructing the AST. Is this desirable ? How does the user specify that identifier are bound or not ?

Sample Implementation

A prototype as been developed in Python to quickly try different solutions.

The development tree of the python prototype is available at