Difference between pages "Code Generation Activity" and "Constrained Dynamic Parser"

From Event-B
(Difference between pages)
Jump to navigationJump to search
imported>Andy
 
imported>Pascal
 
Line 1: Line 1:
Tasking Event-B is the extension of Event-B for defining concurrent systems sharing data, for details see the [[Tasking Event-B Overview]] page. For more information contact Andy Edmunds - University of Southampton - mailto:ae2@ecs.soton.ac.uk
+
{{TOCright}}
=== Code Generation Feature - Version 0.2.0 For Rodin 2.3===
+
This page describes the requirements for a '''generic parser''' for the [[Event-B Mathematical Language]].
This section is undergoing maintenance. We are planning to release a new version of the code generator in the next few days.
 
===== Changes to the Tooling and Approach =====
 
The main changes are:
 
* The code generators have been completely re-written. The translators are now implemented in Java, i.e. there is no longer a dependence on the Epsilon tool set. This was undertaken for code maintenance reasons.
 
* Tasking Event-B is now integrated with the Event-B explorer.
 
* The Rose Editor is used for editing the Tasking Event-B, and
 
* a text-based editor is provided, using the Rose extension, for editing the TaskBody. This feature has been added to address some of the usability concerns. It also overcomes the 'problem' experienced with duplicate event names in a development, since the parser-builder that has been implemented automatically selects the correct event.
 
* The EMF tree editor in Rose is only used minimally; we plan enhancements to further reduce its use.
 
* Composed machines are used to store event 'synchronizations'; these are generated automatically during the decomposition process. This reduces the amount of typing in the TaskBody editor, since we no longer need to specify both local and remote (synchronizing) events.
 
* The code generation approach is now extensible; new target language constructs can be added using the Eclipse extension mechanism.
 
* The translation of target's mathematical language is now specified in the theory plug-in. This improves clarity since the the translation from source to target is achieved by specifying pattern matching rules. Extensibility is also improved; the theory plug-in is used to specify new data-types, and how they are implemented.
 
* Translated code is deposited in a directory in the appropriate files. An Ada project file is generated for use with AdaCore's GPS workbench. Eventually this could be enabled/disabled in a preferences dialog box.
 
* The Tasking Event-B to Event-B translator is now properly integrated. Control variable updates to the Event-B model are made in a similar way to the equivalent updates in the state-machine plug-in. The additional elements are added to the Event-B model and marked as 'generated'. This prevents users from manually modifying them, and allows them to be removed through a menu choice.
 
  
===== Downloads =====
+
A first design proposal is also drafted.
* Rodin Update Site: ???
 
* Examples, including models used for testing: ???
 
* Sources at: ???
 
  
 +
== Requirements ==
 +
In order to be usable [[mathematical extensions]] require that the event-b mathematical language syntax can be extended by the final user.
  
TODO
+
The lexical analyser and the syntaxic parser thus have to be extensible in a simple enough way (from a user point of vue).
* Add array types.
 
* Add addressed variables (for direct read/write access to memory)
 
* Flattening of composed machines/implementation machines.
 
* Interrupts
 
  
=== Sensing and Actuating for Tasking Event-B ===
+
=== Requirements Exported by the Current Language Design ===
Version 0.1.5. Sensing and actuating events, and an Environ Machine have been added to allow simulation of the environment and implementation using memory mapped IO.
+
==== Operator Priority ====
 +
* operator are defined by group,
 +
* group of operator have a defined precedences,
 +
* there may be precedences defined inside groups.
  
* The new v0.1.5 feature is available from the Rodin Update Site, it resides in the Utilities Category.
+
==== Operator Compatibility ====
 +
* a compatibility table defines allowed associativities inside a group,
 +
* a compatibility table defines allowed associativities between groups (it allows to forbid a syntaxic construction like <math>f(x)^{-1}\;</math>
 +
: nota: this requirement was added afterwards with consistency in mind.
  
* As in previous releases, the code generation plug-in relies on the Epsilon tool suite. Add the following Epsilon interim update site to the list of available update sites in the Eclipse menu ''help/install new software'': http://download.eclipse.org/modeling/gmt/epsilon/interim/
+
=== Expected Extension Schemes ===
 +
We do want to at least define operators of the following form :
 +
* infix : <math>a + b\;</math> or <math>a \vdash b : c\;</math>
 +
* prefix : <math>\neg a\;</math>
 +
* postfix : <math> R^*\;</math>
 +
* closed : <math>\|a\|\;</math>
 +
* parentheses sugar : <math>(a +b) * c\;</math>
 +
* functional postfix : <math> a \langle b \rangle\;</math>
 +
* functional prefix : <math>  \langle b \rangle f\;</math>
 +
* lambda like : <math> \lambda x\mapsto y . P | E\;</math>
 +
* Union like : <math> \Union\{ e \mid P\}</math> or <math> \Union\{ x,y . P \mid e\}</math>
 +
* sets : <math>\{a, b, c + e\}\;</math> or <math>\{ e \mid P\}\;</math> or <math>\{x,y . P \mid e\}\;</math>
  
* Select 'the Epsilon Core (Incubation)' component, this is the only component that is required for Tasking Event-B.
+
We also like to define precisely precedences and associativity between existing and new operators.
  
A new [http://wiki.event-b.org/index.php?title=Tasking_Event-B_Tutorial Code Generation Tutorial] has been produced, that makes use of these new features. There is an explanation of the heating controller, upon which it is based, [http://wiki.event-b.org/index.php/Development_of_a_Heating_Controller_System here].
+
=== Requirements exported by the dynamic feature ===
 +
* the precedence should not be enumerated but defined by a relation, like: '+' < '*' and '*' < '**', ...
  
The example/tutorial projects, and also and a Bundled Windows 7 version, are available in the [http://deploy-eprints.ecs.soton.ac.uk/304/ Deploy E-Prints archive] or [https://codegenerationd.svn.sourceforge.net/svnroot/codegenerationd/Examples/HeatingController_Tutorial_v0.1.4/ Examples SVN site].
+
== Limitations ==
  
=== The Code Generation Demonstrator for Rodin 2.1.x ===
+
== Design Alternatives ==
  
Released 24 January 2011.
+
=== Make Existing Parser Extensible ===
 +
The existing parser is a LL recursive descent parser generated by the [http://www.ssw.uni-linz.ac.at/Coco/ Coco/R Compiler Generator], which makes extensive use of pre-computed lookahead, which makes it very difficult to be transformed in a sufficiently enough generic parser.
  
The Rodin 2.1.x compatible code generation demonstrator plug-ins have been released into the Rodin Sourceforge repository at:
+
=== Parser Combinator ===
 +
:* on [http://en.wikipedia.org/wiki/Parser_combinator wikipedia]. Some implementations are referenced [http://pdos.csail.mit.edu/~baford/packrat/ here].
 +
:* with functionnal language [http://wiki.portal.chalmers.se/agda/Agda Agda] : ''[http://www.cs.nott.ac.uk/~nad/publications/danielsson-norell-mixfix.pdf Parsing Mixfix Operators]
 +
:: This paper is interesting in its proposal of using an acyclic graph to define operator precedence.
 +
:* with functional language [http://user.cs.tu-berlin.de/~opal/opal-language.html OPAL] : ''[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=8A8A8B39FFBF1AEF17C782C4822C5AF6?doi=10.1.1.58.655&rep=rep1&type=url&i=0 How To Obtain Powerful Parsers That Are Elegant and Practical]''
  
  https://rodin-b-sharp.svn.sourceforge.net/svnroot/rodin-b-sharp/trunk/CodeGeneration
+
=== Pratt Parser ===
 +
:* [http://portal.acm.org/citation.cfm?id=512931 original paper from Vaughan Pratt]
 +
:* [http://effbot.org/zone/tdop-index.htm SImple ''Pratt'' parsing in python]
 +
:* [http://javascript.crockford.com/tdop/tdop.html ''Pratt'' parsing in javascript]
 +
Note that a ''pratt'' parser is essentially a parser for expressions (where expression parts are themselves expressions)... But it may be tweaked to limit the accepted sub-expressions or to enforce some non-expression parts inside an expression. More on that hereafter.
  
The update-site is available through the Rodin update site in the ''Utilities'' category.
+
=== Some Existing Extensible Languages ===
 +
* [http://www.chrisseaton.com/katahdin/ Katahdin], and the associated paper [http://www.chrisseaton.com/katahdin/katahdin-thesis.pdf A Programming Language Where the Syntax and Semantics Are Mutable at Runtime] which describes the language and its implementation (in C#).
 +
* [http://itcentre.tvu.ac.uk/~clark/xmf.html XMF] whose implementation seems not be describes in details.
 +
* [http://xlr.sourceforge.net/index.html XLR: Extensible Language and Runtime], which looks like a ''pratt'' parser according to the [http://xlr.svn.sourceforge.net/viewvc/xlr/trunk/xl2/xlr/parser.cpp?revision=805&view=markup sources].
 +
* [http://www.pi-programming.org/What.html <math>\pi</math>] whose implementation seems not be describes in details (but sources availables).
  
The code generation tutorial examples are available for download at:
+
== Design Proposal ==
 +
=== Main Concepts ===
 +
;symbol: a symbol is a lexem known by the parser.
 +
;group: each symbol belongs to one and only one group.
 +
;symbol compatibility: a relation telling if inside a group two symbol are compatibles.
 +
:Two symbol are compatibles is they can be parsed without parentheses (see [[Event-B Mathematical Language]] for more on compatibility).
 +
;group compatibility: a relation telling if two groups are compatibles.
 +
:Two symbol of different groups are said compatibles if their group are compatibles.
 +
;symbol associativity: a relation which gives the associative precedence of symbols inside a given group.
 +
;group associativity: a relation which gives the associative precedence of groups.
 +
:Two symbol of different groups have a relative precedence given by the associative precedence of their groups.
 +
;generic parser: the parser which is defined.
 +
: This parser may be called recursively (an infix expression being build upon a left and right sub-expression) and is often the default parser used when building the AST associated with a ''symbol''.
 +
;specific parser: some specific parser, predefined as helper functions to build specific AST.
 +
:Some example of specific parser are the one used to parse and build list of identifiers (for quantified predicates), mapplets lists (for lambda), and so one... 
  
  https://sourceforge.net/projects/codegenerationd/files/DemoFiles/
+
=== Proposed User Interface ===
 +
''The hereafter proposition does not preclude the final UI to take other (more user friendly forms). But the concepts should probably stay the same''
  
The code generation plug-in relies on the Epsilon tool suite. Install Epsilon manually, since the automatic install utility does not seem to work for this feature. We currently use the Epsilon interim update site available at:
+
We propose that the user will be able to create several predefined type of symbols (new types may certainly be further added by mean of eclipse extension points). When defining a symbol, the user can choose the parser used to parse the sub-expressions (default to generic parser, but a specific parser may be used) and can enforce the form of the resulting parsed sub-expression in terms of their AST root type.
  
  http://download.eclipse.org/modeling/gmt/epsilon/interim/
+
==== Predefined Symbol Builders ====
 +
The proposed predefined symbol builders are:
 +
;infix: For example predicate conjunction which expects two predicates for left and right parts may be defined by <code>infix("∧", gid="(logic pred)", expected_nonterm=["predicate", "predicate"])</code>
 +
:As another example, substitution ≔ of group '(infix subst)' which expect two ''ident list'' for left and right expressions may be defined by <code>infix(symbol="≔", gid="(infix subst)", parsers=["ident_list", "ident_list"], expected_nonterm=["ident_list", "ident_list"])</code>
  
Select 'the Epsilon Core (Incubation)' component, this is the only component that is required for Tasking Event-B.
+
;prefix: <code>prefix("¬", "(not pred)", expected_nonterm=["predicate"])</code>
  
== Code Generation Status ==
+
;atomic: <code>atomic("⊤", "(atomic pred)")</code>
=== Latest Developments ===
 
* Demonstrator plug-in feature version 0.1.0
 
** for Rodin 2.1.x version is available.
 
  
* The Code Generation feature consists of,
+
;prefix closed: used to build prefix symbol which must be followed by enclosing symbol(brackets, parentheses,..). For example <tt>finite</tt> may be defined by: <code>prefix_closed("finite", "(", ")","(atomic pred)", expected_nonterm=["expression"])</code>
** a tasking Development Generator.
 
** a tasking Development Editor (Based on an EMF Tree Editor).
 
** a translator, from Tasking Development to Common Language Model (IL1).
 
** a translator, from the Tasking Development to Event-B model of the implementation.
 
** a pretty-printer for the Tasking Development.
 
** a pretty-printer for Common Language Model, which generates Ada Source Code.
 
  
* A tutorial is available [[Code Generation Tutorial]]
+
:postfix: <code>postfix("∼", gid="(unary relation)",  expected_nonterm=["expression"])</code>
** Step 1 - Create the tasking development.
 
** Step 2 - Add annotations.
 
** Step 3 - Invoke translators.
 
  
=== Ongoing Work ===
+
;quantified: <code>quantified("∀", "·", expected_nonterm=["predicate"]</code>
  
* Full Rodin Integration
+
;lambda like: <code>lambda_like("λ", "·", u"∣", gid="(quantification)", expected_nonterm=["ident list", "predicate", "expression"])</code>
* Sensed Variables
 
* Branching in Shared Machines
 
  
=== Future Work ===
+
;quantified union like: <code>Union_like("⋃", "·", "∣", gid="(quantification)")</code>
* Support for Interrupts.
 
* Richer DataTypes.
 
* Accommodation of duplicate event names in tasking developments.
 
  
== Metamodels ==
+
;closed sugar: use for parentheses added to enforce associativity: <code>closed_sugar("(", ")")</code>
* In the plug-in we define several meta-models:
 
** CompositeControl: for the control flow (algorithmic) constructs such as branch, loop and sequence etc. These constructs may be used in the specification of either  sequential or concurrent systems.
 
** Tasking Meta-model: defines the tasking model where we attach tasking specific details, such as task priority, task type. The tasking structures provide the ability to define single tasking or multi-tasking (concurrent) systems. We make use of the composite control plug-in to specify the flow of control.
 
** Common Language (IL1) Meta-model: defines an abstraction of common programming language constructs for use in translations to implementations.
 
  
== Translation Rules ==
+
;functional postfix: <code>functional_postfix("(", ")", gid="(functional)", expected_nonterm=["expression", "expression"])</code>
* Tasking to IL1/Event-B translation rules [[http://wiki.event-b.org/images/Translation.pdf]]
 
  
== Release History ==
+
Some right associative versions of those symbol builders may also be supplied (infix_right, or a specific parameter ''right associative'' parameter may also be provided to the ''infix'' symbol builder).
=== The Code Generation Demonstrator for Rodin 1.3.x ===
 
  
 +
==== Predefined Specific Parser ====
 +
* expression lists
 +
* ident lists
 +
* maplet tree (for lambda)
  
First release: 30 November 2010.
 
  
available from:
+
==== Defining Compatibility and Precedences ====
 +
Inside a group, ''compatibility'' may be defined like:
 +
<code>
 +
    g = group["(binop)"]
 +
    g.add_many_compatibility(
 +
            [  ("∪", "∪"),
 +
                ("∩", "∩"),
 +
                ("∩", "∖"),
 +
                ("∩", "▷"),
 +
                ("∩", "⩥"),
 +
                ("∘", "∘"),
 +
                (";", ";"),
 +
                (";", "▷"),
 +
                (";", "⩥"),
 +
                ("", ""),
 +
                ("◁", "∩"),
 +
                ("◁", "∖"),
 +
                ("◁", ";"),
 +
                ("◁", "⊗"),
 +
                ("◁", "▷"),
 +
                ("◁", "⩥"),
 +
                ("⩥", "∩"),
 +
                ("⩥", "∖"),
 +
                ("⩥", ";"),
 +
                ("⩥", "⊗"),
 +
                ("⩥", "▷"),
 +
                ("⩥", "⩥"),
 +
                ("×", "×"), ]
 +
</code>
 +
which defines the compatibility table for expression binary operator as defined in [[Event-B Mathematical Language]].
  
https://sourceforge.net/projects/codegenerationd/files/
 
  
The zip file contains a windows XP bundle, and a Windows V7 bundle. Alternatively, if you wish to build using an update-site, this is also included in the zip file, along with some notes on installation. However, note that the demonstrator tool is only compatible with Rodin 1.3.  
+
Inside a group, ''precedence'' may be defined like:
 +
<code>
 +
g = group["(arithmetic)"]
 +
g.add_many_precedence(
 +
            [  ("^", "÷"),
 +
                ("^", "∗"),
 +
                ("^", "+"),
 +
                ("^", "−"),
 +
                ("^", "mod"),
 +
                ("+", "∗"),
 +
                ("−", "∗"),
 +
                ("+", "÷"),
 +
                ("−", "÷"),
 +
                ("+", "mod"),
 +
                ("−", "mod"), ] )
 +
</code>
 +
which defines ^ as having the least precedence amongst the operators of the expression arithmetic group.
  
A simple shared buffer example is provided. This will form the basis of a tutorial (which is work in progress). The WindowsBundles directory contains a Rodin 1.3.1 platform with the Code Generation plug-ins, together with a patch plug-in. The patch plug-in is required to correct an inconsistency in the org.eventb.emf.persistence plug-in. For the bundles, simply extract the appropriate zip file into a directory and run the rodin.exe. The plug-ins are pre-installed - the only configuration necessary may be to switch workspace to ''<installPath>\rodin1.3bWin7\workspace''. When using the update-site the example projects, and the project forming the basis of a simple tutorial, are provided in the accompanying zip file. These should be imported manually.
+
Similarly, precedences and compatibilities between groups may be defined :
 +
<code>
 +
    group_precedence.add(("(binop)", "(interval)"))
 +
    group_precedence.add(("(interval)", "(arithmetic)"))
 +
    group_precedence.add(("(arithmetic)", "(unary relation)"))
 +
    group_precedence.add(("(unary relation)", "(bound unary)"))
 +
    group_precedence.add(( "(bound unary)", "(bool)"))
  
Mac users - no bundled version available at present, but use the update site in the 'advanced' folder.  
+
    group_compatibility.add_all( [ "(binop)", "(interval)", (arithmetic)", "(unary relation)", "(unary relation)", "(bound unary)",....])
 +
</code>
  
'''A step-by-step [[Code Generation Tutorial]] is available'''  
+
==== Some Difficulties ====
 +
Some difficulties appear when trying to implement an Event-B parser with the preceding scheme:
 +
* A symbol may be declared by several different builders (think of unary and infix minus for example).
 +
: ''Pratt'' parsers are able to cope with a symbol appearing either at the beginning of an expression or inside an expression (see ''led'' and ''nud'' hereafter). For example, unary minus and infix minus can be both defined by the proposed scheme. However, in their default form, ''pratt'' parser failed in parsing symbol like '{' which appears at the beginning of extension sets, comprehension sets and quantified sets.
 +
: SOLUTION #1: The existing Event-B parser uses an infinite lookahead to determine which one of those three sets is actually being parsed.
 +
: This solution hinders the extensibility of the parser, because the addition of a symbol requires to modify the lookahead.
 +
: However, such an issue may be bypassed, at least in a first version, in enforcing some usage rules. Thus, a new symbol contributed by an extension but used in the lookahead may be rejected as in invalid symbol.
 +
: SOLUTION #2: Implementing a backtracking parser.
 +
: More precisely, it means that the three sets are attempted to be parsed. The parsing is successful if and only if one of these attempts lead to a result (''i.e''., it is a failure if several attempts or none lead to a result).
 +
: It is a good solution provided that we use memorisation to alleviate the cost of parsing formulas that require heavy backtracking.
  
==== About the Initial Release ====
+
* Defining a grammar or a grammar extension is not an easy job, even with those simple mechanisms.
The Code Generation (CG) Feature in the initial release is a demonstration tool; a proof of concept, rather than a prototype. The tool has no static checker and, therefore, there will be a heavy reliance on docs and dialogue to facilitate exploration of the tools and concepts.
+
: It would certainly be a good idea to implement a user interface to test the grammar, so that an extension designer has a way to ensure that he agrees with the parser on the semantic of his extension (or at least on the AST obtained by using it). For example, the representation of the syntax tree returned by <tt>Formula.getSyntaxTree</tt> could be compared to an expected entered string.
  
==== Source Code ====
+
=== Core Algorithm ===
 +
==== Main Objects ====
 +
;{{class|KnownSymbols}}: a table associating a {{class|SymbolParser}} for each known symbol.
  
The sources are available from,
+
;{{class|SymbolParser}}: a Parser for a specific symbol. It should at least have the following attributes:
 +
:* {{class|Array<Parser>}} {{ident|led}} for parsing the expression associated to the symbol {{ident|id}} when it appears in the middle of an expression (''left denotation'')
 +
:* {{class|Array<Parser>}} {{ident|nud}} for parsing the expression associated to the symbol {{ident|id}} when it appears at the beginning of an expression (''null denotation'')
 +
:* {{class|String}} {{ident|id}}
 +
:* {{class|String}} {{ident|gid}} designing the group to which the {{ident|id}} symbol belongs to.
 +
:Note that parsers in ''led'' (respectively ''nud'') will be tried one after another until one match (with backtracking to the original state in between).
  
https://codegenerationd.svn.sourceforge.net/svnroot/codegenerationd
+
;{{class|Parser}}: A parser, which should have the following attributes
 +
:*{{class|Parser Array}} {{ident|subparsers}} an array of subparsers used by the {{ident|parse}} method. They defaults to {{ident|MainParser}}, but may be set to specific parsers.
 +
:and the following methods:
 +
:*{{ident|parse}}: this method should consume tokens and returns an AST. It may use several {{ident|subparsers}}.
  
Note - I used Eclipse 3.5 Galileo, and you will need to install (or have sources from) Epsilon's interim update site. There is also dependency on Camille v2.0.0
+
;{{class|ParserBuilder}}: the class that inherited from this class provide the interface used to create new symbols. It has the following method:
 +
:*{{ident|build}}: which build a SymbolParser and store it in the {{class|KnownSymbols}} container as being associated to the symbol for which the parser is being built.
  
  
 +
;{{class|MainParser}}
 +
:This class inherits from {{class|Parser }} and its {{ident|parse}} function implements the core of the ''pratt'' algorithm.
  
[[Category:Work in progress]]
+
:Without taking into account backtracking, the algorithm reads:
 +
<blockquote><code><pre>
 +
def parse(reckognized_token, has_right_associativity=False):
 +
    global token
 +
    t = token                    # t is the processed symbol
 +
    token = tokenizer.next()        # token is the next symbol
 +
    left = t.nud()              # a new expression is being parsed (null denotation : nothing on the left)
 +
    while has_less_priority(reckognized_token, token, has_right_associativity):
 +
        t = token
 +
        token = next_token()    # if precedence allows it, we parse next expression
 +
        left = t.led(left)      # has being left expression of current symbol (left denotation)
 +
    return left
 +
</pre></code></blockquote>
 +
 
 +
:The parsing of a new expression is initiated by:
 +
<blockquote><code><pre>
 +
    token = tokenizer.next()
 +
    ast = MainParser.parse(KnownSymbols["(eof)"])
 +
</pre></code></blockquote>
 +
 
 +
:where the {{ident|tokenizer}} returns the {{class|SymbolParser}} associated to the next lexem.
 +
 
 +
;{{class|identListParser}}: inherits from {{class|Parser}}. Its {{ident|parse}} function implements the parsing of an ident list (like <math>a,b,c\;</math>).
 +
;{{class|identMapletListParser}}:  inherits from {{class|Parser }}. Its {{ident|parse}} function implements the parsing of a maplet list as used in lambda expression (like <math>a \mapsto (b \mapsto c)\;</math>).
 +
;{{class|expressionListParser}}:  inherits from {{class|Parser }}. Its {{ident|parse}} function implements a list of expression as used in the extension set expression.
 +
 
 +
{{TODO|section being worked on}}
 +
;{{class|infixParser}}: inherits from {{class|Parser}}. Its {{ident|parse}} function is a ''led'' parser and implements something like (pseudocode):
 +
<blockquote><code><pre>
 +
parse(AST left)
 +
  ast = new AST()
 +
  ast.first = left
 +
  verify_expected_group(ast.first, this.expected_non_terminals[0])
 +
  ast.second = this.parsers[0].parse(ast, is_right_associative=False)
 +
  verify_expected_group(ast.second, this.expected_non_terminals[1])
 +
  return ast
 +
</pre></code></blockquote>
 +
 
 +
 
 +
;{{class|infixBuilder}}: inherits from {{class|ParserBuilder}} and has the following methods:
 +
:*<code>build(Lexem lexem, GroupID gid, Array<Parser> parsers, Array<NonTermGroups> expected_non_terminals)</code> which implements something like:
 +
<blockquote><code><pre>
 +
if lexem in KnownSymbols:
 +
else:
 +
</pre></code></blockquote>
 +
{{TODO}}
 +
 
 +
=== Open Questions ===
 +
How does the user specify the expected AST ?
 +
: Some examples in the existing parser:
 +
:* some binary infix predicates are in fact parsed as n-ary operators (with more than 2 childrens if possible). Shall we generalize this feature ? How will this fit within the proposed extension scheme ?
 +
:* bound identifiers are converted to ''de Bruijn'' indexes while constructing the AST. Is this desirable ? How does the user specify that identifier are bound or not ?
 +
 
 +
=== Sample Implementation ===
 +
A prototype as been developed in Python to quickly try different solutions.
 +
 
 +
The development tree of the python prototype is available at http://bitbucket.org/matclab/eventb_pratt_parser/
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
[[Category:Design Proposal]]

Revision as of 17:45, 3 February 2010

This page describes the requirements for a generic parser for the Event-B Mathematical Language.

A first design proposal is also drafted.

Requirements

In order to be usable mathematical extensions require that the event-b mathematical language syntax can be extended by the final user.

The lexical analyser and the syntaxic parser thus have to be extensible in a simple enough way (from a user point of vue).

Requirements Exported by the Current Language Design

Operator Priority

  • operator are defined by group,
  • group of operator have a defined precedences,
  • there may be precedences defined inside groups.

Operator Compatibility

  • a compatibility table defines allowed associativities inside a group,
  • a compatibility table defines allowed associativities between groups (it allows to forbid a syntaxic construction like f(x)^{-1}\;
nota: this requirement was added afterwards with consistency in mind.

Expected Extension Schemes

We do want to at least define operators of the following form :

  • infix : a + b\; or a \vdash b : c\;
  • prefix : \neg a\;
  • postfix :  R^*\;
  • closed : \|a\|\;
  • parentheses sugar : (a +b) * c\;
  • functional postfix :  a \langle b \rangle\;
  • functional prefix :   \langle b \rangle f\;
  • lambda like :  \lambda x\mapsto y . P | E\;
  • Union like :  \Union\{ e \mid P\} or  \Union\{ x,y . P \mid e\}
  • sets : \{a, b, c + e\}\; or \{ e \mid P\}\; or \{x,y . P \mid e\}\;

We also like to define precisely precedences and associativity between existing and new operators.

Requirements exported by the dynamic feature

  • the precedence should not be enumerated but defined by a relation, like: '+' < '*' and '*' < '**', ...

Limitations

Design Alternatives

Make Existing Parser Extensible

The existing parser is a LL recursive descent parser generated by the Coco/R Compiler Generator, which makes extensive use of pre-computed lookahead, which makes it very difficult to be transformed in a sufficiently enough generic parser.

Parser Combinator

This paper is interesting in its proposal of using an acyclic graph to define operator precedence.

Pratt Parser

Note that a pratt parser is essentially a parser for expressions (where expression parts are themselves expressions)... But it may be tweaked to limit the accepted sub-expressions or to enforce some non-expression parts inside an expression. More on that hereafter.

Some Existing Extensible Languages

Design Proposal

Main Concepts

symbol
a symbol is a lexem known by the parser.
group
each symbol belongs to one and only one group.
symbol compatibility
a relation telling if inside a group two symbol are compatibles.
Two symbol are compatibles is they can be parsed without parentheses (see Event-B Mathematical Language for more on compatibility).
group compatibility
a relation telling if two groups are compatibles.
Two symbol of different groups are said compatibles if their group are compatibles.
symbol associativity
a relation which gives the associative precedence of symbols inside a given group.
group associativity
a relation which gives the associative precedence of groups.
Two symbol of different groups have a relative precedence given by the associative precedence of their groups.
generic parser
the parser which is defined.
This parser may be called recursively (an infix expression being build upon a left and right sub-expression) and is often the default parser used when building the AST associated with a symbol.
specific parser
some specific parser, predefined as helper functions to build specific AST.
Some example of specific parser are the one used to parse and build list of identifiers (for quantified predicates), mapplets lists (for lambda), and so one...

Proposed User Interface

The hereafter proposition does not preclude the final UI to take other (more user friendly forms). But the concepts should probably stay the same

We propose that the user will be able to create several predefined type of symbols (new types may certainly be further added by mean of eclipse extension points). When defining a symbol, the user can choose the parser used to parse the sub-expressions (default to generic parser, but a specific parser may be used) and can enforce the form of the resulting parsed sub-expression in terms of their AST root type.

Predefined Symbol Builders

The proposed predefined symbol builders are:

infix
For example predicate conjunction which expects two predicates for left and right parts may be defined by infix("∧", gid="(logic pred)", expected_nonterm=["predicate", "predicate"])
As another example, substitution ≔ of group '(infix subst)' which expect two ident list for left and right expressions may be defined by infix(symbol="≔", gid="(infix subst)", parsers=["ident_list", "ident_list"], expected_nonterm=["ident_list", "ident_list"])
prefix
prefix("¬", "(not pred)", expected_nonterm=["predicate"])
atomic
atomic("⊤", "(atomic pred)")
prefix closed
used to build prefix symbol which must be followed by enclosing symbol(brackets, parentheses,..). For example finite may be defined by: prefix_closed("finite", "(", ")","(atomic pred)", expected_nonterm=["expression"])
postfix: postfix("∼", gid="(unary relation)", expected_nonterm=["expression"])
quantified
quantified("∀", "·", expected_nonterm=["predicate"]
lambda like
lambda_like("λ", "·", u"∣", gid="(quantification)", expected_nonterm=["ident list", "predicate", "expression"])
quantified union like
Union_like("⋃", "·", "∣", gid="(quantification)")
closed sugar
use for parentheses added to enforce associativity: closed_sugar("(", ")")
functional postfix
functional_postfix("(", ")", gid="(functional)", expected_nonterm=["expression", "expression"])

Some right associative versions of those symbol builders may also be supplied (infix_right, or a specific parameter right associative parameter may also be provided to the infix symbol builder).

Predefined Specific Parser

  • expression lists
  • ident lists
  • maplet tree (for lambda)


Defining Compatibility and Precedences

Inside a group, compatibility may be defined like:

   g = group["(binop)"]
   g.add_many_compatibility(
           [   ("∪", "∪"),
               ("∩", "∩"),
               ("∩", "∖"),
               ("∩", "▷"),
               ("∩", "⩥"),
               ("∘", "∘"),
               (";", ";"),
               (";", "▷"),
               (";", "⩥"),
               ("", ""),
               ("◁", "∩"),
               ("◁", "∖"),
               ("◁", ";"),
               ("◁", "⊗"),
               ("◁", "▷"),
               ("◁", "⩥"),
               ("⩥", "∩"),
               ("⩥", "∖"),
               ("⩥", ";"),
               ("⩥", "⊗"),
               ("⩥", "▷"),
               ("⩥", "⩥"),
               ("×", "×"), ]

which defines the compatibility table for expression binary operator as defined in Event-B Mathematical Language.


Inside a group, precedence may be defined like:

g = group["(arithmetic)"]
g.add_many_precedence(
           [   ("^", "÷"),
               ("^", "∗"),
               ("^", "+"),
               ("^", "−"),
               ("^", "mod"),
               ("+", "∗"),
               ("−", "∗"),
               ("+", "÷"),
               ("−", "÷"),
               ("+", "mod"),
               ("−", "mod"), ] )

which defines ^ as having the least precedence amongst the operators of the expression arithmetic group.

Similarly, precedences and compatibilities between groups may be defined :

   group_precedence.add(("(binop)", "(interval)"))
   group_precedence.add(("(interval)", "(arithmetic)"))
   group_precedence.add(("(arithmetic)", "(unary relation)"))
   group_precedence.add(("(unary relation)", "(bound unary)"))
   group_precedence.add(( "(bound unary)", "(bool)"))
   group_compatibility.add_all( [ "(binop)", "(interval)", (arithmetic)", "(unary relation)", "(unary relation)", "(bound unary)",....])

Some Difficulties

Some difficulties appear when trying to implement an Event-B parser with the preceding scheme:

  • A symbol may be declared by several different builders (think of unary and infix minus for example).
Pratt parsers are able to cope with a symbol appearing either at the beginning of an expression or inside an expression (see led and nud hereafter). For example, unary minus and infix minus can be both defined by the proposed scheme. However, in their default form, pratt parser failed in parsing symbol like '{' which appears at the beginning of extension sets, comprehension sets and quantified sets.
SOLUTION #1: The existing Event-B parser uses an infinite lookahead to determine which one of those three sets is actually being parsed.
This solution hinders the extensibility of the parser, because the addition of a symbol requires to modify the lookahead.
However, such an issue may be bypassed, at least in a first version, in enforcing some usage rules. Thus, a new symbol contributed by an extension but used in the lookahead may be rejected as in invalid symbol.
SOLUTION #2: Implementing a backtracking parser.
More precisely, it means that the three sets are attempted to be parsed. The parsing is successful if and only if one of these attempts lead to a result (i.e., it is a failure if several attempts or none lead to a result).
It is a good solution provided that we use memorisation to alleviate the cost of parsing formulas that require heavy backtracking.
  • Defining a grammar or a grammar extension is not an easy job, even with those simple mechanisms.
It would certainly be a good idea to implement a user interface to test the grammar, so that an extension designer has a way to ensure that he agrees with the parser on the semantic of his extension (or at least on the AST obtained by using it). For example, the representation of the syntax tree returned by Formula.getSyntaxTree could be compared to an expected entered string.

Core Algorithm

Main Objects

KnownSymbols
a table associating a SymbolParser for each known symbol.
SymbolParser
a Parser for a specific symbol. It should at least have the following attributes:
  • Array<Parser>
    led
    for parsing the expression associated to the symbol
    id
    when it appears in the middle of an expression (left denotation)
  • Array<Parser>
    nud
    for parsing the expression associated to the symbol
    id
    when it appears at the beginning of an expression (null denotation)
  • String
    id
  • String
    gid
    designing the group to which the
    id
    symbol belongs to.
Note that parsers in led (respectively nud) will be tried one after another until one match (with backtracking to the original state in between).
Parser
A parser, which should have the following attributes
  • Parser Array
    subparsers
    an array of subparsers used by the
    parse
    method. They defaults to
    MainParser
    , but may be set to specific parsers.
and the following methods:
  • parse
    : this method should consume tokens and returns an AST. It may use several
    subparsers
    .
ParserBuilder
the class that inherited from this class provide the interface used to create new symbols. It has the following method:
  • build
    : which build a SymbolParser and store it in the KnownSymbols container as being associated to the symbol for which the parser is being built.


MainParser
This class inherits from Parser and its
parse
function implements the core of the pratt algorithm.
Without taking into account backtracking, the algorithm reads:
def parse(reckognized_token, has_right_associativity=False):
    global token
    t = token                    # t is the processed symbol
    token = tokenizer.next()         # token is the next symbol
    left = t.nud()               # a new expression is being parsed (null denotation : nothing on the left)
    while has_less_priority(reckognized_token, token, has_right_associativity):
        t = token
        token = next_token()     # if precedence allows it, we parse next expression
        left = t.led(left)       # has being left expression of current symbol (left denotation)
    return left
The parsing of a new expression is initiated by:
    token = tokenizer.next()
    ast = MainParser.parse(KnownSymbols["(eof)"])
where the
tokenizer
returns the SymbolParser associated to the next lexem.
identListParser
inherits from Parser. Its
parse
function implements the parsing of an ident list (like a,b,c\;).
identMapletListParser
inherits from Parser . Its
parse
function implements the parsing of a maplet list as used in lambda expression (like a \mapsto (b \mapsto c)\;).
expressionListParser
inherits from Parser . Its
parse
function implements a list of expression as used in the extension set expression.

TODO: section being worked on

infixParser
inherits from Parser. Its
parse
function is a led parser and implements something like (pseudocode):
parse(AST left)
  ast = new AST()
  ast.first = left
  verify_expected_group(ast.first, this.expected_non_terminals[0])
  ast.second = this.parsers[0].parse(ast, is_right_associative=False)
  verify_expected_group(ast.second, this.expected_non_terminals[1])
  return ast


infixBuilder
inherits from ParserBuilder and has the following methods:
  • build(Lexem lexem, GroupID gid, Array<Parser> parsers, Array<NonTermGroups> expected_non_terminals) which implements something like:
if lexem in KnownSymbols:
else:

TODO

Open Questions

How does the user specify the expected AST ?

Some examples in the existing parser:
  • some binary infix predicates are in fact parsed as n-ary operators (with more than 2 childrens if possible). Shall we generalize this feature ? How will this fit within the proposed extension scheme ?
  • bound identifiers are converted to de Bruijn indexes while constructing the AST. Is this desirable ? How does the user specify that identifier are bound or not ?

Sample Implementation

A prototype as been developed in Python to quickly try different solutions.

The development tree of the python prototype is available at http://bitbucket.org/matclab/eventb_pratt_parser/