LexAna - Bottom up parser based on regular expressions

2. Bottom-up recursion in LexAna

Although LexAna is bottom up parser, special type of recursion is implemented anyway. Parser rule of the form:

@ ->Regex

has special meaning. Regex expression must have at least one capturing group. For instance, the parsing rule:

@ -><([^<>]+)>

sends the piece of token string between angle brackets to the parser (calling the same parser with string part as argument). After that, Lexana deletes brackets and replaces the piece of token string with given result.

Problem 1: Parsing numerical expressions using recursion

Input string:

"-1+8*(4+20/10)*(-3-2)-2+10*(-1+2)"

Scanner:

n ->\d+

. ->[()]

m ->[*/]

a ->[+-]

Parser:

# Recursion call syntax is: @->X(...)Y, where X and Y are regexes and (...) is a capturing group

@ ->$([^()]+)$

# T - Term is number or expression (or something else, see below)

T ->n|S

# F - Factor is term [*/] term, providing operator precedence (first */ and then +-)

F ->T(mT)*

# S - Expression is factor [+-] factor

S ->a?F(aF)*

Result of the scanning is the token string:

ananm(nanmn)m(anan)ananm(anan)

Recursion rule @ deletes brackets and calls Parser again, where the argument is string inside. For example, content in leftmost bracket is :

nanmn

LexaAna Parser with this argument returns token S. Result after recursion rule @ is:

ananmSmSananmS

After that, Parser continues and final result is S, expression is correct. Parsing tree is:

(-, 1, +, (8, *, (4, +, (20, /, 10)), *, (-, 3, -, 2)), -, 2, +, (10, *, (-, 1, +, 2)))

For nested brackets we have to put sufix + after symbol @. There is no backtracking in LexaAna parser. After recursion call result will be replaced, no matter what that result is.

This example is very significant and useful, especially for operators precedence. It is easy to expand it. If we add, for instnce, another Scanner rule:

v ->[a-zA-Z]+

and change term rule of Parser into:

T->n|S|v

our parser can receive and variables too. Furthermore, adding vS of the Term rule, parser recognises and functional calls.

T->n|S|vS|v

Remember, vS must be before of v in alternation rule. Also, adding more operators (power, relations < = >, logical operations... ) is almost trivial:

Input string:

"2+3*(a-5)+f(x-2)>0"

# "(a>b|a!=2) & (x<=(y+5)*2 | 2+2>5)"

Scanner:

n->\d+

v ->[a-zA-Z]+

. ->[()]

m ->[*/]

a ->[+-]

p ->\^

c->[=!<>]=|[<>]

Parser:

# @+ means: do @ multiple times (for nested brackets)

@+->$([^()]+)$

T->n|S|vS|v

P ->T(pT)*

F ->P(mP)*

S->a?F(aF)*

C->ScS

Result:

C, ((2, +, (3, *, (a, -, 5)), +, (f, (x, -, 2))), >, 0)

or L, (((a, >, b), |, (a, !=, 2)), &, ((x, <=, ((y, +, 5), *, 2)), |, ((2, +, 2), >, 5)))

Adding Scanner rule:

l ->[&|]

and Parser rule at the end of the list:

L+ ->[LC](l[LC])*

we allowed our parser to recognizes and logical expressions (with brackets) too.

LexAna Parser with implemented recursion, process inner parentheses first (like a human). Without recursion, LexAna parser would be just an interesting idea. Altough very powerful, there are some problems:

Difficult to debugging
Recursion can’t determinates beginning (end) of token string, just beginning of argument string.
All parsing rules response equaly in recursion call and regular execution.
During expression evaluation, this process corupts every identificators and numbers assiging it to S symbol. Before recursion call, we need to hide it.

Example: Very simple parser for very simple regular expressions (character classes, capturing groups, simple quantifiers, alternation)

Input string:

"\-?(0|[1-9]\d*)(\.\d+)?([eE][+-]=\d+)?"

Scanner:

m->\\[.^$*+?{}[\]\\|()-]
c->\\[dDwWsSrntv]|\^|\$
.->[*+?[\]|()-]
s->[\s\S]

Parser:
C ->\[[^\]]*\]
@+ ->$([^()]*)$
E ->[RmcsC\]-]
S ->(E[*+?]?)+
R ->S(\|S)*

Result:

R, (\-, ?, (0, |, (([, 1, -, 9, ]), \d, *)), (\., \d, +), ?, (([, e, E, ]), ([, +, -, ]), =, \d, +), ?)

Example: Multifunctional call in variable expression

Input string:

"2*(f(x-2,2*g(x+1))-3)"

# "f(1,2),3"

Scanner:

n->\d+

w ->[a-zA-Z]+

m ->[*/]

s ->[+-]

. ->[(,)]

Parser:
@+ ->$([^()]+)$

T ->n|w[LS]|w|S
F ->T(mT)*
S ->s?F(sF)*
# L as a list of expressions
L ->S(,S)+

Result:

S, (2, *, ((f, (x, -, 2), (2, *, (g, (x, +, 1)))) , -, 3))

or L, ((f, (1, 2)), 3) This is an error

Example: Double recursion call. Evaluates +, -, *, /, variables, multi function such as f(1,2,3), multi array such as a[1,2,3].

Input string:

"2*g(x+1,a[0,s(0)])-1"

Scanner:

n->\d+

f ->[a-zA-Z]+(?=\()

b->[a-zA-Z]+(?=\[)

w ->[a-zA-Z]+

m ->[*/]

s ->[+-]

. ->[(,)[\]]

Parser:

@+ ->\[([^[\]]+)\]

@+ ->$([^()]+)$

T ->n|w|[fb][LS]|S
F ->T(mT)*
S ->s?F(sF)*
L ->S(,S)+

Result:

S, ((2, *, (g, ((x, +, 1), (a, (0, (s, 0)))))), -, 1)

After recursion call brackets are gone, how to distinct array from function? So, we can use Pebbles mode.

Pebbles mode in LexAna

Every token in LexAna corresponds with some content. Content could be simple - string and complex - (sub)array. After succesful parsing step, Parser replaces group of tokens with one token and wraps its group of contents into an array, assigning it to the new token. If Pebbles mode is on, informations about old and new tokens writes into the array. Pebbles informations has only arrays, not strings.

Example: If we parse simple numerical expression in Pebbles mode with Parser like this one:

Input string:

"-2/1-7*8/2+9-1"

Scanner:

n->\d+
m ->[*/]
a ->[+-]

Parser:

F->n(mn)*
S->a?F(aF)*

Parsing array looks like this:

Or like this:

S, (S:aFaFaFaF, -, (F:nmn, 2, /, 1), -, (F:nmnmn, 7, *, 8, /, 2), +, 9, -, 1)

The first cell of an array contains Pebbles informations. For instance, the cell with index 2 is a factor, containing lexical units nmn, while all array has Pebbles information S:aFaFaFaF.
It could be very useful to distinct different lexical and parsing elements after Parsing.

Example: Parsing a JSON string with pebbles (see previous chapter at the end)

Input string:

'''
{ "id": 1,
"Surname": [],
"price": 100,
"Handler": [
"Bar",
"Foo", {},
"Q",1],
"Stock": {"Market": 300, "goods": [1],
"Retail": {"old":1}
}}
'''

Scanner:

# No scanner. If you avoid it, Lexana works with it's internal scanner

Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label
_->;
J ->d|l|q(?!:)|\{\}|\[\]
J+ ->\[J(,J)*\]|\{q:J(,q:J)*\}

J
.J:qJqJqJqJqJ
.id
.1
.Surname
.[]
.price
.100
.Handler
..J:JJJJJ
..Bar
..Foo
..{}
..Q
..1
.Stock
..J:qJqJqJ
..Market
..300
..goods
...J:J
...1
..Retail
...J:qJ
...old
...1

# All pebbles are bolded

Now it's easy to recognize what kind of syntactic unit is after "Stock", for example (An object).

Summary: Let T ->regex be a parsing rule and string XYZ is the regex match. After replacing group of tokens XYZ with token T, LexAna in Pebbles mode adds, at the beggining of wrapped array, string T:XYZ.

LexAna - Bottom up parser based on regular expressions

Thursday, June 4, 2020

2. Bottom-up recursion in LexAna

Pebbles mode in LexAna

0 Comments:

Post a Comment

About Me

Previous Posts