LexAna - Bottom up parser based on regular expressions

2. Bottom-up recursion in LexAna

Although LexAna is bottom up parser, special type of recursion is implemented anyway. Parser rule of the form:

@ ->Regex

has special meaning. Regex expression must have at least one capturing group. For instance, the parsing rule:

@ -><([^<>]+)>

sends the piece of token string between angle brackets to the parser (calling the same parser with string part as argument). After that, Lexana deletes brackets and replaces the piece of token string with given result.

Problem 1: Parsing numerical expressions using recursion

Input string:

"-1+8*(4+20/10)*(-3-2)-2+10*(-1+2)"

Scanner:

n ->\d+

. ->[()]

m ->[*/]

a ->[+-]

Parser:

# Recursion call syntax is: @->X(...)Y, where X and Y are regexes and (...) is a capturing group

@ ->$([^()]+)$

# T - Term is number or expression (or something else, see below)

T ->n|S

# F - Factor is term [*/] term, providing operator precedence (first */ and then +-)

F ->T(mT)*

# S - Expression is factor [+-] factor

S ->a?F(aF)*

Result of the scanning is the token string:

ananm(nanmn)m(anan)ananm(anan)

Recursion rule @ deletes brackets and calls Parser again, where the argument is string inside. For example, content in leftmost bracket is :

nanmn

LexaAna Parser with this argument returns token S. Result after recursion rule @ is:

ananmSmSananmS

After that, Parser continues and final result is S, expression is correct. Parsing tree is:

(-, 1, +, (8, *, (4, +, (20, /, 10)), *, (-, 3, -, 2)), -, 2, +, (10, *, (-, 1, +, 2)))

For nested brackets we have to put sufix + after symbol @. There is no backtracking in LexaAna parser. After recursion call result will be replaced, no matter what that result is.

This example is very significant and useful, especially for operators precedence. It is easy to expand it. If we add, for instnce, another Scanner rule:

v ->[a-zA-Z]+

and change term rule of Parser into:

T->n|S|v

our parser can receive and variables too. Furthermore, adding vS of the Term rule, parser recognises and functional calls.

T->n|S|vS|v

Remember, vS must be before of v in alternation rule. Also, adding more operators (power, relations < = >, logical operations... ) is almost trivial:

Input string:

"2+3*(a-5)+f(x-2)>0"

# "(a>b|a!=2) & (x<=(y+5)*2 | 2+2>5)"

Scanner:

n->\d+

v ->[a-zA-Z]+

. ->[()]

m ->[*/]

a ->[+-]

p ->\^

c->[=!<>]=|[<>]

Parser:

# @+ means: do @ multiple times (for nested brackets)

@+->$([^()]+)$

T->n|S|vS|v

P ->T(pT)*

F ->P(mP)*

S->a?F(aF)*

C->ScS

Result:

C, ((2, +, (3, *, (a, -, 5)), +, (f, (x, -, 2))), >, 0)

or L, (((a, >, b), |, (a, !=, 2)), &, ((x, <=, ((y, +, 5), *, 2)), |, ((2, +, 2), >, 5)))

Adding Scanner rule:

l ->[&|]

and Parser rule at the end of the list:

L+ ->[LC](l[LC])*

we allowed our parser to recognizes and logical expressions (with brackets) too.

LexAna Parser with implemented recursion, process inner parentheses first (like a human). Without recursion, LexAna parser would be just an interesting idea. Altough very powerful, there are some problems:

Difficult to debugging
Recursion can’t determinates beginning (end) of token string, just beginning of argument string.
All parsing rules response equaly in recursion call and regular execution.
During expression evaluation, this process corupts every identificators and numbers assiging it to S symbol. Before recursion call, we need to hide it.

Example: Very simple parser for very simple regular expressions (character classes, capturing groups, simple quantifiers, alternation)

Input string:

"\-?(0|[1-9]\d*)(\.\d+)?([eE][+-]=\d+)?"

Scanner:

m->\\[.^$*+?{}[\]\\|()-]
c->\\[dDwWsSrntv]|\^|\$
.->[*+?[\]|()-]
s->[\s\S]

Parser:
C ->\[[^\]]*\]
@+ ->$([^()]*)$
E ->[RmcsC\]-]
S ->(E[*+?]?)+
R ->S(\|S)*

Result:

R, (\-, ?, (0, |, (([, 1, -, 9, ]), \d, *)), (\., \d, +), ?, (([, e, E, ]), ([, +, -, ]), =, \d, +), ?)

Example: Multifunctional call in variable expression

Input string:

"2*(f(x-2,2*g(x+1))-3)"

# "f(1,2),3"

Scanner:

n->\d+

w ->[a-zA-Z]+

m ->[*/]

s ->[+-]

. ->[(,)]

Parser:
@+ ->$([^()]+)$

T ->n|w[LS]|w|S
F ->T(mT)*
S ->s?F(sF)*
# L as a list of expressions
L ->S(,S)+

Result:

S, (2, *, ((f, (x, -, 2), (2, *, (g, (x, +, 1)))) , -, 3))

or L, ((f, (1, 2)), 3) This is an error

Example: Double recursion call. Evaluates +, -, *, /, variables, multi function such as f(1,2,3), multi array such as a[1,2,3].

Input string:

"2*g(x+1,a[0,s(0)])-1"

Scanner:

n->\d+

f ->[a-zA-Z]+(?=\()

b->[a-zA-Z]+(?=\[)

w ->[a-zA-Z]+

m ->[*/]

s ->[+-]

. ->[(,)[\]]

Parser:

@+ ->\[([^[\]]+)\]

@+ ->$([^()]+)$

T ->n|w|[fb][LS]|S
F ->T(mT)*
S ->s?F(sF)*
L ->S(,S)+

Result:

S, ((2, *, (g, ((x, +, 1), (a, (0, (s, 0)))))), -, 1)

After recursion call brackets are gone, how to distinct array from function? So, we can use Pebbles mode.

Pebbles mode in LexAna

Every token in LexAna corresponds with some content. Content could be simple - string and complex - (sub)array. After succesful parsing step, Parser replaces group of tokens with one token and wraps its group of contents into an array, assigning it to the new token. If Pebbles mode is on, informations about old and new tokens writes into the array. Pebbles informations has only arrays, not strings.

Example: If we parse simple numerical expression in Pebbles mode with Parser like this one:

Input string:

"-2/1-7*8/2+9-1"

Scanner:

n->\d+
m ->[*/]
a ->[+-]

Parser:

F->n(mn)*
S->a?F(aF)*

Parsing array looks like this:

Or like this:

S, (S:aFaFaFaF, -, (F:nmn, 2, /, 1), -, (F:nmnmn, 7, *, 8, /, 2), +, 9, -, 1)

The first cell of an array contains Pebbles informations. For instance, the cell with index 2 is a factor, containing lexical units nmn, while all array has Pebbles information S:aFaFaFaF.
It could be very useful to distinct different lexical and parsing elements after Parsing.

Example: Parsing a JSON string with pebbles (see previous chapter at the end)

Input string:

'''
{ "id": 1,
"Surname": [],
"price": 100,
"Handler": [
"Bar",
"Foo", {},
"Q",1],
"Stock": {"Market": 300, "goods": [1],
"Retail": {"old":1}
}}
'''

Scanner:

# No scanner. If you avoid it, Lexana works with it's internal scanner

Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label
_->;
J ->d|l|q(?!:)|\{\}|\[\]
J+ ->\[J(,J)*\]|\{q:J(,q:J)*\}

J
.J:qJqJqJqJqJ
.id
.1
.Surname
.[]
.price
.100
.Handler
..J:JJJJJ
..Bar
..Foo
..{}
..Q
..1
.Stock
..J:qJqJqJ
..Market
..300
..goods
...J:J
...1
..Retail
...J:qJ
...old
...1

# All pebbles are bolded

Now it's easy to recognize what kind of syntactic unit is after "Stock", for example (An object).

Summary: Let T ->regex be a parsing rule and string XYZ is the regex match. After replacing group of tokens XYZ with token T, LexAna in Pebbles mode adds, at the beggining of wrapped array, string T:XYZ.

1. LexAna - introduction

LexAna is bottom-up, level-order, non-sequential parser, based on regular expressions for lexical and syntax processing. Features:

Uses regexes for lexical and syntax part
Input is a text string, output is a parsing tree, something like s-expression
Suitable for simple parsing, but parses even some programming languages
A simple and intuitive approach
Easy error handling
Parser never fails, always succeeds.
No ambiguity
Search & replace parsing technique
Automatically builds a syntax tree
Reverse level order tree building
Constant tree transformation in syntax analysis
No need for a grammar
Simple syntax, simple meaning: T ->R, (T - token char, R - regex expression), replaces regex match with T
Easy to control parsing process

Lexana deals with:

Programming languages (Basic, Logo, Tiny, PL/0 ...)
Formats (HTML, JSON, XML, INI ...)
Regular expressions parser, complex expressions with operator precedence ...
Roman numerals, as advanced split function ...

LexAna example of building a tree using bracket expressions.
Notice - every token is one character long

How it works?
1. The first regex \d+ succed and Scanner fills out the first row in Token table. Input string is now three characters shorter.

2. The first and the second regex failed,but the third regex [(),] succed and Scanner fills out another row in Token's table.

...

At the end, input string is empty. Tokenization is successful. The output is Parse tree (tokens + lexemes).

LexAna Scanner priority
The rule order is important

Let see more real example:

Scanner<

n ->\d+

w ->[a-zA-Z]+

c ->=|==|!=|>|<|<>

k ->if|then|else|return|let

g ->goto

s->[ ]+

Let see some scanning:

Input string	Output (Parse tree array)	Remark
a=5	(wcn, a, =, 5)	OK
let cnt!=3	(wswcn, let, , cnt,! =, 3)	No k token for let, but w token for identifiers? Something going wrong!
x<>y	(wccw, x, <, >, y)	In alternation list for token c, chars > and < have a higher priority than <>
c==0	(wccn, x, =, =, y)	Again the same mistake.
if x!=y goto 100	(wswcwswsn, if, , x, !=, y, , goto, ,100)	No keyword tokens for if and goto statements. Tokens k and g are unreachable in this rule order. The definition of token w comes first.

The rule order is important. The alternation order also. Proper definition would be:

Scanner<

n ->\d+

c ->==|=|!=|<>|<|>

k ->if|then|else|return|let

g ->goto

w ->[a-zA-Z]+

s->[ ]+

Input string	Output (an array)	Remark
a=5	(wcn, a, =, 5)	OK
let cnt!=3	(kswcn, let, ,cnt,! =, 3)	OK
x<>y	(wcw, x, <>, y)	OK
c==0	(wcn, x, ==, y)	OK
if x!=y goto 100	(kswcwsgsn, if, , x, !=, y, , goto, ,100)	OK

LexAna Scanner features:

• Lexana Scanner is a lexical analyzer, sequentially consuming an input string, producing a token (and lexeme) in every step

• Input is a string, the output is Parse tree, i.e. the array (t, l,l, ...) t – string of tokens, l – string or array of strings (something like lisp list i.e. s-expression)

• Scanner is defined by a list of rules: T->R, T – Token character, R – Regex expression

• Token’s definitions (rules) are ordered (the first has high priority)

• Token is one character long (letter, number or special symbol)

• Lexeme (token content) is a string or an array of strings, not an array of one string

• Token’s name (letter or not) is significant to Parser

• If a part of the text is not readable by Scanner, it does not causes stopping. Scanner forwarded it to Parser

• In the output array, the length of token’s string is equal to number of lexemes (Parse tree integrity)

• No zero–length match is allowed, no empty regexes, ^ is useless in Scanner. Symbols $, /b,/B and lookahead work fine with other characters

• Every token has a value (lexeme). There is no empty lexeme

• Symbols _ (underline) and . (dot) are reserved token names. Underline is for ignoring. Dot means: lexeme = token, producing terminal symbols

• If the Scanner table is omitted, LexAna BU uses its intrinsic Scanner definition

• One token can have more than one definition.

• Without capturing groups in regex, lexeme is a regex match

• Branch reset. It’s possible to provoke Scanner to produce two tokens at once, T0, T1, ...T9

LexAna parsing example

Input string:

John 100; Bill 150; Ann 80;

Scanner:

i →[a-zA-Z]+

n →\d+

. →[;]

Token names i and n are one char long (important)

Output token string (after scanning):

in;in;in;

Parsing tree (after scanning):

John, 100, ;, Bill, 150, ;, Ann, 80, ;

Parsing tree is an array with the same length as token string

Parser:

P →in;

Q →P+

Parser and scanner have similar format, but slightly different semantics

After the first parsing rule:

Tokens: PPP

Parsing tree:

(John, 100),(Bill, 150),(Ann, 80)

Parser automatically wraps data and replaces all groups of tokens in; with token P

After the second parsing rule:

Token: Q

Parsing tree:

((John, 100),(Bill, 150),(Ann, 80))

Parsing is done, there was only one token left and length of array is also one

LexAna Parser replaces one or more groups of tokens defined by regex and wraps their contents

Parser reduces number of tokens in every step and deletes punctuations

Reverse level order parsing tree

LexAna parsing order for Wikipedia example

LexAna is bottom-up, reverse level order parser. This is the Wikipedia example of parsing an statement:

A=B+C*2;D=1

Bottom level is result of scanning, the rest of tree is result of parsing.

Look at the LexAna example, with parse steps and nodes in comments:

"A=B+C*2;D=1"

Scanner <

w ->[A-Z] # 1-4

d ->\d+ # 5-6

a ->= # 7-8

s ->[+-] # 9

m ->[*/] # 10

; ->; # 11

Parser <

I ->w(?!a) # 12-15 (var)

V ->I|d # 16-19 (value)

P ->V(mV)* # 20-22 (products)

S ->P(sP)* # 23-24 (sums)

A ->waS # 25-26 (assign)

t ->A # 27-28 (stmt)

T ->t(;t)* # 29 (statements)

Token’s definitions of Parser are also ordered. But important difference from Scanner is that replacements are multiple and nonsequential.

Postfix plus (+)

Adding the postfix plus (+) after token’s name in parser definition, we send a message to Parser that consecutively repeats marked rule, while replacements exists.
The rule:

T+->regex

is substitution for:

T->regex

...

T->regex

Question: How to parse a string of length 2n, aaa...bbb, where a and b have the same number of occurrences?

Answer: Parser with n rules would be:

S ->ab

S ->aSb

...

S ->aSb

Abbreviate:

S ->ab

S+ ->aSb

Or just:

S+ ->aS?b

Problem: Parsing types in Pascal given by grammar:

Type -> Simple

Type -> ^id

Type -> 'array' ' [' Simple '] ' 'of' Type

Simple -> 'integer'

Simple -> 'char'

Simple -> num '..' num

Input string:

"array [ 1..5 ] of array [ 1..3 ] of 2..5"

Scanner:

_->\s+

i ->integer

c ->char

a ->array

o ->of\b

r ->\d+\.\.\d+

w ->[a-zA-Z]+

. ->[.^[\]]

Parser:

1) s ->[icr]

2) S ->\[s\]

3) T+ ->s|\^w|aSoT

Result:

T, (array, (1..5), of, (array, (1..3), of, 2..5))

Analysis of parsing steps is given in the table below:

Parsing steps:	Tokens after parsing step
Scanning	a[r]oa[r]or
1) s->r	a[s]oa[s]os
2) S ->\[s\]	aSoaSos
3) T ->s	aSoaSoT
3) T ->aSoT	aSoT
3) T ->aSoT	T

Parsing rules 1) and 2) LexAna applies only once. Parsing rule 3) applies three times, thanks to suffix + after T in rule definition. In fact, a rule with suffix + is realized consecutively as many times as it is possible.

Problem: How to parse a nested list?

Let input string be:

a <<b>> <c <d e > f <q> g > h

If we want to parse this and other similar strings, first we need Scanner like this:

_ ->[ \t]+

i->[a-z]

. ->[<>]

Scanner’s output will be:

i<<i>><i<ii>i<i>i>i

One of many solutions for this problem is:

A+ -><(i|A)+>

S ->(A|i)+

When Parser does the first rule once, result is next token string:

i<A><iAiAi>i

Applying the first rule again, we get:

iAAi

And finally, after the last replacement, token string becomes just token S. Parser automatically handles with lexemes, following replacement steps.
Remark: All punctuations in Parser will be deleted. If you want to save a punctuation, assign it to some letter in Scanner.

Example: Parsing a numerical expression with parentheses and operators + , -

Input string:
"-(-((1+(-(((-2))-((2-1)))+(-((+1)-(2-2))-(-(0))))-2)))"

Scanner:

F ->\d+
s ->[+-]
. ->[]

Parser:

F+ ->$s?F$|s?F(sF)+
E ->s?F

Result:

E, (-, (-, (1, +, (-, ((-, 2), -, (2, -, 1)), +, (-, ((+, 1), -, (2, -, 2)), -, (-, 0))), -, 2)))

Example: Parsing expressions with variables, operators [+-*/] and function calls. In this example there are no operator precedence, neither parentheses to specify the order of operations)

Input string:
"n*fib(n-1,2*f(g(n,2*n-1)),0)-1"

Scanner:

f->[a-z]+(?=\()
w->[a-z]+
d->\d+
o->[/*+-]
.->[(),]

Parser:

E+->[wd]|E(oE)+|f$E(,E)*$

Result:

E, (n, *, (fib, (n, -, 1), (2, *, (f, (g, n, (2, *, n, -, 1)))), 0), -, 1)

Analysis of parsing steps is given in the table below:

Parsing steps:	Tokens after parsing step
Scanning	wof(wod,dof(f(w,dowod)),d)od
Alternation 1	Eof(EoE,Eof(f(E,EoEoE)),E)oE
Alternation 2	Eof(E,Eof(f(E,E)),E)oE
Alternation 3	Eof(E,Eof(E),E)oE
Alternation 3	Eof(E,EoE,E)oE
Alternation 2	Eof(E,E,E)oE
Alternation 3	EoEoE
Alternation 2	E

Example: Right associativity, produces a "right" tree

Input string:

"1:4:6:9:7"

Scanner:

d ->\d+
. ->:

Parser:

a ->d:
R+ ->ad|aR
# Or just R+ ->d(?!:)|d:R

Result:
R, (1, (4, (6, (9, 7))))

Example: Left associativity, produces a "left" tree

Input string:

"1:4:6:9:7"

Scanner:

d ->\d+
. ->:

Parser:
a ->:d

L+ ->da|La
# Or just L ->d(:d)* but in that case answer would be: L, (1, 4, 6, 9, 7)

Result:
L, ((((1, 4), 6), 9), 7)

Example: Nested blocks in JS - like programming languages

Input string:

'''
{var ; let ; if {let; let } if{var} return }
{ if {let; return} return }
{ const; let; if {let ; if {return} let } return}
'''

Scanner:

_->[\s;]+ # Underline is special symbol. This rule deletes all blanks and semicolons
i->if
R->return
L->let|var|const
.->[{}]

Parser:
B+->\{([LBR]|iB)+\}

Result:
BBB

.var
.let
.if
..let
..let
.if
..var
.return

.if
..let
..return
.return

.const
.let
.if
..let
..if
...return
..let
.return

Example: Small part of JS - like syntax, function definitions and if control statements

Input string:

'''
function foo(x) {
if (x > 10) {
if (a > 1){return a-1};
return a * x;
}
return x + 10;
}
'''

Scanner:

_->[ \t]+
;->\r\n
f->function
i->if
r->return
v->var|let
w->[a-z]+
d->\d+
c-><>|<|>|<=|>=|==
e->=
o->[/*+-]
.->[(){};,]

Parser:
_->;
H->fw$(w(,w)*)?$
E->(w|d)(o(w|d))+
C->wc(w|d|E)
R->r[ECdw]
A->vwe(E|d|w)
B+->\{([AR]|i$C$B)+\}
F->HB
# H - Function header, E - Simple expression, C - Condition, R - Return statement,
# A - Assign statement, B - Block, F - Function definition

Result:
F
((function, foo, x), (if, (x, >, 10), (if, (a, >, 1), ((return, (a, -, 1))), (return, (a, *, x))), (return, (x, +, 10))))

Example: Parsing another "Tiny" language (simple expressions, functions, assign, conditions, if - else, while)

Input string:

'''
function foo(x,a) { var c := x-a;
if (c > 10) {
if (a > 1){return c + x - 1};
return a * x;
}
return x + 10;
}
function gcd(x,y){
while (y <> 0) {
if (x > y)
{x := x - y} else {y := y - x}
}
return x}
}
'''

Scanner:

_->[ \t]+
;->\r\n
f->function
i->if
e->else
l->while
r->return
v->var|let|const
w->[a-z]+
d->\d+
c-><>|==|<=|>=|<|>
a->:=
o->[/*+-]
.->[(){};,]

Parser:
_->;
H->fw$(w(,w)*)?$
E->(w|d)(o(w|d))+
C->$wc(w|d|E)$
R->r[Edw]
A->v?wa(E|d|w)
B+->\{([AR]|lCB|iCB(eB)?)+\}
F->HB

Result:
FF
((function, foo, x, a), ((var, c, :=, (x, -, a)), if, (c, >, 10), (if, (a, >, 1), (return, (c, +, x, -, 1)), (return, (a, *, x))), (return, (x, +, 10))))
((function, gcd, x, y), (while, (y, <>, 0), (if, (x, >, y), (x, :=, (x, -, y)), else, (y, :=, (y, -, x))), (return, x)))

Example: Parsing a JSON string

Input string:

'''
{ "id": 1,
"Surname": [],
"price": 100,
"Handler": [
"Bar",
"Foo", {},
"Q",1],
"Stock": {"Market": 300, "goods": [1],
"Retail": {"old":1}
}}
'''

Scanner:

# No scanner. If you avoid it, Lexana works with it's internal scanner

Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label
_->;
J ->d|l|q(?!:)|\{\}|\[\]
J+ ->\[J(,J)*\]|\{q:J(,q:J)*\}

Result:
J

.id
.1
.Surname
.[]
.price
.100
.Handler
..Bar
..Foo
..{}
..Q
..1
.Stock
..Market
..300
..goods
...1
..Retail
...old
...1

What is after "Handler" and whit is after "Stock"? How to recognize arrays from objects in output? There is a way! See Pebbles in the next chapter.

LexAna - Bottom up parser based on regular expressions

Thursday, June 4, 2020

2. Bottom-up recursion in LexAna

Pebbles mode in LexAna

Thursday, May 21, 2020

1. LexAna - introduction

LexAna Scanner priority
The rule order is important

LexAna Scanner features:

LexAna parsing example

Reverse level order parsing tree

Postfix plus (+)

About Me

Links

Previous Posts

Archives

LexAna - Bottom up parser based on regular expressions

Thursday, June 4, 2020

2. Bottom-up recursion in LexAna

Pebbles mode in LexAna

Thursday, May 21, 2020

1. LexAna - introduction

LexAna Scanner priority The rule order is important

LexAna Scanner features:

LexAna parsing example

Reverse level order parsing tree

Postfix plus (+)

About Me

Links

Previous Posts

Archives

LexAna Scanner priority
The rule order is important