LexAna - Bottom up parser based on regular expressions

1. LexAna - introduction

LexAna is bottom-up, level-order, non-sequential parser, based on regular expressions for lexical and syntax processing. Features:

Uses regexes for lexical and syntax part
Input is a text string, output is a parsing tree, something like s-expression
Suitable for simple parsing, but parses even some programming languages
A simple and intuitive approach
Easy error handling
Parser never fails, always succeeds.
No ambiguity
Search & replace parsing technique
Automatically builds a syntax tree
Reverse level order tree building
Constant tree transformation in syntax analysis
No need for a grammar
Simple syntax, simple meaning: T ->R, (T - token char, R - regex expression), replaces regex match with T
Easy to control parsing process

Lexana deals with:

Programming languages (Basic, Logo, Tiny, PL/0 ...)
Formats (HTML, JSON, XML, INI ...)
Regular expressions parser, complex expressions with operator precedence ...
Roman numerals, as advanced split function ...

LexAna example of building a tree using bracket expressions.
Notice - every token is one character long

How it works?
1. The first regex \d+ succed and Scanner fills out the first row in Token table. Input string is now three characters shorter.

2. The first and the second regex failed,but the third regex [(),] succed and Scanner fills out another row in Token's table.

...

At the end, input string is empty. Tokenization is successful. The output is Parse tree (tokens + lexemes).

LexAna Scanner priority
The rule order is important

Let see more real example:

Scanner<

n ->\d+

w ->[a-zA-Z]+

c ->=|==|!=|>|<|<>

k ->if|then|else|return|let

g ->goto

s->[ ]+

Let see some scanning:

Input string	Output (Parse tree array)	Remark
a=5	(wcn, a, =, 5)	OK
let cnt!=3	(wswcn, let, , cnt,! =, 3)	No k token for let, but w token for identifiers? Something going wrong!
x<>y	(wccw, x, <, >, y)	In alternation list for token c, chars > and < have a higher priority than <>
c==0	(wccn, x, =, =, y)	Again the same mistake.
if x!=y goto 100	(wswcwswsn, if, , x, !=, y, , goto, ,100)	No keyword tokens for if and goto statements. Tokens k and g are unreachable in this rule order. The definition of token w comes first.

The rule order is important. The alternation order also. Proper definition would be:

Scanner<

n ->\d+

c ->==|=|!=|<>|<|>

k ->if|then|else|return|let

g ->goto

w ->[a-zA-Z]+

s->[ ]+

Input string	Output (an array)	Remark
a=5	(wcn, a, =, 5)	OK
let cnt!=3	(kswcn, let, ,cnt,! =, 3)	OK
x<>y	(wcw, x, <>, y)	OK
c==0	(wcn, x, ==, y)	OK
if x!=y goto 100	(kswcwsgsn, if, , x, !=, y, , goto, ,100)	OK

LexAna Scanner features:

• Lexana Scanner is a lexical analyzer, sequentially consuming an input string, producing a token (and lexeme) in every step

• Input is a string, the output is Parse tree, i.e. the array (t, l,l, ...) t – string of tokens, l – string or array of strings (something like lisp list i.e. s-expression)

• Scanner is defined by a list of rules: T->R, T – Token character, R – Regex expression

• Token’s definitions (rules) are ordered (the first has high priority)

• Token is one character long (letter, number or special symbol)

• Lexeme (token content) is a string or an array of strings, not an array of one string

• Token’s name (letter or not) is significant to Parser

• If a part of the text is not readable by Scanner, it does not causes stopping. Scanner forwarded it to Parser

• In the output array, the length of token’s string is equal to number of lexemes (Parse tree integrity)

• No zero–length match is allowed, no empty regexes, ^ is useless in Scanner. Symbols $, /b,/B and lookahead work fine with other characters

• Every token has a value (lexeme). There is no empty lexeme

• Symbols _ (underline) and . (dot) are reserved token names. Underline is for ignoring. Dot means: lexeme = token, producing terminal symbols

• If the Scanner table is omitted, LexAna BU uses its intrinsic Scanner definition

• One token can have more than one definition.

• Without capturing groups in regex, lexeme is a regex match

• Branch reset. It’s possible to provoke Scanner to produce two tokens at once, T0, T1, ...T9

LexAna parsing example

Input string:

John 100; Bill 150; Ann 80;

Scanner:

i →[a-zA-Z]+

n →\d+

. →[;]

Token names i and n are one char long (important)

Output token string (after scanning):

in;in;in;

Parsing tree (after scanning):

John, 100, ;, Bill, 150, ;, Ann, 80, ;

Parsing tree is an array with the same length as token string

Parser:

P →in;

Q →P+

Parser and scanner have similar format, but slightly different semantics

After the first parsing rule:

Tokens: PPP

Parsing tree:

(John, 100),(Bill, 150),(Ann, 80)

Parser automatically wraps data and replaces all groups of tokens in; with token P

After the second parsing rule:

Token: Q

Parsing tree:

((John, 100),(Bill, 150),(Ann, 80))

Parsing is done, there was only one token left and length of array is also one

LexAna Parser replaces one or more groups of tokens defined by regex and wraps their contents

Parser reduces number of tokens in every step and deletes punctuations

Reverse level order parsing tree

LexAna parsing order for Wikipedia example

LexAna is bottom-up, reverse level order parser. This is the Wikipedia example of parsing an statement:

A=B+C*2;D=1

Bottom level is result of scanning, the rest of tree is result of parsing.

Look at the LexAna example, with parse steps and nodes in comments:

"A=B+C*2;D=1"

Scanner <

w ->[A-Z] # 1-4

d ->\d+ # 5-6

a ->= # 7-8

s ->[+-] # 9

m ->[*/] # 10

; ->; # 11

Parser <

I ->w(?!a) # 12-15 (var)

V ->I|d # 16-19 (value)

P ->V(mV)* # 20-22 (products)

S ->P(sP)* # 23-24 (sums)

A ->waS # 25-26 (assign)

t ->A # 27-28 (stmt)

T ->t(;t)* # 29 (statements)

Token’s definitions of Parser are also ordered. But important difference from Scanner is that replacements are multiple and nonsequential.

Postfix plus (+)

Adding the postfix plus (+) after token’s name in parser definition, we send a message to Parser that consecutively repeats marked rule, while replacements exists.
The rule:

T+->regex

is substitution for:

T->regex

...

T->regex

Question: How to parse a string of length 2n, aaa...bbb, where a and b have the same number of occurrences?

Answer: Parser with n rules would be:

S ->ab

S ->aSb

...

S ->aSb

Abbreviate:

S ->ab

S+ ->aSb

Or just:

S+ ->aS?b

Problem: Parsing types in Pascal given by grammar:

Type -> Simple

Type -> ^id

Type -> 'array' ' [' Simple '] ' 'of' Type

Simple -> 'integer'

Simple -> 'char'

Simple -> num '..' num

Input string:

"array [ 1..5 ] of array [ 1..3 ] of 2..5"

Scanner:

_->\s+

i ->integer

c ->char

a ->array

o ->of\b

r ->\d+\.\.\d+

w ->[a-zA-Z]+

. ->[.^[\]]

Parser:

1) s ->[icr]

2) S ->\[s\]

3) T+ ->s|\^w|aSoT

Result:

T, (array, (1..5), of, (array, (1..3), of, 2..5))

Analysis of parsing steps is given in the table below:

Parsing steps:	Tokens after parsing step
Scanning	a[r]oa[r]or
1) s->r	a[s]oa[s]os
2) S ->\[s\]	aSoaSos
3) T ->s	aSoaSoT
3) T ->aSoT	aSoT
3) T ->aSoT	T

Parsing rules 1) and 2) LexAna applies only once. Parsing rule 3) applies three times, thanks to suffix + after T in rule definition. In fact, a rule with suffix + is realized consecutively as many times as it is possible.

Problem: How to parse a nested list?

Let input string be:

a <<b>> <c <d e > f <q> g > h

If we want to parse this and other similar strings, first we need Scanner like this:

_ ->[ \t]+

i->[a-z]

. ->[<>]

Scanner’s output will be:

i<<i>><i<ii>i<i>i>i

One of many solutions for this problem is:

A+ -><(i|A)+>

S ->(A|i)+

When Parser does the first rule once, result is next token string:

i<A><iAiAi>i

Applying the first rule again, we get:

iAAi

And finally, after the last replacement, token string becomes just token S. Parser automatically handles with lexemes, following replacement steps.
Remark: All punctuations in Parser will be deleted. If you want to save a punctuation, assign it to some letter in Scanner.

Example: Parsing a numerical expression with parentheses and operators + , -

Input string:
"-(-((1+(-(((-2))-((2-1)))+(-((+1)-(2-2))-(-(0))))-2)))"

Scanner:

F ->\d+
s ->[+-]
. ->[]

Parser:

F+ ->$s?F$|s?F(sF)+
E ->s?F

Result:

E, (-, (-, (1, +, (-, ((-, 2), -, (2, -, 1)), +, (-, ((+, 1), -, (2, -, 2)), -, (-, 0))), -, 2)))

Example: Parsing expressions with variables, operators [+-*/] and function calls. In this example there are no operator precedence, neither parentheses to specify the order of operations)

Input string:
"n*fib(n-1,2*f(g(n,2*n-1)),0)-1"

Scanner:

f->[a-z]+(?=\()
w->[a-z]+
d->\d+
o->[/*+-]
.->[(),]

Parser:

E+->[wd]|E(oE)+|f$E(,E)*$

Result:

E, (n, *, (fib, (n, -, 1), (2, *, (f, (g, n, (2, *, n, -, 1)))), 0), -, 1)

Analysis of parsing steps is given in the table below:

Parsing steps:	Tokens after parsing step
Scanning	wof(wod,dof(f(w,dowod)),d)od
Alternation 1	Eof(EoE,Eof(f(E,EoEoE)),E)oE
Alternation 2	Eof(E,Eof(f(E,E)),E)oE
Alternation 3	Eof(E,Eof(E),E)oE
Alternation 3	Eof(E,EoE,E)oE
Alternation 2	Eof(E,E,E)oE
Alternation 3	EoEoE
Alternation 2	E

Example: Right associativity, produces a "right" tree

Input string:

"1:4:6:9:7"

Scanner:

d ->\d+
. ->:

Parser:

a ->d:
R+ ->ad|aR
# Or just R+ ->d(?!:)|d:R

Result:
R, (1, (4, (6, (9, 7))))

Example: Left associativity, produces a "left" tree

Input string:

"1:4:6:9:7"

Scanner:

d ->\d+
. ->:

Parser:
a ->:d

L+ ->da|La
# Or just L ->d(:d)* but in that case answer would be: L, (1, 4, 6, 9, 7)

Result:
L, ((((1, 4), 6), 9), 7)

Example: Nested blocks in JS - like programming languages

Input string:

'''
{var ; let ; if {let; let } if{var} return }
{ if {let; return} return }
{ const; let; if {let ; if {return} let } return}
'''

Scanner:

_->[\s;]+ # Underline is special symbol. This rule deletes all blanks and semicolons
i->if
R->return
L->let|var|const
.->[{}]

Parser:
B+->\{([LBR]|iB)+\}

Result:
BBB

.var
.let
.if
..let
..let
.if
..var
.return

.if
..let
..return
.return

.const
.let
.if
..let
..if
...return
..let
.return

Example: Small part of JS - like syntax, function definitions and if control statements

Input string:

'''
function foo(x) {
if (x > 10) {
if (a > 1){return a-1};
return a * x;
}
return x + 10;
}
'''

Scanner:

_->[ \t]+
;->\r\n
f->function
i->if
r->return
v->var|let
w->[a-z]+
d->\d+
c-><>|<|>|<=|>=|==
e->=
o->[/*+-]
.->[(){};,]

Parser:
_->;
H->fw$(w(,w)*)?$
E->(w|d)(o(w|d))+
C->wc(w|d|E)
R->r[ECdw]
A->vwe(E|d|w)
B+->\{([AR]|i$C$B)+\}
F->HB
# H - Function header, E - Simple expression, C - Condition, R - Return statement,
# A - Assign statement, B - Block, F - Function definition

Result:
F
((function, foo, x), (if, (x, >, 10), (if, (a, >, 1), ((return, (a, -, 1))), (return, (a, *, x))), (return, (x, +, 10))))

Example: Parsing another "Tiny" language (simple expressions, functions, assign, conditions, if - else, while)

Input string:

'''
function foo(x,a) { var c := x-a;
if (c > 10) {
if (a > 1){return c + x - 1};
return a * x;
}
return x + 10;
}
function gcd(x,y){
while (y <> 0) {
if (x > y)
{x := x - y} else {y := y - x}
}
return x}
}
'''

Scanner:

_->[ \t]+
;->\r\n
f->function
i->if
e->else
l->while
r->return
v->var|let|const
w->[a-z]+
d->\d+
c-><>|==|<=|>=|<|>
a->:=
o->[/*+-]
.->[(){};,]

Parser:
_->;
H->fw$(w(,w)*)?$
E->(w|d)(o(w|d))+
C->$wc(w|d|E)$
R->r[Edw]
A->v?wa(E|d|w)
B+->\{([AR]|lCB|iCB(eB)?)+\}
F->HB

Result:
FF
((function, foo, x, a), ((var, c, :=, (x, -, a)), if, (c, >, 10), (if, (a, >, 1), (return, (c, +, x, -, 1)), (return, (a, *, x))), (return, (x, +, 10))))
((function, gcd, x, y), (while, (y, <>, 0), (if, (x, >, y), (x, :=, (x, -, y)), else, (y, :=, (y, -, x))), (return, x)))

Example: Parsing a JSON string

Input string:

'''
{ "id": 1,
"Surname": [],
"price": 100,
"Handler": [
"Bar",
"Foo", {},
"Q",1],
"Stock": {"Market": 300, "goods": [1],
"Retail": {"old":1}
}}
'''

Scanner:

# No scanner. If you avoid it, Lexana works with it's internal scanner

Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label
_->;
J ->d|l|q(?!:)|\{\}|\[\]
J+ ->\[J(,J)*\]|\{q:J(,q:J)*\}

Result:
J

.id
.1
.Surname
.[]
.price
.100
.Handler
..Bar
..Foo
..{}
..Q
..1
.Stock
..Market
..300
..goods
...1
..Retail
...old
...1

What is after "Handler" and whit is after "Stock"? How to recognize arrays from objects in output? There is a way! See Pebbles in the next chapter.

LexAna - Bottom up parser based on regular expressions

Thursday, May 21, 2020

1. LexAna - introduction

LexAna Scanner priority
The rule order is important

LexAna Scanner features:

LexAna parsing example

Reverse level order parsing tree

Postfix plus (+)

1 Comments:

Post a Comment

About Me

Previous Posts

LexAna - Bottom up parser based on regular expressions

Thursday, May 21, 2020

1. LexAna - introduction

LexAna Scanner priority The rule order is important

LexAna Scanner features:

LexAna parsing example

Reverse level order parsing tree

Postfix plus (+)

1 Comments:

Post a Comment

About Me

Previous Posts

LexAna Scanner priority
The rule order is important