Thursday, May 21, 2020


1. LexAna - introduction



LexAna is bottom-up, level-order, non-sequential parser, based on regular expressions for lexical and syntax processing. Features:

  • Uses regexes for lexical and syntax part
  • Input is a text string, output is a parsing tree, something like s-expression
  • Suitable for simple parsing, but parses even some programming languages
  • A simple and intuitive approach
  • Easy error handling
  • Parser never fails, always succeeds.
  • No ambiguity
  • Search & replace parsing technique
  • Automatically builds a syntax tree
  • Reverse level order tree building
  • Constant tree transformation in syntax analysis
  • No need for a grammar
  • Simple syntax, simple meaning: T ->R, (T - token char, R - regex expression), replaces regex match with T
  • Easy to control parsing process
Lexana deals with:

  • Programming languages (Basic, Logo, Tiny, PL/0 ...)
  • Formats (HTML, JSON, XML, INI ...)
  • Regular expressions parser, complex expressions with operator precedence ... 
  • Roman numerals, as advanced split function ...




LexAna example of building a tree using bracket expressions.
Notice - every token is one character long



How it works?
1. The first regex \d+ succed and Scanner fills out the first row in Token table. Input string is now three characters shorter.

2. The first and the second regex failed,but  the third regex [(),] succed and Scanner fills out another row in Token's table. 

...

At the end, input string is empty. Tokenization is successful. The output is Parse tree (tokens + lexemes).


LexAna Scanner priority
The rule order is important

Let see more real example:

Scanner<
n  ->\d+
w ->[a-zA-Z]+
c ->=|==|!=|>|<|<>
k ->if|then|else|return|let
g ->goto
s->[ ]+
> 

Let see some scanning:

Input string
Output (Parse tree array)
Remark

a=5

(wcn, a, =, 5)
OK
let cnt!=3
(wswcn, let, , cnt,! =, 3)
No k token for let, but w token for identifiers? Something going wrong!
x<>y
(wccw, x, <, >, y)
In alternation list for token c, chars > and < have a higher priority than <>
c==0
(wccn, x, =, =, y)
Again the same mistake.
if x!=y goto 100
(wswcwswsn, if,  , x, !=, y, , goto,  ,100)
No keyword tokens for if and goto statements. Tokens k and g are unreachable in this rule order. The definition of token comes first.

The rule order is important. The alternation order also. Proper definition would be:

Scanner<
n  ->\d+
c ->==|=|!=|<>|<|>
k ->if|then|else|return|let
g ->goto
w ->[a-zA-Z]+
s->[ ]+
> 


Input string
Output (an array)
Remark
a=5
(wcn, a, =, 5)
OK
let cnt!=3
(kswcn, let, ,cnt,! =, 3)
OK
x<>y
(wcw, x, <>, y)
OK
c==0
(wcn, x, ==, y)
OK
if x!=y goto 100
(kswcwsgsn, if,  , x, !=, y, , goto,  ,100)
OK

LexAna Scanner features:


Lexana Scanner is a lexical analyzer, sequentially consuming an input string, producing a token (and lexeme) in every step
Input is a string, the output is Parse tree, i.e. the array (t, l,l, ...)  t – string of tokens, l – string or array of strings (something like lisp list i.e. s-expression)
Scanner is defined by a list of rules: T->R,  T – Token character, R – Regex expression
Token’s definitions (rules) are ordered (the first has high priority)
Token is one character long (letter, number or special symbol)
Lexeme (token content) is a string or an array of strings, not an array of one string
Token’s name (letter or not) is significant to Parser
If a part of the text is not readable by Scanner, it does not causes stopping. Scanner forwarded it to Parser
In the output array, the length of token’s string is equal to number of lexemes (Parse tree integrity)
No zero–length match is allowed, no empty regexes,  ^  is useless in Scanner. Symbols $, /b,/B and lookahead work fine with other characters
Every token has a value (lexeme). There is no empty lexeme
Symbols (underline) and . (dot) are reserved token names. Underline is for ignoring. Dot means: lexeme = token, producing terminal symbols 
If the Scanner table is omitted, LexAna BU uses its intrinsic Scanner definition
One token can have more than one definition.
Without capturing groups in regex, lexeme is a regex match
Branch reset. It’s possible to provoke Scanner to produce two tokens at once, T0, T1, ...T9


   LexAna parsing example


Input string:

John 100; Bill 150; Ann 80;


 Scanner:
i [a-zA-Z]+
n \d+
. [;]
Token names i and n are one char long (important)


Output token string (after scanning):

in;in;in;



Parsing tree (after scanning):

John, 100, ;, Bill, 150, ;, Ann, 80, ;

Parsing tree is an array with the same length as token string


Parser:
P in;
Q P+
Parser and scanner have similar format, but slightly different semantics


After the first parsing rule:

Tokens: PPP

Parsing tree:
(John, 100),(Bill, 150),(Ann, 80)

Parser automatically wraps data and replaces all groups of tokens  in;  with token  P


After the second parsing rule:

Token: Q
    
Parsing tree:
((John, 100),(Bill, 150),(Ann, 80))

Parsing is done, there was only one token left and length of array is also one



LexAna Parser replaces one or more groups of tokens defined by regex and wraps their contents


Parser reduces number of tokens in every step and deletes punctuations

Reverse level order parsing tree


LexAna parsing order for Wikipedia example


LexAna is bottom-up, reverse level order parser. This is the Wikipedia example of parsing an statement: 

A=B+C*2;D=1

Bottom level is result of scanning, the rest of tree is result of parsing.
Look at the LexAna example, with parse steps and nodes in comments: 

"A=B+C*2;D=1"

Scanner <
w ->[A-Z] # 1-4
d ->\d+ # 5-6
a ->= # 7-8
s ->[+-] #  9
m ->[*/] # 10
; ->; # 11
>
Parser <
I ->w(?!a) # 12-15 (var)
V ->I|d # 16-19 (value)
P ->V(mV)* # 20-22 (products)
S ->P(sP)* # 23-24 (sums)
A ->waS # 25-26 (assign)
t ->A # 27-28 (stmt)
T ->t(;t)* # 29 (statements)
>
 Token’s definitions of Parser are also ordered. But important difference from Scanner is that  replacements are multiple and nonsequential.

Postfix plus (+)

Adding the postfix plus (+) after token’s name in parser definition, we send a message to Parser that consecutively repeats marked rule, while replacements exists.
The rule:
 T+->regex
 is substitution for:
 T->regex
 T->regex
 T->regex
 ...
 T->regex

Question: How to parse a string of length 2n, aaa...bbb, where a and b have the same number of occurrences?
Answer: Parser with n rules would be:

S  ->ab
S ->aSb
S ->aSb
...
S ->aSb

Abbreviate:

S  ->ab
S+ ->aSb

Or just:


S+ ->aS?b

Problem:  Parsing types in Pascal given by grammar:

              Type ->   Simple
              Type ->   ^id
              Type ->   'array'  ' [' Simple '] '  'of' Type
              Simple -> 'integer'
              Simple -> 'char'
              Simple -> num '..' num

Input string:
"array [ 1..5 ] of array [ 1..3 ] of 2..5"
Scanner:
_->\s+
i ->integer
c ->char
a ->array
o ->of\b
r ->\d+\.\.\d+
w ->[a-zA-Z]+
. ->[.^[\]]
Parser:
1) s ->[icr]
2) S ->\[s\]
3) T+ ->s|\^w|aSoT
Result:
T, (array, (1..5), of, (array, (1..3), of, 2..5))

Analysis of parsing  steps is given in the table below:

Parsing steps:
Tokens after parsing step
Scanning
a[r]oa[r]or
1) s->r
a[s]oa[s]os
2) S ->\[s\]
aSoaSos
3) T ->s
aSoaSoT
3) T ->aSoT
aSoT
3) T ->aSoT
T

Parsing rules 1) and 2) LexAna applies only once. Parsing rule 3) applies three times, thanks to suffix + after T in rule definition. In fact, a rule with suffix + is realized consecutively as many times as it is possible. 

Problem: How to parse a nested list?

Let input string be:

a <<b>> <c <d e > f <q> g > h

If we want to parse this and other similar strings, first we need Scanner like this:

_ ->[ \t]+
i->[a-z]
. ->[<>]

Scanner’s output will be:

i<<i>><i<ii>i<i>i>i

One of many solutions for this problem is:

A+ -><(i|A)+>
S ->(A|i)+

When Parser does the first rule once, result is next token string:

i<A><iAiAi>i

Applying the first rule again, we get:

iAAi

And finally, after the last replacement, token string becomes just token S. Parser automatically handles with lexemes, following replacement steps.
Remark: All punctuations in Parser will be deleted. If you want to save a punctuation, assign it to some letter in Scanner.

Example: Parsing a numerical expression with parentheses and operators + , -


Input string:
"-(-((1+(-(((-2))-((2-1)))+(-((+1)-(2-2))-(-(0))))-2)))"
Scanner:
F ->\d+
s  ->[+-]
.  ->[\(\)]
Parser:
F+ ->\(s?F\)|s?F(sF)+
E  ->s?F
Result:
E, (-, (-, (1, +, (-, ((-, 2), -, (2, -, 1)), +, (-, ((+, 1), -, (2, -, 2)), -, (-, 0))), -, 2)))

Example: Parsing expressions with variables, operators [+-*/] and function calls. In this example there are no operator precedence, neither  parentheses to specify the order of operations)


Input string:
"n*fib(n-1,2*f(g(n,2*n-1)),0)-1"
Scanner:
f->[a-z]+(?=\()
w->[a-z]+
d->\d+
o->[/*+-]
.->[(),]
Parser:
E+->[wd]|E(oE)+|f\(E(,E)*\)
Result:
E, (n, *, (fib, (n, -, 1), (2, *, (f, (g, n, (2, *, n, -, 1)))), 0), -, 1)


Analysis of parsing steps is given in the table below:

Parsing steps:
Tokens after parsing step
Scanning
wof(wod,dof(f(w,dowod)),d)od
Alternation 1
Eof(EoE,Eof(f(E,EoEoE)),E)oE
Alternation 2
Eof(E,Eof(f(E,E)),E)oE
Alternation 3
Eof(E,Eof(E),E)oE
Alternation 3
Eof(E,EoE,E)oE
Alternation 2
Eof(E,E,E)oE
Alternation 3
EoEoE
Alternation 2
E

Example: Right associativity, produces a "right" tree

Input string:
"1:4:6:9:7"
Scanner:
d ->\d+
.  ->:
Parser:
a ->d:
R+ ->ad|aR
# Or just R+ ->d(?!:)|d:R
Result:
R, (1, (4, (6, (9, 7))))

Example: Left associativity, produces a "left" tree


Input string:
"1:4:6:9:7"
Scanner:
d ->\d+
.  ->:
Parser:
a ->:d
L+ ->da|La
Or just L ->d(:d)* but in that case answer would be: L, (1, 4, 6, 9, 7)
Result:
L, ((((1, 4), 6), 9), 7)

Example: Nested blocks in JS - like programming languages

Input string:
'''
 {var ; let ; if {let; let } if{var} return }
 { if {let; return} return }
 { const; let; if {let ; if {return} let } return}
'''
Scanner:
_->[\s;]+ # Underline is special symbol. This rule deletes all blanks and semicolons
i->if
R->return
L->let|var|const
.->[{}]
Parser:
B+->\{([LBR]|iB)+\}
Result:
BBB

.var
.let
.if
..let
..let
.if
..var
.return

.if
..let
..return
.return

.const
.let
.if
..let
..if
...return
..let
.return


Example: Small part of JS - like syntax, function definitions and if control statements

Input string:
'''
function foo(x) {
    if (x > 10) {
        if (a > 1){return a-1};
        return a * x;
    }
    return x + 10;
}
'''
Scanner:
_->[ \t]+
;->\r\n
f->function
i->if
r->return
v->var|let
w->[a-z]+
d->\d+
c-><>|<|>|<=|>=|==
e->=
o->[/*+-]
.->[(){};,]
Parser:
_->;
H->fw\((w(,w)*)?\)
E->(w|d)(o(w|d))+
C->wc(w|d|E)
R->r[ECdw]
A->vwe(E|d|w)
B+->\{([AR]|i\(C\)B)+\}
F->HB
# H - Function header, E - Simple expression, C - Condition, R - Return statement,
# A - Assign statement, B - Block, F - Function definition 
Result:
F
((function, foo, x), (if, (x, >, 10), (if, (a, >, 1), ((return, (a, -, 1))), (return, (a, *, x))), (return, (x, +, 10))))


Example: Parsing another "Tiny" language (simple expressions, functions, assign, conditions, if - else, while)

Input string:
'''
function foo(x,a) { var c := x-a;
    if (c > 10) {
        if (a > 1){return c + x - 1};
        return a * x;
    }
    return x + 10;
}
function gcd(x,y){
 while (y <> 0) {
  if (x > y) 
  {x := x - y} else  {y := y - x}
 }
return x}
}
'''
Scanner:
_->[ \t]+
;->\r\n
f->function
i->if
e->else
l->while
r->return
v->var|let|const
w->[a-z]+
d->\d+
c-><>|==|<=|>=|<|>
a->:=
o->[/*+-]
.->[(){};,]
Parser:
_->;
H->fw\((w(,w)*)?\)
E->(w|d)(o(w|d))+
C->\(wc(w|d|E)\)
R->r[Edw]
A->v?wa(E|d|w)
B+->\{([AR]|lCB|iCB(eB)?)+\}
F->HB
Result:
FF
((function, foo, x, a), ((var, c, :=, (x, -, a)), if, (c, >, 10), (if, (a, >, 1), (return, (c, +, x, -, 1)), (return, (a, *, x))), (return, (x, +, 10))))
((function, gcd, x, y), (while, (y, <>, 0), (if, (x, >, y), (x, :=, (x, -, y)), else, (y, :=, (y, -, x))), (return, x)))

Example: Parsing a JSON string

Input string:
'''
{ "id": 1,
  "Surname": [],
  "price": 100,
  "Handler": [
    "Bar",
    "Foo", {},
    "Q",1],
  "Stock": {"Market": 300, "goods": [1],
  "Retail": {"old":1}
}}
'''
Scanner:
# No scanner. If you avoid it, Lexana works with it's internal scanner
Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label
_->;
J ->d|l|q(?!:)|\{\}|\[\]
J+ ->\[J(,J)*\]|\{q:J(,q:J)*\}
Result:
J

.id
.1
.Surname
.[]
.price
.100
.Handler
..Bar
..Foo
..{}
..Q
..1
.Stock
..Market
..300
..goods
...1
..Retail
...old
...1
What is after "Handler" and whit is after "Stock"? How to recognize arrays from objects in output? There is a way! See Pebbles in the next chapter.