1. LexAna - introduction
LexAna is bottom-up, level-order, non-sequential parser, based on regular expressions for lexical and syntax processing. Features:
- Uses regexes for lexical and syntax part
- Input is a text string, output is a parsing tree, something like s-expression
- Suitable for simple parsing, but parses even some programming languages
- A simple and intuitive approach
- Easy error handling
- Parser never fails, always succeeds.
- No ambiguity
- Search & replace parsing technique
- Automatically builds a syntax tree
- Reverse level order tree building
- Constant tree transformation in syntax analysis
- No need for a grammar
- Simple syntax, simple meaning: T ->R, (T - token char, R - regex expression), replaces regex match with T
- Easy to control parsing process
- Programming languages (Basic, Logo, Tiny, PL/0 ...)
- Formats (HTML, JSON, XML, INI ...)
- Regular expressions parser, complex expressions with operator precedence ...
- Roman numerals, as advanced split function ...
LexAna example of building a tree using bracket expressions.
Notice - every token is one character long
Notice - every token is one character long
How it works?
1. The first regex \d+ succed and Scanner fills out the first row in Token table. Input string is now three characters shorter.
1. The first regex \d+ succed and Scanner fills out the first row in Token table. Input string is now three characters shorter.
2. The first and the second regex failed,but the third regex [(),] succed and Scanner fills out another row in Token's table.
...
At the end, input string is empty. Tokenization is successful. The output is Parse tree (tokens + lexemes).
LexAna Scanner priority
The rule order is important
Let see more real example:
Scanner<
n ->\d+
w
->[a-zA-Z]+
c
->=|==|!=|>|<|<>
k
->if|then|else|return|let
g
->goto
s->[
]+
>
|
Let see some scanning:
Input string
|
Output (Parse tree array)
|
Remark
|
a=5
|
(wcn, a,
=, 5)
|
OK
|
let cnt!=3
|
(wswcn, let,
, cnt,! =, 3)
|
No k
token for let, but w token for identifiers? Something
going wrong!
|
x<>y
|
(wccw, x,
<, >, y)
|
In alternation list for token c, chars > and < have a higher priority than <>
|
c==0
|
(wccn, x,
=, =, y)
|
Again the same mistake.
|
if x!=y goto 100
|
(wswcwswsn,
if, , x, !=, y, , goto, ,100)
|
No keyword tokens for if and goto
statements. Tokens k and g are unreachable in this rule order.
The definition of token w comes
first.
|
The rule order is important. The alternation order also. Proper definition would be:
Scanner<
n ->\d+
c
->==|=|!=|<>|<|>
k
->if|then|else|return|let
g
->goto
w
->[a-zA-Z]+
s->[
]+
>
|
Input string
|
Output (an array)
|
Remark
|
a=5
|
(wcn, a, =,
5)
|
OK
|
let cnt!=3
|
(kswcn,
let, ,cnt,! =, 3)
|
OK
|
x<>y
|
(wcw, x,
<>, y)
|
OK
|
c==0
|
(wcn, x,
==, y)
|
OK
|
if x!=y goto 100
|
(kswcwsgsn,
if, , x, !=, y, , goto, ,100)
|
OK
|
LexAna Scanner features:
• Lexana Scanner is a lexical analyzer, sequentially consuming an input string, producing a token (and lexeme) in every step
• Input is a string, the output is Parse tree, i.e. the array (t, l,l, ...) t – string of tokens, l – string or array of strings (something like lisp list i.e. s-expression)
• Scanner is defined by a list of rules: T->R, T – Token character, R – Regex expression
• Token’s definitions (rules) are ordered (the first has high priority)
• Token is one character long (letter, number or special symbol)
• Lexeme (token content) is a string or an array of strings, not an array of one string
• Token’s name (letter or not) is significant to Parser
• If a part of the text is not readable by Scanner, it does not causes stopping. Scanner forwarded it to Parser
• In the output array, the length of token’s string is equal to number of lexemes (Parse tree integrity)
• No zero–length match is allowed, no empty regexes, ^ is useless in Scanner. Symbols $, /b,/B and lookahead work fine with other characters
• Every token has a value (lexeme). There is no empty lexeme
• Symbols _ (underline) and . (dot) are reserved token names. Underline is for ignoring. Dot means: lexeme = token, producing terminal symbols
• If the Scanner table is omitted, LexAna BU uses its intrinsic Scanner definition
• One token can have more than one definition.
• Without capturing groups in regex, lexeme is a regex match
• Branch reset. It’s possible to provoke Scanner to produce two tokens at once, T0, T1, ...T9
LexAna parsing example
Input string:
John 100; Bill 150; Ann 80;
|
Scanner:
i
→[a-zA-Z]+
n
→\d+
.
→[;]
|
Token names i
and n are one char long (important)
Output token string
(after scanning):
in;in;in;
|
Parsing tree (after
scanning):
John, 100, ;, Bill, 150, ;, Ann, 80, ;
|
Parsing
tree is an array with the same length as token string
Parser:
P
→in;
Q
→P+
|
Parser
and scanner have similar format, but slightly different semantics
After the first
parsing rule:
Tokens: PPP
Parsing tree:
(John, 100),(Bill,
150),(Ann, 80)
|
Parser automatically wraps data and replaces all groups of tokens in; with token P
After the second
parsing rule:
Token: Q
Parsing tree:
((John, 100),(Bill,
150),(Ann, 80))
|
Parser reduces number of tokens in every step and deletes punctuations
Reverse level order parsing tree
LexAna parsing order for Wikipedia example
LexAna is bottom-up, reverse level order parser. This is the Wikipedia example of parsing an statement:
A=B+C*2;D=1
Bottom level is result of scanning, the rest of tree is result of parsing.
Look at the LexAna example, with parse steps and nodes in comments:
"A=B+C*2;D=1"
Scanner <
w ->[A-Z] # 1-4
d ->\d+ # 5-6
a ->= # 7-8
s ->[+-] # 9
m ->[*/] # 10
; ->; # 11
>
Parser <
I ->w(?!a) # 12-15 (var)
V ->I|d # 16-19 (value)
P ->V(mV)* # 20-22 (products)
S ->P(sP)* # 23-24 (sums)
A ->waS # 25-26 (assign)
t ->A # 27-28 (stmt)
T ->t(;t)* # 29 (statements)
>
Token’s definitions of Parser are also ordered. But important difference from Scanner is that replacements are multiple and nonsequential.
Postfix plus (+)
Adding the postfix plus (+) after token’s name in parser definition, we send a message to Parser that consecutively repeats marked rule, while replacements exists.
The rule:
The rule:
T+->regex
is substitution for:
T->regex
T->regex
T->regex
...
T->regex
Question: How to parse a string of length 2n, aaa...bbb, where a and b have the same number of occurrences?
Answer: Parser with n rules would be:
S ->ab
S ->aSb
S ->aSb
...
S ->aSb
Abbreviate:
S ->ab
S+ ->aSb
Or just:
S+ ->aS?b
Problem: Parsing
types in Pascal given by grammar:
Type
-> Simple
Type
-> ^id
Type
-> 'array' ' [' Simple '] ' 'of' Type
Simple
-> 'integer'
Simple
-> 'char'
Simple
-> num '..' num
Input string:
"array [ 1..5 ] of array [ 1..3 ] of
2..5"
|
Scanner:
_->\s+
i ->integer
c ->char
a ->array
o ->of\b
r ->\d+\.\.\d+
w ->[a-zA-Z]+
. ->[.^[\]]
|
Parser:
1) s ->[icr]
2) S ->\[s\]
3) T+ ->s|\^w|aSoT
|
Result:
T, (array, (1..5), of, (array, (1..3), of, 2..5))
|
Analysis of parsing steps is given in the table below:
Parsing steps:
|
Tokens after parsing step
|
Scanning
|
a[r]oa[r]or
|
1) s->r
|
a[s]oa[s]os
|
2) S ->\[s\]
|
aSoaSos
|
3) T ->s
|
aSoaSoT
|
3) T ->aSoT
|
aSoT
|
3) T ->aSoT
|
T
|
Parsing rules 1) and 2) LexAna applies only once. Parsing rule 3) applies three times, thanks to suffix + after T in rule definition. In fact, a rule with suffix + is realized consecutively as many times as it is possible.
Problem: How to parse a nested list?
Let input string be:
a <<b>> <c <d e > f <q> g > h
If we want to parse this and other similar strings, first we need Scanner like this:
_ ->[ \t]+
i->[a-z]
. ->[<>]
Scanner’s output will be:
i<<i>><i<ii>i<i>i>i
One of many solutions for this problem is:
A+ -><(i|A)+>
S ->(A|i)+
When Parser does the first rule once, result is next token string:
i<A><iAiAi>i
Applying the first rule again, we get:
iAAi
And finally, after the last replacement, token string becomes just token S. Parser automatically handles with lexemes, following replacement steps.
Remark: All punctuations in Parser will be deleted. If you want to save a punctuation, assign it to some letter in Scanner.
Remark: All punctuations in Parser will be deleted. If you want to save a punctuation, assign it to some letter in Scanner.
Example: Parsing a numerical expression with parentheses and operators + , -
Example: Parsing expressions with variables, operators [+-*/] and function calls. In this example there are no operator precedence, neither parentheses to specify the order of operations)
Example: Right associativity, produces a "right" tree
Example: Left associativity, produces a "left" tree
Example: Nested blocks in JS - like programming languages
Example: Small part of JS - like syntax, function definitions and if control statements
Input string:
"-(-((1+(-(((-2))-((2-1)))+(-((+1)-(2-2))-(-(0))))-2)))" |
Scanner:
F ->\d+
s ->[+-] . ->[\(\)] |
Parser:
F+ ->\(s?F\)|s?F(sF)+
E ->s?F |
Result:
E, (-, (-, (1, +, (-, ((-, 2), -, (2, -, 1)), +, (-, ((+, 1), -, (2, -, 2)), -, (-, 0))), -, 2)))
|
Example: Parsing expressions with variables, operators [+-*/] and function calls. In this example there are no operator precedence, neither parentheses to specify the order of operations)
Input string:
"n*fib(n-1,2*f(g(n,2*n-1)),0)-1" |
Scanner:
f->[a-z]+(?=\()
w->[a-z]+ d->\d+ o->[/*+-] .->[(),] |
Parser:
E+->[wd]|E(oE)+|f\(E(,E)*\)
|
Result:
E, (n, *, (fib, (n, -, 1), (2, *, (f, (g, n, (2, *, n, -, 1)))), 0), -, 1)
|
Analysis of parsing steps is given in the table below:
Parsing
steps:
|
Tokens
after parsing step
|
Scanning
|
wof(wod,dof(f(w,dowod)),d)od
|
Alternation 1 |
Eof(EoE,Eof(f(E,EoEoE)),E)oE
|
Alternation 2 |
Eof(E,Eof(f(E,E)),E)oE
|
Alternation 3 |
Eof(E,Eof(E),E)oE
|
Alternation 3 |
Eof(E,EoE,E)oE
|
Alternation 2 |
Eof(E,E,E)oE
|
Alternation 3
|
EoEoE
|
Alternation 2
|
E
|
Example: Right associativity, produces a "right" tree
Input string:
"1:4:6:9:7"
|
Scanner:
d ->\d+
. ->: |
Parser:
a ->d:
R+ ->ad|aR # Or just R+ ->d(?!:)|d:R |
Result:
R, (1, (4, (6, (9, 7)))) |
Example: Left associativity, produces a "left" tree
Input string:
"1:4:6:9:7"
|
Scanner:
d ->\d+
. ->: |
Parser:
a ->:d
L+ ->da|La
# Or just L ->d(:d)* but in that case answer would be: L, (1, 4, 6, 9, 7) |
Result:
L, ((((1, 4), 6), 9), 7) |
Example: Nested blocks in JS - like programming languages
Input string:
'''
{var ; let ; if {let; let } if{var} return } { if {let; return} return } { const; let; if {let ; if {return} let } return} ''' |
Scanner:
_->[\s;]+ # Underline is special symbol. This rule deletes all blanks and semicolons
i->if R->return L->let|var|const .->[{}] |
Parser:
B+->\{([LBR]|iB)+\} |
Result:
BBB .var .let .if ..let ..let .if ..var .return .if ..let ..return .return .const .let .if ..let ..if ...return ..let .return |
Example: Small part of JS - like syntax, function definitions and if control statements
Input string:
'''
function foo(x) { if (x > 10) { if (a > 1){return a-1}; return a * x; } return x + 10; } ''' |
Scanner:
_->[ \t]+
;->\r\n f->function i->if r->return v->var|let w->[a-z]+ d->\d+ c-><>|<|>|<=|>=|== e->= o->[/*+-] .->[(){};,] |
Parser:
_->; H->fw\((w(,w)*)?\) E->(w|d)(o(w|d))+ C->wc(w|d|E) R->r[ECdw] A->vwe(E|d|w) B+->\{([AR]|i\(C\)B)+\} F->HB # H - Function header, E - Simple expression, C - Condition, R - Return statement, # A - Assign statement, B - Block, F - Function definition |
Result:
F ((function, foo, x), (if, (x, >, 10), (if, (a, >, 1), ((return, (a, -, 1))), (return, (a, *, x))), (return, (x, +, 10)))) |
Example: Parsing another "Tiny" language (simple expressions, functions, assign, conditions, if - else, while)
Input string:
'''
function foo(x,a) { var c := x-a; if (c > 10) { if (a > 1){return c + x - 1}; return a * x; } return x + 10; } function gcd(x,y){ while (y <> 0) { if (x > y) {x := x - y} else {y := y - x} } return x} } ''' |
Scanner:
_->[ \t]+
;->\r\n f->function i->if e->else l->while r->return v->var|let|const w->[a-z]+ d->\d+ c-><>|==|<=|>=|<|> a->:= o->[/*+-] .->[(){};,] |
Parser:
_->; H->fw\((w(,w)*)?\) E->(w|d)(o(w|d))+ C->\(wc(w|d|E)\) R->r[Edw] A->v?wa(E|d|w) B+->\{([AR]|lCB|iCB(eB)?)+\} F->HB |
Result:
FF ((function, foo, x, a), ((var, c, :=, (x, -, a)), if, (c, >, 10), (if, (a, >, 1), (return, (c, +, x, -, 1)), (return, (a, *, x))), (return, (x, +, 10)))) ((function, gcd, x, y), (while, (y, <>, 0), (if, (x, >, y), (x, :=, (x, -, y)), else, (y, :=, (y, -, x))), (return, x))) |
Example: Parsing a JSON string
Input string:
'''
{ "id": 1, "Surname": [], "price": 100, "Handler": [ "Bar", "Foo", {}, "Q",1], "Stock": {"Market": 300, "goods": [1], "Retail": {"old":1} }} ''' |
Scanner:
# No scanner. If you avoid it, Lexana works with it's internal scanner
|
Parser:
# Grammar: J = { (str : J) , (str : J)*} | [ J(,J)* ] | str | num | label _->; J ->d|l|q(?!:)|\{\}|\[\] J+ ->\[J(,J)*\]|\{q:J(,q:J)*\} |
Result:
J .id .1 .Surname .[] .price .100 .Handler ..Bar ..Foo ..{} ..Q ..1 .Stock ..Market ..300 ..goods ...1 ..Retail ...old ...1 |
What is after "Handler" and whit is after "Stock"? How to recognize arrays from objects in output? There is a way! See Pebbles in the next chapter.