I've been trying to sit down and complete this post in a sound way. This one is about an exercise in the
Erlang Programming book which asks the reader to write a parser for expression of the form:
((2+3)*4)
with the result looking something like:
{minus, {plus, {num, 2}, {num, 3}}, {num, 4}}
This last expression here, which by the way is valid Erlang, is the expression tree of the former one.
I had previous knowledge of parsers implemented in a functional a language. I used Haskell to implement part of what is in this very nice book:
Implementing Functional Languages: A Tutorial by Simon Peyton Jones. Chapter one of the book shows us how to make a parser, using a technique called
parser combinators, for the language described and implemented through out the book. The idea of parser combinators, in a nutshell, is to build a parser for a grammar combining smaller simpler parsers. The operation of combining the parser depends on the output of each parser (the analyzed result). You will see examples below in Erlang. For a thorough explanation see the links above. As I mentioned, the author presents a grammar for a programming language which includes expressions like the exercise asked as to implement.
Two thoughts might come into your mind right now:
- that I cheated
- and that I over did
Well, the code in the book is ML, the code is incomplete as there are several exercises left to the reader. About item number two; I don't know. I think there might be easier implementations but I thought that to implemented the solution in this way would be interesting (as it turned out).
A Lexer.
The parsers' input is list of tokens and to transform a string in to a list of tokens we need a
lexer. It's quite simple actually. It scans the string, character by character, if the character is in a set of characters that might form a token (such a digit in a number) a function is called (takewhile) that takes elements from the string (a list) while the element satisfies a condition. Then token is then removed from the input string and the function continues recursively.
%% ------------------------- Lexer -------------------------------
isWhitespace(Symbol) ->
case Symbol of
$ -> true;
_ -> false
end.
isDigit(Symbol) ->
($0 =< Symbol) and (Symbol =< $9).
isAlpha(Symbol) ->
($A =< Symbol) and (Symbol =< $z).
clex([], Tokens) -> Tokens;
clex([H|T], Tokens) ->
if
($0 =< H) and (H =< $9) ->
NumToken = [H|takewhile(fun isDigit/1, T)],
Rest = dropwhile(fun isDigit/1, T),
TokensPrime = [NumToken|Tokens],
clex(Rest, TokensPrime);
($ == H) -> %% $ is the space character, as in ' '
Rest = dropwhile(fun isWhitespace/1, T),
clex(Rest, Tokens);
(40 =< H) and (H =< 47) -> %% [40, .. ,47] == "()*+,-./"
SymbolToken = [H],
Rest = T,
TokensPrime = [SymbolToken|Tokens],
clex(Rest, TokensPrime);
($A =< H) and (H =< $z) ->
AlphaToks = [H|takewhile(fun isAlpha/1, T)],
Rest = dropwhile(fun isAlpha/1, T),
TokensPrime = [AlphaToks|Tokens],
clex(Rest, TokensPrime);
true ->
TokensPrime = [[H]|Tokens],
clex(T, TokensPrime)
end.
clex(Str) ->
reverse(clex(Str, [])).
So the lexer (clex) function takes a list of strings and transforms it into a list of tokens.
Something we should have in mind about grammars: (Extracted from Intro. to Func. Languages by Simon Peyton Jones)
The grammar may be ambiguous, so there is more than one way to parse the input; or the input may not conform to the grammar, in which case there is no way to successfully parse the input. An elegant way to accommodate these possibilities is to return a list of possible parses. (Empty list for no parse, and a list of several elements in case of ambiguity)
So with this background information in mind; what is a parser? A parser is a function which takes as input a list of tokens and returns a list of pairs of where the first components of each pair is the parsed result (of some type) and the second component is the remaining tokens. The type in haskell looks like this:
type Parser a = [Token] -> [(a, Token)]
Parsers in Erlang.
pSat(Predicate, [Tok|Toks]) ->
case Predicate(Tok) of
true ->
[{Tok, Toks}];
false -> []
end;
pSat(Predicate, []) ->
[].
pLit(String) ->
fun(Toks) ->
pSat(fun(OtherString) ->
String == OtherString end,
Toks)
end.
pVar() ->
fun(Toks) ->
pSat(fun(Tok) ->
Head = hd(Tok),
($A =< Head) and (Head =< $Z) end,
Toks)
end.
Those are just simple functions. pSat is a parser which takes a predicate (a boolean function) and returns the singleton list with the tuple that contains as first component the token which satisfies the predicate and the second component is the rest of list of tokens. pLit stands for 'parse literal'; it takes a string (the literal) as input and returns a parser for that literal. pVar, guess what it does! (hint: a variable starts with a capital letter).
Some combinators!
pAlt(P1, P2) ->
fun(Toks) ->
P1(Toks) ++ P2(Toks)
end.
pThen(Func, P1, P2) ->
fun(Toks) ->
[{Func(V1, V2), Toks2}
|| {V1, Toks1} <- P1(Toks),
{V2, Toks2} <- P2(Toks1)]
end.
pAlt represents choice it combines two parsers into one allowing to parse either expression parsed by the original parsers. pAlt would correspond to '|' of the BNF grammar. pAlt works by first using the first parser with list of tokens and then the second parser with same list of tokens and then combining the results. This is easily implemented by concatenating the result of the parses. (We used lists to express ambiguity). pThen is another parser combiner which corresponds to
sequencing in BNF (number ::= < digit > < number >) It works in the following way it takes parser p1, and parses the list of tokens, then with p2 we parse what is left over of the list of tokens after p1 was applied. The results (V1, V2) are combined with function passed by the user (the combining of the results depends on what you want to do as we shall see).
Example 1.
pHelloOrGoodbye() ->
fun(Toks) ->
P = pAlt(pLit("hello"), pLit("goodbye")),
P(Toks)
end.
keepFirst(X, Y) -> X.
keepTwo(X, Y, Z) -> {X, Y}.
mkPair(X, Y) -> {X, Y}.
pGreetings(Toks) ->
P = pThen(fun keepFirst/2,
pThen(fun mkPair/2, pHelloOrGoodbye(), pVar()),
pLit("!")),
P(Toks).
pHelloOrGoodbye is a parser that accepts the string "hello" or the string "goodbye". Using pThen we constructed a parser which parsers "hello World" or "goodbye John".
How to Handle Repetition.
pEmpty(Val) -> fun(Toks) -> [{Val, Toks}] end.
pZeroOrMore(Parser) ->
fun(Toks) ->
P = pAlt(pOneOrMore(Parser), pEmpty([])),
%% Just take the first one, get rid of the rest.
case P(Toks) of
[] -> [];
[H|_] -> [H]
end
end.
%% Next function is an exercise to the reader in the book, you might want to do it yourself.
combine(X, XS) -> [X|XS].
pOneOrMore(Parser) ->
fun(Toks) ->
P = pThen(fun combine/2, Parser, pZeroOrMore(Parser)),
P(Toks)
end.
It is very common of grammars to require zero or more repetitions of a symbol. To provide this functionality we need a function, pZeroOrMore, which takes a parser, p and returns a new parser which recognises zero or more occurrences of whatever p recognizes. pZeroOrMore must either be an empty parse, which is expressed using the pEmpty function, or one or more occurrences of p, which we implement using pOneOrMore.
Example 2.
pGreetingsN(Toks) ->
P = pZeroOrMore(fun pGreetings2/1),
P(Toks).
Guess what this does :-)
The grammar, in Erlang.
Please check the book for the BNF grammar (it's available for free).
-module(parser).
-import(lists, [all/2, sublist/2, takewhile/2, dropwhile/2,
nthtail/2, reverse/1]).
-export([clex/1, pThen/3, pAlt/2, mkPair/2,
pZeroOrMore/1, pOneOrMore/1, pEmpty/1, pVar/0,
pHelloOrGoodbye/0, pGreetings/1, pGreetings2/1,
pGreetingsN/1, pNum/1, pParens/1, pExpr6/0, parse/1]).
%% Here go all the functions I just showed you above...
...
%% --- Grammar for expresion parser ---
removeParens(X, Y, Z) ->
Y. %% where X = '(' and Z = ')'
pParens(Parser) ->
fun(Toks) ->
P = pThen3(fun removeParens/3, pLit("("), Parser, pLit(")")),
P(Toks)
end.
pNum([]) -> [];
pNum([Token|TokenList]) ->
All = all(fun(X) -> ($0 =< X) and (X =< $9) end, Token),
case All of
true -> [{{number, Token}, TokenList}];
false -> []
end.
aexpr() ->
fun(Toks) ->
P = pAlt(fun pNum/1, pParens(pExpr4())),
P(Toks)
end.
assembleOp(E1, no_op) -> E1;
assembleOp(E1, {found_op, Op, E2}) ->
V = {var, Op},
Ap = {ap, V, E1},
{ap, Ap, E2}.
foundOp(Op, Expr) ->
{found_op, Op, Expr}.
pExpr4() ->
fun(Toks) ->
P = pThen(fun assembleOp/2, pExpr5(), pExpr4c()),
P(Toks)
end.
pExpr4c() ->
fun(Toks) ->
P1 = pThen(fun foundOp/2, pLit("+"), pExpr4()),
P2 = pThen(fun foundOp/2, pLit("-"), pExpr5()),
P3 = pEmpty(no_op),
P = pAlt(P1, pAlt(P2, P3)),
P(Toks)
end.
pExpr5() ->
fun(Toks) ->
P = pThen(fun assembleOp/2, pExpr6(), pExpr5c()),
P(Toks)
end.
pExpr5c() ->
fun(Toks) ->
P1 = pThen(fun foundOp/2, pLit("*"), pExpr5()),
P2 = pThen(fun foundOp/2, pLit("/"), pExpr6()),
P3 = pEmpty(no_op),
P = pAlt(P1, pAlt(P2, P3)),
P(Toks)
end.
mkApChain([X]) -> X;
mkApChain([X|XS]) ->
Res = mkApChain(XS),
{ap, Res, X}.
pApply(Parser, Fun, Toks) ->
[{Fun(Val), Toks1} || {Val, Toks1} <- Parser(Toks)].
pExpr6() ->
fun(Toks) ->
pApply(pOneOrMore(aexpr()), fun mkApChain/1, Toks)
end.
parse(Str) ->
NewStr = "(" ++ Str ++ ")",
P = pExpr6(),
P(clex(NewStr)).
Bonus Track: Haskell!
module Parser where
type Token = String
-- Lexer -----------------------------------------------------------------
isWhiteSpace :: Char -> Bool
isWhiteSpace c = c == ' ' || c == '\t' || c == '\n'
isDigit :: Char -> Bool
isDigit c = '0' <= c && c <= '9'
isAlpha :: Char -> Bool
isAlpha c = (('A' <= c && c <= 'Z') ||
('a' <= c && c <= 'z'))
isIdChar :: Char -> Bool
isIdChar c = isAlpha c || isDigit c || c == '_'
clex :: String -> [Token]
clex [] = []
clex (c:cs) | isWhiteSpace c = clex cs
| isDigit c = let num_token = c : takeWhile isDigit cs
rest_cs = dropWhile isDigit cs
in
num_token : clex rest_cs
| isAlpha c = let var_tok = c : takeWhile isIdChar cs
rest_cs = dropWhile isIdChar cs
in
var_tok : clex rest_cs
| otherwise = [c] : clex cs
-- Parser --------------------------------------------------------------
type Parser a = [Token] -> [(a, [Token])]
pSat :: (String -> Bool) -> Parser String
pSat predicate (tok:toks) | predicate tok = [(tok, toks)]
| otherwise = []
pSat predicate [] = []
-- pLit :: String -> Parser String
-- pLit s (tok:toks) | s == tok = [(s, toks)]
-- | otherwise = []
-- pLit s [] = []
pLit s = pSat (== s)
-- pVar :: Parser String
-- pVar [] = []
-- pVar (tok:toks) | ('A' <= h) && (h <= 'Z') = [(tok, toks)]
-- | otherwise = []
-- where h = head tok
pVar :: Parser String
pVar = pSat (\tok -> let h = head tok in ('A' <= h) && (h <= 'Z'))
pAlt :: Parser a -> Parser a -> Parser a
pAlt p1 p2 toks = (p1 toks) ++ (p2 toks)
-- Test for pAlt ----------
pHelloOrGoodbye :: Parser String
pHelloOrGoodbye = (pLit "hello") `pAlt` (pLit "goodbye")
pThen :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
pThen combine p1 p2 toks = [(combine val1 val2, toks2)
| (val1, toks1) <- p1 toks,
(val2, toks2) <- p2 toks1]
pThen3 :: (a -> b -> c -> d) -> Parser a ->
Parser b -> Parser c -> Parser d
pThen3 combine p1 p2 p3 toks =
[(combine v1 v2 v3, toks3)
| (v1, toks1) <- p1 toks,
(v2, toks2) <- p2 toks1,
(v3, toks3) <- p3 toks2]
-- Test for pThen ---------
pGreetings :: Parser (String, String)
pGreetings = pThen keep_first
(pThen mk_pair pHelloOrGoodbye pVar)
(pLit "!")
where
keep_first x y = x
mk_pair x y = (x, y)
pGreetings2 :: Parser (String, String)
pGreetings2 = pThen3 keep_two
pHelloOrGoodbye
pVar
(pLit "!")
where keep_two x y z = (x, y)
pEmpty :: a -> Parser a
pEmpty x toks = [(x, toks)]
pOneOrMore :: Parser a -> Parser [a]
pOneOrMore p = pThen combine p (pZeroOrMore p)
where combine x xs = x:xs
pZeroOrMore :: Parser a -> Parser [a]
pZeroOrMore p = (pOneOrMore p) `pAlt` (pEmpty [])
-- Test pZeroOrMore
pGreetingsN :: Parser Int
pGreetingsN = (pZeroOrMore pGreetings) `pApply` length
pApply :: Parser a -> (a -> b) -> Parser b
pApply p fun toks = [ (fun val, toks1) | (val, toks1) <- p toks]
pOneOrMoreWithSep :: Parser b -> Parser a -> Parser [a]
pOneOrMoreWithSep p2 p1 = -- p2 is the parser which parses the separator
pThen3 combine3
p2 p1 (pZeroOrMore (pThen combine2 p2 p1))
where combine2 y x = x
combine3 y x z = x : z
-- Test
pSepGreets :: Parser [(String, String)]
pSepGreets = pThen combine pGreetings (pOneOrMoreWithSep (pLit ";") pGreetings)
where combine x xs = x:xs
-- My Stuff ---
-- Grammar ---
type Name = String
data CoreExpr = ENum Int | EVar Name | EAp CoreExpr CoreExpr deriving (Show, Read)
-- type CoreExpr = Expr Name
pParens :: Parser b -> Parser b
pParens p = pThen3 combine (pLit "(") p (pLit ")")
where
combine x y z = y -- Where x = "(" and z = ")"
pNum :: Parser CoreExpr
pNum = (pSat (\tok -> (all (\x -> ('0' <= x) && (x <= '9')) tok))) `pApply` (\x -> ENum (read x))
aexpr :: Parser CoreExpr
aexpr = pNum `pAlt` (pParens pExpr4)
data PartialExpr = NoOp | FoundOp Name CoreExpr
assembleOp :: CoreExpr -> PartialExpr -> CoreExpr
assembleOp e1 NoOp = e1
assembleOp e1 (FoundOp op e2) = EAp (EAp (EVar op) e1) e2
pExpr4 :: Parser CoreExpr
pExpr4 = pThen combine pExpr5 pExpr4c
where combine = assembleOp
pExpr4c :: Parser PartialExpr
pExpr4c = (pThen combine (pLit "+") pExpr4)
`pAlt` (pThen combine (pLit "-") pExpr5)
`pAlt` (pEmpty NoOp)
where combine = FoundOp
pExpr5 :: Parser CoreExpr
pExpr5 = pThen combine pExpr6 pExpr5c
where combine = assembleOp
pExpr5c :: Parser PartialExpr
pExpr5c = (pThen combine (pLit "*") pExpr5)
`pAlt` (pThen combine (pLit "/") pExpr6)
`pAlt` (pEmpty NoOp)
where combine = FoundOp
mk_ap_chain :: [CoreExpr] -> CoreExpr
mk_ap_chain [x] = x
mk_ap_chain (x:xs) = EAp (mk_ap_chain xs) x
pExpr6 = (pOneOrMore aexpr) `pApply` mk_ap_chain