Experimenting with parser combinators in non-functional languages

2022-09-13 3910 words 19 minutes

Contents

Functional parsing and parser combinators

Recently, on my routine round through YouTube and social media, I’ve came across a video from Ashley Jeffs regarding a message broker he’s the author of called Benthos. Benthos itself is very interesting and I recommend to learn more about it, but what especially caught my interest was a different video from Ashley, regarding bloblang - a configuration language for Benthos written using parser combinators. Frankly, this was the first time I’ve heard that term and, intriguing as it sounds, I wanted to learn all about it. That took me through another rabbit hole, which is, I think, the original paper that defined the whole notion of parser combinators by Graham Hutton and Erik Meijer. Copy of this document can be found here.

What are parser combinators?

In short, parser combinators is a functional approach to parsing. As vague as it sounds, that sentence best describes the idea. Traditionally, parsing is performed in a series of steps and the output is either translated into an abstract syntax tree that can then be interpreted (in case of interpreters) or transformed into an intermediate representation that can be translated to a bytecode or a machine language (I’m completely ignoring other necessary steps like type checking, optimisation etc). This process requires a lexer - converting the input string into a stream of tokens comprising a language and the parser itself, operating on that tokens - checking if the sequence they form is compliant with the grammar. In simplest fashion this can be depicted as:

Right, still though, what are parser combinators and how does the whole process differs?

Parser combinators is an idea where set of bare minimum, fundamental parsers is created, specialising in parsing a specific construct. These parsers are then combined to construct parsers capable of processing a more complex structures.

This means that instead of defining a lexer and a parser operating on tokens it produces, you build a set of small parser operating on input string directly. You’ll have parsers recognising trivial constructs like:

words,
single characters,
keywords,
whitespaces,
…

and you’ll combine these together to create a parser for a more complex constructs.

What I’ll build?

It’s really difficult to explain that without actually implementing them. So, please bear with me as I’ll try to implement a small library of fundamental parsers in both Python and C++, trying to explain the concept of parser combinators as I go. My goal is to be able to parse something non-trivial and my arbitrary choice is the INI file format. Let’s dive in!

Monadic parser combinators

I’ll start with walking through the aforementioned document describing the notion of parser combinators trying to recreate the ideas described there in an imperative language. I’m gonna start with Python as it’s a perfect language for prototyping.

The paper starts with a definition of a parser type:

1

type Parser a = String -> [(a, String)]

This can be translated to C++ as:

1
2
3
4
5
6
7
8


template <typename T>
using Result = std::pair<T, std::string>;

template <typename T>
using Results = std::vector<Result<T>>;

template <typename T>
using Parser = std::function<Results<T>(std::string)>;

This is quite verbose! The semantics is that Parser is a function template. Function, that consumes input string and returns a list of parsing results and the remaining, unparsed input string. For example:

Given an input string:

1

>> let input = "this is an input string"

…and a parser parsing words, the expected result would be:

1
2


>> parseWord input
[ ("this", " is input string") ]

Any given parser can either succeed or fail. The parser indicates failure by returning an empty list. For example:

1
2


>> parseDigit "abc"
[ ]

Next, the paper defines two most fundamental (and most important as well, as we’ll later find out) parsers. These are result and zero parsers. The first one, always succeeding and the second one, always failing. Let’s have a look on the Haskell implementation first:

1
2
3
4
5


result :: a -> Parser a
result v = \inp -> [(v,inp)]

zero :: Parser a
zero = \inp -> []

This looks a bit magical at first glance to people not acquainted with Haskell, also the names are a bit unfortunate as well. As the paper clarifies the \inp -> ... syntax defines a lambda function taking input inp and returning an array containing one element, a tuple of (v, inp). I must admit I was initially confused because I assumed that result v is, similarly to Parser a a template but it’s not, these functions translate to Python as:

1
2
3
4
5


def make_result_parser(v):
    return lambda inp: [(v, inp)]

def make_zero_parser():
    return lambda inp: []

Next parser that the paper describes is the item parser - a parser that parses a single character from the given input string or fails if the string is empty. In python, this will look the following way:

1
2


def make_item_parser():
    return lambda inp: [] if len(inp) == 0 else [(inp[0], inp[1:])]

Combining parsers together

The most important concept behind parser combinators is the ability to “glue” them together to construct a parser for more complex constructs. This is where the bind operation comes into play. It’s essential to understand how it works and what it does. Quoting the paper, here’s the Haskell definition:

1
2


bind :: Parser a -> (a -> Parser b) -> Parser b
p ‘bind‘ f = \inp -> concat [f v inp’ | (v,inp’) <- p inp]

bind is a function that takes two arguments: first Parser a and second, a a function that takes a and returns Parser b. bind itself returns Parser b. This is a bit tricky to understand initially but it’s quite easy after a moment. Simply speaking, bind takes a Parser and a “converter” function that is able to convert a (the parsing result from Parser a) into Parser b. Again, I must admit, it took me a while to understand what bind really does. Most of the difficulty lies really in deciphering Haskell’s syntax. Python equivalent is, in my opinion, a lot easier to understand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12



def bind_parser(parser_a, binder):
    def parser_b(inp):
        results = []
        for pa_results in parser_a(inp):
            pa_v, pa_inp = pa_results
            parser_b = binder(pa_v)
            pb_results = parser_b(pa_inp)
            results.extend(pb_results)
        return results

    return parser_b;

I’ve called the a -> Parser b function a binder. This function takes the parsing results from the first parser and returns a parser of Parser b type. This latter parser is then invoked on the remaining input string (as returned by first parser). The loop is only to comply with the established semantics which suggests that there can be more than one valid parsing result.

That’s all the building blocks required. With that in hands, it’s possible to construct more interesting parsers. Let’s focus on how bind can be used because understanding what it does is absolutely essential. The first parser that the paper mentions is the sat parser. It’s a Parser Char parser, meaning the parsed result is a character. It takes a predicate and parses a single character from the input string if the predicate is true. Here’s the Haskell code for reference:

1
2
3


sat :: (Char -> Bool) -> Parser Char
sat p = item ‘bind‘ \x ->
    if p x then result x else zero

… and equivalent Python implementation:

1
2
3


def make_sat_parser(pred):
    return bind_parser(make_item_parser(), lambda c:
            make_result_parser(c) if pred(c) else make_zero_parser())

The “binder” function in this case is a lambda taking a single character (the result of item parser) and returning either a result or zero parser if the predicate stands or not. The argument to the binder function is always a result from the first parser - it’s important to bear that in mind. In other words what “binder” function does, is it takes first parser’s result and returns a parser of type Parser b. Hopefully, that makes sense?

One “combinator” is needed before I’ll proceed with implementation of the remaining fundamental parsers. In the paper, it’s named plus - it executes two given parsers on the same input and combines their results. Here’s the Haskell code:

1
2


plus :: Parser a -> Parser a -> Parser a
p ‘plus‘ q = \inp -> (p inp ++ q inp)

Personally, I don’t like that name and I prefer to call this function “anyof”. Here’s the implementation in Python:

1
2
3
4


def make_anyof_parser(pa, pb):
    def parser(inp: str):
        return pa(inp) + pb(inp)
    return parser

With that in mind, I’ll implement the rest of the parsers mentioned in the paper. Here they are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


def make_exact_char_parser(c):
    return make_sat_parser(lambda r: c == r)


def make_digit_parser():
    return make_sat_parser(str.isdigit)


def make_lower_parser():
    return make_sat_parser(str.islower)


def make_upper_parser():
    return make_sat_parser(str.isupper)


def make_space_parser():
    return make_sat_parser(str.isspace)


def make_tabs_parser():
    return make_sat_parser(lambda c: c == '\t')


def make_newline_parser():
    return make_sat_parser(lambda c: c == '\n' or c == '\r')


def make_whitechar_parser():
    return make_anyof_parser(
        make_anyof_parser(make_space_parser(), make_tabs_parser()),
        make_newline_parser())


def make_alpha_parser():
    return make_anyof_parser(
        make_lower_parser(),
        make_upper_parser())


def make_alphanum_parser():
    return make_anyof_parser(
        make_alpha_parser(),
        make_digit_parser())

Some names may be slightly different than in the original paper, but majority is “pythonized” quite verbose.

That’s all cool, I’m now able to parse characters of all sorts but still I can’t parse any strings. This can be sorted out by combining the character parsers recursively. The Haskell definition in the paper for parsing words is quite enigmatic (at least to me):

1
2
3
4
5
6


word :: Parser String
word = neWord ‘plus‘ result ""
    where
        neWord = letter ‘bind‘ \x ->
                 word ‘bind‘ \y ->
                 result (x:y)

Without knowing Haskell syntax it’s difficult to grasp it. To understand this syntax fully, here’s a couple of hints:

Sequence packing/unpacking

(x:xs) construct is used to express sequence packing/unpacking. When used on the left hand side:

1
2
3
4


    let s = "string is a sequence of characters"
    let (x:xs) = s
    -- x is now: 's'
    -- xs is now: "tring is a sequence of characters"

It can be used to concatenate sequences as well (and this is exactly how word parser uses it):

let x = 'H'
let xs = "ello"
let s = (x:xs)
-- s is now: "Hello"

`where` expression

This is used to conveniently defer the declaration of an identifier. In case of word parser, it would be difficult to employ tail recursion without resorting to where. So, what it does in this context is that it gets evaluated first before the neWord 'plus' result "" expression.

What happens in the recursive call? First, let’s analyse the base case. Both neWord and result "" are of type Parser String. If the input string is empty, the letter parser will fail and thus terminate the recursive descend so, only the result "" will matter.

Now, the recursion itself. The first “binder” lambda collects a character parsed by a letter parser. The “binder” now has a character but it has to return a Parser String type so, it performs another bind inside where a recursive call happens in which a single character stored in x is combined with the string parsed as result of the recursion (and stored in y). The results are combined to a string and returned via a result parser to form the effective type of Parser String. This can be expressed in Python the following way:

1
2
3
4
5
6
7


def make_word_parser():
    new_word_parser = bind_parser(make_alpha_parser(),
            lambda c: bind_parser(make_word_parser(),
                lambda s: make_result_parser(str(c) + s)))

    return make_anyof_parser(new_word_parser,
            make_result_parser(""))

This is pretty cool - we can now parse words! This abstraction can be (and should be) lifted though. In general, the word_parser is applying the same parser over and over again until parsing fails. Having a parser like that, that can apply any other parser until the latter fails is very handy. You can imagine it may be useful to construct a parser parsing not only words (which are sequences of characters) but i.e. sentences (which are sequences of words) or any other repeating constructs. Let’s start abstracting this implementation. First thing, is to extract the embedded parser into an argument:

1
2
3
4
5
6
7


def make_many_times_parser(p):
    seq_parser = bind_parser(p,
            lambda c: bind_parser(make_many_times_parser(p),
                lambda s: make_result_parser(str(c) + s)))

    return make_anyof_parser(seq_parser,
            make_result_parser(""))

Now, there’s a bit of a problem with that since the result "" only applies as the initial value (returned in recursive base case) for strings. I’ll extract that as well:

1
2
3
4
5
6


def make_many_times_parser(p, initial_value):
    seq_parser = bind_parser(p,
            lambda c: bind_parser(make_many_times_parser(p),
                lambda s: make_result_parser(str(c) + s)))

    return make_anyof_parser(seq_parser, initial_value)

There’s one last change required. The act of combining partial results is currently hard-coded and specific to strings only. I’ll extract that into an argument as well and call that function a “reducer”:

1
2
3
4
5
6


def make_many_times_parser(p, reducer, initial_value):
    seq_parser = bind_parser(p,
            lambda c: bind_parser(make_many_times_parser(p),
                lambda s: make_result_parser(reducer(c, s))))

    return make_anyof_parser(seq_parser, initial_value)

Now, the implementation of make_word_parser can be expressed the following way:

1
2
3
4
5


def make_word_parser():
    return make_many_times_parser(
            make_alpha_parser(),
            lambda c, s: str(c) + s,
            make_result_parser(""))

This is great but it’s not perfect yet, in case of strings, I currently parse words comprised of alphabetic characters but what if I’d want to parse all alphanumeric characters as words or include other character classes? It seems like it does make sense to inject the parser as well:

1
2
3
4
5
6
7
8


def make_many_times_string_parser(p):
    return make_many_times_parser(
            p,
            lambda c, s: str(c) + s,
            make_result_parser(""))

def make_word_parser():
    return make_many_times_string_parser(make_alpha_parser())

I’m almost ready to parse the Ini files. There’s one more thing very common when lexing that needs to be sorted out - white characters. Most of the time these needs to be ignored. They can be of arbitrary amount as well. For that purpose, I’ll introduce:

1
2
3
4
5
6
7


def make_optional_parser(pa):
    def parser(inp: str):
        res = pa(inp)
        if is_error(res):
            return make_result_parser(None)(inp)
        return res
    return parser

This parser wraps a given parser and always succeeds. If the wrapped parser fails it’ll return [(None, input)], if it succeeds it will just forward the results - this is crucial since bind_parser parsing chain is broken on first encountered parse failure and I want to have a way to avoid that.

Two more parsers which happen to be very handy are

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


def make_take_right_parser(pa, pb):
    def parser(inp: str):
        resa = pa(inp)

        results = []
        for res in resa:
            rv, rinp = res
            resb = pb(rinp)
            results.extend(resb)
        return results

    return parser


def make_take_left_parser(pa, pb):
    def parser(inp: str):
        resa = pa(inp)
        results = []
        for res in resa:
            rva, rinpa = res
            for resb in pb(rinpa):
                rvb, rinpb = resb
                results.append((rva, rinpb))

    return parser

Both take two parsers pa, pb. take_right runs the input through first parser and then runs the remaining input through second parser, returning only the second parser results. take_left does the opposite. Runs the input through first parser, then through the second parser and returns first parser results. These two parser wrappers are very handy when dealing with white characters, similarly as optional parser.

Towards parsing `INI` files

I’ve got all the necessary building blocks now. I’ll now combine them to parse the Ini file syntax. Ini file is comprised of sections. Each section has a name and a set of key value fields. The syntax can be described the following way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


<INI> ::= <SECTIONS>

<SECTIONS> ::= <SECTION> | <SECTION> <SECTIONS>

<SECTION> ::= '[' <SECTION_NAME> ']' '\n' <KEY_VALUES>

<KEY_VALUES> ::= <KEY_VALUE> | <KEY_VALUE> '\n' <KEY_VALUES>

<KEY_VALUE> ::= <STR> = <STR>

<STR> = 'a'..'z' | 'A' .. 'Z'

I’m gonna start by parsing <KEY_VALUE> first and build the complete parser bottom-up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


def make_whitechar_parser():
    return make_optional_parser(
        make_many_times_string_parser(
            make_whitechar_parser()))

def make_key_value_parser(delimiter = '='):
    word_p = make_word_parser()
    white_p = make_whitechar_parser()
    delim_p = make_exact_char_parser(delimiter)

    str_p = make_take_right_parser(white_p, word_p))
    delim_p = make_take_right_parser(white_p, delim_p)

    kvp_p = bind_parser(str_p, lambda key:
            bind_parser(delim_p, lambda delim:
                bind_parser(str_p, lambda val: make_result_parser((key, val)))))
    return kv_p

There it is. The code is literally self documenting so there’s no point in discussing the parser whatsoever. Suffice to say that the parser returns key-value pairs as list of tuples. This is one building block, now a way to parse a list of key values is needed. It happens to be even easier than parsing individual key-value pairs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18



def make_single_result_parser(pa):
    def parser(inp: str):
        res = pa(inp)
        if is_error(res):
            return res
        return res[:1]

    return parser

def make_entries_parser(delimiter = '='):
    reducer = lambda a, b: [a] + b
    initial_value = []
    return make_single_result_parser(
            make_many_times_parser(
                make_key_value_parser(delimiter),
                reducer,
                initial_value))

Here, I’ve introduced one small wrapper which reduces any parser’s output to a single result. Most of my parsers return only one results apart from the recursive ones which return a concatenated list of partial results which I don’t really care about.

Believe it or not but the Ini parser is almost complete. Now, the section header parser:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def make_section_name_parser():
    white_p = make_whitechar_parser()
    word_p = make_word_parser()

    sname_p = bind_parser(
        make_exact_char_parser('['), lambda left:
        bind_parser(word_p, lambda name:
            bind_parser(
                make_exact_char_parser(']'), lambda right:
                make_result_parser(name))))

    return make_take_right_parser(white_p, sname_p)

Pretty simple, just as the previous ones. Now, before I continue, I’m gonna introduce some types representing a parsed Ini file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


class Section(object):
    def __init__(self, name, key_values):
        self.name = name
        self.key_values = key_values

    def __repr__(self):
        return f"Section: {self.name}, entries: {self.key_values}"

    def __str__(self):
        return self.__repr__()


class Ini(object):
    def __init__(self, sections):
        self.sections = sections

    def __repr__(self):
        return f"Ini: {self.sections}"

    def __str__(self):
        return self.__repr__()

These are pretty simple. An Ini file is a collection of sections and a section is a collection of key-value pairs with a name. I need two more parsers for a section and a collection of sections. Here they are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20



def make_section_parser():
    sname_p = make_section_name_parser()
    section_p = bind_parser(
            sname_p, lambda section_name:
            bind_parser(make_entries_parser(), lambda entries:
                make_result_parser(Section(section_name, entries))))
    return section_p


def make_sections_parser():
    initial_value = []
    reducer = lambda a, b: [a] + b
    section_p = make_section_parser()

    sections_p = make_many_times_parser(section_p, reducer, initial_value)

    sections_p = make_single_result_parser(sections_p)

    return bind_parser(sections_p, lambda s: make_result_parser(Ini(s)))

The parser is now complete! After feeding it some example Ini file:

1
2
3
4
5
6
7
8
9


[section1]
other = value2
key =  value
last = val

[section2]
some = more
keys = with
different  = values

…the output looks the following way:

1
2
3


[(Ini: [Section: section1, entries: [('other', 'value2'), ('key', 'value'),
        ('last', 'val')], Section: section2, entries: [('some', 'more'),
        ('keys', 'with'), ('different', 'values')]], '')]

Nice! I’ve got a working prototype and a parser combinators library which I can use to parse ANY format I want. But my overall goal is to solve this problem in C++. How difficult will it be now to translate a Python code, which I fully understand, into C++?

Towards C++ `Ini` file parser

I’m tempted to implement a parser as a virtual interface template, something like this:

1
2
3
4
5
6
7


template <typename T>
class Parser {
public:
    virtual ~Parser() = default;

    virtual Result<T> parse() = 0;
};

This is a bad idea though, the amount of work required will be truly disheartening for no good reason. Since parser combinators are an idea from a functional programming world, it’s best to stick with a function-based interface. I’m gonna define the parser as:

1
2


template <typename A>
using Parser = std::function<std::unique_ptr<Result<A>>(std::string_view)>;

There’s this mysterious Result<A> type. What’s that all about? Parser can either succeed or fail and this is indicated via a return type. Originally, in Haskell, the semantics of a Parser was that an empty result list indicates a parsing failure. I’m gonna change the approach slightly in C++. I’ve very rarely had a need for multiple than one result returned from any given parser. To simplify the Result type in C++, I’m gonna drop the results list and always return a single result. How will I indicate a failure? Exceptions! Have a look:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


template <typename T>
class Result {
public:
    virtual ~Result() = default;

    virtual T get() const = 0;

    virtual Input getRoi() const = 0;
};


template <typename T>
class Success : public Result<T> {
public:
    explicit Success(T v, Input roi) :
        value{v},
        roi{roi}
    {
    }

    T get() const override {
        return value;
    }

    Input getRoi() const override {
        return roi;
    }

private:
    T value;
    Input roi;
};

class Error : public Result<T> {
public:
    explicit Error(std::string msg, std::size_t offset) :
        message{msg},
        offset{offset}
    {
    }

    T get() const override {
        std::stringstream ss;
        ss << offset << ": " << message;
        throw std::runtime_error(ss.str());
    }

    Input getRoi() const override {
        return {""};
    }

private:
    std::string message;
    std::size_t offset;
};

Simple stuff. In case of parser error, retrieval of the parsed value will throw an exception. Let’s have a look on the first basic parsers in C++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


template <typename A>
Parser<A> makeZeroParser(std::string msg) {
    return [=](Input inp) {
        return std::make_unique<Error<A>>(msg, inp.getOffset());
    };
}

template <typename A>
Parser<A> makeResultParser(A value) {
    return [=](Input inp) {
        return std::make_unique<Success<A>>(value, inp);
    };
}

Parser<char> makeCharParser() {
    return [=](Input inp) {
        if (inp.isEof()) {
            return makeZeroParser<char>("Unexpected end of input")(inp);
        }
        const auto [h, inp_next] = inp.takeHead();
        return makeResultParser<char>(h)(inp_next);
    };
}

I’m not gonna show all the code here. I’m gonna focus only on the remaining most important part of any parser combinators library which is of course the bind operations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


template <typename A, typename B>
using Binder = std::function<Parser<B>(A)>;

template <typename A, typename B>
Parser<B> bindParsers(Parser<A> a, Binder<A, B> b) {
    return [=](Input inp){
        auto resa = a(inp);
        auto parserb = b(resa->get());
        return parserb(resa->getRoi());
    };
}

Of course, I’ve implemented the same INI-file format parser in the examples directory of the repo as well - I encourage you to have a look.

Conclusion

That’s it! Parser combinators is a very neat concept. There’s a bit of a learning curve if you don’t have a background in functional programming - but nothing too drastic. Additionally, as even the paper title implies, the implemented parser combinators library allows for monadic composition - but that’s a topic for a completely different post.

All the code is available on my gitlab in the following repositories:

C++ version: combspp
Python version: combs