Monday, November 19, 2012

Monadic Parsing in C++

When researching parsing in Haskell, I always find this pdf: eprints.nottingham.ac.uk/223/1/pearl.pdf

I will be referring back to this paper and encourage my readers to at least skim it as well. It may provide more understanding in how it works.

It describes a simple, yet hard to comprehend functional parser. I have attempted to translate this to C++, but first I wrote a more intuitive, less functional one. Just a simple calculator, somewhat close to what's described in the pdf, and it worked like this:

The first stage took the input string and created a list of token type-lexeme pairs. Given the input "1+2*2", it would spit back {(INT,"1"),(OP,"+"),...}. The next stage uses two mutually recursive functions to represent two parse states: that of sums and subtractions and that of numbers, multiplications and divisions. Since addition has the lowest precedence in this mini-language, it's safe to parenthesize the rest of the expression, meaning that if it parsed "1+", it will expect the next number to be a number, and it might want to add one to it, eagerly, but not if fallowed by a "2*2" since multiplication has higher precedence.

These functions convert the vector of tokens to an expression tree that evaluates the arguments."2*2" would parse to a tree with a multiplier at its root and two numbers as its leaves. So, after the first state (sums) had read "1+", it would eagerly construct an addition object with the left argument being 1, and the right argument being the result of the rest of the parse! So it reads "2*2" and builds a tree that becomes the right-side argument for our "1+" tree.

This solution is theoretically consistent with building a language using flex and bison. The source can be found here (gist). But it isn't functional. That's when I stumbled back onto this research paper and decided to give a real go at it.


Functional Parsing.

The parser described in this paper does not act in the same way as my parser does. Rather than lexing, tokenizing, and building an expression tree, it cuts straight to the point. A Parser<int> will be a function that accepts a string and produces ints. But, the job may not be done; there may be more to parse. For every int it parses, it will also store the suffix of the parse. So, if p is some parser of type Parse<int>, p("1abc") will return a vector holding one match: the value, 1, and the suffix, "abc".

What does this class look like?

/* A Parser is a function taking a string and returning a vector of matches. */
template< class X > struct Parser {
    // The value is the target of the parse. (For example "1" may parse to int(1).
    using value_type    = X;

    // A match consists of a value and the rest of the input to process.
    using parse_state   = std::pair<X,std::string>;

    // A parse results in a list of matches.
    using result_type   = std::vector< std::pair<X,std::string> >;

    // A parser is a function that produces matches.
    using function_type = std::function< result_type( const std::string& ) >;

    function_type f;

    Parser( function_type f ) : f(std::move(f)) { }

    Parser( const Parser<X>& p ) : f(p.f) { }
    Parser() { }

    result_type operator () ( const std::string& s ) const {
        return f( s );
    }
};

template< class X, class F > Parser<X> parser( F f ) {
    using G = typename Parser<X>::function_type;
    return Parser<X>( G(std::move(f)) );
}

I have mentioned in previous articles that one should avoid std::function for efficiency reasons, but it vastly simplifies things here. As you can see, Parser merely wraps itself around std::function. I would encourage the reader to think of it as a type alias--not deserving of being called a new type, but more than a typedef.

This is consistent with how the paper defines this type:

    newtype Parser a = Parser (String -> [(a,String)])

If nothing can be parsed, and empty list will be returned. If many things are parsed, a list of each match will be returned.


The Parser Monad.

As a reminder, the basic monadic operations are these:

    a >> b = b -- Do a, then b.
    a >>= f = b -- For each x from a, construct b with f(x).
    mreturn x = a -- Construct a monad from a value.

How does this relate to parsing? A parser is basically just a function, so if p and q are both parsers, p >> q must return a new parser, a new function. The simple explanation (do p, then q) is correct. First, p parses and let's say it returns a match, (x,rest). rest is sent to q for parsing and the x is thrown away. It may sound odd to just throw away a value, but it will become more obvious soon.

If p had failed to parse a value, then q would not have been run.

The bind operation, p >>= f, extracts the parsed value, x, from p and creates a new parser from f(x). mreturn x creates a new parser that returns x as its value. It accepts any string, even an empty one. Ideally, x came from the output of another parser.

    p >> q -- Run parser p, then q.
    p >>= f -- Construct a parser with p's matches.
    mreturn x -- Construct a parser that returns x.

We can define it like so:

template< class X > struct Monad< Parser<X> > {
    using Pair = typename Parser<X>::parse_state;
    using R    = typename Parser<X>::result_type;

    /* 
     * mreturn x = parser(s){ vector( pair(x,s) ) }
     * Return a parsed value. Forwards rest of input to the next parser.
     */
    template< class M >
    static M mreturn( X x ) {
        return parser<typename M::value_type> (
            [x]( const std::string& s ) { 
                return R{ Pair(std::move(x),s) }; 
            }
        );
    }
    
    /* a >> b = b */
    template< class Y, class Z >
    template< class Y, class Z >
    static Parser<Y> mdo( Parser<Z> a, Parser<Y> b ) {
        return a >>= [b]( const Z& z ) { return b; };
    }

    /* Continue parsing from p into f. */
    template< class F, class Y = typename std::result_of<F(X)>::type >
    static Y mbind( F f, Parser<X> p ) 
    {
        using Z = typename Y::value_type;
        return parser<Z> (
                [f,p]( const std::string& s ) {
                    // First, extract p's matches.
                    return p(s) >>= [&]( Pair p ) {
                        // Then construct the new parser from the p's output.
                        // Continue parsing with the remaining input with the new parser.
                        return f( std::move(p.first) )( std::move(p.second) );
                    };
                }
        );
    }
};

Do not worry if this source is difficult to understand. It is more important to understand how it is used (which is perhaps common with monads). Note that mdo is defined such that for every successful parse of a, b is parsed. If a fails to parse anything, b fails, too.

These monadic operations are the core building blocks from which we can build more complex system, but the paper also discusses MonadZero and MonadPlus. They are both type classes, like Monad, but extend it to do a few interesting things. In C++, one can concatenate two string by using simple addition: s1 + s2 = s12. MonadPlus is generalization of this. MonadZero completes this generalization by supplying the additive identity. For example, we know that zero + x = x. Thus, "" + s = s.

In parsing terms, zero would refer to a parser that matches nothing and adding two parsers, p+q, will produce a third parser that accepts either p's or q's. For example, itentifier+number would create a function that parses either identifiers or numbers.

We can define MonadPlus and MonadZero in the same way we would define Monad.

template< class ... > struct MonadZero;
template< class ... > struct MonadPlus;

template< class M, class Mo = MonadZero< Cat<M> > >
auto mzero() -> decltype( Mo::template mzero<M>() )
{
    return Mo::template mzero<M>();
}

template< class A, class B, class Mo = MonadPlus<Cat<A>> >
auto mplus( A&& a, B&& b ) -> decltype( Mo::mplus(std::declval<A>(),std::declval<B>()) )
{
    return Mo::mplus( std::forward<A>(a), std::forward<B>(b) );
}

template< class X, class Y >
auto operator + ( X&& x, Y&& y ) -> decltype( mplus(std::declval<X>(),std::declval<Y>()) )
{
    return mplus( std::forward<X>(x), std::forward<Y>(y) );
}

First, we define these for sequences.

template<> struct MonadZero< sequence_tag > {
    /* An empty sequence. */
    template< class S >
    S mzero() { return S{}; }
};

/* mplus( xs, ys ) = "append xs with ys" */
template<> struct MonadPlus< sequence_tag > {
    template< class A, class B >
    static A mplus( A a, const B& b ) {
        std::copy( b.begin(), b.end(), std::back_inserter(a) );
        return a;
    }
};

And then for Parsers.

/* mzero: a parser that matches nothing, no matter the input. */
template< class X > struct MonadZero< Parser<X> > {
    template< class P >
    static P mzero() { 
        return parser<X> (
            []( const std::string& s ){
                return std::vector<std::pair<X,std::string>>(); 
            }
        );
    }
};

/* mplus( pa, pb ) = "append the results of pa with the results of pb" */
template< class X > struct MonadPlus< Parser<X> > {
    using P = Parser<X>;
    static P mplus( P a, P b ) {
        return parser<X> (
            [=]( const std::string& s ) { return a(s) + b(s); }
        );
    }
};

Since we usually only want the first successful parse, the paper define an operator, +++, that does this.

template< class X >
Parser<X> mplus_first( const Parser<X>& a, const Parser<X>& b ) {
    return parser<X> (
        [=]( const std::string s ) {
            using V = std::vector< std::pair< X, std::string > >;
            V c = (a + b)( s );
            return c.size() ?  V{ c[0] } : V{};
        }
    );
}

This completes the required building blocks. A parser of significant complexity could be made using only the above functions and types. The paper describes building a notably simple parser, so let's do that instead!


Basic Monadic Parsers.

The simplest parser is item, which accepts any char.

std::string tail( const std::string& s ) {
    return std::string( std::next(s.begin()), s.end() );
}

/* 
 * Unconditionally match a char if the string is not empty.
 * Ex: item("abc") = {('a',"bc")}
 */
auto item = parser<char> (
    []( const std::string& s ) {
        using Pair = Parser<char>::parse_state;
        using R = Parser<char>::result_type;
        return s.size() ? R{ Pair( s[0], tail(s) ) } : R();
    }
);

To demonstrate its usage, the paper defines a simple parser that takes a string of length three or more and returns the first and third values.

auto p = item >>= []( char c ) {
    return item >> item >>= [c]( char d ) {
        return mreturn<Parser>( std::make_pair(c,d) );
    };
};

p first runs item to extract c, then runs item again but throws away the value. It runs item a third time to extract d and finally returns as its value (c,d). p("abcd") would return {(('a','c'),"d")}.

The next function creates a parser that is a little more helpful:

/* sat( pred ) = "item, if pred" */
template< class P >
Parser<char> sat( P p ) {
    return item >>= [p]( char c ) { 
        return p(c) ? mreturn<Parser>(c) : mzero<Parser<char>>();
    };
}

Given some function that operates on chars, this extracts an item, but then checks the condition without consuming any additional input. If p(c) returns true, then it returns a parser with the value c, otherwise zero, a failed parse. Using this, we can define a parser to accept only a specific char.

Parser<char> pchar( char c ) {
    return sat( [=](char d){ return c == d; } );
}

And then, a parser that accepts only a specific string.

Parser<std::string> string( const std::string& s ) {
    if( s.size() == 0 )
        return mreturn<Parser>( s );

    Parser<char> p = pchar( s[0] );
    for( auto it=std::next(s.begin()); it != s.end(); it++ )
        p = p >> pchar( *it );

    return p >> mreturn<Parser>(s);
}

Note: There is no name conflict with std::string because the source does not contain "using namespace std;".

This function does something very odd. For every char in s, it chains together char parsers. For example, string("abc") would return a parser equivalent to pchar('a') >> pchar('b') >> pchar('c') >> mreturn<Parser>("abc"). If any of the char parsers fail down the line, mreturn<Parser>(s) will fail. Since we already know the values of the successful parses, their values are thrown away.

Though faithful to the paper, this may be less efficient than desirable. One could implement string in this way, too:

Parser<std::string> string( const std::string& str ) {
    return parser<std::string> (
        [str]( const std::string& s ) {
            using R = typename Parser<std::string>::result_type;
            if( std::equal(str.begin(),str.end(),s.begin()) ) {
                return R {
                    { str, s.substr( str.size() ) }
                };
            } else {
                return R();
            }
        }
    );
}

It can at times be simpler to write out these functions instead of composing them, however, that can be thought of as an optimization.

The next function creates a new parser from a parser, p, that accepts one or zero p's.

template< class X >
Parser<std::vector<X>> some( Parser<X> p ) {
    using V = std::vector<X>;
    using Pair = std::pair<V,std::string>;
    using R = std::vector< Pair >;
    using P = Parser<V>;
    return mplus_first( 
        parser<V> (
            [=]( const std::string& s ) {
                auto matches = p(s);

                return matches.size() == 0 ? R{}
                    : R{ 
                        Pair( 
                            V{std::move(matches[0].first)}, 
                            std::move( matches[0].second )
                        ) 
                    };
            }
        ),
        mreturn<Parser>( V{} )
    );
}

It was unfortunately difficult to translate, but does work. (The gist also contains a version called many, which accepts zero or many p's.) some converts a parser of type X to one that produces a vector or X's. It always succeeds, even if it does not successfully parse anything.

To create a parser that consumes whitespace is now trivial.

template< class X >
using ManyParser = Parser< std::vector<X> >;

ManyParser<char> space = some( sat([](char c){return std::isspace(c);}) );

We require one more function to parse alternating sequences. Like what? A sum, like "1+2+3" is a sort of alternating sequence of numbers and +'s. A product is an alternating sequence of numbers and *'s.

Here's the weird part: what does a parser that accepts a "+" return? What about a parser that accepts a "-"? The value of such a parser, as it turns out, is the binary function that it represents! In the case of the implementation below, this is a function pointer of type int(*)(int,int).

/* 
 * chain(p,op): Parse with p infinitely (until there is no match) folding with op.
 *
 * p and op are both parsers, but op returns a binary function, given some
 * input, and p returns the inputs to that function. For example, if:
 *      input: "4"
 *      p returns: 4
 * No operator is read, no operation is performed. But:
 *      input: "4+4"
 *      p returns: 4
 * op is then parsed with the function, rest:
 *      input: "+4"
 *      op returns: do_add
 *      input: "4"
 *      p returns: 4
 *      rest returns: 8
 * rest applies the operation parsed by op. It alternates between parsing p and
 * op until there are no more matches. 
 */
constexpr struct Chainl1 {
    template< class X, class F >
    static Parser<X> rest( const Parser<X>& p, const Parser<F>& op, const X& a ) {
        // Alternate between op and p until input is consumed or a parse fails.
        auto r = op >>= [=]( const F& f ) {
                return p >>= [&]( const X& b ) {
                    return rest( p, op, f(a,b) );
                };
        };

        // Return the first successful parse, or a if none.
        return mplus_first( r, mreturn<Parser>(a) );
    }

    template< class X, class F >
    Parser<X> operator () ( Parser<X> p, Parser<F> op ) const {
        return p >>= closet( rest<X,F>, std::move(p), std::move(op) );
    }
} chainl1{};


Adding the final touches.

The paper describes a few generally useful functions for constructing parsers based off the above function. space (defined above) consumes whitespace. token consumes any trailing whitespace after parsing p.

constexpr struct Token {
    template< class X >
    Parser<X> operator () ( Parser<X> p ) const {
        return p >>= []( X x ) {
            return space >> mreturn<Parser>(std::move(x));
        };
    }
} token{};

symb converts a string to a token.

auto symb = compose( token, string ); // symb(s) = token( string(s) )

apply consumes any leading whitespace.

constexpr struct Apply {
    template< class X >
    Parser<X> operator () ( Parser<X> p ) const {
        return space >> std::move(p);
    }
} apply{};

The big idea is that we never want to manually write a function that returns a list of successful parses. It's hard! It's much easier to compose such  functions from smaller, more comprehensible ones and use those to build reasonable complex, but more simple to reason about, parsers.

The parser itself.

Now, using all of the tools provided, we can create a parser much more trivially than otherwise--though that is perhaps true of any time one has a new set of tools. First, we define a parser that accepts digits.

constexpr bool is_num( char c ) {
    return c >= '0' and c <= '9';
}

/* Parse one digit. */
Parser<int> digit = token( sat(is_num) ) >>= []( char i ) { 
    return mreturn<Parser>(i-'0'); 
};

Now, digit("2") returns 2, but (digit >> digit)("22") returns 2 as well! Why? Because the first run of digit extracts the first 2, but throws that value away and the second run of digit extracts the second 2. To parse a two-digit number, we need something like this:

Parser<int> twoDigit = digit >>= []( int x ) {
    return digit >>= [x]( int y ) {
        return mreturn<Parser>( x*10 + y );
    };
};

It extracts the first then the second digit and returns the original number, converted from a string to an int! To parse arbitrarily long numbers (diverging from the paper's version), we can define a chain operation!

int consDigit( int accum, int digit ) { return accum*10 + digit; }
Parser<int> num = chainl1( digit, mreturn<Parser>(consDigit) );

For every two digits parse, num calls consDigit to fold the values together. As mentioned earlier, chainl1 works by alternating between its two parsers. Since the second argument is a parser which consumes no input, num only accepts digits.

Next, we can define the parsers for binary operations.

// Binary operations of type int(*)(int,int).
int do_add(  int x, int y ) { return x + y; }
int do_sub(  int x, int y ) { return x - y; }
int do_mult( int x, int y ) { return x * y; }
int do_div(  int x, int y ) { return x / y; }

auto addop = mplus (
    pchar('+') >> mreturn<Parser>(do_add),
    pchar('-') >> mreturn<Parser>(do_sub)
);

auto mulop = mplus (
    pchar('*') >> mreturn<Parser>(do_mult),
    pchar('/') >> mreturn<Parser>(do_div)
);

addop parses either a "+" or "-" and returns either do_add or do_sub, respectively. Because the parsers must return functions of the same types, std::plus and std::minus could not be used.

With this, we can define a term as an alternating sequence of numbers and multiplications and divisions; and an expr(ession) as an alternating sequence of terms, +'s and -'s.

/* 
 * Parse terms: series of numbers, multiplications and divisions.
 * Ex: "1*3*2" -> (3,"*2") -> 6
 */
Parser<int> term = chainl1( num, mulop );

/*
 * Parse expressions: series of terms, additions and subtractions.
 * Ex: "1+7*9-1" -> (1."+7*9-1") -> (63,"-1") -> 62
 */
Parser<int> expr = chainl1( term, addop );

And we're done! We have just built a calculator! It can evaluate any expression of additions, subtractions, multiplications, divisions, and it is whitespace agnostic. While implementing Parser itself took a considerable amount of work, using it does not.

int main() {
    using std::cin;
    using std::cout;
    using std::endl;

    cout << "Welcome to the calculator!\n" 
         << "Press ctrl+d or ctrl+c to exit.\n"
         << "Type in an equation and press enter to solve it!\n" << endl;

    std::string input;
    while( cout << "Solve : " and std::getline(std::cin,input) ) {
        auto ans = apply(expr)(input);
        if( ans.size() and ans[0].second.size() == 0 )
            cout << " = " << ans[0].first;
        else
            cout << "No answer.";
        cout << endl;
    }
}

I highly encourage anyone reading this to attempt to compile and modify the source code.

See the gist at github for the source in full: https://gist.github.com/4112114
And for the original parser I wrote: https://gist.github.com/4112114#file_trivial_parser.cpp